GitHub - Alroy261/next-word-prediction

Create a new anaconda environment with python version 3.10
```
$ conda create -n next-word-prediction python=3.10
```
Install all dependencies
```
$ pip install -r requirements.txt
```
Training script is located in train.py.

Next Word Prediction (Approach 1).ipynb is the same as train.py but in a Jupyter Notebook format.
next-word-prediction (Approach 2 - Google USE).ipynb is the second approach using Google Universal Sentence Encoder to retrieve the embedding of the input text.
Next Word Prediction (word2vec).ipynb is the third approach using word2vec to retrieve the embedding.

The most difficult part of this project was the overfitting issue. I tried using some popular techniques to reduce overfitting, such as dropout, L2 regularization. However, none of them produce a significant improvement.

One major reason for the overfitting issue is that my input data and output data are too sparse. Especially I set the labels to be one-hot-vectors. This makes the model hard to learn and generalize.
What I can do instead is to replace the sparse input vectors with word embeddings. Embeddings are dense, low-dimensional representations of words that capture semantic similarities.
Similarly, I should also consider using embeddings as the target output as well. This can be achieved by using an embedding layer for the output and training with a cosine similarity loss.
I could try to use a different model architecture, such as a transformer, to improve the performance.
If time permits, I am also intended to dockerize everything into a container for easy deployment.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
utils		utils
Next Word Prediction (Approach 1).ipynb		Next Word Prediction (Approach 1).ipynb
Next Word Prediction (word2vec).ipynb		Next Word Prediction (word2vec).ipynb
README.md		README.md
config.env		config.env
model_architecture.py		model_architecture.py
next-word-prediction (Approach 2 - Google USE).ipynb		next-word-prediction (Approach 2 - Google USE).ipynb
phrases_data.txt		phrases_data.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train.py		train.py

Provide feedback