-
Create a new anaconda environment with python version
3.10
$ conda create -n next-word-prediction python=3.10
-
Install all dependencies
$ pip install -r requirements.txt
-
Training script is located in
train.py
.
Next Word Prediction (Approach 1).ipynb
is the same astrain.py
but in a Jupyter Notebook format.next-word-prediction (Approach 2 - Google USE).ipynb
is the second approach using Google Universal Sentence Encoder to retrieve the embedding of the input text.Next Word Prediction (word2vec).ipynb
is the third approach using word2vec to retrieve the embedding.
- The most difficult part of this project was the overfitting issue. I tried using some popular techniques to reduce overfitting, such as dropout, L2 regularization. However, none of them produce a significant improvement.
-
One major reason for the overfitting issue is that my input data and output data are too sparse. Especially I set the labels to be one-hot-vectors. This makes the model hard to learn and generalize.
-
What I can do instead is to replace the sparse input vectors with word embeddings. Embeddings are dense, low-dimensional representations of words that capture semantic similarities.
-
Similarly, I should also consider using embeddings as the target output as well. This can be achieved by using an embedding layer for the output and training with a cosine similarity loss.
-
I could try to use a different model architecture, such as a transformer, to improve the performance.
-
If time permits, I am also intended to dockerize everything into a container for easy deployment.