$25
1 About
Modern distributional semantic algorithms (also known as word embedding algorithms) can be found inside many intelligent systems dealing with natural language. The goal of developing such embeddings is to learn meaningful vectors (embeddings), such that semantically similar words have mathematically similar vectors. The different types of word embeddings can be broadly classified into two categories:
• Frequency-based Embedding:
There are generally three types of vectors that we encounter under this category: Count Vector, TF-IDF Vector, Co-occurrence Matrix with a fixed context window.
• Prediction-based Embedding:
Word2vec is a popular method for training such representations, which can be implemented using either of these two models: Continuous Bag of Words (CBOW) and Skip-Gram (SG).
The computation of word2vec training algorithm in its vanilla implementation involved high computational costs due to the calculation of gradients over the entire vocabulary. To reduce this computation and provide a good approximation to the process, variants such as Hierarchical Softmax output, Negative Sampling and Subsampling of frequent words were proposed to the simplistic architecture.
This assignment aims to make you familiar with the above algorithms by implementing one of the approaches in frequency-based modeling and comparing it with the embeddings obtained using one of the variants of word2vec. Specifically, you’ll first try out getting embeddings using Singular Value Decomposition (SVD) method using a corpora specified. Next, you’d implement the CBOW implementation of word2vec with Negative Sampling. A short analysis would follow, highlighting the differences in the quality of embeddings obtained.
Note on terminology: The terms "word vectors" and "word embeddings" are often used interchangeably. The term "embedding" refers to the fact that we are encoding aspects of a word’s meaning in a lowerdimensional space. As Wikipedia states, "conceptually it involves a mathematical embedding from space with one dimension per word to a continuous vector space with a much lower dimension".
2.1 Theory
Explain negative sampling. How do we approximate the word2vec training computation using this technique? [10 marks]
2.2 Implementation
1. Implement a word embedding model and train word vectors by first building a Co-occurrence Matrix followed by the application of SVD. [25 marks]
2. Implement the word2vec model and train word vectors using the CBOW model with Negative Sampling. [35 marks]
2.3 Analysis
Report these for both models after you’re done with the training.
1. Display the top-10 word vectors for five different words (a combination of nouns, verbs, adjectives, etc.) using t-SNE (or such methods) on a 2D plot. [10 marks]
2. What are the top 10 closest words for the word ’camera’ in the embeddings generated by your program? Compare them against the pre-trained word2vec embeddings that you can download off-the-shelf. [10 marks]
2.4 Presentation
We would check these points at the time of individual evaluations that would involve code walk-through, explanation of the report analysis, and questions based on the implementation.
• Implementation efficiency & quality
• Report quality
• Inclusion of a readme along with the code
• Crisp & clear explanation of the code during evals
[10 marks]
3 Training corpus
Please train your model on the corpus linked here:
http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json. gz