$25
In this homework, we try to solve the problem of predicting wine quality from review texts and other properties of the wine.
While you can find several kernels on kaggle already, I highly recommend you start your own solution from scratch. For this homework, only use wine from the United States. Feel free to subsample the data for building your model.
Task 1 Bag of Words and simple Features
1.1 Create a baseline model for predicting wine quality using only non-text features.
1.2 Create a simple text-based model using a bag-of-words approach and a linear model.
1.2 Try using n-grams, characters, tf-idf rescaling and possibly other ways to tune the BoW model. Be aware that you might need to adjust the (regularization of the) linear model for different feature sets.
1.3 Combine the non-text features and the text features. How does adding those features improve upon just using bag-of-words?
Task 2 Word Vectors
Use a pretrained word-embedding (word2vec, glove or fasttext) for featurization instead of the bag-of-words model. Does this improve classification? How about combining the embedded words with the BoW model?
Task 3 Transformers (bonus / optional)
Fine-tune a BERT model on the text data alone using the transformers library.
How does this model compare to a BoW model, and how does it compare to a model using all features?