Starting from:

$25

CSCI5260- Project 4: Guess What? Solved

Project 4 – Guess What?

Description
Background
You now work for a prominent winery that has hired you to predict the quality of the wine they produce, based on alreadycollected data. The winery collects two main sets of data: one on the white wines they produce (winequality-white.csv, n=4898), and one on the red wines they produce (winequality-red.csv, n=1599). Data Description

Data are in two files: winequality-white.csv (4898 rows x 12 columns) and winequality-red.csv (1599 rows x 12 columns).

Input Variables

These input variables are based on physiochemical tests that occur regularly.

fixed acidity                 Range: 3.8 to 15.9
volatile acidity     Range: 0.08 to 1.58
citric acid                 Range: 0 to 1.66
residual sugar      Range: 0.9 to 65.8
chlorides                 Range: 0.009 to 0.611
free sulfur dioxide Range: 1 to 289
total sulfur dioxide Range: 6 to 440
density                 Range: 0.98711 to 1.03898
pH                                 Range: 2.72 to 4.01
sulphates                 Range: 0.22 to 2.0
alcohol                 Range: 8 to 14.9
Output Variable

quality                 Range: 0 to 10
Part 1 – Unsupervised Learning
Coding and Analysis Requirements
Create a file called project4_clustering.py. Write a program that does the following:

Read winequality-white.csv and winequality-red.csv into two separate Pandas data frames.Reference: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv
Create a target_white data frame and a target_red data frame by selecting the data’s last column (the ‘quality’ column) and storing it there. For example: target_red = data_red['quality']. Be sure to use the drop function after you have copied it to remove it from the original data.See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html?highlight=drop#pandas.Dat aFrame.drop
Using sklearn.cluster.KMeans, run the k-means clustering algorithm on the white wines and the red wines. You should use 11 clusters because we know there are 11 quality metrics (labeled 0-10).See https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html.
Note that the result of the fit function returns a data structure containing the following:cluster_centers_ndarray of shape (n_clusters, n_features) Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_.
labels_ndarray of shape (n_samples,)Labels of each point
inertia_float Sum of squared distances of samples to their closest cluster center.
n_iter_int Number of iterations run.
Analyze the results for the white wine and the red wine examples. Add a discussion to the Project4.docx writeup document. Remember that the cluster labels ARE NOT predictions of quality. The label is simply the grouping to which an example belongs. To analyze this you should:Write a procedure that determines the quality for each cluster by averaging the qualities of all items in that cluster.
This is OPEN-ENDED but you should use this information to plot the quality values for each cluster. Include these plots in your docx writeup.
Does the data indicate 11 clearly-defined quality metrics? Explain why it does or does not.
Part 2 – Supervised Learning
Coding and Analysis Requirements
Create a file called project4_ml.py. Using the same data set as above, do the following.

Combine the data sets into a single data set.To do this, add a column called “type” to each data frame.
Set red wine as type 0 and white wine as type 1.
Split the data into train and test sets.
Train and Test two of the following learning algorithms from the scikit-learn library. Be sure to use the same train and test data for each.Decision Tree Classifier - https://scikit-learn.org/stable/modules/tree.html#classification
Linear Regression Classifier - https://scikit-learn.org/stable/modules/linear_model.html#generalizedlinear-regression
Gaussian Naïve Bayes Classifier - https://scikit-learn.org/stable/modules/naive_bayes.html#gaussiannaive-bayes
Nearest Neighbor Classifier - https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighborsclassification
Support Vector Machine - https://scikit-learn.org/stable/modules/svm.html#classification Analyze the results by showing the following (add your analysis to Project4.docx):
Which classification method performed better?
You should measure the number of true negatives, true positives, false negatives, and false positives. If you want to drill down, it might be helpfult to track this by class.
Based on the results, what could you do to improve performance?
Keep in mind the ideas of feature engineering and feature scaling as you respond to this.
Part 3 – Deep Learning
Coding and Analysis Requirements
Create a file called project4_nn.py. 

Use the combined data set from Part 2, and the same train and test sets.
Train and test a Multilayer Perceptron Neural Network Classifier (MLPClassifer).
https://scikit-learn.org/stable/modules/neural_networks_supervised.html#classification 3. Analyze the results (recording the analysis in Project4.docx) by:Showing true negatives, true positives, false negatives, and false positives. If you want to drill down, you might track this by class to better analyze results.
Comparing the results to the models trained above.

More products