NCTU - CS Assignment #1 - Naïve Bayes Solved

Starting from:

$40

Introduction to Machine Learning Program Assignment #1 - Naïve Bayes

This programming assignment aims to help you understand the algorithm behind Naïve Bayes classifier and the basic workflow of machine learning.

Before we start
You may choose to go to the PC classrooms or finish this HW elsewhere.
It won’t affect a thing. For fairness’ sake, we’ll use discord as the Q&A system.

Join the discord server for TA support

Ask questions on it, and we shall reply. (We won’t respond to raised hands.)
Try not to ask for obvious answers or bug fixes.
Memes and chit chat welcome
Objective
There are two datasets that need to be analyzed. For each dataset, you have to do the following:

Data Input - 5%
Data Visualization - 5%
For mushroom dataset
Show the data distribution by value frequency of every feature.
For Iris dataset
Show the data distribution by average, standard deviation, and value frequency(binning might be needed) of every feature.
Split data based on their labels (targets) and show the data distribution of each feature again.
Data Preprocessing - 5% + (10%)
Drop features with any missing value.
Transform data format and shape so your model can process them.
Shuffle the data.
Bonus: any other transformation boosts the final performance. - (10%)
Model Construction - 20%
You must construct two Naïve Bayes classifiers for the two datasets.
Naïve Bayes divider MM in log-space:
M(q)=argmaxY∈T[logP(Y)+∑mi=1logP(Xi|Y)]M(q)=argmaxY∈T[log⁡P(Y)+∑i=1mlog⁡P(Xi|Y)]
where q={X1,X2,...,Xm}q={X1,X2,...,Xm} is a sample to be predicted, whose features are X1X1 to XmXm. TT is the set of all possible labels.
For the mushroom dataset, whose features are all categorical, P(Xi|Y)P(Xi|Y) must be computed with and without Laplace smoothing for result comparison. - 10%
Without Laplace smoothing
P(Xi|Y)=N(Xi|Y)N(Y)P(Xi|Y)=N(Xi|Y)N(Y)
Laplace smoothing
P(Xi|Y)=N(Xi|Y)+kN(Y)+kτP(Xi|Y)=N(Xi|Y)+kN(Y)+kτ
where ττ is the number of all possible events of feature XiXi
For Iris dataset, whose features are all numerical, assume P(Xi|Y)P(Xi|Y) follows a 1D-Normal(Gaussian) distribution. - 10%
P(Xi|Y)=1σ2π√e−(x−μ)22σ2P(Xi|Y)=1σ2πe−(x−μ)22σ2
where μ,σμ,σ are the mean and standard deviation of feature XiXi respectively, while label YY is determined.
Train-Test-Split - 5%
Two validation methods need to be implemented.
Holdout validation with the ratio 7:37:3
K-fold cross-validation with K=3K=3
Obtain the final performance by averaging all folds’ performance.
Results - 10%
Obtain the performances of all experiment settings in tables by the following metrics:
Confusion matrix
Accuracy
Sensitivity(Recall)
Precision
Comparison & Conclusion - 5%
Questions - 25%
For the mushroom dataset
Show P(Xstalk−color−below−ring|Y=e)P(Xstalk−color−below−ring|Y=e) with and without Laplace smoothing by histograms - 10%
For Iris dataset
What are the values of μμ and σσ of assumed P(Xpetal_length|Y=Iris Versicolour)P(Xpetal_length|Y=Iris Versicolour)? - 5%
Use a graph to show the probability density function of assumed P(Xpetal_length|Y=Iris Versicolour)P(Xpetal_length|Y=Iris Versicolour) - 10%
Finish during class - 20%
Submit your report and source codes to the newE3 system before class ends.
Finish time will be determined by the submission time.
Data
1. Mushroom dataset
Data can be downloaded here:
https://archive.ics.uci.edu/ml/datasets/mushroom
Please NOTE that the first column is the label (edible=e, poisonous=p)
Data Set Information
This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ‘‘leaflets three, let it be’’ for Poisonous Oak and Ivy.
Attribute Information
cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
bruises?: bruises=t,no=f
odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
gill-attachment: attached=a,descending=d,free=f,notched=n
gill-spacing: close=c,crowded=w,distant=d
gill-size: broad=b,narrow=n
gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
stalk-shape: enlarging=e,tapering=t
stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r, missing=?
stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
veil-type: partial=p,universal=u
veil-color: brown=n,orange=o,white=w,yellow=y
ring-number: none=n,one=o,two=t
ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d
2. Iris dataset
Data can be downloaded here:
https://archive.ics.uci.edu/ml/datasets/iris
Data Set Information
This is perhaps the best known database to be found in the pattern recognition literature. Fisher’s paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. Predicted attribute: class of iris plant. This is an exceedingly simple domain.
Attribute Information
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
class:
Iris Setosa
Iris Versicolour
Iris Virginica
Submission & Scoring Policy
Please submit a zip file, which contains the following, to the newE3 system.
Report
Explanation of how your code works.
All the content mentioned above.
Your name and student ID at the very beginning - 10%
Accept formats: HTML
Source codes
Accept languages: python3
Accept formats: .ipynb
Package-provided models are allowed
Your score will be determined mainly by the submitted report.
If there’s any problem with your code, TA might ask you (through email) to demo it. Otherwise, no demo is needed.
Scores will be adjusted at the end of the semester for them to fit the school regulations.
Plagiarizing is not allowed.
You will get ZERO on that homework if you get caught the first time.
The second time, you’ll FAIL this class.

More products