Starting from:

$30

NCTU_ML-Homework 1 Solved

Before we start
You may choose to go to the PC classrooms or finish this HW elsewhere.
It won’t affect a thing. For fairness’ sake, we’ll use discord as the Q&A system.

Join the discord server for TA support

Ask questions on it, and we shall reply. (We won’t respond to raised hands.)
Try not to ask for obvious answers or bug fixes.
Memes and chit chat welcome
Objective
There are two datasets that need to be analyzed. For each dataset, you have to do the following:

Data Input - 5%
Data Visualization - 5%For mushroom datasetShow the data distribution by value frequency of every feature.
For Iris datasetShow the data distribution by average, standard deviation, and value frequency(binning might be needed) of every feature.
Split data based on their labels (targets) and show the data distribution of each feature again.
Data Preprocessing - 5% + (10%)Drop features with any missing value.
Transform data format and shape so your model can process them.
Shuffle the data.
Bonus: any other transformation boosts the final performance. - (10%)
Model Construction - 20%You must construct two Naïve Bayes classifiers for the two datasets.
Naïve Bayes divider M in log-space:M(q)=argmaxY∈T[logP(Y)+∑mi=1logP(Xi|Y)]where q={X1,X2,...,Xm} is a sample to be predicted, whose features are X1 to Xm. T is the set of all possible labels.
For the mushroom dataset, whose features are all categorical, P(Xi|Y) must be computed with and without Laplace smoothing for result comparison. - 10%Without Laplace smoothingP(Xi|Y)=N(Xi|Y)N(Y)
Laplace smoothingP(Xi|Y)=N(Xi|Y)+kN(Y)+kτwhere τ is the number of all possible events of feature Xi
For Iris dataset, whose features are all numerical, assume P(Xi|Y) follows a 1D-Normal(Gaussian) distribution. - 10%P(Xi|Y)=1σ2π√e−(x−μ)22σ2where μ,σ are the mean and standard deviation of feature Xi respectively, while label Y is determined.
Train-Test-Split - 5%Two validation methods need to be implemented.Holdout validation with the ratio 7:3
K-fold cross-validation with K=3Obtain the final performance by averaging all folds’ performance.
Results - 10%Obtain the performances of all experiment settings in tables by the following metrics:Confusion matrix
Accuracy
Sensitivity(Recall)
Precision
Comparison & Conclusion - 5%
Questions - 25%For the mushroom datasetShow P(Xstalk−color−below−ring|Y=e) with and without Laplace smoothing by histograms - 10%
For Iris datasetWhat are the values of μ and σ of assumed P(Xpetal_length|Y=Iris Versicolour)? - 5%
Use a graph to show the probability density function of assumed P(Xpetal_length|Y=Iris Versicolour) - 10%
Finish during class - 20%Submit your report and source codes to the newE3 system before class ends.
Finish time will be determined by the submission time.
Data
1. Mushroom dataset
Data can be downloaded here:https://archive.ics.uci.edu/ml/datasets/mushroom
Please NOTE that the first column is the label (edible=e, poisonous=p)
Data Set InformationThis data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ‘‘leaflets three, let it be’’ for Poisonous Oak and Ivy.
Attribute Informationcap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
bruises?: bruises=t,no=f
odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
gill-attachment: attached=a,descending=d,free=f,notched=n
gill-spacing: close=c,crowded=w,distant=d
gill-size: broad=b,narrow=n
gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
stalk-shape: enlarging=e,tapering=t
stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r, missing=?
stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
veil-type: partial=p,universal=u
veil-color: brown=n,orange=o,white=w,yellow=y
ring-number: none=n,one=o,two=t
ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d
2. Iris dataset
Data can be downloaded here:https://archive.ics.uci.edu/ml/datasets/iris
Data Set InformationThis is perhaps the best known database to be found in the pattern recognition literature. Fisher’s paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. Predicted attribute: class of iris plant. This is an exceedingly simple domain.
Attribute Informationsepal length in cm
sepal width in cm
petal length in cm
petal width in cm
class:Iris Setosa
Iris Versicolour
Iris Virginica

More products