Starting from:

$30

CA4015 Assignment 2 -Solved 



Code can be viewed here: https://github.com/kiansweeney11/ca4015-second-assignment 
Introduction 
For our second assignment of CA4015: Advanced Machine Learning we were tasked with the analysing of 
data related to the analysis of volatile organic compounds (VOC). The data set provided contained four 
sheets, 2 (Long and Wide) focused on the VOCs that were successfully identified across all the bacteria and 
across differing nutritional environments. We will only focus on the aforementioned sheets most specifically 
the ‘wide’ sheet. We were tasked with using an Automated Machine Learning (AutoML) approach for the 
classification of our data using an appropriate platform and then comparing our results to a random forest 
approach. We also needed to combine this with use of cross validation in our random forest and AutoML 
approaches to validate the success of our respective results. In my experiments I tried predicting the strain 
and medium of bacteria using the compounds columns as features. I tried varying levels of training to test 
split on both my AutoML and random forest approaches. Firstly, I would have to take the dataset and clean 
it as required. 
Data Cleaning 
Initially, I loaded my dataset in on excel to have a quick look at it before processing it on Jupyter Notebook. 
Here is what it looked like: 
As we can see this data is very messy. The headings of the columns appear to be in the fourth row and the 
text in the second row explaining the contents of the sheet is unnecessary as is typical of unclean dataset. 
We can also see in some of our columns null values. Initially, I interpreted this as no readings of these 
compounds present so they could be denoted as a 0 value. However, I changed my approach and “imputed” 
the data instead. I used the Scikit (sklearn) library’s “SimpleImputer” implementation to replace all missing 
values in the data with the mean in each column. This is regarded as a better approach when the dataset is 
small (we have 84 rows only albeit with high dimensionality). It can add bias and variance to the data but 
most other approaches appear liable to this also. I looked at applying linear regression to predict values but 
felt this approach may be complicated to implement on so many columns due to high dimensionality. Our 
code and results are as follows:We also tidy up our headings and remove the first four null rows from the dataset on the ‘wide’ sheet. This 
leaves us with our cleaned dataset and we can get ready to classify it using AutoML and RF. I did actually 
implement AutoML using the original dataset with the missing values. However, I couldn’t implement cross 
validation methods as the sklearn module I was using for cross-validation didn’t support the use of null 
values. I also adapted my approach with regards to categorical values too. I originally used the string values 
connected to each strain but changed this to number each value then using the following code: 
Once this was done, I set about implementing my AutoML classification task. We would be trying to predict 
the strain given the values of compounds contained in the bacteria. We would try this at 10:90, 20:80, 25:75, 
30:70, 40:60 and 50:50 training to test split. 
AutoML Classification – Strain Prediction 
There are a lot of different AutoML platforms available today such as: Google colab, TPOT amongst many 
others. I decided to use the supervised AutoML package imported on Jupyter Notebook as so: 
This package is installed on the command line using the following command: 
This package creates directory with reports on the AutoML command ran and also computes a baseline for 
our data assessing the need for machine learning at all. It also imputes missing values which would be useful 
in the data provided. However, this imputes the data while on the AutoML run so our normal dataset doesn’t 
actually get the imputed data values. This is why initially I used my unimputed dataset and ran AutoML and 
then re-ran it on a separate file with the imputed data already present. This is to allow us to compare our 
results. Initially, we look at how AutoML preformed on the unimputed dataset. We split our data up 
accordingly:This will be the column we are trying to predict, the strain column. We then take all the numeric data 
columns: 
Firstly, I’ll look at how the AutoML classification performed imputing the NaN values by itself. The following 
table summarises the results on each training to test split: 
Training to Test Split 
Accuracy 
0.1 
0.7777777777777778 
0.2 
0.6470588235294118 
0.25 
0.6190476190476191 
0.3 
0.7307692307692307 
0.4 
0.6470588235294118 
0.5 
0.6190476190476191 
We couldn’t cross validate our results due to the presence of nulls in the data. So now I look at AutoML 
classification of bacteria strain using our imputed data from earlier. These are our results: 
Training Test Split 
Accuracy 
0.15 
0.9230769230769231 
0.2 
0.8235294117647058 
0.25 
0.8095238095238095 
0.3 
0.8461538461538461 
0.4 
0.6470588235294118 
0.5 
0.6904761904761905 
As we can see our data with imputed values based on the mean of values already in a column is significantly 
higher than before. However, this time we can cross validate our accuracy values. I implemented a couple 
of different metrics including cross validation to test the accuracy of our results. I tried C-support vector 
classification and using the built-in cross validation score library from sklearn like so:This splits the data, fits a model and computes a cross validation score four times, we take the mean of 
these values as our cross-validation score for each x and y test splits. This is the most simplistic way of 
implementing cross-validation and the one we use for all our data. I also used the aforementioned C-support 
vector classification. It works well with smaller datasets which is of benefit to us with the dataset we are 
using. Using support vector machines like this is highly beneficial in high dimensional datasets like ours and 
is also memory efficient as it uses a subset of training points in the decision function. The SVC 
implementation fit time scales at minimum quadratically with the number of samples in the data to predict 
new values. I felt by combining these two we would get a very clear picture of how accurate our predictions 
were. Here are our scores for both of these: 
Training to Test Split 
SVC 
Cross-Validation Mean Score 
0.15 
0.46153846153846156 
0.33 
0.2 
0.7058823529411765 
0.24 
0.25 
0.7142857142857143 
0.45 
0.3 
0.7692307692307693 
0.54 
0.4 
0.6470588235294118 
0.59 
0.5 
0.5952380952380952 
0.57 
Looking at our scores here combined with our accuracy scores from earlier there appears to be a sweet spot 
between 0.25 and 0.3 test to training data split. The cross-validation scores increase with the larger splits but 
the SVC scores suffer directly as a result. There certainly appears to be the risk of over fitting the data with test 
to training splits in excess of 40:60 here. This leads us on to our implementation of random forests and see how 
our results contrast with an AutoML approach. 
Random Forest Comparison of Bacteria Prediction 
Next, we implement a random forest approach to compare with our results from earlier. Again, we couldn’t use 
our original cleaned data with null values as the sklearn library we were using didn’t allow for null values. This 
meant we had to use our imputed data based on the mean values of columns. We use the sklearn ensemble’s 
RandomForestRegressor package to predict our values for the strain given the numeric compounds present. This 
is an example of the code we ran: 
We ran the tests at 10:90, 20:80, 25:75, 30:70, 40:60 and 50:50 test to training splits. Here were our results for 
accuracy, SVC and cross-validation mean scores: 
Training - Test Split Accuracy 
SVC 
Cross-Val Mean Score 
0.1 
0.6667 
0.6667 
0.54 
0.2 
0.5294117647058824 
0.7058823529411765 0.64 
0.25 
0.5238095238095238 
0.6667 
0.62 
0.3 
0.34615384615384615 0.6923076923076923 0.650.4 
0.47058823529411764 0.7647058823529411 0.61 
0.5 
0.42857142857142855 0.6190476190476191 0.74 
As we can see our accuracy is significantly lower than our AutoML approach. This is to be expected as AutoML 
trains and tests the data on a selected collection of algorithms finding the best fitting algorithm. When I ran 
AutoML on the imputed mean data it predominately returned XgBoost as the best performing algorithm on the 
data despite random forest being tested each time. One similarity appears on the implementation of our random 
forest algorithm in that the best split across our three metrics here appears to be around 25:75 training to test 
data split (here the threshold seems to be between 0.20 and 0.25 not 0.25 and 0.3 with AutoML). However, one 
thing that is interesting to note is the much higher cross-validation scores on our different data splits. The lowest 
cross validation score with our RF approach would be the third highest score on our AutoML cross-validation 
scores. All our other cross-validation scores for RF are higher than the highest AutoML score also. While the SVC 
scores are at a similar level the different in cross-validation scores using the cross-evaluation score function is 
stark. This suggests that maybe our RF approach would adapt better at unseen values than our AutoML approach 
and our accuracy scores for the AutoML approach are potentially inflated, with our data suffering from 
overfitting. It would be interesting to implement our own version of XgBoost and see does the cross-validation 
scores align with the poor AutoML cross-validation scores as this was regularly denoted as the best performing 
algorithm. 
AutoML for Medium Prediction 
Next, we try to predict bacteria medium from our data. The bacteria medium is denoted in the sample’s column 
as part of the string as either ‘LB’, ‘TSB’ and ‘BHI’. I ran this code to extract this value from the data and added 
on to our original data frame. 
I followed the same approach for the use of AutoML for classification as before the only difference being I utilised 
the in-built accuracy score library to calculate the model’s accuracy instead of my own function. We ran AutoML 
at 15:85, 20:80, 25:75, 30:70, 40:60 and 50:50 training to test splits. Here were our results: 
Training-Test Split 
Accuracy 
SVC 
Cross-Evaluation 
0.15 
1.0 
0.7692307692307693 0.38 
0.2 
0.9411764705882353 0.8823529411764706 0.53 
0.25 
0.9523809523809523 0.8571428571428571 0.67 
0.3 
0.9615384615384616 0.8076923076923077 0.57 
0.4 
0.7941176470588235 0.7941176470588235 0.70 
0.5 
0.8095238095238095 0.7142857142857143 0.76 
Here we see remarkably high accuracy scores at the lower end of training to test splits. The 15:85 split has a 
flawless prediction result even. However, these lower ratio splits also have significantly lower cross evaluation 
scores, a key metric of ours to see how the model would respond to unseen data. The model appears strongest at around a 40:60 split or slightly more. It does obtain a reasonably good cross evaluation score of 0.67 at the 
25:75 split however this drops noticeably where the ratio is 30:70 again. Across the board here we undoubtedly 
have the highest accuracy, CVC and cross-evaluation scores of any of our tests thus far. I feel this could be due 
to the lack of different values in our medium column. There are only three potential values, all with a very similar 
distribution of frequency as we can see here: 
This would make prediction of the values a lot easier as certain mediums might be seen in certain compounds 
to a larger extent and the algorithms tested by AutoML might identify that if the compound is above a certain 
value, it can be seen as part of this medium for example. The AutoML best models for our different splits were 
XgBoost for 15 and 20 and ensemble for the rest. We now look at our random forest classifier results for the 
same training to test splits. These are the accuracy, CVC and cross evaluation scores for this model: 
Training-Test Split 
Accuracy 
SVC 
Cross Evaluation 
0.15 
0.8461538461538461 1.0 
0.54 
0.2 
0.8823529411764706 1.0 
0.61 
0.25 
0.9047619047619048 1.0 
0.71 
0.3 
0.8846153846153846 0.9615384615384616 0.57 
0.4 
0.9117647058823529 0.9411764705882353 0.6 
0.5 
0.9285714285714286 0.8809523809523809 0.74 
Although our accuracy values are not as high as our AutoML approach we still obtain a high degree of accuracy 
along with higher cross evaluation scores than before. We also see very high support vector c-support scores 
here achieving perfect scores three times. The sweet spot between training to test split is probably about 25:75, 
although we have a high value for 50:50 there is a danger of overfitting our model here with a higher training to 
test split. It is interesting to note despite the AutoML output telling us the best performing algorithms are 
ensemble and XgBoost there would easily be a case for a random forest classifier here. With the correct training 
to test split and strong cross-evaluation scores across the different splits when compared with AutoML cross 
validation scores the random forest algorithm performs strongly. 
Limitations and Future Work 
One limitation of my results was the use of our imputed data. Due to the chosen platforms such as ml-jar 
supervised automl and sklearn libraries for random forests and cross evaluation scores we could not handle null 
values. This meant imputed data was necessary and there were a variety of different approaches we could have 
selected such as imputing with multiple imputation by chained equations, imputing with k-nearest neighbours 
or imputation through mean, median or use of 0. I felt the best approach was to use the mean but each of the 
aforementioned approaches have their pros and cons with some best suited to different datasets than others. 
The fact we did not know what these null values meant lead to me taking this approach even if they could have 
easily denoted the value of 0 as mentioned earlier. For future work we could look at other predictions such as 
using the ‘long’ sheet to predict the chemical class bacteria belonged to and use other AutoML approaches such 
as TPOT among others.

More products