$25
Exploratory Analysis
-
Solutions and Insights
-
Conclusion and next stepsProblem Statement:
Can we predict through certain variables
whether a certified asteroid will be
hazardous or not?Warning!!!About the Dataset
- 90836 rows
- 10 columns
- 6 variablesExploratory Data
Analysis Variables Analysis & Cleaning
●
Estimated diameter min
and estimated diameter
max are correlated so one
of them will be removed
●
Orbiting body and sentry
object will be removed
●
Columns like name, id will
also be removedCorrelation between the Variables of interestAbsolute Magnitude
A measure of luminescence used
for measuring the approximate
diameter of an asteroid. Estimated Diameter
- About 75% of all hazardous are from
the range of .25 to 1 kilometer. While
only accounting for ⅙ of the total
number of asteroids
- Asteroids smaller than .25 kilometers
were only hazardous < 1% of the time
while those larger were hazardous 40%
of the time. Solutions and InsightsNaive Bayes
Qcut: Estimated Diameters
High amount of data are relatively small, but the range is large!
Results in importance actually fits our estimation.Naive Bayes
Likelihoods→Importance:
Accuracy:
1.
Prior Probability for No Hazard: 90.15%
2.
Naive Bayes:
a.
Training data: 89.77%
b.
Testing data: 89.46%
F1 Score:
1.
Precision score: 27.9%
2.
Recall score: 3.6%
3.
F1 Score
a.
Training data: 6.0%
b.
Testing data: 6.4%Logistic Regression
Accuracy:
-Baseline: 0.9024298183533852
-Prediction: 0.9032696047851455
Variable Importance:
- Most significant:
- est_diameter_min
- est_diameter_max
- Least significant:
- relative_velocityK- Nearest Neighbors
Baseline:
0.9024298183533852
Prediction:
0.9031228211808741
F-Score = .0006 Trees and Ensemble
Methods
•
First we a tried to fit decision trees and other
ensemble classifiers like Bagging, Random
Forest and Gradient Boosting without playing
too much with the parameters.
•
The metrics obtained via these models are
shown on the right.
•
Clearly, the big decision Tree was overfitting
with a training accuracy of 100% and a test
accuracy of 89.4% (even lesser than the
baseline accuracy).
•
Even Bagging and Random Forest were
overfitting with much higher accuracy values
for the training datasets compared to the test
datasets.
•
Boosting seemed to be performing the best on
the Test Set.Trees and Ensemble
Methods
•
Next, we will try to tune the parameters for Gradient
Boosting.
•
On varying the no. of estimators across
100,200,300,400 and 500, we obtained the plot on
the right. The differences are minuscule. We will
take no. of trees = 500
•
On varying the maxdepth from 1 to 11, we obtained
the second plot on the right. The test accuracy
seems to be maxing out at maxdepth=9.
•
Finally, we will try to find the best value for the no.
of predictors to take for random forest. On varying
the no. of predictors from 1 to 4, we obtain the third
plot on the right. Hence we will take mtry=4 for our
random forest model.Trees and Ensemble Methods
•
On running our models with the parameters
decided in the previous slide, we obtain the
metrics given on the table in the right.
•
Clearly, Gradient Boosting is performing the best
with Test Accuracy of 92.02% (compared to
90.2% baseline accuracy). The F1 Score
obtained is 0.48
•
From the confusion matrix on the right,
Precision = 0.65 and Recall = 0.38 which is
definitely an improvement over the baseline
model.
•
On plotting the feature importance for the
Gradient Boosted Decision Tree Model, we get
the third figure on the right.
•
Clearly, nearly all the variables have similar
importances with no single dominant variable.
Among the 4 variable, miss_distance seems to
be the most important feature.Conclusion and Next
StepsConclusion
1. The models being used by us do not seem to be extremely good at
predicting whether an NEO will be hazardous or not. We have achieved
only modest gains over the baseline accuracy
2. Only 4 variables are being used to predict whether an NEO is hazardous
or not (miss_distance, absolute magnitude, est_diameter_max and
relative_velocity).
3.
None of the 4 selected variables seem to be heavily impacting the target
variable. All the 4 variables seem to have similar values for feature
importance in the Gradient Boosted Tree Model. Other classification
models too did not give a wide variation in relative importance of features.Conclusion
1.
Given the aforementioned point, it seems reasonable to expect the findings
of Point 1 as the four variables do not seem to be great predictors. Our
best model obtained a test accuracy of 92.015% and an F1 score of
around 0.48
2.
miss_distance seems to be the most important feature as per our best
model (Gradient Boosted Decision Tree). However, as per Logistic
Regression, this is est_diameter_max and as per Naive Bayes it is
absolute_magnitude. This variance probably probably suggests that no
strong relationship exists and all of the variables are nearly similar in
importance.Next Steps
• Find a larger dataset with more even dependent
variable distribution
• Use more variables
• Limitation: many asteroids were listed multiple timesQuestions?