Starting from:

$25

Nasa Near-Earth -Solved

Exploratory Analysis 

Solutions and Insights 

Conclusion and next stepsProblem Statement: 
Can we predict through certain variables 
whether a certified asteroid will be 
hazardous or not?Warning!!!About the Dataset 
- 90836 rows 
- 10 columns 
- 6 variablesExploratory Data 
Analysis Variables Analysis & Cleaning 
● 
Estimated diameter min 
and estimated diameter 
max are correlated so one 
of them will be removed 
● 
Orbiting body and sentry 
object will be removed 
● 
Columns like name, id will 
also be removedCorrelation between the Variables of interestAbsolute Magnitude 
A measure of luminescence used 
for measuring the approximate 
diameter of an asteroid. Estimated Diameter 
- About 75% of all hazardous are from 
the range of .25 to 1 kilometer. While 
only accounting for ⅙ of the total 
number of asteroids 
- Asteroids smaller than .25 kilometers 
were only hazardous < 1% of the time 
while those larger were hazardous 40% 
of the time. Solutions and InsightsNaive Bayes 
Qcut: Estimated Diameters 
High amount of data are relatively small, but the range is large! 
Results in importance actually fits our estimation.Naive Bayes 
Likelihoods→Importance: 
Accuracy: 
1. 
Prior Probability for No Hazard: 90.15% 
2. 
Naive Bayes: 
a.
 Training data: 89.77% 
b.
 Testing data: 89.46% 
F1 Score: 
1. 
Precision score: 27.9% 
2. 
Recall score: 3.6% 
3.
 F1 Score 
a. 
Training data: 6.0% 
b. 
Testing data: 6.4%Logistic Regression 
Accuracy: 
-Baseline: 0.9024298183533852 
-Prediction: 0.9032696047851455 
Variable Importance: 
- Most significant: 
- est_diameter_min 
- est_diameter_max 
- Least significant: 
- relative_velocityK- Nearest Neighbors 
Baseline: 
0.9024298183533852 
Prediction: 
0.9031228211808741 
F-Score = .0006 Trees and Ensemble 
Methods 
• 
First we a tried to fit decision trees and other 
ensemble classifiers like Bagging, Random 
Forest and Gradient Boosting without playing 
too much with the parameters. 
• 
The metrics obtained via these models are 
shown on the right. 
• 
Clearly, the big decision Tree was overfitting 
with a training accuracy of 100% and a test 
accuracy of 89.4% (even lesser than the 
baseline accuracy). 
• 
Even Bagging and Random Forest were 
overfitting with much higher accuracy values 
for the training datasets compared to the test 
datasets. 
• 
Boosting seemed to be performing the best on 
the Test Set.Trees and Ensemble 
Methods 
• 
Next, we will try to tune the parameters for Gradient 
Boosting. 
• 
On varying the no. of estimators across 
100,200,300,400 and 500, we obtained the plot on 
the right. The differences are minuscule. We will 
take no. of trees = 500 
• 
On varying the maxdepth from 1 to 11, we obtained 
the second plot on the right. The test accuracy 
seems to be maxing out at maxdepth=9. 
• 
Finally, we will try to find the best value for the no. 
of predictors to take for random forest. On varying 
the no. of predictors from 1 to 4, we obtain the third 
plot on the right. Hence we will take mtry=4 for our 
random forest model.Trees and Ensemble Methods 
• 
On running our models with the parameters 
decided in the previous slide, we obtain the 
metrics given on the table in the right. 
• 
Clearly, Gradient Boosting is performing the best 
with Test Accuracy of 92.02% (compared to 
90.2% baseline accuracy). The F1 Score 
obtained is 0.48 
• 
From the confusion matrix on the right, 
Precision = 0.65 and Recall = 0.38 which is 
definitely an improvement over the baseline 
model. 
• 
On plotting the feature importance for the 
Gradient Boosted Decision Tree Model, we get 
the third figure on the right. 
• 
Clearly, nearly all the variables have similar 
importances with no single dominant variable. 
Among the 4 variable, miss_distance seems to 
be the most important feature.Conclusion and Next 
StepsConclusion 
1. The models being used by us do not seem to be extremely good at 
predicting whether an NEO will be hazardous or not. We have achieved 
only modest gains over the baseline accuracy 
2. Only 4 variables are being used to predict whether an NEO is hazardous 
or not (miss_distance, absolute magnitude, est_diameter_max and 
relative_velocity). 
3. 
None of the 4 selected variables seem to be heavily impacting the target 
variable. All the 4 variables seem to have similar values for feature 
importance in the Gradient Boosted Tree Model. Other classification 
models too did not give a wide variation in relative importance of features.Conclusion 
1. 
Given the aforementioned point, it seems reasonable to expect the findings 
of Point 1 as the four variables do not seem to be great predictors. Our 
best model obtained a test accuracy of 92.015% and an F1 score of 
around 0.48 
2. 
miss_distance seems to be the most important feature as per our best 
model (Gradient Boosted Decision Tree). However, as per Logistic 
Regression, this is est_diameter_max and as per Naive Bayes it is 
absolute_magnitude. This variance probably probably suggests that no 
strong relationship exists and all of the variables are nearly similar in 
importance.Next Steps 
• Find a larger dataset with more even dependent 
variable distribution 
• Use more variables 
• Limitation: many asteroids were listed multiple timesQuestions?

More products