EEC DNS data classification -Solved

Your shopping cart is empty.

1. Data Analysis:
• First five rows of the static dataset. The dataset has 268074 rows and 16 columns.
• The Distribution of the dataset
➔ The image shows the Count, mean, standard deviation, min value, 25%, 50%(median),
70% quartiles, and max value of the dataset.
• The number of values in each class to see if the data is balanced or imbalanced:
We can see that the dataset is balanced as the number of values in each class is not too
different than the other, 55% of the data is class 1 and 45% of the data is class 0, which is
10% difference only.Each feature distribution, and data skewed
pattern:
We can see that data features have
different skewed patterns.
For example:
labels_average column is right skewed.
The skew of every feature:
Some of features are right skewed and
others are left skewed.
2. Feature engineering and data cleaning:
• Check for Null values:
There are 8 missing values in longest_word,
this is a very small number so I will remove them.
• Check for missing values and categorical:
Timestamp, longest_word, sld columns is categorical
values and we will need to encode it to feed it to the
model.
Encoding categorical Features:
I used label encoding to encode the categorical values to numerical values.3. Feature Filtering/Selection: The dataset has 16 features, we want to reduce these features
to make the model more general, faster, and eliminate the redundant values.
Select independent features with:
• High correlation with the target variable,
• Higher information gain or mutual information of the independent variable,
• Anova f-test.
• Finding correlation with target variable of independent
predictors:
➔I used this approach to filter the feature that has small relation
with the target column and put a threshold for the correlation as I
only keep the feature if its correlation with the target is greater
than 0.2
➔The features I decided to keep from this approach are:
FQDN_count, subdomain_length, lower, numeric, special, labels, longest_word, sld, subdomain.
➔ Mutual Information or Information Gain: → ANOVA f-test Feature Selection
This approach agrees to remove upper, timestamp By plotting the score of each feature, I am confident to
remove the "entropy", "labels_max", “labels_average”,
“upper”, “labels_average”, 'len',"timestamp" columns.

4. Model Training:
I have already split the data before feature selection, but I need to split it again after it.
I split the dataset into the target and the features, then I used train_test_split from sklearn to split data to
train and test with 25% of the data for testing and the rest for training and satisfied it.
I Normalize the training then the testing dataset using Normalizer from sklearn.
I used two learning algorithms and examen their performance before and after data normalization.
The model with best accuracy is Logistic regression before normalization, but the accuracy is not the best
metric to use to evaluate our models, even if our data is balanced, it just gives as an intuition about how both
positive and negative were correctly classified, but here in this problem I want to make sure that there is less
attacks were classified as normal action PS: recall, but I don’t want all the action be classified as attacks, so I
also care about precision, F1 score is a good metric to balance precision and recall, and the Logistic regression
model without normalized data is the winner here.
5. Model evaluation:
• AUC-ROC:
- Is the measure of the ability of a classifier to distinguish between
classes.
- The ROC is 0.81 which is a good score for our problem.
- The higher value of y-axis indicates that the TP is higher than FN.
• Precision Recall Curve:
• A good classifier will maintain both a high precision and high
recall, which we can see in the PR graph.
LR without
Normalization
LR with Norm
Naïve Bayes
without Norm
Naïve Bayes with
Norm
Accuracy
82.44
81.56
80.30
80.30
F1-score
1: 86
0: 76
1: 85
0: 75
1: 84
0: 74
1: 84
0: 74
Precision
1: 76
0: 99
1: 76
0: 95
1: 76
0: 90
1: 76
0: 90
Recall
1: 100
0: 62
Micro avg: 81
1: 97
0: 63
Micro avg: 80
1: 94
0: 64
Micro avg: 79
1: 94
0: 64
Micro avg: 79• Confusion matrix: It shows combinations of predicted and actual
values.
We can see that our model didn’t misclassify the class as class 0, it
only misclassified 139 points, and 11623 classified as 1 and it was 0.
➔ At the end, I saved the LR model to use it in dynamic part.
Part II (Dynamic Model):
1. I loaded the model from the static part and then preprocessed the “Kafka_dataset.csv” as previous.
2. I Run the consumer’s code and validate you are receiving the data stream.
3. I appended 1,000 observations of data streaming as a window.
4. Create a function to preprocess the streaming data like the static data before.
5. Create a pipeline with the same architecture model as static part.
6. The dynamic part will have 250 windows, as the data streamed, it will calculate its Accuracy, F1 score,
compare it with the threshold (F1=85%) and retrain the model if it is lower than it.
7. I evaluate the F1 score as a chosen metric from the last part for each model on each window and draw
the F1 vs iteration plot.
This is the result of the last window generated, as we care about the F1 score, we can see that it is not
changing much, but these small changes can be important depending on the case and the user
requirements, so saying dynamic implementation is better or not depends on the situation.

Shopping cart

US$0

EEC DNS data classification -Solved

More products