Starting from:

$30

EEC DNS data classification -Solved

1. Data Analysis: 
• First five rows of the static dataset. The dataset has 268074 rows and 16 columns. 
• The Distribution of the dataset 
➔ The image shows the Count, mean, standard deviation, min value, 25%, 50%(median), 
70% quartiles, and max value of the dataset. 
• The number of values in each class to see if the data is balanced or imbalanced: 
We can see that the dataset is balanced as the number of values in each class is not too 
different than the other, 55% of the data is class 1 and 45% of the data is class 0, which is 
10% difference only.Each feature distribution, and data skewed 
pattern: 
We can see that data features have 
different skewed patterns. 
 For example: 
labels_average column is right skewed. 
The skew of every feature: 
Some of features are right skewed and 
others are left skewed. 
2. Feature engineering and data cleaning: 
• Check for Null values: 
There are 8 missing values in longest_word, 
this is a very small number so I will remove them. 
• Check for missing values and categorical: 
Timestamp, longest_word, sld columns is categorical 
values and we will need to encode it to feed it to the 
model. 
Encoding categorical Features: 
 I used label encoding to encode the categorical values to numerical values.3. Feature Filtering/Selection: The dataset has 16 features, we want to reduce these features 
to make the model more general, faster, and eliminate the redundant values. 
Select independent features with: 
• High correlation with the target variable, 
• Higher information gain or mutual information of the independent variable, 
• Anova f-test. 
• Finding correlation with target variable of independent 
predictors: 
➔I used this approach to filter the feature that has small relation 
with the target column and put a threshold for the correlation as I 
only keep the feature if its correlation with the target is greater 
than 0.2 
➔The features I decided to keep from this approach are: 
FQDN_count, subdomain_length, lower, numeric, special, labels, longest_word, sld, subdomain. 
➔ Mutual Information or Information Gain: → ANOVA f-test Feature Selection 
This approach agrees to remove upper, timestamp By plotting the score of each feature, I am confident to 
 remove the "entropy", "labels_max", “labels_average”, 
 “upper”, “labels_average”, 'len',"timestamp" columns. 
 
 4. Model Training: 
I have already split the data before feature selection, but I need to split it again after it. 
I split the dataset into the target and the features, then I used train_test_split from sklearn to split data to 
train and test with 25% of the data for testing and the rest for training and satisfied it. 
I Normalize the training then the testing dataset using Normalizer from sklearn. 
I used two learning algorithms and examen their performance before and after data normalization. 
The model with best accuracy is Logistic regression before normalization, but the accuracy is not the best 
metric to use to evaluate our models, even if our data is balanced, it just gives as an intuition about how both 
positive and negative were correctly classified, but here in this problem I want to make sure that there is less 
attacks were classified as normal action PS: recall, but I don’t want all the action be classified as attacks, so I 
also care about precision, F1 score is a good metric to balance precision and recall, and the Logistic regression 
model without normalized data is the winner here. 
5. Model evaluation: 
• AUC-ROC: 
- Is the measure of the ability of a classifier to distinguish between 
classes. 
- The ROC is 0.81 which is a good score for our problem. 
- The higher value of y-axis indicates that the TP is higher than FN. 
• Precision Recall Curve: 
• A good classifier will maintain both a high precision and high 
recall, which we can see in the PR graph. 
LR without 
Normalization 
LR with Norm 
Naïve Bayes 
without Norm 
Naïve Bayes with 
Norm 
Accuracy 
82.44 
81.56 
80.30 
80.30 
F1-score 
1: 86 
0: 76 
1: 85 
0: 75 
1: 84 
0: 74 
1: 84 
0: 74 
Precision 
1: 76 
0: 99 
1: 76 
0: 95 
1: 76 
0: 90 
1: 76 
0: 90 
Recall 
1: 100 
0: 62 
Micro avg: 81 
1: 97 
0: 63 
Micro avg: 80 
1: 94 
0: 64 
Micro avg: 79 
1: 94 
0: 64 
Micro avg: 79• Confusion matrix: It shows combinations of predicted and actual 
values. 
We can see that our model didn’t misclassify the class as class 0, it 
only misclassified 139 points, and 11623 classified as 1 and it was 0. 
➔ At the end, I saved the LR model to use it in dynamic part. 
Part II (Dynamic Model): 
1. I loaded the model from the static part and then preprocessed the “Kafka_dataset.csv” as previous. 
2. I Run the consumer’s code and validate you are receiving the data stream. 
3. I appended 1,000 observations of data streaming as a window. 
4. Create a function to preprocess the streaming data like the static data before. 
5. Create a pipeline with the same architecture model as static part. 
6. The dynamic part will have 250 windows, as the data streamed, it will calculate its Accuracy, F1 score, 
compare it with the threshold (F1=85%) and retrain the model if it is lower than it. 
7. I evaluate the F1 score as a chosen metric from the last part for each model on each window and draw 
the F1 vs iteration plot. 
This is the result of the last window generated, as we care about the F1 score, we can see that it is not 
changing much, but these small changes can be important depending on the case and the user 
requirements, so saying dynamic implementation is better or not depends on the situation.

More products