Starting from:

$24.99

CSE343-ECE343 ML Report Assignment 1 Solution

SECTION - A
(a) No, the fact that two variables exhibit a strong correlation with a third variable does not necessarily imply that they will also display a high degree of correlation with each other. Correlation is a measure of the linear relationship between two variables, and each pair of variables can have its own unique relationship, regardless of their relationship with a third variable.
Here's an example to illustrate this:
Let's say you have three variables: A, B, and C.
Variable A and Variable B both have a strong positive correlation with Variable C. This means that as the values of A and B increase, the values of C also tend to increase.
(b)
The defining criteria for a function to be classified as a logistic function include:
1. S-Shaped Curve: A logistic function must exhibit an S-shaped curve, which means that it starts with a gradual increase, becomes steeper as it progresses, and then levels off as it approaches the upper and lower limits (asymptotes). This characteristic S-shape is a fundamental property of logistic functions.
2. Range: The output of a logistic function should be constrained to a specific range, typically between 0 and 1. As the input becomes very large (positive or negative), the output approaches the asymptotes of 1 and 0.
3. Symmetry and Center: A logistic function is symmetric about its midpoint, which is typically located at x=0. This means that the values of the function are symmetrically distributed around this central point.
4. Monotonicity: A logistic function is monotonically increasing, which means that as the input increases, the output also increases. However, the rate of increase changes as input moves away from the center.
5. Continuity and Differentiability: Logistic functions are typically continuous and differentiable everywhere within their domain. This is important for many applications, including optimization and gradient-based algorithms.
The most common form of the logistic function is the sigmoid function, which is given by:
f(x) = 1/(1+e^-x) this function satisfies all above mentioned criterias sinh(x): It is not a valid logistic function. It does not exhibit an S-shaped curve, and its output range is not constrained within 0 and 1. sinh(x) = (e^x - e^(-x)) / 2

cosh(x): It is not a valid logistic function either it lacks the S-shaped curve and the constrained output range between 0 and 1.
cosh(x) = (e^x + e^(-x)) / 2

tanh(x) : It is a valid logistic function. It exhibits an S-shaped curve and its output range is between -1 and 1. While it's not constrained between 0 and 1, it still meets the criteria of an S-shaped curve and constrained range. tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

signum(x) : It is not a valid logistic function. It produces discrete outputs of -1, 0, and 1, and does not exhibit an S-shaped curve.

(c) For very sparse datasets, the "Leave-One-Out Cross-Validation" (LOOCV) technique can be beneficial. LOOCV is particularly advantageous when dealing with sparse data because it utilizes each data point for validation individually, which helps in making the most of the limited available data.
Leave-One-Out Cross Validation (LOOCV) is a model validation technique where a single sample from the dataset is used as the test set, and the remaining samples are used as the training set. The process is repeated n times, where n is the number of samples in the dataset. Each sample is used once as the test set and the model is trained on the remaining n-1 samples. The average performance score across all n iterations is used to validate the model.
Difference from K-Fold Cross-Validation:
In K-Fold Cross-Validation, the dataset is divided into K subsets (folds), and the model is trained K times, each time using K-1 folds for training and the remaining fold for validation.
(d)

(e) In the simple linear regression model:
Y = α + βx + ε

So, the correct answer is:
(a) α, β, σ, because we can express ε in terms of σ hence we will directly estimate σ instead of ε (also clear from above formula)
(f) Given data:
X = [20, 30, 50, 60, 80, 90]
Y = [125, 110, 95, 90, 110, 130]

As clear from the above scatter plot we can observe an upward parabola so the coefficient of x^2 should be positive (β2 >0) hence option (d) Y= α + β1x + β2x2+ε β2 > 0 is correct.
SECTION - B
(a) Brief explanation of what the code does:
1. LogisticRegression Class: This class represents the logistic regression model. It has methods for sigmoid activation, cross-entropy loss calculation, training the model using SGD, and testing the model's performance.
2. Sigmoid Function: The sigmoid function calculates the sigmoid activation of a given input z .
3. Cross-Entropy Loss: The cross_entropy_loss function computes the binary crossentropy loss between true labels ( y_true ) and predicted probabilities ( y_pred ). It
also includes numerical stability improvements using epsilon.
4. Training: The train method trains the logistic regression model. It iterates through the training data for a specified number of iterations using SGD. In each iteration, it calculates the gradient of the loss with respect to the model's parameters and updates the model's parameters accordingly. It also calculates and tracks training and validation losses and accuracies during training.
5. Testing: The test method evaluates the trained model on a test dataset. It computes metrics such as accuracy, precision, recall, F1 score, and a confusion matrix.
6. Data Loading and Preprocessing: The code loads a dataset from a CSV file, separates features and labels, standardizes the features, and adds a bias term.
7. Hyperparameters: It allows the user to specify the learning rate and the number of training iterations.
8. Training and Visualization: It initializes the logistic regression model, trains it using the training data, and visualizes the training and validation loss and accuracy over iterations using matplotlib.
9. Testing and Metrics: Finally, it evaluates the trained model on a test dataset and prints out the confusion matrix, accuracy, precision, recall, and F1 score to assess the model's performance.
In each epoch of stochastic gradient descent I have calculated Training loss, validation loss, Training accuracy and Validation accuracy.
Convergence : Model converges early for a high learning rate and converges late for a low leaning rate
Comparison and analysis of plot:-
(b) & (c) Learning rate = 1


Confusion matrix: [[17, 14], [5, 42]]
Accuracy: 0.7564102564102564
Precision: 0.7727272727272727
Recall: 0.5483870967741935
F1 Score: 0.32075471698113206
Learning rate = 0.1


Confusion matrix: [[14, 17], [3, 44]]
Accuracy: 0.7435897435897436
Precision: 0.8235294117647058
Recall: 0.45161290322580644
F1 Score: 0.29166666666666663
Learning rate = 0.01


Confusion matrix: [[18, 13], [3, 44]]
Accuracy: 0.7948717948717948
Precision: 0.8571428571428571
Recall: 0.5806451612903226
F1 Score: 0.34615384615384615
Learning rate = 0.001


Confusion matrix: [[24, 7], [18, 29]]
Accuracy: 0.6794871794871795
Precision: 0.5714285714285714
Recall: 0.7741935483870968
F1 Score: 0.3287671232876712
Analysis : As we decrease the learning rate steepness of loss and training graph decreases (graph becomes more smooth for decreasing learning rate) and we attain maximum accuracy and precision at 0.01 learning rate.
(d) Just the modified the loss function by adding a penalty term
New loss functions are:-
Lasso regression:
y_true*log(y_pred) + (1-y_true)*log((1-y_pred)) + λ* ∑∣wj∣where λ is L1 regularization parameter.
This is the additional part specific to Lasso regression. It adds the absolute values of the coefficients (weights) of the features, multiplied by a regularization parameter (λ), to the loss function.
gradient of this new loss function is (pedictions - y_i)*x_i + λ*(sign(weight))
Ridge regression:
y_true*log(y_pred) + (1-y_true)*log((1-y_pred)) + λ* ∑(wj^2)where λ is L2 regularization parameter.
This is the additional part specific to Ridge regression. It adds the sqaured values of the coefficients (weights) of the features, multiplied by a regularization parameter (λ), to the loss function.
gradient of this new loss function is (pedictions - y_i)*x_i + 2λ*weight
Used (0.001, 0.01, 0.1, 1, 10) values for penalty term in both Lasso and Ridge regression and got
Best l1_penalty: 0.1
Best l2_penalty: 0.001
(Lasso regression for best L1 penalty):


(Ridge regression for Best L2 penalty):


(e) tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x)) range of tanh(x) is [-1,1] so we adjust the range with a new function f(x) = (1+tanh(x))/2 to [0,1] and use the same loss function f(x) is simplified to be 1/(1+e^-2x) loss = y*log(f(x))+(1-y)*log(1-f(x)) gradient = 2*(f(x) - yi) * xi
Smoothest graph for learning rate = 0.001
There is not much difference between the performance of model for using tan hyperbolic instead of sigmoid


(f) Batch size = 2


Batch size = 5


Batch size = 8


Batch size = 15


As we increase the Batch size loss decreases more slower and acuuracies increase more slower
In Normal SGD batch size = 1 so loss decreases more rapidly and accuracy increases more rapidly hence model converges earlier and for any other batch size as we increase the Batch size loss decreases more slower and acuuracies increase more slower so model will not converge too earlier as compared to SGD.
SECTION - C
(a) Insights:
As engine size increases CO2 emission increases
Fuel consumption comb (mpg) between 20 and 30 are in majority
As fuel consumption comb (mpg) increases CO2 emission decreases
As the number of cylinders increase Fuel consumption comb (mpg) decreases
As Fuel Consumption comb (L/100 km) increases Fuel Consumption comb (mpg) decrease
Fuel type X are in majority and Fuel type E has highest median
(b)

From above t-SNE Scatter plot data is separable in some regions overall data is not well separable
(c)

(d)

As the number of components inceases Errors parameters decrease and R2 score increases and ence model’s preformance is increasing
(e) After one hot encoding (it creates different coloumns for categorical features)

In part (c) difference between train and testing error is very small and R2 score is close to 1 and hence overall performance of model is much better than (e)
(f)

As the number of component in PCA increases Train and Test MSE,RMSE, decreases and Test R2 score increase.
(g)

MSE, RMSE and MAE for Lasso is greater than RIdge, While R2_score,
Adjusted_R2_score for Ridge is greater than Lasso
(h)

Performance is almost same as (c)

More products