$30
GitHub link : https://github.com/akj-new-era/Stock-Price-Prediction
Abstract
To examine a number of different forecasting techniques to predict future stock returns based on past returns. We do this by applying supervised learning methods for stock price forecasting by interpreting market data. We are primarily looking to apply linear models and later on moving to neural networks.
Introduction
Stock (also known as equity) is a security that represents the ownership of a fraction of a corporation. This entitles the owner of the stock to a proportion of the corporation's assets and profits equal to how much stock they own. Units of stock are called "shares."A stock is a general term used to describe the ownership certificates of any company.
Stock prices change everyday by market forces. By this we mean that share prices change because of supply and demand. If more people want to buy a stock (demand) than sell it (supply), then the price moves up. Conversely, if more people wanted to sell a stock than buy it, there would be greater supply than demand, and the price would fall.
Data Set
To train and test our model we used data from HDFC and ITC from the year 2000 to 2021 which is taken from kaggle(https://www.kaggle.com/datasets/rohanrao/nifty50-stock-market-data). This dataset contains daily stock prices at which they open and close, daily high and low values, company's volume and turnover,previous day stock price,etc.
HDFC ITC
INFOSYS
Linear Regression
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (xi) variables and shows how the value of the dependent variable is changing according to the value of the independent variables.
In this method we find the best fit line as follows.
1) Define a linear model y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable) X= Independent Variable (predictor Variable) a0= intercept of the line (Gives an additional degree of freedom) a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
Writing it in matrix form, we get
Y = AX + E
2) Define a Cost function
Here we will be using MSE cost function
Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value
3) Use Matrix method for minimizing the error between predicted values and actual valuesA = (XTX)-1XTY
This equation will give the optimized value of the Coefficient Matrix (A)
4) Checking the Performance using R Squared/ coefficient of determination
The high value of R-square determines the less difference between the predicted values and actual values and hence represents a good model
Results:
ITC: Train Data Test Data
HDFC : Train Data Test Data
INFOSYS : Train Data Test Data
Recurrent Neural Networks
It is a class of neural networks tailored to deal with temporal data. The neurons of RNN have a cell state/memory, and input is processed according to this internal state, which is achieved with the help of loops within the neural network. There are recurring module(s) of ‘tanh’ layers in RNNs that allow them to retain information. However, not for a long time, which is why we need LSTM models.
LSTM Model
Recurrent neural networks (RNN) have proved one of the most powerful models for processing sequential data. Long Short-Term memory is one of the most successful RNNs architectures.
LSTM introduces the memory cell, a unit of computation that replaces traditional artificial neurons in the hidden layer of the network. With these memory cells, networks are able to effectively associate memories and input remotely in time, hence suited to grasp the structure of data dynamically over time with high prediction capacity.
LSTMs have a chain-like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way
The repeating module in a LSTM
Analysis
For analyzing the efficiency of the system we have used the Mean Square Error(MSE). The error or the difference between the target and the obtained output value is minimized by using MSE value. MSE is the mean/average of the square of all of the errors. The use of MSE is highly common and it makes an excellent general purpose error metric for numerical predictions.
Implementation:
1. Using Scikit Learning 7. Model development
( Machine Learning model)
2. Data Preprocessing using 8. Implementation of sequential, dense, LSTM and dropout.dataset
3. Visualization of Dataset 9. Preprocessing the Data
4. Feature Scaling 10. Predicting the Output
5. Preparing the Datasets for training 11. Result visualization
6. Reshaping the datasets
Methodology
Stage 1: Raw Data: In this stage, the historical stock data is collected as per described in the dataset part. and this historical data is used for the prediction of future stock prices.
Stage 2: Data Preprocessing: The pre-processing stage involves
a) Data discretization: Part of data reduction but with particular importance, especially for
numerical data
b) Data transformation: Normalization.
c) Data cleaning: Fill in missing values.
d) Data integration: Integration of data files. After the dataset is transformed into a cleandataset, the dataset is divided into training and testing sets so as to evaluate. Here, the training values are taken as the more recent values. Testing data is kept as 5-10 percent of the total dataset.
Stage 3: Feature Extraction: In this layer, only the features which are to be fed to the neural network are chosen. We will choose the feature from Date, open, high, low, close, and volume.Here have chosen Date as feature data.
Stage 4: Training Neural Network: In this stage, the data is fed to the neural network and trained for prediction assigning random biases and weights. Our LSTM model is composed of a sequential input layer followed by 2 LSTM layers and a dense layer with ReLU activation and then finally a dense output layer with linear activation function.
Results
ITC Testing Data Infosys Testing Data
HDFC Testing Data
Analysis:
Errors obtained using Linear
Companies
HDFC
ITC
INFOSYS
Mean Squared Error
245146.57934
53423.3812
159334.2670
Mean Absolute Error
416.1567
228.669
280.6234
Coefficient of Determination
-1.5125
-29.9926
-2.1797
Errors obtained using LSTM
Companies
HDFC
ITC
INFOSYS
Mean Squared Error
5933.854308397654
61.66858715680009
3004.878401029405
Mean Absolute Error
56.56770254532921
5.757666215539615
28.037047148515832
Coefficient of Determination
0.9259315325191522
0.9683613473676155
0.9542569842578266
Thus, we observe that the error for LSTM is significantly less than that for linear regression. It provides a much better fit as it accounts for past data to predict future values.
Comparative Results Using Different Parameters and Epochs
Set of normalized features
No.of epochs
Mean squared error for Normalized features
Prices + Volume
25
0.0022
Prices + Volume
50
0.0018
7_MA + Volume
25
0.0015
7_MA + Volume
50
0.0013
Price + Volume + 7_MA
25
0.0018
Price + Volume + 7_MA
50
0.0012
Conclusion
The popularity of stock market trading is continuously increasing, prompting experts to develop new methods for predicting the future utilizing new techniques. The forecasting technique benefits not only scholars, but also investors and anyone involved in the stock market. A forecasting model with high accuracy is required to assist in the prediction of stock indexes. We used one of the most precise forecasting technologies in this work, which uses a Recurrent Neural Network and a Long Short-Term Memory unit for predicting the stock market's future situation. We can also conclude that by taking different combinations of given variables as parameters for the model, we can acquire higher accuries, which makes us aware of the fact that "More data with better models produces outstanding results".