top of page

Stock Price Prediction | AI Project

Updated: May 19, 2024

Machine learning demonstrates monstrously accommodating in numerous businesses in computerizing errands that prior required human labor one such application of ML is anticipating whether a specific exchange will be beneficial or not.


In this article, we'll learn how to anticipate a flag that shows whether buying a specific stock will be accommodating or not by utilizing ML.


Let's start by bringing in a few libraries which is able be utilized for different purposes which can be clarified afterward in this article.


ai , AI , Project , idea, Project Idea, prediction , robot , stock market , AI-based


Importing the libraries

Python libraries make it exceptionally simple for us to handle the information and perform commonplace and complex assignments with a single line of code.


  • Pandas – This library makes a difference to stack the information outline in a 2D cluster arrange and has numerous capacities to perform analysis assignments in one go. (READ)

pip install panda
  • Numpy – Numpy clusters are exceptionally quick and can perform huge computations in a really brief time. (READ)

pip install numpy
  • Matplotlib/Seaborn – This library is utilized to draw visualizations. (SEABORN) (MATPLOTLIB)

pip install seaborn
pip install matplotlib
  • Sklearn – This module contains numerous libraries having pre-implemented capacities to perform errands from information preprocessing to show improvement and assessment. (SKLEARN)

pip install sikit-learn
  • XGBoost – This contains the eXtreme Slope Boosting machine learning calculation which is one of the calculations which makes a difference us to realize tall exactness on forecasts. (XGBOOST)

pip install xgboost


Bringing in Dataset

The dataset we are going utilize here to perform the investigation and construct a prescient model is Tesla Stock Cost information. We'll utilize OHLC('Open', 'High', 'Low', 'Close') information from 1st January 2010 to 31st December 2017 which is for 8 a long time for the Tesla stocks. You'll download the CSV record from:





From the primary five columns, we will see that information for a few of the dates is lost the reason for that's on ends of the week and occasions Stock Advertise remains closed consequently no exchanging happens on these days.


From this, we have to be know that there are 1692 columns of information accessible and for each push, we have 7 distinctive highlights or columns.





Exploratory Information Investigation:

EDA is an approach to analyzing the information utilizing visual procedures. It is utilized to find patterns, and designs, or to check presumptions with the assistance of factual rundowns and graphical representations.


Whereas performing the EDA of the Tesla Stock Price information we are going analyze how costs of the stock have moved over the period of time and how the conclusion of the quarters influences the costs of the stock.


The costs of tesla stocks are appearing an upward slant as delineated by the plot of the closing cost of the stocks.


In the event that we watch carefully we are able see that the data within the 'Close' column which accessible within the 'Adj Close' column is the same let's check whether usually the case with each push or not.


From here ready to conclude that all the columns of columns 'Close' and 'Adj Close' have the same data. So, having excess information within the dataset isn't attending to offer assistance so, we'll drop this column some time recently assist examination.


Presently let's draw the conveyance plot for the ceaseless highlights given within the dataset. Some time recently moving assist let's check for the invalid values if any are display within the information outline.


This implies that there are no null values in the data set provided.



Within the dissemination plot of OHLC information, we are able see two crests which suggests the information has shifted essentially in two districts. And the Volume information is left-skewed.



From the over boxplots, able to conclude that as it were volume data contains exceptions in it but the information within the rest of the columns are free from any exception.


Highlight Designing:

 Include Building makes a difference to determine a few profitable highlights from the existing ones. These additional highlights now and then offer assistance in expanding the execution of the show essentially and certainly offer assistance to pick up more profound experiences into the information.

Presently we have three more columns specifically 'day', 'month' and 'year' all these three have been determined from the 'Date' column which was at first given within the information.


A quarter is characterized as a bunch of three months. Each company plans its quarterly comes about and distributes them freely so, that individuals can analyze the company's execution. These quarterly comes about influence the stock costs intensely which is why we have included this highlight since this may be a supportive include for the learning demonstrate.



From the above bar graph, we can conclude that the stock prices have doubled from the year 2019 to that in 2024.


Here are a few of the critical perceptions of the above-grouped information:

Costs are higher within the months which are quarter conclusion as compared to that of the non-quarter conclusion months. The volume of trades is lower within the months which are quarter conclusion.


Over we have included a few more columns which can offer assistance within the training of our show. We have included the target feature which may be a flag whether to purchase or not we'll prepare our demonstrate to anticipate this as it were. But some time recently continuing let's check whether the target is adjusted or not employing a pie chart.


When we include highlights to our dataset we need to ensure that there are no profoundly connected highlights as they don't offer assistance within the learning handle of the calculation.



From the over heatmap, able to say that there's a tall relationship between OHLC that's beautiful self-evident, and the included highlights are not profoundly connected with each other or already given highlights which implies that we are great to go and construct our show.

Information Part and Normalization:


After selecting the highlights to prepare the demonstrate on we ought to normalize the information since normalized information leads to steady and quick preparing of the show. After that entire information has been part into two parts with a 90/10 proportion so, that we are able assess the execution of our demonstrate on concealed information.


Show Improvement and Assessment:

Presently is the time to prepare a few state-of-the-art machine learning models(Logistic Relapse, Back Vector Machine, XGBClassifier), and after that based on their execution on the preparing and approval information we are going select which ML show is serving the purpose at hand better.

For the evaluation metric, we'll utilize the ROC-AUC bend but why this is often since rather than anticipating the difficult likelihood that's or 1 we would like it to foresee delicate probabilities that are ceaseless values between to 1. And with delicate probabilities, the ROC-AUC bend is by and large utilized to degree the exactness of the expectations.



Among the three models, we have prepared XGBClassifier has the most noteworthy execution but it is pruned to overfitting as the distinction between the preparing and the approval precision is as well tall. But within the case of the Logistic Relapse, this is often not the case.

Presently let's plot a disarray network for the approval information.


Conclusion:

We are able watch that the precision accomplished by the state-of-the-art ML show is no superior than basically speculating with a likelihood of 50%. Conceivable reasons for this may be the need of information or employing a exceptionally basic show to perform such a complex assignment as Stock Showcase forecast.


Project code:

 create a python file with name: main.py , and copy paste the following code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn import metrics

import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('TSLA.csv')
df.head()
df.shape

df.describe()

df.info()

plt.figure(figsize=(15,5))
plt.plot(df['Close'])
plt.title('Tesla Close price.', fontsize=15)
plt.ylabel('Price in dollars.')
plt.show()

df.head()

df[df['Close'] == df['Adj Close']].shape

df = df.drop(['Adj Close'], axis=1)

df.isnull().sum()

features = ['Open', 'High', 'Low', 'Close', 'Volume']

plt.subplots(figsize=(20,10))

for i, col in enumerate(features):
    plt.subplot(2,3,i+1)
    sb.distplot(df[col])
plt.show()

plt.subplots(figsize=(20,10))
for i, col in enumerate(features):
    plt.subplot(2,3,i+1)
    sb.boxplot(df[col])
plt.show()

splitted = df['Date'].str.split('/', expand=True)

df['day'] = splitted[1].astype('int')
df['month'] = splitted[0].astype('int')
df['year'] = splitted[2].astype('int')

df.head()

df['is_quarter_end'] = np.where(df['month']%3==0,1,0)
df.head()

data_grouped = df.groupby('year').mean()
plt.subplots(figsize=(20,10))

for i, col in enumerate(['Open', 'High', 'Low', 'Close']):
    plt.subplot(2,2,i+1)
    data_grouped[col].plot.bar()
plt.show()

df.groupby('is_quarter_end').mean()

df['open-close'] = df['Open'] - df['Close']
df['low-high'] = df['Low'] - df['High']
df['target'] = np.where(df['Close'].shift(-1) > df['Close'], 1, 0)


plt.pie(df['target'].value_counts().values, 
		labels=[0, 1], autopct='%1.1f%%')
plt.show()

plt.figure(figsize=(10, 10))

# As our concern is with the highly
# correlated features only so, we will visualize
# our heatmap as per that criteria only.
sb.heatmap(df.corr() > 0.9, annot=True, cbar=False)
plt.show()

features = df[['open-close', 'low-high', 'is_quarter_end']]
target = df['target']

scaler = StandardScaler()
features = scaler.fit_transform(features)

X_train, X_valid, Y_train, Y_valid = train_test_split(
	features, target, test_size=0.1, random_state=2022)
print(X_train.shape, X_valid.shape)

models = [LogisticRegression(), SVC(
kernel='poly', probability=True), XGBClassifier()]

for i in range(3):
    models[i].fit(X_train, Y_train)

    print(f'{models[i]} : ')
    print('Training Accuracy : ', metrics.roc_auc_score(
	Y_train, models[i].predict_proba(X_train)[:,1]))
    print('Validation Accuracy : ', metrics.roc_auc_score(
	Y_valid, models[i].predict_proba(X_valid)[:,1]))
    print()

metrics.plot_confusion_matrix(models[0], X_valid, Y_valid)
plt.show()

Comentários


bottom of page