How to Develop a Stock Price Prediction Model: A Beginner's Guide (2024)

Introduction

When starting as a data science enthusiast, it is almost impossible not to stumble into or boldly take on the task of predicting stock prices. It's a challenging task, but this article will simplify the steps and build your confidence enough to approach the same or similar tasks, learn, and eventually apply even the advanced concepts and techniques on your own in the future.

Since this article is designed to provide a simple guide to beginners, let us briefly explain fundamental concepts, such as regression and stock price prediction, before we jump right in.

Regression Learning

Regression learning is a type of supervised learning in machine learning where the goal is to predict the output of a continuous variable (also called the dependent variable) based on one or more input features (independent variables). In other words, regression models are used for predicting a quantity, a numerical value, or a continuous response.

The main objective of regression analysis is to find the relationship between the input features and the target variable, allowing us to make predictions on new, unseen data. The output of a regression model is a continuous value ( a number), as opposed to classification models, which predict a discrete label, class or category.

Stock Price Prediction

Stock price prediction is one of the common usecases of regression analysis. It's a common and challenging problem as investors and businesses rise and fall. Predicting stock prices accurately is inherently difficult because financial markets are complex and dynamic. It's why machine learning techniques are often used to model and predict stock prices based on historical data.

In this article, we will predict Google stock prices. The dataset is available on Yahoo Finance for download.

Objectives

At the end of this article, you will be able to:

Download financial data (Google stock data) from Yahoo Finance using Python
Read Data from your local machine
Explore the dataset for a better understanding
Preprocess the dataset
Train a regression model
Test the model
Evaluate the model

Prerequisites

To follow the steps, you must have the following:

Python
Python Jupyter or Anaconda, Google Colab
Understand basic programming in Python
Interest to learn
Determination to follow this guide to the end

Let's get right into the game!😍

Import Required Libraries

We are importing the following libraries for the task.

# Import Required Librariesimport pandas as pdimport yfinance as yfimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltimport plotly.graph_objects as gofrom sklearn.linear_model import LinearRegressionimport warningswarnings.filterwarnings("ignore")

Fetch Historical Stock Price Data

The dataset for use is Google stock data. Notice that we imported the yfinance library to help us download the historical data. You can decide on a different timeline for your data. But we chose 2010 - 2023.

We use the download() function of the yFinance library to download the file. The function accepts three arguments: the stock symbol, the start date and the end date. We then displayed the first 100 entries using the head() function.

# Define Stock Symbolstock_symbol = "Googl"# Download historical data from Yahoo Financedata = yf.download(stock_symbol, start='2010-01-01', end='2023-01-01')data.head(100)

#Output

How to Develop a Stock Price Prediction Model: A Beginner's Guide (1)

Read Input Data

If you don't want to download directly into your notebook, you can download the CSV file from Yahoo Finance. We will use Pandas read_csv() to read the data. Please set the path to your file correctly.

# Skip this step if you did the previous. You can use either. # Load datagoogleData = pd.read_csv("Datasets/stocks_data/googleData.csv")googleData.head(20)

#Output

How to Develop a Stock Price Prediction Model: A Beginner's Guide (2)

Understand your Input Data

To help us get a better understanding of our data, we will use some Pandas functions.

Step 1:

Let's begin with the info() function to get a general overview of our data

# Get information about your DatagoogleData.info()

#Output

<class 'pandas.core.frame.DataFrame'> RangeIndex:

4782 entries, 0 to 4781 Data columns (total 7 columns):

# Column Non-Null Count Dtype --- ------ -------------- -----

0 Date 4782 non-null object

1 Open 4782 non-null float64

2 High 4782 non-null float64

3 Low 4782 non-null float64

4 Close 4782 non-null float64

5 Adj Close 4782 non-null float64

6 Volume 4782 non-null int64 dtypes: float64(5), int64(1), object(1) memory usage: 261.6+ KB

The output above shows that our data has seven (7) columns and 4,782 entries. It also shows there are no null values.

Step 2:

We will use the Pandas describe() function to get a statistical summary of our data. It is used to view details like count, percentile, mean, min, max, and std of a data frame or a series of numeric values, excluding NaN (Not a Number) values.

#Get a Statiscal SummarygoogleData.describe()

#Output

How to Develop a Stock Price Prediction Model: A Beginner's Guide (3)

However, describe() also has three (3) parameters: percentile, include, and exclude. The percentile parameter has a default of .25, .50, and .75, which produced the 25%, 50%, and 75% percentile details in the above output. But it can be set to any percentile of choice. While the exclude parameter defaults to None, and its results exclude nothing, the include parameter defaults to None, and its results include all numeric columns. But when include is set to "all", describe() produces all values, including NaNs.

Let's use it.

# Get a summary of your ALL datagoogleData.describe(include="all")

#Output

How to Develop a Stock Price Prediction Model: A Beginner's Guide (4)

Notice that the above output displayed NaNs because we set the value of the include parameter to "all". There's probably a question in your heart right now: Pandas described() function doesn't show any NaNs because it ignores NaNs on series data; it responds differently to datasets based on their type. Also, we now have more details such as top, unique, and freq from our data.

Step 3:

Let's recheck our data to see if it has nulls using the isnull() and sum() functions.

Check for NullgoogleData.isnull().sum()

#Output

How to Develop a Stock Price Prediction Model: A Beginner's Guide (5)

The output shows that no column has a single null value.

Step 4:

Let's find the correlation between the variables in our data.

Correlation is significant in exploratory data analysis because it helps us to find the relationship between variables and their possible dependencies. It's a vital technique for dimensionality reduction (a term that refers to selecting and engineering features from a dataset to eliminate redundancy and keep only relevant features).

There are three types of correlation: positive, neutral, and negative. A correlation relationship is positive if two variables influence each other in the same direction, neutral if no relationship exists, and negative if they influence each other in opposite directions.

However, the strength of this relationship is usually measured using correlation coefficients that produce a value between -0.1 and 1.0. The closer the value is to 1.0, the stronger the relationship.

Therefore, we will use the corr() function to find the existing correlation between variables in the data. The goal is to keep only the relevant features for modelling.

# Find the Correlation in our datagoogleData.corr()

#Output

How to Develop a Stock Price Prediction Model: A Beginner's Guide (6)

Let's visualize this output for a more precise understanding using the plot() function.

#Call the plot() method and set the kind parameter to bargoogleData.corr().plot(kind="bar")

#Output

How to Develop a Stock Price Prediction Model: A Beginner's Guide (7)

The heatmap() function of the Seaborn library also visualizes it.

# Import Season Libraryimport seaborn as snssns.heatmap(googleData.corr(), vmin=-1, vmax=1,annot=True,cmap="rocket_r")plt.show()

#Output

How to Develop a Stock Price Prediction Model: A Beginner's Guide (8)

The output shows that while Open, High, Low, Close, and Adj Close all have a positive correlation (having a value of 1.0 or close to 1.0), Volume has a negative correlation with the others. It means it is irrelevant to the others and the dependent variable.

Data Preprocessing

We will preprocess this data by dropping the irrelevant and redundant variables using the Pandas drop() function. Since Adj Close and Close are the same, and Volume doesn't correlate with any other, we will drop them.

Step 1:

Let's drop columns

# Drop unneccesary ColumnsgoogleData.drop(['Adj Close', 'Volume'], axis=1, inplace=True)googleData.head(20)

#Output

How to Develop a Stock Price Prediction Model: A Beginner's Guide (9)

Step 2:

Now that we have dropped some columns, we can visualize our data using the Pandas plot().

# Use the plot function to plotgoogleData.plot()

#Output

How to Develop a Stock Price Prediction Model: A Beginner's Guide (10)

Step 3:

Let's visualize the correlation again, but we will use the Seaborn pairplot() function this time.

Recommended by LinkedIn

Data Scaling: A Simple Journey to More Robust Analysis Matheus Peixoto 10 months ago

How to become a Data Scientist in just 45 days? Aditi Parihar 5 years ago

# Use seaborn to plot Datasns.pairplot(googleData)

#Output

How to Develop a Stock Price Prediction Model: A Beginner's Guide (14)

The output shows that the variables positively correlate with each other as the values on the x-axis increase, the values on the y-axis also increase. The data is excellent and ready for modelling.

Split Data

We will now split our data into independent and dependent variables and training and testing.

Step 1:

We will split the data into input variables (Open, Low, and High) and the output variable (Close). The former will be used to predict the latter.

# Declare two variables to hold the two setsx= googleData[['Open', 'High', 'Low']].values y= googleData[['Close']].values# You can out put variables x and y to see their resultprint(x)print(y)

#Output for Input Variables

[[ 4.211713 4.233631 4.147454] [ 4.096396 4.221676 4.017691] [ 4.183569 4.254802 4.146956] ... [136.639999 136.839996 135.330002] [136.960007 138.880005 136.080002] [137.820007 138. 135.479996]]

#Output for Output Variable

[[ 4.219185]

[ 4.112087]

[ 4.172361]

...

[136.380005]

[138.699997]

[136.940002]]

Step 2:

We will now split the data into training and testing using the train_test_split() function from the sklearn library. This function allows us to define the percentage of data for training and testing. We have set the value of the test_size parameter to 0.2 (20%) of the data, which means 80% is used for training. Just so you know, to avoid overfitting, it's crucial always to train a model with sufficient data so that it's able to generalize to new and unseen data.

# import train_test_splitfrom sklearn.model_selection import train_test_split# Split data into two sets x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=0)

Model Training

We will train our regression model on the 80% set of our data (x_train and y_train) using the fit() function of the linear regression class. But first, we will create an instance of LinearRegression.

# Instantiate Linear Regressionmodel = LinearRegression()#Train modelmodel.fit(x_train,y_train)

Model Testing

Let's test our trained model on our test data, the 20% set we split earlier. We will do this using the predict() function of the model.

Step 1:

Let's predict our test data

# Make prediction on test dataprediction = model.predict(x_test)#Predict itprediction

#Output

array([[ 59.23012508], [ 11.89693548], [ 47.6725332 ], [142.20883151], [ 65.36181412], [104.1629295 ], [ 19.18841781], [ 53.12259541], [ 14.52710348], [ 26.67452603], [ 46.32961794], [ 8.90880651], [ 18.29991079], [ 61.10237052], [ 15.43339827], [ 31.01669073], [ 22.60792572], [ 53.23952951], [ 25.28671865],

Step 2:

Let's compare the actual and predicted values to see the performance of our model.

# Compare Actual and Predictedcomparison = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': prediction.flatten()})comparison.head(20)

#Output

How to Develop a Stock Price Prediction Model: A Beginner's Guide (15)

Step 3:

Let's visualize the comparison between the actual data and the predicted

# Visualize the comparison using MatPlotlibcGraph = comparison.head(20)cGraph.plot(kind='bar')plt.title('Actual Vs Predicted')plt.ylabel('Closing Price')

#Output

How to Develop a Stock Price Prediction Model: A Beginner's Guide (16)

The output shows that our model's predicted values are very close to the actual values.

Model Evaluation

We've successfully trained and tested our model, but how do we evaluate its performance? It's important to note that we cannot measure its performance accuracy because accuracy is only for classification, not regression. Instead, we calculate the error (the difference between actual and predicted) values for regression models. But before using the other error metrics to evaluate our model, let's first score our model to see its performance using the score() function.

Step 1:

Let's score our model. We will use the score() function and pass the test data for prediction. The value is between 0 and 1, and the closer the result is to 1, the better the model.

# Let's score this model using the test datamodel.score(x_test, y_test)

#Output

0.9999188517055966

Step 2:

It's important to note that the score() function outputs the R-squared value; thus, the r2_score() also outputs the same result as the score() function. The r2_score() generally has values in the range 0-1. The values near 1 are considered signs of a good model.

# Import libraryfrom sklearn.metrics import r2_scorer2_score(y_test, prediction)

#Output

0.9999188517055966

The score() and r2_score() output show that our model is perfect.

Step 3:

Other metrics for calculating the error of our model performance are Mean Absolute Error, Mean Squared Error, and Root Mean Squared Error. You'll be able to check them here to understand better.

Put simply:

Prediction Error = Actual Value - Predicted Value

The better the model performance, the lesser the MAE score

The average squared error between the Actual and Predicted Value is calculated.

The better the model performance, the lesser the MSE score.

RMSE

This calculates the square root of MSE.

The good news is that these metrics are all available in the sklearn library. So, let's implement them to see their results.

# Librariesimport mathfrom sklearn import metrics#Metrics to find accuracy of continous variablesprint('_____________________________________')print('Mean Absolute Error (MAE)')mav = metrics.mean_absolute_error(y_test, prediction)print('MAE Value:' ,mav)print('_____________________________________')print('Mean Squared Error (MSE)')mse = metrics.mean_squared_error(y_test, prediction)print('MSE Value:' ,mse)print('_____________________________________')print('Root Mean Squared Error Value (RMSE)')rmse = math.sqrt(metrics.mean_squared_error(y_test,prediction))print('RMSE Value:' , rmse)

#Output

_____________________________________

Mean Absolute Error (MAE)

MAE Value: 0.1894684150177685

_____________________________________

Mean Squared Error (MSE)

MSE Value: 0.1152946446650223

_____________________________________

Root Mean Squared Error Value (RMSE)

RMSE Value: 0.33955065110381144

The output above shows that the error of our model performance is less, confirming that our model fits the data well, though it can be optimized. And that's beyond the scope of this guide. It's a good job!

Conclusion

Congratulations! ✌👏👏👏👏

If you've come this far, I'd like to tell you that you've done well for yourself. Simple as it may be for the expert, it's never so simple to be too easy for the beginner or enthusiast. It's the reason this guide was provided for you.

Understand that financial markets are dynamic in the real world, and models may need regular updates. This model should, therefore, be monitored and regularly updated for better performance from time to time. Accurately predicting stock prices is challenging, and even sophisticated models may not always provide accurate predictions given that financial markets are influenced by numerous unpredictable factors, and past performance may not guarantee future results.

If this is helpful, kindly like, comment, and share. Follow Goodnews Daniel for more.

If you want to learn more, W3Schools.com provides sufficient resources. Happy learning!😍✌ Maziv Technologies Limited