Home | Eecs349project

Abstract

Project Title: Stock Market Prediction

Group Members: Lakshman Boddoju

Contact Email: lakshmanboddoju2019@u.northwestern.edu

GitHub: https://github.com/pavanboddoju/stock-market-prediction-project

Northwestern University

EECS 349: Machine Learning Project

My project objective is to predict the price movement of a stock price using Machine Learning. The main focus was on the feature selection spent time preprocessing the data I used.

Inputs: Traditional Stock Market Data such as Open, High, Low, Close, Volume, Adjusted Close.
Additional Inputs: Sentiment Data collected from News headlines

Output: Predicted Stock Price Movement

Why it's interesting:

The problem is difficult mainly due to the nature of markets itself, the constantly changing dynamics and the inherent quasi-unexplainable nature of the parties (and their mentalities) who trade these instruments. The trading market is a very profitable place to leverage machine learning and is it is also observed that hedge funds that leverage Machine Learning to make their decisions and trade stocks are consistently outperforming traditional hedge funds, which means the opportunity remains for a significant potential for financial gain.

Approach:

I trained various models on both traditional stock features and added features of people's sentiment to predict the stock price movements. Various learners such as Random Forests, Logistic Regression, and Multi-Layer Perceptrons. The dataset and features are clearly explained in the longer report, but the key features were: open, high, low, close, volume, adjusted close, compound sentiment, negative sentiment, neural sentiment and positive sentiment.

Key Results:

Multi-Layer Perceptron model worked better than other models but in general, the predictive power of the learners isn't very huge over random guessing. The models also seem to not predict heavily volatile stock movement, such as huge spikes, whereas stocks consistently spike in the real world. But the models are able to guess the direction of the stock movements better than ZeroR. The accuracy of the best model was 62.4% in directional accuracy compared to 54.1% of the ZeroR accuracy.

Here is an example of the average stock market prediction of a Random Forest Model on AAPL:

I chose this diagram to display that the model was not great in predicting the stock price (especially the Random Forest model), but it still provides some insights into the direction the stock might move. There is also a huge flaw in that it was unable to predict the huge spikes since it was a single stock.

whereas when we compare the same model performance on an index stock ticker DJIA which usually doesn't have such huge volatility, the model worked better:

Detailed Report

Data:

There are 2519 instances of data points in total for each stock ticker. I kept aside 504 for testing and used the remaining 2015 for training the models.

I collected the data from 2 sources. One was directly from yahoo finance to get historical data (https://finance.yahoo.com/) and the other was writing a script to get data from news sources such as the New York Times leveraging various APIs (https://www.nytimes.com/, https://developer.nytimes.com/, and https://newsapi.org/).

The data from yahoo finance includes traditional data which has fields such as date, open, high, low, volume, close and adjusted close of a stock. For the evaluation, I chose three different stock tickers, "DJIA", "KO" and "AAPL", and their historical data over 10 years.

This is the Traditional Data Snapshot:

This is the example data from using API calls that gathered news headlines during those time period:

This is the example of the stock prices and the headlines:

This is an example of stock prices and various sentiment values, such as compound (aggregated), negative, neutral and postive sentiments scores from the NLTK package,

Results:

This is an example of the format of the predicted prices made by the models:

And here is an example of the output graph representing the performance of the models between the average predicted price and the average actual price using a random forest model on AAPL:

An example output of KO stock ticker:

An example output of DJIA stock ticker:

The results for directional accuracy on the various different models can be summarized into the following. To calculate the direction of the stock, I subtract the predicted close price on that certain day to its open price and calculate whether it's positive or negative. Then compare that to the difference in actual close price to the open price.

Key Insights:

Multi-Layer Perceptron Model usually works better than other models.
But in general, no model works significantly well.
Prices never seen before aren't usually predictable by the models, this is noticed especially when a stock reaches new highs, or new lows (that haven't been encountered in training before.)
The single company stock tickers perform worse compared to indices. This could mean that single company stocks have more volatility and are more easily influenced and harder to predict.
The assumption that stock prices are influenced by the sentiment does seem to make sense from the outputs observed.
It also seems that given more time and resources, it is possible to come up with a decision-making system that could be profitable based on this hypothesis.
There are a lot of other factors that also influence the stock market prices that are definitely not modeled into this system and during the process of the project, I realize that feature selection is a huge factor that determines the success of a prediction task.

Challenges:

Data collection and preprocessing played a huge part in the project. Traditional data was available, but scripts were written to change various formatting, remove irrelevant and erroneous data points. Integrating API's to gather sources and iterating over them to calculate the sentiment scores took a lot of time. I also had to experiment with various API's before and had to change directions since those approaches weren't scalable and not affordable.
This project also needed an understanding of the financial domain and is based on assumptions that sentiment influences stock prices and ignores other features that are also usually agreed in to influence prices.
There have been other claims of better accurate prediction of stock movement such as up to 90% based on a different hypothesis

Future Work:

Try out more models and add different types of sources such as reddit, twitter, and various other forums by writing own scripts for each of them.
Add more features other than sentiment, and analyze their impact on the prices.

Key Packages Used during the whole process:

NLTK, Pandas, Numpy, Scikit-Learn and Weka.
NYT api, newsapi, ArchiveApi, tweepy, beautifulsoup, etc.