Stock trend prediction using news articles: a text mining
approach
Abstract: Stock market prediction with data mining techniques is one of the most
important issues to be investigated. Mining textual documents and time series
concurrently, such as predicting the movements of stock prices based on the
contents of the news articles, is an emerging topic in data mining and text
mining community. Previous researches have shown that there is a strong
relationship between the time when the news stories are released and the time
when the stock prices fluctuate.
In this thesis, we present a model that predicts the changes of stock trend
by analyzing the influence of non-quantifiable information namely the news
articles which are rich in information and superior to numeric data. In
particular, we investigate the immediate impact of news articles on the time
series based on the Efficient Markets Hypothesis. This is a binary
classification problem which uses several data mining and text mining
techniques.
For making such a prediction model, we use the intraday prices and the
time-stamped news articles related to Iran-Khodro Company for the consecutive
years of 1383 and 1384. A new statistical based piecewise segmentation
algorithm is proposed to identify trends on the time series. The news
articles are preprocessed and are labeled either as rise or drop by being
aligned back to the segmented trends. A document selection heuristics that is
based on the chi-square estimation is used for selecting the positive
training documents. The selected news articles are represented using the
vector space modeling and tfidf term weighting scheme. Finally, the
relationship between the contents of the news stories and trends on the stock
prices are learned through support vector machine.
Different experiments are conducted to evaluate various aspects of the
proposed model and encouraging results are obtained in all of the
experiments. The accuracy of the prediction model is equal to 83% and in
comparison with news random labeling with 51% of accuracy: the model has
increased the accuracy by 30%. The prediction model predicts 1.6 times better
and more correctly than the news random labeling.
CLICK HERE TO DOWNLOAD THE WHOLE ESSAY. (in PDF format)