- Who: In this story, there are three main characters: 1) the people/the community who needs help, 2) the data scientist (that is you), and 3) AI.
- BERT summarization models:If we think from a common man’s perspective, it is very difficult to analyze the data for each and every record which is having huge amounts of data (may vary from time to time) for every company. So, reading and understanding the whole data for every time is an arduous task and it is a time consuming process too.
- On behalf of users (common people), machines are able to understand and provide brief information regarding the content presented in the news articles which saves resources in terms of time and cost.
- Being a data scientist, our objective is to provide solutions for the user and make the job(understand the content) easier without losing the core content of each article.
- For implementing summarization, we need to build a deep neural network to do this task. Instead of developing summarization tasks (number of NLP steps involved) from the scratch, google has already developed BERT summarizer (pre-trained model) to generate brief information from the original data without losing the essence of each record presented in the dataset and feed this data to generate sentiment scores using sentiment analysis.
- Time-series forecasting model: As a data scientist, our goal is to provide a system that can help interpret news and the stock data to see how and what factors are impacting the industry.
- The data extracted will be fed to the neural networks, which can analyze the patterns and habits from the news archive. By usage of extensive training techniques, Neural networks will be knowledgeable to understand news and current data and should be able to predict its worth to buy.
- How much does the data scientist understand Assignment 1 (domain) and Assignment 2 (data)?
-
BERT summarization models:
In the given data, a couple of attributes available in the given equality dataset. For summarizing the data, it is good enough to process only text data by avoiding special symbols irrespective of the data / length it is having. After pre-processing and summarizing and analyzing the summarized data, we need evaluated the performance of the text generated by the BERT model against the summary generated by the human which represents recall.
-
Transformer based Sentiment analysis:
It is important to understand the moving data and their changes with respect to the sentiments provided in the news article. We then employ a model that’s capable of extracting knowledge from the provided article and then establish the kind of sentiment associated with the article.
Time-series forecasting model:
While preprocessing the data, we have come across almost all the fields of the data, what it means and how to analyze it. Keeping this idea in scope we have moved forward by building a simple time-series model to predict the data changes. Once, we have the time-series model built, the sentiment value’s affect on the stock price value changes are detected.
- What models and analysis did the data scientist and AI apply to fulfill the need of the people or the community?
-
BERT summarization models:
There are two types of summarization: abstractive and extractive summarization. Abstractive summarization basically means rewriting key points while extractive summarization generates summary by copying directly the most important spans / sentences from a document. For instance, if we are doing some research and need to get a quick summary on a financial dataset having thousands of records of what we were reading, extractive summarization would be more helpful for the task.
Our objective is to build BERT extractive summarization from the content column presented in the equity dataset. In addition to that, summarization of the text data providing input for sentiment analysis helps to analyze the data by minimizing the system resources in terms of time and cost without losing the essence of the original content.
-
Transformer based Sentiment analysis:
The goal of this model is to provide sentiments from the news article and in doing so this model helps to evaluate the fluctuations of the stock prices in relation with the sentiments of the news. This helps the community to answer the question of whether we should really look into news articles provided in the web as a reliable resource for generating insights of the stock price.
-
Time-series forecasting model:
Time-series forecasting models are ones that are capable to predict future values based on previously captured values. Time-series forecasting is widely used for non-stationary data. Non-stationary data are the data whose metrics vary over time.
These non-stationary input data (used as input to these models) are usually called time-series. Some examples of time-series include the temperature values over time, stock price over time, price of a house over time etc. So, the input is a signal (time-series) that is defined by observations taken sequentially in time as shown below. There are two types of time-series models used and the results are compared.
-
The ARIMA (Autoregressive Integrated Moving Average) model is a widely used forecasting technique for time-series prediction. This is a machine learning algorithm
-
The LSTM ( Long-short Term Memory) model is another deep learning model used for time-series prediction.
- Can the data scientist estimate and select data for their goals from Assignment 1? Can they map data sets from Assignment 2 onto appropriate ML models?
- Transformer based Classification and summarization models: In order to perform data summarization from equity dataset we need to preprocess the data and perform summarization using BERT extractive summarizer. This summarization data provides input for the sentiment analysis to generate scores for each record. Original content in terms of length and text: Total No.of. words in an original text = 90836. Summarized content in terms of length and text:Total No.of. words after summarization from original text = 4304
- Time-series forecasting model: While modeling the data, we have extracted the stock market dataset from here and the preprocessing of the data has been done. After the preprocessing of data which includes, collecting the stocks and neglecting the ETFs, analyzing the companies based on the ticker values, etc. we have modeled the data in order to predict the value of the stocks.
- ARIMA model: The preprocessed dataset could be mapped with the model and the values of a particular stock (IBM in our case) varying over time is shown below:
After the modeling is done, the actual and the predicted values are also visualized:
- LSTM model: The actual and predicted values are compared and visualized.
- Can the data scientist connect Story 1 with ML models/stories about what a ML model can do? To perform good ML research, what in-depth knowledge and experience with ML algorithms and ML stories does a data scientist need?
- BERT summarization models: Before performing summarization and sentiment analysis on the text data for generating scores for each company, we should perform data pre-processing on each record and filter if there are any unnecessary columns which ensure it does not lose the core of the original content before proceeding any type of operations.
- Time-series forecasting model: When modeling the data using the time-series model, since it is a model used for forecasting or future prediction, and also for the non-stationary data, it is suitable for the stock price prediction and when used along with the sentiment analyzed data, it can achieve our goal.
- In order to perform some good research in this area, one has to have knowledge on what is non-stationary data? What are the models used to predict them well? Will the deep neural network play an important role in modeling the data? If yes, which kind of deep learning algorithms can be used? How to integrate the sentiment analysis into the Time-series model and observe necessary changes? These are some of the experiences/expertise one should have in order to model the data and to achieve the goal.
- When has to do with the iterations (Calibration 2). How much time did it take for experimentation? How efficient is the modeling/algorithm?
BERT summarization models:
BERT model is one the finest models developed by google so far for performing NLP tasks even few other models exist. To evaluate the performance of the BERT summarization, Rouge-1F1 (measures recall and precision) is the right evaluation metric. If we want to calculate recall, how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries.In other words, these metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. If we want to calculate precision, how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries. .In other words, these metrics compare an human-produced summary or translation against a reference or a set of references generated by machine.Finally, you could use the F1 measure to make the metrics work together.
Transformer based Sentiment analysis:
Sentiment analysis is performed on the news dataset to understand based on the article and the news spread over the different channels what sentiment does they share. In order to perform, we have considered the state of art attention based transformer representations. We have thus converted this model into a customized model by adding a separate layer to not only stabilize the training accuracies but also to bring out the best relation possible to understand the stock news.
Time-series forecasting model:
The stock market and the news archive datasets considered here contain around 725 companies, so while modeling the data, there are a lot of iterations. For the time-series modeling, we have only considered one dataset at a time for both the models and the results are compared in the form of r2 score for test data for both the models.
The r2 score for ARIMA model is 0.99 and the r2 score for LSTM is 0.26
Finally, the value of the sentiment affecting the stock prices for some of the companies are visualized and shown below:
- Can the data scientist determine the acceptance level of the model (validation with accuracy and runtime performance) considering the targeted users?
BERT Summarization models:
To determine the acceptance level of a model, we need to calculate F1 score in terms of recall and precision by considering ROUGE and BLEU respectively.
Time-series forecasting model:
Considering each stock in the dataset, the accuracy obtained is quite good. Also, the runtime of the model is quite good. Based on the stock value predictions, the accuracy seemed fair enough.
- Where has to do with the learning environment. Where did this experiential learning process take place? For example, it was part of an online Deep Learning course.
BERT summarization is recent revolutions in NLP implemented by google research team. If we perform research and need to get a quick summary on a text dataset having millions of records of what we were reading, BERT summarization would be more helpful for this task.
Transformer based classification is done by customizing the model where we have changed the last few layers of the for performing the classification based sentiment analysis.
Before even modeling for the considered dataset, in the time-series modeling, we have gone through some of the articles on the web, some of the free video tutorials available and discussed among the teammates on what is the approach to be considered, what is the modeling technique used, etc. After each milestone, we came up with the improvements that have to be made on the model in order to improve its overall performance.
-
Why explains the modeling. Explainable ML models.
- Transformer based Classification & Summarization models: For performing sentiment analysis in the given dataset which is having huge text data, it is difficult to process due to avoidance of system resources, time and cost. For that, we need to summarize the whole text data using BERT extractive summarizer on each record and use that data for sentiment analysis and generate scores for each of them without losing the core content. BERT summarizer was recent revolutions in NLP tasks implemented by Google, even few other techniques available in the market. Most of the research institutes follow the same kind of architecture to build the summarization models and give best F1 Scores (recall for ROUGE and precision for BLEU).
- Time-series forecasting model: A famous and widely used forecasting method for time-series prediction is the AutoRegressive Integrated Moving Average (ARIMA) model. ARIMA models are capable of capturing different standard temporal structures in time-series data.
- Terminology:
I (Integrated): means that the model performs differencing of raw observations (e.g. it subtracts an observation from an observation at the previous time step) in order to make the time-series stationary (changing the non-stationary data to stationary one).
MA (Moving Average): means that the model uses the relationship between the residual error and the observations.
Model parameters:
The standard ARIMA models expects 3 arguments i.e. p,d,q as input parameters
- p is the number of lag observations.
- d is the degree of differencing.
- q is the size/width of the moving average window.
-
LSTM model: One of the sought out deep neural network models which uses time series analysis. They are even able to remember the past values and hence, based on the past values, the present stock price values are predicted.
- How: If you would like, you can add a dimension of how. How did it happen? Sometimes, the answer to how can be covered by what, when and where.
- Transformer based Classification & Summarization models:
As of now, we performed the sentiment analysis using summarized using BERT extractive summarizer. In addition to that, we also implemented summarization of data using BERT abstractive summarizer instead of extractive summarizer and compare the results of recall, precision, F1 and Sentiment scores of the whole dataset.
- Time-series forecasting model:
When looking into all the various possibilities of how to model the stock data over time, we have come across the time-series modeling both in the general network and the neural network and it has till the end helped us in achieving our goal.
Edit Page
1. Make a copy and run the following Google Colab Notebook and copy the ngrok tunnel url here.
2. Create a new Notebook and run your model in the Jupyter Lab window.
3. Save results and run inference on Jupyter Dash / your own visualization server.
4. Run ./ngrok http your port and copy the application link.
Colab Links:
https://colab.research.google.com/drive/18lHSmbVT-hVTZEtZfuywUzd4z5R0wQDY?usp=sharing
https://colab.research.google.com/drive/12E0wqMxJ_Tu-Z-_pCWFs7wT2PVjlRcAf?usp=sharing
https://colab.research.google.com/drive/1FpGouGI94Y3aS7GAsFgzpUvMvAE8Ow6S?usp=sharing
Reference: EduKC