Mining dynamic social networks from public news articles for company value prediction.
1. Mining dynamic social networks from
public news articles for company value
prediction.
- PRATIK, MICHEL, KAI & MINGHAO
2. Objectives and Key notes
What we discovered!
1. Study, analyze and understand impactful relations that exist between companies.
2. Transform the discovered relations into intercompany networks, revealing features
and metrics about the company.
3. Generate models that integrate network-feature metrics as well as company
financial valuations in order to substantially project or predict a company’s future
value OR profit over time e.g.
Metrics like Number of company's’ a company relates with (Network feature metric),
Company’s profit (financial metric).
3. Concepts and Techniques utilized.
Network Analysis
Graph theory
Ranking
Machine learning Algorithms
Regression (𝑦 = 𝑎 + 𝑏𝑥)
Statistical Methods
Correlation. (𝑅2
)
Mean Squared Error.
Algebraic equations
e.g the one that they used for the relation score
4. Choice of research domain
Document-level and sentence-level co-occurrence
The more companies co-appear or are described together in important news articles
and/or sentences, the stronger their mutual relationship.
NB: The study doesn’t extract specific relations separately but rather generalizes all
co-occurrence’s as impact relations, i.e., how many impacts a company receives from
others, by considering consider positive/negative structural impacts from networks.
5. Research Coverage
For a Target company
Generation of inter-company networks entailing Local and global relations, historical
relations and the delta change in impact of relations over time.
Borrowing the Page ranking algorithm ideology used in Information retrieval systems.
Companies are ranked by each network feature and company valuations.(e.g. Profit)
Usage of machine learning algorithm such as linear regression and SVM regression to
combine the features of the longitudinal network with a company’s financial
information to predict the company value.
6. Extracting Data
New York Times
Social Network Data
From the large scalable Public data about companies available in the news and
electronically through the web. (News Articles mainly. ). Data dated from 1981 – 2009
(year by year).
e.g. IBM appeared in about 300 news articles in the New York Times in 2009 (277 articles
as IBM and 84 articles as International Business Machines).
Interviews, Questionnaires and Observations.
Financial Data.
Company valuations were also obtained from New York Times Fortune 500 List (1955 -
2009) .
7. Pre-processing the data
For a Target company
For target company x, let candidate company be y (one that is impacting x in a period of
time t. Sets of documents D and sentences S in which they’ve co-occurred during time t
are collected.
Generating Longitudinal directed/undirected and valued/unvalued Networks over a
period of years for a set of companies 𝑉.
𝐺 𝑡 = {𝐺 𝑡1, 𝐺 𝑡2, 𝐺 𝑡3 … … … . } Where 𝑡1 < 𝑡2 < 𝑡3
For eachcompany
𝑥 ∈ 𝑉
a structural feature vector F 𝑥
𝑇
is generated F 𝑥
𝑇
⊆ G 𝑇
where F 𝑥
𝑇
indicates network
effects for target company x.
9. Calculating Impact relation Strength
Algorithm
𝑆𝑐𝑜𝑟𝑒 𝑥(𝑦) = a* 𝑖∈𝐷 𝑥.𝑦
𝑡 𝑤 𝑑 𝑖 + b ∗ 𝑗𝜖𝐷 𝑥.𝑦
𝑡 𝑤𝑠 𝑗
𝑤 𝑑 𝑖 And 𝑤𝑠 𝑖 - Weights computed for the total number of documents and
sentences in which target company 𝑥 and candidate company 𝑌co-occur.
𝑤 𝑑(𝑖) = log(1 +
1
𝑌′ 𝑖
+
𝑡𝑓𝑥(𝑖)
𝑦∈{𝑥,𝑌} 𝑡𝑓𝑦(𝑖)
)
𝑤𝑠(𝑖) = log(1 +
1
𝑌′′ 𝑖
)
e.g. IBM in 2009. It is apparent that Microsoft had the greatest impact on IBM in 2009. They co-occurred in 55
articles and were described together in 264 sentences. From these sentences, we can infer that they are direct
competitors.
Sometimes impact isn’t obvious, SPSS and IBM are not competitors and co-occurred in only 1 article and in 3
sentences, but their relation is important because SPSS and IBM co- appeared in an article in a high-weight
document (which describes only SPSS and IBM’s acquisition relation in the entire article).
10. Mining Longitudinal Network
Network effects
Six types of network effects are considered.
1. The number of connections that target company has.
2. Distance between x and its related nodes.
3. The number of connections that the companies relating with target company have.
4. Number of connections among x’s related nodes.
5. Distance between target company’s related nodes.
6. Number of node pairs having x on the shortest path.
11. Mining Longitudinal Network
1. Network effects generation
A set of nodes that directly or indirectly impact focal company x is generated - 𝑁𝑥
3 different types of node pairs are defined,
𝑥, 𝑖 ∀ (𝑖 ∈ 𝑁𝑥) then
𝑖, 𝑗 ∀ (𝑖, 𝑗 ∈ 𝑁𝑥, 𝑖 ≠ 𝑗) and
𝑖 𝑖, 𝑘 ∀ (𝑖 ∈ 𝑁𝑥, 𝑘 ∈ 𝑉).
Measures of degree connectivity𝛽(𝑖, 𝑗), Eccentricity 𝜇(𝑖, 𝑗), betweeness 𝜁 𝑥(𝑖, 𝑗), are
computed and then standardized to the network size 𝑉 .
12. Further analysis on the Networks
Traversing the valued directed network for more patterns revealing possible impact
relations.
1. Two new sub-networks are incorporated.
Neighboring node sets 𝐿 𝑥 which are considered to exert an impact on to x through their
direct connection to 𝑁𝑥.
NB: 𝐿 𝑥 ∶ 𝑁𝑥 - shows degree to which companies are directly related to x rather than
indirectly.
2. Retaining only arcs (directed edges) to reveal who is impacting who
3. Step 1(Network effects generation – (prev page)) is repeated to obtain historical
network effects.
13. Network Feature Selection
Filtering out companies with maximum Impact
Individual feature selection.
Companies are ranked by network features 𝑓𝑖 and by their valuations (profit).
𝑋𝑖 – Rank vector of companies ranked by network feature
Y – Companies ranked by their valuations like profit.
Spearman’s rank correlation is calculated between 𝑋𝑖 and Y.
The salient implication is that if there is an increase in the ratio of the number of
connections that a company has with the numbers of connections that its neighbors
have, then the value of its profits will increase.
14. Prediction Model
Network effects + Company valuations
Longitudinal network effects as well as valuations of each target company x are integrated into
Linear regression model (LRM) – Predicting a company’s current or future financial value.
Support vector regression model (SVR) – To learn Parameters.
Experimental results.
20 Fortune companies’ are selected as a sample. Their valuation records i.e. profits are captured and
networks are generated.
First, they calculate the mean profit value of the companies, then after train their model on the records for
records that span each five years networks, then after test it to predict the next five years profits then
they’re compared.
This is repeated for just a company.
15. Performance Evaluation
Prediction of the mean profits of 20 companies
Discovered
Network features do not seem to contribute
to revenue prediction but rather contribute
to predicting companies’ profit.
Company profit prediction by joint network
and financial analysis outperforms network-
only by 150% and financial-only by 34%.
17. Aspects of Network science in paper.
Graph-theory : such as degree of connectivity, diameter, shortest path used to calculate
network effects
Developing models to understand the network
Extracting data from NYT , Problem Statement part of Paper.
Building models to anticipate the evolution of the networks.
Network effects, company valuations
Constructing models to optimise the outcomes of networks
Experimental results and improvements.
18. What else can be done.
Improvements
1. A company's value (or performance) may encompass several factors depending on the
context in which it’s defined. Such as
Market performance, and Employee satisfaction and Responsibility. Analysis into these
aforementioned areas can potentially improve the model’s performance.
2. More social network data resources can be used. e.g.
social media especially Twitter. e.g. Twitter analysis or Facebook analysis to get the longitudinal
social network data.
3. Categorizing relations as negative or positive using sentiment analysis. Separately handling
networks i.e. positive impact relations networks handled on their own as well as negative
impact relations networks.
Editor's Notes
Precisely, The Paper aims to deal with the three bullets, in the order placed above.
Main point: Researchers aimed to develop a formula that would predict a company’s financial value over a period of time.
Techniques employed included: Network Analysis – (Graph theory), Statistical Methods – (Correlation), Machine learning Algorithms – (Regression), and Algebric equations – (e.g the one that they used for the relation score)
The concept initiated in this research was interesting, Not one that can easily be thought of.
It had the assumption that if companies co-appear in written records, then they’re most likely impacting each other in one way or another. Which definitely makes sense,
For instance, in football, Often you’ll get two giant clubs (that are rivals) mentioned alongside each other in documents, articles or anywhere. They impact each other by virtue of their rivalry, As one goes into the market to purchase top players, the other makes a similar move just so to stay on top.
However, Mutual relationship was something we didn’t agree with, because, the impact isn’t necessarily on a common understanding, but rather an automatic impact. So when the researcher claims mutual relationship
Extracting data about the relations on a local as well as global level and drawing back the years to capture historical relations between companies was smart and brilliant. Past and present statistics speak volumes about the future.
In order to filter out companies that made the most impact, an algorithm that ranks the relations between was useful.
Page-ranking – (Used to rank the importance of web pages by count of back links on the page).
Regression (Machine Learning algorithm) is a very reliable predictive analysis tool that was used o project outcomes or results after putting together all the necessary metrics as earlier talked about, Network feature metrics and financial metrics.
New York times was the source of the data used in the research.
Interviews, Questionnaires and Observations were a brilliant because researchers would then have more elaborated answers to their questions which would validate the data published by articles.
DISAGREE: A variety of data sets generated from different locations would be ideal, We didn’t agree to the fact that only one source was used.
Target company, was the focal company, So companies with whom it relates formed the network this company.
𝐺 𝑡 vector or set represents graphs for the specified times. 𝑥 𝑏𝑒𝑙𝑜𝑛𝑔𝑠 𝑉.
You can flip through slides and talk about the network effects to make this clear enough.
Relational score was an indicator of the strength of the relationship between companies. It can otherwise be understood as degree of connectivity – if implied through graph theory.
Candidate company: Company being investigated to discover how strong it’s relationship with the target company is.
RS was obtained by summing up the weights of the total number of documents and sentences in which the companies of interest co-appeared.
The weights were obtained using formula’s above. 𝑌 ′ (𝑖) and 𝑌 ′′ (𝑗) – counts of the company names that appear in document I and sentences j
𝑡 𝑓 𝑥 (𝑖) and 𝑡 𝑓 𝑦 (𝑖) frequency of company name y in the document I and sentence j.
A and b were constants that represented trade off’s between document weight and sentence weight.
i.e. The higher this metric, the more connected the involved companies were.
Degree of connectivity of target company.
Eccentricity of target company.
Degree of connectivity of candidate companies.
Vertex degree of the graph
Eccentricity of candidate companies (related nodes).
Betweeness centrality.
𝑥,𝑖 ∀ (𝑖∈ 𝑁 𝑥 ) – target company x and and a candidate company I for each company I In Node set 𝑁 𝑥
𝑖,𝑗 ∀ (𝑖,𝑗∈ 𝑁 𝑥 , 𝑖≠𝑗) candidate companies I and j both belong to Node set 𝑁 𝑥
and 𝑖 𝑖,𝑘 ∀ (𝑖∈ 𝑁 𝑥 , 𝑘∈𝑉).k belongs to the big Network of all nodes, so it can belong to any subgraph within the entire network
𝐿 𝑥 - set of nodes that are indirectly connected to x
𝐿 𝑥 - set of nodes that are directly connected to x
3. We agree and very usefull
Clearly there is an overlap in the research methodologies of these three areas:
They draw on data gathered from social networks, infrastructures, sen sors and the Internet of Things