This document discusses using open company data from OpenCorporates to map corporate networks and structures. It introduces the concept of mapping relationships between companies and directors to identify "corporate sprawls" - groups of companies connected by shared directors. The document demonstrates how to construct these maps by starting with seed companies and snowballing to find other connected companies. It shows how these maps can be analyzed and visualized, for example to identify important companies or directors. Further opportunities discussed include adding geographical and temporal dimensions to the analysis.
2. As company filings start to appear as open data, opportunities
may arise for watchdogs to start mining this data in support of
their investigations and monitoring activities.
This presentation introduces several ideas relating to mapping
network structures in order to learn something about the
structure of “corporate sprawls”, corporate groupings defined
on the basis of co-director relationships.
6. This network maps shows Twitter users who are commonly
followed by the followers of @TOGYnews
Although hard to see at this scale, the map is actually
constructed from labeled points connected by lines (in the
jargon, “nodes connected by edges”).
The algorithm used to position the labeled nodes tries to place
nodes that are heavily connected to each other close to each
other. In a sense, we can view the diagram as a map, with
regions that are highlighted using false colours identifying
clusters of nodes that may in some sense be similar to each
other based on the sharing of common followers.
8. The map is constructed using data grabbed from the Twitter
API.
Using one or more “focus” users (a specific Twitter account, for
example, or the set of users of a particular hashtag), we grab a
list of their followers.
10. For each of the followers, we grab a list of their friends (or a
sample thereof) – that is, a lists of some or all of the people
they follow on Twitter.
We can use this data to construct a network of people followed
by the followers of the original focus.
It is typically at this point, where there is most relational
information contained within the network, that we lay it out
using automatic layout tools.
12. Drawing on the insight that people on Twitter are likely to
follow accounts that are of interest to them, we can start to
imagine the network as a projection of the interests of the
people who are interested in one or more of the things the
focus is associated with.
However, interests of followers may spread to a wide range of
topics, so we look for consistency of interest, pruning the
network to remove people who are not commonly followed by
the followers of the focus. That is, we remove nodes who are
followed by only a few of the followers of the focus.
14. Having laid out the network map, we might now tidy it up a
little by removing all the nodes that are not themselves
followed by a significant number of the followers of the
original focus,
18. It may also be possible to use metadata associated with social
networks to develop additional insights.
A recent paper describes one way of mining social network
data for information about people working for a particular
company, and using public biographical information along with
social connection data to map out the organisational structures
of large companies.
22. For example, if we can get hold of directorial appointment and
termination data, we can start to construct maps that who how
companies are connected by common directors, as well as
which companies are co-directed by particular directors.
As with the emergent social positioning network maps, if
particular directors have particular corporate interests, we may
be able to identify particular organisational groupings in
corporate sprawls made up from dozens of operating
companies working across a range of business areas.
24. One possible source of open company information is
OpenCorporates.
OpenCorporates’ ambitious aim is to mint a unique corporate
identifier for every corporate legal entity in the world [CHECK],
as well as collating, and normalising (or “harmonising”)
company information about company filings, trademarks,
patents(?) and officers (that is company directors, company
secretaries and so on).
For GB registered companies, there is a growing repository of
data relating to company directorships, which provides us with
an opportunity to develop maps that show how companies are
connected by virtue of having common directors.
26. Just a note – my experience in looking at data related to GB
registered companies suggests that the directors of the
“top”/nominal company in a large multinational grouping are
“atypical” compared to the officers appointed to UK based
operating companies in the same corporate sprawl, being
appointed from the great and the good, or from senior officers
who do not take directorships across operating divisions or
companies, rather than representing directors of operating
companies.
When seeding corporate sprawl trawlers – algorithms that try
to identify companies that make up a corporate sprawl based
on co-directorships – my experience suggests that it often
makes sense to see the search with one or more operating
companies who have directors that are likely to be directors of
other operating companies, rather than the “top level”
company.
28. We can reuse the ideas that underpin the construction of the
emergent social positioning graph to map out corporate
structures based on director information.
30. As well as corporate information pages, OpenCorporates
maintains information pages about directorial appointments.
At the moment, there are no authority files providing
identifiers that identify the same physical person – each
directorial appointment to company provides the director with
a unique officer ID. It is possible to search for officers of other
companies with the same name as a particular director, but no
identifiers that link them as the same physical person. (That
said, there does appear to be a slot in the metadata for
authoritative identifiers.)
34. The general shape of this diagram might remind you of
something…?
For each of the seed companies, we grab a list of their
directors.
We can use this data to construct a network of people who are
directors or other officers of the original seed company or
companies.
38. For each of the directors, we run a search for them on
OpenCorporates, to see what directorial appointments have
been made to other companies for people of exactly the same
name.
We can use this data to construct a network of companies
directed by the directors of the original seed company.
For those companies that are directed by N or more of the
directors associated with the seed company or companies
(where N is typically 2) we might now say they are part of the
corporate sprawl. The companies sharing fewer than N
directors associated with companies admitted to the corporate
sprawl are added to a list of possible candidate companies. As
we find more directors associated with companies included in
the sprawl, we might be able to “legitimise” membership of
these companies within the sprawl.
42. If we so decide, we can continue with this snowball discovery
process, looking up further directors associated with
companies we have included in our sprawl, with a view to
trying to discover more companies that should be included in
the sprawl.
43.
44. Using this snowball approach, I have constructed a scraper on
Scraperwiki that mines OpenCorporates, given one or more
seed companies (or seed directors) to map out corporate
sprawls, limiting myself to the capture of current directors and
active companies registered in the UK.
(The code needs checking and is perhaps not as easy to use as
it might be. Developing a more robust and user friendly tool
may be worth exploring if this approach is seen to be useful.)
46. So – we can generate a network that connects companies with
their directors, and grow this network out to identify
companies that share several directors.
As with the emergent social positioning map, we can use
automatic layout tools to try to position companies and
directors close to each other based on their connectivity,
producing a map over the corporate sprawl.
50. This map shows companies in a corporate sprawl grown out
from Royal Dutch Shell.
Note the presence of BP in there – somehow, these two
groupings are connected by shared directorships of some
intermediate company.
52. One of the nice things about representing this sort of structure
in an abstract mathematical or computational way is that we
can wrangle it with code...
So for example, companies C1 and C2 are connected by a
single shared director, whereas C2 and C3 are connected by
two directors.
54. We can represent this by transforming the original bipartite
(two types of node) graph that connects directors to
companies and companies to directors by a graph that just
connects companies who were connected by directors.
The thickness of the line (or “edge”) connecting the companies
represents its “weight”, which in this case is given by the
number of shared directors between connected companies.
56. We can also filter the graph, for example by adding together
the weights of all the edges incident on a node, and throwing
away all nodes for whom this sum is below a specified
threshold value.
We might alternatively prune the network by removing
(“cutting”) all edges below a specified weight, and then
throwing away nodes that aren’t connected to other nodes.
(For example, we might remove connections between
companies that only share a single director, and then throw
away companies that aren’t connected to any other
companies. Which is to say, we cut out companies that don’t
share two or more directors with any other single company.
When you start working with graphs, you begin to realise quite
how beautiful, and powerful, a way they are for working data
elements that are related to each other in some way.)
58. Here’s an example of the Shell corporate sprawl with the
directors removed and edges connecting companies that share
two or more directors. The labels are sized relative to the
PageRank score of each node, which a measure of how well
connected the node is in the graph (the “importance” of each
node is dependent on the “importance” of the nodes
connected to it….)
The lines also provide a background that highlights the
connectivity - and structure – of the corporate elements.
60. In this view, I have resized the labels based on the
betweenness centrality of each node. This network statistic
highlights nodes that play an important role in connecting
clusters or groupings of nodes. So for example, we see the
suggestion that The Consolidated Petroleum Company and
Shell Mex and BP Limited may be the companies that connect
the Shell sprawl to the BP one.
64. Just as we collapsed the network to show how companies
could be linked directly by virtue of co-directorships, so we can
collapse the network to show how directors are connected.
For example, director D1 is connected by a single shared
company to directors D2 and D3, whereas D2 and D3 are
connected by two companies.
70. As to how we get those graphs plotted? I built a crude
workflow in Scraperwiki that gets data out of the scraped
database and into a form that allows it to be visualised using
the Gephi desktop tool or in a web page using different
Javascript libraries (sigma.js or d3.js).
71.
72. This is Gephi – a cross-platform desktop tool that’s great for
generating effective network visualisations. I have some
tutorials and sample datasets if anyone wants to give it a
whirl…
74. So where can we take the OpenCorporates data next?
I have a couple of ideas:
- we can go spatial in a geographical sense and start to
geocode the registered addresses of companies, to see
whether any of them are located in offshore tax havens, for
example, or to see whether there are different registered
addresses that might lead us to yet more companies (by virtue
of sharing common registered office addresses, rather than co-
directors, for example);
- we could start trying to tie non-gb registered companies into
the mix. At the moment, director information for other
territories is sparse – might them be some other way we can
look for connections?
75. And
“When”?
- company timelines (set-up dates, renaming)
- explore director timelines (by company)
- explore director timelines (by directory)
76. Another approach might be to start analysing corporate
sprawls in a time dimension. There are several opportunities
here:
- If we have access to company formation and dissolution
dates, we can map out a timeiline of a corporate sprawl, which
might reveal how companies change name, directorship or
association with other companies;
- if we get all the director information associated with a
company, we can visualise how director appointments and
terminations occurred across one or more companies, which
might in turn reveal identifiable “features” that we might be
able to associate with news or business restructuing events;
- if we track down companies a particular director appears to
be associated with, we can start to develop “career timelines”
of directors, showing how they have been associated with
different corporate groupings over time (and maybe the odd
company on the side…)
78. Whilst it is possible to generate insight from the analysis of
data that is contained just within OpenCorporates, there are
likely to be many opportunities for using OpenCroporates to
annotate other datasets, or use external datasets to annotate
OpenCorporates data
80. As this example starts to explore, we might try to reconcile
company names as recorded in local spending data records
with corporate entities identified within in OpenCorporates to
build up a better picture of how money flows into corporate
sprawls.
On a lobbying front, we might look for mentions of meetings
between government officials and and company officers, and
then try to make mappings between government departments
and operational areas of a corporate sprawl, and so on.
82. [ This is part of an ongoing informal exploration of the patterns
and structures we can find across large open datasets.
For more information, follow:
- blog.ouseful.info
- @psychemedia
All comments welcome. ]
Editor's Notes
As company filings start to appear as open data, opportunities may arise for watchdogs to start mining this data in support of their investigations and monitoring activities.This presentation introduces several ideas relating to mapping network structures in order to learn something about the structure of “corporate sprawls”, corporate groupings defined on the basis of co-director relationships.
To introduce the idea of a network map, let’s have a look at a view we can construct over the Twitter social space…
This network maps shows Twitter users who are commonly followed by the followers of @TOGYnewsAlthough hard to see at this scale, the map is actually constructed from labeled points connected by lines (in the jargon, “nodes connected by edges”).The algorithm used to position the labeled nodes tries to place nodes that are heavily connected to each other close to each other. In a sense, we can view the diagram as a map, with regions that are highlighted using false colours identifying clusters of nodes that may in some sense be similar to each other based on the sharing of common followers.
The map is constructed using data grabbed from the Twitter API.Using one or more “focus” users (a specific Twitter account, for example, or the set of users of a particular hashtag), we grab a list of their followers.
For each of the followers, we grab a list of their friends (or a sample thereof) – that is, a lists of some or all of the people they follow on Twitter.We can use this data to construct a network of people followed by the followers of the original focus.It is typically at this point, where there is most relational information contained within the network, that we lay it out using automatic layout tools.
Drawing on the insight that people on Twitter are likely to follow accounts that are of interest to them, we can start to imagine the network as a projection of the interests of the people who are interested in one or more of the things the focus is associated with.However, interests of followers may spread to a wide range of topics, so we look for consistency of interest, pruning the network to remove people who are not commonly followed by the followers of the focus. That is, we remove nodes who are followed by only a few of the followers of the focus.
Having laid out the network map, we might now tidy it up a little by removing all the nodes that are not themselves followed by a significant number of the followers of the original focus,
The result is a map that shows groups of people positioned according to the shared projected presumed interests of their followers.
It may also be possible to use metadata associated with social networks to develop additional insights.A recent paper describes one way of mining social network data for information about people working for a particular company, and using public biographical information along with social connection data to map out the organisational structures of large companies.
A more principled way of looking at corporate structures at a company level may possibly be derived from publicly available corporate information.
For example, if we can get hold of directorial appointment and termination data, we can start to construct maps that who how companies are connected by common directors, as well as which companies are co-directed by particular directors.As with the emergent social positioning network maps, if particular directors have particular corporate interests, we may be able to identify particular organisational groupings in corporate sprawls made up from dozens of operating companies working across a range of business areas.
One possible source of open company information is OpenCorporates.OpenCorporates’ ambitious aim is to mint a unique corporate identifier for every corporate legal entity in the world [CHECK], as well as collating, and normalising (or “harmonising”) company information about company filings, trademarks, patents(?) and officers (that is company directors, company secretaries and so on).For GB registered companies, there is a growing repository of data relating to company directorships, which provides us with an opportunity to develop maps that show how companies are connected by virtue of having common directors.
Just a note – my experience in looking at data related to GB registered companies suggests that the directors of the “top”/nominal company in a large multinational grouping are “atypical” compared to the officers appointed to UK based operating companies in the same corporate sprawl, being appointed from the great and the good, or from senior officers who do not take directorships across operating divisions or companies, rather than representing directors of operating companies.When seeding corporate sprawl trawlers – algorithms that try to identify companies that make up a corporate sprawl based on co-directorships – my experience suggests that it often makes sense to see the search with one or more operating companies who have directors that are likely to be directors of other operating companies, rather than the “top level” company.
To introduce the idea of a network map, let’s have a look at a view we can construct over the Twitter social space…
As well as corporate information pages, OpenCorporates maintains information pages about directorial appointments. At the moment, there are no authority files providing identifiers that identify the same physical person – each directorial appointment to company provides the director with a unique officer ID. It is possible to search for officers of other companies with the same name as a particular director, but no identifiers that link them as the same physical person. (That said, there does appear to be a slot in the metadata for authoritative identifiers.)
So how might we go about constructing a corporate sprawl?Let’s start with one or more seed company.
The general shape of this diagram might remind you of something…?For each of the seed companies, we grab a list of their directors.We can use this data to construct a network of people who are directors or other officers of the original seed company or companies.
Here’s another way of imagining it – a company surrounded by its directors.
For each of the directors, we run a search for them on OpenCorporates, to see what directorial appointments have been made to other companies for people of exactly the same name.We can use this data to construct a network of companies directed by the directors of the original seed company.For those companies that are directed by N or more of the directors associated with the seed company or companies (where N is typically 2) we might now say they are part of the corporate sprawl. The companies sharing fewer than N directors associated with companies admitted to the corporate sprawl are added to a list of possible candidate companies. As we find more directors associated with companies included in the sprawl, we might be able to “legitimise” membership of these companies within the sprawl.
We now have a larger set of companies, reflecting those companies who share N or more directors with the original seed company or companies.
If we so decide, we can continue with this snowball discovery process, looking up further directors associated with companies we have included in our sprawl, with a view to trying to discover more companies that should be included in the sprawl.
Using this snowball approach, I have constructed a scraper on Scraperwiki that mines OpenCorporates, given one or more seed companies (or seed directors) to map out corporate sprawls, limiting myself to the capture of current directors and active companies registered in the UK.(The code needs checking and is perhaps not as easy to use as it might be. Developing a more robust and user friendly tool may be worth exploring if this approach is seen to be useful.)
So – we can generate a network that connects companies with their directors, and grow this network out to identify companies that share several directors.As with the emergent social positioning map, we can use automatic layout tools to try to position companies and directors close to each other based on their connectivity, producing a map over the corporate sprawl.
We can view this network in various ways. For example, we might choose to view just the companies.
This map shows companies in a corporate sprawl grown out from Royal Dutch Shell.Note the presence of BP in there – somehow, these two groupings are connected by shared directorships of some intermediate company.
One of the nice things about representing this sort of structure in an abstract mathematical or computational way is that we can wrangle it with code...So for example, companies C1 and C2 are connected by a single shared director, whereas C2 and C3 are connected by two directors.
We can represent this by transforming the original bipartite (two types of node) graph that connects directors to companies and companies to directors by a graph that just connects companies who were connected by directors.The thickness of the line (or “edge”) connecting the companies represents its “weight”, which in this case is given by the number of shared directors between connected companies.
We can also filter the graph, for example by adding together the weights of all the edges incident on a node, and throwing away all nodes for whom this sum is below a specified threshold value.We might alternatively prune the network by removing (“cutting”) all edges below a specified weight, and then throwing away nodes that aren’t connected to other nodes. (For example, we might remove connections between companies that only share a single director, and then throw away companies that aren’t connected to any other companies. Which is to say, we cut out companies that don’t share two or more directors with any other single company. When you start working with graphs, you begin to realise quite how beautiful, and powerful, a way they are for working data elements that are related to each other in some way.)
Here’s an example of the Shell corporate sprawl with the directors removed and edges connecting companies that share two or more directors. The labels are sized relative to the PageRank score of each node, which a measure of how well connected the node is in the graph (the “importance” of each node is dependent on the “importance” of the nodes connected to it….)The lines also provide a background that highlights the connectivity - and structure – of the corporate elements.
In this view, I have resized the labels based on the betweenness centrality of each node. This network statistic highlights nodes that play an important role in connecting clusters or groupings of nodes. So for example, we see the suggestion that The Consolidated Petroleum Company and Shell Mex and BP Limited may be the companies that connect the Shell sprawl to the BP one.
This is just a tweaking of the layout of the previous graph to try to highlight the separation of the different clusters.
Just as we collapsed the network to show how companies could be linked directly by virtue of co-directorships, so we can collapse the network to show how directors are connected.For example, director D1 is connected by a single shared company to directors D2 and D3, whereas D2 and D3 are connected by two companies.
Once again, we use line thickness (that is, edge weight) to denote how heavily connected directors are.
Here’s a view over connected directors in the the Shell corporate sprawl.
As to how we get those graphs plotted? I built a crude workflow in Scraperwiki that gets data out of the scraped database and into a form that allows it to be visualised using the Gephi desktop tool or in a web page using different Javascript libraries (sigma.js or d3.js).
This isGephi – a cross-platform desktop tool that’s great for generating effective network visualisations. I have some tutorials and sample datasets if anyone wants to give it a whirl…
So where can we take the OpenCorporates data next?I have a couple of ideas: we can go spatial in a geographical sense and start to geocode the registered addresses of companies, to see whether any of them are located in offshore tax havens, for example, or to see whether there are different registered addresses that might lead us to yet more companies (by virtue of sharing common registered office addresses, rather than co-directors, for example); we could start trying to tie non-gb registered companies into the mix. At the moment, director information for other territories is sparse – might them be some other way we can look for connections?
Another approach might be to start analysing corporate sprawls in a time dimension. There are several opportunities here: If we have access to company formation and dissolution dates, we can map out a timeiline of a corporate sprawl, which might reveal how companies change name, directorship or association with other companies; if we get all the director information associated with a company, we can visualise how director appointments and terminations occurred across one or more companies, which might in turn reveal identifiable “features” that we might be able to associate with news or business restructuing events; if we track down companies a particular director appears to be associated with, we can start to develop “career timelines” of directors, showing how they have been associated with different corporate groupings over time (and maybe the odd company on the side…)
Whilst it is possible to generate insight from the analysis of data that is contained just within OpenCorporates, there are likely to be many opportunities for using OpenCroporates to annotate other datasets, or use external datasets to annotate OpenCorporates data
As this example starts to explore, we might try to reconcile company names as recorded in local spending data records with corporate entities identified within in OpenCorporates to build up a better picture of how money flows into corporate sprawls.On a lobbying front, we might look for mentions of meetings between government officials and and company officers, and then try to make mappings between government departments and operational areas of a corporate sprawl, and so on.