OpenCorporates Co-Director Mapping Graphs

OpenCorporates
Co-Director Mapping
Tony Hirst
Dept of Communications and Systems,
The Open University

As company filings start to appear as open data, opportunities
may arise for watchdogs to start mining this data in support of
their investigations and monitoring activities.
This presentation introduces several ideas relating to mapping
network structures in order to learn something about the
structure of “corporate sprawls”, corporate groupings defined
on the basis of co-director relationships.

Social
Media
Mapping
Introducing“Graphs”

To introduce the idea of a network map, let’s have a look at a
view we can construct over the Twitter social space…

This network maps shows Twitter users who are commonly
followed by the followers of @TOGYnews
Although hard to see at this scale, the map is actually
constructed from labeled points connected by lines (in the
jargon, “nodes connected by edges”).
The algorithm used to position the labeled nodes tries to place
nodes that are heavily connected to each other close to each
other. In a sense, we can view the diagram as a map, with
regions that are highlighted using false colours identifying
clusters of nodes that may in some sense be similar to each
other based on the sharing of common followers.

A
B
Is followed by
focus
Findthefollowers

The map is constructed using data grabbed from the Twitter
API.
Using one or more “focus” users (a specific Twitter account, for
example, or the set of users of a particular hashtag), we grab a
list of their followers.

A
B peer
peer
Is followed by
focus
FindFriendsofFollowers

For each of the followers, we grab a list of their friends (or a
sample thereof) – that is, a lists of some or all of the people
they follow on Twitter.
We can use this data to construct a network of people followed
by the followers of the original focus.
It is typically at this point, where there is most relational
information contained within the network, that we lay it out
using automatic layout tools.

A
B
peer
Is followed by
focus
FindCommonFriendsofFollowers

Drawing on the insight that people on Twitter are likely to
follow accounts that are of interest to them, we can start to
imagine the network as a projection of the interests of the
people who are interested in one or more of the things the
focus is associated with.
However, interests of followers may spread to a wide range of
topics, so we look for consistency of interest, pruning the
network to remove people who are not commonly followed by
the followers of the focus. That is, we remove nodes who are
followed by only a few of the followers of the focus.

peer
focus
Filteroutnotcommonlyfollowed

Having laid out the network map, we might now tidy it up a
little by removing all the nodes that are not themselves
followed by a significant number of the followers of the
original focus,

The result is a map that shows groups of people positioned
according to the shared projected presumed interests of their
followers.

It may also be possible to use metadata associated with social
networks to develop additional insights.
A recent paper describes one way of mining social network
data for information about people working for a particular
company, and using public biographical information along with
social connection data to map out the organisational structures
of large companies.

Corporate
Structure
Maps
Introducing“Graphs”

A more principled way of looking at corporate structures at a
company level may possibly be derived from publicly available
corporate information.

C3
C1
C2
D1
D3D2
Companies&Directors

For example, if we can get hold of directorial appointment and
termination data, we can start to construct maps that who how
companies are connected by common directors, as well as
which companies are co-directed by particular directors.
As with the emergent social positioning network maps, if
particular directors have particular corporate interests, we may
be able to identify particular organisational groupings in
corporate sprawls made up from dozens of operating
companies working across a range of business areas.

CompanyRecordsonOpenCorporates

One possible source of open company information is
OpenCorporates.
OpenCorporates’ ambitious aim is to mint a unique corporate
identifier for every corporate legal entity in the world [CHECK],
as well as collating, and normalising (or “harmonising”)
company information about company filings, trademarks,
patents(?) and officers (that is company directors, company
secretaries and so on).
For GB registered companies, there is a growing repository of
data relating to company directorships, which provides us with
an opportunity to develop maps that show how companies are
connected by virtue of having common directors.

SubsidiaryCompanieshave“working”directors

Just a note – my experience in looking at data related to GB
registered companies suggests that the directors of the
“top”/nominal company in a large multinational grouping are
“atypical” compared to the officers appointed to UK based
operating companies in the same corporate sprawl, being
appointed from the great and the good, or from senior officers
who do not take directorships across operating divisions or
companies, rather than representing directors of operating
companies.
When seeding corporate sprawl trawlers – algorithms that try
to identify companies that make up a corporate sprawl based
on co-directorships – my experience suggests that it often
makes sense to see the search with one or more operating
companies who have directors that are likely to be directors of
other operating companies, rather than the “top level”
company.

Co-Director
Mapping
MoreGraphs

We can reuse the ideas that underpin the construction of the
emergent social positioning graph to map out corporate
structures based on director information.

DirectorRecordsonOpenCorporates

As well as corporate information pages, OpenCorporates
maintains information pages about directorial appointments.
At the moment, there are no authority files providing
identifiers that identify the same physical person – each
directorial appointment to company provides the director with
a unique officer ID. It is possible to search for officers of other
companies with the same name as a particular director, but no
identifiers that link them as the same physical person. (That
said, there does appear to be a slot in the metadata for
authoritative identifiers.)

So how might we go about constructing a corporate sprawl?
Let’s start with one or more seed company.

C1
D1
Has director
D2

The general shape of this diagram might remind you of
something…?
For each of the seed companies, we grab a list of their
directors.
We can use this data to construct a network of people who are
directors or other officers of the original seed company or
companies.

Here’s another way of imagining it – a company surrounded by
its directors.

C1
C2
Has director
D2
D1

For each of the directors, we run a search for them on
OpenCorporates, to see what directorial appointments have
been made to other companies for people of exactly the same
name.
We can use this data to construct a network of companies
directed by the directors of the original seed company.
For those companies that are directed by N or more of the
directors associated with the seed company or companies
(where N is typically 2) we might now say they are part of the
corporate sprawl. The companies sharing fewer than N
directors associated with companies admitted to the corporate
sprawl are added to a list of possible candidate companies. As
we find more directors associated with companies included in
the sprawl, we might be able to “legitimise” membership of
these companies within the sprawl.

FindCompaniesWithTwoorMoreSeedDirectors

We now have a larger set of companies, reflecting those
companies who share N or more directors with the original
seed company or companies.

C1
C2 D3
D1
Has director
D2

If we so decide, we can continue with this snowball discovery
process, looking up further directors associated with
companies we have included in our sprawl, with a view to
trying to discover more companies that should be included in
the sprawl.

Using this snowball approach, I have constructed a scraper on
Scraperwiki that mines OpenCorporates, given one or more
seed companies (or seed directors) to map out corporate
sprawls, limiting myself to the capture of current directors and
active companies registered in the UK.
(The code needs checking and is perhaps not as easy to use as
it might be. Developing a more robust and user friendly tool
may be worth exploring if this approach is seen to be useful.)

So – we can generate a network that connects companies with
their directors, and grow this network out to identify
companies that share several directors.
As with the emergent social positioning map, we can use
automatic layout tools to try to position companies and
directors close to each other based on their connectivity,
producing a map over the corporate sprawl.

We can view this network in various ways. For example, we
might choose to view just the companies.

This map shows companies in a corporate sprawl grown out
from Royal Dutch Shell.
Note the presence of BP in there – somehow, these two
groupings are connected by shared directorships of some
intermediate company.

One of the nice things about representing this sort of structure
in an abstract mathematical or computational way is that we
can wrangle it with code...
So for example, companies C1 and C2 are connected by a
single shared director, whereas C2 and C3 are connected by
two directors.

C3
C1
C2
CompaniesSharingDirectors

We can represent this by transforming the original bipartite
(two types of node) graph that connects directors to
companies and companies to directors by a graph that just
connects companies who were connected by directors.
The thickness of the line (or “edge”) connecting the companies
represents its “weight”, which in this case is given by the
number of shared directors between connected companies.

C3
C2
CompaniesSharingTwoorMoreDirectors

We can also filter the graph, for example by adding together
the weights of all the edges incident on a node, and throwing
away all nodes for whom this sum is below a specified
threshold value.
We might alternatively prune the network by removing
(“cutting”) all edges below a specified weight, and then
throwing away nodes that aren’t connected to other nodes.
(For example, we might remove connections between
companies that only share a single director, and then throw
away companies that aren’t connected to any other
companies. Which is to say, we cut out companies that don’t
share two or more directors with any other single company.
When you start working with graphs, you begin to realise quite
how beautiful, and powerful, a way they are for working data
elements that are related to each other in some way.)

Here’s an example of the Shell corporate sprawl with the
directors removed and edges connecting companies that share
two or more directors. The labels are sized relative to the
PageRank score of each node, which a measure of how well
connected the node is in the graph (the “importance” of each
node is dependent on the “importance” of the nodes
connected to it….)
The lines also provide a background that highlights the
connectivity - and structure – of the corporate elements.

In this view, I have resized the labels based on the
betweenness centrality of each node. This network statistic
highlights nodes that play an important role in connecting
clusters or groupings of nodes. So for example, we see the
suggestion that The Consolidated Petroleum Company and
Shell Mex and BP Limited may be the companies that connect
the Shell sprawl to the BP one.

This is just a tweaking of the layout of the previous graph to try
to highlight the separation of the different clusters.

Just as we collapsed the network to show how companies
could be linked directly by virtue of co-directorships, so we can
collapse the network to show how directors are connected.
For example, director D1 is connected by a single shared
company to directors D2 and D3, whereas D2 and D3 are
connected by two companies.

Once again, we use line thickness (that is, edge weight) to
denote how heavily connected directors are.

Here’s a view over connected directors in the the Shell
corporate sprawl.

OpenCorporates
Scraperwiki db
JSON
D3.js
Networkx
Gexf
Gephi sigma.js

As to how we get those graphs plotted? I built a crude
workflow in Scraperwiki that gets data out of the scraped
database and into a form that allows it to be visualised using
the Gephi desktop tool or in a web page using different
Javascript libraries (sigma.js or d3.js).

This is Gephi – a cross-platform desktop tool that’s great for
generating effective network visualisations. I have some
tutorials and sample datasets if anyone wants to give it a
whirl…

“Where”
Next…?
-geocode registered addresses
- explore non-gb registered companies

So where can we take the OpenCorporates data next?
I have a couple of ideas:
- we can go spatial in a geographical sense and start to
geocode the registered addresses of companies, to see
whether any of them are located in offshore tax havens, for
example, or to see whether there are different registered
addresses that might lead us to yet more companies (by virtue
of sharing common registered office addresses, rather than co-
directors, for example);
- we could start trying to tie non-gb registered companies into
the mix. At the moment, director information for other
territories is sparse – might them be some other way we can
look for connections?

And
“When”?
- company timelines (set-up dates, renaming)
- explore director timelines (by company)
- explore director timelines (by directory)

Another approach might be to start analysing corporate
sprawls in a time dimension. There are several opportunities
here:
- If we have access to company formation and dissolution
dates, we can map out a timeiline of a corporate sprawl, which
might reveal how companies change name, directorship or
association with other companies;
- if we get all the director information associated with a
company, we can visualise how director appointments and
terminations occurred across one or more companies, which
might in turn reveal identifiable “features” that we might be
able to associate with news or business restructuing events;
- if we track down companies a particular director appears to
be associated with, we can start to develop “career timelines”
of directors, showing how they have been associated with
different corporate groupings over time (and maybe the odd
company on the side…)

Linking out
and in
- linking companies or directors with external
datasets

Whilst it is possible to generate insight from the analysis of
data that is contained just within OpenCorporates, there are
likely to be many opportunities for using OpenCroporates to
annotate other datasets, or use external datasets to annotate
OpenCorporates data

As this example starts to explore, we might try to reconcile
company names as recorded in local spending data records
with corporate entities identified within in OpenCorporates to
build up a better picture of how money flows into corporate
sprawls.
On a lobbying front, we might look for mentions of meetings
between government officials and and company officers, and
then try to make mappings between government departments
and operational areas of a corporate sprawl, and so on.

[ This is part of an ongoing informal exploration of the patterns
and structures we can find across large open datasets.
For more information, follow:
- blog.ouseful.info
- @psychemedia
All comments welcome. ]

OpenCorporates Co-Director Mapping Graphs

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to OpenCorporates Co-Director Mapping Graphs

Similar to OpenCorporates Co-Director Mapping Graphs (20)

More from Tony Hirst

More from Tony Hirst (20)

Recently uploaded

Recently uploaded (20)

OpenCorporates Co-Director Mapping Graphs

Editor's Notes