SlideShare a Scribd company logo
1 of 82
OpenCorporates
Co-Director Mapping
Tony Hirst
Dept of Communications and Systems,
The Open University
As company filings start to appear as open data, opportunities
may arise for watchdogs to start mining this data in support of
their investigations and monitoring activities.
This presentation introduces several ideas relating to mapping
network structures in order to learn something about the
structure of “corporate sprawls”, corporate groupings defined
on the basis of co-director relationships.
Social
Media
Mapping
Introducing“Graphs”
To introduce the idea of a network map, let’s have a look at a
view we can construct over the Twitter social space…
EmergentSocialPositioning
This network maps shows Twitter users who are commonly
followed by the followers of @TOGYnews
Although hard to see at this scale, the map is actually
constructed from labeled points connected by lines (in the
jargon, “nodes connected by edges”).
The algorithm used to position the labeled nodes tries to place
nodes that are heavily connected to each other close to each
other. In a sense, we can view the diagram as a map, with
regions that are highlighted using false colours identifying
clusters of nodes that may in some sense be similar to each
other based on the sharing of common followers.
A
B
Is followed by
focus
Findthefollowers
The map is constructed using data grabbed from the Twitter
API.
Using one or more “focus” users (a specific Twitter account, for
example, or the set of users of a particular hashtag), we grab a
list of their followers.
A
B peer
peer
Is followed by
focus
FindFriendsofFollowers
For each of the followers, we grab a list of their friends (or a
sample thereof) – that is, a lists of some or all of the people
they follow on Twitter.
We can use this data to construct a network of people followed
by the followers of the original focus.
It is typically at this point, where there is most relational
information contained within the network, that we lay it out
using automatic layout tools.
A
B
peer
Is followed by
focus
FindCommonFriendsofFollowers
Drawing on the insight that people on Twitter are likely to
follow accounts that are of interest to them, we can start to
imagine the network as a projection of the interests of the
people who are interested in one or more of the things the
focus is associated with.
However, interests of followers may spread to a wide range of
topics, so we look for consistency of interest, pruning the
network to remove people who are not commonly followed by
the followers of the focus. That is, we remove nodes who are
followed by only a few of the followers of the focus.
peer
focus
Filteroutnotcommonlyfollowed
Having laid out the network map, we might now tidy it up a
little by removing all the nodes that are not themselves
followed by a significant number of the followers of the
original focus,
EmergentSocialPositioning
The result is a map that shows groups of people positioned
according to the shared projected presumed interests of their
followers.
AMorePrincipledApproach
It may also be possible to use metadata associated with social
networks to develop additional insights.
A recent paper describes one way of mining social network
data for information about people working for a particular
company, and using public biographical information along with
social connection data to map out the organisational structures
of large companies.
Corporate
Structure
Maps
Introducing“Graphs”
A more principled way of looking at corporate structures at a
company level may possibly be derived from publicly available
corporate information.
C3
C1
C2
D1
D3D2
Companies&Directors
For example, if we can get hold of directorial appointment and
termination data, we can start to construct maps that who how
companies are connected by common directors, as well as
which companies are co-directed by particular directors.
As with the emergent social positioning network maps, if
particular directors have particular corporate interests, we may
be able to identify particular organisational groupings in
corporate sprawls made up from dozens of operating
companies working across a range of business areas.
CompanyRecordsonOpenCorporates
One possible source of open company information is
OpenCorporates.
OpenCorporates’ ambitious aim is to mint a unique corporate
identifier for every corporate legal entity in the world [CHECK],
as well as collating, and normalising (or “harmonising”)
company information about company filings, trademarks,
patents(?) and officers (that is company directors, company
secretaries and so on).
For GB registered companies, there is a growing repository of
data relating to company directorships, which provides us with
an opportunity to develop maps that show how companies are
connected by virtue of having common directors.
SubsidiaryCompanieshave“working”directors
Just a note – my experience in looking at data related to GB
registered companies suggests that the directors of the
“top”/nominal company in a large multinational grouping are
“atypical” compared to the officers appointed to UK based
operating companies in the same corporate sprawl, being
appointed from the great and the good, or from senior officers
who do not take directorships across operating divisions or
companies, rather than representing directors of operating
companies.
When seeding corporate sprawl trawlers – algorithms that try
to identify companies that make up a corporate sprawl based
on co-directorships – my experience suggests that it often
makes sense to see the search with one or more operating
companies who have directors that are likely to be directors of
other operating companies, rather than the “top level”
company.
Co-Director
Mapping
MoreGraphs
We can reuse the ideas that underpin the construction of the
emergent social positioning graph to map out corporate
structures based on director information.
DirectorRecordsonOpenCorporates
As well as corporate information pages, OpenCorporates
maintains information pages about directorial appointments.
At the moment, there are no authority files providing
identifiers that identify the same physical person – each
directorial appointment to company provides the director with
a unique officer ID. It is possible to search for officers of other
companies with the same name as a particular director, but no
identifiers that link them as the same physical person. (That
said, there does appear to be a slot in the metadata for
authoritative identifiers.)
StartWithOneorMoreSeedCompany
So how might we go about constructing a corporate sprawl?
Let’s start with one or more seed company.
C1
D1
Has director
D2
FindFriendsofFollowers
The general shape of this diagram might remind you of
something…?
For each of the seed companies, we grab a list of their
directors.
We can use this data to construct a network of people who are
directors or other officers of the original seed company or
companies.
FindDirectorsofSeedCompany(s)
Here’s another way of imagining it – a company surrounded by
its directors.
C1
C2
Has director
D2
FindFriendsofFollowers
D1
For each of the directors, we run a search for them on
OpenCorporates, to see what directorial appointments have
been made to other companies for people of exactly the same
name.
We can use this data to construct a network of companies
directed by the directors of the original seed company.
For those companies that are directed by N or more of the
directors associated with the seed company or companies
(where N is typically 2) we might now say they are part of the
corporate sprawl. The companies sharing fewer than N
directors associated with companies admitted to the corporate
sprawl are added to a list of possible candidate companies. As
we find more directors associated with companies included in
the sprawl, we might be able to “legitimise” membership of
these companies within the sprawl.
FindCompaniesWithTwoorMoreSeedDirectors
We now have a larger set of companies, reflecting those
companies who share N or more directors with the original
seed company or companies.
C1
C2 D3
D1
Has director
D2
FindFriendsofFollowers
If we so decide, we can continue with this snowball discovery
process, looking up further directors associated with
companies we have included in our sprawl, with a view to
trying to discover more companies that should be included in
the sprawl.
Using this snowball approach, I have constructed a scraper on
Scraperwiki that mines OpenCorporates, given one or more
seed companies (or seed directors) to map out corporate
sprawls, limiting myself to the capture of current directors and
active companies registered in the UK.
(The code needs checking and is perhaps not as easy to use as
it might be. Developing a more robust and user friendly tool
may be worth exploring if this approach is seen to be useful.)
C3
C1
C2
D1
D3D2
Companies&Directors
So – we can generate a network that connects companies with
their directors, and grow this network out to identify
companies that share several directors.
As with the emergent social positioning map, we can use
automatic layout tools to try to position companies and
directors close to each other based on their connectivity,
producing a map over the corporate sprawl.
C3
C1
C2
Companies
We can view this network in various ways. For example, we
might choose to view just the companies.
PageRank
This map shows companies in a corporate sprawl grown out
from Royal Dutch Shell.
Note the presence of BP in there – somehow, these two
groupings are connected by shared directorships of some
intermediate company.
C3
C1
C2
D1
D3D2
Companies&Directors
One of the nice things about representing this sort of structure
in an abstract mathematical or computational way is that we
can wrangle it with code...
So for example, companies C1 and C2 are connected by a
single shared director, whereas C2 and C3 are connected by
two directors.
C3
C1
C2
CompaniesSharingDirectors
We can represent this by transforming the original bipartite
(two types of node) graph that connects directors to
companies and companies to directors by a graph that just
connects companies who were connected by directors.
The thickness of the line (or “edge”) connecting the companies
represents its “weight”, which in this case is given by the
number of shared directors between connected companies.
C3
C2
CompaniesSharingTwoorMoreDirectors
We can also filter the graph, for example by adding together
the weights of all the edges incident on a node, and throwing
away all nodes for whom this sum is below a specified
threshold value.
We might alternatively prune the network by removing
(“cutting”) all edges below a specified weight, and then
throwing away nodes that aren’t connected to other nodes.
(For example, we might remove connections between
companies that only share a single director, and then throw
away companies that aren’t connected to any other
companies. Which is to say, we cut out companies that don’t
share two or more directors with any other single company.
When you start working with graphs, you begin to realise quite
how beautiful, and powerful, a way they are for working data
elements that are related to each other in some way.)
PageRank
Here’s an example of the Shell corporate sprawl with the
directors removed and edges connecting companies that share
two or more directors. The labels are sized relative to the
PageRank score of each node, which a measure of how well
connected the node is in the graph (the “importance” of each
node is dependent on the “importance” of the nodes
connected to it….)
The lines also provide a background that highlights the
connectivity - and structure – of the corporate elements.
Betweenness
In this view, I have resized the labels based on the
betweenness centrality of each node. This network statistic
highlights nodes that play an important role in connecting
clusters or groupings of nodes. So for example, we see the
suggestion that The Consolidated Petroleum Company and
Shell Mex and BP Limited may be the companies that connect
the Shell sprawl to the BP one.
Betweenness(repositioned)
This is just a tweaking of the layout of the previous graph to try
to highlight the separation of the different clusters.
C3
C1
C2
D1
D3D2
Companies&Directors
Just as we collapsed the network to show how companies
could be linked directly by virtue of co-directorships, so we can
collapse the network to show how directors are connected.
For example, director D1 is connected by a single shared
company to directors D2 and D3, whereas D2 and D3 are
connected by two companies.
D1
D3D2
Co-Directors
Once again, we use line thickness (that is, edge weight) to
denote how heavily connected directors are.
PageRank
Here’s a view over connected directors in the the Shell
corporate sprawl.
OpenCorporates
Scraperwiki db
JSON
D3.js
Networkx
Gexf
Gephi sigma.js
As to how we get those graphs plotted? I built a crude
workflow in Scraperwiki that gets data out of the scraped
database and into a form that allows it to be visualised using
the Gephi desktop tool or in a web page using different
Javascript libraries (sigma.js or d3.js).
This is Gephi – a cross-platform desktop tool that’s great for
generating effective network visualisations. I have some
tutorials and sample datasets if anyone wants to give it a
whirl…
“Where”
Next…?
-geocode registered addresses
- explore non-gb registered companies
So where can we take the OpenCorporates data next?
I have a couple of ideas:
- we can go spatial in a geographical sense and start to
geocode the registered addresses of companies, to see
whether any of them are located in offshore tax havens, for
example, or to see whether there are different registered
addresses that might lead us to yet more companies (by virtue
of sharing common registered office addresses, rather than co-
directors, for example);
- we could start trying to tie non-gb registered companies into
the mix. At the moment, director information for other
territories is sparse – might them be some other way we can
look for connections?
And
“When”?
- company timelines (set-up dates, renaming)
- explore director timelines (by company)
- explore director timelines (by directory)
Another approach might be to start analysing corporate
sprawls in a time dimension. There are several opportunities
here:
- If we have access to company formation and dissolution
dates, we can map out a timeiline of a corporate sprawl, which
might reveal how companies change name, directorship or
association with other companies;
- if we get all the director information associated with a
company, we can visualise how director appointments and
terminations occurred across one or more companies, which
might in turn reveal identifiable “features” that we might be
able to associate with news or business restructuing events;
- if we track down companies a particular director appears to
be associated with, we can start to develop “career timelines”
of directors, showing how they have been associated with
different corporate groupings over time (and maybe the odd
company on the side…)
Linking out
and in
- linking companies or directors with external
datasets
Whilst it is possible to generate insight from the analysis of
data that is contained just within OpenCorporates, there are
likely to be many opportunities for using OpenCroporates to
annotate other datasets, or use external datasets to annotate
OpenCorporates data
SankeyFlowDiagrams
As this example starts to explore, we might try to reconcile
company names as recorded in local spending data records
with corporate entities identified within in OpenCorporates to
build up a better picture of how money flows into corporate
sprawls.
On a lobbying front, we might look for mentions of meetings
between government officials and and company officers, and
then try to make mappings between government departments
and operational areas of a corporate sprawl, and so on.
What do
you think?
[ This is part of an ongoing informal exploration of the patterns
and structures we can find across large open datasets.
For more information, follow:
- blog.ouseful.info
- @psychemedia
All comments welcome. ]

More Related Content

What's hot

What's New with Discovery Attender for Notes
What's New with Discovery Attender for NotesWhat's New with Discovery Attender for Notes
What's New with Discovery Attender for NotesSherpa Software
 
The Sherpa Approach: Features and Limitations of Exchange E-Discovery
The Sherpa Approach:  Features and Limitations of Exchange E-DiscoveryThe Sherpa Approach:  Features and Limitations of Exchange E-Discovery
The Sherpa Approach: Features and Limitations of Exchange E-DiscoverySherpa Software
 
One Web (API?) – Alexandre Bertails - Ippevent 10 juin 2014
One Web (API?) – Alexandre Bertails - Ippevent 10 juin 2014One Web (API?) – Alexandre Bertails - Ippevent 10 juin 2014
One Web (API?) – Alexandre Bertails - Ippevent 10 juin 2014Ippon
 
Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)James Arnold
 
Discovering Heterogeneous Resources in the Internet
Discovering Heterogeneous Resources in the InternetDiscovering Heterogeneous Resources in the Internet
Discovering Heterogeneous Resources in the InternetRazzakul Chowdhury
 
2.3.1 creating database, table and relationship on Access 2003
2.3.1 creating database, table and relationship on Access 20032.3.1 creating database, table and relationship on Access 2003
2.3.1 creating database, table and relationship on Access 2003Steven Alphonce
 

What's hot (6)

What's New with Discovery Attender for Notes
What's New with Discovery Attender for NotesWhat's New with Discovery Attender for Notes
What's New with Discovery Attender for Notes
 
The Sherpa Approach: Features and Limitations of Exchange E-Discovery
The Sherpa Approach:  Features and Limitations of Exchange E-DiscoveryThe Sherpa Approach:  Features and Limitations of Exchange E-Discovery
The Sherpa Approach: Features and Limitations of Exchange E-Discovery
 
One Web (API?) – Alexandre Bertails - Ippevent 10 juin 2014
One Web (API?) – Alexandre Bertails - Ippevent 10 juin 2014One Web (API?) – Alexandre Bertails - Ippevent 10 juin 2014
One Web (API?) – Alexandre Bertails - Ippevent 10 juin 2014
 
Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)
 
Discovering Heterogeneous Resources in the Internet
Discovering Heterogeneous Resources in the InternetDiscovering Heterogeneous Resources in the Internet
Discovering Heterogeneous Resources in the Internet
 
2.3.1 creating database, table and relationship on Access 2003
2.3.1 creating database, table and relationship on Access 20032.3.1 creating database, table and relationship on Access 2003
2.3.1 creating database, table and relationship on Access 2003
 

Similar to OpenCorporates Co-Director Mapping Graphs

BIA 658 – Social Network Analysis - Final report Kanad Chatterjee
BIA 658 – Social Network Analysis - Final report Kanad ChatterjeeBIA 658 – Social Network Analysis - Final report Kanad Chatterjee
BIA 658 – Social Network Analysis - Final report Kanad ChatterjeeKanad Chatterjee
 
Enterprise 2.0
Enterprise 2.0Enterprise 2.0
Enterprise 2.0Magan Le
 
Scoda project companygraph
Scoda project companygraphScoda project companygraph
Scoda project companygraphTony Hirst
 
Human Due Diligence Methodology
Human Due Diligence MethodologyHuman Due Diligence Methodology
Human Due Diligence MethodologyManofthetaste
 
Due Diligence Methodology by Human Value International
Due Diligence Methodology by Human Value InternationalDue Diligence Methodology by Human Value International
Due Diligence Methodology by Human Value InternationalManofthetaste
 
370 Part Four Organizational Processesrepeatability,” says.docx
370 Part Four Organizational Processesrepeatability,” says.docx370 Part Four Organizational Processesrepeatability,” says.docx
370 Part Four Organizational Processesrepeatability,” says.docxtamicawaysmith
 
What Does Web 2.0 Mean for Enterprise Search
What Does Web 2.0 Mean for Enterprise SearchWhat Does Web 2.0 Mean for Enterprise Search
What Does Web 2.0 Mean for Enterprise SearchMichael Sampson
 
10 Secret Social Media Tools
10 Secret Social Media Tools10 Secret Social Media Tools
10 Secret Social Media ToolsHammad Siddiqui
 
Information architecture unit i
Information architecture unit iInformation architecture unit i
Information architecture unit iAman Sharma
 
Using social network analysis to improve innovation and performance
Using social network analysis to improve innovation and performanceUsing social network analysis to improve innovation and performance
Using social network analysis to improve innovation and performanceScott Smith
 
Converting Big Data To Smart Data | The Step-By-Step Guide!
Converting Big Data To Smart Data | The Step-By-Step Guide!Converting Big Data To Smart Data | The Step-By-Step Guide!
Converting Big Data To Smart Data | The Step-By-Step Guide!Kavika Roy
 
Startup Network Pitch. Reduce your transaction cost and boost new business de...
Startup Network Pitch. Reduce your transaction cost and boost new business de...Startup Network Pitch. Reduce your transaction cost and boost new business de...
Startup Network Pitch. Reduce your transaction cost and boost new business de...Mario Scuderi
 
Tech Scouting (Companies & Patents)
Tech Scouting (Companies & Patents)Tech Scouting (Companies & Patents)
Tech Scouting (Companies & Patents)quidsupport
 
Social Network Analysis - Twitter
Social Network Analysis - TwitterSocial Network Analysis - Twitter
Social Network Analysis - TwitterSocial Figures
 
A 360° view of the world’s technologies and innovations: Mergeflow’s approach...
A 360° view of the world’s technologies and innovations: Mergeflow’s approach...A 360° view of the world’s technologies and innovations: Mergeflow’s approach...
A 360° view of the world’s technologies and innovations: Mergeflow’s approach...Mergeflow
 
How Small Teams Can Build Powerful Content Engines
How Small Teams Can Build Powerful Content Engines How Small Teams Can Build Powerful Content Engines
How Small Teams Can Build Powerful Content Engines OpenView
 
Organizing collaboration channels
Organizing collaboration channelsOrganizing collaboration channels
Organizing collaboration channelsPaul Richards
 
Tech Scouting (Companies) Workflow
Tech Scouting (Companies) WorkflowTech Scouting (Companies) Workflow
Tech Scouting (Companies) Workflowquidsupport
 

Similar to OpenCorporates Co-Director Mapping Graphs (20)

BIA 658 – Social Network Analysis - Final report Kanad Chatterjee
BIA 658 – Social Network Analysis - Final report Kanad ChatterjeeBIA 658 – Social Network Analysis - Final report Kanad Chatterjee
BIA 658 – Social Network Analysis - Final report Kanad Chatterjee
 
Enterprise 2.0
Enterprise 2.0Enterprise 2.0
Enterprise 2.0
 
Scoda project companygraph
Scoda project companygraphScoda project companygraph
Scoda project companygraph
 
Human Due Diligence Methodology
Human Due Diligence MethodologyHuman Due Diligence Methodology
Human Due Diligence Methodology
 
Social networking tools for enterprises 3
Social networking tools for enterprises   3Social networking tools for enterprises   3
Social networking tools for enterprises 3
 
Due Diligence Methodology by Human Value International
Due Diligence Methodology by Human Value InternationalDue Diligence Methodology by Human Value International
Due Diligence Methodology by Human Value International
 
370 Part Four Organizational Processesrepeatability,” says.docx
370 Part Four Organizational Processesrepeatability,” says.docx370 Part Four Organizational Processesrepeatability,” says.docx
370 Part Four Organizational Processesrepeatability,” says.docx
 
What Does Web 2.0 Mean for Enterprise Search
What Does Web 2.0 Mean for Enterprise SearchWhat Does Web 2.0 Mean for Enterprise Search
What Does Web 2.0 Mean for Enterprise Search
 
10 Secret Social Media Tools
10 Secret Social Media Tools10 Secret Social Media Tools
10 Secret Social Media Tools
 
Information architecture unit i
Information architecture unit iInformation architecture unit i
Information architecture unit i
 
+Cross
+Cross+Cross
+Cross
 
Using social network analysis to improve innovation and performance
Using social network analysis to improve innovation and performanceUsing social network analysis to improve innovation and performance
Using social network analysis to improve innovation and performance
 
Converting Big Data To Smart Data | The Step-By-Step Guide!
Converting Big Data To Smart Data | The Step-By-Step Guide!Converting Big Data To Smart Data | The Step-By-Step Guide!
Converting Big Data To Smart Data | The Step-By-Step Guide!
 
Startup Network Pitch. Reduce your transaction cost and boost new business de...
Startup Network Pitch. Reduce your transaction cost and boost new business de...Startup Network Pitch. Reduce your transaction cost and boost new business de...
Startup Network Pitch. Reduce your transaction cost and boost new business de...
 
Tech Scouting (Companies & Patents)
Tech Scouting (Companies & Patents)Tech Scouting (Companies & Patents)
Tech Scouting (Companies & Patents)
 
Social Network Analysis - Twitter
Social Network Analysis - TwitterSocial Network Analysis - Twitter
Social Network Analysis - Twitter
 
A 360° view of the world’s technologies and innovations: Mergeflow’s approach...
A 360° view of the world’s technologies and innovations: Mergeflow’s approach...A 360° view of the world’s technologies and innovations: Mergeflow’s approach...
A 360° view of the world’s technologies and innovations: Mergeflow’s approach...
 
How Small Teams Can Build Powerful Content Engines
How Small Teams Can Build Powerful Content Engines How Small Teams Can Build Powerful Content Engines
How Small Teams Can Build Powerful Content Engines
 
Organizing collaboration channels
Organizing collaboration channelsOrganizing collaboration channels
Organizing collaboration channels
 
Tech Scouting (Companies) Workflow
Tech Scouting (Companies) WorkflowTech Scouting (Companies) Workflow
Tech Scouting (Companies) Workflow
 

More from Tony Hirst

15 in 20 research fiesta
15 in 20 research fiesta15 in 20 research fiesta
15 in 20 research fiestaTony Hirst
 
Jupyternotebooks ou.pptx
Jupyternotebooks ou.pptxJupyternotebooks ou.pptx
Jupyternotebooks ou.pptxTony Hirst
 
Virtual computing.pptx
Virtual computing.pptxVirtual computing.pptx
Virtual computing.pptxTony Hirst
 
ouseful-parlihacks
ouseful-parlihacksouseful-parlihacks
ouseful-parlihacksTony Hirst
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriateTony Hirst
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriateTony Hirst
 
Robotlab jupyter
Robotlab   jupyterRobotlab   jupyter
Robotlab jupyterTony Hirst
 
Fco open data in half day th-v2
Fco open data in half day  th-v2Fco open data in half day  th-v2
Fco open data in half day th-v2Tony Hirst
 
Notes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopNotes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopTony Hirst
 
Community Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireCommunity Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireTony Hirst
 
Residential school 2015_robotics_interest
Residential school 2015_robotics_interestResidential school 2015_robotics_interest
Residential school 2015_robotics_interestTony Hirst
 
Data Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXData Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXTony Hirst
 
A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefineTony Hirst
 
Conversations with data
Conversations with dataConversations with data
Conversations with dataTony Hirst
 
Data reuse OU workshop bingo
Data reuse OU workshop bingoData reuse OU workshop bingo
Data reuse OU workshop bingoTony Hirst
 
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Tony Hirst
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalismTony Hirst
 

More from Tony Hirst (20)

15 in 20 research fiesta
15 in 20 research fiesta15 in 20 research fiesta
15 in 20 research fiesta
 
Dev8d jupyter
Dev8d jupyterDev8d jupyter
Dev8d jupyter
 
Ili 16 robot
Ili 16 robotIli 16 robot
Ili 16 robot
 
Jupyternotebooks ou.pptx
Jupyternotebooks ou.pptxJupyternotebooks ou.pptx
Jupyternotebooks ou.pptx
 
Virtual computing.pptx
Virtual computing.pptxVirtual computing.pptx
Virtual computing.pptx
 
ouseful-parlihacks
ouseful-parlihacksouseful-parlihacks
ouseful-parlihacks
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Robotlab jupyter
Robotlab   jupyterRobotlab   jupyter
Robotlab jupyter
 
Fco open data in half day th-v2
Fco open data in half day  th-v2Fco open data in half day  th-v2
Fco open data in half day th-v2
 
Notes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopNotes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 Workshop
 
Community Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireCommunity Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wire
 
Residential school 2015_robotics_interest
Residential school 2015_robotics_interestResidential school 2015_robotics_interest
Residential school 2015_robotics_interest
 
Data Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXData Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKX
 
Week4
Week4Week4
Week4
 
A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefine
 
Conversations with data
Conversations with dataConversations with data
Conversations with data
 
Data reuse OU workshop bingo
Data reuse OU workshop bingoData reuse OU workshop bingo
Data reuse OU workshop bingo
 
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalism
 

Recently uploaded

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

OpenCorporates Co-Director Mapping Graphs

  • 1. OpenCorporates Co-Director Mapping Tony Hirst Dept of Communications and Systems, The Open University
  • 2. As company filings start to appear as open data, opportunities may arise for watchdogs to start mining this data in support of their investigations and monitoring activities. This presentation introduces several ideas relating to mapping network structures in order to learn something about the structure of “corporate sprawls”, corporate groupings defined on the basis of co-director relationships.
  • 4. To introduce the idea of a network map, let’s have a look at a view we can construct over the Twitter social space…
  • 6. This network maps shows Twitter users who are commonly followed by the followers of @TOGYnews Although hard to see at this scale, the map is actually constructed from labeled points connected by lines (in the jargon, “nodes connected by edges”). The algorithm used to position the labeled nodes tries to place nodes that are heavily connected to each other close to each other. In a sense, we can view the diagram as a map, with regions that are highlighted using false colours identifying clusters of nodes that may in some sense be similar to each other based on the sharing of common followers.
  • 8. The map is constructed using data grabbed from the Twitter API. Using one or more “focus” users (a specific Twitter account, for example, or the set of users of a particular hashtag), we grab a list of their followers.
  • 9. A B peer peer Is followed by focus FindFriendsofFollowers
  • 10. For each of the followers, we grab a list of their friends (or a sample thereof) – that is, a lists of some or all of the people they follow on Twitter. We can use this data to construct a network of people followed by the followers of the original focus. It is typically at this point, where there is most relational information contained within the network, that we lay it out using automatic layout tools.
  • 12. Drawing on the insight that people on Twitter are likely to follow accounts that are of interest to them, we can start to imagine the network as a projection of the interests of the people who are interested in one or more of the things the focus is associated with. However, interests of followers may spread to a wide range of topics, so we look for consistency of interest, pruning the network to remove people who are not commonly followed by the followers of the focus. That is, we remove nodes who are followed by only a few of the followers of the focus.
  • 14. Having laid out the network map, we might now tidy it up a little by removing all the nodes that are not themselves followed by a significant number of the followers of the original focus,
  • 16. The result is a map that shows groups of people positioned according to the shared projected presumed interests of their followers.
  • 18. It may also be possible to use metadata associated with social networks to develop additional insights. A recent paper describes one way of mining social network data for information about people working for a particular company, and using public biographical information along with social connection data to map out the organisational structures of large companies.
  • 20. A more principled way of looking at corporate structures at a company level may possibly be derived from publicly available corporate information.
  • 22. For example, if we can get hold of directorial appointment and termination data, we can start to construct maps that who how companies are connected by common directors, as well as which companies are co-directed by particular directors. As with the emergent social positioning network maps, if particular directors have particular corporate interests, we may be able to identify particular organisational groupings in corporate sprawls made up from dozens of operating companies working across a range of business areas.
  • 24. One possible source of open company information is OpenCorporates. OpenCorporates’ ambitious aim is to mint a unique corporate identifier for every corporate legal entity in the world [CHECK], as well as collating, and normalising (or “harmonising”) company information about company filings, trademarks, patents(?) and officers (that is company directors, company secretaries and so on). For GB registered companies, there is a growing repository of data relating to company directorships, which provides us with an opportunity to develop maps that show how companies are connected by virtue of having common directors.
  • 26. Just a note – my experience in looking at data related to GB registered companies suggests that the directors of the “top”/nominal company in a large multinational grouping are “atypical” compared to the officers appointed to UK based operating companies in the same corporate sprawl, being appointed from the great and the good, or from senior officers who do not take directorships across operating divisions or companies, rather than representing directors of operating companies. When seeding corporate sprawl trawlers – algorithms that try to identify companies that make up a corporate sprawl based on co-directorships – my experience suggests that it often makes sense to see the search with one or more operating companies who have directors that are likely to be directors of other operating companies, rather than the “top level” company.
  • 28. We can reuse the ideas that underpin the construction of the emergent social positioning graph to map out corporate structures based on director information.
  • 30. As well as corporate information pages, OpenCorporates maintains information pages about directorial appointments. At the moment, there are no authority files providing identifiers that identify the same physical person – each directorial appointment to company provides the director with a unique officer ID. It is possible to search for officers of other companies with the same name as a particular director, but no identifiers that link them as the same physical person. (That said, there does appear to be a slot in the metadata for authoritative identifiers.)
  • 32. So how might we go about constructing a corporate sprawl? Let’s start with one or more seed company.
  • 34. The general shape of this diagram might remind you of something…? For each of the seed companies, we grab a list of their directors. We can use this data to construct a network of people who are directors or other officers of the original seed company or companies.
  • 36. Here’s another way of imagining it – a company surrounded by its directors.
  • 38. For each of the directors, we run a search for them on OpenCorporates, to see what directorial appointments have been made to other companies for people of exactly the same name. We can use this data to construct a network of companies directed by the directors of the original seed company. For those companies that are directed by N or more of the directors associated with the seed company or companies (where N is typically 2) we might now say they are part of the corporate sprawl. The companies sharing fewer than N directors associated with companies admitted to the corporate sprawl are added to a list of possible candidate companies. As we find more directors associated with companies included in the sprawl, we might be able to “legitimise” membership of these companies within the sprawl.
  • 40. We now have a larger set of companies, reflecting those companies who share N or more directors with the original seed company or companies.
  • 42. If we so decide, we can continue with this snowball discovery process, looking up further directors associated with companies we have included in our sprawl, with a view to trying to discover more companies that should be included in the sprawl.
  • 43.
  • 44. Using this snowball approach, I have constructed a scraper on Scraperwiki that mines OpenCorporates, given one or more seed companies (or seed directors) to map out corporate sprawls, limiting myself to the capture of current directors and active companies registered in the UK. (The code needs checking and is perhaps not as easy to use as it might be. Developing a more robust and user friendly tool may be worth exploring if this approach is seen to be useful.)
  • 46. So – we can generate a network that connects companies with their directors, and grow this network out to identify companies that share several directors. As with the emergent social positioning map, we can use automatic layout tools to try to position companies and directors close to each other based on their connectivity, producing a map over the corporate sprawl.
  • 48. We can view this network in various ways. For example, we might choose to view just the companies.
  • 50. This map shows companies in a corporate sprawl grown out from Royal Dutch Shell. Note the presence of BP in there – somehow, these two groupings are connected by shared directorships of some intermediate company.
  • 52. One of the nice things about representing this sort of structure in an abstract mathematical or computational way is that we can wrangle it with code... So for example, companies C1 and C2 are connected by a single shared director, whereas C2 and C3 are connected by two directors.
  • 54. We can represent this by transforming the original bipartite (two types of node) graph that connects directors to companies and companies to directors by a graph that just connects companies who were connected by directors. The thickness of the line (or “edge”) connecting the companies represents its “weight”, which in this case is given by the number of shared directors between connected companies.
  • 56. We can also filter the graph, for example by adding together the weights of all the edges incident on a node, and throwing away all nodes for whom this sum is below a specified threshold value. We might alternatively prune the network by removing (“cutting”) all edges below a specified weight, and then throwing away nodes that aren’t connected to other nodes. (For example, we might remove connections between companies that only share a single director, and then throw away companies that aren’t connected to any other companies. Which is to say, we cut out companies that don’t share two or more directors with any other single company. When you start working with graphs, you begin to realise quite how beautiful, and powerful, a way they are for working data elements that are related to each other in some way.)
  • 58. Here’s an example of the Shell corporate sprawl with the directors removed and edges connecting companies that share two or more directors. The labels are sized relative to the PageRank score of each node, which a measure of how well connected the node is in the graph (the “importance” of each node is dependent on the “importance” of the nodes connected to it….) The lines also provide a background that highlights the connectivity - and structure – of the corporate elements.
  • 60. In this view, I have resized the labels based on the betweenness centrality of each node. This network statistic highlights nodes that play an important role in connecting clusters or groupings of nodes. So for example, we see the suggestion that The Consolidated Petroleum Company and Shell Mex and BP Limited may be the companies that connect the Shell sprawl to the BP one.
  • 62. This is just a tweaking of the layout of the previous graph to try to highlight the separation of the different clusters.
  • 64. Just as we collapsed the network to show how companies could be linked directly by virtue of co-directorships, so we can collapse the network to show how directors are connected. For example, director D1 is connected by a single shared company to directors D2 and D3, whereas D2 and D3 are connected by two companies.
  • 66. Once again, we use line thickness (that is, edge weight) to denote how heavily connected directors are.
  • 68. Here’s a view over connected directors in the the Shell corporate sprawl.
  • 70. As to how we get those graphs plotted? I built a crude workflow in Scraperwiki that gets data out of the scraped database and into a form that allows it to be visualised using the Gephi desktop tool or in a web page using different Javascript libraries (sigma.js or d3.js).
  • 71.
  • 72. This is Gephi – a cross-platform desktop tool that’s great for generating effective network visualisations. I have some tutorials and sample datasets if anyone wants to give it a whirl…
  • 73. “Where” Next…? -geocode registered addresses - explore non-gb registered companies
  • 74. So where can we take the OpenCorporates data next? I have a couple of ideas: - we can go spatial in a geographical sense and start to geocode the registered addresses of companies, to see whether any of them are located in offshore tax havens, for example, or to see whether there are different registered addresses that might lead us to yet more companies (by virtue of sharing common registered office addresses, rather than co- directors, for example); - we could start trying to tie non-gb registered companies into the mix. At the moment, director information for other territories is sparse – might them be some other way we can look for connections?
  • 75. And “When”? - company timelines (set-up dates, renaming) - explore director timelines (by company) - explore director timelines (by directory)
  • 76. Another approach might be to start analysing corporate sprawls in a time dimension. There are several opportunities here: - If we have access to company formation and dissolution dates, we can map out a timeiline of a corporate sprawl, which might reveal how companies change name, directorship or association with other companies; - if we get all the director information associated with a company, we can visualise how director appointments and terminations occurred across one or more companies, which might in turn reveal identifiable “features” that we might be able to associate with news or business restructuing events; - if we track down companies a particular director appears to be associated with, we can start to develop “career timelines” of directors, showing how they have been associated with different corporate groupings over time (and maybe the odd company on the side…)
  • 77. Linking out and in - linking companies or directors with external datasets
  • 78. Whilst it is possible to generate insight from the analysis of data that is contained just within OpenCorporates, there are likely to be many opportunities for using OpenCroporates to annotate other datasets, or use external datasets to annotate OpenCorporates data
  • 80. As this example starts to explore, we might try to reconcile company names as recorded in local spending data records with corporate entities identified within in OpenCorporates to build up a better picture of how money flows into corporate sprawls. On a lobbying front, we might look for mentions of meetings between government officials and and company officers, and then try to make mappings between government departments and operational areas of a corporate sprawl, and so on.
  • 82. [ This is part of an ongoing informal exploration of the patterns and structures we can find across large open datasets. For more information, follow: - blog.ouseful.info - @psychemedia All comments welcome. ]

Editor's Notes

  1. As company filings start to appear as open data, opportunities may arise for watchdogs to start mining this data in support of their investigations and monitoring activities.This presentation introduces several ideas relating to mapping network structures in order to learn something about the structure of “corporate sprawls”, corporate groupings defined on the basis of co-director relationships.
  2. To introduce the idea of a network map, let’s have a look at a view we can construct over the Twitter social space…
  3. This network maps shows Twitter users who are commonly followed by the followers of @TOGYnewsAlthough hard to see at this scale, the map is actually constructed from labeled points connected by lines (in the jargon, “nodes connected by edges”).The algorithm used to position the labeled nodes tries to place nodes that are heavily connected to each other close to each other. In a sense, we can view the diagram as a map, with regions that are highlighted using false colours identifying clusters of nodes that may in some sense be similar to each other based on the sharing of common followers.
  4. The map is constructed using data grabbed from the Twitter API.Using one or more “focus” users (a specific Twitter account, for example, or the set of users of a particular hashtag), we grab a list of their followers.
  5. For each of the followers, we grab a list of their friends (or a sample thereof) – that is, a lists of some or all of the people they follow on Twitter.We can use this data to construct a network of people followed by the followers of the original focus.It is typically at this point, where there is most relational information contained within the network, that we lay it out using automatic layout tools.
  6. Drawing on the insight that people on Twitter are likely to follow accounts that are of interest to them, we can start to imagine the network as a projection of the interests of the people who are interested in one or more of the things the focus is associated with.However, interests of followers may spread to a wide range of topics, so we look for consistency of interest, pruning the network to remove people who are not commonly followed by the followers of the focus. That is, we remove nodes who are followed by only a few of the followers of the focus.
  7. Having laid out the network map, we might now tidy it up a little by removing all the nodes that are not themselves followed by a significant number of the followers of the original focus,
  8. The result is a map that shows groups of people positioned according to the shared projected presumed interests of their followers.
  9. It may also be possible to use metadata associated with social networks to develop additional insights.A recent paper describes one way of mining social network data for information about people working for a particular company, and using public biographical information along with social connection data to map out the organisational structures of large companies.
  10. A more principled way of looking at corporate structures at a company level may possibly be derived from publicly available corporate information.
  11. For example, if we can get hold of directorial appointment and termination data, we can start to construct maps that who how companies are connected by common directors, as well as which companies are co-directed by particular directors.As with the emergent social positioning network maps, if particular directors have particular corporate interests, we may be able to identify particular organisational groupings in corporate sprawls made up from dozens of operating companies working across a range of business areas.
  12. One possible source of open company information is OpenCorporates.OpenCorporates’ ambitious aim is to mint a unique corporate identifier for every corporate legal entity in the world [CHECK], as well as collating, and normalising (or “harmonising”) company information about company filings, trademarks, patents(?) and officers (that is company directors, company secretaries and so on).For GB registered companies, there is a growing repository of data relating to company directorships, which provides us with an opportunity to develop maps that show how companies are connected by virtue of having common directors.
  13. Just a note – my experience in looking at data related to GB registered companies suggests that the directors of the “top”/nominal company in a large multinational grouping are “atypical” compared to the officers appointed to UK based operating companies in the same corporate sprawl, being appointed from the great and the good, or from senior officers who do not take directorships across operating divisions or companies, rather than representing directors of operating companies.When seeding corporate sprawl trawlers – algorithms that try to identify companies that make up a corporate sprawl based on co-directorships – my experience suggests that it often makes sense to see the search with one or more operating companies who have directors that are likely to be directors of other operating companies, rather than the “top level” company.
  14. To introduce the idea of a network map, let’s have a look at a view we can construct over the Twitter social space…
  15. As well as corporate information pages, OpenCorporates maintains information pages about directorial appointments. At the moment, there are no authority files providing identifiers that identify the same physical person – each directorial appointment to company provides the director with a unique officer ID. It is possible to search for officers of other companies with the same name as a particular director, but no identifiers that link them as the same physical person. (That said, there does appear to be a slot in the metadata for authoritative identifiers.)
  16. So how might we go about constructing a corporate sprawl?Let’s start with one or more seed company.
  17. The general shape of this diagram might remind you of something…?For each of the seed companies, we grab a list of their directors.We can use this data to construct a network of people who are directors or other officers of the original seed company or companies.
  18. Here’s another way of imagining it – a company surrounded by its directors.
  19. For each of the directors, we run a search for them on OpenCorporates, to see what directorial appointments have been made to other companies for people of exactly the same name.We can use this data to construct a network of companies directed by the directors of the original seed company.For those companies that are directed by N or more of the directors associated with the seed company or companies (where N is typically 2) we might now say they are part of the corporate sprawl. The companies sharing fewer than N directors associated with companies admitted to the corporate sprawl are added to a list of possible candidate companies. As we find more directors associated with companies included in the sprawl, we might be able to “legitimise” membership of these companies within the sprawl.
  20. We now have a larger set of companies, reflecting those companies who share N or more directors with the original seed company or companies.
  21. If we so decide, we can continue with this snowball discovery process, looking up further directors associated with companies we have included in our sprawl, with a view to trying to discover more companies that should be included in the sprawl.
  22. Using this snowball approach, I have constructed a scraper on Scraperwiki that mines OpenCorporates, given one or more seed companies (or seed directors) to map out corporate sprawls, limiting myself to the capture of current directors and active companies registered in the UK.(The code needs checking and is perhaps not as easy to use as it might be. Developing a more robust and user friendly tool may be worth exploring if this approach is seen to be useful.)
  23. So – we can generate a network that connects companies with their directors, and grow this network out to identify companies that share several directors.As with the emergent social positioning map, we can use automatic layout tools to try to position companies and directors close to each other based on their connectivity, producing a map over the corporate sprawl.
  24. We can view this network in various ways. For example, we might choose to view just the companies.
  25. This map shows companies in a corporate sprawl grown out from Royal Dutch Shell.Note the presence of BP in there – somehow, these two groupings are connected by shared directorships of some intermediate company.
  26. One of the nice things about representing this sort of structure in an abstract mathematical or computational way is that we can wrangle it with code...So for example, companies C1 and C2 are connected by a single shared director, whereas C2 and C3 are connected by two directors.
  27. We can represent this by transforming the original bipartite (two types of node) graph that connects directors to companies and companies to directors by a graph that just connects companies who were connected by directors.The thickness of the line (or “edge”) connecting the companies represents its “weight”, which in this case is given by the number of shared directors between connected companies.
  28. We can also filter the graph, for example by adding together the weights of all the edges incident on a node, and throwing away all nodes for whom this sum is below a specified threshold value.We might alternatively prune the network by removing (“cutting”) all edges below a specified weight, and then throwing away nodes that aren’t connected to other nodes. (For example, we might remove connections between companies that only share a single director, and then throw away companies that aren’t connected to any other companies. Which is to say, we cut out companies that don’t share two or more directors with any other single company. When you start working with graphs, you begin to realise quite how beautiful, and powerful, a way they are for working data elements that are related to each other in some way.)
  29. Here’s an example of the Shell corporate sprawl with the directors removed and edges connecting companies that share two or more directors. The labels are sized relative to the PageRank score of each node, which a measure of how well connected the node is in the graph (the “importance” of each node is dependent on the “importance” of the nodes connected to it….)The lines also provide a background that highlights the connectivity - and structure – of the corporate elements.
  30. In this view, I have resized the labels based on the betweenness centrality of each node. This network statistic highlights nodes that play an important role in connecting clusters or groupings of nodes. So for example, we see the suggestion that The Consolidated Petroleum Company and Shell Mex and BP Limited may be the companies that connect the Shell sprawl to the BP one.
  31. This is just a tweaking of the layout of the previous graph to try to highlight the separation of the different clusters.
  32. Just as we collapsed the network to show how companies could be linked directly by virtue of co-directorships, so we can collapse the network to show how directors are connected.For example, director D1 is connected by a single shared company to directors D2 and D3, whereas D2 and D3 are connected by two companies.
  33. Once again, we use line thickness (that is, edge weight) to denote how heavily connected directors are.
  34. Here’s a view over connected directors in the the Shell corporate sprawl.
  35. As to how we get those graphs plotted? I built a crude workflow in Scraperwiki that gets data out of the scraped database and into a form that allows it to be visualised using the Gephi desktop tool or in a web page using different Javascript libraries (sigma.js or d3.js).
  36. This isGephi – a cross-platform desktop tool that’s great for generating effective network visualisations. I have some tutorials and sample datasets if anyone wants to give it a whirl…
  37. So where can we take the OpenCorporates data next?I have a couple of ideas: we can go spatial in a geographical sense and start to geocode the registered addresses of companies, to see whether any of them are located in offshore tax havens, for example, or to see whether there are different registered addresses that might lead us to yet more companies (by virtue of sharing common registered office addresses, rather than co-directors, for example); we could start trying to tie non-gb registered companies into the mix. At the moment, director information for other territories is sparse – might them be some other way we can look for connections?
  38. Another approach might be to start analysing corporate sprawls in a time dimension. There are several opportunities here: If we have access to company formation and dissolution dates, we can map out a timeiline of a corporate sprawl, which might reveal how companies change name, directorship or association with other companies; if we get all the director information associated with a company, we can visualise how director appointments and terminations occurred across one or more companies, which might in turn reveal identifiable “features” that we might be able to associate with news or business restructuing events; if we track down companies a particular director appears to be associated with, we can start to develop “career timelines” of directors, showing how they have been associated with different corporate groupings over time (and maybe the odd company on the side…)
  39. Whilst it is possible to generate insight from the analysis of data that is contained just within OpenCorporates, there are likely to be many opportunities for using OpenCroporates to annotate other datasets, or use external datasets to annotate OpenCorporates data
  40. As this example starts to explore, we might try to reconcile company names as recorded in local spending data records with corporate entities identified within in OpenCorporates to build up a better picture of how money flows into corporate sprawls.On a lobbying front, we might look for mentions of meetings between government officials and and company officers, and then try to make mappings between government departments and operational areas of a corporate sprawl, and so on.