Some slide prompts to support a data framing inves3ga3on around corporate data – originally prepared for the OGP Fes3val, London, October 2013.
For more informa3on, contact: schoolOfData.org
1
These notes provide a worked example of how to download company ownership rela3onship data from OpenCorproates (opencorporates.com) using the cross-‐plaNorm data cleaning tool OpenRefine (openrefine.org), and then visualise the data using the cross-‐plaNorm Gephi netwrok visualisa3on tool (gephi.org).
2
OpenCorporates is a private company that has set itself the ambi3ous task of building a database of registered company informa3on for every legal corporate en3ty in the world.
One of the views OpenCorporates offers over at least some of the data in its database shows how companies are connected by beneficial ownership or shareholder rela3onships.
Although complex, this diagram is “human readable” – the data is presented in a way that is intended to make some sort of meaningful sense to us.
3
But as well as publishing data for us humans to read, OpenCorporates also makes data available in a way that machines can read -‐ machine readable data.
You may have heard of the term “API” in the context of data publishing websites. To all intents and purposes, an API is an interface that computers can use to get informa3on out of websites in a way that they, and the databases they work with, can understand.
The data is published in a format known as JSON – Javascript Object Nota3on. But you don’t really need to know much more than that – just that it’s called JSON, and tools that can parse and work with JSON can parse and work with the data that the OpenCorporates API publishes.
4
If you aren’t a programmer, here’s way of ge]ng the data out of OpenCorporates and into a tabular form you may be more comfortable with, and which we can use to generate a network diagram to display in a tool such as Gephi…
You can download the OpenRefine applica3on from openrefine.org. When you run it on your computer, it will launch an applica3on that runs inside a browser tab using your default web browser.
5
We can get company ownership (subsidiary rela3ons, major shareholdings, etc) from OpenCorporates by hacking the web address/URL of a company page on OpenCorporates.
From a company page on OpenCorporates, which should have the form: http://opencorporates.com/companies/JURISDICTION/COMPANY_ID!
add the following to the end of the web address/URL: /network.json?depth=2
to give something with the following form: http://opencorporates.com/companies/JURISDICTION/COMPANY_ID/network.json?depth=2!
(Note: company network data may not be available in all jurisdic3ons or for all companies.)
6
In OpenRefine, select the op3on to Create [a new] Project using the web address – or URL – to the JSON data page that reveals the data rela3ng to the corporate ownership network of the company we are interested in on OpenCorporates.
Note that you can import data into OpenRefine from several web addresses all in one go, though the data returned from each URL should have the same format or structure.
Using mul3ple URLs results in a combined data set, which can be quite handy.
7
Being machine readable, the data makes more sense to OpenRefine than it probably does to us!
Select a block of data in the preview view that is typical of a set of data that you want to map into a single row in a “tradi3onal” spreadsheet like view.
Data blocks are typically contained within braces (curly brackets); these things : { }
Note that in some machine readable data, some data blocks may be contained within other data blocks…
Each of the items in a single data block can be mapped into a separate cell – that is, a separate column – in a single row of data.
So each data block is a row, and each item in the block is a column…. OpenRefine will give you a preview of how the data will look if you click the right bumon!
8
You can preview the effect of making par3cular block selec3ons using Update Preview.
To return to the block highlighter, use ‘Pick Record Nodes’.
When you are happy with your selec3on, you are ready to “Create Project”.
9
Once we’re happy with the data preview, we can import the data into a more familiar looking layout.
The arrows at the top of each column pop up menus that allow us to run a wide variety of opera3ons on a column.
One of the opera3ons let’s us change the column name, so I’m going to rename the child company and parent company columns to what Gephi expects: Source and Target.
10
This is the format that Gephi wants to see when we import data from a simple two column, comma separated variable (CSV) text file.
One of the columns needs to be called Source, another needs to be called Target. When construc3ng the network diagram, Gephi then knows to draw a line going from each Source element to the corresponding Target.
11
The OpenCorporates network data in tabulated form. The default column names are not necessarily as human readable as they could be!
In par3cular, we can iden3fy the name of the parent company and the child company for each ownership rela3on. We also have access to the OpenCorporates IDs for all of those companies. The type of rela3onship between the companies is also described. For the moment, we will treat them all equally.
(If you want to view just those company connec3ons that relate to a par3cular type of rela3on, use the Facet or Text Filter tool applied to the appropriate column.)
12
From the appropriate column menu, select “Edit Column” and then “Rename this column” to change the column name.
13
We can now export the data using the Custom Tabular Exporter.
Deselect all the columns then select just the Source and Target columns – we will only export data from these two columns.
14
Preview your data to check that it looks like the sort of data you expect to export.
From the Download tab, select the CSV output type and export your data – it should be saved into the default download directory used by your browser, with a file name that corresponds to the OpenRefine project name.
You should have the two column data saved to your computer that you can now load in to Gephi.
15
Gephi is a powerful cross-‐plaNorm desktop tool for visualising data that describes networks, such as social networks or corporate ownership networks. You can import data into Gephi using specialised graph/network representa3on formats, or from simple two column data files where each describes a simple connec3on between two elements (eg thing1, thing2 would say that thing1 connects to thing2).
You can download the Gephi applica3on from gephi.org. When you run it on your computer, it will launch a desktop applica3on. Note that Gephi requires Java – if you are on a Mac, you may need to download and install Java yourself: www.java.com
16
Launch Gephi (download it from gephi.org if you don’t already have it installed) and select Data Laboratory.
If the Data Table toolbar is empty, go to the applica3on’s File menu and select ‘New Project’. A new project will be created and you should see several toolbar op3ons appear in the Data Table.
17
Load the data in using the “Import Spreadsheet” tool op3on. Make sure that you select Edges table as the table type.
If your data file does not have Source and Target column names, an error will occur and you will not be able to import the data file. (In such a case, you could always open the file in a text editor, change the column names in the file, save it, and try again. Alterna3vely, go in to OpenRefine, change the column names there, and re-‐export the custom tabulated data…)
18
The final stage of the import gives some addi3onal informa3on about how uploaded data will be treated.
Because we are simply loading in data that describes how one company (iden3fied by its name) is connected to another company (also iden3fied by its name), we need to get Gephi to automa3cally create a node each 3me it sees a new company (as iden3fied by its company name…).
19
When the data is imported, we can preview it, either by looking at a list of nodes that have been created, or ‘edges’ – that is, connec3ons between two companies.
20
So now let’s see where we can start to view this data as a network visualisa3on.
Click on the top paleme Overview bumon to get an overview of the network in visual form. This is the area where we can interac3vely visualise the network.
21
The default Overview layout has three main areas:
-‐ in the middle is the canvas where we can see the current layout of the network; along the les hand side of the central panel are several tools for opera3ng on the elements shown on the canvas; along the bomom of the central panel are several tools for controlling how text labels are displayed. -‐ to the les are several tools for manipula3ng what the network looks like: tools for laying out the network (that is, posi3oning the nodes) automa3cally, as well as colouring and sizing the nodes; -‐ to the right are several tools that allow us to analyse and process the graph (that is, the mathema3cal structure that defines the network); for example, we can run various sta3s3cs on the network, or filter the nodes that are displayed according to one or more specified criteria.
22
Let’s start by laying out the network. There are several layout tools provided by default (you can install more from the Tools-‐>Plugins menu) which each have slightly different behaviours and can be differently effec3ve at laying out networks with different sorts of structure.
A couple of good all-‐round layout algorithms are: -‐ ForceAtlas2 -‐ Yifan Hu.
If you imagine connected nodes held together by springs, you can thing of these layout tools as trying to posi3on the nodes so that the springs are stretched as limle as possible. Sort of.
23
At the moment, we don’t know what each node represents. By default, when labels are switched on, Gephi looks for a label column value associated with a node and displays that. But we can also display other values. In this case, we are using a company name as the node ID, so we can select id as the element to display when we switch labels on. Click on the clipboard icon on the toolbar at the bomom of the screen to raise the label selector.
To actually switch labels on, click on the lesmost/darket T bumon on the toolbar at the bomom of the screen.
The slider on the right controls the text label size.
24
We can also change the size of labels propor3onal to the size of a node – but how do we size nodes?
Whilst it is possible to load in data that describes various amributes associated with each node (for example, in the case of a company node it might be the turnover or profit in the last financial year), we can also generate informa3on about each node based on various network proper3es.
For example, the degree of a node says how many connec3ons it has with other nodes. Where connec3ons are ‘directed’ – that is, represented by arrows – the number of arrows that leave a node is referred to as the out-‐degree of the node, and the number of arrows that come into a node as the in-‐degree.
25
We can use the Average Degree sta3s3c tool to calculate the degree, in-‐degree and out-‐degree values for each node.
We can then use these values as the basis for sizing the nodes in the network visualisa3on.
26
Here we have sized the nodes by Degree. The min and max size parameters can be set as required to scale the size of the nodes.
27
We can set the label size so that it is propor3onal to the node size – from the black/dark A label on the toolbar at the bomom of the screen, select the [proporIonal to] Node Size menu op3on.
28
As well as tools for genera3ng grandscale layouts, there are also layout tools for tweaking a par3cular layout.
The Expansion tool just stretches (or shrinks) the layout in the x and y direc3ons. This can be good for just pu]ng a bit of space into a layout.
The Label Adjust tool juggles nodes so that their labels don’t overlap. Note that this tool may move some nodes quite a distance compared to their neighbours and so may upset any meaningful spa3al rela3onships obtained using the other layout tools.
29
We can colour and size nodes according to a wide range of proper3es obtained from running various network sta3s3cs.
As you work with network data more and more, you start to get a feel for which tools to use to help you look for par3cular pamerns, structures and stories within the data. But that is a tutorial for another day…
30
We can use various tools in concert to tweak the layout of the network.
In this example, I have: -‐ sized the nodes by degree; -‐ set the label sizes propor3onal to the Degree; -‐ tweaked the scale using the text-‐size slide; -‐ used the Authority value (obtained via the HITS sta3s3c) to colour the nodes; -‐ laid out the network using a ForceAtlas2 algorithm, a bit of Expansion and a dash of Label Adjust.
31
If you want to know more, contact us…
32