Scoda project companygraph

This tutorial describes how to use network analysis tools to visually explore the links between companies working on the same contract.

1

The example dataset we will use comes from the World Bank. Each row represents a contract. Inspec@ng the column names tells us what data we have available about each contract. Looking at the data, we can see how we could order the companies based on the value of the total contract amount; or we might order the contracts by @me; or we might look to see which contracts were awarded in a par@cular project, or to a par@cular company in the event of the same company being awarded more than one contract.

2

We might also wish to look for paFerns in the data that show us how the things described in one row might connect to things described in other rows. For example, can we organise the data somehow to see which companies are associated with which projects? Could a network style visualisa@on help us do this?

3

But if we were to draw a network, what sort of thing should we connect to what? And how would would know what to connect to each other? One way is to look at the data… at which point we might no@ce that some of entries within a column take on the same value. This means that we can “connect” the data that appears in different rows using these common elements…

4

So what columns have usefully repea@ng elements? The projects column certainly has repea@ng elements, so if we should be able to draw diagrams that show all the companies that connect to each project. And if a company is associated with more than one project, it should in a certain sense be seen to join those projects together…

5

A few of the contract numbers repeat, so it might be interes@ng to explore the extent to which companies connect to contracts. If two different companies are associated with the same contracts, that might be interes@ng.

6

Let’s get some data so we can start to explore the network…

7

We just need to do a liFle bit of @dying of the data before we make use of it. The major problem is that the Total Contract Amount column does not contain numbers, as such… In par@cular, we need to get rid of the dollar sign. Let’s create a new column into which we can put the cleaned values.

8

This liFle bit of code says: take the value of each cell in the original column and replace the $ symbol with nothing (that is, an empty string). In other words, delete the dollar sign… Put this value in the corresponding cell of the new column, and make the cell a number type.

9

Now we can export the data using the Custom Tabular Exporter, which allows us to select just those columns we want to export. (This can be very handy when a table has a large number of columns that we are not interested in!) I have rearranged the cells in the Custom Tabular Exporter simply by clicking on them and dragging them around. We just want three columns for now: Project ID, Supplier, and our new Amount column. Now that you know how to export the data just a few columns at a @me, once you are comfortable with the process of visualising the data, you should be able to take other slices through the data (such as companies related to contracts) and visualise them yourself. You might also like to try using a similar method on a data set of your own…

10

There’s a final bit of @dying to do before we can use this data in Gephi, the applica@on we’ll be using to visualise the network. In par@cular, Gephi expects the data to be presented to it with par@cular column names. Open the exported CSV data in a text editor and rename the columns: Source,Target,Weight (no spaces?) Note – you could have also renamed the columns in OpenRefine before expor@ng them…

11

We might also wish to look for paFerns in the data that show us how the things described in one row might connect to things described in other rows. For example, can we organise the data somehow to see which companies are associated with which projects? Could a network style visualisa@on help us do this?

12

Network diagrams allow us to show rela@onships between different things. Networks are referred to in mathema@cal terms as graph structures, or graphs. You may be more familiar with thinking of things like line charts and bar charts as graphs, but when it comes to network, we use the term graph to describe the mathema@cal structure that defines the network. The circles – or nodes – represent “things” in the network, in this case, par@cular companies or projects. The lines – or edges – represent rela@onships between the things in the network. In this example, the edges represent contracts that associate a par@cular company with one or more projects, (or conversely, associate a project with one or more companies). Where nodes are placed in the diagram can be used to convey informa@on about the structure of the network. Many different algorithms exist to lay out (that is, place, or posi@on) the nodes at specific points in the diagram. Typically, we try to place nodes that are heavily interconnected by edges close to each other. Nodes that are grouped closely together on the page might then be assumed to be associated in some way because of the increasing number of links that connect them to each other.

13

Launch Gephi and from the File menu select New Project. Click on the Data Laboratory tab, and then Import Spreadsheet. Load in the file (with amended column names) as an Edges Table. The default seings should be fine…

14

Click on the Overview tab – you should see the network that connects Companies to Project IDs displayed there… But what does it mean? And can we @dy it up a liFle?!

15

I used the Yifan Hu layout to generate this view over the network. Yifan Hu is a good all round layout engine that works par@cularly well when the data is hierarchically structured. Another good general purpose layout algorithm is ForeceAtlas2.

16

Whilst we might get a feeling for the structure and shape of the dataset as a whole from the overall visualisa@on, we oken want to inspect one or more of the nodes in detail. The quickest way of doing this is to look at the labels… You may also have no@ced that the edge thickness is thicker for some lines than others. In this case, the line thicknesses are propor@onal to the contract value, which we set in the weight column. If a company is associated with more than a single contract on a par@cular project, the edge weight well be propor@onal to the overall (total) sum of values of all the contracts rela@ng that company to that project.

17

As well as using space (or posi@on) and colour to represent structural elements of the network, we can also use edge weight (that is the thickness, or width) of the lines connec@ng nodes to each other to represent some feature of the network. In this case, we might use edge weight to represent the value of contract that connects a company with a project, or the number of contracts that a company has on a par@cular project. When placing nodes, we might also use edge weight to contribute to the determina@on of how closely two connected nodes should be placed to each other. If you think of the edge thickness in terms of the size, thickness or strength of a mechanical spring, you might perhaps start to imagine how nodes connected by thick springs will be pulled closer to each other than nodes connected by much weaker springs.

18

As well as edge thickness, we might also make use of node size to highlight some feature of the network. In this example, we use node size to represent the degree of each node, that is, the number of edges connected to it. Some@mes, we might want to highlight nodes that have small numbers of connec@ons, for example to iden@fy projects with very few companies contracted to them. In this case, we might make nodes with only a single incoming edge very large, and nodes with large number of edges much smaller. The node size thus represents how well connected a node is. In this case, the size of the project nodes indicates how many companies are associated with it, and the size of the company nodes depicts how many project contracts the company is engaged with. Note that we can combine edge weight and node size, for example, by seing node size propor@onal to the summed weights of edges that are connected to the node. Hopefully, you are already star@ng to see how a network diagram can provide a range of powerful visual representa@ons for helping us explore the structure of network and iden@fy key elements of it.

19

We can size the nodes according to sta@s@cal values calculated over the network. In this case, we might want to highlight nodes according to the total value of contracts flowing into them (for companies) or out of them (for projects). The weighted average sta@s@c calculates the corresponding value for each node in the network. The spline operator in the Ranking tab – where we set the node size – allows us to tweak the rela@onship between the value used to size the node and the node size. The default is a simple linear propor@onal map. However, we may find that the range of values we want to map are “clumped” together (for example, one very large value and a range of smaller values clumped together at the other end of the overall range). In such a case, we might want to tweak the mapping to provide a liFle more salience when it comes to dis@nguishing between the values that are otherwise clumped together. As well as making node size propor@onal to some quan@ty, we can also set the label size to be propor@onal to the node size.

20

There are several other tools available to us that allow us to explore other proper@es of the network. For example, there is a wide selec@on of filters that allow us to select par@cular filtered views of the network. In this case, we use the degree range filter to show only nodes that have degree of two or more. This filters out nodes that have degree 1 – for example, companies that are only associated with a single project. The result is a view over the network that shows which companies are associated with two or more projects, and which projects they are. The node sizes are indica@ve of the total overall vale of contracts associated with each par@cular node. So for example, we see that Siemens AG is associated with contracts from projects P072018 and P090104. The large node size suggests that the sum total of contracts Siemens AG has received via this projects is quite significant. In addi@on, the line from P072018 to Siemens AG suggests that the total value of contracts (or maybe just a single contract) Siemens AG has received from that project is quite large.

21

So far, out network diagram has shown us how companies relate to projects, and conversely, how projects relate to companies. But some@mes we may want to know rather more directly the extent to which two things are connected by virtue of having a common partner – for example, which companies worked on the same projects together, or which projects are linked by virtue of having used the same companies. When the data is represented as a graph, we can manipulate the graph in order to generate derived graphs that can capture these sorts of rela@onship directly.

22

When we have a dataset represented in the form of a network, we can start to analyse it by looking at addi@onal network proper@es. For example, for the projects and companies graph, we might process the graph so as to remove project nodes and replace the edges with edges that connect companies that were on one or more project with each other. We might even use edge weight to depict how many projects there were in common between two companies.

23

From the workspace menu, duplicate the original network (remember to turn off all the filters! We want the whole network.) You will automa@cally be moved to a new workspace containing a copy of the original network. (Navigate between workspaces from the workspace selector at the boFom right hand corner of the whole applica@on window.) In the Mul@mode Networks Projec@on panel, click on Graph Coloring to try to split the network into complementary types of node (companies and projects). Hopefully, the tool will return with the report that Bipar22e:true. That is, two complementary sets of nodes have been found (nodes in the first group are only ever connected to nodes in the second group.)Click on Load aFributes and select the Node Color Mul@mode op@on.

24

To check what the mul@mode tool has called nodes of each type, click on the edit buFon in the paleFe toolbar, and click on a project node. An edit panel will appear – make a note of what colour the project type node has been labeled. We can now use the mul@mode network projec@on tool to process the network by joining together company nodes that are connected by a common project, and dele@ng the project nodes. That is, we want to connect blue company nodes to blue company nodes if they are connected by edges that pass through a common red project node. One we have made the mapping, we can delete the inner red project nodes. Running the projec@on results in several dis@nct clusters of companies that are connected to each other by virtue of being associated with the same project, as well as some companies that bridge different clusters by virtueof being associated with companies from different projects.

25

Conversely, we might remove the company nodes, and iden@fy a new set of edges that connect projects that shared one or more common contracted companies. Again, edge thickness might be use to show how @ghtly connected two projects were by virtue of increasing numbers of common contracted companies.

26

By projec@ng the original network onto the network that shows links between projects that arise from common companies, we get a much clearer picture about how many projects there are, as well as possible linkages between them.

27

Here are some of the things you have hopefully learned…feel free to add anything else you might have learned to the list…

28

For more informa@on, and a wide range of further tutorials on all maFers data related, visit the School Of Data at SchoolOfData.org, or on TwiFer via @SchoolOfData.

29

Technology

Scoda project companygraph