29
This tutorial describes how to use network analysis tools to visually explore the links between companies working on the same contract. 1

Scoda project companygraph

  • View
    1.106

  • Download
    1

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Scoda project companygraph

This  tutorial  describes  how  to  use  network  analysis  tools  to  visually  explore  the  links  between  companies  working  on  the  same  contract.  

1  

Page 2: Scoda project companygraph

The  example  dataset  we  will  use  comes  from  the  World  Bank.    Each  row  represents  a  contract.  Inspec@ng  the  column  names  tells  us  what  data  we  have  available  about  each  contract.    Looking  at  the  data,  we  can  see  how  we  could  order  the  companies  based  on  the  value  of  the  total  contract  amount;  or  we  might  order  the  contracts  by  @me;  or  we  might  look  to  see  which  contracts  were  awarded  in  a  par@cular  project,  or  to  a  par@cular  company  in  the  event  of  the  same  company  being  awarded  more  than  one  contract.  

2  

Page 3: Scoda project companygraph

We  might  also  wish  to  look  for  paFerns  in  the  data  that  show  us  how  the  things  described  in  one  row  might  connect  to  things  described  in  other  rows.    For  example,  can  we  organise  the  data  somehow  to  see  which  companies  are  associated  with  which  projects?  Could  a  network  style  visualisa@on  help  us  do  this?    

3  

Page 4: Scoda project companygraph

But  if  we  were  to  draw  a  network,  what  sort  of  thing  should  we  connect  to  what?  And  how  would  would  know  what  to  connect  to  each  other?    One  way  is  to  look  at  the  data…  at  which  point  we  might  no@ce  that  some  of  entries  within  a  column  take  on  the  same  value.  This  means  that  we  can  “connect”  the  data  that  appears  in  different  rows  using  these  common  elements…  

4  

Page 5: Scoda project companygraph

So  what  columns  have  usefully  repea@ng  elements?  The  projects  column  certainly  has  repea@ng  elements,  so  if  we  should  be  able  to  draw  diagrams  that  show  all  the  companies  that  connect  to  each  project.  And  if  a  company  is  associated  with  more  than  one  project,  it  should  in  a  certain  sense  be  seen  to  join  those  projects  together…    

5  

Page 6: Scoda project companygraph

A  few  of  the  contract  numbers  repeat,  so  it  might  be  interes@ng  to  explore  the  extent  to  which  companies  connect  to  contracts.  If  two  different  companies  are  associated  with  the  same  contracts,  that  might  be  interes@ng.    

6  

Page 7: Scoda project companygraph

Let’s  get  some  data  so  we  can  start  to  explore  the  network…  

7  

Page 8: Scoda project companygraph

We  just  need  to  do  a  liFle  bit  of  @dying  of  the  data  before  we  make  use  of  it.    The  major  problem  is  that  the  Total  Contract  Amount  column  does  not  contain  numbers,  as  such…  In  par@cular,  we  need  to  get  rid  of  the  dollar  sign.  Let’s  create  a  new  column  into  which  we  can  put  the  cleaned  values.  

8  

Page 9: Scoda project companygraph

This  liFle  bit  of  code  says:  take  the  value  of  each  cell  in  the  original  column  and  replace  the  $  symbol  with  nothing  (that  is,  an  empty  string).  In  other  words,  delete  the  dollar  sign…  Put  this  value  in  the  corresponding  cell  of  the  new  column,  and  make  the  cell  a  number  type.  

9  

Page 10: Scoda project companygraph

Now  we  can  export  the  data  using  the  Custom  Tabular  Exporter,  which  allows  us  to  select  just  those  columns  we  want  to  export.  (This  can  be  very  handy  when  a  table  has  a  large  number  of  columns  that  we  are  not  interested  in!)    I  have  rearranged  the  cells  in  the  Custom  Tabular  Exporter  simply  by    clicking  on  them  and  dragging  them  around.  We  just  want  three  columns  for  now:  Project  ID,  Supplier,  and  our  new  Amount  column.    Now  that  you  know  how  to  export  the  data  just  a  few  columns  at  a  @me,  once  you  are  comfortable  with  the  process  of  visualising  the  data,  you  should  be  able  to  take  other  slices  through  the  data  (such  as  companies  related  to  contracts)  and  visualise  them  yourself.    You  might  also  like  to  try  using  a  similar  method  on  a  data  set  of  your  own…  

10  

Page 11: Scoda project companygraph

There’s  a  final  bit  of  @dying  to  do  before  we  can  use  this  data  in  Gephi,  the  applica@on  we’ll  be  using  to  visualise  the  network.    In  par@cular,  Gephi  expects  the  data  to  be  presented  to  it  with  par@cular  column  names.    Open  the  exported  CSV  data  in  a  text  editor  and  rename  the  columns:  Source,Target,Weight  (no  spaces?)    Note  –  you  could  have  also  renamed  the  columns  in  OpenRefine  before  expor@ng  them…  

11  

Page 12: Scoda project companygraph

We  might  also  wish  to  look  for  paFerns  in  the  data  that  show  us  how  the  things  described  in  one  row  might  connect  to  things  described  in  other  rows.    For  example,  can  we  organise  the  data  somehow  to  see  which  companies  are  associated  with  which  projects?  Could  a  network  style  visualisa@on  help  us  do  this?    

12  

Page 13: Scoda project companygraph

Network  diagrams  allow  us  to  show  rela@onships  between  different  things.  Networks  are  referred  to  in  mathema@cal  terms  as  graph  structures,  or  graphs.  You  may  be  more  familiar  with  thinking  of  things  like  line  charts  and  bar  charts  as  graphs,  but  when  it  comes  to  network,  we  use  the  term  graph  to  describe  the  mathema@cal  structure  that  defines  the  network.    The  circles  –  or  nodes  –  represent  “things”  in  the  network,  in  this  case,  par@cular  companies  or  projects.    The  lines  –  or  edges  –  represent  rela@onships  between  the  things  in  the  network.  In  this  example,  the  edges  represent  contracts  that  associate  a  par@cular  company  with  one  or  more  projects,  (or  conversely,  associate  a  project  with  one  or  more  companies).    Where  nodes  are  placed  in  the  diagram  can  be  used  to  convey  informa@on  about  the  structure  of  the  network.  Many  different  algorithms  exist  to  lay  out  (that  is,  place,  or  posi@on)  the  nodes  at  specific  points  in  the  diagram.  Typically,  we  try  to  place  nodes  that  are  heavily  interconnected  by  edges  close  to  each  other.  Nodes  that  are  grouped  closely  together  on  the  page  might  then  be  assumed  to  be  associated  in  some  way  because  of  the  increasing  number  of  links  that  connect  them  to  each  other.    

13  

Page 14: Scoda project companygraph

Launch  Gephi  and  from  the  File  menu  select  New  Project.  Click  on  the  Data  Laboratory  tab,  and  then  Import  Spreadsheet.    Load  in  the  file  (with  amended  column  names)  as  an  Edges  Table.  The  default  seings  should  be  fine…  

14  

Page 15: Scoda project companygraph

Click  on  the  Overview  tab  –  you  should  see  the  network  that  connects  Companies  to  Project  IDs  displayed  there…    But  what  does  it  mean?  And  can  we  @dy  it  up  a  liFle?!  

15  

Page 16: Scoda project companygraph

I  used  the  Yifan  Hu  layout  to  generate  this  view  over  the  network.    Yifan  Hu  is  a  good  all  round  layout  engine  that  works  par@cularly  well  when  the  data  is  hierarchically  structured.    Another  good  general  purpose  layout  algorithm  is  ForeceAtlas2.  

16  

Page 17: Scoda project companygraph

Whilst  we  might  get  a  feeling  for  the  structure  and  shape  of  the  dataset  as  a  whole  from  the  overall  visualisa@on,  we  oken  want  to  inspect  one  or  more  of  the  nodes  in  detail.    The  quickest  way  of  doing  this  is  to  look  at  the  labels…    You  may  also  have  no@ced  that  the  edge  thickness  is  thicker  for  some  lines  than  others.  In  this  case,  the  line  thicknesses  are  propor@onal  to  the  contract  value,  which  we  set  in  the  weight  column.      If  a  company  is  associated  with  more  than  a  single  contract  on  a  par@cular  project,  the  edge  weight  well  be  propor@onal  to  the  overall  (total)  sum  of  values  of  all  the  contracts  rela@ng  that  company  to  that  project.    

17  

Page 18: Scoda project companygraph

As  well  as  using  space  (or  posi@on)  and  colour  to  represent  structural  elements  of  the  network,  we  can  also  use  edge  weight  (that  is  the  thickness,  or  width)  of  the  lines  connec@ng  nodes  to  each  other  to  represent  some  feature  of  the  network.    In  this  case,  we  might  use  edge  weight  to  represent  the  value  of  contract  that  connects  a  company  with  a  project,  or  the  number  of  contracts  that  a  company  has  on  a  par@cular  project.    When  placing  nodes,  we  might  also  use  edge  weight  to  contribute  to  the  determina@on  of  how  closely  two  connected  nodes  should  be  placed  to  each  other.  If  you  think  of  the  edge  thickness  in  terms  of  the  size,  thickness  or  strength  of  a  mechanical  spring,  you  might  perhaps  start  to  imagine  how  nodes  connected  by  thick  springs  will  be  pulled  closer  to  each  other  than  nodes  connected  by  much  weaker  springs.        

18  

Page 19: Scoda project companygraph

As  well  as  edge  thickness,  we  might  also  make  use  of  node  size  to  highlight  some  feature  of  the  network.    In  this  example,  we  use  node  size  to  represent  the  degree  of  each  node,  that  is,  the  number  of  edges  connected  to  it.  Some@mes,  we  might  want  to  highlight  nodes  that  have  small  numbers  of  connec@ons,  for  example  to  iden@fy  projects  with  very  few  companies  contracted  to  them.  In  this  case,  we  might  make  nodes  with  only  a  single  incoming  edge  very  large,  and  nodes  with  large  number  of  edges  much  smaller.    The  node  size  thus  represents  how  well  connected  a  node  is.  In  this  case,  the  size  of  the  project  nodes  indicates  how  many  companies  are  associated  with  it,  and  the  size  of  the  company  nodes  depicts  how  many  project  contracts  the  company  is  engaged  with.    Note  that  we  can  combine  edge  weight  and  node  size,  for  example,  by  seing  node  size  propor@onal  to  the  summed  weights  of  edges  that  are  connected  to  the  node.    Hopefully,  you  are  already  star@ng  to  see  how  a  network  diagram  can  provide  a  range  of  powerful  visual  representa@ons  for  helping  us  explore  the  structure  of  network  and  iden@fy  key  elements  of  it.  

19  

Page 20: Scoda project companygraph

We  can  size  the  nodes  according  to  sta@s@cal  values  calculated  over  the  network.    In  this  case,  we  might  want  to  highlight  nodes  according  to  the  total  value  of  contracts  flowing  into  them  (for  companies)  or  out  of  them  (for  projects).  The  weighted  average  sta@s@c  calculates  the  corresponding  value  for  each  node  in  the  network.    The  spline  operator  in  the  Ranking  tab  –  where  we  set  the  node  size  –  allows  us  to  tweak  the  rela@onship  between  the  value  used  to  size  the  node  and  the  node  size.  The  default  is  a  simple  linear  propor@onal  map.  However,  we  may  find  that  the  range  of  values  we  want  to  map  are  “clumped”  together  (for  example,  one  very  large  value  and  a  range  of  smaller  values  clumped  together  at  the  other  end  of  the  overall  range).  In  such  a  case,  we  might  want  to  tweak  the  mapping  to  provide  a  liFle  more  salience  when  it  comes  to  dis@nguishing  between  the  values  that  are  otherwise  clumped  together.    As  well  as  making  node  size  propor@onal  to  some  quan@ty,  we  can  also  set  the  label  size  to  be  propor@onal  to  the  node  size.  

20  

Page 21: Scoda project companygraph

There  are  several  other  tools  available  to  us  that  allow  us  to  explore  other  proper@es  of  the  network.  For  example,  there  is  a  wide  selec@on  of  filters  that  allow  us  to  select  par@cular  filtered  views  of  the  network.    In  this  case,  we  use  the  degree  range  filter  to  show  only  nodes  that  have  degree  of  two  or  more.  This  filters  out  nodes  that  have  degree  1  –  for  example,  companies  that  are  only  associated  with  a  single  project.  The  result  is  a  view  over  the  network  that  shows  which  companies  are  associated  with  two  or  more  projects,  and  which  projects  they  are.  The  node  sizes  are  indica@ve  of  the  total  overall  vale  of  contracts  associated  with  each  par@cular  node.    So  for  example,  we  see  that  Siemens  AG  is  associated  with  contracts  from  projects  P072018  and  P090104.  The  large  node  size  suggests  that  the  sum  total  of  contracts  Siemens  AG  has  received  via  this  projects  is  quite  significant.  In  addi@on,  the  line  from  P072018  to  Siemens  AG  suggests  that  the  total  value  of  contracts  (or  maybe  just  a  single  contract)  Siemens  AG  has  received  from  that  project  is  quite  large.  

21  

Page 22: Scoda project companygraph

So  far,  out  network  diagram  has  shown  us  how  companies  relate  to  projects,  and  conversely,  how  projects  relate  to  companies.    But  some@mes  we  may  want  to  know  rather  more  directly  the  extent  to  which  two  things  are  connected  by  virtue  of  having  a  common  partner  –  for  example,  which  companies  worked  on  the  same  projects  together,  or  which  projects  are  linked  by  virtue  of  having  used  the  same  companies.    When  the  data  is  represented  as  a  graph,  we  can  manipulate  the  graph  in  order  to  generate  derived  graphs  that  can  capture  these  sorts  of  rela@onship  directly.  

22  

Page 23: Scoda project companygraph

When  we  have  a  dataset  represented  in  the  form  of  a  network,  we  can  start  to  analyse  it  by  looking  at  addi@onal  network  proper@es.    For  example,  for  the  projects  and  companies  graph,  we  might  process  the  graph  so  as  to  remove  project  nodes  and  replace  the  edges  with  edges  that  connect  companies  that  were  on  one  or  more  project  with  each  other.  We  might  even  use  edge  weight  to  depict  how  many  projects  there  were  in  common  between  two  companies.  

23  

Page 24: Scoda project companygraph

From  the  workspace  menu,  duplicate  the  original  network  (remember  to  turn  off  all  the  filters!  We  want  the  whole  network.)    You  will  automa@cally  be  moved  to  a  new  workspace  containing  a  copy  of  the  original  network.  (Navigate  between  workspaces  from  the  workspace  selector  at  the  boFom  right  hand  corner  of  the  whole  applica@on  window.)    In  the  Mul@mode  Networks  Projec@on  panel,  click  on  Graph  Coloring  to  try  to  split  the  network  into  complementary  types  of  node  (companies  and  projects).  Hopefully,  the  tool  will  return  with  the  report  that  Bipar22e:true.  That  is,  two  complementary  sets  of  nodes  have  been  found  (nodes  in  the  first  group  are  only  ever  connected  to  nodes  in  the  second  group.)Click  on  Load  aFributes  and  select  the  Node  Color  Mul@mode  op@on.    

24  

Page 25: Scoda project companygraph

To  check  what  the  mul@mode  tool  has  called  nodes  of  each  type,  click  on  the  edit  buFon  in  the  paleFe  toolbar,  and  click  on  a  project  node.  An  edit  panel  will  appear  –  make  a  note  of  what  colour  the  project  type  node  has  been  labeled.    We  can  now  use  the  mul@mode  network  projec@on  tool  to  process  the  network  by  joining  together  company  nodes  that  are  connected  by  a  common  project,  and  dele@ng  the  project  nodes.    That  is,  we  want  to  connect  blue  company  nodes  to  blue  company  nodes  if  they  are  connected  by  edges  that  pass  through  a  common  red  project  node.  One  we  have  made  the  mapping,  we  can  delete  the  inner  red  project  nodes.    Running  the  projec@on  results  in  several  dis@nct  clusters  of  companies  that  are  connected  to  each  other  by  virtue  of  being  associated  with  the  same  project,  as  well  as  some  companies  that  bridge  different  clusters  by  virtueof  being  associated  with  companies  from  different  projects.  

25  

Page 26: Scoda project companygraph

Conversely,  we  might  remove  the  company  nodes,  and  iden@fy  a  new  set  of  edges  that  connect  projects  that  shared  one  or  more  common  contracted  companies.  Again,  edge  thickness  might  be  use  to  show  how  @ghtly  connected  two  projects  were  by  virtue  of  increasing  numbers  of  common  contracted  companies.  

26  

Page 27: Scoda project companygraph

By  projec@ng  the  original  network  onto  the  network  that  shows  links  between  projects  that  arise  from  common  companies,  we  get  a  much  clearer  picture  about  how  many  projects  there  are,  as  well  as  possible  linkages  between  them.  

27  

Page 28: Scoda project companygraph

Here  are  some  of  the  things  you  have  hopefully  learned…feel  free  to  add  anything  else  you  might  have  learned  to  the  list…  

28  

Page 29: Scoda project companygraph

For  more  informa@on,  and  a  wide  range  of  further  tutorials  on  all  maFers  data  related,  visit  the  School  Of  Data  at  SchoolOfData.org,  or  on  TwiFer  via  @SchoolOfData.  

29