48
Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com

Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Easier than Excel: Social Network Analysis of

DocGraph with Gephi Janos G. Hajagos

Stony Brook School of Medicine

Fred Trotter fredtrotter.com

Page 2: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

DocGraph Based on FOIA request to CMS by Fred Trotter Pre-released at Strata RX 2012 Medicare providers (more than doctors) CY 2011 dates of service Share 11 or more patients in a 30 day forward window Initial access restricted to MedStartr funders

2

Page 3: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

DocGraph by the numbers Directed graph Average total degree 52.8 940,492 providers (graph nodes/vertices) 49,685,810 shared edges

3

Page 4: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Geographic visualization

4

http://isurfsoftware.com/blog/2012/12/13/visualizing-geographic-connections-between-us-doctors/

Page 5: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

DocGraph data

5

Page 6: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

6

Page 7: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

NPPES National Plan and Provider Enumeration System Source of NPI (National Provider Identifier) No cost download Information is entered and updated by provider

- Data quality is good to poor CSV file with 314 columns A custom MySQL load script is used to normalize the database Bloom.api open source project to make data easier to access

- http://www.bloomapi.com/

7

Page 8: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Tabular data

8

Page 9: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Things we can do with tabular data

9

Page 10: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Graph data Relation between authors and MeSH terms from PubMed

10

http://dx.doi.org/10.6084/m9.figshare.94595

Page 11: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Graph types Undirected graph

- Facebook friendships Directed graph

- Twitter: follow and be followed Bipartite graph Multipartite

- RDF graph model - Property graph model Allow parallel edges

- RDF graph Model

11

Page 12: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Components of a network/graph

12

Page 13: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Graphs in healthcare Prescriber and patient (bipartite)

- NCPDP data with NPI Referral data sets Shared patients

- DocGraph Social networks

- Tweeting about a disease Limited by imagination

13

Page 14: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Generating GraphML XML based file format for graphs Readable by a large number of tools

- Gephi - Mathematica - igraph (R) NetworkX a Python library for graphs which can export to GraphML GraphML is not a file format for really large graphs GraphML is not readable by d3.js

14

Page 15: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

15

GraphML can be loaded into Mathematica

Page 16: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Gephi

16

Page 17: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Gephi Java based open source tool Focused on interactivity

- Fast graphics - Multi-threaded - Visual updates Strong graph analytics Graphs stored in memory

- Upper limit is about 100,000 nodes Netbeans plugin architecture

- Integration with Neo4J - Additional layout algorithms

17

Page 18: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Downloading Gephi http://gephi.org/users/download/

18

Page 19: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Downloading sample files

https://dl.dropboxusercontent.com/u/21690634/DocGraph/docgraph_tutorial_examples.zip

19

Page 20: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Subsets are generated using a Python script

20

python extract_providers_to_graphml.py "npi='1750499653'" sterrence Leaf-edges

Opening connection referral Configuration Selection criteria for subset graph: npi='1750499653' Referral table _name: referral.referral2011 NPI detail table name: referral.npi_summary_primary_taxonomy Nodes will be labeled by: provider_name Leaf-to-leaf edges will be exported? False … Imported 1 nodes … Imported 986 nodes … Imported 1724 edges Edge types imported {'core-to-leaf': 866, 'leaf-to-core': 856: None : 2} Leaf-to-leaf edges were not selected for export Writing GraphML file

Page 21: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Generating a subset: some concepts

21

Core nodes

Adding leaf nodes

Connecting core nodes

Connecting to leaf nodes

Connecting leaf nodes

Page 22: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Sample files jamestown_core_provider_graph.graphml

- Providers selected with practice addresses in Jamestown, NY - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

- Includes providers above and those who are linked to them - 1,322 nodes with 12,457 edges albany_core_provider_graph.graphml

- Providers selected with practice addresses in Albany, NY - A small city in New York (approximately 100,000 residents) - 1,368 nodes with 44,711 edges

22

Page 23: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Sample files (continued) bronx_core_provider_graph.graphml

- Providers selected with practice addresses in Bronx, NY - Urban community (1.4 million residents) - 3,268 nodes and 53,828 edges

23

Page 24: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Opening a graph file

24

Page 25: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Import report

25

Page 26: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Force directed layout of the graph

26

Page 27: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Results of the layout

27

Page 28: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

ForceAtlas 2 works well for larger graphs

28

Page 29: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Navigating the graph Best experience with a three button mouse with a scroll wheel

- Right click and hold to pan - Scroll wheel to zoom in and out - Left click to select - Right click for context menus MacBook users

- command key and click and hold down on trackpad to pan - Two fingers to zoom on trackpad - Click on trackpad to select - Control click for context menus

29

Page 30: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Coloring the graph (partitioning)

30

Page 31: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Coloring the graph (partitioning)

31

Page 32: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Varying node size based on importance Step 1: Need to select a measure for node importance

- Degree - PageRank - Eigenvector centrality Step 2: Run the measure against the graph Step 3: Ranking tab and “Size/Weight” Step 4: Set size range

32

Page 33: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Graph measures Degree

- In-degree - Out-degree Graph structure measures

- Clustering (global and local) - Network diameter Centrality Measures

- Eigenvector centrality - PageRank (Google search) Community measures And more . . . . .

33

Page 34: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Interactively viewing node attributes

34

Click the “T” icon on the bottom to turn on node labeling

Page 35: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Data Laboratory

35

Page 36: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Selecting visible fields

36

Page 37: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Viewing edge attributes

37

Page 38: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Saving your graph Save your graph in .gephi format

- xml based format - preserves layout, size, and color Save in GraphML format for use with outside programs

38

Page 39: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Filtering nodes by attributes

39

Page 40: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Hints for filtering nodes Drag field filter “is_physician” from the top pane to the lower pane Set the value to filter on

- Value should equal 1 - 1 is equivalent to true Click “Filter” to apply

40

Page 41: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Producing a final graph

41

We need to rescale the edge weights in the graph

Page 42: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Producing a final graph after scaling

42

Page 43: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Bronx core provider graph

43

Page 44: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Challenge questions Which institution is the most “important” provider for the Bronx?

- Hint: try a centrality measure Can you determine if geography plays a role in patient sharing in the Bronx?

- Which parameter could be used to partition the graph? Can you filter the graph to show only radiologists? Which radiologist has the highest “authority” in the graph?

44

Page 45: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Other tools for graph analysis NetworkX

- Python - Lots of algorithms igraph

- R and Python Gremlin – graph traversal and manipulation

- Groovy shell - Gremlin interface is implemented for Neo4J And more . . .

45

Page 46: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Scaling the analysis to the entire DocGraph Most healthcare graphs will be big (millions of nodes) What we learn at the local level can be applied at the global level

- Importance of geography - Supernodes (radiologist, ER docs, pathologist, transportation, …) Many graph measures don’t scale well

- Maximal cliques Currently exploring how to use Faunus to scale the analysis

with Hadoop

46

Page 47: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Links http://strata.oreilly.com/2012/11/docgraph-open-social-doctor-data.html (information) https://github.com/jhajagos/DocGraph (code) http://notonlydev.com/docgraph-data/ (open source $1 covers bandwidth fees) https://groups.google.com/forum/#!forum/docgraph (mailing list)

47

Page 48: Easier than Excel: Social Network Analysis of DocGraph ... · - Small city in far western New York (approximately 30,000 residents) - 179 nodes with 5,560 edges jamestown_core_and_leaf_provider_graph.graphml

Questions

48

Try to publish your own healthcare dataset as a graph!