Algorithms for Data Mining and Querying with GraphsAlgorithms for Data Mining and Querying with GraphsInvestigators: Padhraic Smyth, Sharad Mehrotra Investigators: Padhraic Smyth, Sharad Mehrotra
University of California, IrvineUniversity of California, IrvineStudents: Joshua O’ Madadhain, Dawit Seid, Jon HutchinsStudents: Joshua O’ Madadhain, Dawit Seid, Jon Hutchins
JUNG: JAVA Universal Network/Graph Framework
GAAL: A General-Purpose Graph Query Language
Link Prediction Algorithms
Example of software built using JUNG: Netsight, an interactive graph visualization and analysis tool
- extensible, open source software library (API) for graph/network modeling, analysis, and visualization
- can decorate graphs, vertices, edges with any JUNG object
- complex filtering/transformation/subset management
- includes library of network and graph algorithms
- clustering, centrality, importance, paths, flows, etc
- includes visualization API, or can use other visualization APIs (e.g. prefuse)
- supports graphs, hypergraphs, parallel edges, mixed-mode graphs, k-partite graphs
- active user/developer community
- 30,000 downloads, 1.3 million page visits
- ranked #60 out of 100k Sourceforge projects
- used in social network analysis, games, trust metrics, upcoming version of HP Zoomgraph,
email visualization, and Netsight
JUNG software is publicly available at
http://jung.sourceforge.net
We have developed a general predictive learning approach that can uses historical graph data to learn a predictive model of whether a link is likely to exist between any pair of nodes A and B in a future time-period. The prediction model utilizes information from both structural graph features around A and B, as well as individual node attributes for A and B. For example, for co-author graphs, features can include distance in the co-author graph of A from B, properties of A’s and B’s graph neighborhoods, and topic models in the form of probability distributions characterizing A’s and B’s research interests.
We have developed a new query language called GAAL that allows users to express complex relational queries on attributed graphs, allowing for queries on graph properties, aggregation operations, and scalability to very large graphs. In 2005 we have extended this approach to provide an algebraic framework for spatio-temporal analysis of semantic graphs.
Rid name URL
Pid title abs year
Pid1 Pid2
Rid Pid pos
Iid name typeRid Iid
InstituteWorksIn
Researcher
Write
Paper
Cite
Multi-relational (attributed graph)
representation entity/event data
GraphSchema and
Other metadata
Graph Schema Definition Interface
Visualization/Analysis Applications
(NetSight)
GAAL Language(Graph Querying Algebra)
DBMS Specific AdaptersExtensible ORDBMS
Algorithms for Ranking Nodes in Dynamic Networks
Email Rankings and Organizational Structure
We have developed a novel algorithmic approach to the problem of determining the importance of nodes in a network where the links occur over time, e.g., an email network or a co-author network. The concept is similar to centrality ideas in social networks, and HITS and PageRank for Web page ranking, but produces a “dynamic rank” such that the rank of each node varies over time as it receives messages in the network.
Example of Rankings over Time
Results on KDD Challenge/Biobase Data
This prediction competition in 2005 evaluated different approaches for link prediction. The specific problem was to predict new collaborations among 300,000 medical researchers in 2002, based on co-author relations in 128,000 papers published from 1998-2001. The figure to the right shows the “lift curve” the ratio of the number of true new collaborations predicted by our models’ rankings (relative to a random ranking). In the top 50 predictions for example, our models predict between 40 and 45 true collaborations (versus about 3 for a random ranking).
Data: Corporate Email History1 million emails, 21 months, 628 individuals
A triple of:(target-property,Spatial-property,objectId)
Point set,Relationship set (pair of points),Spatial/ temporal value
Graph that embedsinput data
GraphQuery
Spatial Queries
Temporal Queries
STProject(G,P,O,T,F)
GProject(G,I ,R)
Node set,Edge setProperty valueGraphGraph-set
G - base-graphP - set of spatial/ temporal propertiesO - node/ link type with the target propertyT - set of target properties of nodes/edges (optional)F - A filtering condition (optional)I - spatial/ temporal query outputR - relationships to be used in embedding of I
LEGEND:
• STProject : projects out spatial/ temporal properties and a target property of nodes/ links for spatial analysis.
• GProject: Embeds a set of nodes or a new relationship type in the graph; nodes that have the same type as those in I are filtered out.
Find the news source that had the most coverage of most heavily damaged regions during the Tsunami disaster ?
Step1: Spatial Projection:STProject(TsunamiGraph,source_agency, report,
mentionedCity,,)Step 2: Find the top 3 cities with most damage:
Distinguish(topic=damage,,sum,3)
Step 3: Project into graph to find the sources:GProject(TsunamiGraph,topCities,range(mentionedCity))
Step 4: Find the top 3 sources using graph query language:
SELECT ?sourceFROM {?report,mentionedCity,?city,
?source, reports,?report}WHERE ?city IN <topCities>GROUP BY source_agencyAGGREGATE branchwise count(reports) INTO ?source.repCount)
name
Damagerelatedtopics
Differentiate these
Topic
reportsmentions
About
country
citycity dateHour
Source_agency
basedIn
ReportReplicates
References
mentionedCity
fileLocation
fileHour fileDateReporter
name
GRAPH SCHEMA
filedBy
Topic
reportsmentions
About
country
citycity dateHour
Source_agency
basedIn
ReportReplicates
References
mentionedCity
fileLocation
fileHour fileDateReporter
name
GRAPH SCHEMA
filedBy
Architecture
Query Example
Algebraic Framework