Marko Grobelnik, Dunja Mladenic JSI Parts of the presentation taken from the tutorial “Structure...
If you can't read please download the document
Marko Grobelnik, Dunja Mladenic JSI Parts of the presentation taken from the tutorial “Structure and function of real-world graphs and networks” by Jure
Marko Grobelnik, Dunja Mladenic JSI Parts of the presentation
taken from the tutorial Structure and function of real-world graphs
and networks by Jure Leskovec, CMU/JSI
Slide 2
What are networks? few examples Network properties Small worlds
Power law Long tail Network Resilience Structure of networks
Applications Mining e-mail server logs Mining MSN Messenger
data
Slide 3
Statistics Computer systems Theory and algorithms (complex)
networks Machine learning / Data mining 3
Slide 4
Statistics Computer systems Theory and algorithms (complex)
networks Machine learning / Data mining Social Sciences Biology
Physics (complex) networks Industry & Applications Computer
Science 4
Slide 5
Vertex / Node
Slide 6
Edge/ Link
Slide 7
Vertex / Node Edge/ Link Direction
Slide 8
Vertex / Node Edge/ Link Direction0.3 0.6 0.1
Probabilities
Slide 9
Vertex / Node Edge/ Link Direction0.3 0.6 0.1 Probabilities in
dynamic networks all the elements of the graph are changing dealing
with dynamic networks is active research topic
Slide 10
Query Active topic during limited time period Example of
Dynamic Graph (1/3)
Slide 11
On 1996-08-30 Clinton and Chicago are connected Example of
Dynamic Graph (2/3)
Slide 12
On 1996-10-02 Clinton and Chicago are NOT connected Example of
Dynamic Graph (3/3)
Slide 13
Information networks: World Wide Web: hyperlinks Citation
networks Blog networks Social networks: people + interactios
Organizational networks Communication networks Collaboration
networks Sexual networks Collaboration networks Technological
networks: Power grid Airline, road, river networks Telephone
networks Internet Autonomous systems Florence families Karate club
network Collaboration network Friendship network
Slide 14
Biological networks metabolic networks food web neural networks
gene regulatory networks Language networks Semantic networks
Software networks Yeast protein interactions Semantic network
Language network XFree86 network
Slide 15
Directed/undirected Multi graphs (multiple edges between nodes)
Hyper graphs (edges connecting multiple nodes) Bipartite graphs
(e.g., papers to authors) Weighted networks Different type nodes
and edges Evolving networks: Nodes and edges only added Nodes,
edges added and removed
Slide 16
Sociologists were first to study networks: Study of patterns of
connections between people to understand functioning of the society
People are nodes, interactions are edges Questionares are used to
collect link data (hard to obtain, inaccurate, subjective) Typical
questions: Centrality and connectivity Limited to small graphs (~10
nodes) and properties of individual nodes and edges
Slide 17
Large networks (e.g., web, internet, on-line social networks)
with millions of nodes Many traditional questions not useful
anymore: Traditional: What happens if a node U is removed? Now:
What percentage of nodes needs to be removed to affect network
connectivity? Focus moves from a single node to study of
statistical properties of the network as a whole Can not draw
(plot) the network and examine it
Slide 18
How the network looks like even if I cant look at it? Need for
statistical methods and tools to quantify large networks 3
parts/goals: Statistical properties of large networks Models that
help understand these properties Predict behavior of networked
systems based on measured structural properties and local rules
governing individual nodes
Slide 19
Features common to networks of different types: Properties of
static networks: Small-world effect Transitivity or clustering
Degree distributions (scale free networks) Network resilience
Community structure Subgraphs or motifs Temporal properties:
Densification Shrinking diameter
Slide 20
Six degrees of separation (Milgram 60s) Random people in
Nebraska were asked to send letters to stockbrokes in Boston
Letters can only be passed to first-name acquantices Only 25%
letters reached the goal But they reached it in about 6 steps
Measuring path lengths: Diameter (longest shortest path): max d ij
Effective diameter: distance at which 90% of all connected pairs of
nodes can be reached Mean geodesic (shortest) distance l
Slide 21
Empirical observation for the Web-Graph is that the diameter of
the Web-Graph is small relative to the size of the network this
property is called Small World formally, small-world networks have
diameter exponentially smaller then the size By simulation it was
shown that for the Web- size of 1B pages the diameter is approx. 19
steps empirical studies confirmed the findings
Slide 22
The network represents collaboration between institutions on
FP5-IST projects funded by European Union there are 7886
organizations collaborating on 2786 projects in the network, each
node is an organization, two organizations are connected if they
collaborate on at least one project Small world properties of the
collaboration network: Main connected part of the network contains
94% of the nodes Max distance between any two organizations is 7
steps meaning that any organization can be reached in up to 7 steps
from any other organization Average distance between any two
organizations is 3.15 steps (with standard deviation 0.38) 38%
(2770) of organizations have avg. distance 3 or less
Slide 23
1856 collaborations avg. distance is 1.95 max. distance is
4
Slide 24
179 collaborations avg. distance is 2.42 max. distance is
4
Slide 25
8 collaborations max. distance is 7
Slide 26
Distribution of shortest path lengths Microsoft Messenger
network 180 million people 1.3 billion edges Edge if two people
exchanged at least one message in one month period Distance (Hops)
Number of nodes Pick a random node, count how many nodes are at
distance 1,2,3... hops 7
Slide 27
Power law describes relations between the objects in the
network it is very characteristic for the networks generated within
some kind of social process it describes scale invariance found in
many natural phenomena (including physics, biology, sociology,
economy and linguistics)
Slide 28
In the context of Web the power-law appears in many cases: Web
pages sizes Web page connectivity Web connected components size Web
page access statistics Web Browsing behavior Formally, power law
describing web page degrees are: (This property has been preserved
as the Web has grown)
Slide 29
Slide 30
Degree distribution number of people a person talks to on a
Microsoft Messenger Node degree Count X Highest degree
Slide 31
This is not directly related to graphs, but it nicely explains
the long tail effect. It shows that there is big market for niche
products.
Slide 32
We observe how the connectivity (length of the paths) of the
network changes as the vertices get removed It is important for
epidemiology Removal of vertices corresponds to vaccination
Real-world networks are resilient to random attacks One has to
remove all web- pages of degree > 5 to disconnect the web but
this is a very small percentage of web pages Random network has
better resilience to targeted attacks
Slide 33
What are the building blocks (motifs) of networks? Do motifs
have specific roles in networks? Network motifs detection process:
Count how many times each subgraph appears Compute statistical
significance for each subgraph probability of appearing in random
as much as in real network 3 node motifs
Slide 34
Biological networks Feed-forward loop Bi-fan motif Web graph:
Feedback with two mutual diads Mutual diad Fully connected
triad
Slide 35
Intuition says that distances between the nodes slowly grow as
the network grows (like log n ) But as the network grows the
distances between nodes slowly decrease Internet Citations
Slide 36
In November 1999 large scale study using AltaVista crawls in
the size of over 200M nodes and 1.5B links reported bow tie
structure of web links we suspect, because of the scale free nature
of the Web, this structure is still preserved
Slide 37
SCC - Strongly Connected component where pages can reach each
other via directed paths IN consisting from pages that can reach
core via directed path, but cannot be reached from the core OUT
consisting from pages that can be reached from the core via
directed path, but cannot reach core in a similar way TENDRILS
disconnected components reachable only via directed path from IN
and OUT but not from and to core
Slide 38
Slide 39
We address the problem how to construct a taxonomy from a
social network data. we adapt the approach used when dealing with
text As an example we use e-mail graph in a mid size research
institution ...communication records of JSI 770 people The
experiments and evaluation show our approach to be useful and
applicable in real life situations the approach could be easily
reused in case studies (and elsewhere)
Slide 40
The main contribution of the deliverable is architecture &
software consisting from 5 major steps: 1.Starting with log files
from the institutional e-mail server where the data include
information about e-mail transactions with three fields: time,
sender and the list of receivers. 2.After cleaning we get the data
in the form of e-mail transactions which include e-mail addresses
of sender and receiver. 3.From a set of e-mail transactions we
construct a graph where vertices are e-mail addresses connected if
there is a transaction between them 4.E-mail graph is transformed
into a sparse matrix allowing to perform data manipulation and
analysis operations 5.Sparse matrix representation of the graph is
analyzed with ontology learning tools producing an ontological
structure corresponding to the organizational structure of the
institution where e-mails came from.
Slide 41
The data is the collection of log files with e- mail
transactions from local e-mail spam filter software Amavis
(http://www.amavis.org/):http://www.amavis.org/ Each line of the
log files denotes one event at the spam filter software We were
interested in the events on successful e- mail transactions
...having information on time, sender, and list of receivers An
example of successful e-mail transaction is the following line:
2005 Mar 28 13:59:05 patsy amavis[33972]: (33972-01-3) Passed
CLEAN, [217.32.164.151] [193.113.30.29] ->, Message-ID:, Hits:
-1.668, 6389 ms
Slide 42
The log files include e-mails data from Sep 5th 2003 to Mar
28th 2005: this sums up to 12.8Gb of data. After filtering out
successful e-mail transactions it remains 564Mb which contains
approx. 2.7 million of successful e-mail transitions used for
further processing The whole dataset contains references to approx.
45000 e-mail addresses after the data cleaning phase the number is
reduced to approx. 17000 e-mail addresses out of which 770 e-mail
addresses are internal from the home institution (with ijs.si
domain name)
Slide 43
Organizational structure of JSI produced from cleaned e-mail
transactions with OntoGen in