51
Microsoft Instant Messenger Communication Network How does the world communicate? Jure Leskovec ([email protected]) Machine Learning Department http://www.cs.cmu.edu/~ jure Joint work with: Eric Horvitz, Microsoft Research

Microsoft Instant Messenger Communication Network How does the world communicate? Jure Leskovec ([email protected]) Machine Learning Department jure

Embed Size (px)

Citation preview

Microsoft Instant Messenger Communication NetworkHow does the world communicate?

Jure Leskovec ([email protected])Machine Learning Departmenthttp://www.cs.cmu.edu/~jure

Joint work with: Eric Horvitz, Microsoft Research

Networks: Why?

Today: large on-line systems leave detailed records of social activity On-line communities: MyScace, Facebook Email, blogging, instant messaging On-line publications repositories, arXiv, MedLine

Emerging behavior (need lots of data): Actions of individual nodes are independent

but global patterns and regularities emerge

3

The Largest Social Network

What is the largest social network in the world (that we can relatively easily obtain)?

For the first time we had a chance to look at complete (anonymized) communication of the whole planet (using Microsoft MSN instant messenger network)

4

Instant Messaging

• Contact (buddy) list• Messaging window

5

Instant Messaging as a Network

Buddy Conversation

6

IM – Phenomena at planetary scale

Observe social phenomena at planetary scale: How does communication change with user

demographics (distance, age, sex)? How does geography affect communication? What is the structure of the communication

network?

7

Communication data

The record of communication Presence data

user status events (login, status change) Communication data

who talks to whom Demographics data

user age, sex, location

8

Data description: Presence

Events: Login, Logout Is this first ever login Add/Remove/Block buddy Add unregistered buddy (invite new user) Change of status (busy, away, BRB, Idle,…)

For each event: User Id Time

9

Data description: Communication

For every conversation (session) we have a list of users who participated in the conversation

There can be multiple people per conversation For each conversation and each user:

User Id Time Joined Time Left Number of Messages Sent Number of Messages Received

10

Data description: Demographics

For every user (self reported): Age Gender Location (Country, ZIP) Language IP address (we can do reverse geo IP lookup)

11

Data collection

Log size: 150Gb/day Just copying over the network takes 8 to 10h Parsing and processing takes another 4 to 6h After parsing and compressing ~ 45 Gb/day Collected data for 30 days of June 2006:

Total: 1.3Tb of compressed data

12

Network: Conversations

Conversation

13

Data statistics

Activity over June 2006 (30 days) 245 million users logged in 180 million users engaged in conversations 17,5 million new accounts activated More than 30 billion conversations

14

Data statistics per day

Activity on June 1 2006 1 billion conversations 93 million users login 65 million different users talk (exchange

messages) 1.5 million invitations for new accounts sent

15

User characteristics: age

16

Age piramid: MSN vs. the world

17

Conversation: Who talks to whom?

Cross gender edges: 300 male-male and 235 female-female edges 640 million female-male edges

18

Number of people per conversation

Max number of people simultaneously talking is 20, but conversation can have more people

19

Conversation duration

Most conversations are short

20

Conversations: number of messages

Sessions between fewer people run out of steam

21

Time between conversations Individuals are highly diverse What is probability to login into

the system after t minutes? Power-law with exponent 1.5 Task queuing model [Barabasi]

My email, Darvin’s and Einstein’s letters follow the same pattern

22

Age: Number of conversationsU

ser s

elf r

epor

ted

age

High

Low

23

Age: Total conversation durationU

ser s

elf r

epor

ted

age

High

Low

24

Age: Messages per conversationU

ser s

elf r

epor

ted

age

High

Low

25

Age: Messages per unit timeU

ser s

elf r

epor

ted

age

High

Low

26

Who talks to whom: Number of conversations

27

Who talks to whom: Conversation duration

28

Geography and communication

Count the number of users logging in from particular location on the earth

29

How is Europe talking Logins from Europe

30

Users per geo location

Blue circles have more than 1 million

logins.

31

Users per capita

Fraction of population using MSN:•Iceland: 35%•Spain: 28%•Netherlands, Canada, Sweden, Norway: 26%•France, UK: 18%•USA, Brazil: 8%

32

Communication heat map

For each conversation between geo points (A,B) we increase the intensity on the line between A and B

33

Correlation: Probability:

Homophily (gliha v kup štriha) Ag

e vs

. Age

34

Per country statistics

On a particular typical day…Country # of logins # of users # of messages Messages per user

USA 38,319,363 13,261,337 412,729,278 31.12Brazil 20,582,613 7,864,424 467,972,522 59.50France 19,163,131 6,475,858 518,931,785 80.13Unknown 18,444,352 6,872,347 191,167,085 27.81Spain 16,868,549 6,140,895 503,759,240 82.03UK 16,659,009 5,724,826 487,018,470 85.07Canada 14,558,692 5,021,185 160,249,686 31.91China 14,225,163 5,314,463 101,003,729 19.00Turkey 13,619,789 4,696,555 353,540,475 75.27Mexico 10,756,989 4,359,932 209,195,100 47.98

Note that global usage and market share statistics are higher if we accumulate data over longer time periods.

35

Per typical user per country

On a typical day MSN user from a country …Country

Logins on a particular day

Users on a particular day

Messages sent

Messages per user

Slovenia 364,988 130,884 15,919,892 121.6335992Malta 122,846 41,829 4,993,316 119.3745009Hungary 1,214,268 427,320 47,623,604 111.4471684Bosnia 105,584 35,689 3,254,170 91.18131637Teunion 100,335 33,399 3,041,635 91.0696428Gibraltar 19,096 6,452 581,195 90.07982021UK 16,659,009 5,724,826 487,018,470 85.07131396Macedonia 126,729 43,754 3,669,977 83.87751977Netherlands 7,399,160 2,696,669 221,300,210 82.06428375Spain 16,868,549 6,140,895 503,759,240 82.03352117

Note that global usage and market share numbers are higher if we accumulate data over longer time periods.

36

What about Slovenia (per capita)?

Statistic Number Rank (per capita)

Conversations inside 19,868,886 22Conversation to outside 7,868,483 48Total conversations 27,737,369 29Avg. time inside 309.49 147Avg. time outside 314.39 80Avg. time inside (pct.) 0.4960Messages sent inside 9.78 32Messages sent outside 9.46 19Messages inside (pct.) 0.5083

37

Who is Slovenia talking to?Rank Target

CountryPairs of people

Number of conversations

Avg. time per conv.

Avg. # of messages

1 Slovenia 13,41,250 19,868,886 309.4 9.782 USA 61,794 922,527 303.4 9.143 Spain 27,650 310,357 289.4 7.974 UK 14,709 204,335 325.4 9.025 Germany 9,047 129,551 350.3 10.206 Bosnia 9,956 114,509 385.9 14.627 Yugoslavia 8,194 104,270 381.7 12.558 Italy 8,612 100,698 358.8 9.899 Croatia 6,838 84,362 359.0 11.00

10 Turkey 10,763 77,651 292.4 8.0811 Albania 9,517 76,440 320.7 10.8812 Sweden 5,083 69,019 306.9 8.3413 Netherlands 5,061 68,287 315.9 8.8714 Canada 5,003 60,617 301.8 7.38

38

Instant Messaging as a Network

Buddy

39

IM Communication Network

Buddy graph: 240 million people (people that login in June ’06) 9.1 billion edges (friendship links)

Communication graph: There is an edge if the users exchanged at least

one message in June 2006 180 million people 1.3 billion edges 30 billion conversations

40

Buddy network: Number of buddies

Buddy graph: 240 million nodes, 9.1 billion edges (~40 buddies per user)

Network: Small-world

6 degrees of separation [Milgram ’60s] Average distance 5.5 90% of nodes can be reached in < 8 hops

Hops Nodes1 10

2 78

3 396

4 8648

5 3299252

6 28395849

7 79059497

8 52995778

9 10321008

10 1955007

11 518410

12 149945

13 44616

14 13740

15 4476

16 1542

17 536

18 167

19 71

20 29

21 16

22 10

23 3

24 2

25 3

43

Network: Searchability Milgram’s experiment showed:

(1) short paths exist in networks (2) humans are able to find them

Assume the following setting: Nodes are scattered on a plane Given starting node u and we want to

reach target node v Algorithm: always navigate to a

neighbor that is geographically closest to target node v

Surprise: Geo-routing finds the short paths (for appropriate distance measure)

u

v

44

Communication network: Clustering

How many triangles are closed?

Clustering normally decays as k-1

Communication network is highly clustered: k-0.37

High clustering Low clustering

45

Communication Network Connectivity

46

k-Cores decomposition

What is the structure of the core of the network?

47

k-Cores: core of the network

People with k<20 are the periphery Core is composed of 79 people, each having 68 edges

among them

48

Network robustness

We delete nodes (in some order) and observe how network falls apart: Number of edges deleted Size of largest connected component

49

Robustness: Nodes vs. Edges

50

Robustness: Connectivity

51

Conclusion

A first look at planetary scale social network The largest social network analyzed

Strong presence of homophily: people that communicate share attributes

Well connected: in only few hops one can research most of the network

Very robust: Many (random) people can be removed and the network is still connected

52

References

Leskovec and Horvitz: Worldwide Buzz: Planetary-Scale Views on an Instant-Messaging Network, 2007

http://www.cs.cmu.edu/~jure