Upload
david-graus
View
757
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
Understanding email trafficDavid Graus, University of Amsterdam [email protected] @dvdgrs
2
3
Recipient recommendation
Ò Given a sender, an email, all possible recipients (in an enterprise); Ò Predict which recipient(s) are most likely to
receive the email
4
Why?
Ò Understanding communication in/structure of an enterprise
Ò Applications in: Ò enterprise search Ò expert finding Ò community detection Ò spam classification Ò anomaly detection
5
How?
Ò Gmail Ò Who do you frequently “co-address” Ò egonetwork
Ò Related work Ò Social Network Analysis (SNA) Ò Email content
Ò Us Ò SNA + Email content
7
image by Calvinius - Creative Commons Attribution-Share Alike 3.0
8
SNA for predicting recipients?
1. Importance of a node in the network More important people are more likely to be the recipient of an email
2. Strength of connection between two nodes Given sender of the email, the recipients who are frequently addressed are more likely to be the recipient
9
SNA for predicting recipients?
1. Importance of a node in the network 1. Number of received emails 2. PageRank score of node
2. Strength of connection between two nodes 1. Number of emails sent between nodes 2. Number of times two nodes are adressed together
10
Part 2: Email content
Ò Statistical Language Models (LMs) !
Ò Assign a probability to a sequence of words; Ò Compute models for different corpora; !
Ò Used in lots of places; Ò Information Retrieval Ò Machine Translation Ò Speech Recognition
11
Language Models
Ò Language models as communication “profiles”
12
Language Models
Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user)
13
Language Models
Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people)
14
Language Models
Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1
talks with node2)
15
Language Models
Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1
talks with node2)
16
Language Models
Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1
talks with node2) 4. Corpus LM (how everyone
talks)
17
Why language models?
Ò Comparisons between communication profiles: Ò Find nodes with most similar communication
18
SNA !
!
1. Importance of a node in the network !
3. Strength of connection between nodes
!
!
!
Email Content !
!
1. Incoming LM 2. Outgoing LM 3. Interpersonal LM 4. Corpus-based LM
19
Approach: time-based
t=0 1 email, 2 addresses t=1 2 emails, 2 addresses t=2 3 emails, 4 addresses t=3 4 emails, 5 addresses !
etc… !
t=n 607.011 emails, 2.068 addresses
20
At some time interval t
Ò Given the email, sender, and network Ò Remove recipients from email Ò Rank all nodes in the network
Ò By computing for each candidate (recipient) node:
1. Importance of candidate 2. Strength of connection between sender and
candidate 3. Similarity between sender and candidate LMs
21
22
Findings: what works for predicting recipients?
Ò Importance of node: Number of received emails of node !
Ò Strength of connection: Number of emails between nodes !
Ò LM Similarity: Interpersonal LM is most important
23
Findings: SNA vs email content
Ò SNA: Ò SNA signals deteriorate over time Ò SNA signals are most informative on highly
active users !
Ò Email content: Ò LM signal improves over time Ò LM signal does worse with highly active users
24
Finally
Ò Combining Social Network Analysis with Language Modeling is better than doing either.
25
Why for E-Discovery
Ò Anomaly detection Ò Given a working prediction model; identify
“unexpected” communication Ò Language models for communication
Ò For a node, find the most different interpersonal communication Ò Friends/family vs colleagues?
Ò Find communication that differs from the corpus-based communication