Upload
jacoby-fourman
View
215
Download
0
Embed Size (px)
Citation preview
School of Computer ScienceCarnegie Mellon University
Athens University of Economics & Business
Patterns amongst Competing Task Frequencies:
S u p e r – L i n e a r i t i e s , &t h e A l m o n d-D G m o d e l
Danai KoutraB.Aditya Prakash
Vasileios KoutrasChristos Faloutsos
PAKDD, 15-17 April 2013, Gold Coast, Australia
© Danai Koutra (CMU) - PAKDD'13 2CMU AUEB
Questions we answer (1)
Patterns: If Bob executes task x for nx times,
how many times does he execute task y?
Modeling: Which 2-d distribution fits 2-d clouds of points?
# of
# of
‘Smith’ (100 calls, 700 sms)
© Danai Koutra (CMU) - PAKDD'13 3CMU AUEB
Questions we answer (2)
Patterns: If Bob executes task x for nx times,
how many times does he execute task y?
Modeling: Which 2-d distribution fits 2-d clouds of points?
# of
# of
© Danai Koutra (CMU) - PAKDD'13 4CMU AUEB
Let’s peek...… at our contributions
Patterns:• power laws between competing tasks• log-logistic distributions for many tasks
Modeling:Almond-DG distribution for 2-d real datasets
Practical Use:spot outliers; what-if scenarios
ln(comments)
ln(t
weets
)
© Danai Koutra (CMU) - PAKDD'13 5CMU AUEB
Let’s peek...… at our contributions
Patterns:• power laws between competing tasks• log-logistic distributions for many tasks
Modeling:Almond-DG distribution for 2-d real datasets
Practical Use:spot outliers; what-if scenarios
© Danai Koutra (CMU) - PAKDD'13 6CMU AUEB
Let’s peek...… at our contributions
Patterns:• power laws between competing tasks• log-logistic distributions for many tasks
Modeling:Almond-DG distribution for 2-d real datasets
Practical Use:spot outliers; what-if scenarios
© Danai Koutra (CMU) - PAKDD'13 7CMU AUEB
RoadmapData
Observed Patterns
Related Work
Proposed Distribution
Goodness of Fit
Conclusions
© Danai Koutra (CMU) - PAKDD'13 8CMU AUEB
Data 1: Tencent Weibo•micro-blogging website in China• 2.2 million users• Tasks extracted
Tweets Retweets Comments Mentions Followees
© Danai Koutra (CMU) - PAKDD'13 9CMU AUEB
Data 2: Phonecall Dataset• phone-call records• 3.1 million users• Tasks extracted:
Calls Messages Voice friends SMS friends Total minutes of phonecalls
© Danai Koutra (CMU) - PAKDD'13 10CMU AUEB
RoadmapData
Observed Patterns
Related Work
Proposed Distribution
Goodness of Fit
Conclusions
© Danai Koutra (CMU) - PAKDD'13 11CMU AUEB
Pattern 1 - SuRF: Super Linear Relative Frequency (1)ln
(tw
eets
)
ln(retweets)
‘Smith’ (1100 retweets, 7 tweets)
Logarithmic Binning Fit [Akoglu’10]•15 log buckets•E[Y|X=x] per bucket•linear regression on conditional means
0.23
© Danai Koutra (CMU) - PAKDD'13 12CMU AUEB
Pattern 1 – SuRF (2)
ln(t
weets
)
ln(comments)
Corr coeff: ++Intuition:2x tweets, 4x comments0.304
© Danai Koutra (CMU) - PAKDD'13 13CMU AUEB
Pattern 1 – SuRF (3)
ln(t
weets
)
ln(mentions)
Corr coeff: ++Intuition:
0.33
© Danai Koutra (CMU) - PAKDD'13 14CMU AUEB
Pattern 1 – SuRF (4)
ln(f
ollo
wees
)
ln(retweets)
Corr coeff: ++Intuition:
0.25
© Danai Koutra (CMU) - PAKDD'13 15CMU AUEB
Pattern 1 – SuRF (5)
Super-linear relationship: more calls, even more minutes
ln(calls_no)
ln(t
ota
l_m
ins
)
1.18
Corr coeff: ++Intuition:
© Danai Koutra (CMU) - PAKDD'13 16CMU AUEB
Pattern 1 – SuRF (6a)
ln(calls_no)
ln(v
oic
e_f
rien
ds)
2x friends, 3x phonecalls
© Danai Koutra (CMU) - PAKDD'13 17CMU AUEB
Pattern 1 – SuRF (6b)
Telemarketers?
ln(calls_no)
ln(v
oic
e_f
rien
ds)
© Danai Koutra (CMU) - PAKDD'13 18CMU AUEB
Pattern 1 – SuRF (7)ln
(sm
s_fr
ien
ds)
ln(sms_no)
2x friends, 5x sms
© Danai Koutra (CMU) - PAKDD'13 19CMU AUEB
Contributions revisited (1)
Patterns:• power laws between competing tasks• log-logistic distributions for many tasks
Modeling:Almond-DG distribution for 2-d real datasets
Practical Use:spot outliers; what-if scenarios.
ln(comments)
ln(t
weets
)
© Danai Koutra (CMU) - PAKDD'13 20CMU AUEB
Pattern 2: log-logistic marginals (1)
NOT power law
ln(retweets)
ln(f
req
uen
cy)
Marginal PDF
© Danai Koutra (CMU) - PAKDD'13 21CMU AUEB
ln(comments)
ln(f
req
uen
cy)
Marginal PDF
NOT power law
Pattern 2: log-logistic marginals (2)
© Danai Koutra (CMU) - PAKDD'13 22CMU AUEB
power law
ln(mentions)
ln(f
req
uen
cy)
Marginal PDF
Pattern 2: log-logistic marginals (3)
© Danai Koutra (CMU) - PAKDD'13 23CMU AUEB
Contributions revisited (2)
Patterns: We observe • power law relationships between competing
tasks• log-logistic distributions for many tasks
Modeling:We propose the Almond-DG distribution for fitting
2-d real world datasets Practical Use:
spot outliers; what-if scenarios.
© Danai Koutra (CMU) - PAKDD'13 24CMU AUEB
RoadmapData
Observed Patterns
Proposed Distribution
Problem Definition
Almond-DG
Background: copulas
Goodness of Fit
Conclusions
© Danai Koutra (CMU) - PAKDD'13 25CMU AUEB
Problem definitionGiven: cloud of points
Find: a 2-d PDF, f(x,y), that captures (a) the marginals
(b) the dependency
# of #
of
# of
# of
© Danai Koutra (CMU) - PAKDD'13 26CMU AUEB
Solutions in the Literature?
•Multivariate Logistic [Malik & Abraham, 1973]•Multivariate Pareto Distribution [Mardia, 1962]• Triple Power Law [Akoglu et al., 2012] bivariate distribution for modeling reciprocity in phonecall networks
© Danai Koutra (CMU) - PAKDD'13 27CMU AUEB
Solutions in the Literature?
•Multivariate Logistic [Malik & Abraham, 1973]•Multivariate Pareto Distribution [Mardia, 1962]• Triple Power Law [Akoglu et al., 2012] bivariate distribution for modeling reciprocity in phonecall networksBUT none of them captures the
2-d marginals
AND
dependency / correlation!!!
© Danai Koutra (CMU) - PAKDD'13 28CMU AUEB
RoadmapRelated Work
Data
Observed Patterns
Proposed Distribution
Problem Definition
Almond-DG
Background: copulas
Goodness of Fit
Conclusions
© Danai Koutra (CMU) - PAKDD'13 29CMU AUEB
Problem definitionGiven: cloud of points
Find: a 2-d PDF, f(x,y), that captures (a) the marginals
(b) the dependency
# of
# of
# of
# of
© Danai Koutra (CMU) - PAKDD'13 30CMU AUEB
STEP 1: How to model the marginal distributions?
• A: Log-logistic!
• Q: Why?• A: Because it•mimics Pareto• captures the top concavity•matches reality
ln(retweets)ln
(fre
quency
)
Marginal PDF
© Danai Koutra (CMU) - PAKDD'13 31CMU AUEB
Reminder:Log-logistic (1)
• The longer you survive the disease, the even longer you survive• Not memoryless• 2 parameters: scale (α) and shape (β)
BACKGROUND
a=1 β=
© Danai Koutra (CMU) - PAKDD'13 32CMU AUEB
Reminder:Log-logistic (2a)
• In log-log scales, looks like hyperbola
BACKGROUND
a=1 β=
© Danai Koutra (CMU) - PAKDD'13 33CMU AUEB
Reminder:Log-logistic (2b)
• In log-log scales, looks like hyperbola
BACKGROUND
a=1 β=Blank out the top
concavity - power law
© Danai Koutra (CMU) - PAKDD'13 34CMU AUEB
Fact:Log-logistic (3)
• linear log-odd plots
BACKGROUND
Prob(X<=x)Prob(X>x)
real
Theory
ln(mentions)
ln(o
dds) α =
2.07β = 1.27
© Danai Koutra (CMU) - PAKDD'13 35CMU AUEB
Problem definitionGiven: cloud of points
Find: a 2-d PDF, f(x,y), that captures (a) the marginals
(b) the dependency
# of #
of
# of
# of✔
✔
© Danai Koutra (CMU) - PAKDD'13 36CMU AUEB
STEP 2: How to model the dependency?
• A: we borrow an idea from survival models, financial risk management, decision analysis
COPULA!
© Danai Koutra (CMU) - PAKDD'13 37CMU AUEB
• Modeling dependence between r.v.’s (e.g., X = # of , Y = # of )
BACKGROUNDCopulas in a nutshell
© Danai Koutra (CMU) - PAKDD'13CMU AUEB
• Model dependence between r.v.’s (e.g., X = # of , Y = # of )
• Create multivariate distribution s.t.: the marginals are preserved the correlation (+, -, none) is captured
BACKGROUNDCopulas in a nutshell
38
# of
# of
© Danai Koutra (CMU) - PAKDD'13 39CMU AUEB
STEP 2: Which copula?
• A: among the many copulas• Blah• Blah• Blah• Gumbel’s copula
© Danai Koutra (CMU) - PAKDD'13 40CMU AUEB
Applications ofGumbel’s copula
Modeling of:• the dependence between loss and
lawyer’s fees in order to calculate reinsurance premiums• the rainfall frequency as a joint
distribution of volume, peak, duration etc.• …
BACKGROUND
© Danai Koutra (CMU) - PAKDD'13 41CMU AUEB
Gumbel’s copula:Example 1
BACKGROUND
• Uniform marginals• No dependence
# of
# of
© Danai Koutra (CMU) - PAKDD'13 42CMU AUEB
Gumbel’s copula:Example 2
BACKGROUND
• Skewed marginals• No correlation
# of
# of
© Danai Koutra (CMU) - PAKDD'13 43CMU AUEB
Gumbel’s copula:Example 3
BACKGROUND
• Skewed marginals• ρ = 0.7
# of
# of
© Danai Koutra (CMU) - PAKDD'13 44CMU AUEB
Problem definitionGiven: cloud of points
Find: a 2-d PDF, f(x,y), that captures (a) the marginals
(b) the dependency
# of
# of
# of
# of
✔✔
© Danai Koutra (CMU) - PAKDD'13 45CMU AUEB
Proposed Continuous Distribution: ALMOND
where θ = ( 1 – ρ )-1 captures the dependence ρ = Spearman’s coefficient
ρ=0 ρ=0.4 ρ=0.7 ρ=0 ρ=0.2 ρ=0.7 α = ? β = ? α = ? β = ?
© Danai Koutra (CMU) - PAKDD'13 46CMU AUEB
- DG
Proposed Discrete Distribution: ALMOND-DG
If (X,Y) ~ ALMONDthen (floor(X), floor(Y)) ~ ALMOND-DG where X>=1 and Y>=1.
i.e., we discretize the values of ALMOND, and reject the pairs with either X=0 or Y=0.
© Danai Koutra (CMU) - PAKDD'13 47CMU AUEB
Contributions revisited (3)
Patterns: We observe • power laws between competing tasks• log-logistic distributions for many tasks
Modeling:Almond-DG distribution for 2-d real datasets
Practical Use:spot outliers; what-if scenarios.
© Danai Koutra (CMU) - PAKDD'13 48CMU AUEB
RoadmapRelated Work
Data
Observed Patterns
Proposed Distribution
Goodness of Fit
Conclusions
© Danai Koutra (CMU) - PAKDD'13 49CMU AUEB
Synthetic Data Generation
Parameter Estimation•Traditionally: MLE, MOM log-logistic•Proposed: log-odd plot 2 parametersintercept + slope of the line•Copula-based generation 1 parameterdependence θ
Evaluation is hard even for 1-dskewed distributions!!!
[Chakrabarti, 2006]ln(mentions)
ln(o
dds)
© Danai Koutra (CMU) - PAKDD'13 50CMU AUEB
Goodness of Fit (1a)ln
(fre
qu
en
cy)
ln(comments)
Marginal PDF
ln(mentions)
ln(f
req
uen
cy)
Real data - Synthetic data
1✔
© Danai Koutra (CMU) - PAKDD'13 51CMU AUEB
Goodness of Fit (1b)Contour plots Conditional Means (SuRF)
Synthetic data
Real data
ln(m
en
tions)
ln(m
en
tions)
ln(m
en
tions)
ln(m
en
tions)
ln(comments)
ln(comments)
ln(comments)
ln(comments)
2✔ 3✔
© Danai Koutra (CMU) - PAKDD'13 52CMU AUEB
Goodness of Fit (2a)
Real data - Synthetic data
ln(f
req
uen
cy)
ln(retweets)
Marginal PDF
ln(tweets)
ln(f
req
uen
cy)1✔
© Danai Koutra (CMU) - PAKDD'13 53CMU AUEB
Goodness of Fit (2b)
Real data
Synthetic data
Contour plots Conditional Means (SuRF)
ln(retweets)
ln(t
weets
)ln
(tw
eets
)
ln(retweets)
ln(retweets)
ln(retweets)ln
(tw
eets
)ln
(tw
eets
)
32✔ ✔
© Danai Koutra (CMU) - PAKDD'13 54CMU AUEB
RoadmapRelated Work
Data
Observed Patterns
Proposed Distribution
Goodness of Fit
Conclusions
© Danai Koutra (CMU) - PAKDD'13 55CMU AUEB
Conclusions
Patterns: Discovery of• power law between competing tasks• log-logistic distributions for many tasks
Modeling:Almond-DG, that explains (i) super-linearity, (ii) marginals and (iii) conditionals in real 2-d data
Practical Use:anomaly detection; what-if scenarios.
ln(comments)
ln(t
weets
)
© Danai Koutra (CMU) - PAKDD'13 57CMU AUEB
Backup slides
• Likely question areas: ideas glossed over, shortcomings of methods or results, and future work
• Why Gumbel? It fits, it has been used in the past + parsimonious (theta, alpha, beta)