57
School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r – L i n e a r i t i e s , & t h e A l m o n d-D G m o d e l Danai Koutra B.Aditya Prakash Vasileios Koutras Christos Faloutsos PAKDD, 15-17 April 2013, Gold Coast, Australia

School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

Embed Size (px)

Citation preview

Page 1: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

School of Computer ScienceCarnegie Mellon University

Athens University of Economics & Business

Patterns amongst Competing Task Frequencies:

S u p e r – L i n e a r i t i e s , &t h e A l m o n d-D G m o d e l

Danai KoutraB.Aditya Prakash

Vasileios KoutrasChristos Faloutsos

PAKDD, 15-17 April 2013, Gold Coast, Australia

Page 2: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 2CMU AUEB

Questions we answer (1)

Patterns: If Bob executes task x for nx times,

how many times does he execute task y?

Modeling: Which 2-d distribution fits 2-d clouds of points?

# of

# of

‘Smith’ (100 calls, 700 sms)

Page 3: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 3CMU AUEB

Questions we answer (2)

Patterns: If Bob executes task x for nx times,

how many times does he execute task y?

Modeling: Which 2-d distribution fits 2-d clouds of points?

# of

# of

Page 4: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 4CMU AUEB

Let’s peek...… at our contributions

Patterns:• power laws between competing tasks• log-logistic distributions for many tasks

Modeling:Almond-DG distribution for 2-d real datasets

Practical Use:spot outliers; what-if scenarios

ln(comments)

ln(t

weets

)

Page 5: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 5CMU AUEB

Let’s peek...… at our contributions

Patterns:• power laws between competing tasks• log-logistic distributions for many tasks

Modeling:Almond-DG distribution for 2-d real datasets

Practical Use:spot outliers; what-if scenarios

Page 6: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 6CMU AUEB

Let’s peek...… at our contributions

Patterns:• power laws between competing tasks• log-logistic distributions for many tasks

Modeling:Almond-DG distribution for 2-d real datasets

Practical Use:spot outliers; what-if scenarios

Page 7: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 7CMU AUEB

RoadmapData

Observed Patterns

Related Work

Proposed Distribution

Goodness of Fit

Conclusions

Page 8: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 8CMU AUEB

Data 1: Tencent Weibo•micro-blogging website in China• 2.2 million users• Tasks extracted

Tweets Retweets Comments Mentions Followees

Page 9: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 9CMU AUEB

Data 2: Phonecall Dataset• phone-call records• 3.1 million users• Tasks extracted:

Calls Messages Voice friends SMS friends Total minutes of phonecalls

Page 10: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 10CMU AUEB

RoadmapData

Observed Patterns

Related Work

Proposed Distribution

Goodness of Fit

Conclusions

Page 11: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 11CMU AUEB

Pattern 1 - SuRF: Super Linear Relative Frequency (1)ln

(tw

eets

)

ln(retweets)

‘Smith’ (1100 retweets, 7 tweets)

Logarithmic Binning Fit [Akoglu’10]•15 log buckets•E[Y|X=x] per bucket•linear regression on conditional means

0.23

Page 12: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 12CMU AUEB

Pattern 1 – SuRF (2)

ln(t

weets

)

ln(comments)

Corr coeff: ++Intuition:2x tweets, 4x comments0.304

Page 13: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 13CMU AUEB

Pattern 1 – SuRF (3)

ln(t

weets

)

ln(mentions)

Corr coeff: ++Intuition:

0.33

Page 14: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 14CMU AUEB

Pattern 1 – SuRF (4)

ln(f

ollo

wees

)

ln(retweets)

Corr coeff: ++Intuition:

0.25

Page 15: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 15CMU AUEB

Pattern 1 – SuRF (5)

Super-linear relationship: more calls, even more minutes

ln(calls_no)

ln(t

ota

l_m

ins

)

1.18

Corr coeff: ++Intuition:

Page 16: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 16CMU AUEB

Pattern 1 – SuRF (6a)

ln(calls_no)

ln(v

oic

e_f

rien

ds)

2x friends, 3x phonecalls

Page 17: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 17CMU AUEB

Pattern 1 – SuRF (6b)

Telemarketers?

ln(calls_no)

ln(v

oic

e_f

rien

ds)

Page 18: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 18CMU AUEB

Pattern 1 – SuRF (7)ln

(sm

s_fr

ien

ds)

ln(sms_no)

2x friends, 5x sms

Page 19: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 19CMU AUEB

Contributions revisited (1)

Patterns:• power laws between competing tasks• log-logistic distributions for many tasks

Modeling:Almond-DG distribution for 2-d real datasets

Practical Use:spot outliers; what-if scenarios.

ln(comments)

ln(t

weets

)

Page 20: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 20CMU AUEB

Pattern 2: log-logistic marginals (1)

NOT power law

ln(retweets)

ln(f

req

uen

cy)

Marginal PDF

Page 21: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 21CMU AUEB

ln(comments)

ln(f

req

uen

cy)

Marginal PDF

NOT power law

Pattern 2: log-logistic marginals (2)

Page 22: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 22CMU AUEB

power law

ln(mentions)

ln(f

req

uen

cy)

Marginal PDF

Pattern 2: log-logistic marginals (3)

Page 23: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 23CMU AUEB

Contributions revisited (2)

Patterns: We observe • power law relationships between competing

tasks• log-logistic distributions for many tasks

Modeling:We propose the Almond-DG distribution for fitting

2-d real world datasets Practical Use:

spot outliers; what-if scenarios.

Page 24: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 24CMU AUEB

RoadmapData

Observed Patterns

Proposed Distribution

Problem Definition

Almond-DG

Background: copulas

Goodness of Fit

Conclusions

Page 25: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 25CMU AUEB

Problem definitionGiven: cloud of points

Find: a 2-d PDF, f(x,y), that captures (a) the marginals

(b) the dependency

# of #

of

# of

# of

Page 26: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 26CMU AUEB

Solutions in the Literature?

•Multivariate Logistic [Malik & Abraham, 1973]•Multivariate Pareto Distribution [Mardia, 1962]• Triple Power Law [Akoglu et al., 2012] bivariate distribution for modeling reciprocity in phonecall networks

Page 27: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 27CMU AUEB

Solutions in the Literature?

•Multivariate Logistic [Malik & Abraham, 1973]•Multivariate Pareto Distribution [Mardia, 1962]• Triple Power Law [Akoglu et al., 2012] bivariate distribution for modeling reciprocity in phonecall networksBUT none of them captures the

2-d marginals

AND

dependency / correlation!!!

Page 28: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 28CMU AUEB

RoadmapRelated Work

Data

Observed Patterns

Proposed Distribution

Problem Definition

Almond-DG

Background: copulas

Goodness of Fit

Conclusions

Page 29: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 29CMU AUEB

Problem definitionGiven: cloud of points

Find: a 2-d PDF, f(x,y), that captures (a) the marginals

(b) the dependency

# of

# of

# of

# of

Page 30: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 30CMU AUEB

STEP 1: How to model the marginal distributions?

• A: Log-logistic!

• Q: Why?• A: Because it•mimics Pareto• captures the top concavity•matches reality

ln(retweets)ln

(fre

quency

)

Marginal PDF

Page 31: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 31CMU AUEB

Reminder:Log-logistic (1)

• The longer you survive the disease, the even longer you survive• Not memoryless• 2 parameters: scale (α) and shape (β)

BACKGROUND

a=1 β=

Page 32: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 32CMU AUEB

Reminder:Log-logistic (2a)

• In log-log scales, looks like hyperbola

BACKGROUND

a=1 β=

Page 33: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 33CMU AUEB

Reminder:Log-logistic (2b)

• In log-log scales, looks like hyperbola

BACKGROUND

a=1 β=Blank out the top

concavity - power law

Page 34: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 34CMU AUEB

Fact:Log-logistic (3)

• linear log-odd plots

BACKGROUND

Prob(X<=x)Prob(X>x)

real

Theory

ln(mentions)

ln(o

dds) α =

2.07β = 1.27

Page 35: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 35CMU AUEB

Problem definitionGiven: cloud of points

Find: a 2-d PDF, f(x,y), that captures (a) the marginals

(b) the dependency

# of #

of

# of

# of✔

Page 36: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 36CMU AUEB

STEP 2: How to model the dependency?

• A: we borrow an idea from survival models, financial risk management, decision analysis

COPULA!

Page 37: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 37CMU AUEB

• Modeling dependence between r.v.’s (e.g., X = # of , Y = # of )

BACKGROUNDCopulas in a nutshell

Page 38: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13CMU AUEB

• Model dependence between r.v.’s (e.g., X = # of , Y = # of )

• Create multivariate distribution s.t.: the marginals are preserved the correlation (+, -, none) is captured

BACKGROUNDCopulas in a nutshell

38

# of

# of

Page 39: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 39CMU AUEB

STEP 2: Which copula?

• A: among the many copulas• Blah• Blah• Blah• Gumbel’s copula

Page 40: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 40CMU AUEB

Applications ofGumbel’s copula

Modeling of:• the dependence between loss and

lawyer’s fees in order to calculate reinsurance premiums• the rainfall frequency as a joint

distribution of volume, peak, duration etc.• …

BACKGROUND

Page 41: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 41CMU AUEB

Gumbel’s copula:Example 1

BACKGROUND

• Uniform marginals• No dependence

# of

# of

Page 42: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 42CMU AUEB

Gumbel’s copula:Example 2

BACKGROUND

• Skewed marginals• No correlation

# of

# of

Page 43: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 43CMU AUEB

Gumbel’s copula:Example 3

BACKGROUND

• Skewed marginals• ρ = 0.7

# of

# of

Page 44: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 44CMU AUEB

Problem definitionGiven: cloud of points

Find: a 2-d PDF, f(x,y), that captures (a) the marginals

(b) the dependency

# of

# of

# of

# of

✔✔

Page 45: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 45CMU AUEB

Proposed Continuous Distribution: ALMOND

where θ = ( 1 – ρ )-1 captures the dependence ρ = Spearman’s coefficient

ρ=0 ρ=0.4 ρ=0.7 ρ=0 ρ=0.2 ρ=0.7 α = ? β = ? α = ? β = ?

Page 46: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 46CMU AUEB

- DG

Proposed Discrete Distribution: ALMOND-DG

If (X,Y) ~ ALMONDthen (floor(X), floor(Y)) ~ ALMOND-DG where X>=1 and Y>=1.

i.e., we discretize the values of ALMOND, and reject the pairs with either X=0 or Y=0.

Page 47: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 47CMU AUEB

Contributions revisited (3)

Patterns: We observe • power laws between competing tasks• log-logistic distributions for many tasks

Modeling:Almond-DG distribution for 2-d real datasets

Practical Use:spot outliers; what-if scenarios.

Page 48: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 48CMU AUEB

RoadmapRelated Work

Data

Observed Patterns

Proposed Distribution

Goodness of Fit

Conclusions

Page 49: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 49CMU AUEB

Synthetic Data Generation

Parameter Estimation•Traditionally: MLE, MOM log-logistic•Proposed: log-odd plot 2 parametersintercept + slope of the line•Copula-based generation 1 parameterdependence θ

Evaluation is hard even for 1-dskewed distributions!!!

[Chakrabarti, 2006]ln(mentions)

ln(o

dds)

Page 50: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 50CMU AUEB

Goodness of Fit (1a)ln

(fre

qu

en

cy)

ln(comments)

Marginal PDF

ln(mentions)

ln(f

req

uen

cy)

Real data - Synthetic data

1✔

Page 51: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 51CMU AUEB

Goodness of Fit (1b)Contour plots Conditional Means (SuRF)

Synthetic data

Real data

ln(m

en

tions)

ln(m

en

tions)

ln(m

en

tions)

ln(m

en

tions)

ln(comments)

ln(comments)

ln(comments)

ln(comments)

2✔ 3✔

Page 52: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 52CMU AUEB

Goodness of Fit (2a)

Real data - Synthetic data

ln(f

req

uen

cy)

ln(retweets)

Marginal PDF

ln(tweets)

ln(f

req

uen

cy)1✔

Page 53: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 53CMU AUEB

Goodness of Fit (2b)

Real data

Synthetic data

Contour plots Conditional Means (SuRF)

ln(retweets)

ln(t

weets

)ln

(tw

eets

)

ln(retweets)

ln(retweets)

ln(retweets)ln

(tw

eets

)ln

(tw

eets

)

32✔ ✔

Page 54: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 54CMU AUEB

RoadmapRelated Work

Data

Observed Patterns

Proposed Distribution

Goodness of Fit

Conclusions

Page 55: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 55CMU AUEB

Conclusions

Patterns: Discovery of• power law between competing tasks• log-logistic distributions for many tasks

Modeling:Almond-DG, that explains (i) super-linearity, (ii) marginals and (iii) conditionals in real 2-d data

Practical Use:anomaly detection; what-if scenarios.

ln(comments)

ln(t

weets

)

Page 56: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 56CMU AUEB

Thank you!

[email protected]

- DG

Page 57: School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r

© Danai Koutra (CMU) - PAKDD'13 57CMU AUEB

Backup slides

• Likely question areas: ideas glossed over, shortcomings of methods or results, and future work

• Why Gumbel? It fits, it has been used in the past + parsimonious (theta, alpha, beta)