43
Content and the Scan Statistic for the Enron Data John M. Conroy Institute for Defense Analyses Center for Computing Sciences Bowie, MD

Content and the Scan Statistic for the Enron Data

Embed Size (px)

DESCRIPTION

Content and the Scan Statistic for the Enron Data. John M. Conroy Institute for Defense Analyses Center for Computing Sciences Bowie, MD. Citations and Coauthors. - PowerPoint PPT Presentation

Citation preview

Page 1: Content and the Scan Statistic for the Enron Data

Content and the Scan Statistic for the Enron Data

John M. Conroy

Institute for Defense Analyses

Center for Computing Sciences

Bowie, MD

Page 2: Content and the Scan Statistic for the Enron Data

Citations and Coauthors

• C.E. Priebe, J.M. Conroy, D.J. Marchette, and Y. Park, “Scan Statistics on Enron Graphs,” Computational and Mathematical Organization Theory, to appear. http://www.ams.jhu.edu/∼priebe/sseg.html

• J. M. Conroy, J. D. Schlesinger, J.Goldstein, D. P. O'Leary, Left-Brain/Right-Brain Multi-Document Summarization http://www.nlpir.nist.gov/projects/duc/pubs.html

Page 3: Content and the Scan Statistic for the Enron Data

Outline

• Enron Data

• Review of Scan Statistic

• Content Analysis– Content of Week 109 (Chatter Week)– Communication vs. Content for Week 109

Page 4: Content and the Scan Statistic for the Enron Data

Enron Email Data

• Email boxes of 184 user accounts, mostly executives.

• 55362 stored messages (many duplicates).

• 125,409 transactions (from-to pairs) among the 184 user accounts.

• 189 weeks, from 1998 through 2002.

Page 5: Content and the Scan Statistic for the Enron Data

Review of Scan Statistic

• Anomaly Detection.

• E.g. – Introduction of new actors. – Increase in communication between a

group of people (chatter).

Page 6: Content and the Scan Statistic for the Enron Data

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 7: Content and the Scan Statistic for the Enron Data

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 8: Content and the Scan Statistic for the Enron Data

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 9: Content and the Scan Statistic for the Enron Data

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 10: Content and the Scan Statistic for the Enron Data

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 11: Content and the Scan Statistic for the Enron Data

Gees, What Were They Saying?

• Data for week 109– 1092 transactions among 22 users.– 343 files.– 91 unique messages.

• What were the subjects of discussion?

Page 12: Content and the Scan Statistic for the Enron Data

Counts and Subject Lines 9 5 Analysis of Joskow / Hogan Papers 4 FERC Request 3 Data on Monthly Generation for SCE 3 Draft Talking points about California Gas market 3 EnronOnline question 3 Presentations from GA Meeting on December 8 2 California Price Issues 2 Conectiv / Delmarva 2 Davis, Hoecker and Richardson 2 FYI-Edison wants Reregulation 1 Additional Arguments for Enron's Gas Cap Response 1 Calif. Performance Issues 1 California Update--12.12.00 1 Capacity Release Info for Enron's Gas Cap Response

Page 13: Content and the Scan Statistic for the Enron Data

Clustering Based on Content

• Find emails with similar content based on terms that occur.

• Term: space-delimited string of characters from {a,b,c,…,z}, after text is lower cased and all other characters and stop words are removed.

• Need to restrict our attention to indicative terms (signature terms).– Terms that occur more often then expected.

Page 14: Content and the Scan Statistic for the Enron Data

Signature Terms

Terms that occur more often than expected

• Based on a 22 contingency table of relevance counts.

• Log-likelihood; equivalent to mutual information.

• Dunning 1993, Hovy & Lin 2000.

Page 15: Content and the Scan Statistic for the Enron Data

Hypothesis Testing

H0: P(C|ti)=p=P(C|~ti)

H1: P(C|ti)=p1p2=P(C|~ti)

ML Estimate p, p1, and p2

C ~C

ti O11 O12

~tiO21 O22

p =O11 +O21

O11 +O21 +O12 +O22

,

p1 =O11

O11 +O12

,

p2 =O21

O21 +O22

Page 16: Content and the Scan Statistic for the Enron Data

Likelihood of H0 vs. H1 and Mutual Information

L(H0)

L(H1)=b(p;O11,O11 +O12)b(p;O21,O21 +O22)

b(p1;O11,O11 +O12)b(p2;O21,O21 +O22)

−2logL(H0)

L(H1)

⎝ ⎜

⎠ ⎟= 2NI(C | t), where I(C | t) is the

mutual information statistic, when b is binomial.

Page 17: Content and the Scan Statistic for the Enron Data

Example: Subject: Re: Analysis of Joskow / Hogan Papers

Sounds very good.

Might be useful to get a "reputable" economist to write a paper that 1) describes traditional means for defining, identifying and mitigating market power, 2) compares those with the "new" means folks are coming up with these days, and 3) comments on the "split" in the academic community over the issues.

When Steve Kean and I discussed the notion initially, thought it might be a good idea to gently "pile on" to the public discussion with the goal of making clear 1) just how complex this issue is and 2) how important it will be to have a thorough analysis (say, about 12+ months worth?) before rushing to judgment on anything Joskow might allege in his paper. Thoughts?

Best,Jeff

James D Steffes12/11/2000 10:02 AM

To: Alan Comnes/PDX/ECT@ECT, Joe Hartsoe/Corp/Enron@ENRON, Richard Shapiro/NA/Enron@Enron

cc: Jeff Dasovich/NA/Enron@Enron, Mary Hain/HOU/ECT@ECT, Susan J Mara/NA/Enron@Enron

Subject: Re: Analysis of Joskow / Hogan Papers

Having read the Hogan paper, I think that the "academic" community is somewhat divided on this issue. If we want to move forward on the issues Joskow addresses, I would recommend that EPSA be the vehicle. The entire marketer / generator community needs to counter.

What do people think about seeking activity through EPSA, WPTF, and/or IEP of CA to push back on the studies and analysis especially after the Dec 13 Order? I don't think that the discussions will be ending very soon.

Jim

Alan Comnes@ECT12/07/2000 03:07 AM

To: James D Steffes/NA/Enron@ENRON cc: Jeff Dasovich/NA/Enron@Enron, Susan J Mara/NA/Enron@ENRON, Mary

Hain/HOU/ECT Subject: Re: Analysis of Joskow / Hogan Papers

The Joskow/Kahn paper raises two issues: price above cost and witholding.

Enron obviously has concerns with the "price above cost" analysis. I drafted some specific concerns and put them into a draft to Enron's reponse to Hoeker Question 1. Although the detail was dropped in the final draft, the basic technical concerns were laid out there. To really rebut Joskow/Kahn would take considerable work. Jeff D's idea was to write a paper that raised issues and indicated how complicated a "correct" response would be.

The Joskow/Kahn withholding section has recieved criticism from the ISO so I am not sure Enron needs to respond to that.

I think my bottom line now is that the debate at FERC will soon be over or enter a new stage on the 13th. As far as how a response would help us in California, I think requires a discussion with Jeff.

Alan

From: James D Steffes@ENRON on 12/05/2000 07:22 PM CSTTo: Alan Comnes/PDX/ECT@ECT, Jeff Dasovich/NA/Enron@Enron, Susan J Mara/NA/Enron@ENRON, Mary Hain/HOU/ECT@ECTcc:

Subject: Analysis of Joskow / Hogan Papers

Alan --

Before we bring in Seabron Adamson to do some analysis, I'd like your read of the Joskow and Hogan papers. When we have our understanding straight, let's talk.

Jim

----- Forwarded by James D Steffes/NA/Enron on 12/05/2000 07:20 PM -----

Jeff DasovichSent by: Jeff Dasovich11/30/2000 11:49 AM To: [email protected], Richard Shapiro/NA/Enron@Enron, James D

Steffes/NA/Enron@Enron, Sandra McCubbin/NA/Enron@Enron, Paul Kaufman/PDX/ECT@ECT, Joe Hartsoe/Corp/Enron@ENRON, Sarah Novosel/Corp/Enron@ENRON, Mary Hain/HOU/ECT@ECT, Karen Denne/Corp/Enron@ENRON, [email protected], Susan J Mara/NA/Enron@ENRON, Alan Comnes/PDX/ECT@ECT

cc: Subject: From Today's Electricity Daily

FYI. In bizarre times, help can sometimes come from bizarre places. Granted, we're likely to disagree strongly with Hogan's continued obsession with Poolco, but the discussion in his paper regarding market power may be helpful---I've read the Joskow paper, but haven't yet had a chance to review the Hogan piece.

Steve and I discussed the need to do a focused assessment of the Joskow/Kahn "analysis" (remember it's Ed Kahn, not Alfred Kahn). Seems that it would be very useful to fold into that analysis any useful stuff on market power included in the paper done by Hogan & Co. If, in the end, there ain't nothing useful, so be it. But seems like there's little downside to exploring it.

Jim, my understanding is that Alan is already working with the fundamentals folks on the Portland desk to deconstruct the Joskow paper. Might want to include the Hogan paper in those discussions and might also be useful to pull Seabron Adamson into the thinking, too. Ultimately, may be preferable to have any assessment of Joskow and/or Hogan to come from economists, rather than directly from us.

Best,Jeff----- Forwarded by Jeff Dasovich/NA/Enron on 11/30/2000 11:38 AM -----

"Daniel Douglass" <[email protected]>11/30/2000 11:29 AM To: <[email protected]>, <[email protected]>, <[email protected]>,

<[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>

cc: Subject: From Today's Electricity Daily

Has FERC Gone Far Enough in California?The Federal Energy Regulatory Commission isn't going far enough in its attempt to reform the California wholesale electric market, according to a paper by three prominent economists done for San Diego Gas and Electric. The paper by John D. Chandley, Scott M. Harvey, and William W. Hogan argues that FERC should first end the artificial separation that divides the California Power Exchange and the California Independent System Operator, rather than worrying about the governance of the two institutions."The change in governance may help," says the paper - "Electricity Market Reform in California" - "but it is not likely to be decisive in the near term. Explicit guidance from the commission regarding the nature and trajectory of reforms will be essential if market reform is to be accomplished within an acceptable time frame." Hogan, of the Kennedy School of Government, has been writing since 1995 in opposition to California's market separation.Also, argues the paper, freeing the California utilities to engage in forward contracting is no panacea. "The expectation that merely allowing utilities to participate in forward contracting necessarily would be the solution to high prices is problematic and not supported by the commission's staff report," says the analysis, adding that "putting pressure on buyers to sign contracts in the present environment may make things worse." If the underlying problem in California is high cost and low capacity, requiring forward contracting could harm not only California but also the entire Western U.S. electric system.FERC's $150 so-called "soft cap" is a wild card that has the three economists scratching their heads. "It does not appear in the staff report and there is little critical analysis of their implications, other than the discussion of Commissioner [Curt] Hebert." If the intent of the soft cap is to move toward cost justification for bids above $150/MWh, then FERC is headed into an administrative morass "that would rival those under wellhead price controls in the natural gas industry."If, on the other hand, the soft cap is "truly soft" and would only require some paper work at FERC and the possibility of a refund if the price is eventually deemed not just and reasonable, "there might be little impact on consumer prices (particularly if the principal sources of those high prices are high costs and regional capacity shortages rather than the exercise of market power). Even so, the proposal might serve to deter entry and new investments, thus combining the worst of both worlds, high consumer prices and little or no new investment."FERC's proposed order in California also demonstrates confusion about just what constitutes market power. The paper cites the proposed order's lawyerly, obfuscatory conclusion that "while this record does not support findings of specific exercises of market power, and while we are not able to reach definite conclusions about the actions of individual sellers, there is clear evidence that in California market structure and rules provide the opportunity for sellers to exercise market power when supply is tight and can result in unjust and unreasonable rates under the [Federal Power Act]." The economists note, "In this regard, the debate is confused because we are dancing around the words where the truth may be hard to face."In the case of California, say the economists, there is no evidence of market power. Even the practice of generators avoiding the day-ahead market in favor of the real-time market "is a response to bad market design and pricing incentives (including price caps), but does not demonstrate the exercise of market power." Nor is bidding above marginal cost necessarily an exercise of market power, they add. "The distinction between direct marginal cost and opportunity cost is sometimes lost in the discussion. Hence, a competitive bidder whose direct cost of generation is $40 but who could sell the same energy outside California for $100 should bid no less than $100. This would not be an exercise of market power."

Page 18: Content and the Scan Statistic for the Enron Data

Example Signature Terms

• analysis, california, com, economists, enron, hogan, joskow, kahn, market, na, paper, power

Page 19: Content and the Scan Statistic for the Enron Data

Simple Clustering

For each message compute signature terms.

Form an nd matrix F with

F(i,j)= # times sigterm i occurs in doc j.

[R,P]=corrcoef(F);

P is the dd matrix of p-values for R.

Page 20: Content and the Scan Statistic for the Enron Data

Simple Clustering (cont.)

• Consider the graph G(P<).• Take the clusters as the connected

components of G.

• Thus, two documents are connected if there is a significant overlap in their signature terms!

Page 21: Content and the Scan Statistic for the Enron Data

Connected Components of Email

Page 22: Content and the Scan Statistic for the Enron Data

Connected Components

Page 23: Content and the Scan Statistic for the Enron Data

Summarizing the Clusters

• Single Msg. Summarization–Score the sentences:

• Given signature terms.• Want first “few” great sentences. • Want the probability that a sentence is a

summary sentence.

n 1 n 2 n

Page 24: Content and the Scan Statistic for the Enron Data

Summarizing (cont.)• Multi-document Summarization

– Use HMM scores to select candidate sentences (~2w).

– Terms as sentence features

• Terms: {t1, …, tm} Rm

• Sentences: {s1, …, sn} Rn

• Scaling: || a || = HMM score

–Use Pivoted QR to select sentences.

mnmm

n

n

aat

aat

ss

L

MOMM

L

L

1

1111

1

Page 25: Content and the Scan Statistic for the Enron Data

Summaries of Clusters

100 Words Summary of Cluster 7; 4 msgs

/data/Enron/maildir/lavorato-j/_sent_mail/11.

Subject:

* The power mark to market book will pay NewAlb a capacity payment of $4.87 ...

for 5 years. We shaped this payment as follows:...

* Enron will also pay NewAlb $2.00/MW hour for varialbe o&m.

/data/Enron/maildir/lavorato-j/sent_items/225.

Subject:

The following points refer to the methodology that we are taking to rebook the New Albany Plant. Please send me a note immediately if you disagree....

Assume that NewAlb is a non mark to market entity and Enron is the mark to market entity. However, it is fully owned and operated by us for now....

* This will create an entity "NewAlb" that will return 9% assuming a book value of $336/kw on 12/31/2005 vs. 409 currently.

Page 26: Content and the Scan Statistic for the Enron Data

Cluster 3: 6 Documents/data/Enron/maildir/dasovich-j/all_documents/4665.Subject: Re:

To: Jeff Dasovich/NA/[email protected]: Jeff Dasovich on 12/13/2000 10:36 AM...Sent by: Jeff Dasovich...To: Richard Shapiro/NA/[email protected]: ...Kahn's secretary has left messages that he's "very tied up" and continues to ...try to contact me. Suggests to me that they may be planning something and /data/Enron/maildir/dasovich-j/all_documents/4562.Subject: Re: Presentations from GA Meeting on December 8From: Jeff Dasovich on 12/12/2000 10:57 AM...Sent by: Jeff Dasovich...To: Richard Shapiro/NA/Enron@Enron/data/Enron/maildir/dasovich-j/all_documents/4688.Subject: Re: Davis, Hoecker and RichardsonFrom: Jeff Dasovich on 12/13/2000 12:50 PM...Sent by: Jeff Dasovich...Word on the street is that Davis, Hoecker and Richardson are meeting in D.C. /data/Enron/maildir/dasovich-j/all_documents/4565.Subject: Re: Presentations from GA Meeting on December 8From: Jeff Dasovich on 12/12/2000 10:57 AM...Sent by: Jeff Dasovich...To: Richard Shapiro/NA/Enron@Enron>>

Page 27: Content and the Scan Statistic for the Enron Data

Cluster 1: 35 documents200 Words Summary of Email Cluster 1/data/Enron/maildir/taylor-m/all_documents/3944.Subject: EnronOnline question

To: Stacy E Dickson/HOU/ECT@ECT... cc: Mark Taylor/HOU/ECT@ECT

/data/Enron/maildir/dasovich-j/all_documents/4693.Subject: FYI-Edison wants Reregulation"Stephanie-Newell" <[email protected]>, "Sue Mara" ...<[email protected]>, "Tom Ross" <[email protected]>, "Kate Castillo" ...<[email protected]>, "Bill Carlson" <[email protected]>, "Bill Woods" ...<[email protected]>, "Bob Escalante" <[email protected]>, "Carolyn ...Baker" <[email protected]>, "Cody Carter" <[email protected]>, ..."Curt Hatton" <[email protected]>, "Curtis Kebler" ...<[email protected]>, "Dave Parquet" <[email protected]>, ..."Dean Gosselin" <[email protected]>, "Duane Nelsen" ...<[email protected]>, "Eric Eisenman" <[email protected]>, "Frank ...DeRosa" <[email protected]>, "Greg Blue" <[email protected]>, "Hap Boyd" ...<[email protected]>, "Jack Pigott" <[email protected]>, "Jeff Dasovich" ...<[email protected]>, "Jim Willey" <[email protected]>, "Joe Greco" ...<[email protected]>, "Joe Ronan" <[email protected]>, "Jonathan Weisgall" ...<[email protected]>, "Ken Hoffman" <[email protected]>, "Kent ...McFadden" <[email protected]>, "Paula Soos" /data/Enron/maildir/dasovich-j/all_documents/4460.Subject: Re: Data on Monthly Generation for SCE

McCubbin/NA/Enron@Enron, Tim Belden/HOU/ECT@ECT, Robert Badeer/HOU/ECT@ECT, ...Chris H Foster/HOU/ECT@ECT, Susan J Mara/NA/Enron@ENRON, Alan ...Comnes/PDX/ECT@ECT/data/Enron/maildir/kean-s/all_documents/2288.Subject: FW: NEW HARASSMENT

[email protected]; [email protected];>>

Page 28: Content and the Scan Statistic for the Enron Data

Ada Lovelace

• "The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform. It can follow analysis; but it has no power of anticipating any analytical relations or truths."

Page 29: Content and the Scan Statistic for the Enron Data

Reprocess the Data!

• Remove any line with 2 or more @’s.

• Re-compute signature terms.

• Re-compute clusters.

• Re-compute summaries!

• Do over!

Page 30: Content and the Scan Statistic for the Enron Data

New t=109, Iter1

Page 31: Content and the Scan Statistic for the Enron Data

New t=109, Iter 2

Page 32: Content and the Scan Statistic for the Enron Data

New t=109 Iter8

Page 33: Content and the Scan Statistic for the Enron Data

“Your mother is near! So, as fast as you can, Think of something to do!

You willhave to get rid of Thing One and Thing Two!”

Subject: Organizational Changes---------------------- Forwarded by Richard Shapiro/NA/Enron on 12/08/2000=Subject: Re: Analysis of Joskow / Hogan PapersHaving read the Hogan paper, I think that the "academic" community is ...paper by three prominent economists done for San Diego Gas and Electric. The ...paper by John D. Chandley, Scott M. Harvey, and William W. Hogan argues that Subject: Hogan-California Market PowerFYI. Not sure if you had seen this. Hogan makes many of the arguments about Subject: Re: Draft Talking points about California Gas marketGiven the way the numbers came out, I guess we don't need the talking points, Subject: Re: FERC RequestDrew is okay with this. I will email the list to FERC.Subject: Update on FERC California Gas/Electric Mattersinto the California market last summer....Various Enron units continue to receive informal data requests from FERC ...staff regarding current California gas/electric

Page 34: Content and the Scan Statistic for the Enron Data

Related News Item

January 13, 2001 Leading economists Paul Jaskow and Edward Kahn conclude that high wholesale prices observed in summer 2000 [in California] cannot be explained as the natural outcome of `market fundamentals’ in competitive markets since there is a very significant gap between actual market prices and competitive benchmark prices. (Source: CATO Policy Analysis) http://cantwell.senate.gov/news/releases/2002_04_18_consumer.html

Page 35: Content and the Scan Statistic for the Enron Data

Content vs. Contact

Given our matrix, F, the term document matrix of signature terms and emails sent during a period.

Consider an induced dot product graph, based on the correlation of the signature terms.

Page 36: Content and the Scan Statistic for the Enron Data

A Content-Based Dot Product Graph

Let P(Aij=1)~Rij, the correlation coefficient of document i and j.

Note, this is a dot-product graph based on the correlation of two sparse vectors and not low dimensional!

Page 37: Content and the Scan Statistic for the Enron Data

Communication and Content

Page 38: Content and the Scan Statistic for the Enron Data

Various Thresholds

Page 39: Content and the Scan Statistic for the Enron Data

Content 109 vs. All Communication

Page 40: Content and the Scan Statistic for the Enron Data

Rho Threshold Plot

Page 41: Content and the Scan Statistic for the Enron Data

Content and Communication are not the Same

Example: (due to Libby Beer)

Alice & Bob exchange love letters

and

Carol & Dave exchange love letters

DOES NOT imply

Alice & Dave send love letters!

Page 42: Content and the Scan Statistic for the Enron Data

Conclusions

• The scan statistic on graphs rocks!• Summarization methods are useful in

analyzing email, but exploiting the structure of email is integral.

• Content is correlated with communication but shows about 18% of contacts.

• Probability of communication correlates with message content correlation!

Page 43: Content and the Scan Statistic for the Enron Data

Future Work

• Content scan statistic would track changing user interests.

• Augment the communication information.

• Predict “love is in the air.”• Content scan statistic with nodes being

documents! – E.g., emerging themes in research papers.