Identifying Temporal Patterns in Collections of Documents Rich Caruana, Thorsten Joachims, Johannes Gehrke, Benyah Shaparenko Cornell University {caruana,tj,johannes,benyah}@cs.cornell.edu

Identifying Temporal Patterns in Collections of Documents

Rich Caruana, Thorsten Joachims, Johannes Gehrke, Benyah Shaparenko

Cornell University

{caruana,tj,johannes,benyah}@cs.cornell.edu

KDD Challenge 2005Open Task

Introduction and Goals

• Evolution of Document Collections– How can one identify and compactly

summarize the temporal development of topics in a data stream?

• Identifying Influential Ideas– What are key documents that drive the

development and change?

• Identifying Influential Authors– Which authors have the largest impact on

the development?

Identifying Key Documents: Goals

• Identify “leading” papers:– Which documents introduce new ideas?– Which documents most influence future work?

• Constraints– Analysis not limited to scientific paper

Must work without citation data– Only text and timestamp may be used

Identifying Key Documents: Methods

• Document Lead/Lag Index– Find k nearest neighbors (cosine distance)– Raw lead/lag score:

LLraw(d) = (# later NN) – (# earlier NN)

– Scaled lead/lag score (avoid edge effects): LLnorm(d) = LLraw(d) / AVGd’year(d)

(LLraw(d’))d

1997 1998 1999 2000 2001 2002 2003

d1 d2 d3 d4

Identifying Key Documents: Biobase Results

Score Year Cites Paper Title and Authors

1.153 2000 45 “characterization of a new human b7-related protein: b7rp-1 is the ligand to the co-stimulatory protein icos” by t. boone, d. brankow, m.a. coccia, t. dai, j. delaney, h. han, t. horan, h. hui, s.d. hkare, t. kohno, r. manoukian, k. miner, j. pistillo, m. sonnenberg, j.s. whoriskey, s.k. yoshinaga, m. zhang

1.082 2000 2(168)

“mycobacterium tuberculosis and human macrophage: the bacillus with ‘environment-sensing’” by v. colizzi, m. fraziano, f. mariani

1.082 2000 242 “taci and bcma are receptors for a tnf homologue implicated in b-cell autoimmune disease” by h. blumberg, c.h. clegg, s.r. dillon, r. enselman, k. foley, d. foster, j.a. gross, a. grossman, k. harrison, h. haugen, j. johnston, w. kindsvogel, a. littau, c. lofton-day, k. madden, m. moore, s. mudri, j. parrish-novak, w. xu

1.082 2000 111 “mouse inducible costimulatory molecule (icos) expression is enhanced by cd28 costimulation and regulates differentiation of cd4 t cells” by v.a. boussiotis, t.t. chang, p.-j. chen, t. chernova, j.s. duke-cohan, e.a. greenfield, c. jabs, v.k. kuchroo, v. ling, a.e. lumelsky, n. malenkovich, a.j. mcadam, a.h. sharpe

Identifying Key Documents: NIPS Results

Score Year Cites Paper Title and Authors

1.167 1996 128 “improving the accuracy and speed of support vector machines” by chris j.c. burges, b. schoelkopf

1.128 1999 17 (466)

“using analytic qp and sparseness to speed training of support vector machines” by john c. platt

0.986 1999 18 “regularizing adaboost” by gunnar raetsch, takashi onoda, klaus-robert mueller

0.953 1996 41(3711)

“support vector method for function approximation, regression, and signal processing” by v. vapnik, s. golowich, a. smola

0.945 1998 27 “training methods for adaptive boosting of neural networks” by holger schwenk, yoshua bengio

0.945 1997 3 “modeling complex cells in an awake macaque during natural image viewing” by william e. vinje, jack l. gallant

0.934 1998 17 “em optimization of latent-variable density models” by chris bishop, markus svensen, chris william

0.934 1995 584 “a new learning algorithm for blind signal separation” by s. amari, a. cichocki, h. h. yang

Aggregated Lead/Lag Index:Goals

• Identify “leading” authors– Who are the major players?– Which authors most influence future work?

• Constraints– Analysis not limited to scientific paper

Must work without citation data– Only text and timestamp may be used

Aggregated Lead/Lag Index: Methods

• Author Lead/Lag Index– Assume author a has documents d1,…,dn– Compute scaled lead/lag score for each

document and averageLLnorm(a) = 1/n (LLraw(d1)+…+ LLraw(dn))

– Compute variance v of LLnorm(a) and rank by LLnorm(a) – 2 * sqrt(v / n)

– Use smoothing to avoid small sample artifacts

Aggregated Lead/Lag Index: Biobase Results

Aggregated Lead/Lag Index: NIPS Results

Temporal Cluster Histograms: Goals

• What are main topics in a collection?– Identify key topics.– How big a fraction of documents are on each

topic?

• How do topics develop?– What are new emerging topics?– Which topics are fading?– When did particular topics peak in popularity?

Temporal Cluster Histograms: Methods

• K-Means Clustering– Measure distance via TFIDF cosine on

text vectors– Different k = 7, 13, 30– 10 runs, select run with least squared error

• Visualization– Find how many documents are in each cluster

in each year– Plot these numbers in a “stacked” histogram– Label clusters with the 5 words of highest

value in centroid vector

Temporal Cluster Histograms: Biobase Results

12: /inf >, inf >, 2 <, <, 4 <11: dc, csf, dcs, gm, cells10: il, 12, gamma, cells, production9: cells, cell, protein, hla, mhc8: hiv, infected, cd4, virus, aids7: alpha, &, tnf, ifn, gamma6: isolates, pylori, strains, pcr, infection5: patients, ra, sle, disease, hla4: mice, vaccine, responses,

immunization, immune3: transplantation, patients, graft,

transplant, rejection2: asthma, ige, allergic, allergen, allergens1: hcv, hepatitis, hbv, rna, liver0: <, /sup >, sup >, cells, cd4<

1/6 of Biobase (15 papers min)

0

500

1000

1500

2000

2500

3000

1 2 3 4Year

Number of Papers

0%

20%

40%

60%

80%

100%

1 2 3 4Year

Temporal Cluster Histograms: NIPS Results

NIPS k-means clusters (k=13)

0

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10 11 12 13 14Year

Number of Papers

Temporal Cluster Histograms: NIPS Results

12: chip, circuit, analog, voltage, vlsi11: kernel, margin, svm, vc, xi10: bayesian, mixture, posterior, likelihood,

em9: spike, spikes, firing, neuron, neurons8: neurons, neuron, synaptic, memory,

firing7: david, michael, john, richard, chair6: policy, reinforcement, action, state,

agent5: visual, eye, cells, motion, orientation4: units, node, training, nodes, tree3: code, codes, decoding, message, hints2: image, images, object, face, video1: recurrent, hidden, training, units, error0: speech, word, hmm, recognition, mlp

NIPS k-means clusters (k=13)

0

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10 11 12 13 14Year

Number of Papers

Topic Drift: Goals

• Identify change within topics– Did topic drift in focus?– Did new ideas get introduced without

abandoning a topic?

Topic Drift:Methods

• Identify drift in description of topics– Use clusters from k-means clustering– Compute centroids of each cluster for

• The earlier half of the years• The later half of the years

– Extract the 5 highest scoring terms in each centroid and time period

Topic Drift: Biobase Results

Topic Drift:NIPS Results

Conclusions

• Scaled document lead/lag index – Works well without citation informations– Ideas vs. citations (e.g. Platt’s paper)

• Author lead/lag index find key authors– Can find influential authors

• K-means finds meaningful clusters – Clusters correspond to know topics in NIPS– Effectively shows development of topics

Areas for Future Work• Temporal flow clustering

– Determine flow in stream of ideas– Analyze how topics “forks” and “merges” over time– Explicitly exploit time in distance metric

• Characterizing author behavior– Are there patterns of how authors move from topic to topic?– What marks emerging trends early? (e.g. prominent authors) – Who are early vs. late adopters of trends?– What characterizes authors that publish on few/many topics?

• Improved clustering algorithms– Guidance in determining distance metric, number of clusters– Meta-clustering– Burst (New Topic) Detection and how they align with events

• Corpora beyond scientific literature– Email, web pages, news, etc.