10
Learning Hawkes Processes from Short Doubly-Censored Event Sequences Hongteng Xu 1 Dixin Luo 2 Hongyuan Zha 1 Abstract Many real-world applications require robust al- gorithms to learn point processes based on a type of incomplete data — the so-called short doubly- censored (SDC) event sequences. We study this critical problem of quantitative asynchronous event sequence analysis under the framework of Hawkes processes by leveraging the idea of data synthesis. Given SDC event sequences ob- served in a variety of time intervals, we pro- pose a sampling-stitching data synthesis method, sampling predecessors and successors for each SDC event sequence from potential candidates and stitching them together to synthesize long training sequences. The rationality and the fea- sibility of our method are discussed in terms of arguments based on likelihood. Experiments on both synthetic and real-world data demon- strate that the proposed data synthesis method improves learning results indeed for both time- invariant and time-varying Hawkes processes. 1. Introduction Real-world interactions among multiple entities are often recorded as asynchronous event sequences, such as user be- haviors in social networks, job hunting and hopping among companies, and diseases and their complications. The en- tities or event types in the sequences often exhibit self- triggering and mutually-triggering patterns. For example, a tweet of a twitter user may trigger further responses from her friends (Zhao et al., 2015). A disease of a patient may trigger other complications (Choi et al., 2015). Hawkes processes, an important kind of temporal point process model (Hawkes & Oakes, 1974), have capability to de- scribe the triggering patterns quantitatively and capture the infectivity network of the entities. 1 Georgia Institute of Technology, Atlanta, Georgia, USA 2 University of Toronto, Toronto, Ontario, Canada. Correspon- dence to: Hongteng Xu <[email protected]>, Dixin Luo <[email protected]>. Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). Figure 1. For each SDC sequence, i.e., incomplete disease history of a person in his lifetime, we design a mechanism to select other SDC sequences as predecessors/successors and synthesize a long sequence. Then, we can estimate the unobserved triggering pat- terns among diseases, i.e., the red dashed arrows, and construct a disease network. Despite the usefulness of Hawkes processes, robust learn- ing of Hawkes processes often needs many event sequences with events occurring over a long observation window. Un- fortunately, the observation window is likely to be very short and sequence-specific in many important practical ap- plications, i.e., within an imagined universal window, each sequence is only observed with a corresponding short sub- interval of it, and the events outside this sub-interval are not observed — we call them short doubly-censored (SDC) event sequences. Existing learning algorithms of Hawkes processes directly applied to SDCs may suffer from over- fitting, and what is worse, the triggering patterns between historical events and current ones are lost, so that the trig- gering patterns learned from SDC event sequences are of- ten unreliable. This problem is a thorny issue in sev- eral practical applications, especially in those having time- varying triggering patterns. For example, the disease net- works of patients should evolve with the increase of age. However, it is very hard to track and record people’s dis- eases on a life-time scale. Instead, we can only obtain their several admissions (even only one admission) in a hos- pital during one or two years, which are just SDC event sequences. Therefore, it is highly desirable to propose a method to learn Hawkes processes having a longtime sup- port from a collection of SDC event sequences In this paper, we propose a novel and simple data synthesis method to enhance the robustness of learning algorithms for Hawkes processes. Fig. 1 illustrates the principle of our method. Given a set of SDC event sequences, we sample predecessor for each event sequence from potential can- didates and stitch them together as new training data. In the sampling step, the distribution of predecessor (and suc- cessor) is estimated according to the similarities between

New Learning Hawkes Processes from Short Doubly-Censored Event …proceedings.mlr.press/v70/xu17b/xu17b.pdf · 2020. 8. 18. · Learning Hawkes Processes from Short Doubly-Censored

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: New Learning Hawkes Processes from Short Doubly-Censored Event …proceedings.mlr.press/v70/xu17b/xu17b.pdf · 2020. 8. 18. · Learning Hawkes Processes from Short Doubly-Censored

Learning Hawkes Processes from Short Doubly-Censored Event Sequences

Hongteng Xu 1 Dixin Luo 2 Hongyuan Zha 1

AbstractMany real-world applications require robust al-gorithms to learn point processes based on a typeof incomplete data — the so-called short doubly-censored (SDC) event sequences. We study thiscritical problem of quantitative asynchronousevent sequence analysis under the frameworkof Hawkes processes by leveraging the idea ofdata synthesis. Given SDC event sequences ob-served in a variety of time intervals, we pro-pose a sampling-stitching data synthesis method,sampling predecessors and successors for eachSDC event sequence from potential candidatesand stitching them together to synthesize longtraining sequences. The rationality and the fea-sibility of our method are discussed in termsof arguments based on likelihood. Experimentson both synthetic and real-world data demon-strate that the proposed data synthesis methodimproves learning results indeed for both time-invariant and time-varying Hawkes processes.

1. IntroductionReal-world interactions among multiple entities are oftenrecorded as asynchronous event sequences, such as user be-haviors in social networks, job hunting and hopping amongcompanies, and diseases and their complications. The en-tities or event types in the sequences often exhibit self-triggering and mutually-triggering patterns. For example,a tweet of a twitter user may trigger further responses fromher friends (Zhao et al., 2015). A disease of a patient maytrigger other complications (Choi et al., 2015). Hawkesprocesses, an important kind of temporal point processmodel (Hawkes & Oakes, 1974), have capability to de-scribe the triggering patterns quantitatively and capture theinfectivity network of the entities.

1Georgia Institute of Technology, Atlanta, Georgia, USA2University of Toronto, Toronto, Ontario, Canada. Correspon-dence to: Hongteng Xu <[email protected]>, Dixin Luo<[email protected]>.

Proceedings of the 34 th International Conference on MachineLearning, Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s).

Figure 1. For each SDC sequence, i.e., incomplete disease historyof a person in his lifetime, we design a mechanism to select otherSDC sequences as predecessors/successors and synthesize a longsequence. Then, we can estimate the unobserved triggering pat-terns among diseases, i.e., the red dashed arrows, and construct adisease network.

Despite the usefulness of Hawkes processes, robust learn-ing of Hawkes processes often needs many event sequenceswith events occurring over a long observation window. Un-fortunately, the observation window is likely to be veryshort and sequence-specific in many important practical ap-plications, i.e., within an imagined universal window, eachsequence is only observed with a corresponding short sub-interval of it, and the events outside this sub-interval arenot observed — we call them short doubly-censored (SDC)event sequences. Existing learning algorithms of Hawkesprocesses directly applied to SDCs may suffer from over-fitting, and what is worse, the triggering patterns betweenhistorical events and current ones are lost, so that the trig-gering patterns learned from SDC event sequences are of-ten unreliable. This problem is a thorny issue in sev-eral practical applications, especially in those having time-varying triggering patterns. For example, the disease net-works of patients should evolve with the increase of age.However, it is very hard to track and record people’s dis-eases on a life-time scale. Instead, we can only obtain theirseveral admissions (even only one admission) in a hos-pital during one or two years, which are just SDC eventsequences. Therefore, it is highly desirable to propose amethod to learn Hawkes processes having a longtime sup-port from a collection of SDC event sequences

In this paper, we propose a novel and simple data synthesismethod to enhance the robustness of learning algorithmsfor Hawkes processes. Fig. 1 illustrates the principle of ourmethod. Given a set of SDC event sequences, we samplepredecessor for each event sequence from potential can-didates and stitch them together as new training data. Inthe sampling step, the distribution of predecessor (and suc-cessor) is estimated according to the similarities between

Page 2: New Learning Hawkes Processes from Short Doubly-Censored Event …proceedings.mlr.press/v70/xu17b/xu17b.pdf · 2020. 8. 18. · Learning Hawkes Processes from Short Doubly-Censored

Learning Hawkes Processes from Short Doubly-Censored Event Sequences

current sequence and its candidates, and the similarity isdefined based on the information of time stamps and (op-tional) features of event sequences. We analyze the ratio-nality and the feasibility of our data synthesis method anddiscuss the necessary condition for using the method. Ex-perimental results show that our data synthesis method in-deed helps to improve the robustness of various learningalgorithms for Hawkes processes. Especially in the caseof time-varying Hawkes processes, applying our method inthe learning phase achieves much better results than learn-ing directly from SDC event sequences, which is meaning-ful for many practical applications, e.g., constructing dy-namical disease network, and learning long-term infectiv-ity among different IT companies.

2. Related WorkAn event sequence can be represented as s = {(ti, ci)}Mi=1,where time stamps ti’s are in an observation window[Tb, Te] and events ci’s are in a set of event types C ={1, ..., C}. A point process {Nc}c∈C is a random pro-cess model taking event sequences as instances, whereNc = {Nc(t)|t ∈ [Tb, Te]} and Nc(t) is the number oftype-c events occurring at or before time t. A point pro-cess can be characterized via its conditional intensity func-tion λc(t) = E[dNc(t)|HCt ]/dt, where c ∈ C and HCt ={(ti, ci)|ti < t, ci ∈ C} is the set of history. It representsthe expected instantaneous happening rate of events givenhistorical record (Daley & Vere-Jones, 2007). The inten-sity is often modeled with certain parameters Θ to capturethe phenomena of interests, i.e., self-triggering (Hawkes &Oakes, 1974) or self-correcting (Xu et al., 2015). Based on{λc(t)}c∈C , the likelihood of an event sequence s is

L(s; Θ) =∏

iλci(ti) exp

(−∑

c

∫ Te

Tb

λc(s)ds). (1)

Hawkes Processes. Hawkes processes (Hawkes & Oakes,1974) have a particular form of intensity:

λc(t) = µc +∑C

c′=1

∫ t

0

φcc′(t, s)dNc′(s), (2)

where µc is the exogenous base intensity independent of thehistory while

∫ t0φcc′(t, s)dNc′(s) is the endogenous inten-

sity capturing the influence of historical events on type-cones at time t (Xu et al., 2016a). Here, φcc′(t, s) ≥ 0is called impact function. It quantifies the influence ofthe type-c′ event at time s to the type-c event at time t.Hawkes processes provide us with a physically-meaningfulmodel to capture the infectivity among various events,which are used in social network analysis (Zhou et al.,2013b; Zhao et al., 2015), behavior analysis (Yang & Zha,2013; Luo et al., 2015) and financial analysis (Bacry et al.,2013). However, the methods in these references assume

that the impact function is shift-invariant (i.e., φcc′(t, s) =φcc′(t− s), t ≥ s), which limits their applications on long-time scale. Recently, the time-dependent Hawkes process(TiDeH) in (Kobayashi & Lambiotte, 2016) and the neuralnetwork-based Hawkes process in (Mei & Eisner, 2016)learn very flexible Hawkes processes with complicated in-tensity functions. Because they highly depend on the sizeand the quality of data, they may fail in the case of SDCevent sequences.

Learning from Imperfect Observations. In practice, weneed to learn sequential models from imperfect observa-tions (e.g., interleaved (Xu et al., 2016b), aggregated (Luoet al., 2016) and extremely-short sequences (Xu et al.,2016c)). Multiple imputation (MI) (Rubin, 2009) is a gen-eral framework to build surrogate observations from thecurrent model. For time series, bootstrap method (Efron,1982; Politis & Romano, 1994; Goncalves & Kilian, 2004)and its variants (Paparoditis & Politis, 2001; Guan & Loh,2007) have been used to improve learning results when ob-servations are insufficient. In survival analysis, many tech-niques have been made to deal with truncated and censoreddata (Turnbull, 1974; De Gruttola & Lagakos, 1989; Klein& Moeschberger, 2005; Van den Berg & Drepper, 2016).For point processes, the global (Streit, 2010) or local (Fan,2009) likelihood maximization estimators (MLE) are usedto learn Poisson processes. Nonparametric approachesfor non-homogeneous Poisson processes use the pseudoMLE (Sun & Kalbfleisch, 1995) or full MLE (Wellner &Zhang, 2000). The bootstrap methods above are also usedto learn point processes (Cowling et al., 1996; Guan & Loh,2007; Kirk & Stumpf, 2009). To learn Hawkes processesrobustly, structural constraints, e.g., low-rank (Luo et al.,2015) and group-sparse regularizers (Xu et al., 2016a), areintroduced. However, all of these methods do not considerthe case of SDC event sequences for Hawkes processes.

3. Learning from SDC Event SequencesSuppose that the original complete event sequences arein a long observation window. However, the observationwindow in practice might be segmented into several inter-vals {Tnb , Tne }Nn=1, and we can only observe Kn SDC se-quences {snk}

Knk=1 in the n-th interval, n = 1, ..., N . Al-

though we can still apply maximum likelihood estimator tolearn Hawkes processes, i.e.,

minΘ−∑

n,klogL(snk ; Θ), (3)

the SDC event sequences would lead to over-fitting prob-lem and the loss of triggering patterns. Can we do better insuch a situation? In this work, we propose a data synthesismethod based on a sampling-stitching mechanism, whichextends SDC event sequences to longer ones and enhancesthe robustness of learning algorithms.

Page 3: New Learning Hawkes Processes from Short Doubly-Censored Event …proceedings.mlr.press/v70/xu17b/xu17b.pdf · 2020. 8. 18. · Learning Hawkes Processes from Short Doubly-Censored

Learning Hawkes Processes from Short Doubly-Censored Event Sequences

3.1. Data Synthesis via Sampling-Stitching

Denote the k-th SDC event sequence in the n-th interval assnk . Because its predecessor is unavailable, if we learn theparameters of our model via (3) directly, we actually im-pose a strong assumption on our data that there is no eventhappening before snk (or previous events are too far awayfrom snk to have influences on snk ). Obviously, this assump-tion is questionable — it is likely that there are influentialevents happening before snk . A more reasonable strategy isenumerating potential predecessors and maximizing the ex-pected log-likelihood over the whole observation window:

minΘ−∑

n,kEs∼HC

Tnb

[logL([s, snk ]; Θ)]. (4)

Here Ex∼D[f(x)] represents the expectation of functionf(x) with random variable x yielding to a distributionD. s ∼ HCTnb means all possible history before Tnb , andL([s, snk ]; Θ) is the likelihood of stitched sequence [s, snk ].

The stitched sequence [s, snk ] can be generated via sam-pling SDC sequence s from previous 1st, ..., (k− 1)-th in-tervals and stitching s to snk . The sampling process yieldsto the probabilistic distribution of the stitched sequences.Given snk , we can compute its similarity between its poten-tial predecessor sn

k′ in [Tn′

b , Tn′

e ] as

w(sn′

k′ , snk ) =

{0, when Tn

e ≥ TnbS(Tnb , T

n′

e )S(fnk , fn′

k′ ), o.w.(5)

Here, S(a, b) = exp(−‖b − a‖22/σs) is a predefined sim-ilarity function with parameter σs. fnk is the feature ofsnk , which is available in some applications. Note that theavailability of feature is optional — even if the feature ofsequence is unavailable, we can still define the similaritymeasurement purely based on time stamps. The normal-ized {w(sn

k′ , snk )} provides us with the probability that sn

k′

appears before snk , i.e., p(sn′

k′ |snk ) ∝ w(sn′

k′ , snk ). Then, we

can sample sn′

k′ according to the categorical distribution,i.e., sn

k′ ∼ Category(w(·, snk )).

We can apply such a sampling-stitching mechanism Ltimes iteratively to the SDC sequences in both backwardand forward directions and get long stitched event se-quences. Specifically, we represent a stitched event se-quence as sstitch = [s1, ..., s2L+1], sl ∈ {snk}, l =1, ..., 2L+ 1, whose probability is

p(sstitch) ∝∏2L

l=1w(sl, sl+1). (6)

Note that our data synthesis method naturally contains twovariants. When the starting (the ending) point of time win-dow is unavailable, we use the time stamp of the first (thelast) event of SDC sequence instead. Additionally, we can

relax the constraint Tn′

e ≤ Tnb in (5) and allow a SDC se-quence to have an overlap with its predecessor/successor.In this case, we preserve the overlap part randomly eitherfrom itself or its predecessor/successor before applying oursampling-stitching method. These two variants ensure thatour data synthesis method is doable in practice, which areused in the following experiments on real-world data.

3.2. Justification

After applying our data synthesis method, we obtain manystitched event sequences, which can be used as instancesfor estimating Es∼HC

Tnb

[logL([s, snk ]; Θ)]. Specifically,

taking advantage of stitched sequences, we can rewrite thelearning problem in (4) approximately as

minΘ−∑

sstitch∈Sp(sstitch) logL(sstitch; Θ), (7)

which is actually the minimum cross-entropy estimation.p(sstitch) represents the “true” probability that the stitchedsequence happens, which is estimated via the predefinedsimilarity measurement and the sampling mechanism. Thelikelihood L(sstitch; Θ) represents the “unnatural” prob-ability that the stitched sequence happens, which is esti-mated based on the definition in (1). Our data synthesismethod takes advantage of the information of time stampsand (optional) features and makes p(sstitch) suitable forpractical situations. For example, the likelihood of a se-quence generally reduces with the increase of observationtime window. The proposed probability p(sstitch) yields tothe same pattern — according to (6), the longer a stitchedsequence is, the smaller its probability becomes.

The set of all possible stitched sequences, i.e., the S in (7),is very large, whose cardinality is |S| = O(

∏Nn=1Kn).

In practice, we cannot and do not need to enumerate allpossible combinations. An empirical setting is making thenumber of stitched sequences comparable to that of orig-inal SDC event sequences, i.e., generating O(

∑Nn=1Kn)

stitched sequences. In the following experiments, we justapply 5(= U ) trials and generate 5 stitched sequences foreach original SDC event sequence, which achieves a trade-off between computational complexity and performance.

3.3. Feasibility

It should be noted that our data synthesis method is onlysuitable for those complicated point processes whose his-torical events have influences on current and future ones.Specifically, we analyze the feasibility of our method forseveral typical point processes.

Poisson Processes. Our data synthesis method cannot im-prove learning results if the event sequences are generatedvia Poisson processes. For Poisson processes, the hap-pening rate of future events is independent of historical

Page 4: New Learning Hawkes Processes from Short Doubly-Censored Event …proceedings.mlr.press/v70/xu17b/xu17b.pdf · 2020. 8. 18. · Learning Hawkes Processes from Short Doubly-Censored

Learning Hawkes Processes from Short Doubly-Censored Event Sequences

events. In other words, the intensity function of each inter-val can be learned independently based on the SDC eventsequences. The stitched sequences do not provide us withany additional information.

Hawkes Processes. For Hawkes processes, whose inten-sity function is defined as (2), our data synthesis methodcan enhance the robustness of learning algorithm generally.In particular, consider a “long” event sequence generatedvia a Hawkes process in the time window [Tb, Te]. If wedivide the time window into 2 intervals, i.e., [Tb, T ] and(T, Te], the intensity function corresponding to the secondinterval can be written as

λc(t) = µc +∑ti≤T

φcci(t, ti) +∑

T<ti≤Te

φcci(t, ti). (8)

If the events in the first interval are unobserved, we justhave a SDC event sequence, and the second term in (8) isunavailable. Learning Hawkes processes directly from theSDC event sequence ignores the information of the secondterm, which has a negative influence on learning results.Our data synthesis method leverages the information fromother potential predecessors and generates multiple candi-date long sequences. As a result, we obtain multiple in-tensity functions sharing the second interval and maximizethe weighted sum of their log-likelihood functions (i.e., anestimated expectation of the log-likelihood of the real longsequence), as (7) does.

Compared with learning from SDC event sequences di-rectly, applying our data synthesis method can im-prove learning results in general, unless the term∑ti≤T φcci(t, ti) is ignorable. Specifically, we can model

the impact functions {φcc′(t, s)}c,c′∈C of Hawkes pro-cesses based on basis representation:

φcc′(t, s) = ψcc′(t)︸ ︷︷ ︸Infectivity

× g(t− s)︸ ︷︷ ︸Triggering kernel

=∑M

m=1acc′mκm(t)g(t− s).

(9)

Here, we decompose impact functions into two parts: 1)Infectivity ψcc′(t) =

∑Mm=1 acc′mκm(t) represents the in-

fectivity of event type c′ to c at time t.1 2) Triggering kernelg(t) = exp(−βt) measures the time decay of infectivity. Itmeans that the infectivity of a historical event to currentone reduces exponentially with the increase of temporaldistance between them. When β is very large, φcc′(t, s)decays rapidly with the increase of t − s, and the eventshappening long ago can be ignored. In such a situation, ourdata synthesis method is unable to improve learning results.

1When M = 1 and κm(t) ≡ 1, we obtain the simplest time-invariant Hawkes process. Relaxing the shift-invariant assump-tion, i.e., M > 1 and κm(t) is Gaussian, we obtain a flexibletime-varying Hawkes process model.

4. Implementation for Hawkes ProcessesHawkes process is a kind of physically-interpretable modelfor many natural and social phenomena. The proposedmodel in (9) reflects many common properties of real-world event sequences. First, the infectivity among vari-ous event types often changes smoothly in practice: in so-cial networks, the interaction between two users changessmoothly, which is not established or blocked suddenly;in disease networks, the infectivity among diseases shouldchange smoothly with the increase of patient’s age. Apply-ing Gaussian basis representation guarantees the smooth-ness of infectivity function. Second, the triggering ker-nel measures the decay of infectivity over time. Accord-ing to existing work, the decay of infectivity is expo-nential approximately, which has been verified in manyreal-world data (Zhou et al., 2013a; Kobayashi & Lam-biotte, 2016; Choi et al., 2015). For learning Hawkesprocesses from SDC event sequences, we combine ourdata synthesis method with an EM-based learning algo-rithm of Hawkes processes. Applying our data synthe-sis method, we obtain a set of stitched event sequencesS = {sn} and their appearance probabilities {pn}, wheresn = {(tni , cni )Ini=1|tni ∈ [Tnb , T

ne ], cni ∈ C} and pn is cal-

culated based on (5). According to (7, 9), we can learntarget Hawkes process via

minµ≥0, A≥0

−∑|S|

n=1pn logL(sn; Θ) + γR(A). (10)

Here Θ = {µ, A} represents the parameters of our model.The vector µ = [µc] and the tensor A = [acc′m] are non-negative. Based on (1, 9), the log-likelihood function is

logL(sn; Θ)

=∑In

i=1log λci(ti)−

∑C

c=1

∫ Tne

Tnb

λc(s)ds

=∑In

i=1log

[µcni +

∑j<i

g(τnij)∑

macni cnjmκm(tni )

]−∑C

c=1

(∆nµc −

∑m

∑In

i=1

∑j≤i

accnjmGij

),

where τnij = tni − tnj , Gij =∫ tni+1

tniκm(s)g(s − tnj )ds,

and ∆n = Tne − Tnb . R(A) represents the regularizerof parameters, whose weight is γ. Following existingwork in (Luo et al., 2015; Zhou et al., 2013a; Xu et al.,2016a), we assume the infectivity connections among dif-ferent event types to be sparse and impose a `1-norm regu-larizer on the coefficient tensor A, i.e., R(A) = ‖A‖1 =∑c,c′,m |acc′m|.

We can solve the problem via an EM algorithm. Specif-ically, when sparse regularizer is applied, we take ad-vantage of ADMM method, introducing auxiliary variableZ = [zcc′m] and dual variable U = [ucc′m] for A and

Page 5: New Learning Hawkes Processes from Short Doubly-Censored Event …proceedings.mlr.press/v70/xu17b/xu17b.pdf · 2020. 8. 18. · Learning Hawkes Processes from Short Doubly-Censored

Learning Hawkes Processes from Short Doubly-Censored Event Sequences

rewriting the objective function in (10) as

−∑

npn logL(sn; Θ) + 0.5ρ‖A−Z‖2F

+ ρtr(U>(A−Z)) + γ‖Z‖1.

Here ρ controls the weights of regularization terms, whichincreases with the number of EM iterations. tr(·) computesthe trace of matrix. Then, we can update {µ,A}, Z, andU alternatively.

Update µ and A: Given the parameters in the k-th iter-ation, we apply Jensen’s inequality to −

∑n logL(sn; Θ)

and obtain a surrogate objective function for µ andA:

Q(µ,A; µk,Ak,Zk,Uk)

=−N∑n=1

pn

{ In∑i=1

[∑j<i

M∑m=1

qijm logg(τnij)acni cnjmκm(tni )

qijm

+ qi logµcniqi−

C∑c=1

M∑m=1

∑j≤i

accnjmGij

]−∆n

C∑c=1

µc

}+ 0.5ρ‖A−Zk +Uk‖2F ,

where qi =µkcni

λkcni(tni )

and qijm =g(τnij)a

kcnicnjmκm(tni )

λkcni(tni )

, and

λkcni (tni ) is calculated based on µk and Ak. Then, we canupdate µ andA via solving ∂Q

∂µ = 0 and ∂Q∂A = 0. Both of

these two equations have closed-form solution:

µk+1c =

∑n pn

∑cni =c

qi∑n pn∆n

,

ak+1cc′m =

√B2 − 4ρC −B

2ρ,

(11)

where

B = ρ(ukcc′m − zkcc′m) +∑

npn∑

cni =c, cnj =c

′, j≤iGij ,

C = −∑

npn∑

cni =c, cnj =c

′, j≤iqijm.

Update Z: Given Ak+1 and Uk, we can update Z viasolving the following optimization problem:

minZ γ‖Z‖1 + 0.5ρ‖Ak+1 −Z +Uk‖2F .

Applying soft-thresholding method, we have

Zk+1 = F γρ

(Ak+1 +Uk), (12)

where Fη(x) = sign(x) min{|x| − η, 0} is the soft-thresholding function.

Update U : Given Ak+1 and Zk+1, we can further updatedual variable as

Uk+1 = Uk + (Ak+1 −Zk+1). (13)

Algorithm 1 Learning Algorithm of Hawkes Processes1: Input: Event sequences S. The threshold V . Prede-

fined parameters β, σκ, and γ.2: Output: ParametersA and µ.3: Initialize Ak and µk randomly. Zk = Ak, Uk = 0.k = 0, ρ = 1.

4: repeat5: ObtainAk+1 and µk+1 via (11).6: Obtain Zk+1 via (12).7: Obtain Uk+1 via (13).8: k = k + 1, ρ = 1.5ρ.9: until ‖Ak −Ak−1‖F < V

10: A = Ak, µ = µk.

In summary, Algorithm 1 shows the scheme of our learningmethod. Note that the algorithm can be applied to SDCevent sequences directly via ignoring pn’s.

5. Experiments5.1. Implementation Details

To demonstrate the usefulness of our data synthesismethod, we combine it with various learning algorithms ofHawkes processes and learn different models accordinglyfrom SDC event sequences. For time-invariant Hawkesprocesses, we consider two learning algorithms — our EM-based learning algorithm and the least squares (LS) algo-rithm in (Eichler et al., 2016). For time-varying Hawkesprocesses, we apply our EM-based learning algorithm. Inthe following experiments, we use Gaussian basis func-tions: κm(t) = exp((t − tm)2/σκ) with center tm andbandwidth σκ. The number and the bandwidth of basis canbe set according to the basis selection method proposedin (Xu et al., 2016a). Additionally, we set V = 10−4,γ = 1, and σs = 1 in our algorithm. Given SDC eventsequences, we learn Hawkes processes in three ways: 1)learning directly from SDC event sequences; 2) apply-ing the stationary bootstrap method in (Politis & Romano,1994) to generate more synthetic SDC event sequencesand learning from these sequences accordingly; 3) learn-ing from stitched sequences generated via our data synthe-sis method. For real-world data, whose SDC sequences donot have predefined starting and ending time stamps, weapplied the variants of our method mentioned in the end ofSection 3.1.

5.2. Synthetic Data

The synthetic SDC event sequences are generated via thefollowing method: 2000 complete event sequences are sim-ulated in the time window [0, 50] based on a 2-dimensionalHawkes process. The base intensity {µc}2c=1 are randomlygenerated in the range [0.1, 0.2]. The parameter of trigger-

Page 6: New Learning Hawkes Processes from Short Doubly-Censored Event …proceedings.mlr.press/v70/xu17b/xu17b.pdf · 2020. 8. 18. · Learning Hawkes Processes from Short Doubly-Censored

Learning Hawkes Processes from Short Doubly-Censored Event Sequences

(a) Our learning algorithm

(b) Least squares algorithm

Figure 2. Comparisons on log-likelihood and relative error.

ing kernel, β, is set to be 0.2. For time-invariant Hawkesprocesses, we set the infectivity {ψcc′(t)} to be 4 constantsrandomly generated in the range [0, 0.2]. For time-varyingHawkes processes, we set ψcc′(t) = 0.2 cos(2π ωcc′50 t),where {ωcc′} are randomly generated in the range [1, 4].Given these complete event sequences, we select 1000 se-quences as testing set while the remaining 1000 sequencesas training set. To generate SDC event sequences, we seg-ment time window into 10 intervals, and just randomly pre-serve the data in one interval for each training sequences.We test all methods in 10 trials and compare them on therelative error between real parameters Θ and their estima-tion results Θ, i.e., ‖Θ−Θ‖2

‖Θ‖2 , and the log-likelihood of test-ing sequences.

Time-invariant Hawkes Processes. Fig. 2 shows the com-parisons on log-likelihood and relative error for variousmethods. In Fig. 2(a) we can find that compared with thelearning results based on complete event sequences, the re-sults based on SDC event sequences degrade a lot (lowerlog-likelihood and higher relative error) because of the lossof information. Our data synthesis method improves thelearning results consistently with the increase of trainingsequences, which outperforms its bootstrap-based competi-tor (Politis & Romano, 1994) as well. To demonstrate theuniversality of our method, besides our EM-based algo-rithm, we apply our method to the Least Squares (LS) algo-

Figure 3. Comparisons on log-likelihood and relative error.

Figure 4. Comparisons on infectivity functions {ψcc′(t)}. Thenumber of original SDC sequences is 200 and stitched via ourmethod once.

rithm (Eichler et al., 2016). Fig. 2(b) shows that our methodalso improves the learning results of the LS algorithm inthe case of SDC event sequences. Both the log-likelihoodand the relative error obtained from the stitched sequencesapproach to the results learned from complete sequences.

Time-varying Hawkes Processes. Fig. 3 shows the com-parisons on log-likelihood and relative error for variousmethods. Similarly, the learning results are improved be-cause of applying our method — higher log-likelihood andlower relative error are obtained and their standard devia-tion (the error bars associated with curves) is shrunk. Inthis case, applying our method twice achieves better resultsthan applying once, which verifies the usefulness of the it-erative framework in our sampling-stitching algorithm. Be-sides objective measurements, in Fig. 4 we visualize theinfectivity functions {ψcc′(t)}. It is easy to find that theinfectivity functions learned from stitched sequences (redcurves) are comparable to those learned from completeevent sequences (yellow curves), which have small estima-tion errors of the ground truth (black curves).

Note that our iterative framework is useful, especiallyfor time-varying Hawkes processes, when the number ofstitches is not very large. In our experiments, we fixedthe maximum number of synthetic sequences. As a result,Figs. 2 and 3 show that the likelihoods first increase (i.e.,stitching once or twice) and then decrease (i.e., stitchingmore than three times) while the relative errors have oppo-nent changes w.r.t. the number of stitches. These phenom-ena imply that too many stitches introduce too much un-reliable interdependency among events. Therefore, we fix

Page 7: New Learning Hawkes Processes from Short Doubly-Censored Event …proceedings.mlr.press/v70/xu17b/xu17b.pdf · 2020. 8. 18. · Learning Hawkes Processes from Short Doubly-Censored

Learning Hawkes Processes from Short Doubly-Censored Event Sequences

(a) LinkedIn data (b) MIMIC III data

Figure 5. Comparisons on log-likelihood.

the number of stitches to 2 in the following experiments.

5.3. Real-World Data

Besides synthetic data, we also test our method on real-world data, including the LinkedIn data collected by our-selves and the MIMIC III data set (Johnson et al., 2016).

LinkedIn Data. The LinkedIn data we collected onlinecontain job hopping records of 3, 000 LinkedIn users in 82IT companies. For each user, her/his check-in time stampscorresponding to different companies are recorded as anevent sequence, and her/his profile (e.g., education back-ground, skill list, etc.) is treated as the feature associatedwith the sequence. For each person, the attractiveness ofa company is always time-varying. For example, a youngman may be willing to join in startup companies and in-crease his income via jumping between different compa-nies. With the increase of age, he would more like to stayin the same company and achieve internal promotions. Inother words, the infectivity network among different com-panies should be dynamical w.r.t. the age of employees.Unfortunately, most of the records in LinkedIn are shortand doubly-censored — only the job hopping events in re-cent years are recorded. How to construct the dynamical in-fectivity network among different companies from the SDCevent sequences is still an open problem.

Applying our data synthesis method, we can stitch differ-ent users’ job hopping sequences based on their ages (timestamps) and their profile (feature) and learn the dynamicalnetwork of company over time. In particular, we select 100users with relatively complete job hopping history (i.e., therange of their working experience is over 25 years) as test-ing set. The remaining 2, 900 users are randomly selectedas training set. The log-likelihood of testing set in 10 trialsis shown in Fig. 5(a). We can find that the log-likelihoodobtained from stitched sequences is higher than that ob-tained from original SDC sequences or that obtained fromthe sequences generated via the bootstrap method (Politis& Romano, 1994), and its standard deviation is bounded

(a) LinkedIn data

(b) MIMIC III data

Figure 6. Comparisons on infectivity functions {ψcc′(t)}.

stably. Fig. 6(a) visualizes the adjacent matrix of infec-tivity network. The properties of the network verifies therationality of our results: 1) the diagonal elements of theadjacent matrix are larger than other elements in general,which reflects the fact that most employees would like tostay in the same company and achieve a series of internalpromotions; 2) with the increase of age, the infectivity net-work becomes sparse, which reflects the fact that users aremore likely to try different companies in the early stages oftheir careers.

MIMIC III Data. The MIMIC III data contain admis-sion records of over 40, 000 patients in the Beth IsraelDeaconess Medical Center between 2001 and 2012. Foreach patient, her/his admission time stamps and diseases(represented via the ICD-9 codes (Deyo et al., 1992)) arerecorded as an event sequence, and her/his profile (includ-ing gender, race and chronic history) is represented as bi-nary feature of the sequence. As aforementioned, somework (Choi et al., 2015) has been done to extract time-invariant disease network from admission records. How-ever, the real disease network should be time-varying w.r.t.the age of patient. Similar to the LinkedIn data, here weonly obtain SDC event sequences — the ranges of mostadmission records are just 1 or 2 years.

Applying our data synthesis method, we can leverage theinformation from different patients and stitch their se-quences based on their ages and their profile. Focusing on600 common diseases in 12 categories, we select 15, 000patients’ admission records randomly as training set and1, 000 patients with relatively complete records as test-ing set. Fig. 5(b) shows that applying our data synthesismethod indeed helps to improve log-likelihood of testingdata. Compared with our bootstrap-based competitor, ourdata synthesis method gets more obvious improvements.

Page 8: New Learning Hawkes Processes from Short Doubly-Censored Event …proceedings.mlr.press/v70/xu17b/xu17b.pdf · 2020. 8. 18. · Learning Hawkes Processes from Short Doubly-Censored

Learning Hawkes Processes from Short Doubly-Censored Event Sequences

Figure 7. The diseases (nodes) are labeled with ICD-9 codes. The colors indicate their sub-categories. The size of the c-th node is∑c′ ψcc′(t)). The width of edge is ψcc′(t).

Furthermore, we visualize the adjacent matrix of dynami-cal network of disease categories in Fig. 6(b). We can findthat: 1) with the increase of age the disease network be-comes dense, which reflects the fact that the complicationsof diseases are more and more common when people be-come old; 2) the networks show that neoplasms and thediseases of circulatory, respiratory, and digestive systemshave strong self-triggering patterns because the treatmentsof these diseases often include several phases and requirepatients to make admission multiple times; 3) for kids andteenagers, their disease networks (i.e., “Age 0” and “Age10” networks) are very sparse, and their common diseasesmainly include neoplasms and the diseases of circulatory,respiratory, and digestive systems; 4) for middle-aged peo-ple, the reasons for their admissions are diverse and com-plicated so that their disease networks are dense and in-clude many mutually-triggering patterns; 5) for longevitypeople, their disease networks (i.e., “Age 80” and “Age 90”networks) are relatively sparser than those of middle-agedpeople, because their admissions are generally caused byelderly chronic diseases.

Additionally, we visualize the dynamical networks of thediseases of circulatory systems in Fig. 7, and find some in-teresting triggering patterns. For example, for kids (“Age0” network), the typical circulatory diseases are “diseasesof mitral and aortic valves” (ICD-9 396) and “cardiac dys-rhythmias” (ICD-9 427), which are common for prematurebabies and the kids having congenital heart disease. Forthe old (“Age 80” network), the network becomes dense.We can find that 1) as a main cause of death, “heart fail-ure” (ICD-9 428) is triggered via multiple other diseases,especially “secondary hypertension” (ICD-9 405); 2) “sec-

ondary hypertension” is also likely to cause “other and ill-defined cerebrovascular disease” (ICD-9 437); 3) “Hem-orrhoids” (ICD-9 455), as a common disease with strongself-triggering pattern, will cause frequent admissions ofpatients. In summary, the analysis above verifies the ratio-nality of our result — the dynamical disease networks welearned indeed reflect the properties of human’s health tra-jectory. The list of ICD-9 codes and the complete enlargednetwork over age are shown in the supplementary file.

6. ConclusionIn this paper, we propose a novel data synthesis method tolearn Hawkes processes from SDC event sequences. Withthe help of temporal information and optional features,we measure the similarities among different SDC eventsequences and estimate the distribution of potential longevent sequences. Applying a sampling-stitching mecha-nism, we successfully synthesize a large amount of longevent sequences and learn point processes robustly. Wetest our method for both time-invariant and time-varyingHawkes processes. Experiments show that our data synthe-sis method improves the robustness of learning algorithmsfor various models. In the future, we plan to provide moretheoretical and quantitative analysis to our data synthesismethod and apply it to more applications.

AcknowledgementsThis work is supported in part via NSF IIS-1639792,DMS-1317424, NIH R01 GM108341, NSFC 61628203,U1609220 and the Key Program of Shanghai Science andTechnology Commission 15JC1401700.

Page 9: New Learning Hawkes Processes from Short Doubly-Censored Event …proceedings.mlr.press/v70/xu17b/xu17b.pdf · 2020. 8. 18. · Learning Hawkes Processes from Short Doubly-Censored

Learning Hawkes Processes from Short Doubly-Censored Event Sequences

ReferencesBacry, Emmanuel, Delattre, Sylvain, Hoffmann, Marc, and

Muzy, Jean-Francois. Some limit theorems for hawkesprocesses and application to financial statistics. Stochas-tic Processes and their Applications, 123(7):2475–2499,2013.

Choi, Edward, Du, Nan, Chen, Robert, Song, Le, and Sun,Jimeng. Constructing disease network and temporal pro-gression model via context-sensitive hawkes process. InICDM, 2015.

Cowling, Ann, Hall, Peter, and Phillips, Michael J. Boot-strap confidence regions for the intensity of a poissonpoint process. Journal of the American Statistical Asso-ciation, 91(436):1516–1524, 1996.

Daley, Daryl J and Vere-Jones, David. An introduction tothe theory of point processes: volume II: general the-ory and structure. Springer Science & Business Media,2007.

De Gruttola, Victor and Lagakos, Stephen W. Analysis ofdoubly-censored survival data, with application to aids.Biometrics, pp. 1–11, 1989.

Deyo, Richard A, Cherkin, Daniel C, and Ciol, Marcia A.Adapting a clinical comorbidity index for use with icd-9-cm administrative databases. Journal of clinical epi-demiology, 45(6):613–619, 1992.

Efron, Bradley. The jackknife, the bootstrap and other re-sampling plans, volume 38. SIAM, 1982.

Eichler, Michael, Dahlhaus, Rainer, and Dueck, Johannes.Graphical modeling for multivariate hawkes processeswith nonparametric link functions. Journal of Time Se-ries Analysis, 2016.

Fan, Chun-Po Steve. Local Likelihood for Interval-censored and Aggregated Point Process Data. PhD the-sis, University of Toronto, 2009.

Goncalves, Sılvia and Kilian, Lutz. Bootstrapping au-toregressions with conditional heteroskedasticity of un-known form. Journal of Econometrics, 123(1):89–120,2004.

Guan, Yongtao and Loh, Ji Meng. A thinned block boot-strap variance estimation procedure for inhomogeneousspatial point patterns. Journal of the American StatisticalAssociation, 102(480):1377–1386, 2007.

Hawkes, Alan G and Oakes, David. A cluster process rep-resentation of a self-exciting process. Journal of AppliedProbability, pp. 493–503, 1974.

Johnson, Alistair EW, Pollard, Tom J, Shen, Lu, Lehman,Li-wei H, Feng, Mengling, Ghassemi, Mohammad,Moody, Benjamin, Szolovits, Peter, Celi, Leo Anthony,and Mark, Roger G. Mimic-iii, a freely accessible criti-cal care database. Scientific data, 3, 2016.

Kirk, Paul DW and Stumpf, Michael PH. Gaussian pro-cess regression bootstrapping: exploring the effects ofuncertainty in time course data. Bioinformatics, 25(10):1300–1306, 2009.

Klein, John P and Moeschberger, Melvin L. Survivalanalysis: techniques for censored and truncated data.Springer Science & Business Media, 2005.

Kobayashi, Ryota and Lambiotte, Renaud. Tideh: Time-dependent hawkes process for predicting retweet dynam-ics. arXiv preprint arXiv:1603.09449, 2016.

Luo, Dixin, Xu, Hongteng, Zhen, Yi, Ning, Xia, Zha,Hongyuan, Yang, Xiaokang, and Zhang, Wenjun. Multi-task multi-dimensional hawkes processes for modelingevent sequences. In IJCAI, 2015.

Luo, Dixin, Xu, Hongteng, Zhen, Yi, Dilkina, Bistra, Zha,Hongyuan, Yang, Xiaokang, and Zhang, Wenjun. Learn-ing mixtures of markov chains from aggregate data withstructural constraints. Transactions on Knowledge andData Engineering, 28(6):1518–1531, 2016.

Mei, Hongyuan and Eisner, Jason. The neural hawkes pro-cess: A neurally self-modulating multivariate point pro-cess. arXiv preprint arXiv:1612.09328, 2016.

Paparoditis, Efstathios and Politis, Dimitris N. Taperedblock bootstrap. Biometrika, 88(4):1105–1119, 2001.

Politis, Dimitris N and Romano, Joseph P. The stationarybootstrap. Journal of the American Statistical associa-tion, 89(428):1303–1313, 1994.

Rubin, Donald B. Multiple Imputation for Nonresponse inSurveys, volume 307. John Wiley & Sons, 2009.

Streit, Roy L. Poisson point processes: imaging, tracking,and sensing. Springer Science & Business Media, 2010.

Sun, J and Kalbfleisch, JD. Estimation of the mean functionof point processes based on panel count data. StatisticaSinica, pp. 279–289, 1995.

Turnbull, Bruce W. Nonparametric estimation of a sur-vivorship function with doubly censored data. Journal ofthe American Statistical Association, 69(345):169–173,1974.

Van den Berg, Gerard J and Drepper, Bettina. Inference forshared-frailty survival models with left-truncated data.Econometric Reviews, 35(6):1075–1098, 2016.

Page 10: New Learning Hawkes Processes from Short Doubly-Censored Event …proceedings.mlr.press/v70/xu17b/xu17b.pdf · 2020. 8. 18. · Learning Hawkes Processes from Short Doubly-Censored

Learning Hawkes Processes from Short Doubly-Censored Event Sequences

Wellner, Jon A and Zhang, Ying. Two estimators of themean of a counting process with panel count data. An-nals of Statistics, pp. 779–814, 2000.

Xu, Hongteng, Zhen, Yi, and Zha, Hongyuan. Trailer gen-eration via a point process-based visual attractivenessmodel. In IJCAI, 2015.

Xu, Hongteng, Farajtabar, Mehrdad, and Zha, Hongyuan.Learning granger causality for hawkes processes. InICML, 2016a.

Xu, Hongteng, Ning, Xia, Zhang, Hui, Rhee, Junghwan,and Jiang, Guofei. Pinfer: Learning to infer concur-rent request paths from system kernel events. In ICAC,2016b.

Xu, Hongteng, Wu, Weichang, Nemati, Shamim, and Zha,Hongyuan. Icu patient flow prediction via discrimina-tive learning of mutually-correcting processes. arXivpreprint arXiv:1602.05112, 2016c.

Yang, Shuang-Hong and Zha, Hongyuan. Mixture of mu-tually exciting processes for viral diffusion. In ICML,2013.

Zhao, Qingyuan, Erdogdu, Murat A, He, Hera Y, Rajara-man, Anand, and Leskovec, Jure. Seismic: A self-exciting point process model for predicting tweet pop-ularity. In KDD, 2015.

Zhou, Ke, Zha, Hongyuan, and Song, Le. Learning so-cial infectivity in sparse low-rank networks using multi-dimensional hawkes processes. In AISTATS, 2013a.

Zhou, Ke, Zha, Hongyuan, and Song, Le. Learning trigger-ing kernels for multi-dimensional hawkes processes. InICML, 2013b.