Trailer Generation via a Point Process-Based Visual ... · Trailer Generation via a Point Process-Based Visual Attractiveness Model Hongteng Xu1 ;2, Yi Zhen , Hongyuan Zha 3 1School

Trailer Generation via a Point Process-Based Visual Attractiveness Model

Hongteng Xu1,2, Yi Zhen2, Hongyuan Zha2,3

1School of ECE, Georgia Institute of Technology, Atlanta, GA, USA2College of Computing, Georgia Institute of Technology, Atlanta, GA, USA

3Software Engineering Institute, East China Normal University, Shanghai, China

Abstract

Producing attractive trailers for videos needs hu-man expertise and creativity, and hence is challeng-ing and costly. Different from video summarizationthat focuses on capturing storylines or importantscenes, trailer generation aims at producing trail-ers that are attractive so that viewers will be ea-ger to watch the original video. In this work, westudy the problem of automatic trailer generation,in which an attractive trailer is produced given avideo and a piece of music. We propose a sur-rogate measure of video attractiveness named fix-ation variance, and learn a novel self-correctingpoint process-based attractiveness model that caneffectively describe the dynamics of attractivenessof a video. Furthermore, based on the attractivenessmodel learned from existing training trailers, wepropose an efficient graph-based trailer generationalgorithm to produce a max-attractiveness trailer.Experiments demonstrate that our approach outper-forms the state-of-the-art trailer generators in termsof both quality and efficiency.

1 IntroductionWith the proliferation of online video sites such as Youtube,promoting online videos through advertisements is becom-ing more and more popular and important. Nowadays, mostadvertisements consist of only key frames or shots accompa-nied by textual descriptions. Although these advertisementscan be easily produced using existing video summarizationtechniques, they are oftentimes not attractive enough. Onlya small portion of videos, e.g., Hollywood movies, are pro-moted by highly attractive trailers consisting of well-designedmontages and mesmerizing music. Nevertheless, the qualityof trailers generated by video summarization techniques arefar from satisfactory, and it is reasonable because the goalof trailer generation is to maximize the video attractiveness,or equally, to minimize the loss of attractiveness, whereasvideo summarization aims at selecting key frames or shotsto capture storylines or important scenes. To produce highlyattractive trailers, human expertise and creativity are alwaysneeded, making trailer generation procedures costly. In this

paper, we study how to produce trailers automatically and ef-ficiently, and our approach may be applied to potentially mil-lions of online videos and hence lower the cost substantially.

Trailer generation is challenging because we not only needto select key shots but also re-organize them in such a coher-ent way that the whole trailer is most attractive. Neverthe-less, attractiveness is a relatively ambiguous and subjectivenotion, and it may be conveyed through several factors suchas the shots, the order of shots, and the background music.In this paper, we propose an automatic trailer generation ap-proach which consists of three key components: 1) A practi-cal surrogate measure of trailer attractiveness; 2) An efficientalgorithm to select and re-organize shots to maximize the at-tractiveness; 3) An effective method to synchronize shots andmusic for improved viewer experience. Specifically, we learnan attractiveness model for movie trailers by leveraging self-correcting point process methodology [Isham and Westcott,1979; Ogata and Vere-Jones, 1984]. Then, we position thetrailer montages by exploiting the saliency information of thetheme music. Finally, based upon the montage informationand the attractiveness model, we construct a shot graph andgenerate a trailer by finding the shortest path that is equivalentto maximizing the attractiveness.

We summarize our contributions as follows: 1) We pro-pose an effective surrogate measure of video attractiveness,namely, fixation variance. With this measure, we studythe dynamics of video attractiveness and the properties ofmovie trailer. 2) Although point processes have been widelyused for modeling temporal event sequences, such as earth-quakes [Ogata and Vere-Jones, 1984] and social behav-iors [Zhou et al., 2013], to the best of our knowledge, ourmethodology is the first to model video attractiveness usingself-correcting point processes, jointing a small number ofexisting works for leveraging point process methodology forvision problems. 3) We investigate the influence of music ontrailer generation, and propose a graph-based music-guidedtrailer generation algorithm. Compared to the state-of-the-art methods, our method achieves significantly better resultswhile using much less computational resource.

Related Work. Video summarization techniques have at-tracted much research interest and many works have beenproposed in the past. Early works [Gong and Liu, 2000;Li et al., 2001] extract features of frames and cluster framesaccordingly, but their performance is limited. Besides vi-

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)

2198

sual frames, other information has been taken into consid-eration in video summarization. For example, viewer atten-tion model based summarization methods are proposed [Maet al., 2005; You et al., 2007]. User interaction is incorpo-rated in a generic framework of video summarization [Li etal., 2006]. Textual information is used to achieve transferlearning based video summarization [Li et al., 2011]. Au-dio and visual analysis is performed simultaneously [Jianget al., 2011]. Recently, a joint aural, visual, and textual at-tention model is proposed for movie summarization [Evan-gelopoulos et al., 2013]. Moreover, semantic information ofvideos has been exploited, including the saliency maps of im-ages [Yan et al., 2010], special events [Wang et al., 2011;2012], key people and objects [Lee et al., 2012; Khoslaet al., 2013], storylines [Kim et al., 2014]. The exter-nal information sources such as web images have also beendemonstrated to be useful [Khosla et al., 2013; Kim et al.,2014]. Focusing on trailer generation problem, so far aswe know, Ma et al. proposed the first user attention modelfor trailer generation [Ma et al., 2002]. Irie et al. [Irieet al., 2010] built a trailer generator that combines a topicmodel based on Plutchik’s emotions [Hospedales et al., 2009;Irie et al., 2009] with the Bayesian surprise model [Itti andBaldi, 2009], and their system is reported to achieve the state-of-the-art performance. However, the system does not incor-porate the relationships between a trailer and its music, andthe causality between the surprise degree and video attractive-ness is questionable. In all the above works, the definition andmeasures of the video attractiveness are largely overlooked,which turns out to be rather critical for trailer generation.

2 Properties of Movie TrailerSuppose that we have a set of K training trailers {Tk}Kk=1,and in total N shots {Ci}Ni=1. Each shot comes from onetrailer and consists of a set of frames. Let Ci ∈ Tk indicatethat shot Ci comes from the trailer Tk, and Ci = {f (i)j }

nij=1

indicate that there are ni frames in Ci and f (i)j is the jth frameof Ci. Similarly, we use T ⊂ V to indicate that T is the trailerof the video V . We also represent a video V or a trailer T as asequence of shots, denoted as {Ci}Ni=1, in the sequel. We usethe index of a frame as the time stamp of the shot (trailer andmovie) for convenience. The beginning and the ending of avideo or a trailer are denoted as L0 = 0 and LN =

∑Ni=1 ni.

The position of montage between Ci and Ci+1 is denoted asLi = Li−1 +ni =

∑ij=1 nj . The trailer generation problem

is: given a video V and a piece of music m, we would like togenerate a trailer T ⊂ V that is the most attractive.

2.1 Measure and Dynamics of AttractivenessWe might observe such a common phenomenon: when at-tractive scenes such as handsome characters and hot actionsappear, viewers will look at the same area on the screen; onthe other hand, when boring scenes such as the cast of char-acters and tedious dialogues appear, viewers will no longerfocus on the same screen area. In other words, the attrac-tiveness of a video is highly correlated with the attentionof viewers when they watch the video. Therefore, we pro-

pose a surrogate measure of attractiveness based on view-ers’ eye-movement, whose efficacy is validated by the fol-lowing experiments. Specifically, we invite 14 (6 female and8 male) volunteers to watch 8 movie trailers, which contain1,083 shots1. We further record the motions of their gazesand calculate the mapped fixation points in each frame usingTobii T60 eye tracker. Denote the locations of gaze on thescreen, namely, the fixation points, in the jth frame of Ci as[x

(i)j ,y

(i)j ], where x

(i)j ∈ R14 (resp. y(i)

j ∈ R14) is the vectorof the horizontal (resp. vertical) coordinates of the fixationpoints of the 14 volunteers. For the jth frame, we define thefixation variance as the determinant of the covariance matrixof the fixation points:

σ(i)j = det (cov([x

(i)j ,y

(i)j ])), (1)

We average {σ(i)j }

nij=1 for the frames belonging to the same

shot. The averaged fixation variance reflects the spread of at-tention when watching the shot. Following the above reason-ing and definition, we expect that the boring shots (e.g., back-ground) should have large fixation variance whereas the at-tractive shots (e.g., hot action scenes, characters) should havesmall fixation variance. To verify this, we label these twotypes of shots manually and calculate the statistics of theirfixation variance. The results are summarized in Table 1.

Table 1: The statistics of normalized fixation variance (×108)mean(σ) median(σ) variance(σ)

Boring shots 1.19 0.45 0.03Attractive shots 0.60 0.22 0.01

It is easy to observe that both the mean and the median ofthe fixation variance of boring shots are about twice largerthan those of the attractive shots. The variance is very small,meaning that our proposed fixation variance is stable in bothboring and attractive shot groups. These results show that fix-ation variance is negatively correlated with the video attrac-tiveness — it measures the loss of attractiveness accuratelyand robustly. Fig. 1(a-d) further shows typical examples.

Figure 1: (a-d) Four shots in trailer “The Wolverine 2013” andtheir fixation variances. (e) Dynamics of fixation variancecalculated from training trailers.

1In this paper, we segment video into shots using a commercialsoftware “CyberLink PowerDirector”.

(a) Authorization:σ = 2.9× 108

(b) The background:σ = 1.9× 108

(c) Main character:σ = 0.3× 108

(d) Action scenes:σ = 0.1× 108

(e) Dynamics of fixationvariance.

2199

The dynamics of fixation variance. Given N shots{Ci}Ni=1, the averaged fixation variance in the jth frame iscalculated as σ̄j = 1

Nj

∑Nj

i=1 σ(i)j , where Nj is the number of

shots having at least j frames. In Fig. 1(e), we find that thechange of σ̄j over time can be approximated by an increasingexponential curve. In other words, within a shot, its attrac-tiveness decreases over time. The inter-shot dynamics canbe modeled as the stitching of the fitted exponential curvesfor adjacent shots. It means that although the attractivenesswithin one shot decreases over time, the montage betweenshots increases attractiveness.

2.2 The Dynamics of MusicSimilar to the dynamics of attractiveness, we empirically findthat the dynamics of music are also highly correlated withthe montages between shots. To see this, we first detect thesaliency points of the music associated with a trailer as fol-lows. 1) Using the saliency detection algorithm in [Hou et al.,2012], we extract the saliency curve of a piece of music as

m̂ = G((idct(sign(dct(m))))2), (2)

where dct(·) and idct(·) are a DCT transformation pair,sign(·) is the sign function that returns 1, -1, and 0 for posi-tive, negative, and zero inputs, respectively. G(·) is a Gaus-sian filter. 2) After re-sampling m̂ with the number of frames,we detect the peaks in m̂. Regarding these peaks as thesaliency points of the music, we investigate their correlationswith the montages in the trailer as follows. For each peak,we label the peak as a correct indicator of the location of themontage when a montage appears within ±6 frames (about0.25 second). On the 8 sample trailers, we find that the timestamps of the saliency points are highly correlated with thoseof the montages (with accuracy 84.88%). Fig. 2 presents anexample: in high-quality trailers, the montages of shots aresynchronized with the rhythm of the background music.

Figure 2: The Saliency Points of Music v.s. Montage Posi-tions (Trailer of “The Bling Ring”).

According to the analytic experiments above, we summa-rize three properties of movie trailer as follows: Property 1.The loss of attractiveness can be approximated by a surrogatemeasure, namely, fixation variance. Property 2. Within eachshot, the loss of attractiveness increases exponentially andthis tendency is corrected when a new shot appears. Prop-erty 3. The self-correction of attractiveness, the rhythm ofthe music in the video, and the montages between shots arehighly correlated.

3 Point Process-based Attractiveness Model3.1 MotivationOur modeling assumption is that the fixation variance ishighly correlated with the number of viewers losing theirattention on the screen, which directly connects the notionof the attractiveness of a video with a specific point processmodel we will discuss below. To this end, suppose that thereare V viewers watching a movie. We define a sequenceEv(t)of the event “whether the viewer loses her attention or not attime t” for each viewer v:

Ev(t) =

{1, viewer v loses her attention at time t,0, otherwise.

Although we do not observe the event sequence directly, weassume that the fixation variance is proportional to the num-ber of viewers losing their attention. Therefore, given thefixation variance of training trailers, we can approximate theaggregated observations of viewers’ events

∑v Ev(t), and

model viewers’ events as a temporal point process. Specif-ically, we propose an attractiveness model based on a specificpoint process, i.e., self-correcting point process.

3.2 Self-Correcting Point ProcessA self-correcting point process is a point process with the fol-lowing intensity function:

λ(t) =E(dN(t)|Ht)

dt= exp

(αt−

∑i:ti<t

β), (3)

where N(t) is the number of events occurred in time range(−∞, t], Ht denotes the historical events happened beforetime t, and E(dN(t)|Ht) is the expectation of the numberof events happened in the interval (t, t + dt] given historicalobservations Ht. The intensity function in Eq. (3) representsthe expected instantaneous rate of future events at time t.

The intensity function of the self-correcting point processincreases exponentially with rate α and this tendency can becorrected by the historical observations via rate β. Note thatthe intensity function exactly matches the dynamics of attrac-tiveness described in Property 2. Therefore, given a videoV = {Ci}Ni=1, for each shot Ci, we define the local intensityfunction in its time interval (0, ni] as

λCi(t) = exp(αiH

it− βiDit

), (4)

where t ∈ (0, ni] and

Hi =

{H(f̂

(1)1 ), i = 1,

D(f̂(i−1)Ni−1

||f̂ (i)1 ), i > 1,

Dit =

btc∑j=2

D(f̂(i)j−1||f̂

(i)j ),

where f̂ = G((idct2D(sign(dct2D(f))))2)/C is the normal-ized saliency map of frame f [Hou et al., 2012]. C is a l1normalizer that guarantees f̂ to be a distribution. For the firstshot C1 of V (i = 1), Hi = H(f̂

(1)1 ) represents the entropy

2200

of the first frame in C1, which is the initial stimulus given byC1. For the following shots (i > 1), Hi = D(f̂

(i−1)Ni−1

||f̂ (i)1 )

represents the KL-divergence between the last frame of Ci−1and the first frame of Ci, which is the initial stimulus givenby Ci. Similarly, D(f̂

(i)j−1||f̂

(i)j ) represents the KL-divergence

between the adjacent frames in Ci, which is the supplemen-tary stimulus. Di

t =∑btcj=2D(f̂

(i)j−1||f̂

(i)j ) is the accumulative

influence caused by the supplementary stimulus in the timeinterval (0, t], where btc is the largest integer smaller than t.

The intensity function in Eq. (4) imitates the loss of attrac-tiveness — it increases exponentially with the initial stimulustill the supplementary stimulus corrects this tendency. Foreach shot Ci, its intensity function has two non-negative pa-rameters (αi, βi). To summarize, we define our attractivenessmodel as a self-correcting point process with a global inten-sity function for time interval (0,

∑Ni=1 ni], which stitches N

local intensity functions as λV(t) =∑Ni=1 λCi(t − Li−1),

where L0 = 0, Li = Li−i + ni =∑ij=1 nj .

3.3 Model LearningGiven K training trailers {Tk}Kk=1 consisting of N shots, ourgoal is to learn parameters of the N local intensity functions.Similar to [Zhou et al., 2013], we achieve this goal by pursu-ing an maximum likelihood estimation (MLE) of the param-eters, and the likelihood function can be written as:

L =∏N

i=1Li =

∏N

i=1

((∏ni

t=1(λCi(t))

σCi (t))

exp(−σm

∫ ni

0

λCi(s)ds

)), (5)

where Li denotes the local likelihood for Ci. σCi(t) is the fix-ation variance of the tth frame of shot Ci. σm is the maximumof fixation variance.

Besides capturing the global event dynamics by maximiz-ing Eq. (5), we also require the proposed model to fit localevent information in each time interval. Therefore, we furtherpropose to minimize a novel data fidelity loss function to cor-relate the local intensity and fixation variance in each frame:∑Ni=1

∑ni

t=1 | log(γλCi(t)/σCi(t))|2, where γ is shared by allframes. This term encourages the local intensity (scaled byγ) to be equal to the fixation variance in each frame. To sumup, we learn our model by solving the following problem:

minα,β,γ

− log(L) + µN∑i=1

ni∑t=1

∣∣∣∣log

(γλCi(t)

σCi(t)

)∣∣∣∣2,s.t. α ≥ 0, β ≥ 0, γ > 0, (6)

where α = [α1, .., αN ] and β = [β1, .., βN ] represent theparameters of N local intensity functions.

We develop an alternating algorithm to solve Eq. (6). Tobe specific, given the initial values of (α,β, γ), we first solvethe following subproblem for each shot with γ fixed:

minαi,βi

− log(Li) + µ

ni∑t=1

∣∣∣∣log

(γλCi(t)

σCi(t)

)∣∣∣∣2s.t. αi ≥ 0, βi ≥ 0. (7)

The objective of Eq. (7) can be written as follows:

Ii =

ni∑t=1

(σCi(t)(βiD

it − αiHit)

+ σmeαiH

it − eαiHi(t−1)

αiHieβiDit

+ µ| log γ + αiHit− βiDi

t − log σCi(t)|2).

We can solve the subproblems using gradient-based methodsin a parallel manner. With α and β such learned, γ can beupdated using the following equation:

γ = exp

(∑Ni=1

∑ni

t=1 log(σCi(t)/λCi(t))∑Ni=1 ni

). (8)

Algorithm 1 summarizes our learning algorithm.

Algorithm 1 Learning Proposed Attractiveness ModelInput: Training shots {Ci}Ni=1, the maximum number of it-

eration M = 500, the parameter µ = 0.5. The gradientdescent step size: δα = 10−4, δβ = 10−5.

Output: Parameters of our model α, β, γ.Initialize α0, β0 and γ0 randomly.for m = 1 : M do

for i = 1 : N doαmi =

(αm−1i − δα ∂Ii∂αi

∣∣αi=α

m−1i

)+

.

βmi =(βm−1i − δβ ∂Ii∂βi

∣∣βi=β

m−1i

)+

.

(·)+ sets negative value to be 0.end forγm is calculated by Eq. (8).

end forα = αm, β = βm, γ = γm.

4 Trailer GenerationAfter learning the attractiveness model from the training set,we are able to generate an attractive trailer given a new testingvideo V and a piece of music m with maximum attractive-ness, or equally, minimum loss of attractiveness. The prob-lem of trailer generation can be formulated as:

minT

∫ LN

L0

λT (s)ds, s.t. T ⊂ V, (9)

where L0 = 0 and LN =∑Ni=1 ni are the beginning and the

ending of the trailer, respectively. ni is the number of framesin Ci andN is the number of candidate shots selected from V .{ni} and N are the positions of the montages. According toProperty 3 in Section 2.2, they are determined by the saliencypoints of the music, which is detected from the m̂ in Eq. (2).The interval length between adjacent saliency points deter-mines ni, and the number of saliency points determines N .Since Eq. (9) is a combinatorial problem and NP-hard, wehave to resort to approximate solutions. Inspired by [Xu et

2201

al., 2014], we propose a graph-based method to solve Eq. (9)approximately and efficiently.

Step 1: Candidate Selection. We rewrite Eq. (9) as:

minT

N∑i=1

∫ ni

0

λCi(s)ds, s.t. T = {Ci ∈ Si}Ni=1, (10)

where Si, i = 1, ..., N , is the set of s (s = 5 in our experi-ments) candidate shots selected from V for Ci. Each selectedshot satisfies the following two constraints: 1) the length isnot shorter than ni; 2) it does not appear in Si−1. SN containsonly one shot, which corresponds to the title of the trailergiven in advance.

Step 2: Parameter Assignment. For a candidate shot Ciin the new video, we do not know the parameters of λCi(t) inadvance. In this paper, we first extract the feature of Ci as

fCi = [H(f̂(i)1 ), D(f̂

(i)1 ||f̂

(i)2 ), ...] ∈ R64. (11)

We fix the length of feature vector as 64 in this work — if thelength of Ci is shorter than 64, we pad zeros in the end of fCi ;if Ci is longer than 64, we cut the end of fCi . Then, we selectthe matching shot in the training set for Ci by comparing thefeatures of training shots with those of candidate shots. Thematching criterion is the Euclidean distance. The parametersof the matching shot are assigned to λCi(t).

Step 3: Graph-based Stitching. Eq. (10) is still a com-plicated combinatorial optimization problem. To see this, wenote that the selection of Ci−1 has recursive influence on theselection of subsequent shots {Ci, Ci+1, ...}. As we know, theinitial stimulus in λCi(t) is the KL-divergence between thelast frame of Ci−1 and the first frame of Ci. Hence, if wechange Ci−1, the intensity function λCi(t) will change, andso dose the selection of Ci.

The problem will be efficiently solved if we only considerthe pairwise relationships between the shots in the adjacentcandidate sets. Given {Si}Ni=1, we can construct a trellisgraph G with N + 1 layers. The nodes in the ith layer arethe candidate shots from Si. The edge weights in the graphcan be defined as follows,

wi,i+1p,q =

∫ ni

0

λCp,i(s)ds+

∫ ni+1

0

λCq,i+1(s)ds, (12)

where wi,i+1p,q is the weight connecting the pth candidate in Si

with the qth candidate in Si+1. We calculate all the weightsindependently: the initial stimulus in λCp,i(t) is the entropy ofthe first frame of Cp,i, which is independent of shot selectionin the former layers; on the other hand, Cp,i only influencesCq,i+1 through the initial stimulus in λCq,i+1

that is the KL-divergence between the last frame of Cp,i and its first frame.Influence of Cp,i will not propagate to the following layers. Inother words, Eq. (10) can be solved approximately by findingthe shortest path [Dijkstra, 1959] in the graph G (from thefirst layer to the last one). We summarize our algorithm inAlgorithm 2 and illustrate it in Fig. 3.

5 ExperimentsWe conduct two groups of experiments (objective and subjec-tive) to empirically evaluate the proposed method. Our data

Algorithm 2 Graph-based Trailer Generation AlgorithmInput: a video V , a piece of music m, training shots with

features and learned parameters.Output: a movie trailer T .

1. Segment V to shots {Cj} and extract features.2. For each Cj , find the matching shot in the training setand assign parameters accordingly.3. Detect saliency points of m by Eq. (2).4. Construct candidate set Si and a hierarchical graph G.5. Calculate the weight of edge by Eqs. (4,12).6. Find the shortest path in G.7. T is constructed by the sequence of shots correspondingto the shortest path associated with the music m.

set consists of 16 publicly available movies including 3 ani-mation movies, 2 fantasy movies, 2 action movies, 5 fictionaction movies and 4 dramas in 2012 and 2014. We also col-lect the movies, their theme music and official trailers. Theexperimental settings are as follows: we first select 8 of thetrailers as the training set and collect the fixation data from14 volunteers. Then, we learn our attractiveness model as de-scribed in Section 3.3. Finally, based on the attractivenessmodel, we produce trailers for the remaining 8 movies fol-lowing Section 4. All movies and their trailers are with framesize 640×480.

Figure 3: The scheme of our trailer generator.

We compare our method (“Ours”) with the following fourcompetitors2: i) the trailer generator “V2T” in [Irie et al.,2010]; ii) the commercial video summarization software“Muvee3”; iii) the real official trailers (“RT”) generated byprofessionals; iv) the real trailers without speech information(“RTwS”). For fair comparison, for “V2T”, the training andtesting sets are the same as we described above; for “Mu-vee”, we generate trailers for the testing movies only as there

2Representative trailers generated by all methods are on the web-site: https://vimeo.com/user25206850/videos.

3http://www.muvee.com/

2202

is no training phase. We also note that, in this work, we fo-cus on the visual attractiveness model and its contribution ontrailer generation, and hence we use “RTwS” as a baseline.Moreover, we are only able to implement the shot selectionand arrangement algorithm of “V2T” since other steps suchas feature extraction is omitted in the reference.

5.1 Objective EvaluationLoss of Attractiveness. An important criterion for trailers isthe loss of attractiveness, which can be approximately mea-sured by the proposed fixation variance. We invite the 14 vol-unteers to watch the testing trailers generated by all of the fivemethods mentioned above, and record their fixation points ineach frames by an eye tracker. For each method, we calculatethe fixation variance σ in each frame for all 8 testing trailers,and hence obtain 32,309 σ’s. The statistics of σ’s reflect theoverall loss of attractiveness. Specifically, larger σ indicatesmore loss of attractiveness. We present the mean, the medianand the standard deviation of σ’s for each method in Fig. 4.

Figure 4: The mean, the median and the standard deviation ofthe fixation variance σ for various methods.

It is easy to observe that both the mean and the median ob-tained by our method are the smallest compared with its coun-terparts. Specifically, the results of our method are compara-ble to those of “RT” whereas the results of “Muvee”, “V2T”and “RTwS” are much higher than those of “RT”. Last but notleast, the standard deviation results show that the attractive-ness of our trailers is the most stable, and “RT” trailers arethe second best in this aspect. In our opinion, the superior-ity of our method is mainly based on the utilizing of fixationvariance and the proposed point process model. Firstly, fixa-tion variance is real feature from viewers, which reflects theattractiveness of video better than the learned feature fromvideo. Secondly, the proposed point process model capturesthe dynamics of attractiveness well, which provides us with auseful guidance to generate trailer.

Table 2: Comparison on computational cost.Training cost Testing cost

V2T [Irie et al., 2010] 0.0024 sec/frame 0.2676 sec/frameOurs 0.0014 sec/frame 0.0113 sec/frame

Computational Cost. Computation cost is a key factor fortrailer generation. As we mentioned in previous sections, it isvery promising to have efficient automatic trailer generatorsthat may be potentially applied to millions of online videos.We note that the training and testing computational complex-

ities of “V2T” are both O(N3), whereas ours are O(N) andO(N2) for training and testing, respectively.

Table 2 compares empirical training and testing cost of ourmethod and “V2T”. Both methods are implemented by MAT-LAB and run on the same platform (Core i7 CPU @3.40GHzwith 32GB memory). Specifically, the training cost is calcu-lated as the model learning time per frame for training set,and the testing cost is calculated as the trailer generation timeper frame for the generation result. Table 2 validates that ourmethod needs much less training and testing cost than “V2T”.

5.2 Subjective EvaluationIn this subsection, we evaluate our method as well as the base-lines through subjective experiments. Similar to [Irie et al.,2010], for each testing trailer generated by different methods,we invited 14 volunteers to evaluate it by answering the fol-lowing 3 questions: Rhythm: “How well does the montagematch with the rhythm of background music?” Attractive-ness: “How attractive is the trailer?” Appropriateness: “Howclose is the trailer to an real trailer?” For each question, thevolunteers were asked to provide an integer score in the rangeof 1 (lowest) to 7 (highest). Fig. 5 shows the overall resultsfor all 8 testing movies.

Figure 5: The box plots of scores for various methods onthree questions. The red crosses are means and the red barsare medians.

Consistency of Objective and Subjective Evaluation.Similar to objective evaluation, we find in Fig. 5 that ourmethod is better than “V2T” and “Muvee” in all three ques-tions, indicating that our attractiveness model based on fix-ation variance and self-correcting point processes is reason-able, and is able to generate trailers that satisfy our subjec-tive feelings. On the other hand, our method are inferior to“RTwS” and “RT”, which is different from the result of ob-jective evaluation. The reasons for the difference may be at-tributed to that we only use visual information to learn the at-tractiveness model and produce trailers. On the contrary, offi-cial trailers of “RT” often provide speeches, subtitles and spe-cial effects of montages, which impress viewers a lot. Sim-ilarly, although the trailers of “RTwS” does not have speechinformation, they still contain subtitles and special effects ofmontages. Since the information of these factors contributeto raising the attractiveness of the video, volunteers feel thatthe real trailers are better than our trailers. This observationpoints out a future extension of our work, that is, in addi-tion to visual and music information, we should also enrichour model with information sources such as speech, subtitles,and montage effects to capture holistic movie attractiveness.

2203

6 Conclusion and Future WorkIn this work, we studied a challenging problem, namely, au-tomatic trailer generation. To generate attractive trailers, weproposed a practical surrogate measure of video attractive-ness called fixation variance, and made the first attempt to usepoint processes to model the attractiveness dynamics. Basedupon the attractiveness model, we developed a graph-basedmusic-guided trailer generation method. In the future, we areinterested in extending our method to utilize other informa-tion such as speeches and subtitles. We would also like toexplore parallel algorithms to further improve the scalabilityof our method.

Acknowledgement. This work is supported in part by NSFDMS-1317424 and NSFC-61129001/F010403.

References[Dijkstra, 1959] Edsger W Dijkstra. A note on two problems in

connexion with graphs. Numerische mathematik, 1(1):269–271,1959.

[Evangelopoulos et al., 2013] G. Evangelopoulos, A Zlatintsi,A Potamianos, P. Maragos, K. Rapantzikos, G. Skoumas, andY. Avrithis. Multimodal saliency and fusion for movie summa-rization based on aural, visual, and textual attention. Multimedia,IEEE Transactions on, 15(7):1553–1568, 2013.

[Gong and Liu, 2000] Yihong Gong and Xin Liu. Video summa-rization using singular value decomposition. In CVPR, volume 2,pages 174–180, 2000.

[Hospedales et al., 2009] Timothy Hospedales, Shaogang Gong,and Tao Xiang. A markov clustering topic model for mining be-haviour in video. In ICCV, pages 1165–1172, 2009.

[Hou et al., 2012] Xiaodi Hou, Jonathan Harel, and Christof Koch.Image signature: Highlighting sparse salient regions. PAMI,34(1):194–201, 2012.

[Irie et al., 2009] Go Irie, Kota Hidaka, Takashi Satou, Akira Ko-jima, Toshihiko Yamasaki, and Kiyoharu Aizawa. Latent topicdriving model for movie affective scene classification. In ACMMultimedia, pages 565–568, 2009.

[Irie et al., 2010] Go Irie, Takashi Satou, Akira Kojima, ToshihikoYamasaki, and Kiyoharu Aizawa. Automatic trailer generation.In ACM Multimedia, pages 839–842, 2010.

[Isham and Westcott, 1979] Valerie Isham and Mark Westcott. Aself-correcting point process. Stochastic Processes and Their Ap-plications, 8(3):335–347, 1979.

[Itti and Baldi, 2009] Laurent Itti and Pierre Baldi. Bayesian sur-prise attracts human attention. Vision research, 49(10):1295–1306, 2009.

[Jiang et al., 2011] Wei Jiang, Courtenay Cotton, and Alexander CLoui. Automatic consumer video summarization by audio andvisual analysis. In ICME, pages 1–6, 2011.

[Khosla et al., 2013] Aditya Khosla, Raffay Hamid, Chih-Jen Lin,and Neel Sundaresan. Large-scale video summarization usingweb-image priors. In CVPR, pages 2698–2705, 2013.

[Kim et al., 2014] Gunhee Kim, Leonid Sigal, and Eric P Xing.Joint summarization of large-scale collections of web images andvideos for storyline reconstruction. In CVPR, 2014.

[Lee et al., 2012] Yong Jae Lee, Joydeep Ghosh, and Kristen Grau-man. Discovering important people and objects for egocentricvideo summarization. In CVPR, pages 1346–1353, 2012.

[Li et al., 2001] Ying Li, Tong Zhang, and Daniel Tretter. Anoverview of video abstraction techniques. HP Laboratories PaloAlto, 2001.

[Li et al., 2006] Ying Li, Shih-Hung Lee, Chia-Hung Yeh, and C-CJ Kuo. Techniques for movie content analysis and skimming:tutorial and overview on video abstraction techniques. SignalProcessing Magazine, IEEE, 23(2):79–89, 2006.

[Li et al., 2011] Liangda Li, Ke Zhou, Gui-Rong Xue, HongyuanZha, and Yong Yu. Video summarization via transferrable struc-tured learning. In WWW, pages 287–296. ACM, 2011.

[Ma et al., 2002] Yu-Fei Ma, Lie Lu, Hong-Jiang Zhang, andMingjing Li. A user attention model for video summarization.In ACM Multimedia, pages 533–542, 2002.

[Ma et al., 2005] Yu-Fei Ma, Xian-Sheng Hua, Lie Lu, and Hong-Jiang Zhang. A generic framework of user attention model andits application in video summarization. Multimedia, IEEE Trans-actions on, 7(5):907–919, 2005.

[Ogata and Vere-Jones, 1984] Y Ogata and D Vere-Jones. Infer-ence for earthquake models: a self-correcting model. Stochasticprocesses and their applications, 17(2):337–347, 1984.

[Wang et al., 2011] Zheshen Wang, Mrityunjay Kumar, Jiebo Luo,and Baoxin Li. Sequence-kernel based sparse representation foramateur video summarization. In Proceedings of the 2011 jointACM workshop on Modeling and representing events, pages 31–36, 2011.

[Wang et al., 2012] Meng Wang, Richang Hong, Guangda Li,Zheng-Jun Zha, Shuicheng Yan, and Tat-Seng Chua. Eventdriven web video summarization by tag localization and key-shotidentification. Multimedia, IEEE Transactions on, 14(4):975–985, 2012.

[Xu et al., 2014] Hongteng Xu, Hongyuan Zha, and Mark Daven-port. Manifold based dynamic texture synthesis from extremelyfew samples. In CVPR, pages 3019–3026, 2014.

[Yan et al., 2010] Junchi Yan, Mengyuan Zhu, Huanxi Liu, andYuncai Liu. Visual saliency detection via sparsity pursuit. SignalProcessing Letters, IEEE, 17(8):739–742, 2010.

[You et al., 2007] Junyong You, Guizhong Liu, Li Sun, andHongliang Li. A multiple visual models based perceptive analy-sis framework for multilevel video summarization. CSVT, IEEETransactions on, 17(3):273–285, 2007.

[Zhou et al., 2013] Ke Zhou, Hongyuan Zha, and Le Song. Learn-ing social infectivity in sparse low-rank networks using multi-dimensional hawkes processes. In AISTATS, pages 641–649,2013.

2204

Documents

Trailer Generation via a Point Process-Based Visual ... · Trailer Generation via a Point Process-Based Visual Attractiveness Model Hongteng Xu1 ;2, Yi Zhen , Hongyuan Zha 3 1School