Information Bottleneck Inspired Method For Chat Text ... · information for various other applications such as automatic assessment of possible collaborative work among people (Rebedea

Proceedings of the The 8th International Joint Conference on Natural Language Processing, pages 194–203,Taipei, Taiwan, November 27 – December 1, 2017 c©2017 AFNLP

Information Bottleneck Inspired Method For Chat Text Segmentation

S Vishal1,?, Mohit Yadav2,?,$, Lovekesh Vig1 and Gautam Shroff11TCS Research New Delhi, India

2University of Massachusetts, [email protected], [email protected], [email protected], [email protected]

Abstract

We present a novel technique for segment-ing chat conversations using the infor-mation bottleneck method (Tishby et al.,2000), augmented with sequential conti-nuity constraints. Furthermore, we utilizecritical non-textual clues such as time be-tween two consecutive posts and peoplementions within the posts. To ascertainthe effectiveness of the proposed method,we have collected data from public Slackconversations and Fresco, a proprietaryplatform deployed inside our organiza-tion. Experiments demonstrate that theproposed method yields an absolute (rel-ative) improvement of as high as 3.23%(11.25%). To facilitate future research, weare releasing manual annotations for seg-mentation on public Slack conversations.

1 Introduction

The prolific upsurge in the amount of chat con-versations has notably influenced the way peo-ple wield languages for conversations. Moreover,conversation platforms have now become preva-lent for both personal and professional usage. Forinstance, in a large enterprise scenario, projectmanagers can utilize these platforms for varioustasks such as decision auditing and dynamic re-sponsibility allocation (Joty et al., 2013). Logsof such conversations offer potentially valuableinformation for various other applications suchas automatic assessment of possible collaborativework among people (Rebedea et al., 2011).

? indicates that both authors contributed equally. $ indi-cates that the author was at TCS Research New-Delhi duringthe course of this work.

It is thus vital to invent effective segmentationmethods that can seperate discussions into smallgranules of independent conversational snippets.By ’independent’, we meant a segment should asmuch as possible be self-contained and discussingthe same topic, such that a segment can be sug-gested if any similar conversation occurs again.As an outcome of this, various short text simi-larity methods can be employed directly. Seg-mentation can also potentially act as an empower-ing preprocessing step for various down-streamingtasks such as automatic summarization (Dias et al.,2007), text generation (Barzilay and Lee, 2004),information extraction (Allan, 2012), and conver-sation visualization (Liu et al., 2012). It is worthnoting that chat segmentation presents a numberof gruelling challenges such as, the informal na-ture of the text, the frequently short length of theposts and a significant proportion of irrelevant in-terspersed text (Schmidt and Stone).

Research in text segmentation has a long his-tory going back to the earliest attempt of Koz-ima (1993). Since then many methods, includ-ing but not limited to, Texttiling (Hearst, 1997),Choi’s segmentation (Choi, 2000), representationlearning based on semantic embeddings (Alemiand Ginsparg, 2015), and topic models (Du et al.,2015a) have been presented. Albeit, very little re-search effort has been proposed for segmenting in-formal chat text. For instance, Schmidt and Stonehave attempted to highlight the challenges withchat text segmentation, though they have not pre-sented any algorithm specific to chat text.

The Information Bottleneck (IB) method hasbeen successfully applied to clustering in the NLPdomain (Slonim and Tishby, 2000). Specifically,IB attempts to balance the trade-off between accu-racy and compression (or complexity) while clus-tering the target variable, given a joint probabilitydistribution between the target variable and an ob-

194

served relevant variable. Similar to clustering, thispaper interprets the task of text segmentation asa compression task with a constraint that allowsonly contiguous text snippets to be in a group.

The focus of this paper is to develop text seg-mentation methods for chat text utilizing the IBframework. In the process, this paper makes thefollowing major contributions:

(i) We introduce an IB inspired objective func-tion for the task of text segmentation.

(ii) We develop an agglomerative algorithm tooptimize the proposed objective function thatalso respects the necessary sequential conti-nuity constraint for text segmentation.

(iii) To the best of our knowledge, this paper is afirst attempt that addresses segmentation forchat text and incorporates non-textual clues.

(iv) We have created a chat text segmentationdataset and releasing it for future research.

The remainder of this paper is organized as fol-lows: we present a review of related literature inSection 2. Then, we formulate the text segmen-tation problem and define necessary notations inSection 3. Following this, we explain the proposedmethodology in Section 4. Section 5 presents ex-periments and provides details on the dataset, ex-perimental set-up, baselines, results, and effect ofparameters. Finally, conclusions and potential di-rections for future work are outlined in Section 6.

2 Related Work

The IB method (Tishby et al., 2000) was origi-nally introduced as a generalization of rate distor-tion theory which balances the tradeoff betweenthe preservation of information about a relevancevariable and the distortion of the target variable.Later on, similar to this work, a greedy bottom-up(agglomerative) IB based approach (Slonim andTishby, 1999, 2000) has been successfully appliedto NLP tasks such as document clustering.

Furthermore, the IB method has been widelystudied for multiple machine learning tasks, in-cluding but not limited to, speech diarization(Vijayasenan et al., 2009), image segmentation(Bardera et al., 2009), image clustering (Gordonet al., 2003), and visualization (Kamimura, 2010).Particularly, similar to this paper, image segmenta-tion has considered segmentation as the compres-

sion part of the IB based method. But, image seg-mentation does not involve continuity constraintsas their application can abolish the exploitation ofsimilarity within the image. Yet another similarattempt that utilizes information theoretic termsas an objective (only the first term of the IB ap-proach) has been made for the task of text segmen-tation and alignment (Sun et al., 2006).

Broadly stating, a typical text segmentationmethod comprises of a method that: (a) con-sumes text representations for every independenttext snippet, and (b) applies a search procedure forsegmentation boundaries while optimizing objec-tives for segmentation. Here, we review literatureof text segmentation by organizing them into 3 cat-egories based on their focus: Category1 - (a), Cat-egory2 - (b), and Category3 - both (a) and (b).

Category1 approaches utilize or benefit from agreat amount of effort put in developing robusttopic models that can model discourse in naturallanguage texts (Brants et al., 2002). Recently, Duet al. (2013, 2015b) have proposed a hierarchicalBayesian model for unsupervised topic segmen-tation that integrates a point-wise boundary sam-pling algorithm used in Bayesian segmentationinto a structured (ordering-based) topic model.For a more comprehensive view of classical workon topic models for text segmentation, we refer toMisra et al. (2009); Riedl and Biemann (2012).This work does not explore topic models and isleft as a direction for future research.

Category2 approaches comprise of differentsearch procedures proposed for the task of textsegmentation, including but not limited to, divi-sive hierarchical clustering (Choi, 2000), dynamicprogramming (Kehagias et al., 2003), and graphbased clustering (Pourvali and Abadeh, 2012;Glavas et al., 2016; Utiyama and Isahara, 2001).This work proposes an agglomerative IB based hi-erarchical clustering algorithm - an addition to thearsenal of the approaches that falls in this category.

Similar to the proposed method, Category3 cutsacross both of the above introduced dimensionsof segmentation. Alemi and Ginsparg (2015)have proposed the use of semantic word embed-dings and a relaxed dynamic programming proce-dure. We have also argued to utilize chat cluesand introduced an IB based approach augmentedwith sequential continuity constraints. Yet an-other similar attempt has been made by Joty et al.(2013) in which they use topical and conversa-

195

tional clues and introduce an unsupervised randomwalk model for the task of text segmentation.

Beyond the above mentioned categorization, asignificant amount of research effort has been putup in studying the evaluation metric for text seg-mentation (Pevzner and Hearst, 2002; Scaiano andInkpen, 2012). Here, we make use of the classi-cal and most widely utilized metric introduced byBeeferman et al. (1999). Also, there have beenattempts to track topic boundaries for thread dis-cussions (Zhu et al., 2008; Wang et al., 2008).While these methods look similar to the proposedmethod, they differ as they attempt to recoverthread structure with respect to the topic level viewof the discussions within a thread community.

The most similar direction of research to thiswork is on conversation trees (Louis and Cohen,2015) and disentangling chat conversations (El-sner and Charniak, 2010). Both of these direc-tions cluster independent posts leading to topic la-belling and segmentation of these posts simulta-neously. It is important to note that these methodsdo not have a sequential continuity constraint andconsider lexical similarity even between long dis-tant posts (Elsner and Charniak, 2011). Moreover,if these methods are applied only for segmenta-tion then they are very likely to produce segmentswith relatively very smaller durations; as reflectedin the ground truth annotations of correspondinglyreleased dataset (Elsner and Charniak, 2008). It isworth noting that Elsner and Charniak (2010) havealso advocated to utilize time gap and people men-tions similar to the proposed method of this work.

3 Problem Description And Notations

Let C be an input chat text sequence C ={c1, ..., ci, ..., c|t|} of length |C|, where ci is a textsnippet such as a sentence or a post from chat text.In a chat scenario, text post ci will have a corre-sponding time-stamp cti. A segment or a subse-quence can be represented as Ca:b = {ca, ..., cb}.A segmentation of C is defined as a segment se-quence S = {s1, ..., sp}, where sj = Caj :bj andbj + 1 = aj+1. Given an input text sequence C,the segmentation is defined as the task of findingthe most probable segment sequence S.

4 Proposed Methodology

This section firstly presents the proposed IB in-spired method for text segmentation that conformsto the necessary constraint of sequential continu-

ity, in Section 4.1. Next, in Section 4.2, the pro-posed IB inspired method is augmented to incor-porate important non-textual clues that arise in achat scenario. More specifically, the time betweentwo consecutive posts and people mentions withinthe posts are integrated into the proposed IB in-spired approach for the text segmentation task.

4.1 IB Inspired Text Segmentation Algorithm

The IB introduces a set of relevance variables Rwhich encapsulate meaningful information aboutC while compressing the data points (Slonim andTishby, 2000). Similarly, we propose that a seg-ment sequence S should also contain as muchinformation as possible about R (i.e., maximizeI(R,S)), constrained by mutual information be-tween S and C (i.e., minimize I(S,C)). Here, Cis a chat text sequence, following the notation in-troduced in the previous section. The IB objectivecan be achieved by maximizing the following:

F = I(R,S)− 1β× I(S,C) (1)

In other words, the above IB objective functionattempts to balance a trade-off between the mostinformative segmentation of R and the most com-pact representation of C; where β is a constantparameter to control the relative importance.

Similar to Tishby et al. (2000), we model R asword clusters and optimize F in an agglomerativefashion, as explained in Algorithm 1. In simplewords, the maximization of F boils down to ag-glomeratively merging an adjacent pair of poststhat correspond to least value of d. In Algorithm1, p(s) is equal to p(si) + p(si+1) and d(si, si+1)is computed using the following definition:

d(si, si+1) = JSD[p(R|si), p(R|si+1)]−1β× JSD[p(C|si), p(C|si+1)]

(2)

Here, JSD indicates Jensen-Shannon-Divergence. The computation of R and p(R,C)is explained later in Section 5.2. Stopping crite-rion for Algorithm 1 is SC > θ, where SC iscomputed as follows:

SC =I(R,S)I(R,C)

(3)

The value of SC is expected to decrease due toa relatively large dip in the value of I(R,S) when

196

Algorithm 1: IB inspired text segmentationInput : Joint distribution: p(R,C),

Tradeoff parameter: βOutput : Segmentation sequence: SInitialization: S ← C

Calculate ∆F (si, si+1) =p(s)× d(si, si+1) ∀ si ∈ S

1 while Stopping criterion is false do2 {i} = argmini′∆F (si′ , si′+1);3 Merge {si, si+1} ⇒ s ∈ S;4 Update ∆F (s, si−1) and ∆F (s, si+2);5 end

more dissimilar clusters are merged. Therefore,SC provides strong clues to terminate the pro-posed IB approach. The inspiration behind thisspecific computation of SC has come from thefact that it has produced stable results when exper-imented with a similar task of speaker diarization(Vijayasenan et al., 2009). The value of θ is tunedby optimizing the performance over a validationdataset just like other hyper-parameters.

The IB inspired text segmentation algorithm(Algorithm 1) respects the sequential continuityconstraint, as it considers merging only adjacentpairs (see step 2, 3, and 4 of Algorithm 1) whileoptimizing F ; unlike the agglomerative IB cluster-ing (Slonim and Tishby, 2000). As a result of this,the proposed IB based approach requires a limitednumber of involved computations, more precisely,linear in terms of number of text snippets.

4.2 Incorporating Non-Textual CluesAs mentioned above, we submit that non-textualclues (such as time between two consecutive postsand people mentions within the posts) are criticalfor segmenting chat text. To incorporate these twoimportant clues, we augment Algorithm 1, devel-oped in the previous section. More precisely, wemodify d of Equation 2 to d as follows:

d(si, si+1) = w1 × d(si, si+1) + w2×(ctai+1

− ctbi) + w3 × ||spi − spi+1||(4)

Here, ctai+1, ctbi and spi represent time-stamp of

the first post of segment si+1, time-stamp of lastpost of segment si, and representation for posterinformation embedded in segment si, respectively.The spi representation is computed as a bag ofposters counting all the people mentioned in the

posts and posters themselves in a segment. w1,w2,w3 are weights indicating the relative importanceof distance terms computed for all three differentclues. ||.|| in Equation 4 indicates euclidean norm.

It is important to note that Algorithm 1 utilizesd of Equation 2 to represent textual dissimilaritybetween a pair of posts in order to achieve the op-timal segment sequence S. Following the sameintuition, d in Equation 4 measures weighted dis-tances based not only on textual similarity but alsobased on information in time-stamps, posters andpeople mentioned. The intuition behind the sec-ond distance term in d is that if the time differencebetween two posts is small then they are likely tobe in the same segment. Additionally, the thirddistance term in d is intended to merge segmentsthat involve a higher number of common postersand people mentions. Following the same intu-ition, in addition to the changes in d, we modifythe stopping criterion as well while the rest staysthe same as in Algorithm 1. The stopping criterionis defined as SC > θ, where SC is as follows:

SC = w1 × I(R,S)I(R,C)

+ w2×

(1− G(S)Gmax

) + w3 × H(S)Hmax

(5)

Here, the G(S) and H(S) mentioned in Equa-tion 5 are computed as follows:

G(S) =∑si∈S

ctbi − ctai(6)

H(S) =|S|∑i=1

||spi − spi+1|| (7)

The first term in SC in Equation 5 is taken fromthe stopping criterion of Algorithm 1 and the re-maining second and third terms are similarly de-rived. Both the second and third terms decrease asthe cardinality of S is decreased and reflect anal-ogous behaviour to the two introduced importantclues. The first term computes the fraction of in-formation contained in S about R, normalized bythe information contained in C about R; similarly,the second term computes the fraction of time du-ration between segments normalized by total du-ration of chat text sequence (i.e. 1 - fraction ofdurations of all segments normalized by total du-ration), and the third term computes the sum of

197

Slack Fresco# Threads 4 46# Posts 9000 5000# Segments 900 800# Documents 73 73

Table 1: Statistics of the chat datasets.

inter segment distances in terms of poster infor-mation normalized by the maximum distance ofsimilar terms (i.e. when each post is a segment).

5 Experiments

This section starts with a description of thedatasets collected from the real world conversa-tion platforms in Subsection 5.1. Later in Subsec-tion 5.2, we explain the evaluation metric utilizedin our experiments. Subsection 5.3 describes themeaningful baselines developed for a fair compar-ison with the proposed IB approach. Next in Sub-sections 5.4 and 5.5, we discuss the performanceaccomplished by the proposed approach on bothof the collected datasets. Lastly, we analyse thestability of the proposed IB approach with respectto parameters β and θ in Subsection 5.6.

5.1 Dataset Description

We have collected chat text datasets,namely, Slack and Fresco, respectively fromhttp://slackarchive.io/ and http://talk.fresco.me/.After that, we have manually annotated them forthe text segmentation task. We have utilized theannotations done by 3 workers with problematiccases resolved by consensus. Datasets’ statistics ismentioned in Table 1. The collected raw data wasin the form of threads, which was later dividedinto segments. Further, we have created multipledocuments where each document contains Ncontinuous segments from the original threads. Nwas selected randomly between 5 and 15. 60%of these documents were used for tuning hyper-parameters which include weights (w1, w2, w3),θ, and β; and the remaining were used for testing.

A small portion of one of the documents fromthe Slack dataset is depicted in Figure 1(a). Here,manual annotations are marked by a bold blackhorizontal line, and also enumerated as 1), 2), and3). Every text line is a post made by one of theusers on the Slack platform during conversations.As mentioned above, in a chat scenario, every posthas following three integral components:

1) poster (indicated by corresponding identity inFigure 1, from beginning till ‘-=[*says’),2) time-stamp (between ‘-=[*’ and ‘*]=-)’, and3) textual content (after ‘*]=-::: ’till end).One must also notice that some of the posts alsohave people mentions within the posts (indicatedas ‘<@USERID>’ in Figure 1).

To validate the differences between the col-lected chat datasets and traditional datasets suchas Choi’s dataset (Choi, 2000), we computed thefraction of words occurring with a frequency lessthan a given word frequency, as shown in Figure 2.It is clearly evident from the Figure 2 that chat seg-mentation datasets have a significantly high pro-portion of less frequent words in comparison to thetraditional text segmentation datasets. The pres-ence of large infrequent words makes it hard fortextual similarity methods to succeed as it will in-crease the proportion of out of vocabulary words(Gulcehre et al., 2016). Therefore, it becomeseven more critical to utilize the non-textual clueswhile processing chat text.

5.2 Evaluation and SetupFor performance evaluation, we have employedPk metric (Beeferman et al., 1999) which iswidely utilized for evaluating the text segmenta-tion task. A sliding window of fixed size k (usu-ally half of the average of length of all the seg-ments in the document) slides over the entire doc-ument from top to bottom. Both inter and intrasegment errors for all posts k apart is calculatedby comparing inferred and annotated boundaries.

We model the set of relevance variables R asword clusters estimated by utilizing agglomera-tive IB based document clustering (Slonim andTishby, 2000) where posts are treated as relevancevariables. Consequently, R comprises of infor-mative word clusters about posts. Thus, each en-try p(ri, cj) in matrix p(R,C) represents the jointprobability of getting a word cluster ri in postcj . We calculate p(ri, cj) simply by counting thecommon words in ri and cj and then normalizing.

5.3 Baseline ApproachesFor comparisons, we have developed multiplebaselines. In Random, 5 to 15 boundaries are in-serted randomly. In case of No Boundary, the en-tire document is labelled as one segment. Next, weimplemented C-99 and Dynamic Programming,which are classical benchmarks for the text seg-mentation task. Another very simple and yet effec-

198

Figure 1: (a) Manually created ground truth for Slack public conversations. Black color lines representssegmentation boundaries. (b) Results obtained for multiple approaches. Text best read magnified.

Methods Span of Weights Slack FrescoRandom – 60.6 54No Boundary – 36.76 45Average Time – 32 35C-99 – 35.18 37.75Dynamic Programming – 28.7 35Encoder-Decoder Distance – 29 38LDA Distance – 36 44IB Variants:Text w1 = 1, w2 = 0, w3 = 0 33 42TimeDiff w1 = 0, w2 = 1, w3 = 0 26.75 34.25Poster w1 = 0, w2 = 0, w3 = 1 34.52 41.50Text + TimeDiff ∀w ∈ {w1, w2}, w ∈ (0, 1); w3 = 0; w1 + w2 = 1 26.47 34.68Text + Poster ∀w ∈ {w1, w3}, w ∈ (0, 1); w2 = 0; w1 + w3 = 1 28.57 38.21Text+TimeDiff+Poster ∀w ∈ {w1, w2, w3}, w ∈ (0, 1); w1 + w2 + w3 = 1 25.47 34.80

Table 2: Performance evaluation: Pk metric [in terms of % error] for various methods. Lower is better.

199

Figure 2: Fraction of words less than a given wordfrequency.

tive baseline Average Time is prepared, in whichboundaries are inserted after a fixed amount oftime has elapsed. Fixed time is calculated froma certain separate portion of our annotated dataset.

Next baseline utilized in our experiments isEncoder-Decoder Distance. In this approach,we have trained a sequence to sequence RNNencoder-decoder (Sutskever et al., 2014) utilizing1.5 million posts from the publicly available Slackdataset excluding the labelled portion. The net-work comprises of 2 hidden layers and the hid-den state dimension was set to 256 for each. Theencoded representation was utilized and greed-ily merged in an agglomerative fashion using Eu-clidean distance. The stopping criterion for thisapproach was similar to the third term in Equa-tion 5 corresponding to poster information. Theoptimization of hidden state dimension was com-putationally demanding hence left for further ex-ploration in future. Similar to Encoder-DecoderDistance, we have developed LDA Distance whererepresentations have come from a topic model(Blei et al., 2003) having 100 topics.

5.4 Quantitative Results

The results for all prepared baselines and vari-ants of IB on both Slack and Fresco datasetsare mentioned in Table 2. For both Slack andFresco datasets, multiple variants of IB yield su-perior performance when compared against all thedeveloped baselines. More precisely, for Slackdataset, 4 different variants of the proposed IBbased method achieve higher performance with anabsolute improvement of as high as 3.23% and arelative improvement of 11.25%, when comparedagainst the baselines. In case of Fresco dataset, 3

Figure 3: Normalized frequency distribution ofsegment length for both the collected chat datasets.

different variants of the proposed method achievesuperior performance but not as significantly interms of absolute Pk value, as they do for the Slackdataset. We hypothesize that such a behaviouris potentially because of the lesser value of postsper segment for Fresco (5000/800=6.25) in com-parison to Slack (9000/900=10). Also, note thatjust the time clue in IB framework performs beston Fresco dataset indicating that the relative im-portance of time clue will be higher for a datasetwith smaller lengths of segments (i.e. low valueof posts per segment). To validate our hypothesizefurther, we estimate the normalized frequency dis-tribution of segment length (number of posts persegment) for both datasets, as shown in Figure 3.

It is worth noting that the obtained empirical re-sults support the major hypothesis of this work.As variants of IB yield superior performance onboth the datasets. Also, on incorporation of in-dividual non-textual clues, superior improvementsof 3.23% and 7.32% are observed from Text toText+TimeDiff for Slack and Fresco, respectively;similarly, from Text to Text+Poster improvementsof 4.43% and 3.79% are observed for Slack andFresco, respectively. Further, the best perfor-mance is achieved for both the datasets on fusingboth the non-textual clues indicating that clues arecomplementary as well.

5.5 Qualitative Results

Results obtained for multiple approaches,namely, Average Time, IB:TimeDiff, andIB:Text+TimeDiff+Poster, corresponding to asmall portion of chat text placed in part (a) ofFigure 1 are presented in part (b) of Figure 1.Average Time baseline (indicated by purple)

200

managed to find three boundaries, albeit one ofthe boundary is significantly off, potentially dueto the constraint of fixed time duration.

Similarly, the next IB:TimeDiff approach alsomanages to find first two boundaries correctly butfails to recover the third boundary. Results seemto indicate that the time clue is not very effec-tive to reconstruct segmentation boundaries whensegment length varies a lot within the document.Interestingly, the combination of all three cluesas happens in the IB:Text+TimeDiff+Poster ap-proach, yielded the best results as all of three seg-mentation boundaries in ground truth are recov-ered with high precision. Therefore, we submitthat the incorporation of non-textual clues is criti-cal to achieve superior results to segment chat text.

5.6 Effect Of Parameters

To analyse the behaviour of the proposed IB basedmethods, we compute the average performancemetric Pk of IB:Text with respect to β and θ, overthe test set of Slack dataset. Also, to facilitate thereproduction of results, we mention optimal val-ues of all the parameters for all the variants of theproposed IB approach in Table 5.5.

Figure 4 shows the behaviour of the average ofperformance evaluation metric Pk over the test setof Slack dataset with respect to hyper-parameterβ. As mentioned above also, the parameter β rep-resents a trade-off between the preserved amountof information and the level of compression. It isclearly observable that the optimal value of β doesnot lie on extremes indicating the importance ofboth the terms (as in Equation 1) of the proposedIB method. The coefficient of the second term (i.e.1β equals to 10−3) is smaller. One could interpretthe behaviour of thr second term as a regulariza-tion term because 1

β controls the complexity of thelearnt segment sequence S. Furthermore, optimalvalues in Table 5.5 for variants with fusion of twoor more clues indicate complementary and relativeimportance of the studied non-textual clues.

The average performance evaluation metric Pkover test set of the Slack dataset with respect tohyper-parameter θ is depicted in Figure 5. Figure5 makes the appropriateness of the stopping cri-terion clearly evident. Initially, the average of Pkvalue decreases as more coherent posts are mergedand continues to decrease till it is less than a partic-ular value of θ. After that, the average of Pk valuestarts increasing potentially due to the merging of

Figure 4: Average evaluation metric Pk over Slackdataset with respect to hyper-parameter β.

Figure 5: Average evaluation metric Pk over Slackdataset with respect to hyper-parameter θ.

more dissimilar segments. The optimal values ofθ varies significantly from one variant to anotherrequiring a mandatory tuning over the validationdataset, as mentioned in Table 5.5, for all IB vari-ants proposed in this work.

6 Discussion And Future Work

We started by highlighting the increasing impor-tance of efficient methods to process chat text,in particular for text segmentation. We have col-lected and introduced datasets for the same. Ourintroduction of chat text datasets has enabled usto explore segmentation approaches that are spe-cific to chat text. Further, our results demonstratethat the proposed IB method yields an absoluteimprovement of as high as 3.23%. Also, a sig-nificant boost (3.79%-7.32%) in performance isobserved on incorporation of non-textual clues in-dicating their criticality. In future, it will be inter-esting to investigate the possibility of incorporat-

201

IB Variants: Slack Frescoβ (w1, w2, w3) θ β (w1, w2, w3) θ

Text 1000 (1,0,0) 0.4 1000 (1,0,0) 0.5TimeDiff 750 (0,1,0) 0.9 750 (0,1,0) 0.9Poster 750 (0,0,1) 0.09 750 (0,0,1) 0.1Text+TimeDiff 750 (0.3,0.7,0) 0.75 750 (0.3,0.7,0) 0.75Text+Poster 750 (0.1,0,0.9) 0.2 ∞ (0.3,0,0.7) 0.2Text+TimeDiff+Poster 750 (0.24,0.58,0.18) 0.65 750 (0.10,0.63,0.27) 0.65

Table 3: Optimal values of parameters corresponding to results obtained by IB variants in Table 2.

ing semantic word embeddings in the proposed IBmethod (Alemi and Ginsparg, 2015).

ReferencesAlexander A Alemi and Paul Ginsparg. 2015. Text

segmentation based on semantic word embeddings.arXiv preprint arXiv:1503.05543.

James Allan. 2012. Topic detection and tracking:event-based information organization, volume 12.Springer Science & Business Media.

Anton Bardera, Jaume Rigau, Imma Boada, MiquelFeixas, and Mateu Sbert. 2009. Image segmentationusing information bottleneck method. IEEE Trans-actions on Image Processing, 18(7):1601–1612.

Regina Barzilay and Lillian Lee. 2004. Catching thedrift: Probabilistic content models, with applicationsto generation and summarization. arXiv preprintarXiv: 0405039.

Doug Beeferman, Adam Berger, and John Lafferty.1999. Statistical models for text segmentation. Ma-chine Learning, 34(1-3):177–210.

David M Blei, Andrew Y Ng, and Michael I Jordan.2003. Latent dirichlet allocation. Journal of ma-chine Learning research, 3(Jan):993–1022.

Thorsten Brants, Francine Chen, and Ioannis Tsochan-taridis. 2002. Topic-based document segmentationwith probabilistic latent semantic analysis. In Pro-ceedings of the Eleventh International Conferenceon Information and Knowledge Management, CIKM’02, pages 211–218.

Freddy Y. Y. Choi. 2000. Advances in domain indepen-dent linear text segmentation. In Proceedings of the1st North American Chapter of the Association forComputational Linguistics Conference, pages 26–33.

Gael Dias, Elsa Alves, and Jose Gabriel Pereira Lopes.2007. Topic segmentation algorithms for text sum-marization and passage retrieval: An exhaustiveevaluation. In Proceedings of the 22Nd NationalConference on Artificial Intelligence - Volume 2,AAAI’07, pages 1334–1339. AAAI Press.

Lan Du, Wray L. Buntine, and Mark Johnson. 2015a.Topic segmentation with a structured topic model.In HLT-NAACL. The Association for ComputationalLinguistics.

Lan Du, John K Pate, and Mark Johnson. 2013. Topicsegmentation with a structured topic model. In Pro-ceedings of the North American Chapter of the As-sociation for Computational Linguistics Conference,pages 190–200.

Lan Du, John K Pate, and Mark Johnson. 2015b. Topicsegmentation with an ordering-based topic model.In Proceedings of the Twenty-Ninth AAAI Confer-ence on Artificial Intelligence, pages 2232–2238.

Micha Elsner and Eugene Charniak. 2008. You talk-ing to me? a corpus and algorithm for conversationdisentanglement. pages 834–842.

Micha Elsner and Eugene Charniak. 2010. Disentan-gling chat. Computation Linguistics, 36:389–409.

Micha Elsner and Eugene Charniak. 2011. Disentan-gling chat with local coherence models. In Proceed-ings of the 49th Annual Meeting of the Associationfor Computational Linguistics: Human LanguageTechnologies - Volume 1, pages 1179–1189.

Goran Glavas, Federico Nanni, and Simone PaoloPonzetto. 2016. Unsupervised text segmentation us-ing semantic relatedness graphs. In Proceedings ofthe Fifth Joint Conference on Lexical and Computa-tional Semantics (*SEM 2016), pages 125–130.

Shiri Gordon, Hayit Greenspan, and Jacob Goldberger.2003. Applying the information bottleneck princi-ple to unsupervised clustering of discrete and con-tinuous image representations. In Proceedings of theNinth IEEE International Conference on ComputerVision - Volume 2, ICCV ’03, pages 370–, Washing-ton, DC, USA. IEEE Computer Society.

Caglar Gulcehre, Sungjin Ahn, Ramesh Nallap-ati, Bowen Zhou, and Yoshua Bengio. 2016.Pointing the unknown words. arXiv preprintarXiv:1603.08148.

Marti A Hearst. 1997. Texttiling: Segmenting text intomulti-paragraph subtopic passages. Computationallinguistics, 23(1):33–64.

202

Shafiq Joty, Giuseppe Carenini, and Raymond T Ng.2013. Topic segmentation and labeling in asyn-chronous conversations. Journal of Artificial Intelli-gence Research, 47:521–573.

Ryotaro Kamimura. 2010. Information-theoretic en-hancement learning and its application to visual-ization of self-organizing maps. Neurocomputing,73(13):2642–2664.

Athanasios Kehagias, Fragkou Pavlina, and VassiliosPetridis. 2003. Linear text segmentation using a dy-namic programming algorithm. In Proceedings ofthe tenth conference on European chapter of the As-sociation for Computational Linguistics-Volume 1,pages 171–178. Association for Computational Lin-guistics.

Hideki Kozima. 1993. Text segmentation based onsimilarity between words. In Proceedings of the31st Annual Meeting on Association for Computa-tional Linguistics, ACL ’93, pages 286–288.

Shixia Liu, Michelle X Zhou, Shimei Pan, YangqiuSong, Weihong Qian, Weijia Cai, and XiaoxiaoLian. 2012. Tiara: Interactive, topic-based visualtext summarization and analysis. ACM Transac-tions on Intelligent Systems and Technology (TIST),3(2):25.

Annie Louis and Shay B. Cohen. 2015. Conversa-tion trees: A grammar model for topic structure inforums. In Proceedings of the Empirical Methodson Natural Language Processing, pages 1543–1553.The Association for Computational Linguistics.

Hemant Misra, Francois Yvon, Joemon M. Jose, andOlivier Cappe. 2009. Text segmentation via topicmodeling: An analytical study. In Proceedingsof the 18th ACM Conference on Information andKnowledge Management, CIKM ’09, pages 1553–1556.

Lev Pevzner and Marti A Hearst. 2002. A critique andimprovement of an evaluation metric for text seg-mentation. Computational Linguistics, 28:19–36.

Mohsen Pourvali and Ph D Mohammad SanieeAbadeh. 2012. A new graph based text segmenta-tion using wikipedia for automatic text summariza-tion. Editorial Preface, 3(1).

Traian Rebedea, Mihai Dascalu, Stefan Trausan-Matu,Gillian Armitt, and Costin Chiru. 2011. Automaticassessment of collaborative chat conversations withpolycafe. In European Conference on TechnologyEnhanced Learning, pages 299–312. Springer.

Martin Riedl and Chris Biemann. 2012. Text segmen-tation with topic models. In Journal for LanguageTechnology and Computational Linguistics, volume27.1, pages 47–69.

Martin Scaiano and Diana Inkpen. 2012. Getting morefrom segmentation evaluation. In Proceedings of the2012 conference of the North American Chapter of

the Association for Computational Linguistics: Hu-man Language Technologies, pages 362–366. Asso-ciation for Computational Linguistics.

Alan P Schmidt and Trevor KM Stone. . De-tection of topic change in irc chat logs.http://www.trevorstone.org/school/ircsegmentation.pdf.

Noam Slonim and Naftali Tishby. 1999. Agglomera-tive information bottleneck. In Advances in NeuralInformation Processing Systems, pages 617–623.

Noam Slonim and Naftali Tishby. 2000. Documentclustering using word clusters via the informationbottleneck method. In Proceedings of the 23rd an-nual international ACM SIGIR conference on Re-search and development in information retrieval,pages 208–215. ACM.

Bingjun Sun, Ding Zhou, Hongyuan Zha, and JohnYen. 2006. Multi-task text segmentation and align-ment based on weighted mutual information. InProceedings of the 15th ACM International Confer-ence on Information and Knowledge Management,CIKM ’06, pages 846–847.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in neural information process-ing systems, pages 3104–3112.

Naftali Tishby, Fernando C. N. Pereira, and WilliamBialek. 2000. The information bottleneck method.arXiv preprint arXiv: 0004057.

Masao Utiyama and Hitoshi Isahara. 2001. A statis-tical model for domain-independent text segmenta-tion. In Proceedings of the 39th Annual Meetingon Association for Computational Linguistics, ACL’01, pages 499–506.

Deepu Vijayasenan, Fabio Valente, and HerveBourlard. 2009. An information theoretic approachto speaker diarization of meeting data. IEEE Trans-actions on Audio, Speech, and Language Process-ing, 17(7):1382–1393.

Yi-Chia Wang, Mahesh Joshi, William W. Cohen, andCarolyn Penstein Ros. 2008. Recovering implicitthread structure in newsgroup style conversations.In Proceedings of The International Conference onWeblogs and Social Media, pages 152–160. TheAAAI Press.

Mingliang Zhu, Weiming Hu, and Ou Wu. 2008. Topicdetection and tracking for threaded discussion com-munities. In Web Intelligence and Intelligent AgentTechnology, 2008. WI-IAT’08. IEEE/WIC/ACM In-ternational Conference on, volume 1, pages 77–83.IEEE.

203