116
Social Media and Text Analytics III WebST (19/7/2016) Social Media and Text Analytics III Topic Modelling and Trend Analysis; Semantic and Discourse Analysis of Social Media; Restrictions and Ethics of Social Media Usage Timothy Baldwin

Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Social Media and Text Analytics IIITopic Modelling and Trend Analysis; Semantic and Discourse

Analysis of Social Media; Restrictions and Ethics of SocialMedia Usage

Timothy Baldwin

Page 2: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Talk Outline1 Topic Modelling and Trend Analysis

2 Lexical Semantic Analysis of Twitter

3 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums

4 Duplicate Question DetectionIntroductionDatasetMethodology

5 Restrictions and Ethics of Social Media Usage

6 Overall Summary

Page 3: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Topic Modelling 101 I

Topic models have enduring popularity for documentclustering/thematically understanding a document collectionat a macro-level

Page 4: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Topic Modelling 101 IIdisk mouse rat

d1 :d2 :d3 :

0 1 32 1 01 0 3

disk mouse ratt1 :t2 :

(0.01 0.09 0.900.65 0.35 0.00

)t1 t2

d1 :d2 :d3 :

0.05 0.950.99 0.010.35 0.65

Source(s): Blei et al. [2003]

Page 5: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Topic Modelling of Twitter

The main challenges in applying topic modelling to Twitterare: (a) the large vocabulary; and (b) the short text length

Approaches to improving topic model performance overTwitter:

I “smooth” topic allocations by document pooling based onhashtags etc. [Mehrotra et al., 2013]

I use semi-supervised learning via labelled LDA [Ramageet al., 2010], based on hashtags, emoticons, etc.

Page 6: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Twitter Trend Analysis

Twitter has emerged as a popular form of social media inrecent years

Twitter users post “tweets”, i.e. short message of up to 140characters

Trend analysis applications display trending topics in Twitter

Trend analysis provides a way for users to identify populardiscussions to follow or participate in

Page 7: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Novel Event Detection

Novel event detection has a relatively long history inNLP/IR (esp. in the context of TDT: Allan [2002]), but thevast majority of methods are retrospective and based onbatch processing(!) [Kireyev et al., 2009, Diao et al., 2012]

For practical applications, real-time methods which canoperate over a text stream clearly needed

Recent surge of interest in streaming methods, largely basedon detection of “bursty” terms [Zanzotto et al., 2011] butalso some hashing approaches [Petrovic et al., 2010,Osborne et al., 2012]

Page 8: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Issues with Keyword-based Approaches

Most applications are keyword-based: trending topics areprovided in the form of simple terms or hashtags

Short keywords fall short in providing fine-grained insightsinto the nature of the event

Motivation: Look for an alternate means of presentingtrends

Topic model gives a potential solution: represent trendswith a list of connected words.

Example

〈whitney houston rip love #whitneycnn kevin costner funeralr.i.p carter 〉

Page 9: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Online Processing Variant of LDA

LDA processes documents in one single batch — notsuitable for our purpose as tweets are generated constantly

There are also online variants of LDA [AlSumait et al.,2008] which are designed to enable LDA to scale to largedocument collections (not for online trend analysis, assaturates over time)

We propose a new online processing variant of LDA thatuses collapsed Gibbs sampling for approximate inference

Idea is simple: use the θ and φ counts from the previousmodel to serve as the α and β priors for the new model

Page 10: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Time Slice and Sliding Window

Time is discretised into “slices”; a time slice contains acollection of tweets.Model keeps a “window” of N slices of documents: at thearrival of new documents, old documents in the oldest timeslice is discarded.For previously seen documents and words:

α′dt =

n(d , t)

Nold× Dold × T × α0

β′tw = β0 × (1− c) +

n(t,w)

Nold× T ×Wnew × β0 × c

For new documents and words:

α′dt = α0; β′

tw = β0

s.t.∑

α′ =∑

α = D × T × α0, and∑

β′ =∑

β = T ×W × β0

Source(s): Lau et al. [2012]

Page 11: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Time Slice Example

If time slice = 1 day:

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

Time Slice k0

2012-09-01

Time Slice k1

2012-09-02

Time Slice k2

2012-09-03

Time Slice k3

2012-09-04

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

Page 12: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Sliding Window

If window = 2 days (and time slice = 1 day):

Time Slice k0

2012-09-01

Time Slice k1

2012-09-02

Time Slice k2

2012-09-03

Time Slice k3

2012-09-04

Model m0

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

Page 13: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Sliding Window

If window = 2 days (and time slice = 1 day):

Time Slice k0

2012-09-01

Time Slice k1

2012-09-02

Time Slice k2

2012-09-03

Time Slice k3

2012-09-04

Model m1

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

Page 14: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Sliding Window

If window = 2 days (and time slice = 1 day):

Time Slice k0

2012-09-01

Time Slice k1

2012-09-02

Time Slice k2

2012-09-03

Time Slice k3

2012-09-04

Model m2

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

<Tweet 1><Tweet 2><Tweet 3><Tweet 4>...

Page 15: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Novelty of Our Model

Implements a flexible vocabulary — vocabulary is updatedat every update.

Introduces a contribution factor c (ranges 0 to 1) to controlthe influence of historic parameters.

Source(s): Lau et al. [2012]

Page 16: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Detecting Novel Topics

Topics evolve when the topic model is updated

To detect novel topics, i.e. newly emerged topics, we cantrack topics across different periods

Example of topic evolution for a topic (Topic ID 137):2012-02-10 #stalbans #harpenden #business #finance rate video ...2012-02-11 #stalbans #harpenden #finance #business #in ...2012-02-12 whitney houston r.i.p rip sad die news #stalbans ...2012-02-13 whitney houston r.i.p rip die sad news dead legend ...

Measure the evolution e(t) of topic between adjacentperiods using Jensen-Shannon divergence

Topic t is novel if its topic evolution score e(t) on a dayexceeds a defined threshold

Source(s): Lau et al. [2012]

Page 17: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Synthetic Dataset

Generate a suite of synthetic datasets to test our model

Datasets are generated using tweets so that the data mimicsreal-world text

A dataset consists of background events and novel events

Background events are events that occur consistently over aperiod of time

Novel events are events embeded into the dataset for ourmodel to detect

Source(s): Lau et al. [2012]

Page 18: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Background and Novel Events

Background events are generated using tweets that containhashtags that occur consistently over a long period of time

A novel event is generated using tweets that containhashtags about a current news event

Novel event ranges from natural disasters to celebrity’sdeath.

Tweets of novel event are replaced using sentences fromTDT3 documents that have a similar topic to the newsevent

In the dataset, a tweet = a document in the topic model

Source(s): Lau et al. [2012]

Page 19: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Synthetic Dataset Example

Event Type Document ContentBackground ugh i be go to be so sore tomorrowBackground rt @pagswagxo : next status i see about m . burn and i be gonna go

insane .Background ! ! saatnya mencarus wifi supaya ipod touch gw conect ke internet , dan

ngetweet via twitter for iph one ,Novel the kosovo information center claim serb police be pass out weapon to

serb civilian in the region .Background rt @rickyricchi : rt @atikaftri : jan lupa nya jan lupa juga mention yaw

ˆˆBackground @1aurenheilman lol , do you spell ”pet peeve ” wrong on purpose ?Background had2let it be know ! & thanks for txtn back - - rt @phliwidapencil

lmfao rt @skrillafoccapo : all big booty aint good big booty ! !Background rt @desintadict cb : rtif u want follower ( cont ) http ://t.co/joej7wfz

Source(s): Lau et al. [2012]

Original Novel Event: Kim Jong Il’s death (#kimjongil, #kim #kimjong, ...)

Replaced Novel Event: Holbrooke-Milosevic Meeting (TDT3)

Page 20: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Justification for TDT3 Replacement

Precision Original tweets that contain the novel hashtagcould be spam or unrelated to the novel event

Recall Background tweets might be related to the originalnovel event

Page 21: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Evaluation

At every update, calculate topic evolution score e(t) foreach topic

Topics that exceed a defined threshold are determined asnovel topics

Documents that contain a novel topic as its highestprobability topic are classified as novel documents

We know which documents are the novel events, thereforeprecision, recall and F-scores can be calculated for theclassification of novel documents

Page 22: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Results

Number of topics T varies

Number of topics background events = 50

F-Scores of the classification of novel documents:

No. of Van- Washi- Liege- Kim- Costa-Topics T Mitch Mitch Pinochet Milosevic Swissair

25 0.50 0.00 0.00 0.51 0.0050 0.74 0.62 0.47 0.72 0.37

100 0.63 0.61 0.55 0.62 0.47150 0.65 0.45 0.59 0.76 0.46

Source(s): Lau et al. [2012]

Page 23: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Results

Number of background event varies

Number of topics T = 50

F-Scores of the classification of novel documents:

No. of Van- Washi- Liege- Kim- Costa-BG Events Mitch Mitch Pinochet Milosevic Swissair

25 0.77 0.55 0.81 0.80 0.6250 0.74 0.62 0.47 0.72 0.37

100 0.61 0.53 0.00 0.82 0.45150 0.45 0.34 0.00 0.70 0.46

Source(s): Lau et al. [2012]

Page 24: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Trend Detection in New York and London

Apply proposed methodology for a month in February 2012to look for trending topics on a daily basis for New York andLondon tweets

Number of topics T = 300

Other parameter settings same as synthetic experiment’s

Number of tweets averages around 50,000–60,000 per dayper location

Source(s): Lau et al. [2012]

Page 25: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Top-13 Trending Topics in London

Date (UTC) Topic Topic Words2012-02-05 95 snow #uks london finally settle look #snow garden nom food2012-02-06 256 webb howard penalty unite chelsea #mufc game #cfc utd

give2012-02-09 49 capello england fabio resign manager italian job sink ship

#capello2012-02-11 74 suarez evra hand shake racist liverpool #lfc cunt #mufc win2012-02-12 160 whitney houston rip die dead omg sad amy r.i.p believe2012-02-12 168 whitney houston sad rip music r.i.p love bong voice remember2012-02-12 197 whitney houston rip sad r.i.p peace love voice #whitneyhous-

ton song2012-02-12 137 whitney houston r.i.p rip sad die news #stalbans #harpenden

dead2012-02-13 91 #bafta win film award bafta artist watch meryl #baftas love2012-02-13 49 zambium penalty ivory coast win #zambia cup zambia miss

drogba2012-02-14 81 happy valentine love &lt xxx ;3&lt dear follow fan load2012-02-22 17 win brit #britawards award adele artist british watch inter-

national woman2012-02-22 251 blur adele cut #brits love speech brit sound shit song

Source(s): Lau et al. [2012]

Page 26: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Top-11 Trending Topics in New York

Date (UTC) Topic Topic Words2012-02-06 50 #giants #superbowl giant win fuck die touchdown #gi-

antsnation pat root2012-02-06 88 giant win #superbowl patriot wear shirt jersey fan superbowl

today2012-02-06 60 super bowl giant 2012 champion york fan move target sunday2012-02-06 207 brady tom elus #superbowl #giants manning giant win catch

game2012-02-12 51 whitney houston rip sad die love #whitneyhouston music r.i.p

#rip2012-02-12 45 whitney houston rip sad r.i.p dead die omg wow damn2012-02-13 227 chri brown #grammys rihanna bobby love coldplay grammy

performance sing2012-02-13 4 minaj nickus performance nicki #grammys wtf adele lol

grammy perform2012-02-14 273 valentine happy love single &lt today holiday word tomorrow

heart2012-02-19 250 whitney houston rip love #whitneycnn kevin costner funeral

r.i.p carter2012-02-27 246 #oscars win octavium oscar spencer speech love jlo look dress

Source(s): Lau et al. [2012]

Page 27: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Going “Social”

There has been quite a bit of work on author-topic models,where documents are linked by author metadata [Rosen-Zviet al., 2004], and author community models, wheredocuments are linked by the social network of their authors[Liu et al., 2009]; it would be relatively easy to combine thiswork with our trend analysis ideaIt is also possible to inject textual features from authorprofiles etc. into topic modelsRecent work on neural topic models [Larochelle and Lauly,2012, Cao et al., 2015, Iyyer et al., 2016] opens uppossibilities for more readily incorporating various network,user, community etc. features ... although getting it all towork in a trend analysis scenario where the model needs tobe sensitised to localised changes in lexical distributionswould be non-trivial

Page 28: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Conclusion

Topic modelling approach to detecting novel trends instreamed data based on:

I online variant of LDA using a sliding window approach and“parameter memory” from previous iteration

I detection of topic evolution via Jensen-Shannon divergencefor a given topic between iterations

Excellent results over a synthetic dataset constructed basedon Twitter and TDT3

Anecdotally strong results over raw Twitter data

Page 29: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Talk Outline1 Topic Modelling and Trend Analysis

2 Lexical Semantic Analysis of Twitter

3 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums

4 Duplicate Question DetectionIntroductionDatasetMethodology

5 Restrictions and Ethics of Social Media Usage

6 Overall Summary

Page 30: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

NLP for Social Media

Lots of NLP research on Twitter

Lexical normalisation, text-based geolocation, POS tagging,named entity recognition, sentiment analysis ...

Lexical semantics?

I Challenges for WSD: short, noisy text (lack of reliablecontext)

I Possible benefits:

possible benefits to applications (e.g. sentiment analysis)possible insights into how social media and conventionaltext differ

Page 31: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

NLP for Social Media

Lots of NLP research on Twitter

Lexical normalisation, text-based geolocation, POS tagging,named entity recognition, sentiment analysis ...

Lexical semantics?

I Challenges for WSD: short, noisy text (lack of reliablecontext)

I Possible benefits:

possible benefits to applications (e.g. sentiment analysis)possible insights into how social media and conventionaltext differ

Page 32: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

NLP for Social Media

Lots of NLP research on Twitter

Lexical normalisation, text-based geolocation, POS tagging,named entity recognition, sentiment analysis ...

Lexical semantics?I Challenges for WSD: short, noisy text (lack of reliable

context)I Possible benefits:

possible benefits to applications (e.g. sentiment analysis)possible insights into how social media and conventionaltext differ

Page 33: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Word Usage Patterns

Conventional text

One sense per discourse [Gale et al., 1992]

First-sense heuristic [McCarthy et al., 2004]

Twitter

One sense per tweeter?

I documents are too small to consider applying one sense perdiscourse, but we can possibly address the lack of contextwith user-level sense priors

First-sense heuristic?

I shown to change substantially across domains, so not clearthat it will work as well over Twitter

Page 34: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Word Usage Patterns

Conventional text

One sense per discourse [Gale et al., 1992]

First-sense heuristic [McCarthy et al., 2004]

Twitter

One sense per tweeter?

I documents are too small to consider applying one sense perdiscourse, but we can possibly address the lack of contextwith user-level sense priors

First-sense heuristic?

I shown to change substantially across domains, so not clearthat it will work as well over Twitter

Page 35: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Word Usage Patterns

Conventional text

One sense per discourse [Gale et al., 1992]

First-sense heuristic [McCarthy et al., 2004]

Twitter

One sense per tweeter?I documents are too small to consider applying one sense per

discourse, but we can possibly address the lack of contextwith user-level sense priors

First-sense heuristic?

I shown to change substantially across domains, so not clearthat it will work as well over Twitter

Page 36: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Word Usage Patterns

Conventional text

One sense per discourse [Gale et al., 1992]

First-sense heuristic [McCarthy et al., 2004]

Twitter

One sense per tweeter?I documents are too small to consider applying one sense per

discourse, but we can possibly address the lack of contextwith user-level sense priors

First-sense heuristic?

I shown to change substantially across domains, so not clearthat it will work as well over Twitter

Page 37: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Word Usage Patterns

Conventional text

One sense per discourse [Gale et al., 1992]

First-sense heuristic [McCarthy et al., 2004]

Twitter

One sense per tweeter?I documents are too small to consider applying one sense per

discourse, but we can possibly address the lack of contextwith user-level sense priors

First-sense heuristic?I shown to change substantially across domains, so not clear

that it will work as well over Twitter

Page 38: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Resources

Sense inventory: Macmillan DictionaryI coarse-grained sensesI regularly updated

Target lemmas: 20 nounsI high-to-mid frequencyI medium polysemy: ≥ 3 senses

Source(s): Gella et al. [2014]

Page 39: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Datasets

4 datasets: {Twitter-1/2,ukWaC} × {rand,user}ukWaC: more-conventional (web) text

rand: random sample of usages fromTwitter-1/2/ukWaC

user: 5 usages of a given word from each user(Twitter-1/2) or document (ukWaC)

2000 items each: 100 usages of each noun

Source(s): Gella et al. [2014]

Page 40: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Annotation

Use Amazon Mechanical Turk for annotation

For each usage, pick the most appropriate sense(s), or“Other”

Quality controlI included some gold-standard Macmillan example sentences

in each HITI filtered annotations based on accuracy over these items

Fleiss’ Kappa: 0.47–0.71

Source(s): Gella et al. [2014]

Page 41: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Analysis

Average proportion of users/documents using a noun in thesame sense across all 5 usages

Twitteruser: 65%

ukWaCdoc: 63%

One sense per tweeter heuristic is as strong as one sense perdiscourse

Source(s): Gella et al. [2014]

Page 42: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Analysis: Pairwise Agreement

Partition Agreement (%)Gale et al. (1992) document 94.4Twitteruser user 95.4Twitteruser — 62.9Twitterrand — 55.1ukWaCdoc document 94.2ukWaCdoc — 65.9ukWaCrand — 60.2

Source(s): Gella et al. [2014]

Page 43: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Other Lexical Semantic Tales

Comparing Twitterrand and ukWaCrand:

First-sense tagging is less accurate in Twitter dataI Twitterrand: 45.3%I ukWaCrand: 55.4%

Sense distributions are less skewed on TwitterI sense entropy lower for ukWaCrand for 15 nouns

8/20 nouns have different first senses

More “Other” senses in Twitter dataI Twitterrand: 12.3%I ukWaCrand: 6.6%

Source(s): Gella et al. [2014]

Page 44: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Other Work on Lexical Semantic Analysis of

Social Media

“Usage similarity” in Twitter [Gella et al., 2013]

Wikification/babelfication [Mihalcea and Csomai, 2007,Ferragina and Scaiella, 2010, Moro et al., 2014]

WordNet supersense tagging of Twitter data [Johannsenet al., 2014]

Page 45: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Opportunities for Lexical Semantic Analysis of

Social Media

Impact of time on sense distributions (per user or overall)?

Interaction between geospatial and sociolinguistic factors onsense preferences?

Network/thread-level analysis (also for comments associatedwith given document, user forum threads)

Word sense (d)evolution over streamed data

Geospatial word sense dispersal

Interaction between user profile and word usage?

Page 46: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Summary

One sense per tweeter?I at least as strong as one sense per discourse

First-sense heuristic?I first-sense tagging is less accurate for Twitter

Page 47: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Talk Outline1 Topic Modelling and Trend Analysis

2 Lexical Semantic Analysis of Twitter

3 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums

4 Duplicate Question DetectionIntroductionDatasetMethodology

5 Restrictions and Ethics of Social Media Usage

6 Overall Summary

Page 48: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Contents1 Topic Modelling and Trend Analysis

2 Lexical Semantic Analysis of Twitter

3 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums

4 Duplicate Question DetectionIntroductionDatasetMethodology

5 Restrictions and Ethics of Social Media Usage

6 Overall Summary

Page 49: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Introduction

Page 50: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Example Thread

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...

Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...A little more help... You would simply do it this way: ... You could also just ... An example of this is ...

User APost 4

User DPost 5

HTML Input Code - CNET Coding & scripting Forums

Source(s): http://forums.cnet.com/

Page 51: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Example Thread

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...

Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...A little more help... You would simply do it this way: ... You could also just ... An example of this is ...

User APost 4

User DPost 5

HTML Input Code - CNET Coding & scripting Forums

External Link

External Video

500 words in total

Source(s): http://forums.cnet.com/

Page 52: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Discourse Structure of Forum Threads

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

Question-QuestionØ

Source(s): Kim et al. [2010]

Page 53: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Discourse Structure of Forum Threads

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action

asp.net c\# videoI’ve prepared for you video.link click ...

Question-Question

Answer-AnswerAnswer-Answer

Ø

Source(s): Kim et al. [2010]

Page 54: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Discourse Structure of Forum Threads

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action

asp.net c\# videoI’ve prepared for you video.link click ...

Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...

User APost 4

Question-Question

Answer-AnswerAnswer-Answer

Answer-Confirmation

Question-Add

Ø

Source(s): Kim et al. [2010]

Page 55: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Discourse Structure of Forum Threads

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action

asp.net c\# videoI’ve prepared for you video.link click ...

Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...

A little more help... You would simply do it this way: ... You could also just ... An example of this is ...

User APost 4

User DPost 5

Question-Question

Answer-Answer

Answer-Answer

Answer-Answer

Answer-Confirmation

Question-Add

Ø

Source(s): Kim et al. [2010]

Page 56: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Research Aim and Contributions

Aim:

- jointly classify the discourse structure of forum threads

Contributions:

- apply structural learning and dependency parsing- in situ classification analysis

Source(s): Wang et al. [2011b]

Page 57: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Contents1 Topic Modelling and Trend Analysis

2 Lexical Semantic Analysis of Twitter

3 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums

4 Duplicate Question DetectionIntroductionDatasetMethodology

5 Restrictions and Ethics of Social Media Usage

6 Overall Summary

Page 58: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Experimental Setup

Page 59: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Dataset

From Kim et al. [2010], 1332 posts spanning 315 threadsfrom CNET

Each post is labelled with one or more links, each link islabelled with a dialogue act

- Question

* Question, Add, Correction, Confirmation

- Answer

* Answer, Add, Objection, Confirmation

- Resolution- Reproduction- Other

Most common label: 1+Answer-Answer (28.4%)Source(s): Wang et al. [2011b]

Page 60: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Recap

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action

asp.net c\# videoI’ve prepared for you video.link click ...

Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...

A little more help... You would simply do it this way: ... You could also just ... An example of this is ...

User APost 4

User DPost 5

0+Question-Question

2+Answer-Answer

4+Answer-Answer

1+Answer-Answer

1+Answer-Confirmation

3+Question-Add

Ø

Page 61: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Task Description

Main task: joint classification of inter-post links (Link)and dialogue acts (DA)

Explore two different learning approaches to the task

- a linear-chain CRF (CRFSGD)- a dependency parser (MaltParser)

The task is a natural fit for dependency parsing, with somespecial properties:

⊕ strict reverse-chronological directionality (100%) non-projective dependencies (2%) multi-headedness (6%) disconnected sub-graphs (2%)

Source(s): Wang et al. [2011b]

Page 62: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

FeaturesStructural features:

- Initiator: binary feature indicating whether the currentpost’s author is the thread initiator

- Position: relative position of the current post

Semantic features:

- TitSim: relative location of the post which has the mostsimilar title to the current post.

- PostSim: relative location of the post which has themost similar content to the current post.

- Punct: number of question marks (QusCount),exclamation marks (ExcCount) and URLs (UrlCount) inthe current post.

- UserProf: class distribution of the current post’s author

Source(s): Kim et al. [2010]

Page 63: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Contents1 Topic Modelling and Trend Analysis

2 Lexical Semantic Analysis of Twitter

3 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums

4 Duplicate Question DetectionIntroductionDatasetMethodology

5 Restrictions and Ethics of Social Media Usage

6 Overall Summary

Page 64: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Experiments and Analysis

Page 65: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Post/thread-level Joint Classification F-scores

Method CRFSGD MaltParserpost/thread post/thread

Heuristic .515/.311NoFeatures .508/.394 .533/.356Joint +ALL .756/.578 .738/.578

Source(s): Wang et al. [2011b]

Page 66: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Post/thread-level Joint Classification F-scores

Method CRFSGD MaltParserpost/thread post/thread

Heuristic .515/.311NoFeatures .508/.394 .533/.356Joint +ALL .756/.578 .738/.578

Post-level analysis

? Initiator affects MaltParser

significantly

Source(s): Wang et al. [2011b]

Page 67: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Post/thread-level Joint Classification F-scores

Method CRFSGD MaltParserpost/thread post/thread

Heuristic .515/.311NoFeatures .508/.394 .533/.356Joint +ALL .756/.578 .738/.578

Thread-level analysis

? the best thread-level

F-scores from the two

learners are not significantly

different

Source(s): Wang et al. [2011b]

Page 68: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Threads Evolve Over Time

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

Question-QuestionØ

Page 69: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Threads Evolve Over Time

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action

asp.net c\# videoI’ve prepared for you video.link click ...

Question-Question

Answer-AnswerAnswer-Answer

Ø

Page 70: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Threads Evolve Over Time

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action

asp.net c\# videoI’ve prepared for you video.link click ...

Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...

User APost 4

Question-Question

Answer-AnswerAnswer-Answer

Answer-Confirmation

Question-Add

Ø

Page 71: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Threads Evolve Over Time

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action

asp.net c\# videoI’ve prepared for you video.link click ...

Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...

A little more help... You would simply do it this way: ... You could also just ... An example of this is ...

User APost 4

User DPost 5

Question-Question

Answer-Answer

Answer-Answer

Answer-Answer

Answer-Confirmation

Question-Add

Ø

Page 72: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Threads Evolve Over Time

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action

asp.net c\# videoI’ve prepared for you video.link click ...

Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...

A little more help... You would simply do it this way: ... You could also just ... An example of this is ...

User APost 4

User DPost 5

Question-Question

Answer-Answer

Answer-Answer

Answer-Answer

Answer-Confirmation

Question-Add

Ø

In situ classification — compare the accuracy of differentmodels when applied to partial threads vs. complete threads.

Page 73: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Classify the “Evolving Threads”

Page 74: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Classify the “Evolving Threads”

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action

HTML Input Code - CNET Coding & scripting Forums

Classify first 2 posts

Page 75: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Classify the “Evolving Threads”

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...

Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...

User APost 4

HTML Input Code - CNET Coding & scripting Forums

Classify first 4 posts

Page 76: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Classify the “Evolving Threads”

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...

Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...A little more help... You would simply do it this way: ... You could also just ... An example of this is ...

User APost 4

User DPost 5

HTML Input Code - CNET Coding & scripting Forums

Classify all posts

Page 77: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Evaluation of In situ Classification

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...

Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...A little more help... You would simply do it this way: ... You could also just ... An example of this is ...

User APost 4

User DPost 5

Evaluate first 2 posts

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...

Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...

User APost 4

Page 78: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Evaluation of In situ Classification

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...

Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...A little more help... You would simply do it this way: ... You could also just ... An example of this is ...

User APost 4

User DPost 5

Evaluate first 4 posts

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...

Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...

User APost 4

Page 79: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

In Situ Classification

Link-DA F-score for CRFSGD/MaltParser for in situclassification over sub-threads of different lengths, brokendown over different post extents

PPPPPPPPPTestB/down

[1, 2] [1, 4] [1, 6] [1, 8] [All ]

[1, 2] .947/.947 — — — —[1, 4] .946/.947 .836/.841 — — —[1, 6] .946/.947 .840/.841 .800/.794 — —[1, 8] .946/.947 .840/.841 .800/.794 .780/.769 —[All ] .946/.946 .840/.838 .800/.791 .776/.767 .756/.738

From this, we conclude that our method can be robustlyapplied to real-time analysis of dynamically evolving threads.

Source(s): Wang et al. [2011b]

Page 80: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Contents1 Topic Modelling and Trend Analysis

2 Lexical Semantic Analysis of Twitter

3 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums

4 Duplicate Question DetectionIntroductionDatasetMethodology

5 Restrictions and Ethics of Social Media Usage

6 Overall Summary

Page 81: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

What’s Discourse Parsing got to do with it?

All well and good, but:

(a) does discourse parsing actually aid information access overuser forums?

(b) are our models accurate enough to be useful?

Explore these questions relative to the ancestry.comdataset of Elsas [2011], in the context of IR

Best IR model of Elsas [2011] = perform IR over individualposts in each thread, score the thread via the geometricmean of the top-k retrieved posts’ scores (k = 5)

Source(s): Wang et al. [2013]

Page 82: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

IR Evaluation over Ancestry Dataset

DASubset mAPpref ppref @10IR baseline [Elsas, 2011] .657 .664DAs +ALL .668 .672

–Qq .674 .678

Source(s): Wang et al. [2013]

Page 83: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Contents1 Topic Modelling and Trend Analysis

2 Lexical Semantic Analysis of Twitter

3 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums

4 Duplicate Question DetectionIntroductionDatasetMethodology

5 Restrictions and Ethics of Social Media Usage

6 Overall Summary

Page 84: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Other NLP Research over Web Forums

Thread classification for topic, “solvedness” etc. [Fenget al., 2006, Baldwin et al., 2007, Wang et al., 2012]

Thread structure analysis, e.g. for summarisation [Wang andRose, 2010] or information retrieval [Seo et al., 2009, Wanget al., 2011a]

Expert finding [Jurczyk and Agichtein, 2007, Bouguessaet al., 2008, Lin et al., 2009]

Question–answer pair extraction [Cong et al., 2008, Dinget al., 2008, Hong and Davison, 2009]

Post quality assessment [Weimer et al., 2007, Wanas et al.,2008, Lui and Baldwin, 2009]

Page 85: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

The Road Ahead

Better user support within forums (thread recommendation,question/thread routing)

Research generally focused on specific forums; much to bedone on cross-forum analysis (forum recommendation,cross-forum thread routing)

General-purpose discourse analyser for forum threads?

More use of user priors

Page 86: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Summary

“Discourse parsing” of DA and link structure of web userforum threads, via structured classification and dependencyparsing

Findings:

- empirically little to separate simple CRF model anddependency parsing

- in situ classification: our method is robust overdynamically evolving threads

Demonstration of utility of discourse parsing in an IRcontext

Page 87: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Talk Outline1 Topic Modelling and Trend Analysis

2 Lexical Semantic Analysis of Twitter

3 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums

4 Duplicate Question DetectionIntroductionDatasetMethodology

5 Restrictions and Ethics of Social Media Usage

6 Overall Summary

Page 88: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Contents1 Topic Modelling and Trend Analysis

2 Lexical Semantic Analysis of Twitter

3 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums

4 Duplicate Question DetectionIntroductionDatasetMethodology

5 Restrictions and Ethics of Social Media Usage

6 Overall Summary

Page 89: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Same or Different?

indenting paragraphs of text

i am trying to write a paper in latex . i am able to indentmost of the paragraphs by using \\ at the end of theprevious paragraph and adding a blank line in between theparagraphs like so ...

no indent in the first paragraph in a section ?

i am just curious that the following format looks good to youor not ? the preamble i used is just ... i do not feel it is sogood because the first paragraph has no indent but thesecond does ...

Page 90: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Why do we Care?

Reduce community overhead in manually flagging duplicatequestions in community question answering (cQA) forums

Immediate resolution to FAQs

Reduce false negative rates for newbie cQA questions

An interesting mix of research and engineering question todo this accurately in real time

Page 91: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Contents1 Topic Modelling and Trend Analysis

2 Lexical Semantic Analysis of Twitter

3 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums

4 Duplicate Question DetectionIntroductionDatasetMethodology

5 Restrictions and Ethics of Social Media Usage

6 Overall Summary

Page 92: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Constructing the Data

The first step in developing solutions to this problem is toconstruct a high-quality dataset

StackOverflow offers an immediate solution, but based onour analysis:

I many “meta-answer” duplicatesI false positive rate around 1%I false negative rate harder to measure, but at least 35%

Possible to infer probable false negatives through acombination of graph analysis and pooling the results ofinformation retrieval systems

Effort underway to community crowdsource extraannotations

Source(s): Hoogeveen et al. [2015, 2016]

Page 93: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Contents1 Topic Modelling and Trend Analysis

2 Lexical Semantic Analysis of Twitter

3 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums

4 Duplicate Question DetectionIntroductionDatasetMethodology

5 Restrictions and Ethics of Social Media Usage

6 Overall Summary

Page 94: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Approach and ResultsApproach:

1 run doc2vec dbow model [Le and Mikolov, 2014] over eachquestion to generate a document embedding:

pre-train over full CQADupStack dataset, initialising theword embeddings with pre-trained Google word2vecembeddings

2 calculate question pair similarity based on simple cosinesimilarity, and rank all question pairs based on thisrepresentation

3 evaluate in terms of ROC AUC over the full set of questionpairs

And does it work?

I ROC AUC of 0.95 (averaged across 12 forums)I ... but not real time, and does not explicitly tell us what

the duplicates are ... watch this space for more

Source(s): Hoogeveen et al. [2015], Lau and Baldwin [to appear]

Page 95: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Approach and ResultsApproach:

1 run doc2vec dbow model [Le and Mikolov, 2014] over eachquestion to generate a document embedding:

pre-train over full CQADupStack dataset, initialising theword embeddings with pre-trained Google word2vecembeddings

2 calculate question pair similarity based on simple cosinesimilarity, and rank all question pairs based on thisrepresentation

3 evaluate in terms of ROC AUC over the full set of questionpairs

And does it work?I ROC AUC of 0.95 (averaged across 12 forums)

I ... but not real time, and does not explicitly tell us whatthe duplicates are ... watch this space for more

Source(s): Hoogeveen et al. [2015], Lau and Baldwin [to appear]

Page 96: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Approach and ResultsApproach:

1 run doc2vec dbow model [Le and Mikolov, 2014] over eachquestion to generate a document embedding:

pre-train over full CQADupStack dataset, initialising theword embeddings with pre-trained Google word2vecembeddings

2 calculate question pair similarity based on simple cosinesimilarity, and rank all question pairs based on thisrepresentation

3 evaluate in terms of ROC AUC over the full set of questionpairs

And does it work?I ROC AUC of 0.95 (averaged across 12 forums)I ... but not real time, and does not explicitly tell us what

the duplicates are ... watch this space for more

Source(s): Hoogeveen et al. [2015], Lau and Baldwin [to appear]

Page 97: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Talk Outline1 Topic Modelling and Trend Analysis

2 Lexical Semantic Analysis of Twitter

3 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums

4 Duplicate Question DetectionIntroductionDatasetMethodology

5 Restrictions and Ethics of Social Media Usage

6 Overall Summary

Page 98: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Data Restrictions: TwitterTwitter is famously “open” as a service:

I possible to crawl any (undeleted) public tweet ever postedto Twitter (in rate-limited way)

I possible to access “random” sub-sample of tweets viaStreaming API (“garden hose” vs. “fire-hose”)

Each tweet object is provided as a JSON object containinga wealth of message and user data (various user meta-data,geotag, language, basic social network data, thread data, ...)Historically it was not possible to redistribute Twitter data,other than in the form of tweet IDs (for others to recrawl)... althought this has recently changed, with Twitterallowing direct redistribution of datasets of up to 50Ktweets (in original JSON format)It is possible to crawl social network data from Twitter, butheavily rate-limited

Page 99: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Data Restrictions: Other Sites

YouTube terms of use similar to Twitter (but YouTube usedmuch less for social media research)

Facebook offers very limited access to its data, and there ishence relatively little published research relating to it

Individual forums vary considerably in the terms of use oftheir data, with many commercial forums banning crawling,but many community-run forums having relativelypermissive licenses

Wikipedia is perhaps the most open social media site — allcontent is available via a Creative CommonsAttribution-ShareAlike 3.0 Unported License

Page 100: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Research Datasets

The Wikimedia Foundation provides periodic dumps ofWikipedia, which are heavily used for research purposes(although version data is often not provided inpublications); also StackExchange, Reddit, ..

Spinn3r made available a large crawl of blog data as part ofICWSM-2011 [Burton et al., 2011], which is widely used

Various Twitter datasets have been made available, in theform of tweet IDs for others to crawl

I issues for reproducibility of published results over Twitter

similarly with YouTube datasets

Large-scale datasets of user-tagged images (e.g. from Flickr)

Various other individual datasets made available throughICWSM

Page 101: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Ethics of Social Media ResearchIn “traditional” NLP datasets, the data is generated byorganisations (commercial or otherwise) andpublished/licensed directly by that organisation, oftenwithout information identifying the authors of individualdocuments (with some exceptions, e.g. BNC, ANC)In the case of social media sites, the data is generated byindividual users of that site, often for personal use, in somecases with user-specific data privacy settings (e.g. Twitter,Facebook), and other cases with site-specific privacysettings (e.g. forums, Wikipedia)User accounts are often associated with publicly-accessibleuser-declared profile data (e.g. age, gender, date-of-birth,location, ...), as well as site-generated “activity” statistics(e.g. date joined site, number of posts, number of followers,average user rating, ...)

Page 102: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

How to be Ethical when Using Social Media

DataFor publicly-available, publicly-crawlable data, surelyanything goes?!NO!

I important to get institutional ethics (“IRB”) approval forsocial media data in cases where there is any interactionwith the users or any publication of user-identifyinginformation

I generally OK to publish “aggregated” models/datasets(within the terms of use of a given site), as long as it is notpossible to extract identifying user data from it

Vulnerability of anonymised social media datasets to“privacy attacks” (as the source data is often public)Researchers are often faced with tradeoffs between scientificreproducibility and ethical data use

Page 103: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Ethics of In-Site Social Media Research

Social media sites are continually rolling out newfunctionality, or improving existing functionality, as part ofwhich they user interaction data to validate/A-B test newfunctionalities

Ideally, users should be made aware of any A-B testing (as itinvolves direct user interaction), but subtle question ofwhether, in drawing users’ attention to the testing, the testis “faithful”

Infamous case of A-B testing relating to Facebook “newsfeeds”, in looking at the correlation betweenpositive/negative information in a user’s feeds, and thesentiment in their own posts [Kramer et al., 2014]

I ethical?

Page 104: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Talk Outline1 Topic Modelling and Trend Analysis

2 Lexical Semantic Analysis of Twitter

3 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums

4 Duplicate Question DetectionIntroductionDatasetMethodology

5 Restrictions and Ethics of Social Media Usage

6 Overall Summary

Page 105: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Final Words

Much has been done over social media, but even moreremains to be done

Different social media sources present different challenges,but one common theme is the use of user information andvarious types of linking information

There is much more that can be done to properly“socialise” NLP approaches to social media text, with theintersection of network science and NLP being animmediately fruitful area of discovery

There is more to social media than Twitter

Social media is a many-splendored thing, with lots of roomto play for all, and many open challenges for NLP

Page 106: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

Acknowledgements

These slides are based heavily on joint work withTrevor Cohn, Nigel Collier, Paul Cook, Spandana Gella,Bo Han, Su Nam Kim, Marco Lui, Joakim Nivre,Afshin Rahimi, and Li Wang. The research wassupported in part by the Australian Research Council,NICTA, Google and Xerox Research.

Page 107: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

References I

James Allan. Introduction to topic detection and tracking. In James Allan, editor, TopicDetection and Tracking: Event-based Information Organization, pages 1–16. Kluwer,2002.

Loulwah AlSumait, Daniel Barbara, and Carlotta Domeniconi. On-line LDA: Adaptivetopic models for mining text streams with applications to topic detection and tracking.In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining(ICDM-08), pages 3–12, Washington, DC, USA, 2008.

Timothy Baldwin, David Martinez, and Richard B. Penman. Automatic threadclassification for Linux user forum information access. In Proceedings of the TwelfthAustralasian Document Computing Symposium (ADCS 2007), pages 72–79,Melbourne, Australia, 2007.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journalof Machine Learning Research, 3:993–1022, 2003.

Mohamed Bouguessa, Benoıt Dumoulin, and Shengrui Wang. Identifying authoritativeactors in question-answering forums: The case of Yahoo! answers. In Proceedings ofthe 14th ACM SIGKDD International Conference on Knowledge Discovery and DataMining (KDD ’08), pages 866–874, Las Vegas, USA, 2008. URLhttp://doi.acm.org/10.1145/1401890.1401994.

Page 108: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

References IIKevin Burton, Niels Kasch, and Ian Soboroff. The ICWSM 2011 Spinn3r dataset. In

Proceedings of the 5th International Conference on Weblogs and Social Media(ICWSM 2011), Barcelona, Spain, 2011.

Ziqiang Cao, Sujian Li, Yang Liu, Wenjie Li, and Heng Ji. A novel neural topic model andits supervised extension. In Proceedings of the 29th Annual Conference on ArtificialIntelligence (AAAI-15), pages 2210–2216, Austin, USA, 2015.

Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, and Yueheng Sun. Findingquestion-answer pairs from online forums. In Proceedings of 31st InternationalACM-SIGIR Conference on Research and Development in Information Retrieval(SIGIR’08), pages 467–474, Singapore, 2008.

Q. Diao, J. Jiang, F. Zhu, and E.P. Lim. Finding bursty topics from microblogs. InProceedings of the 50th Annual Meeting of the Association for ComputationalLinguistics (ACL 2012), pages 536–544, Jeju Island, Korea, 2012.

Shilin Ding, Gao Cong, Chin-Yew Lin, and Xiaoyan Zhu. Using conditional random fieldsto extract context and answers of questions from online forums. In Proceedings of the46th Annual Meeting of the ACL: HLT (ACL 2008), pages 710–718, Columbus, USA,2008.

Jonathan L. Elsas. Ancestry.com online forum test collection. Technical report, CarnegieMellon University, 2011.

Page 109: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

References III

Donghui Feng, Erin Shaw, Jihie Kim, and Eduard Hovy. Learning to detect conversationfocus of threaded discussions. In Proceedings of the Main Conference on HumanLanguage Technology Conference of the North American Chapter of the Association ofComputational Linguistics (HLT-NAACL ’06), pages 208–215, New York, USA, 2006.

Paolo Ferragina and Ugo Scaiella. TAGME: On-the-fly annotation of short text fragments(by Wikipedia entities). In Proceedings of the 19th ACM Conference on Informationand Knowledge Management (CIKM 2010), pages 1625–1628, 2010.

William A. Gale, Kenneth W. Church, and David Yarowsky. One sense per discourse. InProceedings of the 4th DARPA Speech and Natural Language Workshop, pages233–237, 1992.

Spandana Gella, Paul Cook, and Bo Han. Unsupervised word usage similarity in socialmedia texts. In Proceedings of the Second Joint Conference on Lexical andComputational Semantics (*SEM 2013), pages 248–253, Atlanta, USA, 2013. URLhttp://www.aclweb.org/anthology/S13-1036.

Spandana Gella, Paul Cook, and Timothy Baldwin. One sense per tweeter ... and otherlexical semantic tales of Twitter. In Proceedings of the 14th Conference of the EACL(EACL 2014), pages 215–220, Gothenburg, Sweden, 2014.

Page 110: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

References IV

Liangjie Hong and Brian D. Davison. A classification-based approach to questionanswering in discussion boards. In Proceedings of the 32nd International ACM SIGIRconference on Research and Development in Information Retrieval, pages 171–178,2009.

Doris Hoogeveen, Karin Verspoor, and Timothy Baldwin. CQADupStack: A benchmarkdata set for community question-answering research. In Proceedings of the TwentiethAustralasian Document Computing Symposium (ADCS 2015), pages 3:1–3:8, Sydney,Australia, 2015.

Doris Hoogeveen, Karin Verspoor, and Timothy Baldwin. CQADupStack: Gold or silver?In Proceedings of the SIGIR 2016 Workshop on Web Question Answering BeyondFactoids (WebQA 2016), Pisa, Italy, 2016.

Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jordan Boyd-Graber, and HalDaume III. Feuding families and former friends: Unsupervised learning for dynamicfictional relationships. In Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics — Human LanguageTechnologies (NAACL HLT 2016), pages 1534–1544, 2016. URLhttp://aclweb.org/anthology/N16-1180.

Page 111: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

References VAnders Johannsen, Dirk Hovy, Hector Martınez Alonso, Barbara Plank, and Anders

Søgaard. More or less supervised super-sense tagging of Twitter. In Proceedings ofthe Third Joint Conference on Lexical and Computational Semantics (*SEM 2014),pages 1–11, Dublin, Ireland, 2014.

Pawel Jurczyk and Eugene Agichtein. Discovering authorities in question answercommunities by using link analysis. In Proceedings of the Sixteenth ACM Conferenceon Conference on Information and Knowledge Management (CIKM ’07), pages919–922, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-803-9. doi:10.1145/1321440.1321575. URLhttp://doi.acm.org/10.1145/1321440.1321575.

Su Nam Kim, Li Wang, and Timothy Baldwin. Tagging and linking web forum posts. InProceedings of the 14th Conference on Natural Language Learning (CoNLL-2010),pages 192–202, Uppsala, Sweden, 2010.

K. Kireyev, L. Palen, and K. Anderson. Applications of topics models to analysis ofdisaster-related Twitter data. In NIPS Workshop on Applications for Topic Models:Text and Beyond, Whistler, Canada, 2009.

Adam D.I. Kramer, Jamie E. Guillory, and Jeffrey T. Hancock. Experimental evidence ofmassive-scale emotional contagion through social networks. Proceedings of theNational Academy of Sciences, 111(24):8788–8790, 2014.

Page 112: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

References VIHugo Larochelle and Stanislas Lauly. A neural autoregressive topic model. In Advances in

Neural Information Processing Systems 25, pages 2708–2716, 2012.

Jey Han Lau and Timothy Baldwin. An empirical evaluation of doc2vec with practicalinsights into document embedding generation. In Proceedings of the 1st Workshop onRepresentation Learning for NLP, Berlin, Germany, to appear.

Jey Han Lau, Nigel Collier, and Timothy Baldwin. On-line trend analysis with topicmodels: #twitter trends detection topic model online. In Proceedings of the 24thInternational Conference on Computational Linguistics (COLING 2012), pages1519–1534, Mumbai, India, 2012.

Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. InProceedings of the 31st International Conference on Machine Learning (ICML 2014),pages 1188–1196, Beijing, China, 2014.

Chen Lin, Jiang-Ming Yang, Rui Cai, Xin-Jing Wang, Wei Wang, and Lei Zhang.Modeling semantics and structure of discussion threads. In Proceedings of the 18thInternational Conference on the World Wide Web (WWW 2009), pages 1103–1104,Madrid, Spain, 2009.

Yan Liu, Alexandru Niculescu-Mizil, and Wojciech Gryc. Topic-link LDA: joint models oftopic and author community. In Proceedings of the 26th International Conference onMachine Learning (ICML 2009), pages 665–672, 2009.

Page 113: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

References VIIMarco Lui and Timothy Baldwin. You are what you post: User-level features in threaded

discourse. In Proceedings of the 14th Australasian Document Computing Symposium(ADCS 2009), Sydney, Australia, 2009.

Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. Finding predominant sensesin untagged text. In Proceedings of the 42nd Annual Meeting of the Association forComputational Linguistics (ACL 2004), pages 280–287, Barcelona, Spain, 2004.

Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. Improving LDA topicmodels for microblogs via tweet pooling and automatic labeling. In Proceedings of36th International ACM SIGIR Conference on Research and Development inInformation Retrieval (SIGIR 2013), pages 889–892, Dublin, Ireland, 2013.

Rada Mihalcea and Andras Csomai. Wikify!: Linking documents to encyclopedicknowledge. In Proceedings of the Sixteenth ACM Conference on Conference onInformation and Knowledge Management, pages 233–242, Lisbon, Portugal, 2007.URL http://doi.acm.org/10.1145/1321440.1321475.

Andrea Moro, Alessandro Raganato, and Roberto Navigli. Entity linking meets wordsense disambiguation: a unified approach. Transactions of the Association forComputational Linguistics, 2:231–244, 2014.

M. Osborne, S. Petrovic, R. McCreadie, C. Macdonald, and I. Ounis. Bieber no more:First story detection using Twitter and Wikipedia. In Proceedings of the SIGIR 2012Workshop on Time-aware Information Access, Oregon, USA, 2012.

Page 114: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

References VIII

Sasa Petrovic, Miles Osborne, and Victor Lavrenko. Streaming first story detection withapplication to Twitter. In Proceedings of Human Language Technologies: The 11thAnnual Conference of the North American Chapter of the Association forComputational Linguistics (NAACL HLT 2010), pages 181–189, Los Angeles, USA,2010.

Daniel Ramage, Susan T. Dumais, and Daniel J. Liebling. Characterizing microblogs withtopic models. In Proceedings of the 4th International Conference on Weblogs andSocial Media (ICWSM 2010), Washington D.C., USA, 2010.

Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. Theauthor-topic model for authors and documents. In Proceedings of the 20th Conferenceon Uncertainty in Artificial Intelligence, pages 487–494, 2004.

Jangwon Seo, W. Bruce Croft, and David A. Smith. Online community search usingthread structure. In Proceedings of the 18th ACM Conference on Information andKnowledge Management (CIKM 2009), pages 1907–1910, Hong Kong, China, 2009.

Nayer Wanas, Motaz El-Saban, Heba Ashour, and Waleed Ammar. Automatic scoring ofonline discussion posts. In Proceedings of the 2nd ACM Workshop on InformationCredibility on the Web (WICOW’08), pages 19–26, Napa Valley, USA, 2008.

Page 115: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

References IXHongning Wang, Chi Wang, ChengXiang Zhai, and Jiawei Han. Learning online

discussion structures by conditional random fields. In Proceedings of the 34th AnnualInternational ACM SIGIR Conference (SIGIR 2011), pages 435–444, Beijing, China,2011a.

Li Wang, Marco Lui, Su Nam Kim, Joakim Nivre, and Timothy Baldwin. Predictingthread discourse structure over technical web forums. In Proceedings of the 2011Conference on Empirical Methods in Natural Language Processing (EMNLP 2011),pages 13–25, Edinburgh, UK, 2011b.

Li Wang, Su Nam Kim, and Timothy Baldwin. The utility of discourse structure inidentifying resolved threads in technical user forums. In Proceedings of the 24thInternational Conference on Computational Linguistics (COLING 2012), pages2739–2756, Mumbai, India, 2012.

Li Wang, Su Nam Kim, and Timothy Baldwin. The utility of discourse structure in forumthread retrieval. In Proceedings of the 9th Asian Information Retrieval SocietiesConference (AIRS 2013), pages 284–295, Singapore, 2013.

Yi-Chia Wang and Carolyn P. Rose. Making conversational structure explicit:identification of initiation-response pairs within online discussions. In Human LanguageTechnologies: The 2010 Annual Conference of the North American Chapter of theAssociation for Computational Linguistics (NAACL HLT 2010), pages 673–676, 2010.

Page 116: Social Media and Text Analytics III Topic Modelling and ...Social Media and Text Analytics III WebST (19/7/2016) Talk Outline 1 Topic Modelling and Trend Analysis 2 Lexical Semantic

Social Media and Text Analytics III WebST (19/7/2016)

References X

Markus Weimer, Iryna Gurevych, and Max Muhlhauser. Automatically assessing the postquality in online discussions on software. In Proceedings of the 45th Annual Meetingof the ACL on Interactive Poster and Demonstration Sessions (ACL 2007), pages125–128, Prague, Czech Republic, 2007.

F. Zanzotto, M. Pennacchiotti, and K. Tsioutsiouliklis. Linguistic redundancy in twitter.In Proceedings of the 2011 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP 2011), pages 659–669, Edinburgh, United Kingdom, 2011.