14
Sub-topic clustering on tweets and generating brief pseudo summaries Summarization

IRE-2014: Sub-topic clustering on tweets and generating brief pseudo summaries

Embed Size (px)

DESCRIPTION

Sub-topic clustering on tweets and generating brief pseudo summaries, IIITH, Hyderabad. By Anil Kumar Sutrala, Raghav K, Dinesh Singla, Singdha Verma

Citation preview

Page 1: IRE-2014: Sub-topic clustering on tweets and generating brief pseudo summaries

Sub-topic clustering on tweets and generating brief pseudo summaries

Summarization

Page 2: IRE-2014: Sub-topic clustering on tweets and generating brief pseudo summaries

Team Members

Anil Sutrala Snigdha Verma Dinesh Singla Raghav K

Page 3: IRE-2014: Sub-topic clustering on tweets and generating brief pseudo summaries

Introduction

Summarizing twitter tweets can be viewed as an instance of the more general problem of automated text summarization.

A Twitter post or tweet is at most 140 characters long and in this study we only consider English posts.

Page 4: IRE-2014: Sub-topic clustering on tweets and generating brief pseudo summaries

Basic Idea

Identifying important entities from a cluster of tweets.

For each cluster we identify the most important entities from each type like Geographic location, Person etc using TF-IDF scores.

Finding most important tweets using these important entities.

Generate a brief pseudo summary for each cluster using the important entities and important tweets.

Page 5: IRE-2014: Sub-topic clustering on tweets and generating brief pseudo summaries

Dataset

Labelled tweets taken from Replab Dataset. RepLab is a competitive evaluation exercise for

Online Reputation Management systemsFinding most important tweets using these important entities.

In the dataset provided we have the set of tweets with the tweet id, author, entity id , text.

Labeled dataset contains the fields of tweet id, author, entity id, filtering, polarity, topic, topic priority.

Page 6: IRE-2014: Sub-topic clustering on tweets and generating brief pseudo summaries

MVP Model

Page 7: IRE-2014: Sub-topic clustering on tweets and generating brief pseudo summaries

Named Entity Recognition

Using labeled data we have generated Base tweet clusters, for further processing, using the tweet topic name.

Then use Aritter NLP tool for identifying named entities, attributes and attribute relations.

Generate TF-IDF Scores for these entities recognized.

Location (“geo-loc” named entity as per Aritter classification) has been taken as the most priority type among all named entities.

Page 8: IRE-2014: Sub-topic clustering on tweets and generating brief pseudo summaries

Generate Summary Per Cluster

Generate a map of named entity type vs named entities for the list of all tweets and call this map as NETYPE_MAP.

Tweet Summary is of three types broadly:– Case I : When the named entities with max

TF-IDF’s all are of location type– Case II : When no location type named entities

has maximum TF-IDF and only first the max TF-IDF named entity type is” important”

– Case III : Case III: When the max TF-IDF named entity types are of location and other types

Page 9: IRE-2014: Sub-topic clustering on tweets and generating brief pseudo summaries

Case 1

This case occurs when the named entities with max TF-IDF’s all are of location type.

In this case we print the summary as the collection of tweet texts which contains the named entities with max TF-IDF counts of location type.

Page 10: IRE-2014: Sub-topic clustering on tweets and generating brief pseudo summaries

Case 2

This case occurs When no location type named entities has maximum TF-IDF and only first the max TF-IDF named entity type is” important”.

A named entity type is marked as “important” only if its TF-IDF count is not less than half of the max TF-IDF count.

Max is the TF-IDF count for an NE type in cluster.

Page 11: IRE-2014: Sub-topic clustering on tweets and generating brief pseudo summaries

Case 3

When the max TF-IDF named entity types are of location and other types (Mixed case).

Becomes a subcase of case2. A named entity type is marked as “important” only if its TF-IDF count is not less than half of the max TF-IDF count.

Page 12: IRE-2014: Sub-topic clustering on tweets and generating brief pseudo summaries

Web UI for summaries

We generate a web UI for the tweet cluster summary.

Clusters are provided in the alphabetical order and the summary is generated in following format.

Cluster Label: @ <Location> <List of Named entities> <Tweet Cluster Summary>

Page 13: IRE-2014: Sub-topic clustering on tweets and generating brief pseudo summaries

Results

We have generated pseudo summaries for tweet clusters and will analyze the summaries with that of text rank tool.

A sample screen shot is shown in the next slide.

Page 14: IRE-2014: Sub-topic clustering on tweets and generating brief pseudo summaries

Results