Hw09 Understanding Natural Language

News and Blog Analysis

with Lydia

October 2nd, 2009

Charles Ward – Stony Brook University

Karthik Balaji, Levon Lloyd – General Sentiment

� Lydia System Overview

� News Analysis Examples

� Data and Workflow Organization

� Data Access Interface

� Conclusion

Outline

Large-Scale News/Blog Analysis

� The Lydia news/blog analysis system does a daily analysis of over 1000+ English and foreign-language

online newspapers, plus blogs, and other text sources.

� We currently track tens of millions of named entities in the news and blogs, providing spatial, temporal,

relational and sentiment analysis.

� Customer's track entities of interest using reports

generated in our user interface.

Lydia Text Analysis Phases

� Lydia performs named entity recognition and analysis over large text corpora.

� Spidering: Lydia spiders and parses thousands of online news sources. We also handle the feed of social media provided by Spinn3r.

� Named Entity Recognition: Lydia identifies and classifies occurrences of named entities (people, places, companies, etc.)

� Sentiment Analysis: Lydia assigns sentiment scores to identified entities using shallow NLP techniques.

� Entity Statistics Aggregation: Lydia digests marked-up text and produces usable entity statistics.

� Data Exploration: Aggregated entity statistics are made available through user interfaces and programming APIs for detailed exploration of the data.

Lydia Architecture

Outline





� Conclusion

Frequency Time Series

� Michael Vick references (2004-2009)

� Mel Gibson references (2004-2009)

Sentiment Analysis

� Michael Phelps sentiment score (June 2008-Feb 2009)

� David Paterson sentiment score (Jan 2008-Jul 2009)

Comparative Analysis

� Peyton Manning vs. Eli Manning

Heatmaps

Arnold Schwarzenegger Alabama

Ethnic Biases in News Coverage

Frequency of coverage of entities

with Hispanic names in the

U.S. news, 2004-2008

Percentage of population self-

reporting as Hispanic in the 2000

census. Courtesy of Wikipedia.

Ethnic Biases in News Coverage

� (a) African

� (b) Hispanic

� (c) East Asian

� (d) Indian

� (e) Eastern European

� (f) Muslim

Juxtaposition Analysis

� Top Juxtapositions for Barack Obama

� Juxtapositions between Barack Obama and John McCain

Outline





� Conclusion

Hadoop in Lydia

� The legacy Perl NLP pipeline runs in parallel on Hadoop Streaming, generating articles with marked-up entities which are stored as compressed XML in HDFS.

� To build or update Lydia entity statistics and indexes for a single text corpus, over 80 map-reduce jobs are necessary.

� We have developed a custom workflow management framework in Amazon EC2 to manage our data and processing.

Lydia Workflow Framework

High-level concepts:� A depository is a statistics dataset derived from a

text corpus. It consists of artifacts.� Stored as a directory structure in HDFS

� An artifact is a homogeneous dataset of a specific type.� Examples:

� Key-value artifacts, e.g. entity name -> frequency time series

� Lucene index artifacts (entity and article indexes)

� Stored as a directory in HDFS containing several map-reduce job output subdirectories named as date ranges (we do updates on a daily granularity).

Artifact Dependencies

Most Lydia artifacts are derived from other artifacts:

Artifact Storage

Lydia artifacts are stored in HDFS inside the depository directory:

� /dailies (depository name) � /EXACT_DUP_ARTICLES (artifact name)

� /2004_11_01-2009_03_31 (date range-named MR output)

� /part-00000

� . . .

� /part-00017

� /2009_04_01-2009_04_02� . . .

� /2009_04_03. . .

Job Input Selection

� Artifact updates are incrementally propagated through the dependency graph:

� Multiple date ranges (sometimes overlapping) typically exist for each artifact.

� Some small artifacts get fully rebuilt on every update.

Depository Build Scheduling

� The same tool is used for the initial depository build and for updating it with new data.

� Any set of target artifacts to build can be specified, similarly to a makefile. Prerequisites of the targets are automatically identified.

� Artifacts are built in the correct order according to dependencies.

� The build process runs as a sequence of Hadoop map-reduce jobs and occasional serial jobs.

Amazon EC2

� We run Hadoop on Amazon EC2.

– Quickly scale capacity as requirements change.

� 10 extra large nodes for weekly data processing.

� Amazon S3 is our persistent data store.

� All our web services are hosted in dedicated amazon

nodes.

� S3 is not meeting our required level-of-service

– Moving to EBS

Outline





� Conclusion

Depository Server

� Random access to the Lydia depository, e.g.:� Monthly frequency time series of Barack Obama in all

U.S. sources

� Top juxtapositions for Continental Airlines in February 2009

� Sentiment time series for Michael Phelps in all U.S. sources

� Uses the mapfiles generated by map-reduce jobs.

� Currently is not distributed (but we can put different depositories on different machines).

� Provides a caching subsystem to reduce the number of HDFS accesses.

Artifact Date Range Merging

� The depository server combines results from

multiple groups of mapfiles on the fly.

(MR output = date range = mapfile group)

� This may result in performance problems and memory shortage (direct memory buffers).

� Solution: limit the number of covering date ranges

to be O(log N) after N daily updates.

Outline





� Conclusion

Conclusion

� Great improvement (up to 20x) in the Lydia system performance and scalability from using Hadoop.

� Lydia w/ Hadoop makes new types of automated analysis of web-scale content possible.

Technology

Hw09 Understanding Natural Language