16
Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases Dr. Maria Grineva Systems Group @ ETH Zurich Sunday, April 7, 13

Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

Semantic Data Search and Analysis Using Web-based User-Generated

Knowledge Bases

Dr. Maria GrinevaSystems Group @ ETH Zurich

Sunday, April 7, 13

Page 2: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

Today’s Search is Based On Links

• Full-text search is the main way to access information on the Web

• The goal of Web search engines: find out the most relevant pages for the user’s query

• Google employs the Web’s hyperlinks to compute relevance of a Web page (PageRank)

22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation

Sunday, April 7, 13

Page 3: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

Domains Without Links

• PageRank does not work when documents are are not interlinked

• Breaking news and Blog posts - must be available in real-time, when no links have been created yet

• Enterprise databases - documents are not well interconnected because of organizational silos and limited number of people who create and use them

Sunday, April 7, 13

Page 4: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

Web-based User-Generated Knowledge Bases

• To rank and organize documents that are not interlinked well, we need additional knowledge bases:

• Wikipedia - Online encyclopedia

• Twitter - real-time microblogging service

22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation

Sunday, April 7, 13

Page 5: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

The Goal of This ProjectDevelop a technology which automatically extracts semantic information:

• from Wikipedia - term meanings, relationships, ontologies ...

• from Twitter - real-time information about breaking news, trends, people opinions ...

and applies this information to organize:

• news and blogs on the Web

• documents in enterprise databases

We will release our technology as an open source software framework

22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation

Sunday, April 7, 13

Page 6: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

Semantic Text Analysis Using Wikipedia

• Leveraging Wikipedia to improve text analysis methods:

• Comprehensive coverage (6M terms vs. 65K in Britannica)

• Continuously brought up-to-date

• Rich structure (cross-references between articles, categories, redirect pages, disambiguation pages, info-boxes)

• New algorithms:

• Advanced NLP: Word Sense Disambiguation, Keyword Extraction, Topic Inference

• Automatic Ontology Management: Organizing Concept into Thematically Grouped Tag Clouds

• Semantic Search: Concept-based Similarity Search, Smart Faceted Navigation

• Zero-cost deployment and customization: No need to train methods, no human labor, no “cold start” problem

22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation

Sunday, April 7, 13

Page 7: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

Basic Technique:Semantic Relatedness of Terms

• We analyze Wikipedia Links Structure to compute Semantic Relatedness of Wikipedia terms

• We use Dice-measure with weighted hyperlinks (bi-directional links, direct links, “see also” links, etc)

Dmitry Lizorkin, Pavel Velikhov, Maria Grineva, Maxim GrinevAccuracy Estimate and Optimization Techniques for SimRank ComputationVLDB 2008Sunday, April 7, 13

Page 8: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

Word Sense Disambiguation • Exmple: IBM may stand for International Business

Machines Corp. or International Brotherhood of Magicians

• We use Wikipedia redirection (synonyms) and disambiguation pages (homonyms) to detect and disambiguate terms in a text

• Example: Platform is mentioned in the context of implementation, open-source, web-server, HTTP

Sunday, April 7, 13

Page 9: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

Prototype of a Semantic Search Engine for the Blogosphere

22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation

Sunday, April 7, 13

Page 10: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

Twitter - A Real-Time News Medium

• ~200M users all over the world posting short messages (tweets) via mobile devices and web browser

• ~140M tweets per day

• Twitter - is an open social network where everyone can follow everyone

• Retweets - a mechanism for fast news spreading

22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation

Sunday, April 7, 13

Page 11: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

Following + Retweets:Twitter is the Fastest News Medium

• Twitter reacts faster than mainstream media: Haiti Earthquake, Hudson river plane crash

• Everyone can be a reporter: real-time updates on the revolutions in Tunisia, Egypt, Libya, Iran ...

Sunday, April 7, 13

Page 12: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

Extracting Useful Information From Twitter

• Popularity of a URL

• Sentiments, opinions about a news story (tweets containing the news URL)

• Trending topics: what is being actively discussed right now

• Personalization of news based on user’s friends connections: The Tweeted Times http://tweetedtimes.com

22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation

Sunday, April 7, 13

Page 13: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

The Tweeted Times: personalized newspaper generated from user’s Twitter account

Sunday, April 7, 13

Page 14: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

At the Systems Layer

• Scalable distributed architecture is required:

• Hadoop (MapReduce software framework) for batch processing of Wikipedia snapshots

• Real-time analytics based on distributed key-value store for online Twitter stream processing

22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation

Sunday, April 7, 13

Page 15: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

Scalable Real-Time Analytics Based On Distributed Key-Value Store

• At Systems Group, we are working on a system for real-time analytics based on Cassandra:

• We extend Cassandra with:

• push-style procedure for real-time analytics

• incremental computations (alternative to batch-processing) - processing data as it arrives from the stream

22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation

Sunday, April 7, 13

Page 16: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

References

• Prototype of the semantic search engine Blognoon: http://blognoon.com

• The Tweeted Times - personalized newspaper based on user’s Twitter account:http://tweetedtimes.com

• Triggy: a system for real-time analytics:http://www.systems.ethz.ch/research/projects

22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation

Sunday, April 7, 13