Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
COMP33111, 2011 1
1
Analysis of textual data
(text mining)
Goran Nenadic
School of Computer Science
COMP33111
Lecture 5
2
Aims
Understand the need and challenges in the
extraction and integration of information
from textual unstructured data sources
Discuss how to get unstructured data into a
form that can be used in OLAP or data
mining
Discuss techniques for text mining
Analyse possible application areas
90%
10%
Structured & unstructured data
Structured numerical or coded information
Unstructured or semi-structured information
e.g. company reports, market analyses, e-mails, videos, audio,
legal documents, newspapers, Web pages, scientific papers, etc.
e.g. transactional databases
COMP33111, 2011 2
4
Company reports
Internet
News-wire, reports, blogs, social networks, etc.
150 million web domains, ~12 billion pages (2006)
Facebook: 700 status updates per second
Twitter tweets: 600 per second
Scientific literature
~700,000 scientific articles per year in biomedicine only
(~2,000 articles per day!)
Textual data is everywhere
Textual data
Typically available as semi- and un-structured data: no strict format, but can still have some structure with optional entities, elements and attributes
e.g. structured: author, title, date, etc.
most of the content is unstructured
some entities/elements/attributes may be missing, some repeated
Examples HTML document with titles (e.g. <H1>) and paragraphs
(<p>), but no information about author(s), date, etc.
LaTeX, Word documents
continued
Textual data
Stored in documents (files) “annotated” i.e. tagged for
various purposes
type setting
visualisation (e.g. browsing)
document retrieval (search engines)
text analytics
Various annotations supporting various purposes
e.g. many publishers use XML as internal format
but many also rely on pdf or proprietery formats
continued
COMP33111, 2011 3
text
meta-data
XML textual document representation
title
example
Retrieving information from text
Aim: get textual data into a form that could be used for data integration and analytics e.g. classification, clustering, association rule mining
combine with other resources (structured data)
Retrieval of textual data (strings) from structured sources is easy e.g. using SQL (if in a relational DB) or XPath and XQUery
(if in a XML DB)
known “meaning” of attributes and objects
Retrieving information from text
How to extract information from unstructured parts? text mining, natural language processing
Feature selection and extraction identify features of interest, extract them from unstructured
data sources and store them as “structured” data
features: names of companies, drugs, genes, their co-occurrence, facts, relations, etc.
Two main problems with textual data variability
ambiguity
continued
COMP33111, 2011 4
Main problems: variability
Numerous ways to express objects (entities) and relationships among them
Gina Torretta succeeds Nicholas Andrews as chairperson of BNC Holdings Inc.
BNC Holdings Inc. named Ms G. Torretta as its new chair-person after Nick Andrews’ departure.
Nicholas Andrews was succeeded by Gina Torretta as chair-person of BNC Holdings.
continued
Main problems: ambiguity
Different meanings/senses of words e.g. Apple (the company) or apple (the fruit)
e.g. Toyota can be a car or a company
e.g. acronyms have different meanings in different contexts
continued
United States Army
United States of America
Ulhasnagar Sindhi Association
Ultimate in Suspense and Action
Unconditional Self-Acceptance
Unconventional Stellar Aspect
Under Secretary of the Army
Underground Service Alert
Underground Sewer Adapter
Underwriting Service Assistant
Unicycling Society of America
Union of South Africa
Union Street Athletics
Unionville-Sebewaing Area
Unique Settable Attributes
Unit Self Assessment
United Scenic Artists
University of South Alabama
University of South Australia
Unix System Admin
Unstable Angina
Unusually Sensitive Area
USA =
Tasks and techniques
Information retrieval (IR)
select a set of relevant documents
Information extraction (IE)
extract factual information (facts) from texts
Question answering (QA)
find/generate an answer to a given question
COMP33111, 2011 5
Information retrieval (IR)
Searching for relevant documents
whether in stand-alone or hypertext collections (both Internet
or intranets)
Search engines are an example of IR systems
Result of IR is a set of relevant documents
filtering huge collections based on a query
no fine-grained information, just whole documents
users would need to read and analyse these documents on
their own
Information retrieval (IR)
IR is based on indexing by keywords pre-calculated and stored for easier retrieval
each document represented by an index vector
Simple techniques are used index all words and “phrases”
use meta-information (cross-links, titles, etc.)
add stemming, spelling variation, etc.
Calculate “similarity” between documents & query ranked list of documents, based on similarity
vector space model & distances (Euclidian, cosine)
continued
IR challenges
Number of documents (e.g. on the Web) is growing
also, many (web) pages are updated frequently, which forces the IR engines to revisit them periodically (refresh)
Dynamically generated sites/documents may be difficult to index, or may result in excessive results from a single site
Queries one can make are currently limited to searching by/for keywords, which may result in many
false positives (wrong hits), because of ambiguity
false negatives (non-returns), because of variability
COMP33111, 2011 6
Information extraction (IE)
Extract information i.e. facts from text
Identify instances of pre-defined entities (dates, names
of people, locations, etc.) and relations between them
Fill in database-like tables with “facts”
Slot Information
Date 7/10/96 (today) Location SanSalvador Victim injured policeman Victim attacked guards Perpetrator urban guerrillas
San Salvador, 7/10/96
It has been officially reported that a policeman was wounded today when urban guerrillas attacked the guards at a power substation located downtown San Salvador.
IE steps
Tokenisation: finding individual words
Lexical processing: classifying words into nouns,
verbs, adjectives, etc. (part-of-speech tagging); also,
identification of names, places, etc.
Syntactic processing: finding (syntactic) links between
words: phrases, and also subjects, objects, etc.
Semantic processing: linking extracted data and
making sense of it
e.g. this number represent the age of person, not their phone
number
e.g. this chemical entity is a hormone in this context.
etc.
COMP33111, 2011 7
IE: key steps
Identification of entities and terms of interest find strings in text that denote specific entities or concepts
(named-entity recognition; term identification)
designed for each class of interest
e.g. persons, jobs, companies, dates, time, locations, genes, species, medications, tools, …
Various methodologies dictionaries (need to be updated)
rule-based: define term patterns and/or context
RSM North Ltd. was established in 2010.
machine learning: generalise from a training set
continued
IE: key steps
Extract specific facts, relations and events by linking
entities
use of templates, regular expressions, grammars
<PERSON> is appointed as a <JOB> of <COMPANY>
typically designed around important verbs –
e.g. attacked, bought, appointed, merged, acquired, etc.
also, machine learning approaches
Relies on parsing sentences
identification of main syntactic units and their relations
e.g. finding syntactic chunks (noun phrases, verb phrases),
subjects, objects, etc.
continued
IE: final result
Provide a table-like structured information
COMP33111, 2011 8
IE: some challenges
Variability and ambiguity of “templates”
IBM today announced …
John Lewis today announced …
They bought the company with last year’s profits.
They bought the company with 100 employees.
Co-references
using different words to refer to the same entity
Question answering (QA)
Produce factual (short) answers on a user query that is formulated as a question
“When was the takeover of AstraZenica?”
“Who is the CEO of Software Ltd.?”
Combine IR, IE and heuristics if the question starts with “When …” then look for date
and time expressions
if the question starts with “Who …” then look for persons
Text mining
Application of data mining techniques on features extracted from unstructured data e.g. predict how well a company would do on the
market based on data that has been reported in the news
e.g. extract and analyse user complaints mails
e.g. cluster genes based on functions
e.g. find causal links between symptoms or diseases and drugs or chemicals
e.g. analyse twitter sentiments
COMP33111, 2011 9
Twitter sentiment
Twitter as “crystal ball”
Identify mood of tweets 72 mood adjectives and related words
~10 million tweets
Correlate them with stock exchange data (up/down) very good correlation
Can Twitter predict the stock market? Dow Jones on its own: 73% accuracy
Dow Jones + twitter data: 87% accuracy
Personalised movie “matcher”
Match movies to individuals based on their preference profiles
Information sources written reviews of movies
users’ lists of favorite movies
movie
reviews
“sentiment
analysis”
personalised
matched
moves
example
COMP33111, 2011 10
recommendation portal
A recommendation portal for movies and tv shows provides recommendations, answering a free given search.
~10,000 movie, TV and video titles
Based on semantic technologies, Jinni uses text mining on plot, mood, style, setting, soundtrack and more in combination with an ontology, created by film professionals
You don’t need to know about exact title, actor, director, place or year of production to get an result, you can enter simply a phrase describing the mood, genre or place the movie is about, and you will guided through a facilitated search to narrow your search and get at the end what you want.
Or alternative, if you search for a movie and you have only a vague idea of the plot, you can formulate a plot’s description in free phrasing.
As it also offers APIs for Internet and TV content providers you can make your way direct to an online store to download or purchase the movie.
Restaurant Reputation Report
A service targeting restaurant owners to provide them reports of positive and negative reviews of food, service and ambiance at their restaurants.
The service monitors negative and positive trends across hundreds of online review sites. Now restaurant owners can subscribe to receive a PDF of their
monthly reports. This PDFs came with charts, trends, rankings, summaries and some quotes from users, month by month. The reports may enable those restaurant owners to react and improve their services in the specific field.
A simple but straight forward way to using text mining and semantic technologies in business.
Text mining
Text mining is now widely used as an
umbrella for large variety of natural language
processing techniques to denote all
approaches to retrieve, extract and analyse
textual information
Also known as text analytics
Provided by a number of vendors
Oracle, SAS, Autonomy, etc.
continued
COMP33111, 2011 11
Text mining
continued
IBM Cognos Content Analytics
Text mining: Keyword driven investigation. View, filter and export. Automatically extracted concepts, relationships, meta data and organization.
Delivers new business understanding from the content of unstructured data .Trend and pattern detection and anomaly highlighting for focused research. Pre-built and customizable entity extraction and visualization. Combines content access, entity and context extraction, analysis and categorization with exploratory miningand operational reporting.
Integration point for structured and unstructured content. Integrates with and delivers analytics to Cognos 8 BI, InfoSphere Warehouse, IBM ECM, WebSphere Portal, and custom-built solutions. Provides ETL interface for unstructured content. Enables content integration between applications, systems and processes
SAS® Text Analytics
Enterprise Content Categorization – Drive faster, more efficient information organization, access and findability with automated content categorization.
Ontology Management – Maximize the value of your text repositories by linking them together with consistently and systematically defined relationships.
Sentiment Analysis – Automatically locate and extract sentiment from the Internet, social networking sites and internal electronic documents to identify more effective strategies.
Text Mining – Capitalize on the value hidden in textual information by mining unstructured data sources.
http://www.sas.com/text-analytics/
COMP33111, 2011 13
Challenges
Text mining is possible but difficult
language variability and ambiguity
currently – only approximation is used, but needs
interpretation, context, background knowledge (not
everything is explicit)
Nicholas Andrews was succeeded by Gina
Torretta as chair-person of BNC Holdings. She
was appointed 2 days ago.
Summary
Differences between retrieving information from structured and unstructured data computers do not “understand”
Problems with understanding text documents language variability and ambiguity
growing number and size of documents
Text mining techniques: IR, IE, QA, etc. extraction of various features: entities, terms, facts, relations
classification, clustering, association rules
Multi-media data: discussed later
39
Reading
See on-line materials, tutorials and software
Our School’s Text mining and Natural Language
Processing group
with applications in biology, medicine, bioinformatics,
engineering, social sciences (sentiment), marketing …
Also, we host the National Centre for Text Mining
(www.nactem.ac.uk)