13
COMP33111, 2011 1 1 Analysis of textual data (text mining) Goran Nenadic School of Computer Science COMP33111 Lecture 5 2 Aims Understand the need and challenges in the extraction and integration of information from textual unstructured data sources Discuss how to get unstructured data into a form that can be used in OLAP or data mining Discuss techniques for text mining Analyse possible application areas 90% 10% Structured & unstructured data Structured numerical or coded information Unstructured or semi-structured information e.g. company reports, market analyses, e-mails, videos, audio, legal documents, newspapers, Web pages, scientific papers, etc. e.g. transactional databases

COMP33111 Lecture 5 Analysis of textual data

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

COMP33111, 2011 1

1

Analysis of textual data

(text mining)

Goran Nenadic

School of Computer Science

COMP33111

Lecture 5

2

Aims

Understand the need and challenges in the

extraction and integration of information

from textual unstructured data sources

Discuss how to get unstructured data into a

form that can be used in OLAP or data

mining

Discuss techniques for text mining

Analyse possible application areas

90%

10%

Structured & unstructured data

Structured numerical or coded information

Unstructured or semi-structured information

e.g. company reports, market analyses, e-mails, videos, audio,

legal documents, newspapers, Web pages, scientific papers, etc.

e.g. transactional databases

COMP33111, 2011 2

4

Company reports

Internet

News-wire, reports, blogs, social networks, etc.

150 million web domains, ~12 billion pages (2006)

Facebook: 700 status updates per second

Twitter tweets: 600 per second

Scientific literature

~700,000 scientific articles per year in biomedicine only

(~2,000 articles per day!)

Textual data is everywhere

Textual data

Typically available as semi- and un-structured data: no strict format, but can still have some structure with optional entities, elements and attributes

e.g. structured: author, title, date, etc.

most of the content is unstructured

some entities/elements/attributes may be missing, some repeated

Examples HTML document with titles (e.g. <H1>) and paragraphs

(<p>), but no information about author(s), date, etc.

LaTeX, Word documents

continued

Textual data

Stored in documents (files) “annotated” i.e. tagged for

various purposes

type setting

visualisation (e.g. browsing)

document retrieval (search engines)

text analytics

Various annotations supporting various purposes

e.g. many publishers use XML as internal format

but many also rely on pdf or proprietery formats

continued

COMP33111, 2011 3

text

meta-data

XML textual document representation

title

example

Retrieving information from text

Aim: get textual data into a form that could be used for data integration and analytics e.g. classification, clustering, association rule mining

combine with other resources (structured data)

Retrieval of textual data (strings) from structured sources is easy e.g. using SQL (if in a relational DB) or XPath and XQUery

(if in a XML DB)

known “meaning” of attributes and objects

Retrieving information from text

How to extract information from unstructured parts? text mining, natural language processing

Feature selection and extraction identify features of interest, extract them from unstructured

data sources and store them as “structured” data

features: names of companies, drugs, genes, their co-occurrence, facts, relations, etc.

Two main problems with textual data variability

ambiguity

continued

COMP33111, 2011 4

Main problems: variability

Numerous ways to express objects (entities) and relationships among them

Gina Torretta succeeds Nicholas Andrews as chairperson of BNC Holdings Inc.

BNC Holdings Inc. named Ms G. Torretta as its new chair-person after Nick Andrews’ departure.

Nicholas Andrews was succeeded by Gina Torretta as chair-person of BNC Holdings.

continued

Main problems: ambiguity

Different meanings/senses of words e.g. Apple (the company) or apple (the fruit)

e.g. Toyota can be a car or a company

e.g. acronyms have different meanings in different contexts

continued

United States Army

United States of America

Ulhasnagar Sindhi Association

Ultimate in Suspense and Action

Unconditional Self-Acceptance

Unconventional Stellar Aspect

Under Secretary of the Army

Underground Service Alert

Underground Sewer Adapter

Underwriting Service Assistant

Unicycling Society of America

Union of South Africa

Union Street Athletics

Unionville-Sebewaing Area

Unique Settable Attributes

Unit Self Assessment

United Scenic Artists

University of South Alabama

University of South Australia

Unix System Admin

Unstable Angina

Unusually Sensitive Area

USA =

Tasks and techniques

Information retrieval (IR)

select a set of relevant documents

Information extraction (IE)

extract factual information (facts) from texts

Question answering (QA)

find/generate an answer to a given question

COMP33111, 2011 5

Information retrieval (IR)

Searching for relevant documents

whether in stand-alone or hypertext collections (both Internet

or intranets)

Search engines are an example of IR systems

Result of IR is a set of relevant documents

filtering huge collections based on a query

no fine-grained information, just whole documents

users would need to read and analyse these documents on

their own

Information retrieval (IR)

IR is based on indexing by keywords pre-calculated and stored for easier retrieval

each document represented by an index vector

Simple techniques are used index all words and “phrases”

use meta-information (cross-links, titles, etc.)

add stemming, spelling variation, etc.

Calculate “similarity” between documents & query ranked list of documents, based on similarity

vector space model & distances (Euclidian, cosine)

continued

IR challenges

Number of documents (e.g. on the Web) is growing

also, many (web) pages are updated frequently, which forces the IR engines to revisit them periodically (refresh)

Dynamically generated sites/documents may be difficult to index, or may result in excessive results from a single site

Queries one can make are currently limited to searching by/for keywords, which may result in many

false positives (wrong hits), because of ambiguity

false negatives (non-returns), because of variability

COMP33111, 2011 6

Information extraction (IE)

Extract information i.e. facts from text

Identify instances of pre-defined entities (dates, names

of people, locations, etc.) and relations between them

Fill in database-like tables with “facts”

Slot Information

Date 7/10/96 (today) Location SanSalvador Victim injured policeman Victim attacked guards Perpetrator urban guerrillas

San Salvador, 7/10/96

It has been officially reported that a policeman was wounded today when urban guerrillas attacked the guards at a power substation located downtown San Salvador.

IE steps

Tokenisation: finding individual words

Lexical processing: classifying words into nouns,

verbs, adjectives, etc. (part-of-speech tagging); also,

identification of names, places, etc.

Syntactic processing: finding (syntactic) links between

words: phrases, and also subjects, objects, etc.

Semantic processing: linking extracted data and

making sense of it

e.g. this number represent the age of person, not their phone

number

e.g. this chemical entity is a hormone in this context.

etc.

COMP33111, 2011 7

IE: key steps

Identification of entities and terms of interest find strings in text that denote specific entities or concepts

(named-entity recognition; term identification)

designed for each class of interest

e.g. persons, jobs, companies, dates, time, locations, genes, species, medications, tools, …

Various methodologies dictionaries (need to be updated)

rule-based: define term patterns and/or context

RSM North Ltd. was established in 2010.

machine learning: generalise from a training set

continued

IE: key steps

Extract specific facts, relations and events by linking

entities

use of templates, regular expressions, grammars

<PERSON> is appointed as a <JOB> of <COMPANY>

typically designed around important verbs –

e.g. attacked, bought, appointed, merged, acquired, etc.

also, machine learning approaches

Relies on parsing sentences

identification of main syntactic units and their relations

e.g. finding syntactic chunks (noun phrases, verb phrases),

subjects, objects, etc.

continued

IE: final result

Provide a table-like structured information

COMP33111, 2011 8

IE: some challenges

Variability and ambiguity of “templates”

IBM today announced …

John Lewis today announced …

They bought the company with last year’s profits.

They bought the company with 100 employees.

Co-references

using different words to refer to the same entity

Question answering (QA)

Produce factual (short) answers on a user query that is formulated as a question

“When was the takeover of AstraZenica?”

“Who is the CEO of Software Ltd.?”

Combine IR, IE and heuristics if the question starts with “When …” then look for date

and time expressions

if the question starts with “Who …” then look for persons

Text mining

Application of data mining techniques on features extracted from unstructured data e.g. predict how well a company would do on the

market based on data that has been reported in the news

e.g. extract and analyse user complaints mails

e.g. cluster genes based on functions

e.g. find causal links between symptoms or diseases and drugs or chemicals

e.g. analyse twitter sentiments

COMP33111, 2011 9

Twitter sentiment

Twitter as “crystal ball”

Identify mood of tweets 72 mood adjectives and related words

~10 million tweets

Correlate them with stock exchange data (up/down) very good correlation

Can Twitter predict the stock market? Dow Jones on its own: 73% accuracy

Dow Jones + twitter data: 87% accuracy

Personalised movie “matcher”

Match movies to individuals based on their preference profiles

Information sources written reviews of movies

users’ lists of favorite movies

movie

reviews

“sentiment

analysis”

personalised

matched

moves

example

COMP33111, 2011 10

recommendation portal

A recommendation portal for movies and tv shows provides recommendations, answering a free given search.

~10,000 movie, TV and video titles

Based on semantic technologies, Jinni uses text mining on plot, mood, style, setting, soundtrack and more in combination with an ontology, created by film professionals

You don’t need to know about exact title, actor, director, place or year of production to get an result, you can enter simply a phrase describing the mood, genre or place the movie is about, and you will guided through a facilitated search to narrow your search and get at the end what you want.

Or alternative, if you search for a movie and you have only a vague idea of the plot, you can formulate a plot’s description in free phrasing.

As it also offers APIs for Internet and TV content providers you can make your way direct to an online store to download or purchase the movie.

Restaurant Reputation Report

A service targeting restaurant owners to provide them reports of positive and negative reviews of food, service and ambiance at their restaurants.

The service monitors negative and positive trends across hundreds of online review sites. Now restaurant owners can subscribe to receive a PDF of their

monthly reports. This PDFs came with charts, trends, rankings, summaries and some quotes from users, month by month. The reports may enable those restaurant owners to react and improve their services in the specific field.

A simple but straight forward way to using text mining and semantic technologies in business.

Text mining

Text mining is now widely used as an

umbrella for large variety of natural language

processing techniques to denote all

approaches to retrieve, extract and analyse

textual information

Also known as text analytics

Provided by a number of vendors

Oracle, SAS, Autonomy, etc.

continued

COMP33111, 2011 11

Text mining

continued

IBM Cognos Content Analytics

Text mining: Keyword driven investigation. View, filter and export. Automatically extracted concepts, relationships, meta data and organization.

Delivers new business understanding from the content of unstructured data .Trend and pattern detection and anomaly highlighting for focused research. Pre-built and customizable entity extraction and visualization. Combines content access, entity and context extraction, analysis and categorization with exploratory miningand operational reporting.

Integration point for structured and unstructured content. Integrates with and delivers analytics to Cognos 8 BI, InfoSphere Warehouse, IBM ECM, WebSphere Portal, and custom-built solutions. Provides ETL interface for unstructured content. Enables content integration between applications, systems and processes

SAS® Text Analytics

Enterprise Content Categorization – Drive faster, more efficient information organization, access and findability with automated content categorization.

Ontology Management – Maximize the value of your text repositories by linking them together with consistently and systematically defined relationships.

Sentiment Analysis – Automatically locate and extract sentiment from the Internet, social networking sites and internal electronic documents to identify more effective strategies.

Text Mining – Capitalize on the value hidden in textual information by mining unstructured data sources.

http://www.sas.com/text-analytics/

COMP33111, 2011 12

Autonomy

Computer Science MSc projects

MSc projects

COMP33111, 2011 13

Challenges

Text mining is possible but difficult

language variability and ambiguity

currently – only approximation is used, but needs

interpretation, context, background knowledge (not

everything is explicit)

Nicholas Andrews was succeeded by Gina

Torretta as chair-person of BNC Holdings. She

was appointed 2 days ago.

Summary

Differences between retrieving information from structured and unstructured data computers do not “understand”

Problems with understanding text documents language variability and ambiguity

growing number and size of documents

Text mining techniques: IR, IE, QA, etc. extraction of various features: entities, terms, facts, relations

classification, clustering, association rules

Multi-media data: discussed later

39

Reading

See on-line materials, tutorials and software

Our School’s Text mining and Natural Language

Processing group

with applications in biology, medicine, bioinformatics,

engineering, social sciences (sentiment), marketing …

Also, we host the National Centre for Text Mining

(www.nactem.ac.uk)