Discovering Knowledge in and Extracting Information...

Preview:

Citation preview

Center of Signal and Image ProcessingGeorgia Institute of Technology

Discovering Knowledge in and Extracting Information from

Multimedia Patterns

Chin-Hui LeeSchool of ECE, Georgia Institute of Technology

Atlanta, GA 30332, USAchl@ece.gatech.edu

(Most work finished in Bell Labs, some work done while visiting NUS in 2001-2002)

ISIMP2004, HKPolyU, Oct. 21,2004

2 Center of Signal and Image ProcessingGeorgia Institute of Technology

Outline• Rich content of heterogeneous media patterns

– Text, audio, video, speech, image, object, graphics, sketch, etc. – Web is becoming the largest multimedia databases & playground

• 4M in human information processing technology– Multimedia, multi-modal, multi-lingual, multi-disciplinary

• Technology dimensions (more language engineering)– Parametrization, feature extraction, modeling, segmentation, etc.– Coding, synthesis, recognition, verification, understanding, etc.

• Knowledge discovery and information extraction – From spotting cues and events to understanding media patterns

• Summary and emerging opportunities

3 Center of Signal and Image ProcessingGeorgia Institute of Technology

Evolution of Language and Media

Paper Radio Historic Flow of Knowledge & Civilization

Print(1450AD)

Telegraph &Telephone

TV Computer & Digital

Processing

Hyper & Virtual

Media ? (21st Cen)

WrittenLanguage (3000BC)

SpokenLanguage

ElectronicMedia

(1900AD)Recording

Media

Internet & WWW

4 Center of Signal and Image ProcessingGeorgia Institute of Technology

Growth in Network TrafficGrowth in Network Traffic

Data

0

1000

2000

3000

97 98 99 0096 01

Voice

Tera

byte

s pe

r D

ay

Year

Voice Traffic:576 TB/day

Data Traffic:1178 TB/day at YE’00,

2136 TB/day at YE’01

5 Center of Signal and Image ProcessingGeorgia Institute of Technology

The Internet ExplosionThe Internet ExplosionThe Internet Explosion

Internet Hosts

CAGR since 1998 100%

traffictraffic

2,000,000,000 Web Pages

75,000,000

275,000,000 Worldwide Users

6 Center of Signal and Image ProcessingGeorgia Institute of Technology

Heterogeneous Multimedia Pages

Rich Content:

• Audio• Video• Image• Speech• Graphics• Objects• Comic Strips• Files.xxx• Links• Multilingual

7 Center of Signal and Image ProcessingGeorgia Institute of Technology

Picasso’s “Parade” (1917)

A Picture is worth more than a thousand words?

On display at IFC, HK

8 Center of Signal and Image ProcessingGeorgia Institute of Technology

Ubiquitous Wireless AccessUbiquitous Wireless Access(Mobile Info Access and Transactions)(Mobile Info Access and Transactions)

Devices

Services

Internet

Corporate Networks

LANs

Air Access Interface

Network

Any wireless device Any air interface Any desired network Any service

9 Center of Signal and Image ProcessingGeorgia Institute of Technology

Multimodal Access of Multimedia DBs(Research & Business Opps for Info Intelligence)

User Model

User Input

Keyboard

Speech

MM-pad

Speech Recognizer

Text Processing

Multimedia Presentation

User Intent Understanding

Audio/video Recognizer

Audio/Video Rendering

Indexed A/V Database

A/V Browser

InformationAppliance

Info Fusion

Raw A/VDatabase

Multimedia Processing

User Feedback

VideoMultimedia

IndexingAudioText

Info Fusion & Retrieval

Q&A Dialogue

Network

10 Center of Signal and Image ProcessingGeorgia Institute of Technology

Human Information Technologies & 4M• Multimedia Documents

– Audio, video, speech, image, text, chart, map, etc.– Indexing, retrieval, presentation, rendering, etc.

• Multi-Modal Human-Machine Interface (HCI)– Speech, gesture, point ‘n’ click, pen, MM sketch pad, etc.– Multiple sensory inputs and feedbacks

• Multilingual Information Sources– Multilingual human language understanding– Multilingual presentation, cross-language referencing

• Multidisciplinary Collaborative Research– Engineers, scientists, artists, psychologists, etc.– Human factors, behavior science, wide range of soft topics

11 Center of Signal and Image ProcessingGeorgia Institute of Technology

Human Language Engineering Abstraction

• Modeling of Input-Output Relationship– Shannon’s Channel Modeling and Decoding Paradigm

• Signal Processing of Linguistic Features– e.g. latent semantic analysis and vector space representation

• Similarity Measures between Documents– Clustering and modeling of linguistic events

• Machine Learning Techniques for Classifier Design• Document Classification, Verification, Understanding

– Many research and business opportunities

12 Center of Signal and Image ProcessingGeorgia Institute of Technology

Vector Space Representation of Queries & Documents (Latent Semantic Indexing)

qX

Credit CardServices

DepositServices

id

ConsumerLending

Home EquityService

LoanServicing

13 Center of Signal and Image ProcessingGeorgia Institute of Technology

Query Vector Feature Extraction• Text Pre-processing (SMART, Salton, 1971)

– extract root form of a word, e.g. check for checking– remove ignore words, e.g. um, uh– remove stop words, e.g. I would like to– count occurrences of remaining key terms

QueryVectorText

Speech

MorphologicalFiltering

Query-VectorExtractionASR

Stop/Ignore List

Key Term List

Center of Signal and Image ProcessingGeorgia Institute of Technology

LSA Based Feature Extraction• LSA Matrix (also known as Routing Matrix) C

– number of times word occurs in :– total number of words present in :– total number of occurs in A :– “indexing” power of in corpus A :– normalized entropy:

jijiij nnc ⋅−= /)1( ε

iw

iw

iw

jAjA

10log1log

1 ≤≤−=⋅⋅∑ = in

nN

j nn

Ni i

ij

i

ij εε

ijnsum)column (jn⋅

sum) row(⋅inii εη −=1

power indexing maximum if0 ⋅== iiji nnεprobable)(equally power no if1 N

niji

in ⋅==ε{

Center of Signal and Image ProcessingGeorgia Institute of Technology

LSA Feature Space• Mapping into Latent Semantic Space S

– each document vector (N column vectors of matrix C ) is mapped to an (1xR)-vector

– each term vector (M row vectors of matrix C ) is mapped to an (1xR)-vector

– each query vector (a new Mx1 vector) is mapped to an (1xR)-vector through the pseudo-document vector

– closeness in the S space is much easier measured for both document-document and term-term comparisons

jaSdv t

jtj =

Stu ii =ib

jaib

M

N

jd

Stu ii =

Sdv tj

tj =

=• •

it

200150000,100000,10(SVD)

−≈≈≈

=

RNM

TSDC t

S

Center of Signal and Image ProcessingGeorgia Institute of Technology

Confidence Scoring• Inner Product: tyxyxs •=),(

• Cosine:)],([cosor||||),( 1 yxsyx

yxyxst

−•=

• Confidence Scoring: Sigmoid function fitting1)( ]1[),;( −+−+= βαβα sesConf

• Other Scores– Euclidean, Manhattan, etc.

• Generalized Scores– between any two vectors: );,(),( Γ= yxfyxs

17 Center of Signal and Image ProcessingGeorgia Institute of Technology

Term Clustering• MMT characterizes all co-occurrences between

terms, the (i, j) cell of MMT infers the similarity between wi and wj

• Define a distance measureSuSu

uSuSuSuwwK

ji

Tji

2

),cos(),( ==

),(cos),( 1jiji wwKwwD −=

jiji

• Given D, one can perform word clustering using any clustering algorithm, e.g. K-Means

• For document clustering, use MTM instead

18 Center of Signal and Image ProcessingGeorgia Institute of Technology

K-Means Term Clustering Example• 9492 words into 100 clusters (one example)

oub bank Singapore cent uob db account share singtel trade Bangkok manage save entity annual ocbc tangible debt stikeppel custom transact currency deposit card sixth citibank integer subscribe handset creation loan auditor merger autom merge sharehold attract uncondiasx optu sembawang ibra restructursingland landlord uic yaw sgx

19 Center of Signal and Image ProcessingGeorgia Institute of Technology

Document Clustering Example• 2000 documents into 100 clusters (one example)

N Korea Proposes Resumed Talks with S Korea-YonhapNorth Korea Proposes Resuming Talks with SeoulSouth Korea Set for Key Vote on Approach to NorthKorea to Replace Four to Eight Ministers on FridayS.Korea to Push North Policy Despite Kim Setback

……

20 Center of Signal and Image ProcessingGeorgia Institute of Technology

Conventional View on PR

Unknown Pattern dj

Classifier Ti

Classifier T1

Classifier Tm

…..

L1(dj)

Li(dj)

Lm(dj)Label by m-th classifiers

T/F?

T/F?

T/F?

…..

Modeling and recognition units are the same !

21 Center of Signal and Image ProcessingGeorgia Institute of Technology

Shannon’s Channel Modeling Paradigm –An Information Theoretic Perspective

OI IChannelP(O|I)

ChannelDecoder

( | ) ( )ˆ arg max ( | ) arg max( )I I

P O I P II P I OP O∈Γ ∈Γ= =

• Channel input is hidden (unobserved) while output is observed and used to infer the input (which is often approximated by a structural Markov model in many problems in speech, language and MM processing)

• Channel Modeling with (I, O) pairs in training• Modeling units are usually smaller than recognition units

22 Center of Signal and Image ProcessingGeorgia Institute of Technology

Other Applications in Pattern Recognition

Application Input Output P(I) P(O|I)

OCR Error Model

Character (Letter) LM

Noisy Letters

Actual Letters

Optical Char. Recognition

Tagging ModelPOS Tag LMWord Sequence

POS Tag Sequence

Part-of-Speech Tagging

Parsing ModelLM of Derivations

Word Sequence

Parse TreeParsing

Semantic Model

Concept LMWord Sequence

Semantic Concept

Text Understanding

Translation Model

Source LM

Target Sentence

Source Sentence

Machine Translation

Bio-genetic Model

LM of Nucleotides

Noisy DNA Sequence

Actual DNA Sequence

Bioinformatics

23 Center of Signal and Image ProcessingGeorgia Institute of Technology

Modeling Input-Output Associations• Artificial Neural Network (ANN)

– MLP functional approximation and input-output mapping• Classification and Regression Tree (CART)

– Multi-layer tree approximation• Support Vector Machine (SVM) and LVQ• Kernel-based, mixture of experts, Bayesian network • Other Machine Learning Techniques• Many New Applications

– Rule induction, statistical parsing, machine translation, etc.– Pronunciation modeling and multilingual transliteration– Information retrieval, text categorization, and call routing

24 Center of Signal and Image ProcessingGeorgia Institute of Technology

Hidden Markov Model (HMM) -Dynamic Time or Space Warping

PΛ(X|C) = ∑ PΛ(X, q|C)q

PΛ(X, q|C) = a0 Π aqt-1 qt bqt(xt)t

X = (x1, x2, x3, ….., xT )

• Each state represents a process ofmeasurable observations.

• Inter-process transition is governed by afinite state Markov chain.

• Processes are stochastic and individualobservations do not immediately identifythe hidden state.

HMM models spectral and temporal variations simultaneously!

25 Center of Signal and Image ProcessingGeorgia Institute of Technology

Text Categorization: Training Classifiers

(1) Feature Extraction &

Reduction(2) Classifier

Learning

Pi

Ni

Pi

Ni

Training set for each category Ci , i= 1,…,m. (Positive +Negative)

Classifier Tifor category Ci

Doc. in new feature space

26 Center of Signal and Image ProcessingGeorgia Institute of Technology

Related Work on Classifier Design• Decision Tree: Simple, popular, and powerful

classifier. Many available tools, C4.5, CART, ID3

( ) 01

,D

i ii

f X W wx w=

= −∑Linear discriminative function:

• Support Vector Machine (SVM)• Naïve Bayes: simple distributions for each class• K-Nearest Neighbor (kNN)• Semantic Perceptron Net (SPN)• Hidden Markov Model (HMM) • Discriminative Training

27 Center of Signal and Image ProcessingGeorgia Institute of Technology

Reading Tables in Documents (TTS)

COMPANY TODAY' S YESTERDAY' S OPEN CHANGE OPEN CHANGE BLUE I NC 75 1/ 2 + 1 1/ 8 74 9/ 16 - 4 1/ 4 GREEN. COM 89 1/ 4 + 2 88 5/ 8 - 2 13/ 16 RED I NC 22 1/ 4 + 5/ 16 21 13/ 16 - 3/ 8 YELLOW LTD 103 3/ 8 - 1 13/ 16 101 - 4 PURPLE I NC 27 11/ 16 - 2 5/ 8 27 5/ 8 - 1 1/ 8 BROWN. COM 68 + 11/ 16 66 11/ 16 - 1 5/ 8 PI NK LTD 130 7/ 16 + 1 1/ 16 130 - 2 3/ 8

Document understanding is needed before rendering !

28 Center of Signal and Image ProcessingGeorgia Institute of Technology

Web Information Access & PresentationNews Page (HTML)

Sampras volunteers for Davis Cup doublesduty

-------------------------------------------------------------------------------------------------------------------------------------------------------------

Sampras …….----------------------------------------------------------

News Content(Text)

SummaryLinks

• Web data mining• Web content extraction• Topic detection and automatic summarization• Information rendering and presentation• Q&A construction for natural interface

29 Center of Signal and Image ProcessingGeorgia Institute of Technology

Image Segmentation & Annotation

• Concept definition needed?• What is image understanding?

“Building, sky, lake, tree, landscape”

30 Center of Signal and Image ProcessingGeorgia Institute of Technology

Concept vs. Content Based Search

GoogleATR ConceptSearch

Query Taxonomy

……

… …

ATR

31 Center of Signal and Image ProcessingGeorgia Institute of Technology

Multilingual IA (IIS/Taiwan)Top 4 keywords Top 4 keywordsImages Images

彩虹 (Rainbow)天氣 (Weather)花 (Flower)自然 (Nature)

向日葵 (Sunflower)花 (Flower)植物 (Plant)沙漠 (Desert)

海豹 (Seal)哺乳類 (Mammal)海岸 (Coast)動物 (Animal)

太陽系(Solar System)慧星 (Comet)熱帶魚(Tropical Fish)太空 (Universe)

瀑布 (Waterfall)地形 (Landform)自然 (Nature)蟑螂 (Cockroach)

狗 (Dog)哺乳類 (Mammal)穿山甲 (Pangolin) 羊(Sheep)

32 Center of Signal and Image ProcessingGeorgia Institute of Technology

Cross-Language Web Search (IIS/Taiwan)• A Web search service allows users to query in one language and

search documents that are written or indexed in another language.

33 Center of Signal and Image ProcessingGeorgia Institute of Technology

Audio Segmentation & Annotation (DP-Based Often Involved Segmental Models like HMM)

Audio Speech

34 Center of Signal and Image ProcessingGeorgia Institute of Technology

SpeechFind: Speech & Speaker AnnotationFully searchable online database of spoken word collections spanning the 20th century

http://svoice.colorado.edu (Bowen Zhou)

35 Center of Signal and Image ProcessingGeorgia Institute of Technology

Video & Audio Segmentation(Story Segmentation of Audiovisual Documents)

36 Center of Signal and Image ProcessingGeorgia Institute of Technology

Video Clip Browsing over IP on 3G

37 Center of Signal and Image ProcessingGeorgia Institute of Technology

From Web Search to Web Mining• Exploring the Development of Advanced IR

Techniques through Web Mining

Weblogs, texts, images, …

• Cross-Language IR• Concept Search • Personalized Search• Multimedia Search

Knowledge Discovery & Info Extraction

Search Engine

Language info Speaker ProfileImage SemanticsBackground infoTerm ExtractionFace/Object IDEtc.

• Anchor Texts• Query Term Logs• Query Session Logs• Audio/Image Banks

38 Center of Signal and Image ProcessingGeorgia Institute of Technology

Personal Media: A New Scenario

media miningcontent

analysis

authored story

semantic analysis media servernavigation

Specification Media Space Composition Presentation

39 Center of Signal and Image ProcessingGeorgia Institute of Technology

Summary• Rich content of heterogeneous media patterns

– Text, audio, video, speech, image, object, graphics, sketch, etc. – Web is becoming the largest multimedia databases & playground

• 4M in human information processing technology– Multimedia, multi-modal, multi-lingual, multi-disciplinary

• Technology dimensions– Parametrization, feature extraction, modeling, segmentation, etc.– Coding, synthesis, recognition, verification, understanding, etc.

• Knowledge discovery and information extraction – Spotting cues/events embedded in unconstrained media patterns

• Many emerging research opportunities

Center of Signal and Image ProcessingGeorgia Institute of Technology

Recommended