Subproject III - Spoken Language Systems

1

Subproject III -Spoken Language Systems

Members:

Lin-shan Lee (PI), Lee-Feng Chien (Co-PI)

Hsin-min Wang (Co-PI), Berlin Chen (Co-PI)

Other Participants:

Sin-Horng Chen, Yih-Ru Wang

Yuan-Fu Liao, Jen-Tzung Chien

2

Outline

MembersResearch ThemeCurrent Achievements with DemosFuture Directions

3

Members

4

Research Theme

Information Extraction and Retrieval

(IE & IR)

Spoken Dialogues

Spoken DocumentUnderstanding and

Organization

MultimediaNetworkContent

NetworksUsers

˙

Named EntityExtraction

Segmentation

Topic Analysisand Organization

Summarization

Title Generation

InformationRetrieval

Two-dimensional Tree Structurefor Organized Topics

Chinese Broadcast News Archive

retrievalresults

titles,summaries

Input Query

User Instructions


Segmentation


Summarization

Title Generation




retrievalresults

titles,summaries

Input Query

User Instructions

N

i

d

d

d

d

.

.

.

.2

1

K

k

T

T

T

T

.

.2

1

n

j

t

t

t

t

.

.2

1

idP ik dTP kj TtP

documents latent topics query

nj ttttQ ....21Q

N

i

d

d

d

d

.

.

.

.2

1

N

i

d

d

d

d

.

.

.

.2

1

K

k

T

T

T

T

.

.2

1

K

k

T

T

T

T

.

.2

1

n

j

t

t

t

t

.

.2

1

n

j

t

t

t

t

.

.2

1

idP ik dTP kj TtP

documents latent topics query

nj ttttQ ....21Q

5

Research Roadmap

• Term Extraction/Organization Term Translation/Indexing

• Retrieval Modeling

• Title/Summary Generation

• Topic Analysis/Organization

•Information Extraction And Retrieval (IE & IR)

•Spoken Document Understanding and Organization

•Spoken Dialogues

•Distributed Speech Recognition

Information Navigation across Multimedia/Spoken

Documents

Cross-language Information Processing

Knowledge Discovery and Web Mining

Spoken Language Applications

Future DirectionsCurrent Achievements

Speech & LanguageUnderstanding

• …..

6

Information Extraction & Retrieval (IE & IR)

Named Entity Extraction from Text/Spoken Documents

Taxonomy GenerationTerm TranslationRetrieval Modeling for Text/Spoken Documents

7

Named Entity Extraction from Text/Spoken Documents

Global Information for the Entire Document Extracted from Forward/Backward PAT-Trees– Some named entities may not be easily identified from a single

sentence, but can be extracted when information in several sentences jointly considered

Named Entity Matching using Retrieved Text Documents to Identify Some Out-of-Vocabulary (OOV) Words

8

Automatic Taxonomy Generation (1/2)

Problem – Find relationships and associations between terms, and

organize them into a hierarchical structure (i.e. taxonomy)

– Useful for identifying and analyzing concepts embedded in documents and queries

Method– An approach proposed for clustering terms into

comprehensive hierarchical clusters

– Web mining techniques -- automatically generating relationships between terms based on relationships between documents retrieved with the terms from the Web

9

Automatic Taxonomy Generation (1/2)

A Typical Example for Term Taxonomy

10

Automatic Term Translation (1/2)

Problem– Cross-language information retrieval systems usually rely

on bilingual dictionaries; however, search terms are very often missing because they are proper nouns and OOVs

– Discovering translations of unknown query terms in different languages

Method– Finding translations of query terms via mining of huge qua

ntities of data obtained from the Web

– Correlation/Association patterns extracted from parallel bilingual pages retrieved from the Web, the anchor texts of the pages indicating out-links to multi-lingual pages, etc.

11

Automatic Term Translation (2/2)

Machine-Extracted

Translations

The Live Query Term Translation System (LiveTrans)

http://wkd.iis.sinica.edu.tw/LiveTrans/lt.html

http://wkd.iis.sinica.edu.tw/LiveTrans/lt.html

12

Retrieval Modeling for Text/Spoken Documents (1/2)

Problem– Conventional retrieval models can not be trained or improved

through use

– Word usage mismatch between the query and the documents

Method– Literal term matching: HMM/N-gram model trained with ML

or MCE criteria

– Concept matching: Topical mixture model (TMM), extended from PLSA, trained in either supervised or unsupervised manner

13

Retrieval Modeling for Text/Spoken Documents (2/2)

HMM/N-gram retrieval model– A document is viewed as a probabilistic

generative model for the query

– Literal term matching

Topical Mixture Model (extended from PLSA)– A document is composed of a set of

K latent topical distributions (unigrams) for predicting the query

– Concept matching

14

Spoken Document Understanding & Organization (1/2)

Problem– The content of multimedia documents very often described

by the associated speech information

– Unlike text documents with paragraphs/titles easy to look through at a glance, multimedia/spoken documents are unstructured and difficult to retrieve/browse

15

Spoken Document Transcription Multimedia/Spoken Document Segmentation Summarization for Multimedia/Spoken Documents Title Generation for Multimedia/Spoken Documents Topic Analysis and Organization for

Multimedia/Spoken Documents


16

Dividing a one-hour News Episode into News Stories

An improved audio segmentation technique integrating BIC and Divide-and-Conquer Approaches

Viterbi search over the Hidden Markov Model of text clusters

Spoken Document Segmentation (Broadcast News)

……distance

computation

17

Title Generation for Spoken Documents (Broadcast News)

Training Phase

Generation Phase

For Training Phase – Developing statistical relationships between words in the

training documents and their human-generated titles For New Spoken Documents

– Transcribing into term sequences– Identifying suitable terms, and using them to generate a

readable title

Training DocumentsD={dj, j=1,2,…,N}

(text form)

Human-generatedTitles of Training Documents

T={tj, j=1,2,…,N} (text form)

New Spoken DocumentsD={di, i=1,2,…,N}

(speech form)

Computer-generatedTitles of Spoken Documents

T={ti, i=1,2,…,M} (text/speech form)

18

Topic Analysis and Organization for Spoken Documents (Broadcast News)

Based on Probabilistic Latent Semantic Analysis (PLSA)– Terms (words, syllable pairs, etc.)/documents analyzed by probabilities

considering a set of latent topics

– Trained by EM algorithm

– Related documents don’t have to share common sets of terms, and related terms don’t have to co-exist in the same set of documents

Spoken Documents Clustered by the Latent Topics and Organized in a Two-dimensional Tree Structure, or a Two-layer Map

Two-dimensional Tree Structure

for Organized Topics

K

kikkjij dTPTtPdtP

1

19

Spoken Dialogues

Analysis and Design Using Quantitative Simulations

20

Analysis and Design Based on Quantitative Simulations

Problem– Dialogue performance cannot be predicted before the system is

on line– The effects of different factors, such as the system’s dialogue

strategies, speech recognition and understanding conditions etc., cannot be quantitatively identified and analyzed

Method– Computer-aided analysis and design approaches based on

quantitative simulations

misunderstanding rateslot loss rate

transactionsuccess

rate

21

Demo: Understanding and Organization of Chinese Broadcast News with Interactive

Interface


Segmentation


Summarization

Title Generation




retrievalresults

titles,summaries

Input Query

User Instructions


Segmentation


Summarization

Title Generation




retrievalresults

titles,summaries

Input Query

User Instructions

22


Problem– The content of multimedia documents very often described

by the associated speech information

– Unlike text documents with paragraphs/titles easy to look through at a glance, multimedia/spoken documents are unstructured and difficult to retrieve/browse

23

Topic Analysis and Organization for Spoken Documents (Broadcast News)

Based on Probabilistic Latent Semantic Analysis (PLSA)– Terms (words, syllable pairs, etc.)/documents analyzed by probabilities

considering a set of latent topics

– Trained by EM algorithm

– Related documents don’t have to share common sets of terms, and related terms don’t have to co-exist in the same set of documents

Spoken Documents Clustered by the Latent Topics and Organized in a Two-dimensional Tree Structure, or a Two-layer Map

Two-dimensional Tree Structure

for Organized Topics

K

kikkjij dTPTtPdtP

1

24

Future Directions

Information Navigation across Multimedia/Spoken Documents – Fast growing of quantities of multimedia/spoken documents are much more

difficult to browse compared to text documents – Better approaches to navigate across huge quantities of multimedia/spoken do

cuments using comprehensive presentation (e.g. topic taxonomy) Cross-language Information Processing Technologies

– Reducing language barriers in a future world of multilingual environment– Seeking for international collaboration and resource exchanging – Collaboration between the two major non-English languages may be a good d

irection Knowledge Discovery and Web Mining

– Web offers live, dynamic and by far the most complete global knowledge the human beings have

– Better approaches to explore the Web resources and enhance the language processing technologies