Upload
willis
View
47
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Subproject III - Spoken Language Systems. Members: Lin-shan Lee (PI), Lee-Feng Chien (Co-PI) Hsin-min Wang (Co-PI), Berlin Chen (Co-PI) Other Participants: Sin-Horng Chen, Yih-Ru Wang - PowerPoint PPT Presentation
Citation preview
1
Subproject III -Spoken Language Systems
Members:
Lin-shan Lee (PI), Lee-Feng Chien (Co-PI)
Hsin-min Wang (Co-PI), Berlin Chen (Co-PI)
Other Participants:
Sin-Horng Chen, Yih-Ru Wang
Yuan-Fu Liao, Jen-Tzung Chien
2
Outline
MembersResearch ThemeCurrent Achievements with DemosFuture Directions
3
Members
4
Research Theme
Information Extraction and Retrieval
(IE & IR)
Spoken Dialogues
Spoken DocumentUnderstanding and
Organization
MultimediaNetworkContent
NetworksUsers
˙
Named EntityExtraction
Segmentation
Topic Analysisand Organization
Summarization
Title Generation
InformationRetrieval
Two-dimensional Tree Structurefor Organized Topics
Chinese Broadcast News Archive
retrievalresults
titles,summaries
Input Query
User Instructions
Named EntityExtraction
Segmentation
Topic Analysisand Organization
Summarization
Title Generation
InformationRetrieval
Two-dimensional Tree Structurefor Organized Topics
Chinese Broadcast News Archive
retrievalresults
titles,summaries
Input Query
User Instructions
N
i
d
d
d
d
.
.
.
.2
1
K
k
T
T
T
T
.
.2
1
n
j
t
t
t
t
.
.2
1
idP ik dTP kj TtP
documents latent topics query
nj ttttQ ....21Q
N
i
d
d
d
d
.
.
.
.2
1
N
i
d
d
d
d
.
.
.
.2
1
K
k
T
T
T
T
.
.2
1
K
k
T
T
T
T
.
.2
1
n
j
t
t
t
t
.
.2
1
n
j
t
t
t
t
.
.2
1
idP ik dTP kj TtP
documents latent topics query
nj ttttQ ....21Q
5
Research Roadmap
• Term Extraction/Organization Term Translation/Indexing
• Retrieval Modeling
• Title/Summary Generation
• Topic Analysis/Organization
•Information Extraction And Retrieval (IE & IR)
•Spoken Document Understanding and Organization
•Spoken Dialogues
•Distributed Speech Recognition
Information Navigation across Multimedia/Spoken
Documents
Cross-language Information Processing
Knowledge Discovery and Web Mining
Spoken Language Applications
Future DirectionsCurrent Achievements
Speech & LanguageUnderstanding
• …..
6
Information Extraction & Retrieval (IE & IR)
Named Entity Extraction from Text/Spoken Documents
Taxonomy GenerationTerm TranslationRetrieval Modeling for Text/Spoken Documents
7
Named Entity Extraction from Text/Spoken Documents
Global Information for the Entire Document Extracted from Forward/Backward PAT-Trees– Some named entities may not be easily identified from a single
sentence, but can be extracted when information in several sentences jointly considered
Named Entity Matching using Retrieved Text Documents to Identify Some Out-of-Vocabulary (OOV) Words
8
Automatic Taxonomy Generation (1/2)
Problem – Find relationships and associations between terms, and
organize them into a hierarchical structure (i.e. taxonomy)
– Useful for identifying and analyzing concepts embedded in documents and queries
Method– An approach proposed for clustering terms into
comprehensive hierarchical clusters
– Web mining techniques -- automatically generating relationships between terms based on relationships between documents retrieved with the terms from the Web
9
Automatic Taxonomy Generation (1/2)
A Typical Example for Term Taxonomy
10
Automatic Term Translation (1/2)
Problem– Cross-language information retrieval systems usually rely
on bilingual dictionaries; however, search terms are very often missing because they are proper nouns and OOVs
– Discovering translations of unknown query terms in different languages
Method– Finding translations of query terms via mining of huge qua
ntities of data obtained from the Web
– Correlation/Association patterns extracted from parallel bilingual pages retrieved from the Web, the anchor texts of the pages indicating out-links to multi-lingual pages, etc.
11
Automatic Term Translation (2/2)
Machine-Extracted
Translations
The Live Query Term Translation System (LiveTrans)
http://wkd.iis.sinica.edu.tw/LiveTrans/lt.html
12
Retrieval Modeling for Text/Spoken Documents (1/2)
Problem– Conventional retrieval models can not be trained or improved
through use
– Word usage mismatch between the query and the documents
Method– Literal term matching: HMM/N-gram model trained with ML
or MCE criteria
– Concept matching: Topical mixture model (TMM), extended from PLSA, trained in either supervised or unsupervised manner
13
Retrieval Modeling for Text/Spoken Documents (2/2)
HMM/N-gram retrieval model– A document is viewed as a probabilistic
generative model for the query
– Literal term matching
Topical Mixture Model (extended from PLSA)– A document is composed of a set of
K latent topical distributions (unigrams) for predicting the query
– Concept matching
14
Spoken Document Understanding & Organization (1/2)
Problem– The content of multimedia documents very often described
by the associated speech information
– Unlike text documents with paragraphs/titles easy to look through at a glance, multimedia/spoken documents are unstructured and difficult to retrieve/browse
15
Spoken Document Transcription Multimedia/Spoken Document Segmentation Summarization for Multimedia/Spoken Documents Title Generation for Multimedia/Spoken Documents Topic Analysis and Organization for
Multimedia/Spoken Documents
Spoken Document Understanding & Organization (2/2)
16
Dividing a one-hour News Episode into News Stories
An improved audio segmentation technique integrating BIC and Divide-and-Conquer Approaches
Viterbi search over the Hidden Markov Model of text clusters
Spoken Document Segmentation (Broadcast News)
……distance
computation
17
Title Generation for Spoken Documents (Broadcast News)
Training Phase
Generation Phase
For Training Phase – Developing statistical relationships between words in the
training documents and their human-generated titles For New Spoken Documents
– Transcribing into term sequences– Identifying suitable terms, and using them to generate a
readable title
Training DocumentsD={dj, j=1,2,…,N}
(text form)
Human-generatedTitles of Training Documents
T={tj, j=1,2,…,N} (text form)
New Spoken DocumentsD={di, i=1,2,…,N}
(speech form)
Computer-generatedTitles of Spoken Documents
T={ti, i=1,2,…,M} (text/speech form)
18
Topic Analysis and Organization for Spoken Documents (Broadcast News)
Based on Probabilistic Latent Semantic Analysis (PLSA)– Terms (words, syllable pairs, etc.)/documents analyzed by probabilities
considering a set of latent topics
– Trained by EM algorithm
– Related documents don’t have to share common sets of terms, and related terms don’t have to co-exist in the same set of documents
Spoken Documents Clustered by the Latent Topics and Organized in a Two-dimensional Tree Structure, or a Two-layer Map
Two-dimensional Tree Structure
for Organized Topics
K
kikkjij dTPTtPdtP
1
19
Spoken Dialogues
Analysis and Design Using Quantitative Simulations
20
Analysis and Design Based on Quantitative Simulations
Problem– Dialogue performance cannot be predicted before the system is
on line– The effects of different factors, such as the system’s dialogue
strategies, speech recognition and understanding conditions etc., cannot be quantitatively identified and analyzed
Method– Computer-aided analysis and design approaches based on
quantitative simulations
misunderstanding rateslot loss rate
transactionsuccess
rate
21
Demo: Understanding and Organization of Chinese Broadcast News with Interactive
Interface
Named EntityExtraction
Segmentation
Topic Analysisand Organization
Summarization
Title Generation
InformationRetrieval
Two-dimensional Tree Structurefor Organized Topics
Chinese Broadcast News Archive
retrievalresults
titles,summaries
Input Query
User Instructions
Named EntityExtraction
Segmentation
Topic Analysisand Organization
Summarization
Title Generation
InformationRetrieval
Two-dimensional Tree Structurefor Organized Topics
Chinese Broadcast News Archive
retrievalresults
titles,summaries
Input Query
User Instructions
22
Spoken Document Understanding & Organization (1/2)
Problem– The content of multimedia documents very often described
by the associated speech information
– Unlike text documents with paragraphs/titles easy to look through at a glance, multimedia/spoken documents are unstructured and difficult to retrieve/browse
23
Topic Analysis and Organization for Spoken Documents (Broadcast News)
Based on Probabilistic Latent Semantic Analysis (PLSA)– Terms (words, syllable pairs, etc.)/documents analyzed by probabilities
considering a set of latent topics
– Trained by EM algorithm
– Related documents don’t have to share common sets of terms, and related terms don’t have to co-exist in the same set of documents
Spoken Documents Clustered by the Latent Topics and Organized in a Two-dimensional Tree Structure, or a Two-layer Map
Two-dimensional Tree Structure
for Organized Topics
K
kikkjij dTPTtPdtP
1
24
Future Directions
Information Navigation across Multimedia/Spoken Documents – Fast growing of quantities of multimedia/spoken documents are much more
difficult to browse compared to text documents – Better approaches to navigate across huge quantities of multimedia/spoken do
cuments using comprehensive presentation (e.g. topic taxonomy) Cross-language Information Processing Technologies
– Reducing language barriers in a future world of multilingual environment– Seeking for international collaboration and resource exchanging – Collaboration between the two major non-English languages may be a good d
irection Knowledge Discovery and Web Mining
– Web offers live, dynamic and by far the most complete global knowledge the human beings have
– Better approaches to explore the Web resources and enhance the language processing technologies