52
DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 1 / 52 DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task Maria Eskevich, Gareth J.F. Jones Centre for Digital Video Processing Centre for Next Generation Localisation School of Computing Dublin City University Ireland December, 8, 2011

DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Embed Size (px)

DESCRIPTION

We describe details of our runs and the results obtained for the "IR for Spoken Documents (SpokenDoc) Task'' at NTCIR-9. The focus of our participation in this task was the investigation of the use of segmentation methods to divide the manual and ASR transcripts into topically coherent segments. The underlying assumption of this approach is that these segments will capture passages in the transcript relevant to the query. Our experiments investigate the use of two lexical coherence based segmentation algorithms (TextTiling, C99). These are run on the provided manual and ASR transcripts, and the ASR transcript with stop words removed. Evaluation of the results shows that TextTiling consistently performs better than C99 both in segmenting the data into retrieval units as evaluated using the centre located relevant information metric and in having higher content precision in each automatically created segment.

Citation preview

Page 1: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 1 / 52

DCU at the NTCIR-9 SpokenDocPassage Retrieval Task

Maria Eskevich, Gareth J.F. Jones

Centre for Digital Video ProcessingCentre for Next Generation Localisation

School of ComputingDublin City University

Ireland

December, 8, 2011

Page 2: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 2 / 52

Outline

Retrieval MethodologyTranscript PreprocessingText SegmentationRetrieval Setup

ResultsOfficial MetricsuMAPpwMAPfMAP

Conclusions

Page 3: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 3 / 52

Retrieval Methodology

Transcript

TopicallyCoherentSegments

Segmentation

IndexedSegmentsIndexing

Queries

Information Request

RetrievalResultsRetrieval

Page 4: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 4 / 52

Retrieval Methodology

Transcript

TopicallyCoherentSegments

Segmentation

IndexedSegmentsIndexing

Queries

Information Request

RetrievalResultsRetrieval

Page 5: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 5 / 52

Retrieval Methodology

Transcript

TopicallyCoherentSegments

Segmentation

IndexedSegmentsIndexing

Queries

Information Request

RetrievalResultsRetrieval

Page 6: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 6 / 52

Retrieval Methodology

Transcript

TopicallyCoherentSegments

Segmentation

IndexedSegmentsIndexing

Queries

Information Request

RetrievalResultsRetrieval

Page 7: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 7 / 52

Retrieval Methodology

Transcript

TopicallyCoherentSegments

Segmentation

IndexedSegmentsIndexing

Queries

Information Request

RetrievalResultsRetrieval

Page 8: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 8 / 52

6 Retrieval Runs

1-best ASR 1-best ASR withstop words removed

ManualTranscript

ASR C99

ASR TT

C99

TT ASR NSW C99

ASR NSW TT

C99

TT Manual C99

Manual TT

C99

TT

Page 9: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 9 / 52

6 Retrieval Runs

1-best ASR 1-best ASR withstop words removed

ManualTranscript

ASR C99

ASR TT

C99

TT

ASR NSW C99

ASR NSW TT

C99

TT Manual C99

Manual TT

C99

TT

Page 10: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 10 / 52

6 Retrieval Runs

1-best ASR 1-best ASR withstop words removed

ManualTranscript

ASR C99

ASR TT

C99

TT ASR NSW C99

ASR NSW TT

C99

TT

Manual C99

Manual TT

C99

TT

Page 11: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 11 / 52

6 Retrieval Runs

1-best ASR 1-best ASR withstop words removed

ManualTranscript

ASR C99

ASR TT

C99

TT ASR NSW C99

ASR NSW TT

C99

TT Manual C99

Manual TT

C99

TT

Page 12: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 12 / 52

6 Retrieval Runs

1-best ASR 1-best ASR withstop words removed

ManualTranscript

ASR C99

ASR TT

C99

TT

ASR NSW C99

ASR NSW TT

C99

TT

Manual C99

Manual TT

C99

TT

Page 13: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 13 / 52

Transcript Preprocessing

I Recognize individual morphemes of the sentences:ChaSen 2.4.0, based on Japanese morphological analyzerJUMAN 2.0 with ipadic grammar 2.7.0

I Form the text out of the base forms of the words in order toavoid stemming

I Remove the stop words (SpeedBlog JapaneseStop-words) for one of the runs

Page 14: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 14 / 52

Transcript Preprocessing

I Recognize individual morphemes of the sentences:ChaSen 2.4.0, based on Japanese morphological analyzerJUMAN 2.0 with ipadic grammar 2.7.0

I Form the text out of the base forms of the words in order toavoid stemming

I Remove the stop words (SpeedBlog JapaneseStop-words) for one of the runs

Page 15: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 15 / 52

Transcript Preprocessing

I Recognize individual morphemes of the sentences:ChaSen 2.4.0, based on Japanese morphological analyzerJUMAN 2.0 with ipadic grammar 2.7.0

I Form the text out of the base forms of the words in order toavoid stemming

I Remove the stop words (SpeedBlog JapaneseStop-words) for one of the runs

Page 16: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 16 / 52

Text Segmentation

Use of the algorithms originally developed for text:Individual IPUs are treated as sentences

I TextTiling:I Cosine similarities between adjacent blocks of sentences

I C99:I Compute similarity between sentences using a cosine

similarity measure to form a similarity matrixI Cosine scores are replaced by the rank of the score in the

local regionI Segmentation points are assigned using a clustering

procedure

Page 17: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 17 / 52

Retrieval SetupSMART information retrieval system extended to use languagemodelling with a uniform document prior probability.

A query q is scored against a document d within the SMARTframework in the following way:

P(q|d) =n∏

i=1

(λiP(qi |d) + (1− λi)P(qi))

whereI q = (q1, . . .qn) is a query comprising of n query terms,I P(qi |d) is the probability of generating the i th query term

from a given document d being estimated by the maximumlikelihood,

I P(qi) is the probability of generating it from the collectionand is estimated by document frequency

The retrieval model used λi = 0.3 for all qi , this valuebeing optimized on the TREC-8 ad hoc dataset.

Page 18: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 18 / 52

Retrieval SetupSMART information retrieval system extended to use languagemodelling with a uniform document prior probability.A query q is scored against a document d within the SMARTframework in the following way:

P(q|d) =n∏

i=1

(λiP(qi |d) + (1− λi)P(qi))

whereI q = (q1, . . .qn) is a query comprising of n query terms,I P(qi |d) is the probability of generating the i th query term

from a given document d being estimated by the maximumlikelihood,

I P(qi) is the probability of generating it from the collectionand is estimated by document frequency

The retrieval model used λi = 0.3 for all qi , this valuebeing optimized on the TREC-8 ad hoc dataset.

Page 19: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 19 / 52

Retrieval SetupSMART information retrieval system extended to use languagemodelling with a uniform document prior probability.A query q is scored against a document d within the SMARTframework in the following way:

P(q|d) =n∏

i=1

(λiP(qi |d) + (1− λi)P(qi))

whereI q = (q1, . . .qn) is a query comprising of n query terms,I P(qi |d) is the probability of generating the i th query term

from a given document d being estimated by the maximumlikelihood,

I P(qi) is the probability of generating it from the collectionand is estimated by document frequency

The retrieval model used λi = 0.3 for all qi , this valuebeing optimized on the TREC-8 ad hoc dataset.

Page 20: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 20 / 52

Retrieval SetupSMART information retrieval system extended to use languagemodelling with a uniform document prior probability.A query q is scored against a document d within the SMARTframework in the following way:

P(q|d) =n∏

i=1

(λiP(qi |d) + (1− λi)P(qi))

whereI q = (q1, . . .qn) is a query comprising of n query terms,I P(qi |d) is the probability of generating the i th query term

from a given document d being estimated by the maximumlikelihood,

I P(qi) is the probability of generating it from the collectionand is estimated by document frequency

The retrieval model used λi = 0.3 for all qi , this valuebeing optimized on the TREC-8 ad hoc dataset.

Page 21: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 21 / 52

Outline

Retrieval MethodologyTranscript PreprocessingText SegmentationRetrieval Setup

ResultsOfficial MetricsuMAPpwMAPfMAP

Conclusions

Page 22: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 22 / 52

Results: Official Metrics

Transcript Segmentation uMAP pwMAP fMAPtype type

BASELINE 0.0670 0.0520 0.0536manual tt 0.0859 0.0429 0.0500manual C99 0.0713 0.0209 0.0168

ASR tt 0.0490 0.0329 0.0308ASR C99 0.0469 0.0166 0.0123

ASR nsw tt 0.0312 0.0141 0.0174ASR nsw C99 0.0316 0.0138 0.0120

I Only runs on the manual transcript had higher scores thanthe baseline (only uMAP metric)

I TextTiling results are consistently higher than C99 for allthe metrics for manual and ASR runs

Page 23: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 23 / 52

Results: Official Metrics

Transcript Segmentation uMAP pwMAP fMAPtype type

BASELINE 0.0670 0.0520 0.0536manual tt 0.0859 0.0429 0.0500manual C99 0.0713 0.0209 0.0168

ASR tt 0.0490 0.0329 0.0308ASR C99 0.0469 0.0166 0.0123

ASR nsw tt 0.0312 0.0141 0.0174ASR nsw C99 0.0316 0.0138 0.0120

I Only runs on the manual transcript had higher scores thanthe baseline (only uMAP metric)

I TextTiling results are consistently higher than C99 for allthe metrics for manual and ASR runs

Page 24: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 24 / 52

Results: Official Metrics

Transcript Segmentation uMAP pwMAP fMAPtype type

BASELINE 0.0670 0.0520 0.0536manual tt 0.0859 0.0429 0.0500manual C99 0.0713 0.0209 0.0168

ASR tt 0.0490 0.0329 0.0308ASR C99 0.0469 0.0166 0.0123

ASR nsw tt 0.0312 0.0141 0.0174ASR nsw C99 0.0316 0.0138 0.0120

I Only runs on the manual transcript had higher scores thanthe baseline (only uMAP metric)

I TextTiling results are consistently higher than C99 for allthe metrics for manual and ASR runs

Page 25: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 25 / 52

Time-based Results Assessment ApproachFor each run and each query:

Passage 1Relevant

Passage 2Relevant

Passage 3

Relevant Passage 4

...

Precision

Precision

Precision

AverageofPrecisionper Queryfor rankswithpassagescontainingrelevantcontent

where:

Precision =Length of the Relevant Part

Length of the Whole Passage

Page 26: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 26 / 52

Time-based Results Assessment ApproachFor each run and each query:

Passage 1Relevant

Passage 2Relevant

Passage 3

Relevant Passage 4

...

Precision

Precision

Precision

AverageofPrecisionper Queryfor rankswithpassagescontainingrelevantcontent

where:

Precision =Length of the Relevant Part

Length of the Whole Passage

Page 27: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 27 / 52

Time-based Results Assessment ApproachFor each run and each query:

Passage 1Relevant

Passage 2Relevant

Passage 3

Relevant Passage 4

...

Precision

Precision

Precision

AverageofPrecisionper Queryfor rankswithpassagescontainingrelevantcontent

where:

Precision =Length of the Relevant Part

Length of the Whole Passage

Page 28: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 28 / 52

Average of Precision for all passages with relevantcontent

I TextTiling algorithm has higher average of precision for alltypes of transcript, i.e.topically coherent segmentsare better located

Page 29: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 29 / 52

Average of Precision for all passages with relevantcontent

I TextTiling algorithm has higher average of precision for alltypes of transcript, i.e.topically coherent segmentsare better located

Page 30: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 30 / 52

Results: utterance-based MAP (uMAP)

Transcript type Segmentation type uMAPBASELINE 0.0670

manual tt 0.0859manual C99 0.0713

ASR tt 0.0490ASR C99 0.0469

ASR nsw tt 0.0312ASR nsw C99 0.0316

I The trend ‘manual > ASR > ASR nsw’ for both C99 andTextTiling is not proved by the averages of precision

I Higher average values of the TextTiling segmentation overC99 are not reflected in the uMAP scores

I For some of the queries runs on C99 segmentationhave better ranking of the segments with relevantcontent

Page 31: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 31 / 52

Results: utterance-based MAP (uMAP)

Transcript type Segmentation type uMAPBASELINE 0.0670

manual tt 0.0859manual C99 0.0713

ASR tt 0.0490ASR C99 0.0469

ASR nsw tt 0.0312ASR nsw C99 0.0316

I The trend ‘manual > ASR > ASR nsw’ for both C99 andTextTiling is not proved by the averages of precision

I Higher average values of the TextTiling segmentation overC99 are not reflected in the uMAP scores

I For some of the queries runs on C99 segmentationhave better ranking of the segments with relevantcontent

Page 32: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 32 / 52

Results: utterance-based MAP (uMAP)

Transcript type Segmentation type uMAPBASELINE 0.0670

manual tt 0.0859manual C99 0.0713

ASR tt 0.0490ASR C99 0.0469

ASR nsw tt 0.0312ASR nsw C99 0.0316

I The trend ‘manual > ASR > ASR nsw’ for both C99 andTextTiling is not proved by the averages of precision

I Higher average values of the TextTiling segmentation overC99 are not reflected in the uMAP scores

I For some of the queries runs on C99 segmentationhave better ranking of the segments with relevantcontent

Page 33: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 33 / 52

Results: utterance-based MAP (uMAP)

Transcript type Segmentation type uMAPBASELINE 0.0670

manual tt 0.0859manual C99 0.0713

ASR tt 0.0490ASR C99 0.0469

ASR nsw tt 0.0312ASR nsw C99 0.0316

I The trend ‘manual > ASR > ASR nsw’ for both C99 andTextTiling is not proved by the averages of precision

I Higher average values of the TextTiling segmentation overC99 are not reflected in the uMAP scores

I For some of the queries runs on C99 segmentationhave better ranking of the segments with relevantcontent

Page 34: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 34 / 52

Relevance of the Central IPU AssessmentNumber of ranks Average of Precision

taken or not taken for the passages at ranksinto account by pwMAP that are taken or not taken

into account by pwMAP

I TextTiling has higher numbers of segments that havecentral IPU relevant to the query

I Overall the numbers of the ranks where the segment withrelevant is retrieved is approximately the samefor both segmentation techniques

Page 35: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 35 / 52

Relevance of the Central IPU AssessmentNumber of ranks Average of Precision

taken or not taken for the passages at ranksinto account by pwMAP that are taken or not taken

into account by pwMAP

I TextTiling has higher numbers of segments that havecentral IPU relevant to the query

I Overall the numbers of the ranks where the segment withrelevant is retrieved is approximately the samefor both segmentation techniques

Page 36: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 36 / 52

Relevance of the Central IPU AssessmentNumber of ranks Average of Precision

taken or not taken for the passages at ranksinto account by pwMAP that are taken or not taken

into account by pwMAP

I TextTiling has higher numbers of segments that havecentral IPU relevant to the query

I Overall the numbers of the ranks where the segment withrelevant is retrieved is approximately the samefor both segmentation techniques

Page 37: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 37 / 52

Results: pointwise MAP (pwMAP)

Transcript type Segmentation type pwMAPBASELINE 0.0520

manual tt 0.0429manual C99 0.0209

ASR tt 0.0329ASR C99 0.0166

ASR nsw tt 0.0141ASR nsw C99 0.0138

I TextTiling segmentation puts better topic boundaries forrelevant content and have higher precision scores for theretrieved relevant passages

Page 38: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 38 / 52

Results: pointwise MAP (pwMAP)

Transcript type Segmentation type pwMAPBASELINE 0.0520

manual tt 0.0429manual C99 0.0209

ASR tt 0.0329ASR C99 0.0166

ASR nsw tt 0.0141ASR nsw C99 0.0138

I TextTiling segmentation puts better topic boundaries forrelevant content and have higher precision scores for theretrieved relevant passages

Page 39: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 39 / 52

Average Length of Relevant Part and Segments(in seconds)

Center IPU is relevant

Center IPU is not relevant

I Center IPU is relevant: Average length of the relevantcontent is of the same order for both segmentationschemes, slightly higher for TextTiling

I Center IPU is not relevant: Average length of the relevantcontent is higher for C99 segmentation,due to the poor segmentation it correlateswith much longer segments

Page 40: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 40 / 52

Average Length of Relevant Part and Segments(in seconds)

Center IPU is relevant

Center IPU is not relevant

I Center IPU is relevant: Average length of the relevantcontent is of the same order for both segmentationschemes, slightly higher for TextTiling

I Center IPU is not relevant: Average length of the relevantcontent is higher for C99 segmentation,due to the poor segmentation it correlateswith much longer segments

Page 41: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 41 / 52

Average Length of Relevant Part and Segments(in seconds)

Center IPU is relevant

Center IPU is not relevant

I Center IPU is relevant: Average length of the relevantcontent is of the same order for both segmentationschemes, slightly higher for TextTiling

I Center IPU is not relevant: Average length of the relevantcontent is higher for C99 segmentation,due to the poor segmentation it correlateswith much longer segments

Page 42: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 42 / 52

Average Length of Relevant Part and Segments(in seconds)

Center IPU is relevant

Center IPU is not relevant

I Center IPU is relevant: Average length of the relevantcontent is of the same order for both segmentationschemes, slightly higher for TextTiling

I Center IPU is not relevant: Average length of the relevantcontent is higher for C99 segmentation,due to the poor segmentation it correlateswith much longer segments

Page 43: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 43 / 52

Average Length of Relevant Part and Segments(in seconds)

Center IPU is relevant Center IPU is not relevant

I Center IPU is relevant: Average length of the relevantcontent is of the same order for both segmentationschemes, slightly higher for TextTiling

I Center IPU is not relevant: Average length of the relevantcontent is higher for C99 segmentation,due to the poor segmentation it correlateswith much longer segments

Page 44: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 44 / 52

Results: fraction MAP (fMAP)

Transcript type Segmentation type fMAPBASELINE 0.0536

manual tt 0.0500manual C99 0.0168

ASR tt 0.0308ASR C99 0.0123

ASR nsw tt 0.0174ASR nsw C99 0.0120

I Average number of ranks with segments havingnon-relevant center IPU is more than 5 times higher

I Segmentation technique with longer poor segmentedpassages (C99) has much lower precision-based scores

Page 45: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 45 / 52

Results: fraction MAP (fMAP)

Transcript type Segmentation type fMAPBASELINE 0.0536

manual tt 0.0500manual C99 0.0168

ASR tt 0.0308ASR C99 0.0123

ASR nsw tt 0.0174ASR nsw C99 0.0120

I Average number of ranks with segments havingnon-relevant center IPU is more than 5 times higher

I Segmentation technique with longer poor segmentedpassages (C99) has much lower precision-based scores

Page 46: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Conclusions DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 46 / 52

Outline

Retrieval MethodologyTranscript PreprocessingText SegmentationRetrieval Setup

ResultsOfficial MetricsuMAPpwMAPfMAP

Conclusions

Page 47: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Conclusions DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 47 / 52

Conclusions

I TextTiling segmentation shows better overall retrievalperformance than C99:

I Higher numbers of segments with higher precisionI Higher precision even for the segments with non-relevant

center IPUI High level of poor segmentation makes it harder to retrieve

relevant content for C99 runsI Removal of stop words before segmentation did not have

any positive effect on the results

Page 48: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Conclusions DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 48 / 52

Conclusions

I TextTiling segmentation shows better overall retrievalperformance than C99:

I Higher numbers of segments with higher precision

I Higher precision even for the segments with non-relevantcenter IPU

I High level of poor segmentation makes it harder to retrieverelevant content for C99 runs

I Removal of stop words before segmentation did not haveany positive effect on the results

Page 49: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Conclusions DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 49 / 52

Conclusions

I TextTiling segmentation shows better overall retrievalperformance than C99:

I Higher numbers of segments with higher precisionI Higher precision even for the segments with non-relevant

center IPU

I High level of poor segmentation makes it harder to retrieverelevant content for C99 runs

I Removal of stop words before segmentation did not haveany positive effect on the results

Page 50: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Conclusions DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 50 / 52

Conclusions

I TextTiling segmentation shows better overall retrievalperformance than C99:

I Higher numbers of segments with higher precisionI Higher precision even for the segments with non-relevant

center IPUI High level of poor segmentation makes it harder to retrieve

relevant content for C99 runs

I Removal of stop words before segmentation did not haveany positive effect on the results

Page 51: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Conclusions DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 51 / 52

Conclusions

I TextTiling segmentation shows better overall retrievalperformance than C99:

I Higher numbers of segments with higher precisionI Higher precision even for the segments with non-relevant

center IPUI High level of poor segmentation makes it harder to retrieve

relevant content for C99 runsI Removal of stop words before segmentation did not have

any positive effect on the results

Page 52: DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

Conclusions DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 52 / 52

Thank you for your attention!

Questions?