15
EARS STT Workshop at ICASSP, March 2005 EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel, David Graff, Kevin Walker, Mark Liberman {ccieri,maamouri,shudong,jfiumara, strassel,graff,walkerk,[email protected]}

What Happens Next?

  • Upload
    roger

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel, David Graff, Kevin Walker, Mark Liberman {ccieri,maamouri,shudong,jfiumara, strassel,graff,walkerk,[email protected]}. What Happens Next?. Collect feedback here - PowerPoint PPT Presentation

Citation preview

Page 1: What Happens Next?

• EARS STT Workshop at ICASSP, March 2005

EARS STT Workshop at ICASSP

Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel, David Graff, Kevin Walker, Mark

Liberman

{ccieri,maamouri,shudong,jfiumara, strassel,graff,walkerk,[email protected]}

Page 2: What Happens Next?

• EARS STT Workshop at ICASSP, March 2005

What Happens Next?

• Collect feedback here• Check feasibility of new ideas

– e.g. availability of BN (tran)scripts

• Estimate cost, timeline for wish list• Sponsors allocate funds• EARS Board revise priorities• Re-estimate cost, timeline for task list• Communicate final plan• “Start”

Page 3: What Happens Next?

• EARS STT Workshop at ICASSP, March 2005

What Happened Next?• Feedback was generally favorable• Next day learned of 3 month projects• Received 25% funding• Preparation of utility thresh holds• Learned of TIDES/EARS end• Learned that GALE <> TIDES+EARS• Completed existing commitments

– STT Test Sets (MT Test Set)– CTS Collections

• Adjusted focus to GALE preparation

Page 4: What Happens Next?

• EARS STT Workshop at ICASSP, March 2005

Broadcast News• Continue 2004 collection

– >2000h English: VOA, NBC/MSNBC, CNN, ABC, PBS, PRI, WB17– >1000h Chinese: VOA, CCTV, Radio Free Asia (RFA), NTDTV, Tai Yuan– >1000h Arabic: VOA, Al Hurra, Al Jazeera, Dubai, Jordan TV, LBC, Nile

• Select 2005 evaluation set then distribute 2004 data (February 2005)– delivery made after eval set picked

• 2005 Collection same sources, volumes– add semi-automatic language, source, program ID to QC process– harvest (tran)scripts where possible– 100 hours of transcribed Chinese BN (commercial, QTr)– 100 hours of transcribed Arabic BN (commercial, QTr)– collect broadcast conversations: audio and (tran)scripts

• Continue IPR negotiations• Contribute to Experiments

– Utility of Careful vs. Commercial vs. QTr. vs. CC. vs. Roverized ASR• Update pronouncing Lexicons with vocab from English, Chinese, Arabic• Continue collection with sources adjusted for GALE

– Greater focus on broadcast conversation– Total: 62.5 hrs/week of Arabic, 60 hrs/week of Chinese, 75 hs/week of English– BC: 2.5 hours/week Arabic, 15 hours/week Chinese, 25 hours/week English– Acquired IPR for several new programs: 100% English 50% of Arabic, Chinese

Page 5: What Happens Next?

• EARS STT Workshop at ICASSP, March 2005

English CTS• Volume: complement 2003 collection to

provide another 1400 hours (was 850) with subjects making 1-20 10-minute calls

• Used November 2003 Topics• BBNT/WordWave doing transcription• Complete collection of 1400 hours• Finalize evaluation set• Distribute beginning in December as

transcripts are ready• 1400 hours sent to BBN/WordWave for

transcription• 450 hours distributed to sites February 17

Page 6: What Happens Next?

• EARS STT Workshop at ICASSP, March 2005

Chinese CTS• New Collection at HKUST

– Target 200 hours transcribed, gender balance, regions represented

• Transcription based upon RT03• 150 hours in delivered to LDC so far

– regions not balanced across delivery increments• Select 2005 evaluation & dev/test sets

– to control demographics across train/test sets• Deliver training data once final increment has arrived and

evaluation data extracted• Repeat collection in 2005

– require gender, age, regional balance across collection epoch– require word segmentation?

• Build portable platform?• HKUST finished Collection of 150 hours of CTS

– ready for release once test set extracted– will deliver 50 more hours at end of March– will collect & transcribe another 50 hours through June

Page 7: What Happens Next?

• EARS STT Workshop at ICASSP, March 2005

Arabic CTS• Fisher Protocol, platform in US• Select 2005 evaluation set from current collection• Continue collection until current pool sapped• Complete audit and transcription; deliver in December• Add ‘yellow’ tier (surface phonemic) transcription• Build portable platform? Begin new dialect?• Demographics changed since last test sets created

– new Dev/Test as well as Eval set required• Finished 50 hours of Levantine Arabic CTS• Released on 01/15/2005 as LDC 2005SO7 & LDC 2005TO3• 50 more hours of Levantine due March 31, 2005• 85 hours scheduled June 30, 2005 ???• Yellow layer transcription of 15h underway• RT rates improving: 8-10xRT on green, 15xRT yellow

(assuming green)

Page 8: What Happens Next?

• EARS STT Workshop at ICASSP, March 2005

STT Test Sets

• None

Page 9: What Happens Next?

• EARS STT Workshop at ICASSP, March 2005

MDE

• Ported English specification v6.2 to Chinese, Arabic

• Created MDE v7 specification, tool for English

• Created Chinese and Arabic tools• Created small pilot data set in each

language• Distributed as: LDC2004E47

Page 10: What Happens Next?

• EARS STT Workshop at ICASSP, March 2005

GALE Preparation• Created 13 new Fisher English topics

designed to elicit ACE worthy conversations• Collected 500 conversations; manually

selected 25% for transcription. ACE transcribed; are in ACE annotation pipeline

• LDC Staff Read DLI DLPT material in Arabic• LDC Staff read WSJ articles• In preparation for GALE, adding new source

types• e-lists, blogs, chat, technical reports, GovDocs

• Built general purpose speech annotation toolkit; ready April 1.

Page 11: What Happens Next?

• EARS STT Workshop at ICASSP, March 2005

Distribution Rules• Most EARS sites are LDC members• Those who are not have data under

evaluation agreement– Require return at end of program– LDC will offer extension; sites not part of

GALE by June 2005 must return data then– Or non-members, non-GALE sites can

keep data by becoming LDC members• Exception drive arrays of BN data. This must

be returned by both members and non-member not involved in GALE

Page 12: What Happens Next?

• EARS STT Workshop at ICASSP, March 2005

GALE-related efforts• Data scouting in English, Chinese, Arabic

– Exploring new domains• Broadcast conversation (roundtable, talk shows, call-ins)• Web text (blogs, newsgroups, chat, discussion forums)

– Defining best practices• Identifying, Harvesting, Formatting, Licensing

– Researching more economical sources, methods• Transcripts, story segmentation• Annotation efficiencies

• Local infrastructure in place– Annotation toolkit– Annotation guidelines & web resources guide– Scouting teams for English, Chinese

• Arabic lagging

• Sharable version of tools, docs in progress• To date,

– English: 270 sites identified (16 topics)– Chinese: 57 sites identified (10 topics)– Arabic: 10 sites identified (3 topics)– All of these now/soon in ACE annotation pipeline– IPR secured under “fair use”

Page 13: What Happens Next?

• EARS STT Workshop at ICASSP, March 2005

Documentation

Page 14: What Happens Next?

• EARS STT Workshop at ICASSP, March 2005

• Use search engine to find sites for each types– Minimum thresholds for each data type/subject

• Tool tallies good/bad sites identified; logs URLs/judgments to DB

• Categorize URLs as good or bad for TIDES-type annotation– “Bad” URLs are not revisited for a topic

Process

The left side of the web scouting tool shows a tally of the data types found for the annotator’s topic.

The bottom pane of the tool is a window where the annotator inputs information, including data type, title, and URL, for each site that he finds.

The top pane of the tool is occupied by a web browser.

Page 15: What Happens Next?

• EARS STT Workshop at ICASSP, March 2005

Up-to-minute updates

http://www.ldc.upenn.edu/Projects/GALE/Annotation/DataScouting/status.php