34
The Games Corpus Design, implementation and annotation Agustín Gravano [email protected] Spoken Language Processing Group Columbia University

The Games Corpus Design, implementation and annotation Agustín Gravano [email protected] Spoken Language Processing Group Columbia University

Embed Size (px)

Citation preview

Page 1: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

The Games CorpusDesign, implementation and annotation

Agustín [email protected]

Spoken Language Processing GroupColumbia University

Page 2: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 2

The Games Corpus

1. Design and Implementation

2. Annotation

Page 3: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 3

The Games Corpus

1. Design and Implementation

2. Annotation

Page 4: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 4

Experiment Design

Goal: Study the relation between the down-stepped contour and Information status Syntactic position Discourse position

Spontaneous speech Both monologue and dialogue

Page 5: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 5

Experiment Design

Three computer games. Two players, each on a different computer.

They collaborate to perform a common task. Totally unrestricted speech.

Page 6: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 6

Player 2 (Searcher)

Player 1 (Describer)

Cards Game #1

• Short monologues• Vary frequency and order of

occurrence of objects on the cards.

Page 7: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 7

Cards Game #2

Player 2 (Searcher)

Player 1 (Describer)

• Dialogue• Vary frequency and order of

occurrence of objects on the cards.

Page 8: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 8

Objects Game

Player 2 (Searcher)

Player 1 (Describer)

• Dialogue• Vary target and surrounding objects

(subject and object position).

Page 9: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 9

Games Session

Repeat 3 times: Cards Game #1 Cards Game #2

Short break (optional) Repeat 3 times:

Objects Game

Each subject participated in 2 sessions. 12 sessions

Page 10: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 10

Subjects

Postings: Columbia’s webpage for temporary job adds. Craig’s list

http://www.craigslist.org Category: Gigs Event gigs

Problem: People are unreliable ~50% did not show up, or cancelled with short notice.

Page 11: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 11

Subjects

Possible solutions: Give precise instructions to e-mail ALL required info:

Name, native speaker?, hearing impairments?, etc. Ask for a phone number. Call them and explain why it is so important for us that they

show up (or cancel with adecuate notice). Increase the pay after each session.

Example: $5, $10, $15 instead of $10, $10, $10.

Page 12: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 12

Recording Sound-proof booth

2 subjects + 1 or 2 confederates. Head-mounted mics. Digital Audio Tape (DAT): one channel per speaker.

Wav files One mono file per speaker. Sample rate: 48000 Downsampled to 16000 (but kept original files!) ~20 hours of speech 2.8 GB (16k)

Page 13: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 13

Logs

Log everything the subjects do to a text file. Example:

17:03:55:234 BEGIN_EXECUTION17:04:04:868 NEXT_TURN17:04:31:837 RESULTS 97 points awarded.17:04:38:426 NEXT_TURN17:05:03:873 RESULTS 92 points awarded....

Later, this may be used (e.g.) to divide each session into smaller tasks or conversations.

Page 14: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 14

The Games Corpus

1. Design and Implementation

2. Annotation

Page 15: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 15

Speech Processing Tools

Praat http://www.praat.org

WaveSurfer http://www.speech.kth.se/wavesurfer

Transcriber http://trans.sourceforge.net

Page 16: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 16

Orthographic Tier - Method 1

Page 17: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 17

Orthographic Tier - Method 1

Problems Very stressing Time consuming

Separate transcription from alignment.

Page 18: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 18

Orthographic Tier - Method 2

1. Transcribe chunks using a web interface.

Page 19: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 19

Orthographic Tier - Method 2

1. Transcribe chunks using a web interface.

2. Align each chunk automatically.

3. Concatenate all chunks.

4. Correct the alignment by hand using Praat, Wavesurfer or similar.

Page 20: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 20

Orthographic Tier - Method 2

Advantages Transcription task is very comfortable. Most of the alignment task is done automatically.

Only fine-grain hand corrections are needed.

Problems Overhead: chunking, automatic alignment, concat. Error prone! Easy for humans to overlook errors in the

automatic alignment.

Page 21: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 21

Orthographic Tier - Method 3

1. Transcribe the whole file, using: a regular audio player (e.g., Windows Media Player), and a regular plain-text editor (e.g., Notepad).

2. Use Wavesurfer to align the words. “Load text labels” function Check out:

Spectrogram settings Customizable shortcuts

Page 22: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 22

Orthographic Tier

Transcription guidelines capital letters abbreviations disfluencies mmhm, uhhuh, gotcha, etc.

Alignment guidelines boundaries

http://www.cs.columbia.edu/~agus/games username/password = speech/lions

Page 23: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 23

Too many cooks…

Concurrency problem

File locking webpage Annotators lock a file before working on it,

and release it when done.

Page 24: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 24

Annotation: Cue Words

okay, mmhm, uhhuh, right, etc. Acknowledgment, Backchannel, Segment

Beginning, Segment End, etc. Developed an ad-hoc application in Java.

Bad idea!!! Too long development time.

Instead, use Praat (or other general-purpose tool). For simple, specific tasks, Praat is not difficult to learn. Create a file with empty points at the middle point of the

words that need to be labeled. Annotators only label those words, safely ignoring the rest.

Page 25: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 25

Other Annotations

Turn switches Smooth switches, interruptions, backchannels, etc. The labeler received a Praat file with empty turns.

Prosody ToBI Labeling Conventions: Tones and Break Indices.

Questions Identification, form and function.

Page 26: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 26

Guidelines for Guidelines

Web based (password protected) Highlight recent changes Avoid long lists: categorize, trees.

Page 27: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 27

Files

games/data/session_NN/sNN.GAME.P.Y.ext NN = 01..12 GAME = {cards, objects} P = 0..3 if GAME=cards, 0..1 if GAME=objects Y = {A, B} ext = {wav, words, tones, breaks, misc, turns, …}

Page 28: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 28

Files

Examples:games/data/session_08/s08.cards.3.B.wav

s08.cards.3.B.wordss08.cards.3.B.misc…

s08.objects.1.A.wavs08.objects.1.A.wordss08.objects.1.A.misc…

games/data/session_11/…

Page 29: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 29

Files Format

All files (except *.wav) are saved as plain text, with the WaveSurfer format: Start End Value (for interval tiers) Time Value (for point tiers)

Advantages Human-readable. Very easy to process.

Problems Consistency Rounding

Page 30: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 30

Files Format

Other formats: XML

General-purpose mark-up language. <TAG attribute=“value”> … </TAG> Solves problems like consistency and rounding. Not human-readable, harder to process.

Praat Not human-readable, hard to process. Also has the consistency problem.

Page 31: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 31

Scripts

So far, we have needed dozens of Perl scripts. Examples:

Convert between Praat and WaveSurfer formats. Create a Praat file with empty CW labels, turns, etc. Find typos, missing labels, and other errors. Unify notation (e.g., “mm-hmm” “mmhm”). Check consistency of files. …

Page 32: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 32

Back-up!

Back-up wav files only once (too heavy) in different places (DVD, 3+ computers).

Back-up everything else (plain text: light) periodically, and automatically. Configure “cron” to make a backup copy every 8 hours.

Page 33: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University 33

Timeline

Orthographic tier first!

time

design+implem.

orthographic tier

cue words

prosody (ToBI)

turn switches

Page 34: The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

The Games CorpusDesign, implementation and annotation

Agustín [email protected]

Spoken Language Processing GroupColumbia University