View
230
Download
1
Tags:
Embed Size (px)
Citation preview
2
TalkBank
• Brian MacWhinney– Carnegie Mellon University, Psychology– Child Language Data Exchange System CHILDES
• Steven Bird, Mark Liberman– University of Pennsylvania, Linguistics– Linguistic Data Consortium, LDC
• Howard Wactlar– Carnegie Mellon University, Computer Science– Informedia Project
3
Basic Premise of TalkBank• Human Communication is a unified fact,
• but it is studied by 8 disciplines and up to 40 subdisciplines.
• Analysis is important, but so is synthesis.
• We can put the puzzle back together by focusing all the disciplines on the data.
4
Some Examples
• “My Theory”
• Bettino Craxi
• Nixon’s Watergate Tapes
• MacWhinney’s Lectures
• Ross and Mark
• Graphics lesson
• Bilingual Classroom
5
My Theory: An ExampleSpecial Issue of Discourse Processes edited by Tim
Koschmann with articles from• Rogers Hall• Jay Lemke• Annemarie Palincsar• Carl Frederiksen• Commentary by
– Judith Green & Marleen McClelland
– Jeremy Roschelle
6
TalkBank Areas
• Classroom Discourse - CMU Dec 99• Conversation Analysis - Odense Oct• Text and Discourse - Santa Barbara July• Child Language Disorders - Madison 2002• Language and Gesture - CMU October• Child Language Learning - Madison Aug 2002• Animal Communication - Penn May 2000
7
More areas ….• Field Linguistics - LSA Dec 99, Penn Dec 2000• Aphasia• Corpus Linguistics• Signed Language• Second Language Learning• Anthropological Linguistics• Cross-cultural studies
8
More areas ...• Multilingualism, code-switching - LIDES
• Mother-infant interaction
• Psychiatry
• Conflict Resolution
• Management Styles
• Small-group Interaction - soon
• Human-computer Interaction
10
Why data-sharing is important
• Increasing the size and reliability of the empirical basis
• Opening science to the community, practitioners, and students
• Opening science to collaborative commentary
• Creating transparency across disciplines
11
Key Features of TalkBank• Multimodal digitized data
• Internet access
• Defense of confidentiality
• Codon: transcription, coding, viewing, and analysis
• XML standard for underlying representation
• Alliance of databases from many fields
12
Why TalkBank can be built now
• The Internet
• Fast computers. big disks, cheap storage
• Good audio and video digitization
• Advances in web-based database design
• Emergence of annotation standards
• Maturation of the social sciences
13
CHILDES: APrototype
• Brian MacWhinney - CMU• Leonid Spektor - CMU• Catherine Snow - Harvard
• 2000 Members• 400 Active contributors
14
1850-1950 Darwin and Diaries
• Darwin, Stern, Ament
• Emotion, gesture, language, the soul
• Card files and shoe boxes
15
1950-1984 Tapes
• Nagras and TEAC, VHS and Beta
• Dittos, mimeo, notes in the margins
• Good “raw” data, unclear transcription
19
Universals• Are there basic patterns to babbling?
• Are early word orders universal?
• Does UG give children a universal set of functional categories?
• Is the vocabulary spurt universal?
The answer requires LOTS of data
20
Particulars• Do children have individual styles?
– Gestalt vs. Analytic– Enactive (1S) vs. Depictive (3S)
• Do children respond differentially to parental recasts?
• Do children vary in their match to cue validity?
Again, we need LOTS of data
21
Comparisons
• How should we match SLI children to normal controls -- MLU? Morphology, TTR
• How should we compare language socialization processes across social classes? Between cultures?
• How should we compare the course of development across languages? The case of Romance.
23
CHAT Format
@Begin
@Participants: CHI Target_Child Sid, MOT Mother
*MOT: you want them to go in there?
*CHI: yeah. [+ Q]
*CHI: yeah. [+ SR]
*MOT: okay.
*CHI: okay. [+ I]
*CHI: look at this.
%act: CHI picks up piece of paper
@End
28
Phonology
• MakeMod
• ModRep
• PhonFreq
• UniCode
• Inventory (in progress, LIPP, CompProf)
• Process Analysis (in progress)
30
The Database
• English - 25 corpora
• Non-English - 18 languages
• Clinical - 14 corpora, aphasia, SLI, Down, autism, Williams, and other groups
• Narrative - Frog stories, Red Balloon
• Childhood Bilingualism
• Adult Second Language Learning
31
Morphology
• MOR
• Post, PostTrain -- Christophe Parisse
• Parse -- Kenji Sagae
• --> revised DSS, LARSP, IPSyn
• MinMor for 14 language
• MaxMor for English, Spanish, Italian, Hungarian, Dutch, German
32
New Technologies
• Sonic CHAT
• Bullets
• QuickTime Movies
• Sound editor by wave
• Movie editor by dragging
• Fast mode editing
• Web streaming of audio and video
33
Sample Topics• Past tense debate
• Functional categories, tenseless verbs
• Verb frame generalization
• Fine-tuning of the input
• Theory of mind
• Lexical range and communicative context
• MLU and vocabulary growth in disorders
34
Research based on CHILDES• Over 1200 published studies• Syntax• Morphology• Discourse• Lexicon• Narrative, Literacy• Language Impairments• Phonology
35
Allied Efforts
• JCHAT, Chinese, Korean
• Dutch, Nordic, Celtic
• Romance (Italian, Spanish, Portuguese)
• Slavic (Krakow, Vienna)
• Bilingualism -- Catalan, Basque
• Frogs, Disorders, Code-switching
• Classroom discourse
38
Format BabelAlembic Annotator Archivage CA CHAT
COCOSDA CSAE CSLU DAISY DAMSL
Delta DRI EAGLES Emu Festival
FSA’s GATE HIAT Hyperlex Intex
ISIP LDC MATE MICASE MPEG
MPI Multitext Observer PartiturPraat
SABLE SAMPA SGREP SignSTream SIL
SLAM SMDL SNACK StandOff SUSANN
TalkBank TEI Tipster Transcriber TreeBank
TSNLP Unicode UTF
50
Confidentiality Levels1 - fully public2 - copying block3 - transcripts public, audio/video protected4 - non-disclosure5 - non-disclosure, no copying6 - data-viewing with approval7 - data-viewing under direct supervision8 - archived only
51
Conclusions
• Child Language has guided other fields, but now we need to link to these other fields.
• CLAN must give way to more international tools and distributed databases.
• Number counting will give way to reality-linked number counting.
• Lab-based research will have to open up to collaborative annotation.