Upload
hiroko
View
27
Download
2
Tags:
Embed Size (px)
DESCRIPTION
What's Happened Since the First SIGDAT Meeting?. Kenneth Ward Church AT&T Labs-Research [email protected]. The First SIGDAT Meeting. WVLC-1 was held just before ACL-93 Great turnout! More like a conference than a workshop We knew that corpora were “hot,” - PowerPoint PPT Presentation
Citation preview
What's Happened Since the First SIGDAT Meeting?
Kenneth Ward Church
AT&T Labs-Research
The First SIGDAT Meeting
• WVLC-1 was held just before ACL-93
• Great turnout!– More like a conference than a workshop
• We knew that corpora were “hot,”– but didn't appreciate just how hot they
would turn out to be.
Sister meetings have also done very well since 1993
• Information Retrieval – http://www.acm.org/sigir/
• Digital Libraries– http://fox.cs.vt.edu/DL99/
• Machine Learning– http://www.cs.cmu.edu/Web/Groups/NIPS
• Data-mining, Databases, Data Warehousing– http://www.acm.org/sigkdd/
– http://www.vldb.org/
Empiricism has a long history
• In the 1950’s, empiricism dominated a broad set of fields:– from psychology (behaviorism)– to electrical engineering (information theory).
• At the time, it was common practice in linguistics to classify words not only on the basis of their meanings– but also on the basis of their co-occurrence with other words.
– ``You shall know a word by the company it keeps” (Firth, 1957)
• Regrettably, interest in empiricism faded in the 1960’s:– Chomsky's criticism of ngrams in Syntactic Structures (1957) and
– Minsky and Papert's criticism of neural nets in Perceptrons (1969).
1990’s Revival
• Empiricism regained a dominant position:– Ngrams and Hidden Markov Models (HMMs) became
the method of choice in Speech.
– Neural Networks (Perceptrons + Hidden Layers) helped create Machine Learning.
• Empiricism Rationalism Empiricism– Oscillates about once a career
• Mark Twain: Grandparents and Grandchildren have a natural alliance.
Why the Revival?“It was a bad idea then, and it is still a bad idea now”
• More powerful computers??
• Availability of massive quantities of data!!
– Text is available like never before.– Not long ago, the Brown Corpus was considered large.
– But now, text is available like never before!
• First came collection efforts (www.ldc.upenn.org),
• And now everyone has access to the Web!
• Experiments are routinely carried out on gigabytes of text.
• Some researchers are even working with terabytes.
Big Changes Since 1993
• The Web, stupid!– Demos– Data
• Research: – Shared resources + evaluation– Scale: How large is very large?– Increased breadth: Geography, Topics
• Commercial: Wall Street & Main Street
The Web, Stupid!
• If you publish a paper about neat stuff, it is expected that you will post it on the web.
• I’ll mention just a few examples of neat stuff on the web.– Demos– Data– Tools
Lots of Neat Demos on the Web
• Web Searching with Machine Translation– www.altavista.com(uses Systran)
• Cross-Language Information Retrieval (CLIR): – www.xrce.xerox.com
• Parallel Corpora: www-rali.iro.umontreal.ca
• Latent Semantic Indexing (LSI)– superbook.bellcore.com/~remde/lsi
– lsa.colorado.edu
• Speech Synthesis: www.bell-labs.com/project/tts
• Dotplot: www.cs.unm.edu/~jon/dotplot
Lots of Neat Data on the Web
• Wordnet: www.cogsci.princeton.edu/~wn• Linguistic Data Consortium (LDC):
– www.ldc.upenn.org
• SIGLEX: www.clres.com/siglex.html
• Discourse Resource Initiative (DRI)– www.georgetown.edu/luperfoy/Discourse-Treebank/
dri-home.html
• The Federalist Papers: – www.mcs.net/~knautzr/fed
More Neat Data on the Web(in Lots of Languages)
• Chinese:– rocling.iis.sinica.edu.tw– www.sinica.edu.tw
• Japanese: cl.aist-nara.ac.jp/lab/resource/resource.html– Electronic Dictionary Research (EDR): www.iijnet.or.jp/edr – Advanced Telecommunications Research (ATR): www.atr.co.jp– www.rdt.monash.edu.au/~jwb/japanese.html
• Korean: korterm.kaist.ac.kr• European Language Resources Association (ELRA)
– www.icp.grenet.fr/ELRA
• Parallel Text (Resnik, ACL-99)– Canadian Hansards: WWW.Parl.GC.CA– Turkish: www.nlp.cs.bilkent.edu.tr– Swedish: svenska.gu.se
Lots of Neat Tools on the Web
• Penntools (links to all over the world) – www.cis.upenn.edu/~adwait/penntools.html
• Part of Speech Taggers (see above)• Juman/Chasen
– pine.kuee.kyoto-u.ac.jp/nl-resource/juman.html– cl.aist-nara.ac.jp/lab/nlt/chasen.html
• Suffix Arrays– http://cm.bell-labs.com/cm/cs/who/doug/ssort.c
Big Changes Since 1993
• The Web, stupid!– Demos– Data
Research: – Shared resources + evaluation– Scale: How large is very large?– Increased breadth: Geography, Topics
• Commercial: Wall Street & Main Street
Shared Resources + Evaluation
• Common tasks: – Trec (trec.nist.gov), Tipster, MUC
• Common benchmark corpora: Brown, Penn Treebank, Wall Street Journal, Switchboard
• Shared lexical resources: Wordnet (www.cogsci.princeton.edu/~wn/)
• Common labeling conventions/standards in all areas of NLP from Speech to Discourse
• Evaluation, evaluation, evaluation– Required to get a paper accepted anywhere.
In 1993, it wasn’t like this...
• Invited talks at ACL-93– “Planning Multimodal Discourse”– “Transfers of Meaning”– “Quantificational Domains and Recursive
Contexts”
• Less sharing of resources
• Evaluation not required
Empiricism vs. Rationalism
• Pluses: Clear measurable progress– Speech Recognition– Part of Speech Tagging– Parsing
• Minuses: Herd mentality, incrementalism, mindless metrics, duplicated effort– Recall: empiricism fell out of favor in 1960s
when methodology became too burdensome.
Big Changes Since 1993
• The Web, stupid!– Demos– Data
• Research: – Shared resources + evaluation– Scale: How large is very large?– Increased breadth: Geography, Topics
Commercial: Wall Street & Main Street
Main Street:Big change since 1993
• Large corpora are now having an impact on
ordinary users:
– Web search engines/portals
– Managing gigabytes, not just a popular book,
but something that ordinary users are beginning
to take for granted.
Huge Commercial Successes(Since 1993)
• Information Retrieval & Digital Libraries– Web search engines/portals: highly successful on
both Wall Street as well as Main Street• Invited talks from Lycos (1997) & Infoseek (1998)
• Machine Translation & Speech– Available wherever software is sold
– Can’t use a phone without talking to a computer
Big Changes Since 1993
• The Web, stupid!– Demos– Data
• Research: – Shared resources + evaluationScale: How large is very large?– Increased breadth: Geography, Topics
• Commercial: Wall Street & Main Street
How Large is Very Large?
Year Source Size (words)
1788 Federalist Papers 1/5 million
1982 Brown Corpus 1 million
1987 Birmingham Corpus 20 million
1988- Associate Press (AP) 50 million(per year)
1993 MUC, TREC, Tipster
Mirror, mirror on the wall
• Who is the largest of them all?– The Web?– Lexis-Nexis?– West?
• We have had invited talks from all three– Web: Lycos (1997) & Infoseek (1998)– Lexis-Nexis (1993)– West (1997)
Big Changes Since 1993
• The Web, stupid!– Demos– Data
• Research: – Shared resources + evaluation– Scale: How large is very large?Increased breadth: Geography, Topics
• Commercial: Wall Street & Main Street
Internationalization
• SIGDAT-93: Nearly equal participation– America : 4 papers
– Asia: 4 papers
– Europe: 3 papers
• Great growth in activity around the world, especially Asia
• SIGDAT has met in a dozen cities (50% in America)– America: Columbus, Cambridge, Philadelphia, Providence,
Montreal, College Park
– Asia: Kyoto, Beijing, Hong Kong
– Europe: Dublin, Copenhagen, Grenada
Some Topics that are Behind the International Expansion
• Classic Issues– Machine Translation (MT) / Tools
– Input Method Editor (IME): MS-IME98
– Morphology: Juman, Chasen
• New Issues– Cross-language Information Retrieval (CLIR)
– Browsing the Internet: integrate IME + CLIR + MT
– Parallel and comparable corpora– Terminology Extraction & Alignment
– Suffix Arrays
Big Changes Since 1993
• The Web, stupid!– Demos– Data
• Research: – Shared resources + evaluation– Scale: How large is very large?Increased breadth: Geography, Topics
• Commercial: Wall Street & Main Street
Broader (and More Applied) View of Computational Linguistics
• Data-mining, Databases, Data Warehousing• Digital Libraries• Information Retrieval, Categorization, Extraction• Lexicography• Machine Learning• Machine Translation• Speech• Text Analysis
Data-Mining Issues(How Large is Very Large?)
• Similar technology to corpus-based methods
• But much larger datasets– Newswire (AP): 1 million words per week
– Telephone calls: 1-10 billion per month
– IP packets: expected to be even larger
• Tasks: Fraud, Marketing, Operations, Care– Identify knobs that business partners can turn
• Increase demand (buy TV ads, reduce price)
• Increase supply (buy network capacity, enhance operations)
– Target opportunities for improvement (marketing prospects)
– Track market response in real time (supply/demand by knob)
Best of SIGDAT
• Best Invited Talk
• Work of Note
• Work of Note (in Related Fields)
Best Invited Talkat a SIGDAT Meeting
• Henry Kučera and Nelson Francis– Third Workshop on Very Large Corpora (1995)– Massachusetts Institute of Technology (MIT)– Cambridge, MA, USA
• Described their work on the Brown Corpus– At a time when empiricism was out of fashion– especially at MIT– Personal & Touching (received standing ovation)
Work of Note
• Statistical Machine Translation / Alignment– Brown et al.
• Statistical Parsing (In 1993, poor use of lexical info)– Jelinek, Magerman, Charniak, Collins
• Statistical PP Attachment– Hindle and Rooth
• Word-sense Disambiguation– Yarowsky
• Text-tiling (Discourse Parsing)– Hearst
Work of Note(in Related Fields)
• Learning– Classification and Regression Trees (CART)– Riper
• Web Tools– Managing Gigabytes, Harvest, SGML XML
• Representation– Suffix Arrays– Latent Semantic Indexing
Summary:Reaching a Wider Audience
• Commercial Successes– Main Street & Wall Street
• Internationalization– Goal: equal rep from America, Asia & Europe
• More topic areas– Information Retrieval, Speech, Machine
Translation, Machine Learning, Data-mining
Self-organizing vs. EDA
• Self-organizing: Learning, HMM
– Statistics do it all
• Manual
– Wilks’ Stone Soup: Statistics don’t do nothing
• Exploratory Data Analysis (EDA)
– Hybrid of above
Time for a little controversy:Two types of Empiricism
• New Linguistic Insights vs. Methodology• Reviewers do what reviewers do
– Safe, conservative, seek precedents, case law
– Reviewers go easy on methodology papers
• Grim historical reminder:– Recall: empiricism fell out of favor in 1960s when
methodology became too burdensome.
• Shouldn’t let the methodology get in the way of what we are here to do.