46
ABRAPT Mini-curso 30.08.0 4 The Corpógrafo Theory and Practice Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

ABRAPT Mini-curso 30.08.04 The Corpógrafo Theory and Practice Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

  • View
    219

  • Download
    2

Embed Size (px)

Citation preview

ABRAPT Mini-curso 30.08.04

The CorpógrafoTheory and Practice

Belinda Maia & Luís Sarmento

PoloFLUP

LINGUATECA

ABRAPT Mini-curso 30.08.04

A bit of history

• PALC ’97 – 'Do-it-yourself corpora ... with a little bit of help from your friends!'

• CULT 1998 - ‘Making corpora – a learning process’

Contrastive linguistics Corpora linguistics Translation teaching

General > specific language

ABRAPT Mini-curso 30.08.04

A bit of history

• 2000 – First Master’s in Terminology and Translation at FLUP

• PALC 2001 - ‘Training Translators in Terminology and Information Retrieval using Comparable and Parallel Corpora’

Specialized translation and terminology

Contact with domain experts

Importance of IT Need for technical help

for more ambitious students!

ABRAPT Mini-curso 30.08.04

A bit of history

• LREC 2002 - ‘Corpora for terminology extraction – the differing perspectives and objectives of researchers, teachers and language services providers’

• 2002 – Second Master’s in Terminology and Translation at FLUP

Plea for help to Diana Santos

October 2002

LINGUATECA - Polo FLUP

ABRAPT Mini-curso 30.08.04

LINGUATECA

• See http://www.linguateca.pt

• Leader > Diana Santos (SINTEF – Oslo)

• Objective - to create resources and tools for the computational processing of Portuguese

• Poles at Oslo, Lisbon, Braga and Porto

• Porto – Polo CLUP/FLUP

ABRAPT Mini-curso 30.08.04

Polo CLUP/FLUP

• See http://www.linguateca.pt/poloclup/• On-line suite of corpora tools to work with

comparable corpora with emphasis on bilingual research– Focus on special domains – Construction of terminology databases,

ontologies and domain modelsCorpógrafo

ABRAPT Mini-curso 30.08.04

Polo CLUP/FLUP

• See http://www.linguateca.pt/poloclup/

• General help in constructing resources specific to the need of FLUP/CLUP – For researchers, teachers and students – For teaching methodology at FLUP

BNC & Reuter’s corpora on intranet A small ‘chat’ corpus

ABRAPT Mini-curso 30.08.04

More history

• 2003 – Poster of the GC – at CL2003• 2003 – ‘What are comparable corpora?’

CL2003• 2003 – Experimentation with evaluation of

Machine Translation• 2003 – Experimentation with GC• 2003 – Third Master’s in Terminology and

Translation at FLUP

ABRAPT Mini-curso 30.08.04

GC – Integrated Web Environment for Corpora Linguistics

Motivation

• Lack of Comprehensive, wide-scope Corpora Tools • Commercial Packages are usually difficult to Integrate/Customize• Tools are not prepared to support cooperative work.• Linguistic knowledge is not usually integrated in tools.

What is GC?GC is a Web tool being developed at Linguateca/CLUP that aims to provide a comprehensive work environment for Corpora-Based Linguistic Research. GC allows users to:

• access several Corpora tools from a single entry point using a regular web browser

• access and query generic Corpora (BNC, Reuter’s, COMPARA, CETEMPúblico)

• build personal simple, parallel and comparable Corpora from text files (PDF, PS, Word, HTML, TXT)

• use several (on-line/off-line) tools with their personal Corpora (statistics, POS-taggers, Filters, etc.)

• communicate and exchange results with other usersInternet Integration

GC provides seamless integration with the World Wide Web allowing users to:

• search specific Corpora resources on the Internet

• query the web for concordances

• use available translation-engines in parallel.

DOC HTML

TXT

PSPDF

RTF

BNCCETEMPúblico

COMPARA Others

PersonalCorpora

Custom Interface

DEV

Inter-userCommunication

ADMUSER

Administrator’s Tasks:

• Users, Groups and Disk Quotas

• Corpora Taxonomy (see box)

• Documentation Organization

• Access Service StatisticsVirtual

Desktop

Custom Interface Custom Interface Custom Interface

Tool Pool• Concordance Engine

• Taggers

• Aligner (Semi-Auto)

• Corpora Bot

• Statistics

• Custom Tools

InternetTerminology DB

• Medium: written, spoken, multimedia• Domain: Engineering, medicine, etc.• Genre: scientific, technical, informative, etc.

Corpora Taxonomy

Terminology Extraction Tool (Auto/Semi-Auto)

Developer Task:

Developer’s Tasks:

• Integrate Existing Tools/Resources

• Develop Additional Generic Tools

• Interact with Users/Administrator

• Develop Custom Tools for particular research needs

Inter-User Communication

• Tagging and Aligning Cooperatively

• Messaging Service

• Exchange of Corpora Resources

• Provide on-line tutorials

• Provide links to:

• on-line teaching material

• bibliography and other resources

Teacher’s Tasks:

ABRAPT Mini-curso 30.08.04

And then...

• PoloCLUP’s 3rd function:• Evaluation of Machine Translation

– Experimentation with evaluation – Teaching + research focus

• Results: – TrAva – MT evaluation tool

– CorTA – Corpus of 1 EN input + 4 MT output sentences

ABRAPT Mini-curso 30.08.04

Prescriptive v descriptive terminology

• Paper > digital form

• Static > dynamic resources

• ‘Democratization’ of terminology

• ISO standards > socioterminology

• Knowledge structures increasingly recognized as structured but dynamic - ask Gerhard Budin to explain this to you ….

ABRAPT Mini-curso 30.08.04

Perspectives of terminology users

• Domain experts and vested interests

• Translators • Information retrieval• Knowledge

engineering

Standardized terminology

Getting the right word Finding information Perfecting Google

Structuring knowledgeFinding it fast

ABRAPT Mini-curso 30.08.04

Bridging the Gap

• General linguists• Translation teachers• Translation students• Corpus linguists• Computational

linguists• Computer engineers

Computer-phobia

Computer-worship

ABRAPT Mini-curso 30.08.04

The Corpógrafo combines:

• Terminology, translation and language study and research (Belinda)

• Terminology databases (Domain experts)• Computational linguistics research and

production of resources (Diana)• Information retrieval and artificial

intelligence (Luís)= Discussions on priorities!

ABRAPT Mini-curso 30.08.04

Corpora and Terminology

• Corpora as input

• Terminology extraction

• Terminology databases

• Structuring of domain knowledge

• Further corpora

ABRAPT Mini-curso 30.08.04

CorporaCorpora Analysis

TerminologyDatabase

InternetInternet

Text details Text details Text details

ABRAPT Mini-curso 30.08.04

Working with the Corpógrafo

• Corpógrafo is a suite of integrated tools for INDIVIDUAL or GROUP research

• All research done ONLINE• Each username/password = separate space on our

server• At present > anyone can work with it using 10 MB

space for FREE• BUT - you get an empty space + tools + tutorial!

ABRAPT Mini-curso 30.08.04

Terminologyold v new

• Prescriptive > descriptive • Paper > digital form• Static > dynamic resources• ‘Democratization’ of terminology • ISO standards > socioterminology• Knowledge structures increasingly

recognized as structured but dynamic - ask Gerhard Budin to explain this to you ….

ABRAPT Mini-curso 30.08.04

Perspectives of terminology users

• Domain experts and vested interests

• Translators • Information retrieval• Knowledge

engineering

Standardized terminology

Getting the right word Finding information Perfecting Google

Structuring knowledgeFinding it fast

ABRAPT Mini-curso 30.08.04

Bridging the Gap

• General linguists• Translation teachers• Translation students• Corpus linguists• Computational

linguists• Computer engineers

Computer-phobia

Computer-worship

ABRAPT Mini-curso 30.08.04

Focus of Corpógrafo

• Design priorities are to:– See the Big Picture– Create the Overall Framework– Get feedback from users to see their needs– Develop according to real research needs– Fill in the details and improve techniques as

needed

ABRAPT Mini-curso 30.08.04

Corpógrafo and special domains

• Master’s in Terminology and Translation• Terminology projects with the support of domain

specialists in:– Engineering – Electronics, Mechanical Engineering

– Geography - Population Geography, Natural Hazards – Fire, Floods, Earthquakes, Coastal Erosion,

– Medicine - Kidney support machines, Neurology

– Science – Genetics

– Technology – GPS – Geographical Positioning Systems

ABRAPT Mini-curso 30.08.04

Corpógrafo and terminology/translation research

• Ongoing dissertations on aspects of:– Terminology – databases for different uses,

neologisms, definition searches, semantic relations, conceptual analysis

– Corpora – text analysis, corpora construction

– Technical writing > Electrical Appliances

– Localization

– Terminology in documentaries

– Translation of Multimedia

ABRAPT Mini-curso 30.08.04

Linguateca

• Linguateca’s policy - all resources and tools freely available online

• Primary users - Portuguese and Brazilian

ABRAPT Mini-curso 30.08.04

Polo CLUP/FLUP

• Bi- or multi-lingual in interest

• Corpógrafo available for experiments on a small scale to the general public

• Possibilities of future work on projects with users from other universities and other countries

ABRAPT Mini-curso 30.08.04

Contacts

If you are interested is finding out more, please contact me:

Belinda Maia

[email protected]

The Corpógrafo can be used

(with a username and password) at:

http://www.linguateca.pt and

http://poloclup.linguateca.pt/ferramentas/gc

ABRAPT Mini-curso 30.08.04

ABRAPT Mini-curso 30.08.04

Corpógrafo

1. File Manager - area where each individual or group can:

– convert various text formats to .txt– upload texts to their space on server– ‘clean’ them of unnecessary material– check tokenization and sentence divisions– consult wordlists – alphabetical, frequency etc– group texts into corpora– register full information on source, domain and text

type

ABRAPT Mini-curso 30.08.04

Corpógrafo

2. Corpora analysis area:– Concordancing tools allowing for

• KWIC concordancing

• KWIC concordancing with sorted according to word to left or right

– N-gram tool• N-grams

• Term-candidates– With filters for PT

ABRAPT Mini-curso 30.08.04

Corpógrafo

3. Terminology database– Terms– Definitions– Examples– Morphology – Multilingual equivalents– Sources and text details of corpora used– Semantic relations – further complexity

ABRAPT Mini-curso 30.08.04

CorporaCorpora Analysis

TerminologyDatabase

InternetInternet

Text details Text details Text details

ABRAPT Mini-curso 30.08.04

Future developments – general policy

• General testing and improvement of the Corpógrafo

• Experimentation with ideas from other projects:- e.g. Wordnet, Framenet

• Experimentation with theories of semantic primitives, human universals etc

• Development of new ideas or functions – using isomorphic relationships between researchers’ needs and our possibilities

ABRAPT Mini-curso 30.08.04

Future developments- File Manager

• Creation of overall framework – perhaps UDC based – for:– consultation of research available to public– information on ongoing research

• Coordination of individual corpus projects into bigger projects, when possible or necessary

ABRAPT Mini-curso 30.08.04

File ManagerTheoretical questions

• Domain organization – UDC or ?• Categorization of text by genre – how many

genres? • Reliability of texts from Internet – how does one

guarantee quality?• Is a translator or linguist able to distinguish a

‘good text’?• Should the domain specialist choose the texts?

ABRAPT Mini-curso 30.08.04

Corpora constructiontheoretical questions / problems

• How large is a good domain corpus?

• No domain corpus will produce EVERY term in the area

• Comparable corpora v. Parallel corpora

• Aligning comparable corpora at term level

ABRAPT Mini-curso 30.08.04

Future developments- Corpora analysis

• Development of finer-grained concordancing

• Experimentation with finding definitions in context

• Semi-automatic creation of keyword shortlists for further text retrieval

ABRAPT Mini-curso 30.08.04

Corpora AnalysisTheoretical questions

• How far can one rely on the computational linguist or computer engineer to produce analyses of corpora?

• If (semi-) automated processes produce 80% possible results, should the linguist / translator rubbish these processes?

• Can we leave it all the computer engineer?

ABRAPT Mini-curso 30.08.04

Future developments- terminology databases

• Refinement of terminology fields

• Development of further multi-lingual functions

• Development of organized and robust set of semantic relations

• Semi-automatic visualizing of semantic relations

ABRAPT Mini-curso 30.08.04

Terminology databasesTheory

• How much information does a database need?

• How much does the user of a database need?

• Is it reasonable to hope that all our databases could one day communicate with each other and help us with translation / information retrieval – or whatever?

ABRAPT Mini-curso 30.08.04

How is the Corpógrafo being used at present?

• Master’s in Terminology and Translation• Terminology projects with the support of domain

specialists in:– Engineering – Electronics, Mechanical Engineering

– Geography - Population Geography, Natural Hazards – Fire, Floods, Earthquakes, Coastal Erosion,

– Medicine - Kidney support machines, Neurology

– Science – Genetics

– Translation and Localization

ABRAPT Mini-curso 30.08.04

How is the Corpógrafo being used at present?

• Dissertations completed on:

– Definitions for different purposes + pedagogical glossary for Corrosion, Electrical engineering http://www.fe.up.pt/~cdm/QAE/QAE_gloss_b.htm

– Socioterminology – in the area of Composite Materials

– Graphical representation of Conceptual systems

– Terminology and Metaphors

– Football Metaphors

ABRAPT Mini-curso 30.08.04

How is the Corpógrafo being used at present?

• Ongoing dissertations on aspects of:– Terminology – databases for different uses,

neologisms, conceptual analysis– Corpora – text analysis, corpora construction– Translation and localization terminology– Technical writing > Electrical Appliances– Terminology in documentaries

ABRAPT Mini-curso 30.08.04

Pedagogical applications of the Corpógrafo

• Undergraduate courses – only possible if both teachers and students are trained to use it

• Postgraduate research – Terminology and translation (Belinda + domain

experts)

– Computational linguistics (Diana)

– Information retrieval (Luís)

• Long live team work!

ABRAPT Mini-curso 30.08.04

To what extent is the Corpógrafo available to others?

• Linguateca’s policy is to make all resources and tools available online

• Primary users are expected to be Portuguese and Brazilian as most of resources and tools are for Portuguese

• PoloFLUP’s main objective – comparable corpora and terminology tools

ABRAPT Mini-curso 30.08.04

To what extent is the Corpógrafo available to others?

• PoloFLUP is, by definition, bi- or multi-lingual in interest

• The Corpógrafo is therefore available for experiments on a small scale to the general public

• In the future – we hope to be able to work on projects with users from other universities and other countries

ABRAPT Mini-curso 30.08.04

Contacts

If you are interested is finding out more, please contact me:

Belinda Maia

[email protected]

The Corpógrafo can be used

(with a username and password) at:

http://www.linguateca.pt and

http://poloclup.linguateca.pt/ferramentas/gc