View
22
Download
1
Category
Tags:
Preview:
DESCRIPTION
Language and Knowledge Technologies for News Collections in Croatia. Bojana Dalbelo Bašić, Marko Tadić University of Zagreb, Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences bojana . dalbelo @ fe r . hr, marko.tadic @ ffzg.hr - PowerPoint PPT Presentation
Citation preview
ITN2008Dubrovnik2008-05-21
Language and Knowledge Technologies for News Collections
in Croatia Bojana Dalbelo Bašić, Marko Tadić
University of Zagreb,Faculty of Electrical Engineering and Computing / Faculty of Humanities and
Social Sciencesbojana.dalbelo@fer.hr, marko.tadic@ffzg.hr
ITN2008Dubrovnik
2008-05-21
ITN2008Dubrovnik2008-05-21
Talk overview who we are? what are we doing? text collections used for research applicable language technologies applicable knowledge technologies
ITN2008Dubrovnik2008-05-21
Who we are?
University of Zagreb, Croatia two faculties in a joint mission
build the systems that will develop and enable the usage of language resources and tools for Croatian
ITN2008Dubrovnik2008-05-21
Who we are 2? Faculty of Humanities and
Social Sciences Institute/Department of
Linguistics Department of Information
Sciences basic computational linguistic tasks for Croatian
compiling and processing large language resources Croatian National Corpus, Croatian Morphological
Lexicon, Croatian WordNet, Croatian Dependency Treebank
digitalization of Croatian lexicographic heritage:60+ dictionaries digitalized so far
tagger, lemmatizer chunker, parser NERC system, gazeteers (e.g. Croatian (sur)names)
ITN2008Dubrovnik2008-05-21
Who we are 3? Faculty of Electrical Engineering and Computing
Department of Electronics, Microelectronics, Computer and Intelligent Systems / KTLab
Knowledge Technogies Laboratory Group deals with text preprocessing techniques for Croatian for
machine learning procedures dimensionality reduction and document clustering
in the vector space model + visualisation automatic indexing of
documents intelligent, language specific
and non-specific informationretrieval and extraction
ITN2008Dubrovnik2008-05-21
What are we doing? working jointly on several research projects
AIDE: Automatic Indexing with Descriptors from Eurovoc (cooperation with the Government of the Republic of Croatia, HIDRA) Institute of Linguistics/FFZG & ZEMRIS/FER, 2006-2008
Computational Linguistic Models and Language Technologies for Croatian (rmjt.ffzg.hr), 2007-2011 national research programme, prof. Marko Tadić
Sources for Croatian Heritage and Croatian European Identity, 2007-2011 national research programme, prof. Damir Boras
CADIAL: Computer Aided Document Indexing for Accessing Legislation joint Flemish-Croatian project, 2007-2009 prof. Marie-Francine Moens & prof. Bojana Dalbelo
Bašić
ITN2008Dubrovnik2008-05-21
What are we doing 2? Composition of the programme RMJT
P1: Croatian language resources and their annotation project leader: Marko Tadić
P2: Computational syntax of Croatian project leader: Zdravko Dovedan
P3: Lexical semantics in building Croatian WordNet project leader: Ida Raffaelli
P4: Information technology in translating Croatian and language e-learning project leader: Sanja Seljan
P5: Knowledge discovery in textual data project leader: Bojana Dalbelo Bašić
participation in a FP7 project CLARIN LR & LT as a research infrastructure for e-SSH
ITN2008Dubrovnik2008-05-21
Text collections used for research we have done research on different kinds of
texts, but predominantly in journalistic genre Croatian National Corpus (hnk.ffzg.hr)
101,2 million tokens in size newspaper articles: 37% (ca 37 million tokens) magazines articles: 16% (ca 16 million tokens)
Croatian-English Parallel Corpus 3,5 million tokens from Croatian Weekly newspaper articles: 100%, bilingual
special text collections database of Vjesnik articles: 2000-2003, >90,000
articles Narodne novine collection: 1998-2008, >10,000
texts, >15 million tokens Parallel corpus of Southeast European Times: 2007-,
>25,000 articles, >4 million tokens, in 10 languages
ITN2008Dubrovnik2008-05-21
Applicable language technologies morphological processing
important for inflectionally rich languages, e.g. Croatian noun in 14 word-forms (7 cases, 2
numbers):N: student studentiG: studenta studenataD: studentu studentimaA: studenta studenteV: studentu studentiL: studentu studentimaI: studentom studentima
unlike English noun in 2(4?) word-forms (2 numbers+ possesive?):
Sg: student Poss: (student’s)Pl: students Poss: (students’)
present in all Slavic languages (excl. Bulgarian), German, Greek, Baltic languages, Finnish, ...
ITN2008Dubrovnik2008-05-21
Applicable language technologies 2 recognizing to which lexeme(s) a WF belongs to
helps us in avoiding the problem of data sparsness in many text processing tasks: information retrieval text mining document classification document indexing
query processing search engines are not “inflectionally sensitive” speakers of inflectionally rich language use the
normal/base form = lemma e.g. www.google.hr input: noun in nominative
singular did you know that accusative and genitive are more
frequent in Croatian?
ITN2008Dubrovnik2008-05-21
Applicable language technologies 3
ITN2008Dubrovnik2008-05-21
Applicable language technologies 4
ITN2008Dubrovnik2008-05-21
Applicable language technologies 5
ITN2008Dubrovnik2008-05-21
Applicable language technologies 6 Named Entity Recognition and Classification
(NERC) NEs are introducing the exact information from outer
world into the world-of-text represent answers to the basic journalistic
questions: who?, where?, when?, how much? types of NEs (according to MUC conferences)
person organization location date time valute and measurements percentage
system that works for Croatian with >90% precision
ITN2008Dubrovnik2008-05-21
Applicable language technologies 7 system that works for Croatian with >90%
precision
ITN2008Dubrovnik2008-05-21
Applicable language technologies 8 semantic networks as language resources
covering the general lexicon and NEs in a language
WordNet: words are linked by meaning synonyms, antonyms, hypo-/hyperonyms,
meronyms… realized as ontologies or taxonomies allow for words and/or NEs
synonymy/antonymy search evoking upper-levels in taxonomy
e.g. activating the region/state/continent when a city is mentioned or a company when a director is in focus
explicit social networking connections between NEs
ITN2008Dubrovnik2008-05-21
Applicable L&K technologies
ITN2008Dubrovnik2008-05-21
Applicable L&K technologies
ITN2008Dubrovnik2008-05-21
Applicable language technologies 8 semantic networks as language resources
covering the general lexicon and NEs in a language WordNet: words are linked by meaning
synonyms, antonyms, hypo-/hyperonyms, meronyms…
realized as ontologies or taxonomies allow for words and/or NEs
synonymy/antonymy search evoking upper-levels in taxonomy
e.g. activating the region/state/continent when a city is mentioned or a company when a director is in focus
explicit social networking connections between NEs semantic processing: roles in sentences (agent,
patient, instrument etc.)
ITN2008Dubrovnik2008-05-21
Applicable language technologies 8 semantic networks as language resources
covering the general lexicon and NEs in a language WordNet: words are linked by meaning
synonyms, antonyms, hypo-/hyperonyms, meronyms… realized as ontologies or taxonomies allow for words and/or NEs
synonymy/antonymy search evoking upper-levels in taxonomy
e.g. activating the region/state/continent when a city is mentioned or a company when a director is in focus
explicit social networking connections between NEs semantic processing: roles in sentences (agent,
patient, instrument etc.) event detection: from verbal frames and scenarios
ITN2008Dubrovnik2008-05-21
Applicable language technologies 8 semantic networks as language resources
covering the general lexicon and NEs in a language WordNet: words are linked by meaning
synonyms, antonyms, hypo-/hyperonyms, meronyms… realized as ontologies or taxonomies allow for words and/or NEs
synonymy/antonymy search evoking upper-levels in taxonomy
e.g. activating the region/state/continent when a city is mentioned or a company when a director is in focus
explicit social networking connections between NEs semantic processing: roles in sentences (agent,
patient, instrument etc.) event detection: from verbal frames and scenarios connection with geo-data
ITN2008Dubrovnik2008-05-21
Applicable knowledge technologies automatic document indexing
eCADIS system developed for Croatian legal docs applicable to any document collection uses machine learning techniques automatically attaches the keywords
(descriptors) from a controlled thesaurus to a document
represent the document content description
integrates the corpus and document analysis
ITN2008Dubrovnik2008-05-21
CADIS system
ITN2008Dubrovnik2008-05-21
ITN2008Dubrovnik2008-05-21
eCADIS system integrates the information from the whole
document collection greyed n-grams are statistically relevant in
the corpus i.e. collocations
ITN2008Dubrovnik2008-05-21
eCADIS system automatic suggestion of relevant
descriptors, hence the automatic indexing
ITN2008Dubrovnik2008-05-21
eCADIS system compare it to manually attached
descriptors…
ITN2008Dubrovnik2008-05-21
Applicable knowledge technologies automatic document classification
uses a series of classifiers, combined 3500 classifiers
results represented in a vector-space model dimensionality reduction
matrices could be huge (Vjesnik: 90,000 x 600,000) features selected
types lemmas collocations NEs …
evaluated by F1 measure (combination of precision/recall) F1 > 90% in most of cases
ITN2008Dubrovnik2008-05-21
Applicable knowledge technologies
visualisationof classification between pages Croatia
Weekly English side go= economy
ks = culture/sportte = turism/ecol.po = politics
ITN2008Dubrovnik2008-05-21
Applicable knowledge technologies
visualisationof classification between culture (low right) and sport (high left) Croatia Weekly English side go= economy
ks = culture/sportte = turism/ecol.po = politics
ITN2008Dubrovnik2008-05-21
Applicable knowledge technologies
visualisationof classification for documents that differentiate between home (blue upward) and foreign policy (blue downward) Croatia Weekly English side go= economy
ks = culture/sportte = turism/eco.po = politics
ITN2008Dubrovnik2008-05-21
Language and Knowledge Technologies for News Collections
in Croatia Bojana Dalbelo Bašić, Marko Tadić
University of Zagreb,Faculty of Electrical Engineering and Computing / Faculty of Humanities and
Social Sciencesbojana.dalbelo@fer.hr, marko.tadic@ffzg.hr
ITN2008Dubrovnik
2008-05-21
Recommended