Language and Knowledge Technologies for News Collections in Croatia

Language and Knowledge Technologies for News Collections in Croatia Bojana Dalbelo Bai, Marko Tadi University of Zagreb, Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social [email protected], [email protected] ITN2008 Dubrovnik 2008-05-21

Talk overviewwho we are?what are we doing?text collections used for researchapplicable language technologiesapplicable knowledge technologies

Who we are?University of Zagreb, Croatiatwo faculties in a joint missionbuild the systems that will develop and enable the usage of language resources and tools for Croatian

Who we are 2?Faculty of Humanities and Social SciencesInstitute/Department of LinguisticsDepartment of Information Sciencesbasic computational linguistic tasks for Croatiancompiling and processing large language resourcesCroatian National Corpus, Croatian Morphological Lexicon, Croatian WordNet, Croatian Dependency Treebankdigitalization of Croatian lexicographic heritage: 60+ dictionaries digitalized so fartagger, lemmatizerchunker, parserNERC system, gazeteers (e.g. Croatian (sur)names)

Who we are 3?Faculty of Electrical Engineering and ComputingDepartment of Electronics, Microelectronics, Computer and Intelligent Systems / KTLabKnowledge Technogies Laboratory Group deals withtext preprocessing techniques for Croatian for machine learning proceduresdimensionality reduction and document clustering in the vector space model + visualisationautomatic indexing of documentsintelligent, language specific and non-specific information retrieval and extraction

What are we doing?working jointly on several research projectsAIDE: Automatic Indexing with Descriptors from Eurovoc (cooperation with the Government of the Republic of Croatia, HIDRA)Institute of Linguistics/FFZG & ZEMRIS/FER, 2006-2008Computational Linguistic Models and Language Technologies for Croatian (rmjt.ffzg.hr), 2007-2011national research programme, prof. Marko TadiSources for Croatian Heritage and Croatian European Identity, 2007-2011national research programme, prof. Damir BorasCADIAL: Computer Aided Document Indexing for Accessing Legislation joint Flemish-Croatian project, 2007-2009prof. Marie-Francine Moens & prof. Bojana Dalbelo Bai

What are we doing 2?Composition of the programme RMJTP1: Croatian language resources and their annotationproject leader: Marko TadiP2: Computational syntax of Croatianproject leader: Zdravko DovedanP3: Lexical semantics in building Croatian WordNetproject leader: Ida RaffaelliP4: Information technology in translating Croatian and language e-learningproject leader: Sanja SeljanP5: Knowledge discovery in textual dataproject leader: Bojana Dalbelo Baiparticipation in a FP7 project CLARINLR & LT as a research infrastructure for e-SSH

Text collections used for researchwe have done research on different kinds of texts, but predominantly in journalistic genreCroatian National Corpus (hnk.ffzg.hr)101,2 million tokens in sizenewspaper articles: 37% (ca 37 million tokens)magazines articles: 16% (ca 16 million tokens)Croatian-English Parallel Corpus3,5 million tokens from Croatian Weeklynewspaper articles: 100%, bilingualspecial text collectionsdatabase of Vjesnik articles: 2000-2003, >90,000 articlesNarodne novine collection: 1998-2008, >10,000 texts, >15 million tokensParallel corpus of Southeast European Times: 2007-, >25,000 articles, >4 million tokens, in 10 languages

Applicable language technologiesmorphological processingimportant for inflectionally rich languages, e.g.Croatian noun in 14 word-forms (7 cases, 2 numbers):N: studentstudentiG: studentastudenataD: studentustudentimaA: studentastudenteV: studentustudentiL: studentustudentimaI: studentomstudentimaunlike English noun in 2(4?) word-forms (2 numbers + possesive?):Sg: studentPoss: (students)Pl: studentsPoss: (students)present in all Slavic languages (excl. Bulgarian), German, Greek, Baltic languages, Finnish, ...

Applicable language technologies 2recognizing to which lexeme(s) a WF belongs tohelps us in avoiding the problem of data sparsness in many text processing tasks:information retrievaltext miningdocument classificationdocument indexingquery processingsearch engines are not inflectionally sensitivespeakers of inflectionally rich language use the normal/base form = lemmae.g. www.google.hr input: noun in nominative singulardid you know that accusative and genitive are more frequent in Croatian?

Applicable language technologies 3

Applicable language technologies 6Named Entity Recognition and Classification (NERC)NEs are introducing the exact information from outer world into the world-of-textrepresent answers to the basic journalistic questions: who?, where?, when?, how much?types of NEs (according to MUC conferences)personorganizationlocationdatetimevalute and measurementspercentagesystem that works for Croatian with >90% precision

Applicable language technologies 7system that works for Croatian with >90% precision

Applicable language technologies 8semantic networks as language resourcescovering the general lexicon and NEs in a languageWordNet: words are linked by meaningsynonyms, antonyms, hypo-/hyperonyms, meronymsrealized as ontologies or taxonomiesallow for words and/or NEssynonymy/antonymy searchevoking upper-levels in taxonomye.g. activating the region/state/continent when a city is mentioned or a company when a director is in focusexplicit social networking connections between NEs

Applicable L&K technologies

Applicable language technologies 8semantic networks as language resourcescovering the general lexicon and NEs in a languageWordNet: words are linked by meaningsynonyms, antonyms, hypo-/hyperonyms, meronymsrealized as ontologies or taxonomiesallow for words and/or NEssynonymy/antonymy searchevoking upper-levels in taxonomye.g. activating the region/state/continent when a city is mentioned or a company when a director is in focusexplicit social networking connections between NEssemantic processing: roles in sentences (agent, patient, instrument etc.)

Applicable language technologies 8semantic networks as language resourcescovering the general lexicon and NEs in a languageWordNet: words are linked by meaningsynonyms, antonyms, hypo-/hyperonyms, meronymsrealized as ontologies or taxonomiesallow for words and/or NEssynonymy/antonymy searchevoking upper-levels in taxonomye.g. activating the region/state/continent when a city is mentioned or a company when a director is in focusexplicit social networking connections between NEssemantic processing: roles in sentences (agent, patient, instrument etc.)event detection: from verbal frames and scenarios

Applicable language technologies 8semantic networks as language resourcescovering the general lexicon and NEs in a languageWordNet: words are linked by meaningsynonyms, antonyms, hypo-/hyperonyms, meronymsrealized as ontologies or taxonomiesallow for words and/or NEssynonymy/antonymy searchevoking upper-levels in taxonomye.g. activating the region/state/continent when a city is mentioned or a company when a director is in focusexplicit social networking connections between NEssemantic processing: roles in sentences (agent, patient, instrument etc.)event detection: from verbal frames and scenariosconnection with geo-data

Applicable knowledge technologiesautomatic document indexingeCADIS systemdeveloped for Croatian legal docsapplicable to any document collectionuses machine learning techniquesautomatically attaches the keywords (descriptors) from a controlled thesaurus to a documentrepresent the document content descriptionintegrates the corpus and document analysis

CADIS system

eCADIS systemintegrates the information from the whole document collectiongreyed n-grams are statistically relevant in the corpus i.e. collocations

eCADIS systemautomatic suggestion of relevant descriptors, hence the automatic indexing

eCADIS systemcompare it to manually attached descriptors

Applicable knowledge technologiesautomatic document classificationuses a series of classifiers, combined 3500 classifiersresults represented in a vector-space modeldimensionality reductionmatrices could be huge (Vjesnik: 90,000 x 600,000)features selectedtypeslemmascollocationsNEsevaluated by F1 measure (combination of precision/recall)F1 > 90% in most of cases

Applicable knowledge technologiesvisualisation of classification between pagesCroatia WeeklyEnglish sidego= economy ks = culture/sport te = turism/ecol. po = politics

Applicable knowledge technologiesvisualisation of classification between culture (low right) and sport (high left)Croatia WeeklyEnglish sidego= economy ks = culture/sport te = turism/ecol. po = politics

Applicable knowledge technologiesvisualisation of classification for documents that differentiate between home (blue upward) and foreign policy (blue downward)Croatia WeeklyEnglish sidego= economy ks = culture/sport te = turism/eco. po = politics

Language and Knowledge Technologies for News Collections in Croatia Bojana Dalbelo Bai, Marko Tadi University of Zagreb, Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social [email protected], [email protected] ITN2008 Dubrovnik 2008-05-21

Documents

Language and Knowledge Technologies for News Collections in Croatia