Language and Knowledge Technologies for News Collections in Croatia

  • Upload
    keaira

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

Language and Knowledge Technologies for News Collections in Croatia. Bojana Dalbelo Bašić, Marko Tadić University of Zagreb, Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences bojana . dalbelo @ fe r . hr, marko.tadic @ ffzg.hr - PowerPoint PPT Presentation

Citation preview

  • Language and Knowledge Technologies for News Collections in Croatia Bojana Dalbelo Bai, Marko Tadi University of Zagreb, Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social [email protected], [email protected] ITN2008 Dubrovnik 2008-05-21

  • Talk overviewwho we are?what are we doing?text collections used for researchapplicable language technologiesapplicable knowledge technologies

  • Who we are?University of Zagreb, Croatiatwo faculties in a joint missionbuild the systems that will develop and enable the usage of language resources and tools for Croatian

  • Who we are 2?Faculty of Humanities and Social SciencesInstitute/Department of LinguisticsDepartment of Information Sciencesbasic computational linguistic tasks for Croatiancompiling and processing large language resourcesCroatian National Corpus, Croatian Morphological Lexicon, Croatian WordNet, Croatian Dependency Treebankdigitalization of Croatian lexicographic heritage: 60+ dictionaries digitalized so fartagger, lemmatizerchunker, parserNERC system, gazeteers (e.g. Croatian (sur)names)

  • Who we are 3?Faculty of Electrical Engineering and ComputingDepartment of Electronics, Microelectronics, Computer and Intelligent Systems / KTLabKnowledge Technogies Laboratory Group deals withtext preprocessing techniques for Croatian for machine learning proceduresdimensionality reduction and document clustering in the vector space model + visualisationautomatic indexing of documentsintelligent, language specific and non-specific information retrieval and extraction

  • What are we doing?working jointly on several research projectsAIDE: Automatic Indexing with Descriptors from Eurovoc (cooperation with the Government of the Republic of Croatia, HIDRA)Institute of Linguistics/FFZG & ZEMRIS/FER, 2006-2008Computational Linguistic Models and Language Technologies for Croatian (rmjt.ffzg.hr), 2007-2011national research programme, prof. Marko TadiSources for Croatian Heritage and Croatian European Identity, 2007-2011national research programme, prof. Damir BorasCADIAL: Computer Aided Document Indexing for Accessing Legislation joint Flemish-Croatian project, 2007-2009prof. Marie-Francine Moens & prof. Bojana Dalbelo Bai

  • What are we doing 2?Composition of the programme RMJTP1: Croatian language resources and their annotationproject leader: Marko TadiP2: Computational syntax of Croatianproject leader: Zdravko DovedanP3: Lexical semantics in building Croatian WordNetproject leader: Ida RaffaelliP4: Information technology in translating Croatian and language e-learningproject leader: Sanja SeljanP5: Knowledge discovery in textual dataproject leader: Bojana Dalbelo Baiparticipation in a FP7 project CLARINLR & LT as a research infrastructure for e-SSH

  • Text collections used for researchwe have done research on different kinds of texts, but predominantly in journalistic genreCroatian National Corpus (hnk.ffzg.hr)101,2 million tokens in sizenewspaper articles: 37% (ca 37 million tokens)magazines articles: 16% (ca 16 million tokens)Croatian-English Parallel Corpus3,5 million tokens from Croatian Weeklynewspaper articles: 100%, bilingualspecial text collectionsdatabase of Vjesnik articles: 2000-2003, >90,000 articlesNarodne novine collection: 1998-2008, >10,000 texts, >15 million tokensParallel corpus of Southeast European Times: 2007-, >25,000 articles, >4 million tokens, in 10 languages

  • Applicable language technologiesmorphological processingimportant for inflectionally rich languages, e.g.Croatian noun in 14 word-forms (7 cases, 2 numbers):N: studentstudentiG: studentastudenataD: studentustudentimaA: studentastudenteV: studentustudentiL: studentustudentimaI: studentomstudentimaunlike English noun in 2(4?) word-forms (2 numbers + possesive?):Sg: studentPoss: (students)Pl: studentsPoss: (students)present in all Slavic languages (excl. Bulgarian), German, Greek, Baltic languages, Finnish, ...

  • Applicable language technologies 2recognizing to which lexeme(s) a WF belongs tohelps us in avoiding the problem of data sparsness in many text processing tasks:information retrievaltext miningdocument classificationdocument indexingquery processingsearch engines are not inflectionally sensitivespeakers of inflectionally rich language use the normal/base form = lemmae.g. www.google.hr input: noun in nominative singulardid you know that accusative and genitive are more frequent in Croatian?

  • Applicable language technologies 3

  • Applicable language technologies 4

  • Applicable language technologies 5

  • Applicable language technologies 6Named Entity Recognition and Classification (NERC)NEs are introducing the exact information from outer world into the world-of-textrepresent answers to the basic journalistic questions: who?, where?, when?, how much?types of NEs (according to MUC conferences)personorganizationlocationdatetimevalute and measurementspercentagesystem that works for Croatian with >90% precision

  • Applicable language technologies 7system that works for Croatian with >90% precision

  • Applicable language technologies 8semantic networks as language resourcescovering the general lexicon and NEs in a languageWordNet: words are linked by meaningsynonyms, antonyms, hypo-/hyperonyms, meronymsrealized as ontologies or taxonomiesallow for words and/or NEssynonymy/antonymy searchevoking upper-levels in taxonomye.g. activating the region/state/continent when a city is mentioned or a company when a director is in focusexplicit social networking connections between NEs

  • Applicable L&K technologies

  • Applicable L&K technologies

  • Applicable language technologies 8semantic networks as language resourcescovering the general lexicon and NEs in a languageWordNet: words are linked by meaningsynonyms, antonyms, hypo-/hyperonyms, meronymsrealized as ontologies or taxonomiesallow for words and/or NEssynonymy/antonymy searchevoking upper-levels in taxonomye.g. activating the region/state/continent when a city is mentioned or a company when a director is in focusexplicit social networking connections between NEssemantic processing: roles in sentences (agent, patient, instrument etc.)

  • Applicable language technologies 8semantic networks as language resourcescovering the general lexicon and NEs in a languageWordNet: words are linked by meaningsynonyms, antonyms, hypo-/hyperonyms, meronymsrealized as ontologies or taxonomiesallow for words and/or NEssynonymy/antonymy searchevoking upper-levels in taxonomye.g. activating the region/state/continent when a city is mentioned or a company when a director is in focusexplicit social networking connections between NEssemantic processing: roles in sentences (agent, patient, instrument etc.)event detection: from verbal frames and scenarios

  • Applicable language technologies 8semantic networks as language resourcescovering the general lexicon and NEs in a languageWordNet: words are linked by meaningsynonyms, antonyms, hypo-/hyperonyms, meronymsrealized as ontologies or taxonomiesallow for words and/or NEssynonymy/antonymy searchevoking upper-levels in taxonomye.g. activating the region/state/continent when a city is mentioned or a company when a director is in focusexplicit social networking connections between NEssemantic processing: roles in sentences (agent, patient, instrument etc.)event detection: from verbal frames and scenariosconnection with geo-data

  • Applicable knowledge technologiesautomatic document indexingeCADIS systemdeveloped for Croatian legal docsapplicable to any document collectionuses machine learning techniquesautomatically attaches the keywords (descriptors) from a controlled thesaurus to a documentrepresent the document content descriptionintegrates the corpus and document analysis

  • CADIS system

  • eCADIS systemintegrates the information from the whole document collectiongreyed n-grams are statistically relevant in the corpus i.e. collocations

  • eCADIS systemautomatic suggestion of relevant descriptors, hence the automatic indexing

  • eCADIS systemcompare it to manually attached descriptors

  • Applicable knowledge technologiesautomatic document classificationuses a series of classifiers, combined 3500 classifiersresults represented in a vector-space modeldimensionality reductionmatrices could be huge (Vjesnik: 90,000 x 600,000)features selectedtypeslemmascollocationsNEsevaluated by F1 measure (combination of precision/recall)F1 > 90% in most of cases

  • Applicable knowledge technologiesvisualisation of classification between pagesCroatia WeeklyEnglish sidego= economy ks = culture/sport te = turism/ecol. po = politics

  • Applicable knowledge technologiesvisualisation of classification between culture (low right) and sport (high left)Croatia WeeklyEnglish sidego= economy ks = culture/sport te = turism/ecol. po = politics

  • Applicable knowledge technologiesvisualisation of classification for documents that differentiate between home (blue upward) and foreign policy (blue downward)Croatia WeeklyEnglish sidego= economy ks = culture/sport te = turism/eco. po = politics

  • Language and Knowledge Technologies for News Collections in Croatia Bojana Dalbelo Bai, Marko Tadi University of Zagreb, Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social [email protected], [email protected] ITN2008 Dubrovnik 2008-05-21