20
Corpus lexicography Corpus lexicography in Russia: recent in Russia: recent trends and trends and perspectives perspectives Maria Khokhlova Maria Khokhlova St.Petersburg State St.Petersburg State University University Philological Faculty Philological Faculty [email protected] [email protected]

Corpus lexicography in Russia: recent trends and perspectives

  • Upload
    noah

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

Maria Khokhlova St.Petersburg State University Philological Faculty [email protected]. Corpus lexicography in Russia: recent trends and perspectives. Prehistory of Russian Corpus Linguistics‏. Frequency Dictionary of Russian: (L.N.Zasorina, 1977) - PowerPoint PPT Presentation

Citation preview

Page 1: Corpus lexicography in Russia: recent trends and perspectives

Corpus lexicography in Corpus lexicography in Russia: recent trends Russia: recent trends

and perspectivesand perspectivesMaria KhokhlovaMaria Khokhlova

St.Petersburg State UniversitySt.Petersburg State University

Philological FacultyPhilological Faculty

[email protected]@gmail.com

Page 2: Corpus lexicography in Russia: recent trends and perspectives

2

Prehistory of Russian Corpus Linguistics

Frequency Dictionary of Russian: (L.N.Zasorina, 1977) Text database contained about 1 mln units.During its compilation a huge number of notorious issues were discussed:representiveness;tokenization;lemmatization...So it was the earliest computer corpus of Russian.

Page 3: Corpus lexicography in Russia: recent trends and perspectives

3

Prehistory of Russian Corpus Linguistics «Computer Fund of the Russian

Language»Idea: Acad. Andrey Yershov

Andrey Petrovich Yershov (1931-1988)

Page 4: Corpus lexicography in Russia: recent trends and perspectives

Jeršov A.P. "On methodology of constructing dialogue systems: the

phenomenon of business prosa" (1978)

The idea was formulated as follows: "Any progress in the field of constructing models and algorithms will remain a purely academic exercise, unless a most important problem of creating a Computer fund of the Russian language is solved. We hope that creation of such a Computer fund by linguists, qualified for the task, will precede construction of large systems for application purposes. This would minimize labour costs and simultaneously would protect the Russian language from arbitrary and incompetent intervention“.

Page 5: Corpus lexicography in Russia: recent trends and perspectives

5

Russian Corpora (1)

The Uppsala Russian Corpus (1960s), the earliest corpus

The Tübingen Russian Corpus (Tübingen Universität, in 1999 -2004 under the guidance of T.Berger)

The HANCO corpus (Helsinki Annotated Corpus), Helsinki University, Slavic and Baltic Languages Department (2001-2004, A. Mustajoki, M. Kopotev). It is a small teaching corpus with morphological and syntactic annotation.

Page 6: Corpus lexicography in Russia: recent trends and perspectives

6

Russian Corpora (2)

Three big corpora of Russian: The National Corpus of Russian Language

(NCRL, about 364 million words) (http://ruscorpora.ru

Corpora at the Leeds University created by S.Sharoff (about 2000 million words) (http://corpus.leeds.ac.uk/ruscorpora.html)

A corpus of Russian Fiction at the Automatic Text Processing initiative team (AOT), 680 million words (http://aot.ru).

Page 7: Corpus lexicography in Russia: recent trends and perspectives

7

Russian National Corpus (1)Over 364 million wordsBased on Yandex Search:

Search by exact form(s); Lexico-grammatical search. see www.yandex.ru – Advanced Search and www.ruscorpora.ru – Search in the Corpus

Additional options:morphological features;semantic features;metadata.

Page 8: Corpus lexicography in Russia: recent trends and perspectives

8

Russian National Corpus (2)Subcorpora: Modern Russian corpus, Diachronic corpus (the Church Slavonic

language), Syntactic corpus, Spoken corpus, News corpus, Parallel corpora, Poetic corpus, Dialect corpus, Speech corpus, Multimodal corpus

Page 9: Corpus lexicography in Russia: recent trends and perspectives

9

Page 10: Corpus lexicography in Russia: recent trends and perspectives

10

Page 11: Corpus lexicography in Russia: recent trends and perspectives

11

Dictionaries based on the Russian National Corpus

Grammatical Dictionary of Russian Neologisms;

New Frequency Dictionary of Russian;

The Combinatory Dictionary of Russian Intensifiers;

The Verbal Combinatory Dictionary of Russian Abstract Nouns

http://dict.lang.ru

Page 12: Corpus lexicography in Russia: recent trends and perspectives

AOT (1)

Page 13: Corpus lexicography in Russia: recent trends and perspectives

AOT (2)

Page 14: Corpus lexicography in Russia: recent trends and perspectives

Russian Corpora (Leeds University, Serge Sharoff)

Russian Reference CorpusRussian Reference Corpus,

another versionRussian Fiction (disambiguated) Russian Newspapers

Russian Internet Corpus Russian National Corpus…

Page 15: Corpus lexicography in Russia: recent trends and perspectives
Page 16: Corpus lexicography in Russia: recent trends and perspectives

Collocations

Page 17: Corpus lexicography in Russia: recent trends and perspectives

St.Petersburg Corpus of Hagiographic Texts

Biographies of saints and holy people;

50 manuscripts; 500 000 tokenshttp://project.phil.spbu.ru/scat/

page.php?page=project

Page 18: Corpus lexicography in Russia: recent trends and perspectives

The Fundamental Digital Library of Russian Literature

and FolkloreFEB-web accumulates information in text,

audio, visual, and other forms on 11th-20th-century Russian literature, Russian folklore, and the history of Russian literary scholarship and folklore studies.

Page 19: Corpus lexicography in Russia: recent trends and perspectives

19

Conference “Corpus Linguistics”

2002 2004 2006 2008 2011 2013 (late June)Saint-PetersburgSt.Petersburg State University,

Department of Mathematical Linguistics

Page 20: Corpus lexicography in Russia: recent trends and perspectives

Thank you for your attention!