NT2Lex - University of Rochester

NT2LexA CEFR-Graded Lexical Resource for Dutch as a Foreign Language

Linked to Open Dutch WordNet

Anaïs Tack 1,2 Thomas François 1 Piet Desmet 2 Cédrick Fairon 11 CENTAL, Université catholique de Louvain, Louvain-la-Neuve, Belgium

2 ITEC, imec, KU Leuven Kulak, Kortrijk, Belgium

CEFR-GRADED LEXICONS

a graded lexicon is a lexical database that includes lexical

frequencies observed in texts graded along a difficulty scale

Foreign language (L2) materials

• textbooks and readers / learner texts

• CEFR scale [A1 > A2 > B1 > B2 > C1 > C2] (Council of Europe, 2001)

CEFRLex � cental.uclouvain.be/cefrlex/

ANALYSIS Semantics

ANALYSIS Frequency

KEY TAKEAWAYS

NT2Lex

�� a new resource for Dutch as a foreign language (NT2)

�� 17,743 entries with graded frequency distributions

�� measure of receptive word difficulty

�� measure of word sense complexity

through linkage to Open Dutch WordNet

� cental.uclouvain.be/nt2lex/

French - FLELex(François et al., 2014)

Swedish - SVALex(François et al., 2016)

English - EFLLex(Dürlich & François, 2018)

Swedish - SweLLex(Volodina et al., 2016)

ANALYSIS Psycholinguistics

NT2LEX

Online tools for lexical complexity analysis

• database search

• CEFR-based complex word identification (Tack et al., 2016)

Tools

Corpus of reading materials

• corpus of 461,088 tokens

• 5 CEFR levels (A1, A2, B1, B2, C1)

Preprocessing

• part-of-speech tagging with Frog (van den Bosch et al., 2007)

• SVM WSD tool trained on DutchSemCor (Vossen et al., 2012)

• linkage to Open Dutch WordNet (Postma et al., 2016)

Lexical frequencies

• lexical entries with per-level observed frequency

• normalised for lexical dispersion (Carroll et al., 1971)

ResourceNT2LEX

lemma pos sense synset A1 A2 B1 B2 C1pakkento grab

WW() pakken-v-1 odwn-10-101230891-v 35 117 101 5 -

pakkento defeat

WW() pakken-v-10 eng-30-01100145-v - 51 12 - -

zijnto exist

WW() zijn-v-1 eng-30-02603699-v 2,094 1,647 1,423 1,253 1,335

0 20 40 60 80frequency

0.0

0.2

0.4

0.6

0.8

1.0

disp

ersi

on

r2 = 0.83

frequency

• correlation Subtlex-NL (Keuleers et al., 2010)

• Zipfian effects

shorter = more frequent

dispersion

• theoretical familiarity

• more dispersed = basic voc

A1 A2 B1 B2 C1 TOTALlevel

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

poly

sem

es

semasiology

• form > meaning mappings

• easy = more polysemous

onomasiology

• meaning > form mappings

• lower degree of synonymy

• L2-specific lexicalisations

0 5 10 15 20age of acquisition

0.00

0.05

0.10

0.15

0.20

0.25

0.30

dens

ity

A1A2B1B2C1TOTAL

0 2 4 6concreteness

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

dens

ity

A1A2B1B2C1TOTAL

interplay of psycholinguistic norms (Brysbaert et al., 2014)

Documents

NT2Lex - University of Rochester