1
NT2Lex A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet Anaïs Tack 1,2 Thomas François 1 Piet Desmet 2 Cédrick Fairon 1 1 CENTAL, Université catholique de Louvain, Louvain-la-Neuve, Belgium 2 ITEC, imec, KU Leuven Kulak, Kortrijk, Belgium CEFR-GRADED LEXICONS a graded lexicon is a lexical database that includes lexical frequencies observed in texts graded along a difficulty scale Foreign language (L2) materials textbooks and readers / learner texts CEFR scale [A1 > A2 > B1 > B2 > C1 > C2] (Council of Europe, 2001) CEFRLex cental.uclouvain.be/cefrlex/ ANALYSIS Semantics ANALYSIS Frequency KEY TAKEAWAYS NT2Lex a new resource for Dutch as a foreign language (NT2) 17,743 entries with graded frequency distributions measure of receptive word difficulty measure of word sense complexity through linkage to Open Dutch WordNet cental.uclouvain.be/nt2lex/ French - FLELex (François et al., 2014) Swedish - SVALex (François et al., 2016) English - EFLLex (Dürlich & François, 2018) Swedish - SweLLex (Volodina et al., 2016) ANALYSIS Psycholinguistics NT2LEX Online tools for lexical complexity analysis database search CEFR-based complex word identification (Tack et al., 2016) Tools Corpus of reading materials corpus of 461,088 tokens 5 CEFR levels (A1, A2, B1, B2, C1) Preprocessing part-of-speech tagging with Frog (van den Bosch et al., 2007) SVM WSD tool trained on DutchSemCor (Vossen et al., 2012) linkage to Open Dutch WordNet (Postma et al., 2016) Lexical frequencies lexical entries with per-level observed frequency normalised for lexical dispersion (Carroll et al., 1971) Resource NT2LEX lemma pos sense synset A1 A2 B1 B2 C1 pakken to grab WW() pakken-v-1 odwn-10-101230891-v 35 117 101 5 - pakken to defeat WW() pakken-v-10 eng-30-01100145-v - 51 12 - - zijn to exist WW() zijn-v-1 eng-30-02603699-v 2,094 1,647 1,423 1,253 1,335 0 20 40 60 80 frequency 0.0 0.2 0.4 0.6 0.8 1.0 dispersion r 2 = 0.83 frequency correlation Subtlex-NL (Keuleers et al., 2010) Zipfian effects shorter = more frequent dispersion theoretical familiarity more dispersed = basic voc A1 A2 B1 B2 C1 TOTAL level 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 polysemes semasiology form > meaning mappings easy = more polysemous onomasiology meaning > form mappings lower degree of synonymy L2-specific lexicalisations 0 5 10 15 20 age of acquisition 0.00 0.05 0.10 0.15 0.20 0.25 0.30 density A1 A2 B1 B2 C1 TOTAL 0 2 4 6 concreteness 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 density A1 A2 B1 B2 C1 TOTAL interplay of psycholinguistic norms (Brysbaert et al., 2014)

NT2Lex - University of Rochester

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

NT2LexA CEFR-Graded Lexical Resource for Dutch as a Foreign Language

Linked to Open Dutch WordNet

Anaïs Tack 1,2 Thomas François 1 Piet Desmet 2 Cédrick Fairon 11 CENTAL, Université catholique de Louvain, Louvain-la-Neuve, Belgium

2 ITEC, imec, KU Leuven Kulak, Kortrijk, Belgium

CEFR-GRADED LEXICONS

a graded lexicon is a lexical database that includes lexical

frequencies observed in texts graded along a difficulty scale

Foreign language (L2) materials

• textbooks and readers / learner texts

• CEFR scale [A1 > A2 > B1 > B2 > C1 > C2] (Council of Europe, 2001)

CEFRLex � cental.uclouvain.be/cefrlex/

ANALYSIS Semantics

ANALYSIS Frequency

KEY TAKEAWAYS

NT2Lex

�� a new resource for Dutch as a foreign language (NT2)

�� 17,743 entries with graded frequency distributions

�� measure of receptive word difficulty

�� measure of word sense complexity

through linkage to Open Dutch WordNet

� cental.uclouvain.be/nt2lex/

French - FLELex(François et al., 2014)

Swedish - SVALex(François et al., 2016)

English - EFLLex(Dürlich & François, 2018)

Swedish - SweLLex(Volodina et al., 2016)

ANALYSIS Psycholinguistics

NT2LEX

Online tools for lexical complexity analysis

• database search

• CEFR-based complex word identification (Tack et al., 2016)

Tools

Corpus of reading materials

• corpus of 461,088 tokens

• 5 CEFR levels (A1, A2, B1, B2, C1)

Preprocessing

• part-of-speech tagging with Frog (van den Bosch et al., 2007)

• SVM WSD tool trained on DutchSemCor (Vossen et al., 2012)

• linkage to Open Dutch WordNet (Postma et al., 2016)

Lexical frequencies

• lexical entries with per-level observed frequency

• normalised for lexical dispersion (Carroll et al., 1971)

ResourceNT2LEX

lemma pos sense synset A1 A2 B1 B2 C1pakkento grab

WW() pakken-v-1 odwn-10-101230891-v 35 117 101 5 -

pakkento defeat

WW() pakken-v-10 eng-30-01100145-v - 51 12 - -

zijnto exist

WW() zijn-v-1 eng-30-02603699-v 2,094 1,647 1,423 1,253 1,335

0 20 40 60 80frequency

0.0

0.2

0.4

0.6

0.8

1.0

disp

ersi

on

r2 = 0.83

frequency

• correlation Subtlex-NL (Keuleers et al., 2010)

• Zipfian effects

shorter = more frequent

dispersion

• theoretical familiarity

• more dispersed = basic voc

A1 A2 B1 B2 C1 TOTALlevel

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

poly

sem

es

semasiology

• form > meaning mappings

• easy = more polysemous

onomasiology

• meaning > form mappings

• lower degree of synonymy

• L2-specific lexicalisations

0 5 10 15 20age of acquisition

0.00

0.05

0.10

0.15

0.20

0.25

0.30

dens

ity

A1A2B1B2C1TOTAL

0 2 4 6concreteness

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

dens

ity

A1A2B1B2C1TOTAL

interplay of psycholinguistic norms (Brysbaert et al., 2014)