Textdatenbanken12.ppt...

Preview:

Citation preview

Textdatenbanken

Sommersemester 2020

12. Vorlesung

- Text-Genres und Korpusstatistik, Teil 3 -

Uwe Quasthoff

Universität Leipzig

Institut für Informatik

quasthoff@informatik.uni-leipzig.de

U. Quasthoff Textdatenbanken 2

Präsentation der Messergebnisse

• Optisch: Graphische Darstellung (z.B. Gerade)

• Minimal: Beschreibung der Grafik durch möglichst wenige Parameter (z.B.

Anstieg)

• Exemplarisch: Angabe von Beispielen in einer Tabelle

U. Quasthoff Textdatenbanken 3

Präsentation der Messergebnisse:

Regelmäßigkeit

• Optisch: Graphische Darstellung (z.B. Gerade)

• Erkennen von Gesetzmäßigkeiten / Zusammenhängen

• Minimal: Beschreibung der Grafik durch möglichst wenige Parameter (z.B.

Anstieg)

• Verwendung in Feature-Vektoren

• Grundlage für Vergleich / Clustering

• Exemplarisch: Angabe von Beispielen in einer Tabelle

– Typische Beispiele

U. Quasthoff Textdatenbanken 4

Präsentation der Messergebnisse:

Unregelmäßigkeiten

• Optisch: Graphische Darstellung (z.B. Gerade)

• Abweichung von erwarteten Gesetzmäßigkeiten / Zusammenhängen

• Exemplarisch: Angabe von Beispielen in einer Tabelle

– Extreme Beispiele

• Aufspüren von interessanten Sonderfällen

• Hinweis auf Fehler in der Vorverarbeitung

U. Quasthoff Textdatenbanken 5

Beispiel: Satzlänge in Wörtern (Deu)

• Graph

• Mittelwert

• Beispiele für extrem kurze und

lange Sätze.

Aber auch:

• Unterscheidung nach Satztyp:

Aussage / Frage / Ausruf

• Unterscheidung der Quellen nach

mittlerer Satzlänge

• Varianz der Satzlänge pro Quelle

U. Quasthoff Textdatenbanken 6

Satzlänge in Ido-Wikipedia

Großer Teil automatisch

erzeugt. Beispielsätze:

• Tama, Iowa: Segun la

kontado di 2000 esas 2,731

homi, 1,065 hemanari, e

723 familii qui rezidas en la

urbo. La lojanto-denseso

esas 349.2/km² (905.1/mi²).

• Harper, Iowa: Segun la

kontado di 2000 esas 134

homi, 55 hemanari, e 40

familii qui rezidas en la

urbo. La lojanto-denseso

esas 574.9/km²

(1,523.7/mi²).

U. Quasthoff Textdatenbanken 7

Words by Length without

multiplicity

Here we ignore the fact that

words have different

frequencies. So for the

average word length,

each word is considered

equally. For a fixed

word length, we count

the number of different

words having this

length.

With a logarithmic scale of

the y-axis, we get a

nearly linear part

between length 15 and

40.

U. Quasthoff Textdatenbanken 8

Words by Length with multiplicity

The fact that stopwords are

very high frequent and

short will give a shorter

average word length than

in the previous picture.

U. Quasthoff Textdatenbanken 9

Average word length for different

frequency ranges

The table shows the average word

length (counted without

multiplicity) for the most

frequent 10n (n=1,2,…) words.

U. Quasthoff Textdatenbanken 10

Number of letter-N-grams at word

beginnings

How many different

letter-N-grams do we

find at the beginning

of a word? Of course

we will find many

unexpected N-grams,

but the will have low

frequency. This is the

reason to count these

numbers for different

ranges and use the

top K=10n words

(n=2, 3, 4, 5, 6).

U. Quasthoff Textdatenbanken 11

Text coverage by top words

Text coverage measures

the number of words

necessary to cover a

certain amount of

text of a corpus. The

table shows the text

coverage for the first

N=10n words,

n=1,…,5.

A diagram with these

values and

logarithmic x-axis

shows a nearly

straight line.

U. Quasthoff Textdatenbanken 12

Text Coverage

text coverage by the most frequent 10 words: 21.129%

text coverage by the most frequent 100 words: 40.212%

text coverage by the most frequent 1 000 words: 60.632%

text coverage by the most frequent 10 000 words: 80.703%

text coverage by the most frequent 100 000 words: 93.498%

U. Quasthoff Textdatenbanken 13

Sentences containing the most frequent

wordsFor the most frequent

words we present the

percentage of sentences

containing this word.

U. Quasthoff Textdatenbanken 14

Length of sentences in characters and

words

U. Quasthoff Textdatenbanken 15

Most frequent sentence beginnings

and endings of different length

U. Quasthoff Textdatenbanken 16

Buchstabenverteilung von Z in

Sätzen

U. Quasthoff Textdatenbanken 17

Buchstabenverteilung von X in

Sätzen

U. Quasthoff Textdatenbanken 18

Buchstabenverteilung von

Doppelpunkten in Sätzen

U. Quasthoff Textdatenbanken 19

Sentences consisting of short

words only

In this subsection we look for sentences containing only short words. The sentences

have minimum length of 40 characters and are ordered by the length of the

longest word.

U. Quasthoff Textdatenbanken 20

30 sentences with least maximum word number

U. Quasthoff Textdatenbanken 21

Sentences with highest average word

number

U. Quasthoff Textdatenbanken 22

Sentences with highest average word

length

U. Quasthoff Textdatenbanken 23

30 sentences with shortest word of

maximal length

U. Quasthoff Textdatenbanken 24

Types of Sentences by Punctuation

Mark

U. Quasthoff Textdatenbanken 25

Sentences consisting of long words

only

The table shows the sentences with maximal average word length. Because some

languages allow very long words, such sentences may also contain short

stopwords. Hence, we may find (at least some) well-formed sentences.

U. Quasthoff Textdatenbanken 26

Language Fingerprint

NN co-occurrences within the 10 most frequent words

U. Quasthoff Textdatenbanken 27

Number of NN-co-occurrences

depending on frequency classesIn many cases, two co-

occurring words have nearly the same frequency. In many other cases (like DET NN), the frequencies differ very much. The following plot shows the frequency classes of co-occurring words. Frequency classes are defined as the logarithm (with base 2) of the frequency rank. The size of the dots corresponds to the number of co-occurrences with the corresponding pair of frequency classes.

U. Quasthoff Textdatenbanken 28

Number of sentence co-

occurrences vs. FrequencyThe diagram below displays for any word its frequency and number of sentence co-

occurrences.

U. Quasthoff Textdatenbanken 29

Siz

e o

f So

urc

es

U. Quasthoff Textdatenbanken 30

Sentence length for different

sources: Min and Max

U. Quasthoff Textdatenbanken 31

Average word length for different

sources: Min and Max

U. Quasthoff Textdatenbanken 32

Sources consisting of many / few

words with frequency 1

U. Quasthoff Textdatenbanken 33

Sources with low / high average word

length of rare words

Recommended