25
STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

Embed Size (px)

Citation preview

Page 1: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

STYLOMETRY IN IR SYSTEMS

Leyla BİLGE

Büşra ÇELİKKAYA

Kardelen HATUN

Page 2: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

OUTLİNE

Stylistics and Stylometry Applications of stylometry History of stylometric researches Stylistic features Recent Studies Our approach Conclusion

4/2

0/2

00

7

2

Stylometry in IR Systems

Page 3: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

STYLISTICS The theoritical framework for stylistic

combines; Halliday’s Language Theory Sander’s Theories of Stylistic

Halliday says:“A text is what is meant, selected from the total set of opinions that constitute what can be meant”

Sander says:“Style is the result of choices made by an author from a range of possibilities offered by the language system”

4/2

0/2

00

7

3

Sty

lom

etry

in IR

Syste

ms

Page 4: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

STYLISTICS

Stylistic variation depends on Author preferences and competence Familiarity Genre Communicative context Expected characteristics of the intended

audience

Modeling, representing and utilizing this variation is the business of stylistic analysis.

4/2

0/2

00

7

4

Sty

lom

etry

in IR

Syste

ms

Page 5: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

STYLOMETRY

The application of the study of linguistic style

Style refers to the linguistic choices of authors that persist over their works, independently of content

Aim is to describe a text from a rather formal perspective like; Number of words Number of repetitions Sentence length

4/2

0/2

00

7

5

Sty

lom

etry

in IR

Syste

ms

Page 6: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

APPLICATIONS OF STYLOMETRY

Authorship attribution Forensic author identification To find the author of an anonymous text

Observation of the “characteristics” of a particular author

Organization and retrieval of documents based on their writing style

Systems for genre-based information retrieval

4/2

0/2

00

7

6

Sty

lom

etry

in IR

Syste

ms

Page 7: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

HISTORY OF STYLOMETRY

Stylometry grew out of analyzing text for evidence of authenticity, authorial identity

According to modern practice of discipline, there are distinctive patterns of a language to identify authors

After development of computers and their capacities Large data sets can be analyzed New methods can be generated and easily

applied

4/2

0/2

00

7

7

Sty

lom

etry

in IR

Syste

ms

Page 8: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

HISTORY OF STYLOMETRY, CONT’D

Current researches uses techniques based on term frequency counts Frequency data are collected for common terms These data are then analyzed using a range of

fairly standard statistical techniques

However, they cannot guarantee quality ouput yet, i.e. Ulysses

4/2

0/2

00

7

8

Sty

lom

etry

in IR

Syste

ms

Page 9: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

METHODOLOGY Use a subset of structural and stylometric

features on a set of authors without consideration of author characteristics

Currently, authorship attribution studies are dominated by the use of lexical measures

Generally used statistics: Word length Syllables per word Sentence-length Sentence count Text length in words Use of punctuation marks

Page 10: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

STYLISTIC FEATURES Lexically-Based Methods

Vocabulary richness of the author Frequencies of occurrence of individual words

Vocabulary diversity: Type-token ratio V/N

V: size of vocabulary of sample text N: number of tokens

Hapax legomena How many words occur once

Frequencies of occurrence: Function words

Page 11: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

STYLISTIC FEATURES

Problems: Text length dependent Unstable for short texts Function word set requires manual effort Specific to the group of authors considered

Solution: Use set of most frequent words Both content-words and function words

Page 12: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

RELATED STUDIES

Analysis of the text by a natural language processing tool: Use existing NLP tool Sentence and Chunk Boundaries Detector (SCBD)

Use sub-word units like character N-grams instead of word frequencies: Character sequences of length n Most frequent n-grams provide information about

author’s stylistic choices on lexical, syntactical and structural level

Page 13: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

WORD BASED FEATURES

Bag-of-words Apply stemming and stopword list

Function words Content-free

POS Annotation Feature Selection Semantic Disambiguation

Page 14: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

LINGUISTIC CONSTITUENTS

Structure of natural language sentences show word occurrences follow a specific order

Words are grouped into syntactic units called “constituents”

Use word relationships by extracting constituents for feature construction Subdivide document into sentences Construct a syntax tree for each sentence

Page 15: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

SYNTAX TREE

Use a syntax tree representation of different authors sentences as features

Page 16: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

OUR APRROACH

4/2

0/2

00

7Sty

lom

etry

in IR

Syste

ms

16

Use Stylometry to analyze the following Texts translated by

the same translator but written by different authors

Texts translated by different translators but written by the same authors

Page 17: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

PROPOSED STEPS

1. Feature Extraction Determine which features represent the style

best

2. Training Training the classifier with a training set Many methods present, (SVM, bayesian…)

3. Recognition and Classification of texts4. Analyzing the results of classification

4/2

0/2

00

7

17

Sty

lom

etry

in IR

Syste

ms

Page 18: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

1. FEATURE EXTRACTION

The stylometric features of a text can be: Word length Sentence length Paragraph length Character n-grans Function words

Feature choices affect classification results seriously.

Then obtain a feature vector with n-dimensions V = {v1,v2,v3 … vn}

4/2

0/2

00

7

18

Sty

lom

etry

in IR

Syste

ms

Page 19: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

2. TRAINING

4/2

0/2

00

7Sty

lom

etry

in IR

Syste

ms

19

Choose training data for every class May be randomly

selected texts May be manually

picked Determine the

corresponding parameters to each class

Page 20: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

3. RECOGNITION AND CLASSIFICATION

4/2

0/2

00

7Sty

lom

etry

in IR

Syste

ms

20

Use the parameters we obtained from training data

Compute the distance

Label the data Classify the data

Page 21: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

RESULTS OF THE CLASSIFICATION

We will have two set of results The original texts classified by author The translated texts classified by no prior class

information These results will give us a clue about the

two issues we stated at the beginning Example: “The Picture of Dorian Gray” is

translated into Turkish by many translators Look if these are clustered in one class or separate

classes

4/2

0/2

00

7

21

Sty

lom

etry

in IR

Syste

ms

Page 22: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

OUR AIM

With the right classification we will be able to identify If sytlometric analysis works in finding an author

in two different languages If translations carry more of their translators’

style or if they still have their authors’ style

“…yet, to date, no stylometrist has managed to establish a methodology which is better able to capture the style of a text than that based on lexical items.”

4/2

0/2

00

7

22

Sty

lom

etry

in IR

Syste

ms

Page 23: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

CONCLUSION

Today there are many useful applications of stylometry. Authorship attribution, plagiarism detection,

genre-based information retrieval

What features are valuable for analysis is still an important question.

We aim to find the stylistic connection between a text and its translation.

4/2

0/2

00

7

23

Sty

lom

etry

in IR

Syste

ms

Page 24: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

REFERENCES Computational Stylistics in Forensic Author

Identifiction, Carole E. Charsi Style vs. Expression in Literary Narratives,

Özlem Uzuner, Boris Katz Computer-Based Authorship Attribution

Without Lexical Measures, E. Stamatatos, N. Fakotakis, G. Kokkinakis

Ensemble-Based Author Identification Using Character N-grams, E. Stamatatos

Combining Text and Linguistic Document Representations for Authorship Attribution, A. Kaster, S. Siersdofer, G. Weikum

4/2

0/2

00

7

24

Sty

lom

etry

in IR

Syste

ms

Page 25: STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

4/2

0/2

00

7

25

Sty

lom

etry

in IR

Syste

ms