28
A SURVEY OF ARABIC DISCOURSE ANNOTATION By : Abeer Al-Qahtani Afnan Al-Moadi Nujoud Al-Ghamdi

Discourse annotation

Embed Size (px)

Citation preview

Page 1: Discourse annotation

A SURVEY OF ARABIC DISCOURSE ANNOTATION

By :Abeer Al-Qahtani Afnan Al-Moadi Nujoud Al-Ghamdi

Page 2: Discourse annotation

2

INTRODUCTION

Arabic language discourse annotation or segmentation have become a popular area of research. The aim of this presentation is to survey and summarize some techniques which used in discourse annotation and segmentation and to show their methods and results.

Page 3: Discourse annotation

3

CLAUSE-BASED DISCOURSE SEGMENTATION OF ARABIC TEXTS

Discourse parsing consists in two steps: 1- discourse segmentation which aims at

identifying Elementary Discourse Units (EDU).2- building the discourse structure by linking EDUs

using a set of rhetorical or discursive relations

Arabic language characteristics:- An agglutinative.- Does not have capital letters.- Absence of diacritics.

Page 4: Discourse annotation

4

METHODOLOGY

Their analysis was carried out on two different corpus genres: news articles and elementary school textbooks.

They proposed a three steps segmentation algorithm: Step1: punctuation marks. Step2: lexical cues. Step3: Mixed of punctuation marks and lexical

cues.

Page 5: Discourse annotation

5

METHODOLOGY CONT.

Step1- punctuation marks:[ مختلفة. أمراض عالج السويدان طارق [. د

[Dr. Tarak Swiden has treated various diseases.]

Step2: lexical cues:[ ] [ ننتهي[ متى يعرفوا ال ولكن نبدأ متى الجميع سيعرف

[They will know when we start][but they don't know when we finish]

Page 6: Discourse annotation

6

METHODOLOGY CONT.

Step3: Mixed of punctuation marks and lexical cues: If comma is followed by the conjunction " و" (waw) or

(and then by a preposition of localization (fā) "ف "{ , عن, , , في على من it indicates the end of a ,{ إلىsegment.

Example:( المرسى ببلدة يتخلعون التونسية العائالت من كثير عادة على أهله وعلى )( ,كان

شاطئها.( الطبيعة وبين بينه حميما اللقاء بدأ البديع

[Like Tunisian families, her family left Marsa city,]

[then, they found themselves at the wonderful Marsa’s beach.]

Page 7: Discourse annotation

7

METHODOLOGY CONT. If comma is followed by the conjunction " "و (waw) or "

لكم, , , } and then by a possessive noun (fā) "ف لكما uلكن, , , له , , , , لها لهما uلهن لهم لنا لي it indicates the end of a ,{لك

segment.Example:

.) تتكلم) ,() دمية لها الخارج في أختي رأيت[I saw my sister outside,] [with a talking doll]

If a comma is followed by a demonstrative pronoun { , , , , , , تلك , , , هذه هذا ذاك ذلك لهذا بهذا لهذه بهذه and {لذلكthen by a word that is not a verb, there is not a segment frontier.

Example:.) , , uا) ملي وجوهنا في ينظر أمامنا اليوم هذا حامد سي معلمنا وقف

[Mr. Hamed, our teacher, was standing up, looking at us.]

Page 8: Discourse annotation

8

THE RESULT

Page 9: Discourse annotation

9

SEMANTIC-BASED SEGMENTATION FOR ARABIC TEXT

In this approach the aim is to divide the text into complete meaningful parts which can exist independently without their prefix or postfix parts .

Connectors Classification: Active: words that indicate the beginning of a

new segment, the end of a segment or a complete segment. ( لكن – (...... هنالك

Passive: words that don't indicate a new segment, an end of a segment or a complete segment by themselves, but when they come with active elements, they contribute in determining the position of the start or the end of the segments.

Page 10: Discourse annotation

10

METHODOLOGY Identifying the

connectors that indicate complete segments (with S instances in the SegBoundary property). Locating the active

connectors. Resolving the case

where adjacent active connectors exist

Setting the segments boundaries.

Creating the final list of segments

Page 11: Discourse annotation

11

THE RESULT

Page 12: Discourse annotation

ARABIC DISCOURSE SEGMENTATION BASED ON RHETORICAL METHOD

This technique derived from Arabic Rhetorical as defined by Arabic.

Focuses on connector Waw “و”. Categorizes the six known Rhetorical types of “و” into tow

classes: “Fasl” and “Wasl”.

They use SVM Machine Learning.12

“Fasl”: 1,2 and 3“Wasl”: 4,5 and 6

Page 13: Discourse annotation

13

EXAMPLES “ القسم” : Waw1و � عمال ليقدمون إنهم ولله والفضيلة العلم التالميذ يعلمون األساتذة

. لألمة عظيما

[Professors teach students sciences and virtue, I swear to God, they have done a great mission for their nation]

“ رب” أزمات: Waw2و من جزء أزماتهم إن بل يعانون الذين وحدهم ليسوا الشباب " " : المجتمع؟ طبقات بين من الشباب على ركزتم لماذا يقول سائل uورب كله المجتمع

[Young people are not the only ones who suffer, but their crises are part of the crises of the whole society and someone may ask: Why have focused only on youth only and not on the divisions of the whole society?]

“ االستئناف” به : Waw3و عامة المجتمع و النفسية المشكالت بعض من المراهقون يعاني. كثيرة أخرى سلبيات

[Adolescents suffer from some psychological problems and there are, in general, other numerous problems in the society.]

“ الحال” . : Waw4و يبتسم وهو الفصل المدرس دخل

[The teacher came smiley into the classroom.] “ المعية” . : Waw5و القمر وضوء الحبيبان جلس

[The couple sat together with the light of the moon.] “ العطف” . : Waw6و المدارس في والطالب المعلمون وانتظم الدراسة بدأت

[The study started and students and teachers enrolled in schools.]

Page 14: Discourse annotation

14

METHODOLOGY Preprocessing

Diacretization Discriminate the connector “و” from the letter “و”

Feature Extraction They extract 22 features to distinguish each type of

.”و“ Classification

Page 15: Discourse annotation

15

FEATURE EXTRACTION Waw1: ” القسم “و

X1= “الله” and X7= genitive mark. X3=noun, X7= genitive mark and

X16=no.

Waw2: “ورب” X1= “رب” and X7= accusative mark. X3=noun, X5= indefinite,

X6≠genitive mark and X7 = genitive mark.

Waw3: “ االستئناف ”و X12≠X13. X14 ≠ X15. X19 ≠X20. X21=no and X22=no.

Waw4: “ الحال ”و X16=yes. X1= “قد”, X10= verb and X11=past

tense.

Waw5: “ المعية ”و X3= noun and X7 = accusative

mark.

Waw6: “ العطف ”و X2=X3, X6=X7, and (X4=X5 OR

X8=X9 OR X17= X18). X12=X13, X14=X15, X19=X20 and (X21= yes OR X22= yes)

Page 16: Discourse annotation

16

THE RESULT

The Corpus of Arabic Discourse Segmentation incorporated in this experiment.

They use 1200 instances for training and 293 for testing. Class Waw5 did not appear in training and testing. Class Waw3 and 6 are the most appearance.

Segmentation

accuracy = 98.98%

Page 17: Discourse annotation

17

THE LEEDS ARABIC DISCOURSE TREEBANK: ANNOTATING DISCOURSE CONNECTIVES FOR ARABIC

First effort toward producing an Arabic Discourse Treebank. Defining discourse connectives as lexical expression that relate

two text segment. Segments called arguments. Discourse relations play an important role in producing a coherent

discourse. Collecting Arabic Connectives:

They using text analysis and corpus-based technique. Manually extracting connectives from 50 randomly selected texts from

PATB and from 10 different websites. Resulting list was manually tested by two native speakers. 107 discourse connectives.

Page 18: Discourse annotation

18

CONT. Types Of Relations:

Page 19: Discourse annotation

19

CONT. Agreement Studies:

The Corpus: PATB ADA Tool & Annotating process.

After annotating

Page 20: Discourse annotation

20

METHODOLOGY

Done by two independent Arabic native speakers.

Agreement is measured on two tasks:Task1:

measures whether annotators agree on the binary decision on whether an item constitutes a discourse connective in context.

Task2: measures whether annotators agree on which

discourse relation an identified connective expresses.

Page 21: Discourse annotation

21

THE RESULT

Agreement on TASK I is highly reliable.

Agreement on TASK II (relation assignment) is relatively low.

Page 22: Discourse annotation

22

MODELLING DISCOURSE RELATIONS FOR ARABIC.

Discourse Connective Recognition.

Discourse connective recognition distinguishes between the discourse usage and non-discourse usage of potential connectives.

Conjunctions such as و /w/and,او /¯aw/or can have discourse usage or just conjoin two non-abstract entities as in سارة و mr w s¯arh/Omar and,/عمرSarah.

Page 23: Discourse annotation

23

CONT. Features:1. Surface Features (SConn).2. Part of speech features(POS).3. Lexical features of surrounding words (Lex). E.g.

4. Syntactic category of related phrases (Syn).5. Al-Masdar feature:

Page 24: Discourse annotation

24

RESULTS AND DISCUSSION

Page 25: Discourse annotation

25

Discourse Relation Recognition:

1. Connective features.2. Words and POS of arguments. E.g.

when the first word of Arg2 is قد/qd/might/may or k¯an/had, the relation is likely to be/كانEXPANSION.BACKGROUND or EXPANSION.CONJUNCTION.

3. Tense and Negation.4. Masdar.5. Argument Parent.6. Production Rules.

Page 26: Discourse annotation

26

Performance of different models for identifying fine-grained discourse relations on two datasets

Performance of different models for identifyingclass-level discourse relations on two datasets

Page 27: Discourse annotation

27

CONCLUSION

In this survey we presented some annotating connectives and some segmentation techniques which related with Arabic language and depended on different corpora and methods. according to that , we get many different results.

Page 28: Discourse annotation

28

THANKS!