126
UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina B¨ ogel, Miriam Butt, Annette Hautli, Ghulam Raza, Sebastian Sulger and Veronika Walther Universit¨ at Konstanz FB Kolloquium, May 2010 1 / 60

UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

UrduGram: Towards a Deep, Large-Coverage Grammarfor Urdu and Hindi

Tafseer Ahmed, Tina Bogel, Miriam Butt, Annette Hautli, GhulamRaza, Sebastian Sulger and Veronika Walther

Universitat Konstanz

FB Kolloquium, May 2010

1 / 60

Page 2: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Preview

1 Urdu & the UrduGram Project

2 Urdu Transliterator

3 Syntax

4 Semantics

2 / 60

Page 3: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Urdu

Urdu is

3 / 60

Page 4: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Urdu

Urdu is

a South Asian language spoken primarily in Pakistan and India

3 / 60

Page 5: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Urdu

Urdu is

a South Asian language spoken primarily in Pakistan and Indiadescended from (a version of) Sanskrit (sister language of Latin)

3 / 60

Page 6: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Urdu

Urdu is

a South Asian language spoken primarily in Pakistan and Indiadescended from (a version of) Sanskrit (sister language of Latin)structurally identical to Hindi (spoken mainly in India)

3 / 60

Page 7: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Urdu

Urdu is

a South Asian language spoken primarily in Pakistan and Indiadescended from (a version of) Sanskrit (sister language of Latin)structurally identical to Hindi (spoken mainly in India)together with Hindi the fourth most spoken language in the world(∼ 250 million native speakers)

3 / 60

Page 8: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Urdu and Hindi

The two languages are regarded as structurally identical:

4 / 60

Page 9: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Urdu and Hindi

The two languages are regarded as structurally identical:

syntax/morphology are practically identical

4 / 60

Page 10: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Urdu and Hindi

The two languages are regarded as structurally identical:

syntax/morphology are practically identical

vocabulary is practically identical (Urdu: borrowed fromPersian/Arabic; Hindi: borrowed from Sanskrit)

4 / 60

Page 11: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Urdu and Hindi

The two languages are regarded as structurally identical:

syntax/morphology are practically identical

vocabulary is practically identical (Urdu: borrowed fromPersian/Arabic; Hindi: borrowed from Sanskrit)

main difference is in the script

4 / 60

Page 12: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Urdu and Hindi

The two languages are regarded as structurally identical:

syntax/morphology are practically identical

vocabulary is practically identical (Urdu: borrowed fromPersian/Arabic; Hindi: borrowed from Sanskrit)

main difference is in the script

→ We are developing a single grammar and lexicon for both of thelanguages!

4 / 60

Page 13: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Context of Work

Computational LFG grammar in development in Konstanz

5 / 60

Page 14: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Context of Work

Computational LFG grammar in development in Konstanz

Aim: large-scale LFG grammar for parsing Urdu/Hindi

5 / 60

Page 15: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Context of Work

Computational LFG grammar in development in Konstanz

Aim: large-scale LFG grammar for parsing Urdu/Hindi

Grammar is part of the ParGram project

5 / 60

Page 16: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Context of Work

Computational LFG grammar in development in Konstanz

Aim: large-scale LFG grammar for parsing Urdu/Hindi

Grammar is part of the ParGram project

Collaborative, world-wide research project

5 / 60

Page 17: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Context of Work

Computational LFG grammar in development in Konstanz

Aim: large-scale LFG grammar for parsing Urdu/Hindi

Grammar is part of the ParGram project

Collaborative, world-wide research projectDevoted to developing parallel LFG grammars for a variety of languages

5 / 60

Page 18: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Context of Work

Computational LFG grammar in development in Konstanz

Aim: large-scale LFG grammar for parsing Urdu/Hindi

Grammar is part of the ParGram project

Collaborative, world-wide research projectDevoted to developing parallel LFG grammars for a variety of languagesFeatures and analyses are kept parallel for easy transfer betweenlanguages

5 / 60

Page 19: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Context of Work

Computational LFG grammar in development in Konstanz

Aim: large-scale LFG grammar for parsing Urdu/Hindi

Grammar is part of the ParGram project

Collaborative, world-wide research projectDevoted to developing parallel LFG grammars for a variety of languagesFeatures and analyses are kept parallel for easy transfer betweenlanguagesLanguages involved:

5 / 60

Page 20: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Context of Work

Computational LFG grammar in development in Konstanz

Aim: large-scale LFG grammar for parsing Urdu/Hindi

Grammar is part of the ParGram project

Collaborative, world-wide research projectDevoted to developing parallel LFG grammars for a variety of languagesFeatures and analyses are kept parallel for easy transfer betweenlanguagesLanguages involved:

→ English, German, French, Japanese, Norwegian, Welsh, Georgian,Hungarian, Turkish, Chinese, Indonesian, Urdu (among many others)

5 / 60

Page 21: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

The ParGram Grammar Architecture

6 / 60

Page 22: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

The ‘Parallel’ in ParGram

Analysis for transitive sentence in English ParGram grammar(F-Structure, “Functional Structure”):

7 / 60

Page 23: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

The ‘Parallel’ in ParGram

Analysis for transitive sentence in English ParGram grammar(F-Structure, “Functional Structure”):

"Nadya saw the book."

'see<[1:Nadya], [113:book]>'PRED

'Nadya'PRED

_LEX-SOURCE morphology, _PROPER known-nameCHECK

NAME-TYPE first_name, PROPER-TYPE namePROPERNSEM

properNSYNNTYPE

CASE nom, GEND-SEM female, HUMAN +, NUM sg, PERS 31

SUBJ

'book'PRED

countnoun-lex_LEX-SOURCECHECK

countCOMMONNSEM

commonNSYNNTYPE

'the'PREDdefDET-TYPE

DETSPEC

CASE obl, NUM sg, PERS 3113

OBJ

V-SUBJ-OBJ_SUBCAT-FRAMECHECK

MOOD indicative, PERF - _, PROG - _, TENSE pastTNS-ASP

CLAUSE-TYPE decl, PASSIVE -, VTYPE main57

7 / 60

Page 24: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

The ‘Parallel’ in ParGram (cont.)

Analysis for the same transitive sentence in Urdu ParGram grammar(F-Structure, “Functional Structure”):

8 / 60

Page 25: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

The ‘Parallel’ in ParGram (cont.)

Analysis for the same transitive sentence in Urdu ParGram grammar(F-Structure, “Functional Structure”):

"nAdiyah nE kitAb dEkHI"

'dEkH<[1:nAdiyah], [20:kitAb]>'PRED

'nAdiyah'PRED

obl_NMORPHCHECK

namePROPER-TYPEPROPERNSEM

properNSYNNTYPE

+SPECIFICSEM-PROP

CASE erg, GEND fem, NUM sg, PERS 31

SUBJ

'kitAb'PRED

countCOMMONNSEM

commonNSYNNTYPE

CASE nom, GEND fem, NUM sg, PERS 320

OBJ

infl_MTYPE_VMORPH

_RESTRICTED -, _SUBCAT-FRAME V-SUBJ-OBJ, _VFORM perfCHECK

+AGENTIVELEX-SEM

ASPECT perf, MOOD indicativeTNS-ASP

CLAUSE-TYPE decl, PASSIVE -, VTYPE main42

8 / 60

Page 26: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

The ‘Parallel’ in ParGram (cont.)

Analysis for the same transitive sentence in Urdu ParGram grammar(F-Structure, “Functional Structure”):

"nAdiyah nE kitAb dEkHI"

'dEkH<[1:nAdiyah], [20:kitAb]>'PRED

'nAdiyah'PRED

obl_NMORPHCHECK

namePROPER-TYPEPROPERNSEM

properNSYNNTYPE

+SPECIFICSEM-PROP

CASE erg, GEND fem, NUM sg, PERS 31

SUBJ

'kitAb'PRED

countCOMMONNSEM

commonNSYNNTYPE

CASE nom, GEND fem, NUM sg, PERS 320

OBJ

infl_MTYPE_VMORPH

_RESTRICTED -, _SUBCAT-FRAME V-SUBJ-OBJ, _VFORM perfCHECK

+AGENTIVELEX-SEM

ASPECT perf, MOOD indicativeTNS-ASP

CLAUSE-TYPE decl, PASSIVE -, VTYPE main42

→ Analyses are kept parallel where possible

8 / 60

Page 27: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

The ‘Parallel’ in ParGram (cont.)

Analysis for the same transitive sentence in Urdu ParGram grammar(F-Structure, “Functional Structure”):

"nAdiyah nE kitAb dEkHI"

'dEkH<[1:nAdiyah], [20:kitAb]>'PRED

'nAdiyah'PRED

obl_NMORPHCHECK

namePROPER-TYPEPROPERNSEM

properNSYNNTYPE

+SPECIFICSEM-PROP

CASE erg, GEND fem, NUM sg, PERS 31

SUBJ

'kitAb'PRED

countCOMMONNSEM

commonNSYNNTYPE

CASE nom, GEND fem, NUM sg, PERS 320

OBJ

infl_MTYPE_VMORPH

_RESTRICTED -, _SUBCAT-FRAME V-SUBJ-OBJ, _VFORM perfCHECK

+AGENTIVELEX-SEM

ASPECT perf, MOOD indicativeTNS-ASP

CLAUSE-TYPE decl, PASSIVE -, VTYPE main42

→ Analyses are kept parallel where possible

→ Features are kept parallel where possible

8 / 60

Page 28: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

The ‘Parallel’ in ParGram (cont.)

Demo: Large-Scale English ParGram Grammar

9 / 60

Page 29: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

The Motivation behind ParGram

10 / 60

Page 30: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

The Motivation behind ParGram

The ParGram project is working on Deep Grammars

10 / 60

Page 31: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

The Motivation behind ParGram

The ParGram project is working on Deep Grammars

Provide detailed syntactic and semantic analyses

10 / 60

Page 32: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

The Motivation behind ParGram

The ParGram project is working on Deep Grammars

Provide detailed syntactic and semantic analysesEncode grammatical functions, tense, number etc.

10 / 60

Page 33: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

The Motivation behind ParGram

The ParGram project is working on Deep Grammars

Provide detailed syntactic and semantic analysesEncode grammatical functions, tense, number etc.Linguistically motivated

10 / 60

Page 34: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

The Motivation behind ParGram

The ParGram project is working on Deep Grammars

Provide detailed syntactic and semantic analysesEncode grammatical functions, tense, number etc.Linguistically motivatedUsually manually constructed (→ linguistic intuition)

10 / 60

Page 35: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

Possible Applications

11 / 60

Page 36: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

Possible Applications

Large-Coverage, Deep Computational Grammars can be useful for:

11 / 60

Page 37: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

Possible Applications

Large-Coverage, Deep Computational Grammars can be useful for:

Meaning-Sensitive Applications

11 / 60

Page 38: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

Possible Applications

Large-Coverage, Deep Computational Grammars can be useful for:

Meaning-Sensitive Applications

Web-Search

11 / 60

Page 39: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

Possible Applications

Large-Coverage, Deep Computational Grammars can be useful for:

Meaning-Sensitive Applications

Web-Search

Question-Answering

11 / 60

Page 40: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

Possible Applications

Large-Coverage, Deep Computational Grammars can be useful for:

Meaning-Sensitive Applications

Web-Search

Question-Answering

Knowledge Representation

11 / 60

Page 41: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

Possible Applications

Large-Coverage, Deep Computational Grammars can be useful for:

Meaning-Sensitive Applications

Web-Search

Question-Answering

Knowledge Representation

Text Summarization

11 / 60

Page 42: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

Possible Applications

Large-Coverage, Deep Computational Grammars can be useful for:

Meaning-Sensitive Applications

Web-Search

Question-Answering

Knowledge Representation

Text SummarizationMachine Translation

11 / 60

Page 43: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

Possible Applications

Large-Coverage, Deep Computational Grammars can be useful for:

Meaning-Sensitive Applications

Web-Search

Question-Answering

Knowledge Representation

Text SummarizationMachine TranslationComputer-Assisted Language Learning

11 / 60

Page 44: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

powerset.com

12 / 60

Page 45: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

powerset.com

“Semantic search engine”

12 / 60

Page 46: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

powerset.com

“Semantic search engine”

Uses large-scale English LFG

12 / 60

Page 47: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

powerset.com

“Semantic search engine”

Uses large-scale English LFG

Works on English Wikipedia

12 / 60

Page 48: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

powerset.com

“Semantic search engine”

Uses large-scale English LFG

Works on English Wikipedia

Parses query and matches withparsed corpus

12 / 60

Page 49: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

powerset.com

“Semantic search engine”

Uses large-scale English LFG

Works on English Wikipedia

Parses query and matches withparsed corpus

→ Can give better results than

regular search engines

12 / 60

Page 50: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Computational Grammars - What For?

powerset.com

“Semantic search engine”

Uses large-scale English LFG

Works on English Wikipedia

Parses query and matches withparsed corpus

→ Can give better results than

regular search engines

(Example: ‘X was bought by Y’vs. ‘Y acquired X’)

12 / 60

Page 51: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Our Overall Architecture

Our parsing architecture currently looks like this:

13 / 60

Page 52: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Our Overall Architecture

Our parsing architecture currently looks like this:

tokenizer

13 / 60

Page 53: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Our Overall Architecture

Our parsing architecture currently looks like this:

tokenizer

↓transliterator (Urdu & Hindi to Roman script)

13 / 60

Page 54: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Our Overall Architecture

Our parsing architecture currently looks like this:

tokenizer

↓transliterator (Urdu & Hindi to Roman script)

↓morphology (fst)

13 / 60

Page 55: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Our Overall Architecture

Our parsing architecture currently looks like this:

tokenizer

↓transliterator (Urdu & Hindi to Roman script)

↓morphology (fst)

↓syntax (c- and f-structure) (xle)

13 / 60

Page 56: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Our Overall Architecture

Our parsing architecture currently looks like this:

tokenizer

↓transliterator (Urdu & Hindi to Roman script)

↓morphology (fst)

↓syntax (c- and f-structure) (xle)

↓semantics (xfr ordered rewriting)

13 / 60

Page 57: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Our Overall Architecture

Our parsing architecture currently looks like this:

tokenizer

↓transliterator (Urdu & Hindi to Roman script)

↓morphology (fst)

↓syntax (c- and f-structure) (xle)

↓semantics (xfr ordered rewriting)

xle is the overall development platform, with the other modules(fst and xfr) being plugged into it.

13 / 60

Page 58: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu & the UrduGram Project

Overview

Overall Architecture

tokenizer↓

transliterator (Urdu & Hindi to Roman script)↓

morphology (fst)↓

syntax (c- and f-structure) (xle)↓

semantics (xfr ordered rewriting)

14 / 60

Page 59: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Aim of the transliterator

Our aim is to build and integrate a transliterator that allows for both,Urdu and Hindi, to be parsed and generated with the same grammar.

couplet by the poet Mirza Ghalib

Urdu Hindi

Romanized Script

(the XLE grammar)

→ Right now we are working on the Urdu-Roman transliterator.

15 / 60

Page 60: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Transliteration scheme

An excerpt from our scheme table:

Unicode Urdu character Latin letter Phonemein transliteration scheme

H. b /b/

H� p /p/�H t /t/�H T /ú/

À^ j /j/ h^ c /

>Ù/

16 / 60

Page 61: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Basic idea of the transliterator

use finite state transducer to allow for generation and parsing.

17 / 60

Page 62: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Basic idea of the transliterator

use finite state transducer to allow for generation and parsing.

Urdu script:parsing ↓ —————–———— ↑ generating

ASCII: bA

AK.

17 / 60

Page 63: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Basic idea of the transliterator

use finite state transducer to allow for generation and parsing.

Urdu script:parsing ↓ —————–———— ↑ generating

ASCII: bA

AK.

The same concept will be used to create a transliterator forHindi/Devanagari

17 / 60

Page 64: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Basic idea of the transliterator

use finite state transducer to allow for generation and parsing.

Urdu script:parsing ↓ —————–———— ↑ generating

ASCII: bA

AK.

The same concept will be used to create a transliterator forHindi/Devanagari

This way we can parse Urdu script and generate Hindi script(and vice versa)

17 / 60

Page 65: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Position of the transliterator

the transliterator is composed with the tokenizer(separates the words within a sentence)

18 / 60

Page 66: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Position of the transliterator

the transliterator is composed with the tokenizer(separates the words within a sentence)

tokenizer and transliterator are placed in front of the morphology

18 / 60

Page 67: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Position of the transliterator

the transliterator is composed with the tokenizer(separates the words within a sentence)

tokenizer and transliterator are placed in front of the morphology

InputTransliterator ↓ ↓

Output kitAb

Input kitAbMorphology ↓ ↓

Output kitAb+Noun+Fem+Sg+Count

XLE ... ...

H. A�J»

18 / 60

Page 68: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Example

→ The transliterator at this position works quite well:

(1) laRkE

boy

kI

gen

kitAb

book

‘The boy’s book’

→ Problem: long sentences or highly ambiguous words (when looking atscript) need some time to parse.

19 / 60

Page 69: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Problems of the script - an example

The problem of the vowels ...

Diacritics represent short vowels

Urdu script Roman script

ba

bi

bu

�H.H.��H.

20 / 60

Page 70: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Problems of the script - an example

The problem of the vowels ...

Diacritics represent short vowels

Urdu script Roman script

ba

bi

bu

�H.H.��H.

(2) nAdyA

Nadya

nE

erg

yasIn

Yasin

kO

dat

kitAb

see

dEkHnE

let

dI

‘Nadya let Yassin see the book’

ø X� úG� éºKX� H. A

��J»� ñ�» á���

�� úG� A�KXA

�K

20 / 60

Page 71: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Problems of the script - an example

The problem of the vowels ...

Diacritics represent short vowels

Urdu script Roman script

ba

bi

bu

�H.H.��H.

(2) nAdyA

Nadya

nE

erg

yasIn

Yasin

kO

dat

kitAb

see

dEkHnE

let

dI

‘Nadya let Yassin see the book’

ø X� úG� éºKX� H. A

��J»� ñ�» á���

�� úG� A�KXA

�K

Unfortunately, these diacritics tend to be left out.

ø X úG éºKX H. A�J» ñ» á��� ú

G AKXA K

20 / 60

Page 72: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Consequences

If the input is without diacritics, e.g. ...

Urdu script letter combination representation translation

ktAb kitAb ‘book’H. A�J»

H. A�J»

21 / 60

Page 73: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Consequences

If the input is without diacritics, e.g. ...

Urdu script letter combination representation translation

ktAb kitAb ‘book’H. A�J»

H. A�J»

.. then there are all kinds of possible combinations:kitAb, kutaAb, kitAbu, ikatAubi, ukitAbia, akatAbu, aukatAib ....

21 / 60

Page 74: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Consequences

If the input is without diacritics, e.g. ...

Urdu script letter combination representation translation

ktAb kitAb ‘book’H. A�J»

H. A�J»

.. then there are all kinds of possible combinations:kitAb, kutaAb, kitAbu, ikatAubi, ukitAbia, akatAbu, aukatAib ....

(demo)

21 / 60

Page 75: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Solution

In order to restrict this overgeneration the possible letter combinationsneed to be constrained:

22 / 60

Page 76: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Solution

In order to restrict this overgeneration the possible letter combinationsneed to be constrained:

which vowels are actually allowed to cooccur?

→ ai, but not ia?

22 / 60

Page 77: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Solution

In order to restrict this overgeneration the possible letter combinationsneed to be constrained:

which vowels are actually allowed to cooccur?

→ ai, but not ia?

which consonants are actually allowed to cooccur?

→ initial kr, but not gr?

22 / 60

Page 78: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Solution

In order to restrict this overgeneration the possible letter combinationsneed to be constrained:

which vowels are actually allowed to cooccur?

→ ai, but not ia?

which consonants are actually allowed to cooccur?

→ initial kr, but not gr?

certain combinations with semi-vowels or consonants are not allowed:

→ a short vowel followed by v may not be followed by u or i

22 / 60

Page 79: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Solution

In order to restrict this overgeneration the possible letter combinationsneed to be constrained:

which vowels are actually allowed to cooccur?

→ ai, but not ia?

which consonants are actually allowed to cooccur?

→ initial kr, but not gr?

certain combinations with semi-vowels or consonants are not allowed:

→ a short vowel followed by v may not be followed by u or i

certain positions are prohibited:

→ a word can never end in a short vowel or begin with a short vowelthat is only represented with a diacritic

22 / 60

Page 80: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Solution

write rules and filters out of these constraints and apply them to thetransliterator

(demo)

23 / 60

Page 81: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Solution

write rules and filters out of these constraints and apply them to thetransliterator

(demo)

Problem: these “rules” cannot be found in the literature - they are aproduct of extensive manual labor

23 / 60

Page 82: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Solution

write rules and filters out of these constraints and apply them to thetransliterator

(demo)

Problem: these “rules” cannot be found in the literature - they are aproduct of extensive manual labor

However, the transliterator works quite well now

→ Some sentences are still a little slow (but I keep looking for possiblerestrictions)

→ continue with generation of Urdu and the Hindi transliterator

23 / 60

Page 83: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Urdu Transliterator

Overview

Overall Architecture

tokenizer↓

transliterator (Urdu & Hindi to Roman script)↓

morphology (fst)↓

syntax (c- and f-structure) (xle)↓

semantics (xfr ordered rewriting)

24 / 60

Page 84: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Syntax

syntax component is at the core of Urdu grammar

25 / 60

Page 85: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Syntax

syntax component is at the core of Urdu grammar

theoretical background: LFG

25 / 60

Page 86: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Syntax

syntax component is at the core of Urdu grammar

theoretical background: LFG

well-studied (∼ 30 years) framework with computational usability

25 / 60

Page 87: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Syntax

syntax component is at the core of Urdu grammar

theoretical background: LFG

well-studied (∼ 30 years) framework with computational usability

c- and f-structures used for syntactic representation

25 / 60

Page 88: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Syntax

syntax component is at the core of Urdu grammar

theoretical background: LFG

well-studied (∼ 30 years) framework with computational usability

c- and f-structures used for syntactic representation

c-structure: basic constituent structure (“tree”) and linear precedence(∼ what parts belong together)

25 / 60

Page 89: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Syntax

syntax component is at the core of Urdu grammar

theoretical background: LFG

well-studied (∼ 30 years) framework with computational usability

c- and f-structures used for syntactic representation

c-structure: basic constituent structure (“tree”) and linear precedence(∼ what parts belong together)f-structure: encodes syntactic functions and properties

25 / 60

Page 90: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Syntax

CS 1: ROOT

S

KP

NP

N

nAdiyah

VCmain

V

hansI

"nAdiyah hansI"

'hans<[1:nAdiyah]>'PRED

'nAdiyah'PRED

namePROPER-TYPEPROPERNSEM

properNSYNNTYPE

+SPECIFICSEM-PROP

CASE nom, GEND fem, NUM sg, PERS 31

SUBJ

infl_MTYPE_VMORPH

_RESTRICTED -, _SUBCAT-FRAME V-SUBJ, _VFORM perfCHECK

unergVERB-CLASSLEX-SEM

ASPECT perf, MOOD indicativeTNS-ASP

CLAUSE-TYPE decl, PASSIVE -, VTYPE main18

26 / 60

Page 91: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Syntax

CS 1: ROOT

S

KP

NP

N

nAdiyah

VCmain

V

hansI

"nAdiyah hansI"

'hans<[1:nAdiyah]>'PRED

'nAdiyah'PRED

namePROPER-TYPEPROPERNSEM

properNSYNNTYPE

+SPECIFICSEM-PROP

CASE nom, GEND fem, NUM sg, PERS 31

SUBJ

infl_MTYPE_VMORPH

_RESTRICTED -, _SUBCAT-FRAME V-SUBJ, _VFORM perfCHECK

unergVERB-CLASSLEX-SEM

ASPECT perf, MOOD indicativeTNS-ASP

CLAUSE-TYPE decl, PASSIVE -, VTYPE main18

current size: 53 phrase-structure rules, annotated for syntacticfunction (usual size of large-scale grammars: 350–400 rules)

26 / 60

Page 92: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Syntax

CS 1: ROOT

S

KP

NP

N

nAdiyah

VCmain

V

hansI

"nAdiyah hansI"

'hans<[1:nAdiyah]>'PRED

'nAdiyah'PRED

namePROPER-TYPEPROPERNSEM

properNSYNNTYPE

+SPECIFICSEM-PROP

CASE nom, GEND fem, NUM sg, PERS 31

SUBJ

infl_MTYPE_VMORPH

_RESTRICTED -, _SUBCAT-FRAME V-SUBJ, _VFORM perfCHECK

unergVERB-CLASSLEX-SEM

ASPECT perf, MOOD indicativeTNS-ASP

CLAUSE-TYPE decl, PASSIVE -, VTYPE main18

current size: 53 phrase-structure rules, annotated for syntacticfunction (usual size of large-scale grammars: 350–400 rules)

coverage: basic clauses with free word order, NP syntax, tense andaspect, causative verbs, complex predicates, relative clauses, passives,semantically-based case marking

26 / 60

Page 93: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Discontinuous NPs in Urdu

1 Well known discontinuities

2 NP-internal discontinuity in Urdu

3 LFG implementation

4 Conclusion

27 / 60

Page 94: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Extraction from DP

(2) a.Er hat viele Bucher uber Logik gekauft.He has many books on logic bought‘He has bought many books about logic.’

b. Bucher uber Logik hat er viele gekauft.

c. Uber Logik hat er viele Bucher gekauft. (German)

(3) mantiq=par nidA=nE Ek kitAblogic=Loc.on Nida=Erg one book.F.3Sg

xarId-I he.buy-Perf be.Pres

‘Nida has purchased a book on logic.’ (Urdu)

28 / 60

Page 95: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Quantifier Float

(4) a. They all have bought a car.

b. They have all bought a car.

(5)Am alI=nE bahut kHA-Emango.Pl Ali=Erg many eat-Perf‘Ali ate many mangoes.’ (Urdu)

29 / 60

Page 96: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Constituent-level discontinuities in Urdu

NP-internal discontinuity

Discontinuous NP

Discontinuous AP

30 / 60

Page 97: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

When NP-internal discontinuity occurs in Urdu

The NP-internal discontinuity in Urdu can occur when theargument-taking noun is modified by:

argument-taking adjectives

argument-taking specifier nouns

31 / 60

Page 98: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Argument-taking adjectives in Urdu

Nr. Type of Argument Example of Adjective Phrase

(i) Dative Marked sadr=kO hAsilpresident=Dat possessed‘possessed by the president’

(ii) Ablative Marked adliyah=sE xAifcourts=Abl afraid‘afraid of courts’

(iii) Locative Marked buxAr=mEN mubtalAfever=Loc.in suffered‘suffered with fever’

(iv) Adpositional sihat=kE liyE muzirhealth=Gen for harmful‘harmful for health’

32 / 60

Page 99: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Simple examples of argument-taking nouns

(6) a. istisnA‘immunity’

b.muqaddamAt=sE istisnAcourt-case.Pl=Abl immunity‘immunity from court-cases’

c.muqaddamAt=sE AInI istisnAcourt-case.Pl=Abl constitutional immunity‘constitutional immunity from court-cases’

33 / 60

Page 100: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Simple examples of argument-taking nouns

(7) a. barIfiNg‘briefing’

b.salAmtI=par barIfiNgsecurity=Loc briefing‘briefing on security’

c.salAmtI=par tafsIlI barIfiNgsecurity=Loc detailed briefing‘detailed briefing on security’

34 / 60

Page 101: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Simple examples of argument-taking nouns

(8) a. mutAlbA‘demand’

b.ArmI-cIf=sE mutAlbAarmy-chief=Abl demand‘demand to the army-chief’

c.ArmI-cIf=sE qAnUnI mutAlbAarmy-chief=Abl legal demand‘legal demand to the army-chief’

35 / 60

Page 102: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Examples of discontinuous NPs

(9)a1. sadr=kO1 hAsil1 muqaddamAt=sE2

president=Dat possessed court-cases=Abl

AInI istisnA2

constitutional immunity

‘Constitutional Immunity from court-cases possessedby the president’

a2. [NP [AP [KP sadr=kO] hAsil][KP muqaddamAt=sE] AInI istisnA]

b. muqaddamAt=sE2 sadr=kO1 hAsil1 AInI istisnA2

c. sadr=kO1 muqaddamAt=sE2 hAsil1 AInI istisnA2

d. *hAsil1 muqaddamAt=sE2 sadr=kO1 AInI istisnA2

36 / 60

Page 103: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Hierarchical structure of AP in NP

CS 1: NP

KP

NP

N

muqaddamAt

K

sE

AP

KP

NP

N

s3adr

K

kO

A

h2As3il

AP

A

AInI

N

istis2nA

Figure: Hierarchical structure of AP in NP

37 / 60

Page 104: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Examples of discontinuous NPs

(10)a1.ArmI-cIf=sE2 salAmtI=par1 barIfiNg1=kA mutAlbA2

army-chief=Abl security=Loc.on briefing=Gen demand‘The demand to the army chief for briefing on security’

a2. [NP [KP ArmI-cIf=sE][KP [NP [KP salAmtI=par] barIfiNg]=kA]mutAlbA]

b. salAmtI=par1 ArmI-cIf=sE2 barIfiNg1=kA mutAlbA2

38 / 60

Page 105: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Examples of discontinuous NPs

(11) [NP [KP ArmI-cIf=sE] [KP [NP [KP mulkI salAmtI=par]army-chief=Abl of-country security=Loc.on

tafsIlI barIfiNg]=kA] qAnUnI mutAlbA]detailed briefing=Gen legal demand

‘The legal demand to the army chief for a detailedbriefing on security of the country’

39 / 60

Page 106: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

LFG implementation of NP-internal discontinuity

NP

KP/PP A+ A NSpec(N)/Arg(N/A) Arg-taking-adj Arg-less-adj Head-noun

Scrambling of elements in oval possible with some constraints

Figure: Word Order in Noun Phrases of Urdu

40 / 60

Page 107: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Implementation Issues

Free word order in an NP

Relating arguments with corresponding heads

Head last constraint

41 / 60

Page 108: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

LFG instruments used

Shuffle operator (‘,’):To accommodate free word order of different elements in the nounphrases.

Non-deterministic operator (‘$’):Relating the corresponding arguments to the corresponding heads.

Head Precedence Operator (‘<h’):To make it sure that the head must not precede its arguments in thenoun phrases.

42 / 60

Page 109: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

An excerpt from Grammar Rules

NP�

KP*: { (^ ADJUNCT $ OBL)= !| (^ ADJUNCT $ OBJ- GO)= ! | (^ OBL) = ! | (^ OBJ-GO) = ! }, “for scrambling”

AP*: ! $ (^ ADJUNCT ) N : ^ = !

__________________________________________

KP*: { (^ ADJUNCT $ OBL)= !(^ ADJUNCT) <h (^ ADJUNCT $ OBL)

| ..... }.......

Figure: Grammar Rules

43 / 60

Page 110: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

C-structure for a discontinuous NP

CS 1: NP

KP

NP

N

s3adr

K

kO

KP

NP

N

muqaddamAt

K

sE

AP

A

h2As3il

AP

A

AInI

N

istis2nA

Figure: C-structure

44 / 60

Page 111: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

F-structure for a discontinuous NP

"s3adr kO muqaddamAt sE h2As3il AInI istis2nA"

'istis2nA<[34:muqaddamah]>'PRED

'muqaddamah'PRED

obl_NMORPHCHECK

CASE inst, GEND masc, NUM pl34

OBL

'h2As2il<[1:s3adr]>'PRED

's3adr'PRED

obl_NMORPHCHECK

countCOMMONNSEM

commonNSYNNTYPE

CASE dat, GEND masc, NUM sg, PERS 31

OBJ-GO

-_RESTRICTEDCHECK

+GOALLEX-SEM

attributiveATYPE39

'AInI'PREDattributiveATYPE

[39:h2As2il]<s44

ADJUNCT

49

Figure: F-structure

45 / 60

Page 112: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Summary

Urdu is a typical language in which discontinuous NPs are found both at:

Clause-level

Constituent-level

Constituent-level discontinuity in Urdu can be implemented in LFGframework by making use of:

Shuffle operator (‘,’)

Non-deterministic operator (‘$’)

Head-precedence operator (‘<h’)

46 / 60

Page 113: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Syntax

Overview

Overall Architecture

tokenizer↓

transliterator (Urdu & Hindi to Roman script)↓

morphology (fst)↓

syntax (c- and f-structure) (xle)↓

semantics (xfr ordered rewriting)

47 / 60

Page 114: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Semantics

Intro

Aim: a large-coverage computational semantic analyzer on the basis of adeep syntactic analysis

use f-structures as starting point

apply xfr semantic rules → from f-structure facts to a semanticrepresentation (Crouch and King, 2006)

judgment on the semantic well-formedness of a sentence

The girl laughs. → semantically well-formed#The tree laughs. → semantically ill-formed

we need lexical information about the words in a sentence

1 lexical resource for Urdu verbs

more information on the verb and its arguments

2 general lexical resource for Urdu nouns, adjectives etc.

48 / 60

Page 115: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Semantics

Intro

F-structure for nAdiyah hansI (Nadya laughed).

"nAdiyah hansI"

'hans<[1:nAdiyah]>'PRED

'nAdiyah'PRED

namePROPER-TYPEPROPERNSEM

properNSYNNTYPE

+SPECIFICSEM-PROP

CASE nom, GEND fem, NUM sg, PERS 31

SUBJ

infl_MTYPE_VMORPH

_RESTRICTED -, _VFORM perfCHECK

unergVERB-CLASSLEX-SEM

ASPECT perf, MOOD indicativeTNS-ASP

CLAUSE-TYPE decl, PASSIVE -, VTYPE main18

xfr semantic rule:PRED(%1, hans), SUBJ(%1, %subj), -OBJ(%1, %obj)

==>

word(%1, hans, verb), role(Agent, %1, %subj).

49 / 60

Page 116: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Semantics

Developing an Urdu VerbNet (1)

following the methodology of the English VerbNet (Kipper-Schuler2006)

categorization of English verbs in 250 classesinformation on event structure and argument structure of verbsprovides the general architecture for a VerbNet in any languagee.g. parts of the entry for ‘laugh’ in the English VerbNet

50 / 60

Page 117: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Semantics

Developing an Urdu VerbNet (2)

Difficulty: resource sparseness of Urdu

Approach 1:

translating the entries in the English VerbNet to Urdu

figure out problematic cases

Approach 2:

fully rely on corpus work

extend tool for automatic subcategorization extraction (Ghulam,2010)

Can we benefit from a Hindi lexical resource?

51 / 60

Page 118: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Semantics

Hindi WordNet

Facts:

inspired in methodology and architecture by the English WordNet(Fellbaum 1998)

52 / 60

Page 119: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Semantics

Hindi WordNet

developed at the Indian Institute of Technology, Bombay, India

separated into four independent “semantic nets”

verbs, nouns, adjectives and adverbs

about 3.900 verbs, 57.000 nouns, 13.700 adjectives and 1.300 adverbs

words are grouped according to their meaning similarity (“synsets”)

53 / 60

Page 120: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Semantics

Hindi WordNet

Issues

far less specific concepts than in the English WordNet

Hindi WordNet:TOP 〉 Noun 〉 Inanimate 〉 Object 〉 Artifact 〉 kitAbTOP 〉 Noun 〉 Inanimate 〉 Object 〉 Artifact 〉 mez

English WordNet:entity 〉 physical entity 〉 object 〉 whole unit 〉 artifact 〉 creation 〉 product〉 piece of work 〉 publication 〉 book

entity 〉 physical entity 〉 object 〉 whole unit 〉 artifact 〉 instrumentatlity 〉furnishing 〉 piece of furniture 〉 table

54 / 60

Page 121: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Semantics

Benefits for an Urdu VerbNet

Preliminary experiments for Urdu/Hindi verbs

Resources that we have:

the database from Hindi WordNeta list of Urdu verbs

out of 3.900 Hindi verbs, we have found 534 verbs in an Urdu verblist (Humayoun, 2006)

complex predicates are included in Hindi WordNet, but not in theUrdu wordlist

total of around 700 Urdu verbs → more than 2/3 of Urdu verbs arefound

all found verbs seem to be valid

→ extract verb information from Hindi WordNet for the Urdu VerbNet

55 / 60

Page 122: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Semantics

Urdu Lexical Semantics

Polysemy:An extreme case - eat expressions in Hindi/Urdu (Hook and Pardeshi,2009):

employing ’eat’ in idiomatic expressions

about 160 eat expressions for Hindi/Urdu

variety of uses due to loan translations from Persian

56 / 60

Page 123: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Semantics

Urdu Lexical Semantics

h2asan=ne kEk=ko kHAyAh2asan.Erg cake.Acc eat.Perf.Sg.Masc’Hasan ate the cake.’

eat=〈 Agent, Theme 〉

inqilAbI fikar zang kHA jAEgIrevolutionary thought rust eat go.Fut’Revolutionary thinking will gather rust.’

eat (gather rust) =〈 Patient, Theme 〉

is sAl=kI mandI sheyar-bAzAr kHA gAyIthis year.Gen slowdown.Fem stockmarket eat go.Fut.Fem’This year’s slowdown wrecked (lit. devoured) the stock market.’

eat (wreck) =〈 Agent, Theme 〉

57 / 60

Page 124: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Semantics

Urdu Lexical Semantics

How do we approach polysemy in the computational semantics?

extensive corpus work to find polysemous verbs

assign different thematic roles to polysemous verbs?

put all combinations in the Urdu VerbNet, but mark the “original”use?

analysis for all sentences, mark idiomatic and semantically ill-formedsentences as such?

58 / 60

Page 125: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Semantics

Wrap up

What we have talked about:

architecture of the Urdu LFG Grammar

ongoing work

transliterationdiscontinuous NPscomputational semantics

challenges ahead

Demo

59 / 60

Page 126: UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu ... · UrduGram: Towards a Deep, Large-Coverage Grammar for Urdu and Hindi Tafseer Ahmed, Tina Bo¨gel, Miriam Butt, Annette

Semantics

Thank you!

60 / 60