shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/43484/7/phd … · Web viewSo far, many such formats have been devised, for instance; CONLL-X (see Table. 2)

Chapter.1 Introduction‘Where shall I begin, please, your majesty?' he asked.

‘Begin at the beginning; the king said gravely''And go on till you come to the end: then stop.'

Carroll, 2003

1. Introduction

It is a well established fact that for the development of NLP tools and

applications, a substantial amount of linguistic knowledge is required which can

be either in the form of computational grammar1 or in the form of syntactically

annotated machine readable corpus known as treebank.2 This research is an effort

to create a treebank for Kashmiri Language [KashTreeBank].3 It investigates the

theoretical as well as the practical issues involved in the creation of a small scale

dependency treebank of Kashmiri, using simple grammar formalism for syntactic

parsing and annotation.

Treebank creation is a promethean task which requires different types of

resources and enormous funding for the development or acquisition of corpus &

tools as well as for labor-intensive annotations, expert opinions, and validation.

The current research is an initiative to build language resources for Kashmiri so

that a base line syntactic parser of Kashmiri can be developed. The findings of

this research can serve as basis for carrying out treebanking for Kashmiri on large

scale.

The next section discusses the motivations behind pursuing the current

research and highlights its social relevance. Section three provides brief

introduction of Kashmiri Language. Section four on research problem, introduces

a whole spectrum of issues associated with treebanking in general & development

of KashTreeBank in particular. Section five on theoretical preliminaries,

elaborates the theoretical framework used in this research work.

1 An assembly of e-dictionary and formal representation of word, phrase and sentence formation rules for a language.

2 The well known treebanks are Penn English Treebank; Marcus et al. 1993, Penn Arabic Treebank; Maamouri et al. 2004, Penn Chinese Treebank; Xue et al. 2004, Prague Dependency Treebank of Czech; Hijicova & Hajic, 1998, Böhmova et al. 2003, HyDTB Hindi Treebank; Begum et al., 2008, etc.

3 KashTreeBank was initially conceived as a summer school project in IASNLP 2011.

1

2. Motivation

Treebank is a rich language resource for research on grammar development &

grammar engineering. Grammar engineering is the practice of building elaborated

linguistic models on computers. It has been used for practical purposes for many

years. For instance, it has been used for developing grammar checkers, e.g.

Microsoft grammar checker, Boeing’s simplified English grammar checker, etc

but the contemporary grammar engineering involves extraction and induction of

probabilistic grammars. Besides, if a treebank is created as a reference work,

rather than an application oriented repository, it can serve multiple functions in

various subfields of linguistics as well as in language technology. Theoretical

linguists can use treebank for searching various illustrations of the different

syntactic phenomenon under investigation, whereas psycholinguists can use it to

find the relative frequencies of various possible PP attachments or relative clauses

(Abeille, 2003). Similarly, formal and computational linguists can evaluate the

correctness and coverage of grammars and lexicons against the analyses stored in

a treebank and at a more general level, the adequacy of linguistic theories and

formalisms can be assessed.

Further, treebanking is not goal in itself rather treebank driven parsers are

used as an important component of artificial intelligence (AI) systems like MT

system, Question-Answering system and Grammar Checker. Therefore, treebank

is a valuable resource not only for Computational Linguistic (CL) and Natural

Language Processing (NLP) tasks, such as automatic syntactic parsing, grammar-

induction4 and grammar-extraction5 but also for non-technological academic

research such as experimental syntax. Evaluation of NLP systems or their

components is yet another field which is currently very active. These days,

treebanks are in much demand for testing and optimization of syntactic parsers.

Treebanks can be also used for pedagogical purposes, both in teaching of

language and linguistic theory, e.g. the Visual Interactive Syntax Learning (VISL)

project, established at the University of Southern Denmark, has developed

teaching-treebanks for twenty two languages with a number of different teaching

4 During the last decade treebanks have been used for the induction of probabilistic grammars for syntactic parsing (see Collins, 1999 & Charniak 2000) but currently these are used in data-driven parsing (see Bod 1998, Nivere 2009) which eliminates the traditional notion of grammar completely and uses a probabilistic model defined directly on the treebank.

5 Besides optimization of syntactic parsers, treebanks are used to induce other linguistic phenomena that are relevant to NLP e.g. extraction of sub-categorization frames (Briscoe, 1997).

2

tools including the interactive games such as Syntris6. Treebanks are also being

used for empirical linguistic research in theoretical syntax and historical

linguistics. For instance, creation of historical treebanks like Middle English

(Kroch & Taylor, 2000), Old English (Taylor et al. 2003), Early New High

German (Demske et al. 2004), etc, have revolutionized historical linguistics and

comparative philology in last one decade or two. Given the versatility of

treebanks to hold vast amount of empirical grammatical knowledge and given

their commercial utility to be consumed as language data in research &

development, it is the need of hour to develop large scale treebanks for all

resource poor languages and Kashmiri is one of them.

3. Kashmiri Language

Kashmiri, locally known as “Koshur,” is one of the 22 scheduled languages of

Indian union, as per the 8th schedule of its constitution. It is mainly spoken in the

greater region called “Kashmir” which includes State of Jammu Kashmir (JK) and

Pak Administered Kashmir. JK is located at a strategically important geographical

point where it is bordered with Tibet in the east, China in the north, Pakistan in

the west and south west, and in the south-east by rest of India (Hussein, 1987).

There are approximately six million Kashmiri speakers scattered in India,

Pakistan, UK, USA and Gulf Countries (Ethnologue 2006). Kashmiri is a Dardic

language, considered genealogically distinct from Indo-Aryan and Indo-Iranian

languages (Grierson, 1915) but latter on it has been classified under Dardic group

within Indo-Aryan language family (Morgenstiene, 1961). It is closely related to

Shina and some other languages of the North-West frontier (Koul 2006).

It is a highly inflectional language with predominant V2 phenomenon &

pronominal clitics like Germanic languages. Kashmiri is the only Dardic language

which has a written tradition. It is written in modified Persio-Arabic script, with

additional diacritics to capture its peculiar phonetic features. Like Urdu, its

writing convention is from right to left. Although, Persio-Arabic is an officially

approved script, it is also written in Devanagari. Moreover, Sharda and Roman

scripts have been also used for it from time to time. However, it is, mainly written

in modified Persio-Arabic script with writing convention from right to left. The

script uses some additional distinguishing set of diacritic markers and letters, for

6 see, http://visl.edu.dk

3

representing a system of central vowels and secondary articulations, e.g.

palatlization at token initial, medial and final positions. Therefore, the script is

fully capable of representing all the sounds of Kashmiri. It has two writing styles-

Nasaq and Nastaliq. Kashmiri is mainly written in Nastaliq style, either manually

by cartographers (kA:tib) or by using some word processor, e.g. Inpage-Urdu. It

can be also directly input in Microsoft Word where it will be displayed in Nasaq

style like Arabic as the available Unicode fonts are in only in Nasaq Style. It is

worth to mention that the readers of Kashmiri are not normally used to this style

and they find it difficult to read.

4. The Research Problem

Kashmiri is a highly inflectional language with relatively variable word-order,

extensive pronominal cliticisation and predominant V2 phenomenon. As per

computational resources are concerned, it is a resource poor language, lagging far

behind than other Indian languages like Hindi, Urdu, Punjabi, Bengali, Telugu,

Tamil, etc. Several kinds of resources are needed for developing a treebank like

annotation guidelines to state the conventions in order to guide the annotators

throughout their work and a software tool to aid the annotation work. Since,

constructing syntactic trees manually is a very slow and error-prone process;

semi-automatic annotation can be opted but the semi-automated treebank

annotation needs a whole battery of NLP modules like Tokenizer, POS tagger,

Morph-analyzer, chunker (shallow parser) and a Syntactic Parser.

The development of KashTreeBank involves many challenges ranging

from preliminary decision making regarding the selection of framework & the

associated formalism to the actual syntactic annotations & their representations in

certain format. Therefore, this multi-dimensional problem of “Creating

KashTreeBank” can be better addressed by describing the wide spectrum of small

problems related to its design & development. It includes choice of corpus,

selection of framework & the associated grammar formalism, choice of

annotation scheme, nature of annotation process, representation of treebank & the

choice of annotation tool.

4.1. Choice of Corpus

Treebank can't be created out of vacuum. One needs to have some primary source

data (machine readable text) to work on and to annotate required linguistic

information. Either, already created corpus resources, under various projects, can

4

be used, or, new resources can be created for this purpose. But the choice

governing the acquisition of the old resources or the development of new

resources should be the principled one. The principles necessary to determine the

choice of corpus for treebanking are given as:

a) The corpus should be freely available for research and development with

easy licensing policy.

b) The licensing policy for the distribution of corpus should not undermine

your rights on the treebank.

c) The corpus should have been developed following certain encoding

standards, preferably, Unicode for character encoding and XML for text-

encoding.

d) The corpus should be sanitized and normalized one, i.e. with no

typographical errors, tokenization problems and missing diacratics (crucial

ones).

e) The corpus should be balanced one (with samples from all the possible

existing domains).

f) The corpus should represent almost all types of constructions of a

language for a wider coverage of treebank to produce a robust annotation

scheme and parsing model.

g) The corpus should be preferably annotated with Morph, POS & Chunk

information so that one can directly start parsing sentences.

h) The sufficient quantity of corpus should be available (at least, 1500-2000

sentences). Less than this quantity may be sufficient for developing

annotation scheme & guidelines but it won’t be sufficient to train a base

line parser.

It is not obligatory to follow all the above criteria, strictly. They can vary from

language to language but certainly one has to think on these lines for the

acquisition or the development of corpus to create a treebank. It is important to

mention that there is a need to use corpus of shorter sentences (with considerable

complexity) in the initial stage of the research to lay down a basic annotation

scheme.

4.2. Treebanks and Linguistic Theory

The choice of a suitable framework as well as an implementable formalism is of

paramount importance in any treebanking endeavor as it determines the nature of

5

all data (trees) in the treebank and consequently determines the value and utility

of the entire treebank. Since, a number of grammatical frameworks and

formalisms exist worldwide; it has become imperative to choose one among the

existing models to be implemented on the selected sets of Kashmiri corpus. The

choice of annotation scheme for a treebank is influenced by different factors. One

of the most central considerations is its relationship with the linguistic theory. It is

to be decided if the annotation scheme should be theory-specific or theory-neutral.

If the first of these alternatives is taken into consideration, then which theoretical

framework should be adopted? If the second is opted then how do we achieve the

broader consensus on framework selection, given the fact that truly theory-

neutrality is almost impossible? Although, it has been argued that while creating

treebank theoretical neutrality should be maintained (Fei Xia, 2008) but in reality

theoretically neutral treebank is a myth. However, if theory neutrality is

interpreted as NLP friendly, one can choose that framework for preparing

annotation guidelines that is advantageous for Natural Language Processing.

However, the solution to the problem of framework selection and design of

annotation scheme comes from the interaction between different factors that

govern treebanking, in particular, from the nature of the language (configurational

or non-configurational) that is being analyzed. Also, the researchers, particularly

from resource poor scenarios, cannot afford to disregard the already created

resources and tools for automatic and interactive annotations. The following

criteria can be posited to help in grammar formalism vis-à-vis annotation scheme

selection.

a) The formalism should be simple & elegant with fewer abstractions, i.e. it

should be NLP friendly.

b) The associated resources (tools and schemes) should be accessible.

c) It should suite the nature of the language under investigation.

d) It shouldn’t disregard the grammatical tradition for the language.

e) It should have some cognitive reality.

Two types of frameworks, constituency and dependency, have been used in

framing annotation schemes for different treebanks. A constituency-based

annotation scheme posits the structure of a sentence as hierarchically organized

phrases (IP = Spec + X’ & X’ = X + Comp.) where the annotations are confined

to phrasal tags (such as S, JJP, NP, PP, VP, etc). Such schemes do not represent

6

the grammatical relations between and within the constituents, explicitly. On the

other hand, dependency-based annotation scheme posits a sentence as a

dependency graph, i.e. a structure consisting of a head and a dependent with a

labeled arch (which can be also a directed arch), denoting the grammatical

relation (GR) between them. The relations in the syntactic structure can be labeled

with not only GRs but also with other specifications of the function of the

dependent. Syntactic units are words in more lexicalized dependency frameworks

(Hudson, 1984; Mel’cuk, 1988) but dependency annotation schemes sometimes

rely on units of several words or word clusters, e.g. chunks in case of Abney

(1991) and Bharati et al., (1994).

The annotation schemes used in different treebanks can be compared and

contrasted on the basis of the following parameters, proposed in Bosco and

Lombardo (2004).

a) The number of layers involved

b) The number and the nature of relations annotated

c) The richness of the annotation

d) The explicit representation of semantic information

On the one hand, Penn Treebank (PTB) (Marcus et al., 1993) uses a mono-stratal

(single layered) annotation scheme that combines the annotation of syntax and

semantics on the same level of representation. The syntactic annotation is based

on constituency but it has been enriched with the annotation of a small set of

grammatical relations and semantic information. On the other hand, annotation

scheme used in Prague Dependency Treebank (PDT) uses a multi-stratal

annotation scheme that consists of three separate layers: morphological, analytical

and tecto-grammatical (or semantic). However, NEGRA treebank also uses a

mono-stratal annotation scheme which combines phrase-structure and dependency

representations, allowing for the direct representation of both phrases for fixed-

word-order constructions as well as syntactic dependencies (predicate-argument

structures). The PDT uses a richer annotation of the relational structure compared

to others. Since the number of relations annotated in NEGRA Treebank and PTB

is quite low, their representation of the relational structure is quite poor.

Nevertheless, the relational structure can be easily recovered all at once in

monostratal representations such as in the NEGRA and PTB than in multi-stratal

representations where the information is sparse on several structurally different

7

layers, as in the PDT. The major limits of monostratal representation have been

referred to representation of phenomena in one level which require structurally

different levels, e.g. representation of semantics and syntax as coordinated rather

disjoint.

Some scholars claim that dependency based annotation is more suitable for

relatively free-word-order languages (Hudson, 1984; Mel’Cuk, 1988; Covington

1990 & Bharati et al. 1995) while others make their choice on the basis of

application requirements and in some cases, the annotation scheme follows the

linguistic tradition. To annotate the corpus of relatively fixed-word-order

languages like English, principle of constituency is usually employed. However,

in treebanks like TIGER Treebank for German (Brants et al. 2002) and Quranic-

Arabic Treebank for Quranic Arabic (Kais Dukes & Tim Buckwalter, 2010)

dependency is combined with PSG. Also, recently efforts were made to annotate

relatively free-word-order languages like Hindi-Urdu with dependency structure,

lexical predicate structure & phrase structure in a coordinated manner (Palmer et

al. 2009). Further, a treebank can have multiple representations rooted in different

linguistic theories to maintain theory equality rather than theory neutrality. For

instance, The Multi-Representational and Multi-Layered Treebank for Hindi-Urdu

(Bhatt et al., 2009) has both the phrase structure (PS) as well as dependency

structure (DS) representations. In fact, multiple representations are the current

state-of-art in treebanking but still one has to start with one type of

representations.

4.3. Nature of the Annotation Process

Treebanking, primarily, involves syntactic parsing & annotation of POS tagged

corpus which can be done in different ways. The most commonly used method for

developing a treebank is a combination of automatic and manual processing.

However, there are some treebanks created completely manually but with taggers

and parsers available to automate some of the work. Such a method is rarely

employed in state-of-the-art treebanking. There are three main techniques to carry

out annotation process.viz:

a) Supervised Technique: In this technique, the annotation process is carried

out manually by human annotators, preferably by syntacticians.

8

b) Un-supervised Technique: In this technique, the annotation process is

carried out automatically by an intelligent system called syntactic parser

(developed without any training data).

c) Semi-supervised Technique: In this technique, the annotation process is

partly done automatically by a trained parser & the parses are partly done

or corrected by human intervention.

Traditionally, the parsing or syntactic annotation was mostly confined to

manual methods but after the development of more sophisticated grammar

formalisms such as context free grammars like PSG, it became possible to

automatise the process of syntactic annotation either on the basis of computational

grammar in which hand crafted grammar rules (morphological, phrasal and

sentential) are used to develop parser or on the basis of statistical modeling in

which syntactically annotated electronic corpus is used to train a parser. Hybrid

techniques, involving both grammar rules as well as statistical modeling, are also

used to develop parsers. But treebank creation on the basis of automatic parsing,

using a probabilistic grammar or statistical modeling (Bod 1998; Collins, 1999;

Charniak, 2000) is desirable for both practical and theoretical reasons and manual

annotation has the disadvantage of being time consuming, labor-intensive, costly

& error prone. Also, it is difficult to achieve satisfactory consistency both within

and between human annotators (Van Der Beek et al., 2002). However, in order to

create treebank for any resource poor language like Kashmiri, automatic approach

is an impractical one. Therefore, to stick to the old method of manual annotation

is the only choice and the treebank, so obtained serves as data for training &

testing state-of-art parsers like Stanford Parser (Dan Klein & Christopher D.

Manning, 2003), Malt-parser (Nivre et al, 2006), or MST-parser (McDonald,

2006). Training results in the induction of a language model which in turn results

in a baseline Kashmiri parser. Once the baseline parser for Kashmiri is ready, it

can be employed to parse more and more Kashmiri corpus automatically and learn

more and more structures by boot-strapping7 and only then the labor-intensive

manual annotations can be avoided. Nevertheless, the validation of the

automatically annotated corpus needs to be done manually. Since, currently, the

automation syntactic parsing is predominantly the domain of machine learning

(engineering) where consistencies in annotations matter more than the granularity,

7 Parse a little, learn a little

9

i.e. the depth of analysis, the annotation guidelines need to be prepared &

followed strictly during the annotation process to avoid frequent inconsistencies

and propagation of errors to other annotation layers.

4.4. Representation of Treebank

Treebanking not only involves deep syntactic analysis of natural language corpus

(sentences) according to particular grammar formalism but also the representation

of the syntactic analysis (trees) in certain format so that the annotated information

can be read by an algorithm during training process. A format is, generally, a sort

of matrices which represents various levels of annotated grammatical information

in different data types or fields (columns) in such a way so that a link is

maintained between them. So far, many such formats have been devised, for

instance; CONLL-X (see Table. 2) and Shakti Standard Format (SSF) (see Table.

1). SSF was originally devised for Shakti-Machine Translation System for Indian

Languages and is mostly used in India but CONLL-X standard is a widely used

format. It has ten data types (fields), of which seven are utilized in the analysis.

Recently algorithms have been developed to convert SSF into CONLL so that

experiments can be done on wider range of parsers.

In the CONLL-X format, all word-forms and punctuation marks are presented on

a separate line. Each word has a numerical address (NA) within the sentence in

Column-1. The next column from the left is the actual word-form (WF), followed

by its base form (BF) in Column-3. The morphological description is given in

both short and coarse grained manner (POS) in column 4, and a fine-grained

analysis (Morph) in Column-5. The dependency relations (dRel) are marked in

Column-7 by indicating the governing word (Head/Root/Regent) using the

sentence-internal numerical address of Column-1. The dependency functions

(dFn) of the word-forms are presented in Column-8. Columns 6, 9 and 10 are

unused and are marked with an underscore (_).

In the present work, the annotated data is represented in SSF (Bharati et

al., 2007). The SSF consists of four columns in which the column1 (C1) carries

information about the address of the token (like 1, 2, 3,……….., n), the column-2

(C2) carries the actual tokens in the manner of one token per line (see Fig.1), the

column-3 (C3) carries the POS category of the node and the column-4 (C4)

carries other features like the dependency relations. Any further information like

morph information can be represented in this column using an attribute–value

10

pair. Therefore, POS and chunk information of the tokens would be in the C3 and

the morph, dependency and any other information pertaining to a node would

appear in the C4 (see Table 1).

</Sentence>

س آسان حضو ۍسفید پلو دٲ ٹھا پس ۔رن س ن� ٮ� <Sentence id=''22''>

C1 C2 C3 C4

1 (( NP <fs name='NP' drel='k2:VGF'>

1.1 سفید JJ <fs name='سفید'>

1.2 پلو NN <fs name='پلو'>

))

2 (( VGF <fs name='VGF'>

2.1 ۍس ٲ VAUX <fs name=' ۍس <'ٲ

2.2 آسان VM <fs name='آسان'>

))

3 (( NP <fs name='NP2' drel='k4a:VGF'>

3.1 رن‘حضو NNP <fs name=' رن‘ <'حضو

))

4 (( NP <fs name='NP3' drel='pof:VGF'>

4.1 ٹھا ٮ�س INTF <fs name=' سٮ77ٹھا <'ٮ�

4.2 د ن�پس NN <fs name='د ن�پس '>

4.3 ۔ SYM <fs name='۔'>

))

Table.1: An Eight Token Kashmiri Sentence in SSF

11

1.NA 2.WF 3.BF 4.POS 5.Morph 6. 7. dRel 8.dFn 9. 10.

سفید سفید 1 JJ JJ.0.0.0 _ 2 Adj _ _

پلو پلو 2 NN NN.0.0.0 _ 3 Obj _ _

ٲس ٲس 3 VA VM.0.0.0 _ 0 Root _ _

آسان آسان 4 VM VA.0.0.0 _ 3 Aux _ _

حضو‘ر حضو‘رن 5 NNP NNP.0.0.0 _ 3 Subj _ _

ٹھا 6 ٹھا س ٮ�س ٮ� INT INT.0.0 _ 7 Intf

_ _

د 7 د پس ن�پس ن� NN NN.0.0.0 _ 3 pRoot

_ _

۔ ۔ 8 SYM SYM.0 _ _

_

Table.2: An Eight Token Kashmiri Sentence in CONLL-X Format

4.5. Choice of Annotation Interface

The annotation process for the development of treebank can’t be accomplished

effectively unless some user friendly annotation interface is available. The

annotation interface is generally customized on the basis of requirements of the

annotation scheme. Given the specifications of the treebank to be built, one can

search for some open-source tools instead of wasting resources to develop new

tools. In fact, there are many open-source syntactic and syntacto-semantic

annotation tools available which have been developed under various research

projects throughout the world. Such tools include:

i. Dependency Grammar Annotator (DGA)

This tool has been developed to facilitate the syntactic annotation of text-corpus

within the formal framework of Dependency Grammar (Tesnière, 1959). DGA is

12

a user friendly graphical interface which allows the efficient creation and

manipulation of syntactic structures. DGA was developed by Marius Propescu of

the University of Bucharest, under the BALRIC-LING project.

ii. Syntactic Tree Viewer

It is an easy to use interface to visualize or create simple linguistic trees. It allows

creating and editing of syntactic and viewing the output in string format. It

supports for visualizing parse trees produced by various parsers including

Stanford parser and Charnaik parser. It also support for visualizing Penn Treebank

trees with slight modification.

iii. Sanchay

It is an open-source platform to carry out various NLP tasks for South Asian

Languages (SALs). It has been extensively used for Indian Languages (ILs) at

various NLP research labs especially at LTRC Lab., for various research projects

like ILMT8, Treebank projects (Hindi, Urdu, Telugu & Bengla), PropBank9 &

Bengali Treebank10. So the tool has been very instrumental in creating languages

resources & carrying out various NLP tasks for ILs. However, it is generally

assumed that Sanchay is exclusively devised to implement PCG but the fact is

that it can be customized & used irrespective of grammatical frameworks &

formalisms but it is also true that Panini’s PCG was first experimented &

implemented in Sanchay for ILs under ILMT project.

iv. Cornerstone

It is a PropBank frameset editor developed at the University of Colorado at

Boulder. It runs platform independently and supports multiple languages such as

Arabic, Chinese, English, Hindi and Korean. It is worth mentioning that before

development of cornerstone, Sanchay was used for PropBank (Palmer et al.,

2005) for annotating predicate argument structure. However, it is not sufficient

for treebanking where one also needs to annotate beyond predicate argument

structure like the coordinated and embedded clause constructions, sentential

modifiers, internal structure of complex predicates, serial verb constructions and

subject & object complement constructions.

Besides above mentioned tools, there is also GATE Architecture which can be

customized and used for syntactic annotation and other NLP tasks also. Finally, it

8 Indian Languages Machine Translation, a consortium project at LTRC Lab IIIT Hydrabad

9 PropBank (Palmer M, Kingsbury P, Gildea D, 2005) at University of Colorado

10 LDCIL-IIT Kharagpur Bangla Treebank (Sanjay C, Praveen S, Sudeshna S, Devshri R, 2009)

13

is worth to mention that if the demands of annotation schemes are not fulfilled by

such open source tools (as in case of Corner Stone), even after their

customization, new annotation tools can be developed provided funding and

technical support is available. Usually, annotation schemes are never alien and are

developed in consonance with the pre-existing schemes and Tools. So the second

situation barely arises and it is hardly necessary to strive for building new

annotation tools.

5. Theoretical Preliminaries

Lucien Tesniére (1930s), a French Linguist, developed a relatively formal and

sophisticated theory of DG grammar, Éléments de syntaxe structural, for

pedagogic purposes. It was first drafted in 1939 but published latter on in 1959,

posthumously. Tesniére puts forward his notion of dependency in the following

lines:

“[I] La phrase est un ensemble organisé dont les éléments constituants sont les

mots. [II] Tout mot qui fait partie d’une phrase cesse par luimeˆme d’eˆtre isolé

comme dans le dictionnaire. Entre lui et ses voisins, l’esprit apercoit des connex-

ions, dont l’ensemble forme la charpente de la phrase. [III] Les connexions struc-

turales établissent entreles mots des rapports de dépendance. Chaque connexion

unit en principe un terme supérieur aùn terme inférieur. [IV] Le terme supérieur

recoit le nom de régissant. Le terme inférieur recoit le nom de subordonné. Ainsi

dans la phrase Alfred parle [. . .], parle est le régissant et Alfred le subordonné.”

(Tesniére, 1959, p.11-13)

“[I] The sentence is an organized whole, the constituent elements of which are

words. [II] Every word that belongs to a sentence ceases by itself to be isolated as

in the dictionary. Between the word and its neighbors, the mind perceives connec-

tions, the totality of which forms the structure of the sentence. [III] The structural

connections establish dependency relations between the words. Each connection

in principle unites a superior term and an inferior term. [IV] The superior term re-

ceives the name governor. The inferior term receives the name subordinate. Thus,

in the sentence Alfered parle11 [. . .], parle is the governor and Alfred the subordi-

nate.”12 The ‘parle’ is also the root (the head of whole clause, Alfered parle) of

the structural diagram (dependency graph) called ‘stemma’ which is widely used

in different formalisms of dependency framework.

11 The French clause “Alfred parle” means “Alfred speaks”

12 Translated from Tesniere (1959, page. 11–13) by Joakim Nivre (2009)

14

Dependency relations belong to structural order which is different from linear

order of spoken or written string of the words and structural syntax (Nivre, 2009).

Dependency relation holds between Head (H) and Dependent (D) in a clause or

sentence which is represented by a labeled arch13 (arrow), projecting from H to

Ds. Therefore, the criteria for establishing dependency relations and for

distinguishing between the H and D are of paramount importance, not only in

dependency framework, but also within other frameworks where the notion of

syntactic head plays a pivotal role, including all constituency based frameworks

that belong to some version of X-bar theory (Chomsky, 1970; Jakendoff, 1977).

Zwicky (1985) has proposed some of the following criteria to distinguish between

an H and a D in a construction (C)14.

i. H determines the semantic category of C, D gives semantic specification.

ii. H determines the syntactic category of C and can often substitute C.

iii. H is obligatory, D is optional.

iv. H selects D and determines whether D is obligatory or optional.

v. The form of D depends on H (government or agreement/concord).

vi. The linear position of D is specified with reference to H.

It is very important to distinguish between syntactic dependencies in endocentric

and exocentric constructions (Bloomfield, 1933). For illustration, consider the

structure of the following sentence, taken from the Wall Street Journal part of

Penn Treebank:

Figure.4: Dependency structure for English sentence15

The Attribute (ATT) relation holding between ‘H’ (noun “markets”) and ‘D’

(adjective “financial”) is an endocentric construction in which head can substitute

13 The notational convention used in the above dependency graph is that the arrows point from H to Ds but there is a competing tradition in the literature according to which arrows

point from the Ds to the H (Nivere, 2009).

14 Taken from Hudson (1990, pp. 106-7).

15 One peculiarity of the dependency structure in Figure.4 is that there is an artificial word root before the first word of the sentence. This is a mere technicality, which simplifies both

formal definitions and computational implementations. In particular, it is assumed that every real word of the sentence has a syntactic head. Thus, instead of saying that the verb had

lacks a syntactic head, it can be said that it is dependent of the artificial word root (Nivere, 2009).

15

the entire group of words “financial markets” (phrase or chunk), without

impacting the overall syntactic structure of the sentence. The endocentric

constructions generally satisfy all the above criteria. However, aforementioned

criterion (IV) is usually considered less relevant as dependents are always

optional in such constructions.

While as the Prepositional Compliment (PC) relation holding between H

(preposition “on”) and the dependent (noun “markets”) is an exocentric

construction in which head can’t substitute the entire phrase (on financial

markets). Such constructions fail to meet mentioned criterion (I), at least, with

respect to the substitutability of the head for the whole construction (phrase or

chunk) but they may satisfy rest of the criteria. Further, the subject (SBJ) and

object (OBJ) relations are clearly exocentric while the remaining ATT relations

(effect →little, effect →on) have a more unclear status.

The contrast between Endocentric and Exocentric constructions is also

related to the contrast between head-complement and head-adjunct relations. The

former relations (preposition-noun) are exocentric while as the latter relations

(adjective-noun) are endocentric. The third one, head-specifier relation

(determiner-noun) is also an exocentric relation like the head-complementation

but there is no clear selection of the dependent element by the head. The contrast

between complements and adjuncts (modifiers) is often defined in terms of

valency which is the central notion in the theoretical tradition of the dependency

grammar. The notion of valency has been originally taken from chemistry. It is

usually related to the argument structure16. The idea is that the verb ‘H’ imposes

certain requirements on its syntactic dependents that reflect its interpretations as a

semantic predicate. The nouns (Ds) which are the arguments of a predicate (can

be obligatory or optional in surface syntax) can only occur once with each

predicate. While as the ‘Ds’ which are adjuncts (tend to be optional) can occur

more than once with a single predicate. The valency frame of the verb (predicate)

is generally considered to include ‘Ds’ which are arguments not Adjuncts.

Therefore in the Figure.1, the (SBJ) “news” and (OBJ) “effect” would be

generally considered as valance-bound ‘Ds’ of the ‘H’ “had” while adjectival

16 Argument Structure is inherent property of certain classes of lexemes, particularly verbs (also for nouns and adjectives). The argument structure of verb is called predicate argument

structure.

16

modifiers of the ‘Hs’ “news” (economic) and “markets” (financial) would be

considered as valance-free ‘Ds’.

While head-complement and head-modifier structures have a fairly

straight forward analysis in dependency grammar, there are also many

constructions that have a relatively unclear status. This group includes

constructions that involve function words, such as articles, complementizers and

auxiliary verbs, and apart from that, structures involving prepositional phrases.

For these constructions, there is no general consensus in the tradition of

dependency grammar as to whether they should be analyzed as head-dependent

relations at all and if so, what should be regarded as the head and what should be

regarded as the dependent. For example, some theories regard auxiliary verbs as

heads taking lexical verbs as dependents; other theories make the opposite

assumption; and yet other theories assume that verb chains are connected by

relations that are not dependencies in the usual sense. Another kind of

construction that is problematic for dependency grammar (as for most theoretical

traditions) is coordination. According to Bloomfield (1933), coordination is an

endocentric construction, since it contains not only one but several heads that can

replace the whole construction syntactically. However, this characterization raises

the question that whether coordination can be analyzed in terms of binary

asymmetrical relations holding between a head and a dependent.

6. Chapterization

This dissertation consists of following seven chapters:

Chapter.2: Review of Existing Literature

The chapter surveys the existing literature on various grammar formalisms and

treebanking. It presents a historical view on dependency parsing, tracing its roots

in Indian, Semitic and Hellenic traditions. Some brief history of treebanking is

also traced out. It attempts a link between these old grammatical traditions and the

contemporary practice of natural language parsing & treebanking.

Chapter.3: Developing KashCorpus

The chapter begins with the introduction of philosophical grounds that underlie

the current corpus based research. It gives a brief account of language and other

computational resources that have been developed for Kashmiri. Finally, the

chapter investigates the problems of KashCorpus collection, development,

sanitization & normalization.

17

Chapter.4: POS Tagging of KashCorpus

The chapter discusses the building of the fundamental layer of annotation for the

dependency treebank of Kashmiri, i.e. parts-of-speech tagging of the selected

portion of Kashmiri corpus. Further, a brief review of various POS tagging

frameworks and tagsets that have been developed for English and Indian

Languages is given. The various issues that have been encountered in the

annotation process and the empirical results are presented in this chapter.

Chapter.5: Chunking of KashCorpus

The chapter discusses the second layer of annotation for building the dependency

treebank of Kashmiri, i.e. the chunking of POS annotated KashCorpus. It presents

the detailed description of various chunks found in Kashmiri. It further gives a

detailed account of various issues and also presents the empirical results.

Chapter.6: Syntactic parsing of KashCorpus

The chapter discusses dependency annotation of the chunked KashCorpus in

detail and presents a detailed account of the dependency treebank of Kashmiri

(KashTreeBank). Further, the language related issues which have been raised

during the annotation process are also discussed. Finally, the results of inter-

annotator agreement are also presented.

Chapter-7: Conclusion

It presents a conclusion of all the research presented in this dissertation.

Chapter.2 Review of Existing Literature`Would you tell me, please, which way I ought to go from here?'

18

`That depends a good deal on where you want to get to,' said the Cat.

Ì don't much care where.' said Alice.`Then it doesn't matter which way you go,' said the Cat.

`So long as I get somewhere,' Alice added as an explanation.Òh, you're sure to do that,' said the Cat,

Ìf you only walk long enough.'Carroll, 2003

1. Introduction

This chapter surveys the existing literature regarding grammar formalisms,

dependency parsing and tree-banking. The chapter is organized in nine sections.

Section two presents various relational structure (dependency) based grammar

formalisms for treebanking. Section three debates on various modifications in the

notion of VP to account for the non-configurationality and to justify the use of

dependency based formalisms. Section four tries to view dependency grammar

from the historical perspective, tracing its roots in ancient & medieval times.

Section five presents the rationale for using DG. Section six describes the notion

of treebanking. Section seven presents the principles involved in treebanking.

Section eight gives a brief account of some dependency treebanks. Finally,

section nine summarizes the chapter.

2. Grammar Formalisms

There is very close relationship between grammar formalism, syntactic parsing,

syntactic annotation and treebanking. In fact, treebank is a product of syntactic

parsing and annotation of natural language corpus, based on a given grammar

formalism or simply grammatical model. The syntactic annotation for building a

treebank can be carried out either manually, automatically or semi-automatically.

The term 'Parsing' has been derived from Latin phrase Paras Orationis meaning

“Parts-of-Speech”. The term refers to both the synthetic (bottom-up)17 as well as

the analytical (top-down)18 approaches of inquiry into the natural language syntax.

In CL and NLP literature, the former is commonly known as dependency based

parsing (DBP), which addresses the following research questions: a. How do

words combine to form sentences? b. How does bottom-up approach to parsing

17 This bottom-up approach is widely used in Europe (by linguists in Germany, France, Scandinavia, Czechoslovakia, Russia), and by Russians and Slavists in USA (Mel'cuk,

Shaumyan, Nichols). From concrete data (empirical) to abstract categories (rational) or simply from data to theory

18 In 1930 Leonard Bloomfield in the USA developed a top-down approach: Immediate-Constituent Analysis (which turned into PSG, TGG, X-bar Syntax and Minimalism), largely

inspired by the German psychologist Wundt - Percival, "On the Historical Source of Immediate Constituent". From abstract categories (rational) to concrete data (empirical) or

simply from theory to data

19

help in understanding the nature of language? c. How does bottom-up approach

facilitate the annotation and capturing of the grammatical knowledge and ensure

its role in developing real world computational tools and applications? The latter

approach is known as constituency based parsing (CBP), which also addresses the

similar research questions, like: a. How is a sentence broken into smaller units

like clauses, phrasal nodes and then into the terminal nodes (words)? b. How does

top-down approach of analysis help in understanding the nature of language and

how does it ensure its role in developing real world computational tools and

applications? Both these approaches include some notion of relational structure

but it is described in different ways (C. Bosco & V. Lombardo, 2004). Since, the

notions of dependency or relational structure are used in the current work;

constituency based formalisms such as PSG, GB & Minimalism are not dealt with

here. However, the notions of grammatical relations and predicate argument

structure are given a proper treatment.

There are several approaches in the literature to explain the grammatical

relationships (GRs) in a clause. These approaches posit GRs as semantic roles which

include Verb-specific roles e.g. Runner, Killer, and Bearer; Thematic Roles e.g. Agent,

Patient, Theme, Instrument and Experiencer, and Generalized Roles like Actor and

Undergoer (Dowty, 1982; Van Valin, 1999). Marantz (1984) describes that GRs are the

syntactic counterparts of certain Logico-semantic relations such as the predicate-subject

and modifier-modifiee relations. Rappaport & Levin (1988) describe GRs in terms of

purely syntactic relations (SUBJ, DOBJ, and IOBJ) and thematic roles. However, the

status of thematic roles (as purely semantic or syntacto-semantic) and the identification of

an appropriate inventory of semantic GRs are not very clear (Leech et al., 1996). It gets

more complicated when we see that purely syntactic relations may bear thematic roles. For

instance, in the sentence “the garden is swarming with vipers,” the subject coincides with a

thematic role-locative instead of the more expected agent relation (Renzi, 1988). There are no one-

to-one clear cut correspondences between syntactic relations and semantic roles and most theories

of grammatical relations make distinction between purely syntactic relations and semantic roles.

The distinction between syntactic and semantic relations with some independence from

morphology is not new and can be traced back to Panini’s Karaka theory. The six Karakas are

semantic relations (agent, object, instrument, destination, source and locus) which are assigned to

the nouns governed by a Verb. However, an inventory of universally accepted semantic relations,

also known as thematic or theta-roles ceases to exist.

20

2.1. Dependency Grammar (DG)

In contrast to the constituency, dependency is a vertical organizational principle

that shows binary asymmetrical relation between a head and its dependents19

(Kruijff, 2002). The basic idea of dependency grammar is that the syntactic

structure is a flat (with no non-terminals) and rooted structure called Stemma

which consists of lexical elements linked by binary asymmetrical relations called

dependencies. The variants of DG which are briefly reviewed here are Structural

Syntax (SS), Functional Dependency Grammar (FDG), Word Grammar (WG),

Meaning Text Theory (MTT) and Paninian Computational Grammar (PCG).

These variants share the major tenets of dependency and proposed relation-based

structures for language representation.

i. Structural Syntax (Tesnière, 1959)

It adheres to the long standing notion that syntax is a matter of combinatory

requirements or capabilities of words (i.e. their valency). The fundamental

syntactic building block of the sentence is considered to be a word (token) which

is linked to other words (directly or indirectly) by means of the dependency

relations (for details see Chapter 1).

The main idea behind Tesnière’s model is the notion of dependency which

identifies the syntactic relation existing between two elements within a sentence,

one of them taking the role of governor (or head) and the other of dependent

(régissant and subordonné in the original terminology). He schematizes this

syntactic relation using a tree diagram called Stemma.

In his scheme all words are divided into two classes: full content words

(e.g., nouns, verbs, adjectives, etc), and empty functional words (e.g. determiners,

prepositions, etc). Each full word forms a block which may additionally include

one or more empty words and it is on blocks that operations are applied. He

distinguishes four block categories (or functional labels); nouns, adjectives, verbs

and adverbs. Also, a distinction is made between Actants and Circumstants. The

verb represents the process or state expressed by the clause and all its actants

(representing the participants) are determined by the valence of the verb and have

the functional labels of nouns. On the other hand, the verb’s Circumstants

(representing the circumstances under which the process is taking place, i.e. time,

19 The alternative terms for that are used in the literature are modifier or child for dependent and modified, governor, regent or parent for head.

21

manner, location, etc) have the functional labels of adverbs. There are two

operations, Junction and Transference, by means of which it is possible to

construct more complex clauses from simple ones. The junction is employed to

group blocks which are at the same level, i.e. Conjuncts, into a unified entity by

itself attaining the status of a block. The conjuncts belong to the same category

and are horizontally connected (and not always) by means of empty words called

the conjunctions. There are two types of transference operations. The first degree

transference is a changing process which makes a block to change its original

category. This process occurs by means of one or more empty words belonging to

the same block called transferrers. For instance, the category of word ‘rotten’ in

the construction “rotten food” is transferred from noun to the functional label of

an adjective through transferrer, the perfective participle -en. The second degree

transference occurs when a simple clause becomes an actant or a circumstant of

another clause, maintaining all its previous lower connections, but changing its

functional label within the main clause.

For example:

1. She believes that he knows it.

2. The man I saw yesterday is here today.

3. You will see him when he comes.

In the sentence 1, we have a verb-noun transference by means of the transferrer

‘that.’ The embedded clause in italics takes the functional label of a noun and

becomes the object of the verb. The embedded clause in the sentence 2 is a verb-

adjective transference without any transferrer. The temporal clause in the sentence

3 is an example of verb-adverb transference where the transferrer is ‘when.’

Actants (arguments) are immediately dominated by the verb and represent the

entities involved in the event, described by the verb (obligatory to fill the valence

frame of the verb). The Circumstants (adjuncts) instead express the bystander’s

role in the event (optional). The first actant corresponds to the Arg-1 (SUBJ), the

second to the Arg-2 (DOBJ), the third to the Arg-3 (IOBJ), as in RG. In SS, the

verbal valency also motivates this sorting of actants. The first actant can be found

in mono/bi/tri-valent verbal nodes (that can take one, two or three actants), the

second only in the bivalent nodes (that can take two or three actants) and the third

only in the trivalent nodes (that can take three actants). Dependency relations are

annotated to make the function of the nodes explicit. The words of a sentence

22

together with their dependency relations form the dependency graph in which the

information regarding the dependency structure is explicit while as the

information regarding the constituent structure is implicit e.g. a node X with the

sub-tree attached to it can represent the constituent headed by X (X-phrase) and

can express all the important properties of the constituent. Therefore, a sentence

structure can be described as consisting of structural nodes organized

hierarchically by the nodal functions and held together by structural connections.

A structural node is a group of words consisting of only one head and one or more

sub-ordinate words. It is this head of the structural node which carries out the

nodal function. The structure and the meaning of the sentence are theoretically

independent but parallel as the structural connections match with the semantic

connections to negotiate the meaning. In fact, a structural connection is usually

motivated by a semantic connection, i.e. two words are linked by a structural

connection in order to make their semantic connection explicit. Just as the head of

the structural node bears the nodal function, the head of the semantic node bears

the semantic function.

ii. Functional DG (Tapanainen & Jarvinen 1997, Tapanainen 1999)

It is a computational implementation of Tesniere (1959), describing the Structural

Syntax (SS) through formal rules. FDG posits that the basic elements of the

syntactic structure are the Nuclei which have mutual connections and every

Nucleus has only one head. The relationship between structure (i.e. syntax) and

semantics is evident in the notion of Nucleus20 which encompasses both the

structural and the semantic node. Since, there is a close parallelism between

syntax and semantics, i.e. the syntactic structure depends on the semantic

interpretation rather than on word order or morphological marking, the variation

in word order does not affect the structural analysis of the sentence. The basic

element of FDG is the nucleus which consists of tokens which are words or parts-

of-words of the input sentence. Here, a distinction is made between valency

functions which are unique in the Nucleus (actants) and ambiguous functions

(circumstants).

iii. Word Grammar (Hudson 1990, Hudson 1984)

20 The nucleus is a unique head which coincides with the head of the structural as well as the semantic node and consequently bears both the semantic and the nodal functions.

Otherwise the nucleus is said to be dissociated. The most typical example of a dissociated nucleus in the verb group consisting of an auxiliary Verb and a main Verb; the former

bears the nodal function, while the latter bears the semantic function.

23

WG primarily, developed for English, is a monostratal, non-transformational ap-

proach which uses word-to-word dependencies to show grammatical relations/

functions by explicit labels, e.g. SUBJ and OBJ. It includes two main inheritance

hierarchies: the system of word-classes, which also includes all lexemes and in-

flections, and the system of dependency types or grammatical functions. WG

presents language as a network of knowledge, where all the areas of knowledge

are included with no clear cut boundaries between the ‘internal’ and ‘external’

facts about words.

iv. Meaning To Text Theory (Mel’cuk 1988)

MTT is primarily, developed for Russian. It provides a rich representation and

analysis of a variety of aspects of language. The natural language is posited as a

logical device that establishes correspondences between the infinite set of possible

meaning and the infinite set of possible texts. The representation of the sentence

consists of various separate components, in particular, a semantic component and

a deep-syntactic component. By performing several operations, the semantic com-

ponent establishes the correspondence between a sentence and all its synonym

sentences. The deep-syntactic component establishes the correspondence between

the various syntactic realizations of a sentence.

v. Paninian Computational Grammar (Bharati et al., 1993)

It has been used for syntactic annotation in the current work which is actually a

variant of dependency grammar (Kiparsky & Staal, 1969; Shastri, 1973). This

model helps to capture the syntacto-semantic relations in a sentence. Sentence is

considered as a series of modifier-modified relations with a primary modified,

root of the dependency tree, the main verb. The elements which modify the verb

are its arguments that participate in the action specified by the verb. The relations

of these participants (arguments) with the verb are called Karakas.

2.2. Relational Grammar (RG)

RG is basically motivated by the idea that positing grammatical relations in terms

of linear constituent order and their domination by the VP node, seems to be

inadequate for VSO languages like Welsh in which there can be no VP node

(Perlmutter, 1983) and for free word-order languages like Czech (Dowty, 1982).

RG is primarily concerned with capturing the pure grammatical relations

that constitute predicate argument structure (syntactic) and other relations

(semantic) that are not related to the core arguments (Perlmutter, 1980). The

24

former includes the three pure grammatical relations like S (subject), DO (direct

object and IO (indirect object) or 1, 2, 3, respectively. The numbers (1, 2 & 3)21

are assigned to posit a hierarchical organization which is motivated by the

behavioral properties of the relations (head vs. dependent). The latter includes a

set of impure grammatical relations (oblique objects -OO) that have independent

semantic content like Instrumental, Locative, Benefective, etc. The NPs which are

labeled with pure grammatical relations are called Terms while as the other

NPs/PPs which are labeled with impure grammatical relations (i.e. with their

semantic functions) are called Non-Terms. Figure.1 is a Relational Network which

posits the relational structure of a clause as an abstract universal representation

that remains constant in spite of the cross linguistic morpho-syntactic variations

i.e. a clause in another languages, involving the same predicate and the same

participants, will be represented by the same relational network.

Figure 1: The Relational Network Showing Dative Shift

RG assumes a universal mapping between thematic and grammatical relations

known as Universal Alignment Hypothesis (UAH): the agent maps on Argument1

(John), the patient or theme maps on Argument 2 (the book), and the recipient on

Argument 3 (Mary) but the surface form of the clause does not always correspond

to the UAH, for instance, Passive and Dative-Shift constructions. In these cases

several syntactic layers (Strata) are proposed and the surface syntactic form of the

clause is derived through a series of transformations that generate a syntactic form

consistent with UAH.

21 Johnson (1977) used terms S, DO, IO to describe grammatical relations but Perlmutter (1980) used simple numbers 1, 2. & 3.

25

Figure 2: The Relational Network of Passive Construction

The initial stratum in the Figure.2 represents the underlying syntactic structure,

which corresponds exactly to the active form of the clause (John eats the apple)

where the UAH holds because the Agent is 1 (John) and the Theme is 2 (the

apple). But when the passive rule has been applied, there is a transformation

which produces a second stratum where the initial 1 loses its syntactic role,

chomeur22 and the initial 2 becomes 1 (2-to-1 advancement). Being the semantic

relations unchanged from the initial stratum, the agent is mapped in this last

stratum to a chomeur, whilst the theme is mapped to 1, so contrasting with the

UAH. In the representation of dative shift (see Figure.1), instead, comparing the

initial stratum with the final one, we can observe a phenomenon referred as 3-to-2

advancement. The recipient which corresponds in the initial stratum with 3

becomes 2 in the final, consequently, 2 loses its role and becomes a chomeur.

2.3. Lexical Functional Grammar (LFG)

LFG posits a flat (VP-less) structure for many VSO languages (Kroeger, 1993)

and “free-word order” or “non-configurational” languages like Warlpiri (Simpson

1991). In LFG, the grammatical relations are termed as functions (Bresnan, 1982,

Bresnan and Kaplan, 1982). Since, LFG does not adhere to the notion of an un-

derlying abstract syntactic representation and transformational rules; it posits a

representation where the lexicon plays a key role. It postulates three distinct but

interrelated levels of grammar which co-occur in a single representation; Lexical

Structure (LS), Functional Structure (FS) and Constituency Structure (CS). The

LS captures the information about the meaning of the lexical items, semantic

roles, constituting predicate argument structure and the grammatical functions like

Subject (SUBJ) and Object (OBJ) that are associated with the arguments through

the Lexical Assignment (LA). The LA states that each argument is assigned with

22 The term ‘chomeur’ is the French word meaning a jobless person, indicated by * in the clause.

26

a unique grammatical function (GF). GFs are assigned at the Lexicon-Syntax in-

terface. For instance, the transitive verb (kick) has a predicate argument structure

that consists of an Agent associated with SUBJ-function and a Theme associated

with OBJ-function. The other levels of the representation, as shown in the Figure

3, are called f-structure and c-structure, respectively. Constituency relations vary

cross-linguistically and across constructions within a single language while the

syntactic functions are universal (invariant) and are represented in a universal for-

mat. Therefore, a number of different c-structures can have a single f-structure

and it would be possible to derive an f-structure from a c-structure but not vice

versa. The inventory of grammatical functions is different from that of RG. LFG

distinguishes between sub-categorizable functions (governable), which can be

part of a verb sub-categorization, like SUBJ, OBJ1 (direct object), OBJ2 (indirect

object), OBL (oblique) and POSS (possessor) and non-sub-categorizable (non-

governable) functions like AJT (adjunct), syntactic FOCUS and TOPIC. Among

sub-categorizable functions, SUBJ, OBJ1 and OBJ2 are semantically unrestricted,

i.e. they can bear a variety of semantic functions while as OBL and POSS are se-

mantically restricted and can bear only some particular semantic function. The

non-sub-categorizable functions are used to refer to adjuncts, to the discourse

functions indicating an entity that has already been established in the discourse

context (topic), or to the information about some topical participant that is new in

the context (focus).

Figure. 3: The f-structure of the sentence “A boy handed the teacher a gift.”

27

LFG represents the f-structure of the sentence in terms of an attribute-value ma-

trix (as shown in Figure 3) and the c-structure as an augmented constituency tree.

The relationship between c-structure and f-structure is represented by adding

functional information on the tree edges.

The notion of grammatical relation occupies a central role in LFG in deter-

mining which of the arguments is semantically selected by a predicate or syntacti-

cally realized and how? In particular, the lexical level plays a central role by the

mechanism of sub-categorization. The similarities between LFG and RG can con-

sist of the inventory of basic syntactic relations, the relevance of the semantic

level of the sentence and use of a form of relational structure where grammatical

relations are the interface between the syntactic and semantic level of the sen-

tence. There are many differences between these approaches: LFG is mono-

stratal, non-transformational and gives a central role to lexical sub-categorization.

Moreover, LFG clearly represents the syntactic interrelation between relational

structure and constituent structure, by assigning a c-structure to the sentence and

assuming that there is some f-structure associated with each node in the c-struc-

ture.

3. (Non-) Configurationality and DG

Although the tradition of using syntactic models in linguistics can be traced back

to Panini’s work (3rd century BC), the discussion about which grammatical model

one should use is still an open issue. For addressing this problem, i.e. which

framework/formalism one should use for treebank annotation, certain typological

features of a language are necessary to be taken into account. Such features have

very strong repercussions on the encoding of grammatical relations in different

languages. Adhering to the similar view, Chomsky (1981) and Hale (1982, 1983)

have divided the languages of the world into Configurational and Non-

configurational languages (Covington, 1990). Hale (1982 & 1983) has put

forward following general diagnostic criteria to check whether a language is non-

configurational or not:

i) Variable word-order

ii) Lack of pleonastic NPs (expletives)

iii) Extensive null anaphora (pro-drop)

iv) Syntactically discontinuous constituents

v) Lack of NP movement (passive, raising, etc)

28

vi) Use of rich case-system

These criteria were latter on attested by Farmer (1984), Jelnik (1984), Mohnan

(1983), Webelhuth (1984) and Speas (1990). On the basis of these criteria, several

languages were claimed to be non-configurational, at least to some extent. These

languages include; Japanese (Chomsky 1981), Warlpiri (Hale 1982), German

(Haider, 1982), Hungarian (Kiss, 1987), Hindi (Mohnan, 1983), Kashmiri (Raina,

1991), etc. Such languages share most of aforementioned criteria, if not all. On

the other hand, some languages like English and French do not meet any of the

above criteria. It can be argued that English is totally devoid of such properties

while as Warlpiri possesses most or all of these properties. However, it appears

that there is not a clear cut division between languages as configurational and

non-configurational rather languages tend to form a continuum from completely-

fixed-word-order languages to completely-flexible-word-order languages with no

sharp transition from one type to another (Siewierska, 1988). In this continuum,

Warlpiri exists at one extreme (non-configurational) and English at the other

(configurational) while the rest of languages lie in between, possessing more or

less of configurationality and non-configurationality. It is not only the case with

the entire concept of non-configurationality but also with aforementioned

individual properties as well. For instance, the nature of word-order (i.e.

completely free/completely rigid) is not cross-lingually a categorical property. If

we correlate one of these six properties (typological variables) with other

properties, we can find high correlation measures between “variable-word-order”

and “rich case system”, indicating that these properties go hand in hand with each

other to characterize a language. It is in fact the rich overt case system that allows

flexibility in word-order. Therefore, some languages are found to be with fixed-

word-order and others with flexible-word-order but this fixity and rigidity in

word-order is itself relative rather than categorical property. However, one thing

is clear that the division does exist, though not very sharp, as some languages tend

to be fixed word-order languages while others tend to be more variable.

Nevertheless, it has been argued that most of the languages have partly variable

word-order (Covington, 1990). Fixed word-order languages resist any kind of

scrambling (change in word-order) leading to information distortion (change in

propositional semantics), evident from the sentences i, ii & iii. Sentence-i carries

proper semantic information but sentence-ii is syntactically anomalous, violating

29

the default SVO word-order of English but it can be a stylistic or pragmatic

variant (in terms topic-focus or information structure) of the Sentence-i as the

propositional information is still intact. However, sentence-iii is syntactically well

formed but semantically anomalous, violating the sub-categorization rules of the

noun (football) and leading to information distortion.

For example:

i) S [[The fat boy] [[kicked] [a football]

ii) S [[kicked] [a football] [The fat boy]

iii) S [a football] [kicked] [The fat boy] **

On the other hand, completely free-word-order language (e.g. Warlpiri) do not

conform to the typical English type hierarchical clause structure (SUBJ-OBJ

asymmetry) i.e. SUBJ as an external argument (higher one) and OBJ as an

internal argument (lower one) as shown below in a PS rule:

1. S = [NP1 + VP], where VP = V+(OBJ NP2)

We can say, unlike English, it lacks VP and has a flat clause structure (SUBJ-OBJ

symmetry) as shown in the following PS rule:

2. S = [NP1 + V + NP2], where both the NPs (1 & 2) are symmetrical.

Moreover, VSO languages (e.g. Semitic & Clitic languages) are also considered

as problematic for the universal appeal of the notion of VP (Perlmutter, 1983 and

Dowty, 1982). There are two approaches to explain word-order, Parameter

Approach (Chomsky 1991, 1993, 1994, 1995) which implies that languages are

partly defined by the head parameter that sets the positions of the head of a phrase

either initial or final and Universal Base Hypothesis (Kayne, 1994 and Zwart,

1997) which posits that SVO is a basic canonical word-order, underlying even

VSO languages and other free-word-order languages. It is because it is in

consonance with the basic tenants of X-bar schema (or X-bar schema has been

optimized for SVO or SOV languages). The notion of basic order refers to the

ordering of elements in the representation that expresses the basic meaning

relations between the elements in the deep structure (Chomsky 1957). These

relations are expressed by an interaction of theta theory and X-bar theory in which

a complement of a verb appears as the sister of V, the subject of V appears as the

sister of VP. The perceived word-order in the surface structure of a sentence often

deviates from the basic ordering (ibid). However, if a language has only VSO

sentences, the basic word-order never surfaces but all the variations in word-order

30

can be accounted by a series of movements. It is being argued that the importance

of the X-bar schema is not only that it regularizes structure but also that the

structure defined by it conveys meaning. In traditional and intuitive sense, a verb

has a complement, likewise, the combination of a verb and a complement (VP) is

a predicate, requiring a subject. These notions of complement and subject are

defined in structural terms. A complement is a sister of a head (V) and a subject is

a sister of a predicate (VP). The hypothesis that the function of a noun phrase is

defined by its hierarchical position in the syntactic structure is part of the theta

theory of generative grammar. Nevertheless, one can raise many questions on this

hierarchical structure vis-à-vis on SUBJ-OBJ asymmetry like; why only SVO is

considered a basic word-order? Why can’t we consider VSO basic and then

explain constituent structure? Why can’t be there a clause structure like the

following?

3. S = [VP + NP2], where VP = V+SUBJ.NP1

Many such questions have been addressed quite elaborately in various

formulations that subscribe in one way or other to X-bar theory. For instance, a

distinction was made between internal (OBJ) and external arguments (SUBJ) of a

verb (Williams, 1981). Internal arguments were further divided into direct and

indirect objects (Marantz, 1984). The external argument was considered to be

generated outside the base (i.e. not dominated by VP projection) and the internal

arguments were the only arguments generated by base (i.e. dominated by VP

projection). This raised a question as to whether this distinction between

arguments (external vs. internal) was legitimate, given the fact that they both are

arguments of verb. Consider Marantz’s famous idiom-explanation for internal

argument where idioms are formed only with OBJ and not with the SUBJ e.g. VP

(kick the bucket). VP-internal Subject Hypothesis (VISH) (Fukui & Speas 1986,

Kitagawa 1986, Kuroda 1988, Koopman & Sportiche 1991) showed that these

problems could be overcome if we assume that subjects are base-generated as a

specifier in the VP and then raised to the specifier of IP. According to VISH, the

external argument would be like other arguments of the verb in that it is generated

like other arguments in the domain of its Theta-licenser. As the previous notions

of VP were totally changed with VISH, VP-shells were introduced (Larson 1988).

According to VP-shells formulation, in the lower VP the thematic elements are

generated and there is an empty ‘shell’ of a VP generated on top of the thematic

31

VP. This theory also helps to maintain the binary branching structure for the

dative shift/to-dative constructions and double object constructions (DOC) in

English. Minimalist Programme also maintains that if a verb has several internal

arguments, a Larsonian VP-shell must be postulated (Chomsky 1995).

Moreover, if all phrases were required to have a specifier by X-bar notation, why

was VP exempted? Why did Spec of IP receive a dual characterization i.e.

sometimes as a Case position in object-raising as in passives and sometimes a

Theta-position (a base generated position) for the external argument? Even if we

look at the nature of VP without X-bar notation, phrases tend to be homogeneous,

discrete, and perceptually compact & closed syntactic units that can confirm the

substitution test of constituency (e.g. NP, AdjP, AdvP) but VP as shown in the PS

rule (1) is heterogeneous and perceptually non-compact & open syntactic unit

with one or two NPs embedded in it. Further, adhering to the notion of constituent

structure (with or without X-bar notations) is at par with ignoring potential

semantic cues in the constructions, even in variable word-order languages, where

there is, more or less, a well defined system of case markers or pre/post-positions

to represent semantic roles. In such cases, a single layered representation of

syntax and semantics is quite possible. However, the representation of syntax

(case relations) and semantics (thematic relations) has been a long standing issue

in theoretical syntax, as clearly mentioned below:

“One of the most important research questions in the history of generative

grammar has been the determination of the domains in which Case and theta

theory apply as distinct, related or disjoint. The main concern is whether

Case is parasitic or derivative of thematic configurations or whether Case

and thematic relations involve different projections/configurations

altogether. Although the research tradition has settled for the disassociation

approach, it has met with variable degrees of success in achieving a complete

severance of the domains in which theta and Case are assigned.” (Richa

2011)

Finally, it can be argued that non-configurational or relatively variable word-order

languages can be explained even without positing VP (as in X-bar) but by

positing a bare Verb (V) or Verb Group (VG) with symmetrical arguments

(SUBJ.NP1 & OBJ.NP2), organized in a flat structure, as shown above in PS rule

2. Such a treatment to these languages is given dependency theory that posits flat

32

organization of verbal arguments and doesn’t consider any notion of deep

structure, surface structure and any kind derivation through movements. As name

indicates, variable word-order languages are non-positional languages and mostly

the arguments and adjuncts are with overt case markers. Hence, the position of the

verbal arguments/adjuncts in the sentence doesn’t matter. Their relation with the

verb is determined by the morpho-syntactic or semantics cues carried by the case

markers or pre/post-positions, not by their position in the construction. For

instance, Indian languages (e.g. Hindi, Urdu, Gujarati, Punjabi, Bangla and

Kashmiri), Semitic languages (e.g. Arabic and Hebrew) and Slavic languages

(Czech and Russian) are relatively variable word-order languages. They allow

scrambling of their constituents without impacting the propositional information

of the sentence.

4. A Historical View of DG

Although the interdisciplinary fields of Computational Linguistics (CL) and

Natural Language Processing (NLP) is the gift of modern technological era, the

origin of the syntactic parsing (syntactic analysis) of natural language which

forms the backbone of various NLP systems, can be traced back in antiquity. The

present notions of syntactic parsing and the existing grammar formalisms are

actually the outcome of accumulation of the vast grammatical knowledge which

originated in ancient, medieval and modern grammatical traditions all over the

world. Butt (2005) gives an elaborate account of various grammatical traditions.

The next subsections describe a brief history of the notion of dependency analysis

and its roots in different grammatical traditions based on Mariam Butt’s and

Svetoslav Marinov’s account:

4.1. Indian Tradition (350-250 B.C) The earliest traces of syntactic analysis can be found in Panini’s grammatical

sketch of Sanskrit (350-250 B.C), which was based on long standing linguistic

thought in India, rooted in the works of Vedas about 500 years ago (Kruijff,

2002). It falls within the realm of dependency grammar. Panini’s grammar

consists of four modules that account for different aspects of language separately

as given in table.1:

The module called Ashtaadhyaayii, deals with the derivation of sentence

structure. The derivation of a sentence starts from the semantic level and ends

with the formation of phonological form (Itkonen, 1991). The lexicon contains

33

verbal and nominal stems. The sentence derivation begins with choosing of the

lexical items from the lexicon and deciding on the karaka-relations that holds

between the verbal root and the nominal roots. So, only verbs and nouns play

primary role in sentence construction and the rest of the parts-of-speech (POS)

play the secondary or tertiary roles. This is, in fact, the simplest way of

representing the sentence structure of a language, particularly, in Indian

languages. Therefore, in Paninian perspective, to construct a skeletal structure of a

sentence, we primarily need events (or action/states) and entities. Other elements

like modifiers (verbal and nominal) can also be incorporated in the construction

but to add different semantic shades to the primary predication. As such they have

least role in the basic syntactic skeleton of a sentence. It is evident that Paninian

relational view is primarily focused at the Verb-Noun relations and the linking

case markers/vebhakti.

S.

NO.

MODULES DESCRIPTION COVERAGE

01 Ashtaadhyaayii Describe syntactic rules 4000 (Approx.)

02 Dhaatupaatha Describe verbal roots with their

morpho-phonemic and morpho-

syntactic properties

2000 (Approx.)

03 Ganapaatha An inventory of lexical items 261 (Approx.)

04 Shivasuutras Describe the segmental phonology

Table.1. Four Modules of Paninian Grammar (Kiparsky 2002)

The karakas are actually the six primary syntacto-semantic roles that the nominal

roots (arguments) play for their verbal root in well-formed sentential

constructions (Kiparsky, 2002). The six karakas include karta (Agent), karma

(Goal), sampradaana (Recipient), karana (Instrument), adhikarana (Locative), and

apaadaana (Source). The well-formedness of a sentence is only assured when each

of the participating nominal entity is assigned a syntacto-semantic role. Karakas

act as mediators between semantic level and the morpho-syntactic level of the

sentence structure by adopting following two constraints (Kiparsky 2002, p. 16):

i) Every karaka must be morpho-syntactically realized (in the form of

vebhakti/case marker/ postposition).

ii) No karaka must be realized by more than one morphological

form/element.

34

These two constraints can play a pivotal role in establishing a simple mapping

schema between karakas and vebhakti which can prove instrumental while

developing any formalism based on Paninian view.

To sum up, the Paninian perspective of syntactic analysis (traditional parsing)

highlighted some key notions/relations prevailing in the contemporary parsing

formalisms that fall within dependency framework like Meaning Text Theory

(Me’lcuk, 1988). Such notions include binary relations holding between a verb

root and nominal roots, (Itkonen, 1991; Kiparsky, 2002), the rootedness of a

sentence which is due the central role of verb (Itkonen, 1991) and the labeled

relations (six syntacto-semantic roles; k1, k2…… k6) which are binary in nature

(Misra, 1966; Itkonen, 1991, Kiparsky, 2002). It is worth to mention that in

Mel’cuk’s Meaning Text Theory, there are six syntacto-semantic relations

(actants; a1, a2…, a6), labeled with digits; 1, 2, 3, 4, 5, 6 like the six Paninian

karakas.

4.2. Hellenic Tradition (100 B.C)

The traces of syntactic analysis can also be found in the Greek grammatical

tradition (GrGT). In this period, there were two schools, Logicians &

Grammarians, involved in the study of word-classes or parts-of-speech of Greek.

Consequently, there were two different views regarding POS of Greek. The

Logicians (Plato, Aristotle, etc) were involved in the analysis of proposition into

logical parts (subject & predicate). So, they recognized only two POS categories

(V and N). While as Grammarians like Dionysius Thrax who wrote Techne, a

grammatical sketch of Greek (100 B.C), recognized eight POS categories (Verb,

Noun, Adjective, Adverb, Pronoun, Preposition, Conjunction & Particle) which is

still a role model for POS tag-sets or word-class classifications across

grammatical traditions of the world.

In the works of Stoics (300-150 B.C) we find the traces of the modern

notion of dependency. The Stoics were concerned with the analysis of spoken

utterance, lekton- ‘the thing said’, (Lepschy, 1994). They considered the predicate

like graphei- ‘writes’ an incomplete lekton which requires a nominal of some sort

to perform the act of writing and become a complete lekton, or an axıoma. Ineke

Sluiter writes in (Auroux et al, 2000)

35

“The predicate was called an ‘incomplete lekton’ with a number of slots that

need filling ....” (ibid. p. 378) and “... they (the Stoics) describe interaction of

bodies as occurring in relation to lekta …” (ibid. p. 384).

In the works of Apollonius Dyscolus (200 A.D) we find a more straight forward

reference to the notion of dependency. In their view, adverbs complement or

diminish the meaning of the verb and are attached to verbs. While adverbs require

the verb, verbs do not necessarily require adverb (Percival, 1990). Both of the

authors distinguish between major word-classes (verbs and nouns) and minor

word-classes, where the latter serve the purpose to support or circumscribe the

former. Apollonius, for example, regarded some words as naturally more closely

related than others. Prepositions preceded nouns and had to be construed with

them. Articles related to nouns and nouns relate to verbs. Conjunctions could not

bind a noun and a verb. “In some of these relations there are clear indications of

what we now call dependency.” (Lepschy, 1994, page .99).

The logician Boethius (480-524/6 A.D) was the first person who introduced a

special term for the supportive function of the minor word-classes (Percival,

1990). In his work on Aristotle’s On Interpretation, he referred to quantifiers,

syncategorematic words, as determinations (specifiers). In his De Divisione, he

developed the notion of specification further to include not only quantifiers but

also words from others word-classes. His term determinatio is generic and refers

to the relation of all minor word-classes with the corresponding major word-

classes, adding an idea of semantic specification.

In Priscian’s Latin Grammar (500 A.D) which is based on Appolonius’

ideas, rudiments of dependency analysis have been found. According to him, lexis

or diction-words are ‘the smallest part of a connected sentence’ (Lepschy, 1994).

One word is put in construction with (construitur cum) or requires (exigit) another

(Covington, 1984). Given that, a very long sentence can be diminished or

collapsed to a very short one sentence, consisting only of a noun and a verb.

The question which worried the grammarians was which of the two

elements; Noun or Verb, is logically prior. Ancient grammarians generally

considered the noun as prior to the verb but many Greek and Latin verbs in first

and second person singular mark morphologically the subject. Both Percival

(1990) and Lepschy (1994) find support for the view that the verb being prior to

the noun since one could omit the subject in the cases like above.

36

To sum up, many of the ideas discussed in the works of ancient

grammarians and logicians can be subsumed under the modern understanding of

dependency. These include rootedness (i.e. the prior of the noun and the verb),

head-modifier relations (e.g. the adverb-verb relation), analysis in terms of words

only as well as a term for the head-dependent relation (determinacio).

4.3. Arabic Tradition (798-928 A.D)

It is in the Arabic Linguistic Tradition (ArLT), where we find the first systemic

treatment of Syntax, based on the concepts that form the core of contemporary

dependency grammar (Bohas et al 1990; Owens, 1988). The Siibawaihi (793 A.D)

was the main grammarian in ArLT and his seminal work, Al-Kitaab (The Book) is

considered as the core grammatical thought of Arabia (Itkonen, 1991). He

recognized only three parts-of-speech; Nouns (that include adjectives, pronouns,

active & passive participles), Verbs and Particles. According to him:

i. Verbs are primarily governors but can be governed by particles.

ii. Particles (i.e. prepositions) can be non-governors or governors of Nouns or

Verbs, but they can never be governed.

iii. Nouns can never govern but they can be governed by Verbs Particles.

The governor-dependent scheme proposed by Siibawaihi accounts for many

verbal sentences (i.e. verb + noun), with general principle being that “A unit may

govern more than one unit; but it can be governed only by one unit” (Itkonen,

1991, page.136). It is worthwhile to mention that the nominal sentences (i.e. noun

+ noun) are not analyzed explicitly in terms of dependency but rather in terms of

Topic-Comment. For example:

1) zayd-un rajul-an

Topic Comment

zayd is a man

2) kaana zayd-un rajul-an

was Zayd-NOM man-ACC

Was zayd a man

However, a covert auxiliary has been proposed by Siibawaihi to account for its

dependency structure as in example 2: As for as the proposing of a covert element

is concerned, Itkonen (1991) considers this to support a transformational grammar

approach. However, Mel’cuk (1988), similarly, assumes an empty

category/element to be the head in copula constructions (N+N cases) of Russian.

37

Some scholars have found support only for dependency analyses in the ArLT

(Owens,1988) but others stick to the idea that the syntactic analysis of both

nominal and verbal sentences, as proposed by Siibawaihi and his followers, is

essentially a Bloomfieldian type of IC analysis (Carter, 1973). However, Itkonen

(1991) maintains a moderate stance by assuming that Siibawaihi and his followers

operated with the two notions, which are today known as dependency and

constituency, depending on the type of the given structure. As per the views of

Owens (1988) and Itkonen (1991), it can be said that the Arab grammarians were

the proponents of the modern definition of dependency. They differentiated

between aamil (head) and macmuul (dependent). Single-headedness and

Projectivity were two main principles explicitly present in their grammatical

analyses.

4.4. European Tradition (1260-1310 A.D)

According to Covington (1984), the earliest rudiments of dependency analyses in

the European Linguistic tradition (EuLT) have been found in the works of

Modistae (1260-1310 A.D). The modistic grammar known as “Grammatica

Speculative” describes how whole sentences can be build up by concatenating the

words together. The terms Suppositum (subject) & Appositum (predicate) were

used to denote the Syntactic-function of the two parts of a basic sentence, the

nominal & the verbal (Robins, 1997). The process of the formation of the

sentence is divided into three successive steps (Covington, 1984):

i) Constructio: - It involves establishing links between the words.

ii) Congruitas: - It involves application of three well-formedness conditions

on the links.

a. There should be compatibility (agreement) of the modes of

signifying.

b. Every dependens should have a terminans.

c. A suppositum and appositum of finite mood should appear in the

sentence.

iii) Perfectio: - It involves a final check on if it is a complete sentence.

Within each construction there are two grammatical relations:

a) Primum-to-Secundum: Secundum presupposes the presence of Primum23.

23 For Covington (1984) the Primum-to-Secundum relation correspond the current notion of dependency.

38

b) Dependens-to-Terminans: Terminans presupposes the presence of

Dependens24. Dependens is an ‘unsaturated’ element while a terminans

is the element which ‘saturates’ (Lepschy, 1994).

c) It has been argued that the Dependens–Terminans relation is an extension

of Petrus Helias’ concept of Regimen, which according to Law (2003)

is actually the concept of Government, where one word forces another

to be in a particular form (Covington, 1984).

Still one big difference with modern dependency theories is the fact that the root

node of a dependency graph will typically be the subject nominal for the

Modiestae while according to the contemporary formalization this should be the

finite verb in the clause.

For the Modist Martin of Dacia (1304 AD), there was only one Primum in

the whole sentence and this was the subject (Covington, 1984). Latter on, this idea

was replaced with a model where in every construction Primum & Secundum

were identified; although the criteria for differentiating between the two were not

entirely clear (Covington, 1984). For instance, the verb was considered Secundum

in subject-verb constructions but Primum in the verb-object constructions.

Entities & Substances were considered prior to their attributes and therefore a

Primum. Certain constructions, like coordination and subordinate clauses,

however, posited problems for the Modistae where it is difficult to identify a

single element as being a Primum. According to Svetoslav Marinov (MS.), the

two sets of relations that of Primum-to-Secundum and of Dependens-to-

Terminans, receive a contradictory interpretation in the literature. It is now clear

that dependency like analyses were central in the syntactic theory of the Modistae.

As mentioned above some sort of head-dependent dichotomy was present along

with the notion of root node in the sentence.

Latter on in European Grammatical Tradition, from the mid 14th century

AD up to the mid 20th century AD, there was hardly any grammatical notion

relates to dependency. The only references available in the literature are based on

Kruijff (2002) and Percival (1990). However, the main dependency grammar,

24 Covington (1984) and Robins (1997) explicitly point out that the Dependens-to-Terminans relation should not be confused with the present-day notion of dependency. Percival (1990), for example, considers the Dependens-to-Terminans dichotomy to correspond to the modern notion of dependent-head asymmetry. But he also considers this relation to be another way of capturing Boethius’ notion of “Determination”.

39

“Elements de syntaxe structurale” (Lucien Tesnere 1959) was published

posthumously in French. A number of scholars like Mel’cuk (1988), Graffi

(2001), Nivre (2005), etc have summarized the key notions of his work in English

which is instrumental in understanding his work that otherwise is very difficult.

5. Why Dependency Grammar?

According to Covington (2001) constituency based approach appears to have been

invented only once, by the ancient Stoics, and has been passed through formal

logic to modern linguists. On the other hand, dependency based approach appears

to have been invented many times in many places (Covington 2001). Nevertheless

the constituency based discourse has overshadowed every other view of syntactic

representation. Mel’cuk argues the constituency based approach is particularly

suitable for English and this was the mother tongue of its founding fathers

(Mel’cuk, 1988). Furthermore, Mel’cuk summarizes a few reasons why the

dependency model is preferrable:

i. A phrase-structure tree focus on grouping of the words, which words go

together in the sentence, but does not give a representation of the relations

between the words.

ii. A dependency tree is based on relations. It shows which words are related

and in what way. The sentence is “built out of words, linked by

dependencies”. The relations could be described in more detail by giving

them meaningful labels.

iii. A dependency tree also represents grouping. A phrase is represented by a

word and its entire sub-tree of dependents.

iv. In a phrase-structure tree usually most nodes are nonterminal, representing

intermediate groupings. A dependency tree consists of only terminal

nodes. There is no need for abstract representation of grouping.

v. In a phrase-structure tree the linear order of the nodes is relevant. It must

be kept to retain the meaning of the sentence. In a dependency tree this is

not important. All information is preserved in the, possibly labeled,

connections.

6. Notion of Treebanking

The improvement in natural language parsing during the last two decades has

been generally attributed to the emergence of statistical and machine learning

approaches (Collin, 1999; Charnaik, 2000). However, such approaches became

40

possible only with the availability of large scale machine readable handcrafted or

automatically generated (manually corrected) syntactic trees. The art or science of

crafting or generating and organising machine readable syntactic trees is called

treebanking. In the next sub-sections, the concept of treebank, principles of

treebanking and review of various dependency treebanks are given.

6.1. Some Background

The term ‘treebank’ was probably introduced by Geoffrey Leech (Sampson

2003). The pioneering work in treebanking started in early 70’s in Sweden with

the inception of Talbanken25 (Teleman 1974; Einarsson, 1976) which was

developed at Lund University by manually annotating Swedish corpus with

phrase structure and grammatical functions. However, the serious work in this

area started in 80s as put forward by Fred Jelinek of IBM in his 1987 Lifetime

achievement talk at Applied Computational Linguistics (ACL):

“We were not satisfied with the crude n-gram language model we were using

and were “sure” that an appropriate grammatical approach would be better.

Because we wanted to stick to our data-centric philosophy, we thought that

what was needed as training material was a large collection of parses of

English sentences. We found out that researchers at the University of

Lancaster had hand-constructed a “treebank” under the guidance of

Professors Geoffrey Leech and Geoffrey Sampson (Garside, Leech, and

Sampson 1987). Because we wanted more of this annotation, we

commissioned Lancaster in 1987 to create a treebank for us. Our view was

that what we needed above all was quantity, possibly at some expense of

quality …… We wanted to extract the grammatical language model

statistically and so a large amount of data was required.” (Marcus, 1995)

Actually, it was Linguistic Data Consortium (LDC), established at University of

Pennsylvania that started massive and sophisticated efforts in developing

treebanks for European languages. However, these efforts were latter on extended

to Non-European languages as well. So, there are Penn treebanks for various

languages like Penn English Treebank (Marcus et al. 1993), Penn Arabic

Treebank (Maamouri et al. 2004), Penn Chinese Treebank (Xue et al., 2004), etc.

The Penn English Treebank is one of the largest and most widely used English-

language treebanks that has contributed greatly in creation of important English

NLP resources. Moreover, it is well-documented and the documentations are

25 Talbanken was recently reconstructed into Talbanken-05 (Nivre et al. 2006).

41

freely available; consequently, it provides a solid template methodology for

researchers attempting to produce treebanks in other languages. Similar efforts

were made in Charles University at Prague and various treebanks were created.

They include Prague Dependency Treebank of Czech (Hijicova & Hajic, 1998;

Böhmova et al. 2003), Turkish Treebank (Oflazer et al. 2003), Danish

Dependency Treebank (Kromann, 2003), Turin University Treebank of Italian

(Bosco & Lombardo 2004), etc. AnnCorra Treebank (Bharati et al. 1995) is

another similar kind of effort made for Indian Languages at LTRC Lab and

presently, the work is going on in some major Indian languages e.g. Hindi, Telugu

and Urdu (R. Begum et al., 2008; Vempaty et al., 2010, R. Bhat 2012). Moreover,

Bangla Treebank26 has been constructed at IIT Kharagpur (S. Chatterji et al.,

2009) and some efforts in developing dependency treebank for Kashmiri

(KashTreeBank27) have been already initiated (S. Bhat, 2012).

6.2. What is a Treebank?

Treebank is a set of corpora annotated with skeletal syntactic information, such as

POS labels for words level and syntactic labels beyond word level (Kristin

Jacque, 2006). A treebank is text corpus annotated with syntactic, semantic and

sometimes even inter-sentential relations (Hajicova et al., 2010). It is essentially

a machine readable repository of annotated syntactic structures of a language that

predominantly serves as a bank of training & testing data for the development of

various computational tools and applications that use some form of supervised

learning, e.g. deep syntactic parser, chunker, POS tagger, etc. Although, the term

‘treebank’ initially referred to a bare collection of syntactic trees, its

contemporary usage has been extended to the corpora with all kinds of structural

annotations, such as constituent structure, functional structure, or predicate-

argument structure (Nivre 2005; Smedt & Volk 2005). Currently, treebanks are

augmented with different types of structural representations and to restrict a

treebank to a particular type of structural representation is not current state-of-art.

However, a basic skeletal treebank is perquisite for any kind of further

augmentation like multiple representations. Earlier treebanking efforts were based

on manual annotations which are laborious, time-consuming and error-prone.

Such limitations in the manual annotations have led to the development of several

26 Funded by Linguistic Data Consortium for Indian Languages (LDCIL)

27 KashTreeBank started as a summer school project in IIIT Hyderabad Advanced Summer School for Natural Language Processing (IASNLP 2011)

42

alternative approaches like automatic annotation or automatic conversion but the

alternative approaches won’t work for resource poor Languages like Kashmiri.

We have to rely on the manual annotations first as no previous treebank resources

are available for Kashmiri. Therefore, to start from scratch & use manual methods

are unavoidable unless sufficient resources are created to train a parsing system

for automatic annotation. Moreover, a large number of treebanks have been

developed and many are currently under construction. Many treebanks implement

formats similar to those of the major treebanks and rarely new models are being

devised. For instance, the English dependency treebank (Rambow et al., 2002)

follows the model of the Prague Dependency Treebank but uses a mono-layered

representation centered on the notion of predicate argument structure instead of

multi-layered approach of Prague. Similarly, Spanish Treebank adheres to the

model of the Penn Treebank (Moreno et al., 2000).

7. Dependency Treebanks: A Brief Review

It is a fact that most of the languages have relatively free-word order and for

treebanking in free-word-order languages dependency based annotation schemes

are used. It is because of this fact that there is an ever expanding number of

dependency treebanks across the world. Many of these dependency treebanks28 are

briefly explored here:

7.1. Prague Dependency Treebank (PDT)

PDT for Czech is the largest of the existing dependency treebanks in which the

corpus has been annotated on the basis of a multi-layer annotation scheme,

consisting of morphological layer; analytical layer i.e. syntactic and a tecto-

grammatical layer i.e. semantic (Hajic, 1998; Bohmova and Hajikova 1999,

Böhmova et al., 2003). It consist of, approx 90,000 sentences, from newspaper

articles on diverse topics (e.g. politics, sports, culture) and texts from popular

science magazines, selected from the Czech National Corpus (T. Kakkonen,

2006). There are 3,030 morphological tags in the morphological tagset (Hajic,

1998). The syntactic annotation comprises of 23 dependency relations. The

annotation for three levels has been done separately, by different groups of

annotators. The morphological tagging was performed by two human annotators

28 Treebanks given here are mainly taken from (Kakkonen, 2006)

43

selecting the appropriate tag from a list proposed by a tagging system. Third

annotator then resolved any differences between the two annotations. The

syntactic annotation was at first done completely manually with the help of

ambiguous morphological tags and a graphical user interface. After the annotation

of about 19,000 sentences, Collins Lexicalized Stochastic Parser (Nelleke et al.,

1999) was trained with the data with 80% accuracy. Thereafter, the status of work

of the annotators changed from building the trees from scratch to post-editing

(checking and correcting) the parses assigned by the parser, except for the

analytical functions, which still had to be assigned manually. There are other

treebank projects that use the same framework developed for the PDT. For

instance, Prague Arabic Dependency Treebank (Hajic et al., 2004) is a treebank of

Modern Standard Arabic, consisting of around 49,000 tokens of newswire texts

from Arabic Giga-word and Penn Arabic Treebank. The Slovene Dependency

Treebank consists of around 500 annotated sentences obtained from the

MULTEXT-East Corpus (Erjavec, 2005).

7.2. Russian Dependency Treebank

The Dependency Treebank for Russian is based on the Uppsala University Corpus

(Lonngren, 1993). The texts have been collected from contemporary Russian

prose, newspapers, and magazines (Boguslavsky et al., 2000; 2002). The treebank

consists of about 12,000 annotated sentences. There are 78 syntactic relations

(divided into 6 subgroups, e.g. attributive, quantitative, and coordinative). The

annotation is layered, in the sense that different levels of annotation are

independent and can be extracted or processed independently. The treebank has

been developed automatically with the help of a morphological analyzer and a

syntactic parser (Apresjan et al., 1992) which was followed by post-editing.

7.3. Italian Dependency Treebank

Italian Dependency Treebank is known as Turin University Treebank. It consists

of 1,500 sentences, divided into 4 sub-corpora (Bosco, 2000; Lesmo et al., 2002;

Bosco and Lombardo, 2003). The majority of text is from civil law code and

newspaper articles. The annotation format is based on the Augmented Relational

Structure (ARS). The POS tagset consists of 16 categories and 51 subcategories.

There are around 200 dependency types, organized in 5 levels. The scheme

provides the annotator with the possibility of marking a relation as under-

specified if a correct relation type cannot be determined. The annotation process

44

consists of automatic tokenization, morphological analysis, POS disambiguation

and syntactic parsing (Lesmo et al., 2002).

7.4. German Treebank

It is known as TIGER Treebank (Brants et al., 2002). It was developed on the

basis of NEGRA Corpus (Skut et al., 1998) and consists of complete articles

covering diverse topics collected from a German newspaper. It consists of

approximately 50,000 sentences. It combines both phrase structure and

dependency and organizes them in a way that phrase categories are marked as

non-terminals, POS information as terminals and syntactic functions as the edges.

The syntactic annotation is rather simple and flat29.

7.5. English Dependency Treebank

The Dependency Treebank of English consists of dialogues between a travel

agent and customers (Rambow et al., 2002), and is the only dependency treebank

with spoken language annotation. The treebank has about 13,000 words. The

annotation is a direct representation of lexical predicate-argument structure, thus

arguments and adjuncts are dependents of their predicates and all function words

are attached to their lexical heads. The annotation is done at a single, syntactic

level, without surface representation for surface syntax, the aim being to keep the

annotation process as simple as possible. The trained annotators have access to an

on-line manual and work off the transcribed speech without access to the speech

files. The dialogs are parsed with a Dependency Parser, the Super-tagger and

Light weight Dependency Analyzer (Bangalore and Joshi, 1999). The annotators

correct the output of the parser using a graphical tool developed by Prague

Dependency Treebank project.

7.6. Basque and Danish Dependency Treebanks

The Basque Dependency Treebank (Aduriz and al., 2003) consists of 3,000

manually annotated sentences from newspaper articles. The syntactic tags are

organized as a hierarchy. The annotation is done by aid of an annotation tool, with

tree visualization and automatic tag syntax.

The annotation of the Danish Dependency Treebank is based on

Discontinuous Grammar formalism which is closely related to Word Grammar

(Kromann, 2003). The treebank consists of 5,540 sentences covering a wide range

of topics. The morpho-syntactic annotated corpus is obtained from the PAROLE

29 Note that hierarchical structure has been avoided and flat structure been preferred to reduce the amount of attachment ambiguities.

45

Corpus (Keson and Norling-Christensen, 2005), thus no morphological analyzer

or POS tagger is applied. The dependency links are marked manually by using a

command-line interface with a graphical parse view.

7.7. Turkish Dependency Treebank

It is known as METU-Sabanci Turkish Treebank30. It consists of morphologically

and syntactically annotated 5,000 sentences. The treebank is represented in the

XML-based Corpus Encoding Standard format (Anne and Romary, 2003). Due to

morphological complexity of Turkish, morphological information is encoded as

sequences of inflectional groups (IGs). An IG is a sequence of inflectional

morphemes, divided by derivation boundaries. The dependencies between IGs are

annotated with the following ten link types: subject, object, modifier, possessor,

classifier, determiner, dative adjunct, locative adjunct, ablative adjunct, and

instrumental adjunct. The annotation is done in a semi-automated fashion though

lot of manual work is also involved. First, a morphological analyzer based on the

two-level morphology model (Oflazer, 1994) is applied to the texts. The

morphologically analyzed and pre-processed text is input to an annotation tool.

The tagging process requires two steps: morphological disambiguation and

dependency tagging. The annotator selects the correct tag from the list of tags

proposed by the morphological analyzer. After the whole sentence has been

disambiguated, dependency links are specified manually.

7.8. Danish, Portuguese and Estonian Treebanks

Danish, Portuguese and Estonian treebanks are called Arboretum, Floresta

Sintactica and Arborest, respectively. These are all sibling treebanks in which

Arboretum is the oldest one. The treebanks are hybrids with both constituent and

dependency annotation organized into two separate levels. The levels share the

same morphological tagset. The dependency annotation is based on the Constraint

Grammar (CG) (Karlsson, 1990) and consists of 28 dependency types. For

creating each of the four treebanks, a CG-based morphological analyzer and

parser has been applied. The annotation process consisted of CG parsing of the

texts followed by conversion to constituent format, and manual checking of the

structures. Danish treebank (Bick, 2003; Bick, 2005) has around 21,600 sentences

annotated with dependency tags, and of those, 12,000 sentences have also been

marked with constituent structures. The annotation is in both TIGER-XML and

30 The corpus for the treebank was obtained from the METU Turkish Corpus (Atalay et al., 2003), hence, the name of the treebank.

46

PENN export formats. Portuguese treebank (Afonso et al., 2002) consists of

around 9,500 manually checked and around 41,000 fully automatically annotated

sentences obtained from a corpus of newspaper. Estonian treebank (Bick et al.,

2005) consists of 149 sentences from newspaper articles. The morpho-syntactic

and CG-based surface syntactic annotations are obtained from an existing corpus,

which is converted semi-automatically to Arboretum-style format.

7.9. AnnCorra : Treebanks for Indian Languages

AnnCorra (Hyderabad Treebanks) for Indian Languages (ILs) are a dependency

treebanks which use indigenous Karaka-theory based grammatical scheme,

known as Paninian Computational Grammar, for syntactic annotation (Bharati et

al., 1996, Begum et al., 2008). Currently, treebanks of four ILs, namely Hindi,

Urdu, Bangla and Telegu, following grammatical scheme, are under development.

Hindi dependency treebank consists of 20705 sentences, Urdu dependency

treebank consists of 3226 sentences from newspaper corpus, Bangla dependency

treebank consists of 1279 sentences and Telegu dependency treebank consists of

1635 sentences (Bhat & Sharma 2012, DVempaty et al., 2010) annotated with the

linguistic information at morpho-syntactic (morphological, part-of-speech and

chunk information) and syntactico-semantic (dependency) levels. No reference is

available about the size of Hindi and Bangla treebanks. However, the annotation

schemes in all treebanks consider the verb as the root of the sentence. The

relationship between the participant and the event/activity/state denoted by the

verb is marked using relations that are called karaka. It has been shown that the

notion of karaka incorporates the local semantics of a verb in a sentence and that

it is syntactico-semantic. Indian languages are morphologically rich and have a

relatively free constituent order. Unlike karaka relations, structural relations like

the subject and the objects are considered less relevant for the grammatical

description of ILs due to the less configurational nature of these languages (Bhat,

1991; Begum et al., 2008).

8. Principles of Treebanking

According to Haung (2003), there are four general principles that have been

considered important for the design & development of a treebank. These

principles31 are given below:

8.1 Maximal Resource Sharing

31 Taken from (Haung et al. 2003*) “Sinica Treebank: Design Criteria, Annotation Guidelines, and On-line Interface”

47

The resources for developing a treebank include corpus, tools, annotations

schemes, guidelines and human annotators. Since, developing these resources

from a scratch can be very expensive and time consuming process; one should

make maximum use of the existing resources, if available at all. For instance, in

order to achieve maximal resource sharing, the Sinica Treebank (Chen et al. 1996)

has been bootstrapped from existing Chinese computational linguistic resources.

The textual material has been extracted from the tagged Sinica Corpus (ibid).

Moreover, the same research team that carried out the POS annotation of Sinica

Corpus and annotated Sinica Treebank to ensure the consistency in the

interpretation of texts and tags.

8.2 Minimal Structural Complexity

The criterion of minimal structural complexity is motivated by the idea that the

annotated structural information can be shared regardless of users’ theoretical

orientations. It is observed that theory-internal motivations often require abstract

intermediate phrasal levels like intermediate phrasal category X’ in X-bar-theory

and abstract covert phrasal category like INFL in the GB theory. Although, the

phrasal categories are well-motivated within the theory, their significance cannot

be maintained across theoretical frameworks. Since, the minimal basic level

structures are shared by all theories, it would be better to annotate the information

which is most commonly shared among theories like the canonical phrasal

categories.

8.3 Optimal Semantic Information

The most critical issue, involving Treebanking as well as the theories related to

NLP, is how much semantic information should be incorporated? The original

Penn Treebank used a pure syntactic approach. A purely semantic approach is yet

to be attempted. However, a third approach involving annotation of partial

semantic information, especially encoded in argument-relations. It is this third

approach which is shared by most of the treebanks, e.g. the Prague Dependency

Treebank (Bohmova and Hajikova 1999), AnnCorra Treebanks (Bharati et al,

1994), etc use syntacto-semantic approach. In this approach, the thematic relation

between a predicate and an argument is marked, in addition to the grammatical

category. This allows optimal semantic information to be incorporated in a

treebank and subsequently in an NLP system like a syntactic parser.

8.4 Minimal Granularity

48

More important parameter is the granularity (depth) of analyses in treebanks.

While some of the earliest syntactically annotated corpora contain information of

only syntactic boundaries, others contain, constituent structures (Abeille,

Clement, and Toussenel, 2003), functional dependency structures (Hajic, 1998) or

in addition to the syntactic structures, also predicate-argument structures (Marcus

et al., 1994). However, the present KashTreeBank (S. Bhat, 2012) contains inter-

chunk dependency relations in addition to POS and Chunk labels.

9. Summary

In this chapter a review of literature was done with focus on dependency grammar (DG),

dependency parsing and the treebanking. First of all different grammar formalisms which

are considered closer to DG were briefly presented in order to compare their fundamental

notions with that of DG and to understand what is common ground between them as they

all constrast with constituency based formalisms like DG. The sample representations for

each of these formalisms, i.e. for DG, RG and LFG, were also given. The review of PSG

based formalisms was deliberately avoided as their reparesentations and the notions were

hardly required when it confermed that DG based formalisms are more famous for

treebanking purposes for relatively variable word order language, due to many reasons

some of which are given in sections five of this chapter. After the grammar formalisms,

the notion of non-configurationality was given elaborately along with some modifications

that were done to original PSG based formalisms in order to minimize the operational

apparatus and incorporate notions of dependency, e.g. incorporation of VP shell. This

was done to justify the suitability of DG for inflectionally rich languages. Next, the

history of dependency based representations was charted out, its rrots were traced and its

development in different grammatical traditions was also given. In next section, the

notion of treebanking was introduced along with some background that trigerred the

creations the wave of creation of treebanks. The notion of treebank has been also given.

Further, some important dependency treebanks were introduced and finally the principles

that should govern treebanking efforts have been presented.

Chapter.3 Creating Corpus for Kashmiri Treebank“There are and can exist but two ways of investigating and

discovering truth.The one hurries on rapidly from the senses and particulars to the

most generalaxioms, and from them derives and discovers the intermediate

axioms. The other

49

constructs its axioms from the senses and particulars, by ascending continually

and gradually, till it finally arrives at the most general axioms.”Francis Bacon, Novurn Organurn Book 1.19 (1620)

1. Introduction

In rationalistic discourse, competence, the underlying ideal grammatical system in

the mind of a native speaker, is considered the only legitimate source of

grammatical knowledge which can be accessed only through grammatical

intuitions of the native speaker (Chomsky 1956). In spite of the fact that

performance, the actual real world utterances of a native speaker, is also a source

of grammatical information, it is not considered the legitimate source. It has been

only considered the inferior copy of the tacit knowledge, the competence.

However, in empirical discourse, alternative stance has been taken and the real

world observable and verifiable language, the actual writings, speech and signs,

which come under the purview of performance acts, are given prime legitimacy to

build a linguistic theory. Since corpus is a real world linguistic artifact (written,

spoken or sign) that stores linguistic knowledge, it is extensively used in

empirical research in Linguistics, CL and NLP. As mentioned earlier, in Chapter

1, linguistic knowledge that exists in corpus is very crucial for creating various

NLP tools and applications. Such knowledge can be captured either by building

computational grammar (hand crafted linguistic rules) or by annotating large

electronic corpus to create treebanks. It is from these treebanks, grammatical

knowledge can be induced in a machine by some statistical modeling. Hence,

need of the treebank as an empirical basis for research on grammar is well

established.

Further, corpus-based empirical research which was not much in practice

for quite a long time since late 50s was almost completely marginalized by strong

rationalistic discourse and subsequently developed formalisms. For instance, one of

the pioneers of Brown Corpus (BC), shares the response they got for the development of

BC. It was considered “a useless and foolhardy enterprise” as the intuition of native

speaker was considered the only legitimate source of grammatical knowledge of a

language which could not be obtained from corpus (Francis, 1992). However, with the

progress in corpus linguistics itself and the achievements in Speech Recognition and

NLP, particularly, in Statistical Machine Translation (SMT), it is now a well established

50

fact that corpus based empirical grammar products like treebanks are of crucial

importance not only for linguistic research and language technology (Nivre et al; 2005)

but also for cognitive and historical linguistic studies.

The next section on Corpus linguistics introduces the notion of corpus

linguistics and also providers its background. The section three tries to talk about

the status of Kashmiri text corpus. Section four discusses the methodology for

developing Kashmiri text corpus. Section five tries to look into various problems

of corpus development, like corpus sanitization, corpus normalization and

tokenization, in general and for creating Kashmiri corpus (KashCorpus) in

particular. Finally, section six summarizes the chapter.

2. Notion of Corpus Linguistics

The Latin term ‘Corpus’ means a ‘body’. It was traditionally applied to various

collections of linguistic or non-linguistic items. In linguistics, however, the term

‘corpus’ refers to the finite collections of naturally occurring utterances. Corpus is

actually a machine readable, principled and organized collection of text, speech &

sign samples that represent a particular language or a variety of that language

(Leech 1992, Sinclair 1996). However, Corpus Linguistics is not a branch of

linguistics like many inter-disciplinary branches like psycholinguistics,

sociolinguistics etc. and core branches like morphology, syntax etc., rather it is an

alternative empirical methodology (corpus based) that percolates through all the

branches of linguistics.

2.1. Some Background

The term Corpus Linguistics was not much in practice up to early 1980s but it

came into lime light with the publication of The Recent Trends in English Corpus

Linguistics (Aarts & Meijs, 1984). Actually, corpus based linguistic research

predates rationalistic generative era (late 50s), when it was practiced by many

linguists32. Although, they used hard-copies of text for manual analysis and paper

slips or card boards for data storage33, their methodology was purely empirical,

based on real world data. As mentioned above, the underlying notion of the

language in corpus linguistics is an empiricist and probabilistic one, where

language is considered as a real-life object which can only be probabilistically

32 Linguists like Firth (1930s), Jesperson (1940s), Franz Boas (1940s), Sapir (1950s), Bloomfield (1950s), Harris (1950s), Fries (1950s), etc were practicing this empirical brand of linguistic research (See Biber and Finegan 1991: 207)

33 Unsophisticated (Pen & Paper technology)

51

modelled, i.e. the correspondence between linguistic structures and grammatical

rules is a matter of frequency vis-à-vis probability.

“If it is correct to describe linguistic behavior as rule-governed, this is much

more like the sense in which car-drivers’ behavior is governed by the

Highway Code than the sense in which the behavior of material objects is

governed by the laws of physics, which can never be violated” (Sampson,

1992).

This period, prior to 1950s, is considered as the golden era of the old fashioned

corpus linguistics which has been termed as Early Corpus Linguistics (ECL)

(McEnery & Wilson, 1996). In ECL, the corpus was collected, stored and

analyzed by linguists by hand, using pen and paper as the aids. Consequently,

corpora were hardly as large as today and rarely faultless. The corpus-based

methodology required data storage (memory devices) and processing abilities

which were not available at that time. In 50s, under the influence of logical

positivism and behaviorism, several linguists regarded corpus as the primary

source of linguistic information. The corpus was deemed both necessary and

sufficient for the task at hand and intuitive evidence was sometimes rejected

altogether. A small number of researchers, applying some corpus-based

methodology did make weaker claims for suggesting that the purpose of linguist’s

work is not simply to account for all utterances included in the corpus but rather

to account for the ones which are not in the corpus at a given time (Leech, 1992).

In spite of its intrinsic limitations (theoretical and technological), the corpus-

based approach was being considered as a scientific methodology for language

study. The ECL was widespread among the linguists until early 1950s (McEnery

& Wilson, 2001). At the end of 1950s, corpus based empirical method was

severely criticized and almost overshadowed by rationalistic discourse (ibid). The

criticism was partly genuine, given the crude techniques available at that time.

Finally, with the advent of computing machines and their usage in corpus

processing, owing to their large storage and computing capacities, modern corpus

linguistics came into existence and in early 60s, first modern corpus, known as

Brown Corpus was compiled for American English. The modern corpus

linguistics, known as Computerized Corpus Linguistics (CCL), received further

impetus from the ground breaking successes in automatic speech recognition &

automatic machine translation, using various techniques of statistical language

52

modelling. The success in building various NLP applications, based on different

modern day corpora, rekindled hope in empiricism and by early 90s, the magic

spell of rationalism was almost reversed.

2.2. Text and Grammatical Knowledge

Writers write without being conscious that they, apart from their intents, carve

their grammatical knowledge and mastery of the language in the patterns of the

text. It is a well established fact that this grammatical knowledge can be

harnessed. The grammatical information in the text corpus needs to be annotated

at various levels in order to be used in developing real world NLP tools and

applications. It involves direct induction (learning) of linguistic knowledge from

annotated corpora. The annotated corpora being used are the treebanks where the

implicit linguistic information has been made explicit through various levels of

annotation. Several NLP modules like Part-of-Speech Tagger, Chunker, Parser,

etc, and various NLP application systems like Machine Translation, Question

Answering or Information Extraction, are trained and tested on treebanks, i.e. the

aforementioned systems learn linguistic knowledge from the treebank samples

and their performances are also evaluated on those samples. Training of the

system consists of two stages - (a) classifying the linguistic structures (i.e. words

and chunks) occurring in the corpus, and (b) assigning them probabilities of

occurrence according to a probabilistic language model.

3. Status of Kashmiri Text Corpus

Kashmiri language presents unique challenges to descriptive, theoretical and

historical linguistics. It is not only a fascinating language for linguists who base

their research on rationalism but also to the corpus linguists and NLP practioners

who base their research on empiricism. Though Kashmiri is pretty well explored

from rationalistic orientation, it is yet to be explored from empirical perspective.

The brief overview regarding existing corpus resources for Kashmiri are given in

this section.

Since corpus is the primary source data for empirical research34, corpus building is

to be seen as the part-and-parcel of corpus linguistics which has become an

essential enterprise for quantitative analysis and technological development of

34 The method used in empirical research is totally quantitative in nature which, in addition to documenting structural and functional analysis, also stores some numbers like frequency counts and probability weights with the items of analysis. This augmentation with statistical information makes linguistic data more information rich. Information richness and machine readability of such data makes it more preferred data for language technology & NLP research.

53

any language in post 1980 scenario. This has led to the development of huge

corpora in many languages of the world like English, French, German, Arabic,

Chinese etc., hence, loosely called resource rich languages but some languages

still lack such resources on large scale like most of the South Asian Languages

(SALs), hence, called resource poor languages. For instance, Indian Languages

present a good example of resource poor scenario. The work of corpus building in

ILs first started at individual level thirty three years ago at Kolhapur University

and KCIE35 (Shastri 1988) came into being which consists of approx one million

words of Indian English with ISCII encoding. The next initiative in this direction

was taken by Department of Electronics Govt. of India in the form of a project-

TDIL36 in 1991. The project was launched to develop three million text corpora

for all ILs that are included under 8th schedule (cf. Ganesan, 1999). The corpora

were compiled from the texts materials published between 1981 and 1990. For

Kashmiri, Urdu and Sindhi the initiatives were taken at AMU and similar efforts

were made at different institution throughout the country. Another effort was

taken under EMILLE project to build multi-lingual corpora for South Asian

Languages (McEnery et al., 2000) which released 200,000 words parallel corpus

of English, Urdu, Bengali, Gujarati, Hindi and Punjabi. However, still ILs need

large scale languages resources and to develop enough language technology and

thereby, enhance their online representation. However, to enhance technological

development, many corpora projects have been launched recently, for instance

LDCIL and ILCI37 which are still going on. The former aims at producing quality

annotated corpus for all 22 scheduled ILs while as the latter aims at producing

parallel corpus in tourism domain for all major ILs, keeping Hindi corpus as the

pivotal one which is translated into other languages.

The efforts of corpora building started in post 80s in Europe and America on large

scale with considerable amount of standardization while as these efforts started

only a decade back in India on small scale in isolated projects, that too with less

emphasis on standardization. As a result of such efforts, resources were created

only for few major languages that too without proper standardization.

Consequently, until 2008-2009, in spite of the efforts made under TDIL, there

35 Kolhapur Corpus of Indian English36 Technology Development of Indian Languages37 Indian Languages Corpora Initiative

54

were hardly any language resources for Kashmiri, and hence, no corpus based

research for Kashmiri was possible before. It was only after some initial efforts

that started in this direction first at Central Institute of Indian Languages (CIIL)

and then at Kashmir University (KU)38 that some corpus based studies were made.

These corpus building efforts resulted in some basic language resources and

computational tools like unicode compatible font, text corpus, POS annotated

corpus, speech corpus, annotation tools, transliteration tools & some lexical

resources like trilingual dictionary, frequency dictionary for Kashmiri. Besides,

C-DAC Pune which is also involved in the localization of various softwares like

Open-Office, has developed a software package for all Indian Languages,

including Kashmiri which consists of word processor, browser, excel, etc. In spite

of the fact that considerable amount of text corpus for Kashmiri was build at

AMU, more than one million words of text corpus has been built under in LDC-

IL39 at CIIL and about 2-5 lakhs at KU (Bhat 2012), no existing corpus is open to

the researchers till now. Therefore, instead of trying to get the existing corpus,

new small scale resources were created for developing KashTreeBank. The next

section describes the methodology used in building Kashmiri corpus.

4. Methodology for Building Kashmiri Text Corpus (KashCorpus)

Theoretically, text corpora can be developed by typing in printed texts, using

OCR or through speech recognition. OCR and speech technologies are far from

perfect, especially for ILs and the only workable method is to key in texts. For the

development of Kashmiri Text Corpus (KashCorpus) too, raw text has been

collected and digitized by inputting the data manually into Microsoft Word (.doc

format). After certain procedures like cleaning and normalization, the corpus is

deemed fit for the linguistic scrutiny and for different types of annotations. The

entire procedure that was adopted for the development of KashCorpus is

explained below along with the associated issues.

4.1 Planning Corpus

Planning is a very important stage, in fact, a decision making one in corpus

building. It is in this stage that the source and the nature of text and the purpose

for which corpus needs to be built are decided upon. Once the purpose of the

38 In Department of Linguistics, Kashmir University under DIT funded Project- Development of Kashmiri Language Technology Tools (See kashmirizaban.com)

39 LDCIL stands for Linguistic Data Consortium for Indian Languages which is set at CIIL. It is a scheme of MHRD, Govt. of India with goal to create annotated language resources for all ILs (for details see ldcil.org)

55

corpus is clear, other specifications like character-encoding, text-encoding and

format for storage and usage are also laid down. The general practice in

treebanking is the usage of news papers as primary source data. It is because these

are easily available and can be freely downloaded, e.g. the Wall Street Journal

(WSJ) part of Penn Treebank. However, digitization is yet to be achieved for the

newspapers in most of the ILs and if the news papers are digitized at all, these are

mostly in image format which can’t be directly used as corpus but one can

download or copy and input them. But for the current work, it was even more

difficult situation as only few newspapers are available in Kashmiri that too very

rare and without digitization.

4.2 Selecting Text Domains

Theoretically, one can identify different domains of text like Aesthetics-

Literature and Fine Arts, Natural, Physical and Professional Sciences, Social

Sciences, Commerce, Government Documents, etc which are very important for

creating a balanced corpus but availability of all such domains vary from

language to language. Moreover, certain domains have more day-to-day relevance

than others, like government documents, medical and tourism texts. These

domains are more useful in developing technology for e-governance and hence,

much in demand these days to be used for commercial purposes in developing

various NLP applications. However, such text domains, whether important for

building balanced corpus or important for commercial purpose, are not available

in Kashmiri. It is because Kashmiri has been never used as an official language or

the medium-of-instruction40 and currently too, the official language of the state is

Urdu and the alternative official language is English. Therefore, the text

production is in limited domains41, predominantly, in literature. As mentioned

earlier the current corpus is meant for developing a KashTreeBank, it was decided

that newspaper text should be used. The rational to use newspaper corpus was not

in tandem with general practice in the field of treebanking but additional reason

was that the textual material in collected from books (Bhat 2012) show least grip

of standardization but newspapers use comparatively standard forms. However,

40 It is worth to mention that Kashmiri was intermittently introduced and taken out from the school curriculum and again recently, it has been re-introduced to be taught as a subject in schools. This is probably reason that most young people are unable to read and write Kashmiri but elders and children are well versed with it.

41 It was observed in a fieldwork that there is hardly any text from the domains other than Aesthetics (Bhat 2012). The fieldwork was done for LDCIL in which data was collected from 270 books for developing balanced corpus for Kashmiri.

56

when newspaper corpus was used initially on experimental basis it was found

very difficult to annotate it at sentence level as the sentences were very complex

and lengthy. Consequently, it became very hard to lay down the first version of

annotation guidelines. However, to avoid this difficulty, some short story text was

also selected to add to the existing corpus. The current KashCorpus consists of the

following domains:

S. No Domain Word Count (WC) %age

01 Short Stories (SS) 3384 7.29

02 News Articles Political (NAP) 14395 31.02

03 News Articles (NA) 7001 15.09

04 News Report Political (NRP) 14263 30.74

05 News Report (NR) 2997 6.45

06 Editorials (ED) 4354 9.38

Total WC 46394

Table.1 Text Domains

4.3 Data Collection

For building KashCorpus, data collection was carried out through field work. As

mentioned above, it was not possible to collect newspaper data online, as it can be

done for English, Urdu, Hindi, Tamil, etc which have pretty good online

representation. Further, it was decided to use text of Sangarmaal, the only well

known news paper in Kashmiri which has recently started daily publication but

before it was weekly newspaper. The other Kashmiri news papers - kAhvaTh,

soon miiraas, arnimaal, miiraas & kA:shur times are not much circulated ones.

Sangarmaal too is not a widely read paper as there are very less people who could

read and write Kashmiri but English and Urdu newspapers are widely read in

Kashmir. Therefore, it became necessary to go to the field for newspaper

collection. Some issues of Sangarmaal (of 6 months duration) were purchased

and news items, editorials and articles from mainly political domain were marked

up. Besides, short stories were also taken to be included in the corpus most of

which have been taken from an anthology of prose used to teach Kashmiri at

NRLC. The decision to add short stories to the corpus was taken at the last stage

and, as aforementioned, the average sentence length in the newspaper corpus was

found high, approx. 27 Ws and the sentences are also quite complex. On the other

57

hand, the average sentence length of short story corpus is approx 12 Ws but with

quite considerable complexity. Moreover, data to be collected needs to follow a

proper sampling scheme as is done at LDCIL for building text corpus (each nth

page from n-page book/ magazine/ journal) for all scheduled languages but for the

current case, random sampling was done in which no explicit criterion was

followed to chose the text. However, it was taken care that least possible number

of newspapers be used to avoid wastage. The sample details of the newspaper data

collected during the field trip, in 2011 are given the table 1.

File ID. Metadata Words Domain

KashCorp 01 ۲٠ تا ۱۴ ، ۲۱ شمار : ۵سنگرمال سرینگر : جلد: ۲٠۱٠جون ،

سی اعتبار مویوس کن: د دور سی ٲوزیر اعظم س ن� ہ کن ن اصل سیسی سوال گال ٲاقتصدی مراعات ہ ٮ� ہ ٲ

اونک ول بنتھ ن�کھاتس تسنگرمال تجزی

204 ED

KashCorp 02 �� م ،۹ تا ۳ ، ۱۷، شمار :۵سنگرمال سرینگر :جلد: ی ۲٠۱٠

ٹھ انن ین سمیر ر گوژھ ڈنج پ ۔۔۔سری ینگر ش ہ ٮ� ہ رشیدبٹ

132 NA

KashCorp

04 مارچ تا۲۹ ، ۱۲ ، شمار :۵سنگرمال سرینگر :جلد:

۲٠۱٠اپریل، :۴ز راونس ژیر د مسل ا ر گن کش ن�ندوستان م ل ہ ن� ہہ ی� ٲ

172 NAP

KashCorp

06 تا۱۶ ، ۹٠ ، شمار : ۴سنگرمال سرینگر :جلد: ، ۲٠٠۹

نومبر۲۲ل سرحد مضبوط کران ت رلن و نس ۍندوستان چھ چ ٲ ۍ س� ی�

216 NRP

KashCorp

19:می ،۳٠ تا ۲۴ ، ۱۹ ، شمار :۵سنگرمال سرینگر :جلد:

۲٠۱٠ن اعزاز ن لکھار د ترین کتابن ٮ�ب ٮ� ن� ہہ

202 NR

Table.1 Metadata of Sample Newspaper corpus

58

4.4 Character Encoding

These days, unicode has become the prime choice in character encoding for text

corpora creation. Unicode is the universal character encoding standard which

defines a consistent scheme for encoding multilingual text and assigns a numeric

value (code point) and a name for each of its characters. Unicode characters are

represented in three forms of UTF42; 32-bit form, 16-bit form or an 8-bit form

(UTF-8). UTF-8 has been designed for ease of use with existing ASCII and ISCII-

based systems. The Unicode Standard specifies a code point and a name for each

of its characters. It contains more than 1 million code points, most of which are

available for the encoding of characters (Allen at al., 2009). The availability of

unicode compatible font is a prerequisite for the development of corpus with

unicode compatibility. As mentioned earlier, Kashmiri has only one unicode

compatible font with least issues, i.e. Afan Koshur Naksh (Aadil 2011) and is

being used for major NLP related works for many projects. It has been also used

for the development of the current corpus. The table 2 shows the encoding of

Kashmiri characters employed in developing KashCorpus.

S. No Characters Unicode

Values

S. No Characters Unicode

Values

1 ا 0627 30 ل 0644

2 ب 0628 31 م 0645

3 پ 067E 32 ن 0646

4 062A ت 33 و 0648

5 ٹ 0679 34 06C1

6 ث 062B 35 ھ 06BE

7 ج 062C 36 ء 0621

8 چ 0686 37 ی 06CC

9 ح 062D 38 ے 06D2

10 خ 062E 39 064E

11 د 062F 40 آ 0622

42 Unicode Transfer Form http://www.unicode.org, http://www.unicode.org/versions/Unicode 5.2.0

59

http://www.unicode.org/versions/Unicode

12 ڈ 0688 41 ٲ 0672

13 ذ 0630 43 0650

14 0631 ر 44 ی 0656

15 ڑ 0691 45 064F

16 0632 ز 46 065717 ژ 0698 47 0654

18 س 0633 48 0655

19 ش 0634 49 065A

20 ص 0635 50 ن 065B

21 ض 0636 51 � 064D

22 ط 0637 52 ، 061B

23 ظ 0638 53 ۔ 06D4

24 ع 0639 54 ؟ 061F

25 غ 063A 55 * ۄ 1732

26 ف 0641 56 س * 1773

27 ق 0642 57 * ۍ 1741

28 ک 06A9 58 ٮ� * 1646 + 1770

29 گ 06AF

Table.2 Kashmiri Unicode Chart

4.5 Text Encoding

The term text encoding refers to the practice of representing textual and linguistic

data in a certain format in corpus. A standard encoding format provides the most

possible generality and flexibility (McEnery & Wilson, 1996). The XML43 is the

emerging standard for data representation and exchange on the World Wide Web

(Bray, Paoli & Sperberg-McQueen, 1998). At the fundamental level XML is a

document markup language directly derived from SGML with various additional

features that make it a far more powerful tool for data representation and access.

Therefore, natural choice these days for storing a corpus is in an XML format. An

XML format provides needed standardization so that a user, who is not familiar

with the corpus but familiar with XML-DTDs, can easily interface with the

corpus but for the current KashCorpus, no markup language or XML-DTDs were

43 Extensible Markup Language

60

used, instead, the entire corpus has been rendered in plain document (.doc) format

as there was only one purpose of the corpus, i.e. to be used for syntactic

annotation and for that purpose it was not necessary to have corpus in XML

format, a plain text (.txt) format in UTF-8 was sufficient. However, the corpus

can be easily converted into XML format.

4.6 Data Entry

Data entry is the corner stone in any corpus building endure. It is time consuming

task especially for the language in which people are accustomed to use some

different kind of word processors which are use some different kind encoding

standards and are not compatible with unicode (like InPage) but are not yet much

familiar with using Microsoft Word, e.g. in Kashmiri. Finally, the manually

marked up news items & articles from Sangarmaal and short stories from an

anthology were typed in. It took a professional data inputter 8 days to input 46394

words of Kashmiri newspaper text in Microsoft Word, with an average of 5051

words per day (5-7 hrs). It was found that the corpus is unclean, i.e. it contains lot

of typos and space problems, and is still unfit to be used for next level process. It

is a well established general practice that the corpus needs to be sanitized and

preprocessed first before putting it in actual usage. The sample of the unclean

corpus is given below in the Table 2. It contains three parts; a) Metadata

(information about the data available in the corpus) b) Data (text on which actual

work is done) c) Word Count (number of words of the actual data, excluding

metadata.

M

E

T

A

D

A

T

A

File ID No.KashCorp11

Newspaper

Details

ہ� : ۵سنگرمال سرینگر : جلد: ی�/ ۹ تا ۳ ، ۱۷ شما ۲٠۱٠ م

News Item

Title

ی�ن و�گ زا ن�ن ل ٮ�ن امکا ن��

چ5لو سیاست ننز ہہ چد� ٮCٹھ مثبت اثر، علیحدگ/ پسن چEس پ ہد� صو�تحا نن ہہ ہر یی �Hش Iہ نرا ام� عمل ت

ہH Kا�گر 5ھن

Item TypeKسنگرمال تجزی

ہمژ ٮI �Cزیراعظم� د�میا� سپز H تا�سHپا Kہ چسسس د�Iا� بھا�ت ت ہتمپھو سا�ک سربرا اجلا KYہ ہK5 �ازد ا ہ] بھوٹا� Eحاس

61

ھھ] 5Iچ� چیو Eید ہو ہمت۔ حالاYکK دۄش ھو ہد گ ہYک اما�H پٲ نبال سپد ننزع مل ہہ ہھ چHتھ بات ہ5ھ دۄ� ملک� د�میا� جامع Kہ ہنگK پت باہم/ میٹ

Kہ چپتھ ماحول سازگا�ت ہمK با چر� ا H ہتم ہز ہمژ ہدژ ہK دٲ�� ہgم Kہی چی� ہK خا�KY سیکریٹر� ننز ضر�Iت Iاضح Hرا� خٲ�ج/ Iزیر� ت ہہ ہتھ چHتھ با

ہYچ Iٲحد ہK حل Hر چHتھ مسل ننز ہK م ہK پت چ�� د�میا� ملاقات نYہK دۄ� Eیڈ I نر چاHھ و�� ب ننز ہHس شرم اEشیخ شہر� م ہر ہH Kا� تلاش۔ مص Iطریق

�گھس ا�I شرم سظم۔ Iزیر اعظم ڈاHٹر منموہ� سن ہK من ہد محتاط ت ہمت بیا� زیا آا Kہ Yہد Kہ ہمK پت ہK ا ہیK ملاقات ت ہ5ھ ہمش ٲ�، آا KYہ ہد چIتھ قرا�

چیس ہHس و�ۍ �ا ہ� ہمس پو �پتۍ ہI� Kد �}ژ سبب ہK ام/ مخال ہمت ت ھو ہ5ھ� پی Iہ ہبتھ ہک ننز زبردست مخاEفت ہK پا�Eیمنٹس م اEشیخ اعلاYیK پت

Kہ ہK ت ہشٹھ نند موقف ہہ Kہ Eہود Y قس متلYچ ننز ملوث Yفر�Y � ۲٠٠۸ومبر ۲۶جامع مذاHرٲت/ عمل بحال Hر چل� م �/ حم ٮ�C ممب H و�عیس

چنس ہی Kہ YرH Kہ ٮCٹھ تمام ام� عمل �Iٹ چلس پ ہس] معام �اHۍ چاتھ ہI� Kد پاHستا� ہیمK طر} Iن سستۍ مشرIط۔ د چنس ہی Kہ YرH /ٲیIا�H خلاف ٹھو�

/� ہممب Kہ ہیتھۍ حمل ہمتۍ آا ہK سپدا� ہتتھ/ حمل ہI Kاد ننز د�جن چتتھ ملکس م ہ5ھ چیس د�Iا� ہHس و�ۍ �ا ہمس �پتۍ ہز چYIا� Kہ ٮCٹھ IاI یلا Hرا� ت پ

ہمت ا�I ۔ ہد� ننز سپ م

W-241

Table.3 Sample of Unclean KashCorpus

4.7 Copyright Issue

Copyright legislation is one of the serious problems for building and usage of

large scale text corpora as the authors and publishers protect their rights on their

texts through copyright laws. The main concern for the corpus builder is that any

text which is to be digitized and included in the corpus will be under copyright

protection and that the permission has to be obtained for its use. If corpus is going

to be used in the development of various systems or applications for commercial

purpose, one has to take the permission and enter into an agreement with the

authors or publishers in which some royalty will be fixed for each text. However,

if corpus is to be used for research purpose there is hardly any need of taking

permission and entering into any agreement. Since the current KashCorpus is not

for any commercial purpose but only for the research, permission for using the

text has not been taken. The next section describes the procedures involved in

polishing the corpus.

5. Preprocessing

Once the inputting of the text is finished and the corpus is ready, it can’t be

directly used for annotation purposes rather it has to go through some more

manipulating procedures. For instance, the current KashCorpus has been

62

manually sanitized, normalized and tokenized before it has been used for POS

annotation. So, all the manipulations done (manually or automatically) to the

corpus prior to annotation can be collectively called Preprocessing.

5.1. Corpus Sanitization

Corpus cleaning involves proof reading or checking of the digitized corpus files

for typos, errors, spelling and grammatical mistakes. During this process, it is

necessary to be faithful to the text, as whatever, one may think as a mistake on the

part of a writer could be in fact a variation. The reasons of the errors or mistakes

and how these were corrected in the process of sanitization are given as under:

a. The less expertise in the Kashmiri script, in spite of the good typing

experience, on the part of inputter has resulted in many errors and

mistakes and consequently in more unclean corpus. Moreover, it was

found that the highest scoring day (in terms of number of words per day)

was also the day in which more errors & mistakes were committed by the

data inputter. There was more percentage of erroneous words in the corpus

as compared to the day when there was average word count. Therefore,

taking the required time seems to be a good strategy as in haste to finish

more and more words per day can lead to the increased percentage of

erroneous words.

b. Sometimes the bad quality of print and the errors in the original text would

lead to the wrong judgment of the letters/words and consequently the

mistakes on the part of the inputter.

c. Kashmiri script uses lots of diacritics to represent different phonetic

subtleties of the language. Sometimes some of these diacritics would

appear on the top or bottom of one character when actually they are part of

preceding or following characters. So, in the text it is sometimes hard to

decide to which letter the diacritic belongs, unless the native speaker’s

intuitions have not been taken into consideration. These apparently

misplaced diacritics generally lead to confusion for the data inputter and

result in lot of spelling mistakes or unclean corpus. For instance, in the

word ہک ,(mukhaalfatku) مخاEفت given above in Table 2, the ‘pesh’ ( )

diacritic appears misplaced, i.e. on final letter of the word (ک) which is

63

actually on the preceding letter (ت) and the actual word is ہتک مخاEف

(mukhaalfatuk). Such mistakes are regular and hence quite predictable.

d. Since, various combinations of keys are used to input different character,

e.g. by pressing Shirt+P ( ) can be typed in but sometimes only one key is

pressed (e.g. only ‘P’) and an entirely different character (e.g. gets (پ

typed in. This has resulted in various errors which are more or less

predictable ones.

e. It was maintained to use some diacritics in a consistent way, despite of the

variations in the text. When two diacritics are typed contiguously (one

after other) in which the first one joins two consonants to function as a

unit and the second one represents the vowel on the unit, it was decided to

type in on this sequence – 1st consonant, 2nd diacritic representing the

vowel, 3rd consonant and 4th the conjoining diacritic. For instance, in the

words ( مت ) & (ہن�دی7777777 ) the splitting occurs when two diacritics (ہن�تی7777777ت& ن ) come contiguously. To avoid this, vowel representing diacritic ( ) is

typed after first consonant (د) which forms a unit with (�) with the help of

a linker & shortening diacritic ( ن ). Therefore the above words were

corrected as & نیتھ ہت ہمت نی ہد and this pattern was followed in the entire process

of cleaning.

f. It was also maintained not to consider aspirated consonants as unit and put

the diacritic after this unit, instead put diacritic on or under the letter

representing the consonant. For instance, in the word ہ5ھ) ), the vowel

representing diacritic ( ) is not actually associated with only (چ), the

word initial letter but the unit (5ھ). However, it is maintained to write after

the first letter (چ) so that (ہ5ھ ) is typed everywhere instead of (چھ).

g. It was also maintained to put the diacritic representing a vowel under the

unpronounced pseudo-character44 instead directly under the preceding

letter representing a consonant which is not a writing convention. For

instance, in the words (Kہ ہK) & (ت ) the letter ,(ت) remains unpronounced and is

used as a supporting characters for the diacritics ( ہ & ) so that they are

44 In Kashmiri a letter () is used at word final position just for the support of the preceding diacritic which can’t stand on its own. In such cases ( ) is a pseudo-character as it doesn’t represent anything of the phonological word.

64

written as ( & ہ ہ ), respectively. These are mere orthographic conventions

and have nothing to do with phonological rules.

These all types of errors were rectified during the course of cleaning, keeping in

mind the principle of faithfulness to the text and some additional decisions to

maintain consistency. The next sub-section describes the normalization of the

corpus.

5.2. Corpus Normalization

Corpus Normalization involves all the necessary manipulations of the corpus, not

covered under cleaning and tokenization. It primarily involves filling in left out

diacritics. As mentioned above, the Kashmiri script uses fourteen diacritics (e.g. ی ) to represent different phonetic subtleties of the language. Urdu also uses

modified Persio-Arabic script but drops the three crucial vowel representing

diacritics, namely; zer, zabar & pesh ( ). This tendency can be also seen in

Kashmiri texts but it is not as severe as it is in Urdu texts where all diacritics are

being left. However, like Urdu writers, there is a tendency in Kashmiri writers too

to drop these three crucial diacritics but the remaining are the essential ones,

specific to Kashmiri and can’t be inferred from the context. However, dropping of

the diacritics creates a big text normalization problem that needs to be taken care

of, i.e. all diacritics need to be put in the text or at least where these are crucial for

word identification and disambiguation. Same has been done with the current

KashCorpus; all the crucial diacritics have been put there manually. Actually,

corpus cleaning and normalization has been done simultaneously.

5.3. Tokenization

A token is a string of characters delimited by unit character spaces and

tokenization is a preprocessing procedure by which disparity is removed in

achieving one-to-one mapping between tokens and the words or major

grammatical categories either by concatenation (joining) or by segmentation,

respectively. The natural one-to-one correspondence has been generally observed

between a word (simple or complex) and a token in isolating45and inflectional46

45 They have low morpheme per word ratio and more this ratio is lower more the language is said to isolating. Purely isolating ones have 1:1 word-morpheme ratio, e.g. Mandarin. Therefore, languages with one to one correspondence between words and morphemes are said to be isolating.

46 They have high morpheme per word ratio, in contrast to isolating languages, and more this ratio is higher more the language is said to be inflectional, e.g. Indo-European languages (also known as low synthetic languages).

65

languages but hardly in agglutinating languages47 where, one token usually

corresponds to many grammatical words (POS categories). However, in case of

some inflectional languages, particularly, the languages which use modified

Persio-Arabic script and borrow heavily from Persian, e.g. Urdu and Kashmiri, no

one-to-one correspondence between the words and the tokens in many complex

words (bound + free morphemes) has been observed. The root form of such words

is written as one token and the affix as another separate token. It is the common

practice in Kashmiri and Urdu and mostly occurs in Persian borrowed affixes, as

given in the Table.4. In Kashmiri, this practice has been observed even in some

simple words where the two parts of word are written as two separate tokens, with

the blank space between them may or may not representing the morphemic

boundary. Moreover, the case markers, if added to such words, give rise to three

token words as given in examples 4, 5 & 6 of the Table.5. Therefore, the second

part of the word may or may not be a bound morpheme but the third token is

surely bound morpheme. This orthographic convention of writing bound

morphemes or parts-of-word as separate tokens to avoid unacceptable word

shapes due to the context-sensitive48 script is called split-orthography49. The

Kashmiri specific examples of split-orthography are mostly taken from the corpus

sample given in Table.3.

The concept of space as a word boundary is weak in Urdu script

(Durrani and Hussain, 2010) but it is far weaker in Kashmiri script. A zero width

non-joiner (space character as can be seen between Roman letters) is primarily

required to generate acceptable word shapes on the one hand and to join various

parts of a word and rectify tokenization problem on the other hand. It has been

47 They have highest morpheme per word ratio but additionally, there is a low degree of fusion of major grammatical categories, e.g. Turkish, Tamil, Malayalam, Telegu, etc (also known as polysynthetic languages).

48 In such scripts some letters (joiners but not non-joiners) attain different shapes upon joining with the adjacent letters. There are three possible shapes a letter can attain at initial, medial & final positions (contexts) in a concatenated sequence of letters of the word. The letters assuming these three shapes according to the context are called joiners. Another set of letters, called as non-joiners do not do not change their shape according to the context. They only join with the letter immediately preceding them and thus, have only word final and isolated variants. An examples of a joiners are Arabic letters ‘te, miim, ye, be, siin’ ( س ب ی م ) ’and that of non-joiner are Arabic letters ‘vaav’ &‘re (ت ڑ ۄ � I).

49 The term split-orthography is used due to unavailability of any technical term in the existing literature to denote the splitting tendency in Persio-Arabic Script (an orthographic convention) due to which affixes and the roots are written separately and even some roots are written in two tokens, forming multi-token words. The term is in a way new coinage to describe the tokenization problem of Kashmiri, Urdu, etc (S. Bhat, 2010 & 2012).

66

already implemented for Urdu (G. Lehal, 2010) but for Kashmiri it has been

implemented very recently which is compatible to windows-08 only. However,

instead of zero width non-joiner, underscore (_) has been used in the tokenization

of Urdu Dependency Treebank (R. Bhat 2012) and manual preprocessing for

Urdu and Kashmiri Corpus at LDCIL (S.Bhat 2012) but for the current work dash

(-) has been used, instead of underscore or zero width unit character space, to join

parts of a word as shown in examples 1-3 of Table 4 and 10-15 of the Table 5.

S. No. Root (Token I) Affix (Token II) Words Urdu Kashmiri

1 aqIl (عقل) -mand (مند) aqIl–mand عقل_مند نند چم عقل-

2 mazmoon -nigar mazmoon–nigaar مضمو�_Yگا� چمضمو�-Yگا�

3 tA:liim -yaaftah tA:liim–yaaftah تعلیم_یا}تھ Kہ ییم-یا}ت Eتٲ

4 khatIm -shudah khatIm–shudah حاصل_شد حٲصل شد

5 hA:sil -kardah hA:sil–kardah حاصل_Hرد ہد حٲصل Hر

6 gonah -gaar gonah–gaar گنا_گا� گۄKY گا�

7 qosuur -vaar qosuur–vaar قصو�_Iا� قصو� Iا�

67

8 khosh -go khosh–go خوش_گو خوش گو

9 tarqii -paziir tarqii–paziir ترق/_پزیر ییر ترق/ پز

Table.4: Tokenization Problem Common in Kashmiri & Urdu

S.NO. Root

(Token I)

Affix

(Token

II)

Affix

(Token III)

Words Kashmiri

1 butaan Chi - buuTaan-chi K5ہ بھوٹا�

2 iisvii k’n - isvii-k’n �Cٮ H عیسو�

3 sekretrii Yan - sekretrii-yan چی� سیکریٹر�

4 zimI dA:rii yan zimI-dA:rii-yan چی� ہK دٲ�� ہgم

5 tariiqI Kaar k’n tAriiqI-kaar-k’n �Cٮ H ا�H Kہ طریق

6 fal safI kis fal-safI-kis ہHس Kہ }ل سف

7 paanI van’ - paanI van’ ہYI Kۍ Yپا

8 mukhaal fAts - mukhaal-fAts �}ژ مخال

9 misrI Kis - misrI-kis ہHس ہر مص

10 sapIz mIts - sapIz-mIts ہمژ سپز-

11 vAr’ Yas - vAr’-yas چیس و�ۍ-

12 Ak’ sIY - Ak’-sIY ہس] �اHۍ-

13 pAt’ Mis - pAt’-mis ہمس �پتۍ-

68

14 anan vA:l’ - anan-vA:l’ ل ۍانن-و ٲ

15 yithI pA:Th’ - yithI-pA:Th’ ٹھ -پ ۍیتھ ٲ ہ

Table.5 Tokenization Problem Specific to Kashmiri

6. Summary

In the wake of current corpus linguistics scenario and the boom of empirical

studies, the development of Kashmiri corpus is need of the hour. It is not only to

feed data hungry research and development initiatives for technological

enhancement of Kashmiri Language but also to carryout various quantitative

studies to discover the new realities which have remain unexplored so far due to

the unavailability of the corpus. Though, in this chapter the building of

KashCorpus is described from a specific point of view, i.e. for developing

KashTreeBank, but the corpus can be also used for different types of studies. This

work is the most basic part of a large attempt of resource creation to put Kashmiri

language on the map of current language technology. Like any other corpus

building endeavors, the creation of KashCorpus was not a straightforward

process; there were many issues like, selection of text domains, representativeness

of the language in the selected samples, etc. which were properly scrutinized and

solved before starting the actual work. The other major problems include the

unavailability of any online resource from which data could have been obtained,

the total vacuum of commercially important text domains like medical & tourism

text, lack of well trained data inputters who are not only well versed with Persio-

Arabic script in general but particularly with Kashmiri script and its unicode

based inputting setup. Usually, data inputters use “Inpage” but not Microsoft

office for Kashmiri inputting. Finally, many processes were carried out to make

corpus worth for adding further values by various types of annotations. These

69

preprocesses include corpus cleaning, normalization and tokenization. Though,

sometimes tokenization is treated as a separate problem in between corpus

building and corpus annotation but in this work it is included as the part of corpus

building as it has been carried out manually along with cleaning and

normalization. In the present form, the KashCorpus is ready to be used for the

future work.

Chapter.4 POS Tagging of KashCorpus“The definitions of the parts-of-speech arevery far from having attained the degree ofexactitude found in Euclidean geometry.”

Otto Jespersen, the Philosophy of Grammar, 1924

1. Introduction

Part-of-speech (POS) tagging constitutes the fundamental layer of annotation in

treebanking, on the basis of which furthers annotation layers are build. The next

layer of annotation is called chunking, which is important to determine

dependency relations, the most crucial task in building a dependency treebank.

The POS category which forms the head of the chunk can be further augmented

with the crucial morphological information like PNGC50 and TAM51 but in the

current work adding morphological information has been avoided to concentrate

on inter-chunk dependencies and get the skeletal dependency trees ready. It is

important to mention that it is easier to add morph information latter in order to

get better results in automatic syntactic annotation.

This chapter describes the first level of annotation of KashCorp, i.e. POS tagging and

chunking and the associated resources, technicalities and manipulations of the data that

were required to start POS annotation. The section second provides the notion of POS

tagging. Section three discusses briefly the important annotation standards. Section four

presents POS tagsets developed, mainly, for English and Indian Languages, and

50 Person, Number, Gender, Case51 Tense, Aspect, Mood

70

elaborates only the most relevant ones. Section five describes the Kashmiri BIS tagset.

Section six, seven, eight, and nine talk about the requirements, the process, issues and the

guidelines of POS tagging, respectively. Section ten provides the statistical results and finally

section eleven summarizes the chapter.

2. The Notion of POS Tagging

The notion of parts-of-speech (POS) tagging has been given very elegantly in Daniel Jurafsky and James H. Martin (2000):

“Words are traditionally grouped into equivalence classes called parts-of-speech

(POS), word classes, morphological classes, or lexical tags. In traditional grammars

there were generally only a few parts of speech (noun, verb, adjective, preposition,

adverb, conjunction, etc.). More recent models have much larger numbers of word

classes (45 for the Penn Treebank (Marcus et al., 1993), 87 for the Brown corpus

(Francis, 1979; Francis and Kučera, 1982), and 146 for the C7 tagset (Garside et al.,

1997).

The part of speech for a word gives a significant amount of information about

the word and its neighbors. This is clearly true for major categories, (verb

versus noun), but is also true for the many fine distinctions. For example

these tagsets distinguish between possessive pronouns (my, your, his, her, its)

and personal pronouns (I, you, he, me). Knowing whether a word is a

possessive pronoun or a personal pronoun can tell us what words are likely to

occur in its vicinity (possessive pronouns are likely to be followed by a noun,

personal pronouns by a verb). This can be useful in a language model for

speech recognition.”

POS tagging is a process of assigning part-of-speech tags to each and every word

used in continuous text after the morphological analysis and grammatical

interpretation (Garside, 1995). A set of specially designed tags, carrying

grammatical information are assigned to words to indicate their parts-of-speech

category with regard to their use in the text (Leech and Garside, 1982). POS

tagging is actually the process of labeling words in running corpus with their

grammatical categories (optionally with the morpho-syntactic features), based on

both their form as well as their contextual function. It is essentially a classification

problem in which words are classified on the basis of a predefined inventory of

parts-of-speech categories called POS tagset. For morphologically rich languages,

it plays a limited role of syntactic category disambiguation in the entire pipeline

of NLP modules where morphological analyzer provides all possible POS

categories for a word and POS tagger just disambiguates the category of the given

71

word by selecting only one according to its context. It is the fundamental level of

corpus annotation; in fact, it is the first stage to proceed for the syntactic

annotation in order to develop a treebank. Apart from its role in treebanking, POS

annotated corpus alone can be used in wide number of NLP applications like

information extraction, information retrieval, parsing (shallow as well as deep),

machine translation, speech synthesis and speech recognition.

3. Annotation Standards

POS standards provide a framework in which a tagset can be designed to annotate

corpora. Therefore, to decide upon choosing a standard from the existing ones or

to lay down a new one by taking inputs from the existing ones is the first and

foremost task in corpus annotation.

Standardization in POS tagset designing is not only important to achieve

consistency in the annotation across related languages and research projects but

also to ensure maximum resource sharing and least wastage of annotated language

resources, particularly, in resource poor scenarios like Indian Languages. For

European languages such steps had been taken more than a decade ago in the

form of EAGLES52, ELRA53, and ISLE54 but for Indian Languages, it is quite

recent tendency (only 3-4 years old) and came into being in the form of BIS

scheme, though there were earlier efforts in this direction in the form of ILPOSTS

& ILMT. EAGLES and BIS POS annotation schemes can be seen as instrumental

in bringing consensus among NLP groups with divergent interests and approaches

to take up annotation projects and solve various CL, NLP or LT problems. These

two standard frameworks are briefly given below from the POS annotation point

of view.

3.1. EAGLES Framework

It is widely used framework on POS tagset designing with main aim of

standardisation of POS tagsets used for the annotation of corpora of various

European Languages. Standardization of the tagsets is very important process as

pointed out by Leech and Wilson (1999: 55-56):

“In the interests of interchangeability and re-usability of annotated corpora, it

is important to avoid a ‘free-for-all’ or ‘re-invention of the wheel’ every time

52 The EAGLES (Expert Advisory Group for Language Engineering Standards) guidelines provide recommendations for standardization of a range of language engineering resources. The recommendations actually refer solely to the guidelines on morpho-syntactic annotation of texts.

53 The European Language Resources Association (ELRA)54 International Standards for Language Engineering Standards (ISLES)

72

a new project begins………… At the cross-linguistic level, annotations used

for one language should as far as possible be compatible with annotations

used for another. Compatibility here means that where there are descriptive

categories common in between different languages, these should be

recognised in the annotation scheme and recoverable from the annotations

applied to texts in different languages.”

The EAGLES guidelines provide a set of features and an encoding scheme which

different tagsets were supposed to include. The EAGLES guidelines for morpho-

syntactic annotation include: 1) what is obligatory 2) what is recommended 3)

what are optional extensions for morphosyntactic annotation. At each level, tags

are defined as morphosyntactic Attribute-Value (A-V) Pairs e.g. gender is an

attribute that can have the values, masculine, feminine or neuter. These A-V pairs

are structured as a hierarchy but need not be so, strictly. The property suggested

by the EAGLES guidelines as obligatory to any POS tagset is that of thirteen

major word classes which include: noun, verb, adjective, pronoun/determiner,

article, adverb, adposition, conjunction, numeral, interjection, unassigned/unique,

residual, and punctuation. The recommended properties are then organised

according to these major word classes, e.g. the attribute Type with values;

Common, Proper, etc, is for nouns but Person with values First, Second and

Third, is for verbs and Degree with values Positive, Comparative, Superlative is

for adverbs. The recommended attributes also include number, gender, case,

finiteness, tense, voice, and other sub-categorisation features. The optional

recommendations consist of similar attributes of lesser applicability, and some

additional language specific values for the recommended attributes.

The value of this framework is that it promotes consistency and reusability

of linguistic resources for different languages and discourages “wheel

reinvention”. The main drawback to the EAGLES guidelines, however, is that

they cover only a tiny fraction of the world’s languages. As a project of the

European Union, it covers only English, Dutch, German, Danish, French,

Spanish, Portuguese, Italian and Greek: nine languages of Western Europe which

are moreover typologically similar. It is worth to mention that the ILPOSTS on

the basis of which LDCIL POS tagsets were made for the annotation of Indian

73

Language Corpora was based to EAGLES. We can say it was an Indian extension

of EAGLES. As point out by Leech and Wilson (1999: 58):

“It remains to be seen how far these guidelines can be extended, without

substantial revision, to other languages”.

3.2. BIS Framework

It is the latest annotation framework for the annotation of Indian Languages and

recognised by Bureau of Indian Standardization (BIS). Its foundation was laid

down by the first meeting of POS tagset standardization committee, held at

Department of IT, New Delhi on 19th Nov. 2009. It has been evolved by taking

insights from earlier efforts-ILPOSTS, ILMT, etc, to bring consensus among

different NLP groups in India. It incorporated the set of POS labels from ILMT

POS tagset (Bharati et al., 2006) and the notion of hierarchical structure from

ILPOSTS (Baskaran et al., 2008) but avoided fine granularity proposed by

ILPOSTS.

In line with the ILMT tagset, it assumes separate layers for morphological

analysis and POS annotation for efficient capturing of grammatical information

and better results in manual as well as automatic annotation. It, further, holds that

the input to the POS tagger (text corpus) should have already undergone through

pre-processing. Thus, every token (word) to be assigned a POS tag is a single

lexical item and is not a token which internally contains more or less than one

lexical item as can be seen in agglutinative languages and in the languages with

split-orthography (Bhat, 2010), respectively. It also sticks to the assumption that

there must be a MWE identifier layer after POS tagging. Since POS tagging is a

lexical level annotation process, any unit that involves more than one lexical item,

such as conjunct verb, compounds and will not be captured at the POS level.

Therefore, BIS proposes hierarchical and coarse grained tagsets for all Indian

Languages. These tagsets have three-levels of hierarchy, including Type,

Subtype-I and Subtype-II. The first level (type) includes 11 main categories-

Noun, Pronoun, Demonstrative, Adjective, Quantifier, Verb, Adverb,

Postposition, Conjunction, Particle, and Residual. The second level (subtype-I)

includes 32 subcategories and the third level (subtype-II) includes 3 sub-

subcategories only for verb but the third level is optional. The main principles55

55 The principles are given in ‘Linguistic Resource Standards for POS Tag Set for Indian Languages’. Documentation by D. M. Sharma in May 2010. MS

74

that were taken into consideration while developing the POS tagsets for the

annotation of Indian Language Corpora are as:

i. The scheme should be generic, i.e. it should work for all the Indian Languages

and shouldn’t be oriented towards any one language or a group of languages.

ii. A layered approach should be followed for annotating various types of

linguistic information available in a text. Each type of information like

morphological, POS and chunk information should be annotated in separate layer.

iii. The scheme should be flexible to incorporate or drop a category either at the

top level of hierarchy or as a sub-category of an existing type so that the scheme

can be extended from one language to other.

iv. The annotation scheme should be annotator friendly by avoiding ambiguous

tags which puts cognitive load on the annotators and leads to inconsistency in the

annotation.

v. The scheme should be mappable with pre-existing annotation schemes of

Indian Languages to avoid the wastage of the resources.

vi. The scheme should support all types of NLP research efforts independent of a

particular technology and development approach.

4. POS Tagsets

The POS tagsets that have been designed for English and Indian Languages have

been given.

4.1. POS Tagsets for English

“POS tagging has been a hot research topic since the early 1980s” (Voutilainen,

1999) but the research actually originated in 1960s for European Languages.

However, the research in POS tagging is a quite recent tendency in India and,

therefore, the concept of tagset designing and its standardization is also very

recent as compared to its European and American counterpart. The main efforts in

POS tagging resulted in various POS tagsets such as Brown, CLAWS1, and U-

Penn (mainly designed for English) but these tagsets are mostly simple

inventories of tags corresponding to the morpho-syntactic features, and varied

greatly in terms of their granularity (Hardie, 2004). It is CLAWS 2 & 7 tagsets

which are considered landmark in the history of tagset designing (Leech 1997).

75

CLAWS7 marked an important change in the structure of tagsets, from a flat-

structure56 to a hierarchical-structure57.

According to Daniel Jurafsky and James H. Martin (1999) “There are a small

number of popular tagsets for English, many of which evolved from the 87-tag

tagset used for the Brown corpus (Francis, 1979; Francis and Kučera, 1982).

Three of the most commonly used are the small 45-tag Penn Treebank tagset

(Marcus et al., 1993), the medium-sized 61 tag C5 tagset used by the Lancaster

UCREL project's CLAWS (the Constituent Likelihood Automatic Word-tagging

System) tagger to tag the British National Corpus (BNC) (Garside et al., 1997),

and the larger 146-tag C7 tagset (Leech et al., 1994).” However, irrespective of

the popularity, a brief description of many POS tagsets of English and Indian

Languages are given as follows.

4.1.1. CGC58 Tagset

The earliest work on POS tagging started with CGC of Klein and Simmons

(1963) for English in USA. The tagset consists of thirty tags of which only

pronoun tags are decomposable59 but the rest are not. Their CGC-program also

outputs information, external to the main tag, on the number of nouns and verbs;

it is also noted if a noun is possessive, so that the actual number of categories

distinguished is considerably greater (Hardie, 2004). It also incorporate tags for

punctuation marks, which are treated as words as has been pointed out that the

treatment of punctuation marks in this manner can be a significant aid in the

tagging of other nearby words (Leech, 1997).

4.1.2. TAGGIT Tagset

Klein and Simmons’s work inspired the work of Greene and Rubin (1971)60. The

tagset contains 77 POS tags, but their TAGGIT program displays information

56 Simple inventory of unrelated POS tags57 The term “hierarchical”, when used for a tagset, means that the categories in that tagset are structured

relative to one another. Rather than a large number of independent categories, a hierarchical tagset will contain a small number of categories, each of which contains a number of sub-categories, each of which may contain sub-sub-categories, and so on, in a tree-like structure (ibid).

58 Computational Grammar Coder-CGC (Klein and Simmons 1963) was designed as a component of a parser (in turn a component of a system to synthesize human language behaviour).

59 A tag is considered to be “decomposable” if the string that represents that tag consists of one or more characters that represent the same elsewhere which it represents in the original tag within the same tagset. For example, any noun tag which combines an N for “noun” with other characters to indicate other features of the word is decomposable (N.SG.MAS .dir).

60 It was Green & Rubin’s POS tagset which was used in annotating the Brown Corpus, and was refined slightly in a latter stage of this project (see Francis and Kučera 1982: 3-15) which came to be known as Brown tagset. It consists of 87 tags; allowing for compound tags, the number of potential analyses for any given orthographic form is 179 (Sampson 1987).

76

regarding the number as an integral part of the main tag itself (ibid). The CGC &

TAGGIT display some consistent design features. The tagset incorporates tags for

punctuation marks, which are treated as words. They based the definition of their

tags on the syntactic functions that a given word form performs in a particular

context. The tags display more of a tendency to be decomposable. For example, in

the tag WPO, W is Wh-word, P is pronoun and O is objective form. However,

unlike some latter tagsets, this tagset was not hierarchical. The earlier Klein and

Simmons’ (1963) tagset was not hierarchical either. Both these early projects also

had some means of dealing with ambiguity. Some of the TAGGIT tags were

exclusively for dealing with ambiguous words. For example, the CI tag marks a

word which is either a subordinating conjunction or a preposition, such as

‘before.’ There are also tags for subordinating conjunction (CS) and preposition

(IN). Only CS and IN tags are needed for an exhaustive classification, but CI is

necessary on a pragmatic ground.

4.1.3. CLAWS61 Tagset

As mentioned above, POS tagging has been a well known research topic since the

early 1980s, a number of tagsets have been devised for English at Lancaster

University within a decade from 80s to 90s to be used in CLAWS Tagger

(Garside 1987). The C1 tagset was used in the annotation of the LOB Corpus

(also known as LOB tagset). Since, this corpus was designed to parallel the

structure of the Brown Corpus, the tags were also parallel, and C1 is very similar

to the latter version of the Brown tagset (Francis and Kučera 1982). The

development of the C262 tagset was motivated by:

“providing distinct codings for all classes of words, having distinct

grammatical behavior, and making the tagset more systematic, in the way,

that tags are built up from individual characters” (Sampson 1987).

It means more decomposability and hierarchical nature was brought in C2 tagset

(166 tags). For example, all verbal tags have V as their first character and as their

second character either V again (for a main verb) or another character (for

auxiliary verb). The major subsequent developments in the CLAWS tagset were

the C5 and C7 tagsets, developed for the annotation of the BNC Corpus (see

Leech, Garside & Bryant 1994, Leech 1997b, Garside and Smith 1997). The C7

61 Constituent Likelihood Automatic Word Tagging System62 The CLAWS2 tagset was the basis for the much larger, much finer-grained SUSANNE Word-tag Set

(Sampson 1995: 79-149; circa 360 tags).

77

tagset (146 tags) is the more fine-grained of the two and can be regarded as a

further refinement of the CLAWS2 tagset but the C5 tagset is something of a

departure from the others, since it has fewer tags (61 tags) – this was in order to

make it useful to the largest number of end users (Hardie, 2004). On the other

hand, C5 tagset has been characterized as flat tagset (Cloeren, 1999). In fact,

although none of the CLAWS tagsets are laid out in the hierarchical fashion

described by Cloeren, the C7 tagset is hierarchical in conceptual terms (Leech,

1997). Furthermore, both C5 and C7 are largely decomposable – the C7, again, to

a greater extent. For example, in the tag PPHO2, ‘P’ is pronoun, ‘P’ is personal,

‘H ’is third person, & ‘O’ is objective case and ‘2’ is plural.

4.1.4. UPenn63 Tagset

The POS tagset used in Penn Treebank (Marcus et al., 1993) is also based on the

Brown Corpus tagset. However, it has been modified in terms of simplification,

rather than complexity, as is case with CLAWS tagsets (Hardie, 2004). Thus,

there are considerably less tags (36). It makes fewer of what has been described as

“lexically recoverable distinctions” (Marcus et al, 1993), i.e. the distinction

between lexical verbs and the auxiliary verbs (be, do and have) is not retained in

this tagset as the distinction is made on the basis of the forms of words. Also,

information that could be recovered from the parsing information has been

excluded from the tagset to avoid the risk of inconsistency in tagging. “It is clear

that reducing the size of the tagset reduces the chances of such tagging

inconsistencies” (ibid).

4.1.5. Lund Tagset

The tagset, designed for the annotation of the London-Lund Corpus of Spoken

English, represents a tagset significantly different from the Brown

Corpus/CLAWS tagset tradition (Svartvik 1990). It is more fine-grained,

consisting of just over 200 tags. It has been designed for spoken texts and

includes tags for a variety of discourse element type adverbs, not usually

distinguished in the tagging of written texts, as well as tags for other features of

speech such as swearing. Similarly it lacks punctuation tags. Moreover, this tagset

is also hierarchical and decomposable into single characters (or 2-3 character

strings) that indicate given features.

63 University of Pennsylvania, USA

78

4.2. POS Tagsets for Indian Languages

Despite being relatively new field, research on POS annotation in Indian

Languages has also produced a number of tagsets and common frameworks.

These include AU-KBC tagset for Tamil (2001), Hardie's tagset for Urdu (Hardie,

2005), IIIT-ILMT tagset for Hindi (Bharati et al., 2006), MSRI-JNU tagset for

Sanskrit (Chandra Shekhar, 2007), MSRI-ILPOSTS for Hindi & Bangla

(Baskaran et al., 2008), CSI-HCU tagset for Telugu (Sree R.J et al., 2008),

Nelrlac tagset for Nepali (Hardie et al., 2005), LDCIL tagsets for ILs (Malikarjun

et al., 2010; Bhat et al., 2010), BIS tagsets for all ILs (Ms. 2009), etc. Some of the

important POS tagsets relevant to the current work are briefly given below.

4.2.1. EMILLE64 Tagset for Urdu

Urdu, written in the Perso-Arabic script, offers different set of challenges in POS

tagset design. Hardie (2005) designed the Urdu tagset based on the Urdu grammar

of Schmidt (1999) in accordance to the EAGLES guidelines for the EMILLE

project. However, designing a tagset in Urdu was not a straightforward task

particularly with respect to the orthographic convention, and the presence of

Arabic and Persian borrowed forms, which are structurally quite distinct from the

Indo-Aryan forms. Some of the issues that were highlighted in Hardie (2005) are

tokenisation and idiosyncratic features of Urdu. It has been found that in Urdu

orthography, many elements described as suffixes in traditional grammars are

actually written as independent tokens. Hence, the arbitrary decision was taken to

treat every orthographic space as a word break even if it occurs within a lexical

item. However, this leads to include some means of tagging those elements which

do not constitute a free form (words). For example, “zimmah daar" (responsible)

consists of two tokens - a root and a derivational suffix. The same suffix appears

fused to the root in other contexts like "samajhdaar" (sensible), and further

suffixation can take place like "zimmah daarii" (responsibility). In the background

of such orthographic conventions, a syntactically null tag has been introduced

which is dependent for its grammar on the subsequent token, e.g. samajhdaar\JJU

and zimmah\LL daar\JJU. The major categories in Urdu tagset are virtually

identical to the equivalent categories as defined in the EAGLES - Nouns,

Pronouns, Verbs, Adjectives, Adverbs, Postpositions and Conjunctions. The

tagset handles tokenization problem (for details see chapter 3) at POS level and

64 Enabling Minority Language Engineering

79

thus tries to deal with two separate problems- tokenization & POS tagging,

simultaneously.

4.2.2. ILMT65 POS Tagset for Hindi

The ILMT POS tagset has been developed by Akshar Bharti Group for annotating

Hindi corpus. It is based on the principle of simplicity with a motivation to extend

it as a framework for all ILs. Another important dimension that has been taken

into account in its design is the division of labour between POS tagger and Morph

Analyser. POS tagger is supposed to merely disambiguate the multiple tags

generated by the Morph Analyser. Finer distinctions have been avoided in order

to have lesser number of tags to facilitate efficient machine learning vis-à-vis

accuracy in automatic annotation. This has resulted in the flat tagset, comprising

of 21 POS tags but other inflectional information associated with the tokens can

be obtained from the Morph Analyser. Form-Function duality is one of the crucial

issues in tagset designing. However, it is mainly form-based tagset as pointed out

in (Bharati et al., 2006) “the syntactic function of a word is not considered for

POS tagging.....the word is tagged always according to its lexical category...”

Hence, pragmatic function of a token in the context is not considered as the

primary basis for POS tagging. As far as tags are concerned, the UPenn tags along

with the newly devised tags have been used. The most important point is that the

tagset has innovatively left the finiteness to be dealt at next level of annotation,

i.e. at word group or chunk level not at the level of token. The participial and

gerund are tagged as VM (though they function differently), and all auxiliaries are

tagged accordingly as VAUX. A variable tag (XC) has been also introduced

where (X) stands for category which is a part of a compound and (C) stands for

compound. Finally, it is worth to mention that form has been chosen as primary

basis for POS annotation but often adherence to semantic as well as syntactic

functions are evident from the tagset.

4.2.3. ILPOSTS66

It is a POS tagset framework designed to cover the fine-grained morphosyntactic

details of Indian Languages. It proposes a three-level hierarchy of categories,

types and attributes. It has been developed by Microsoft Research India, on the

basis of EAGLES guidelines (Leech & Wilson 1999). Language specific POS

65 Indian Language Machine Translation is a consortium project for developing MT systems for major ILS pairs. It has been set at IIIT Hyderabad and is funded by DIT.

66 Indian Language POS Tagset

80

tagsets have been customised on the basis of it. First Sanskrit (C. Shekhar, 2007),

Hindi and Bangla (Baskaran et al., 2008) tagsets were customised but latter the

scheme was more refined and tagsets for all ILs were developed at LDCIL

(Malikarjun et al., 2010; Bhat et al., 2010). These tagsets are hierarchical in nature

and consist of decomposable tags.

A general guiding principle has been formulated to handle form-function duality.

A set of ‘Attributes’ have been devised on the basis of morpho-syntactic or

simply orthographic practices and the attributes are marked according to their

form while the ‘Types’ are marked on the basis of their function. It is worth to

mention that on the one hand ‘Attributes’ are tagged according to their

morphological visibility (like tense, aspect, etc) as well as the semantics (like

number, gender, etc). On the other hand, ‘Types’ are exclusively based on

semantics (like common noun, proper noun, etc). A combination of the form and

the function based on distribution is applied for tagging categories like

demonstrative (DEM), Pronoun (P), Quantifier (JQ), Noun (NC), Noun denoting

Space & Time (NST) and Adverb of Location (ALC), and the orthographic

convention is taken as basis to annotate Postposition and Case marker. Although

finiteness is defined on the basis of the inflection for person, number, gender,

tense, aspect and mood but verb is not dealt as neatly as it has been dealt latter.

Further, with respect to the similar forms, a distributional basis is considered for

distinguishing and annotating the categories like pronoun and demonstrative, or

between pronoun and quantifier. A token is to be tagged as a demonstrative if it

follows an adjective or a noun and as a pronoun if it does not follow another noun

or other parts of speech. Similarly, a token is tagged as a nominal modifier if the

token it is followed by noun and as a noun if it is not followed. Case marker and

Postposition are assumed to be an instance of the same phenomenon of marking

dependents. However, due to orthographic conventions, the dependent marker is

written in two ways: together and separate. These two ways are tagged as case

marker and postposition, respectively.

4.2.4. BIS67 Tagset

POS tagset designing and developing is a perquisite for any POS annotation work,

whether carried out in isolation or as an integral part of a larger annotation

pipeline like as involved in building a treebank. As mentioned above (see the

67 Bureau of Indian Standardization

81

second section), BIS is an annotation framework, recognized by Bureau of Indian

Standardization. The framework has not adopted the Indic system of descriptive

categories rather, it has, like the most of the annotation schemes of the world

relied on the descriptive categories of Techne. Therefore, Dionysius Thrax’s

Techne (C.100 B.C) – a grammatical sketch of Greek – has not only served as

role model for contemporary POS descriptions in European Languages but also

for the POS descriptions of South Asian Languages. Techne includes an inventory

of eight POS categories (noun, verb, pronoun, preposition, adverb, conjunction,

particle, and article).

BIS recommended POS tagsets for Indian Languages also uses the same basic set

of POS categories which were also used by earlier tagsets like ILPOSTS and

ILMT. The 32 parts-of-speech categories recommended by BIS for Kashmiri are

given in the Table.1 (for detailed tagset see appendix-I). It is worth to mention

that at POS level, verb subcategories of Kashmiri have been kept in line with

Hindi-Urdu, i.e. fine grained distinction (finite, non-finite, infinite distinction) has

been avoided and category verbs has been further sub-divided into verb main and

auxiliary.

Category Tag Category Tag

1 Noun Common N_NN 22 Quantifier General QT_QTF

2 Noun Proper N_NNP 23 Quantifier Cardinals QT_QTC

3 Noun Locative N_NST 24 Quantifier Ordinals QT_QTO

4 Pronoun Pronominal PR_PRP 25 Residual Foreign-word RD_RDF

5 Pronoun Reflexive PR_PRF 26 Residual Symbol RD_SYM

6 Pronoun Relative PR_PRL 27 Residual Punctuation RD_PUNC

7 Pronoun Reciprocal PR_PRC 26 Residual Unknown RD_UNK

8 Pronoun WH PR_PRQ 29 Residual Echo-words RD_ECH

9 Pronoun Indefinite PR_PID 30 Adverb Manner RB_RB

10 Demonstrative

Deictic

DM_DMD 31 Adjective JJ_JJ

11 Demonstrative

Relative

DM_DMR 32 Postposition PP_PSP

12 Demonstrative WH DM_DMQ

82

13 Demonstrative

Indefinite

DM_DMI

14 Verb Main V_VM

15 Verb Auxiliary V_VAUX

16 Conjunction

Coordinating

CC_CCD

17 Conjunction

Subordinating

CC_CCS

18 Particle Default RP_RPD

19 Particle Interjection RP_INJ

20 Particle Intensifier RP_INTF

21 Particle Negation RP_NEG

Table.1. BIS POS Tagset of Kashmiri68

5. Description of Kashmiri BIS POS Tagset

POS categories and subcategories as given in the tagset (Appendix-I) are briefly

discussed below with reference to KashTreeBank Dataset-4.

i. Noun (N)

Noun is an open-class item or content word that refers to people, places, animals,

objects, substances, ideas, concepts, feelings etc. Nouns have inherent

characteristic of number, gender & case and they are usually inflected for such

information. Noun is the first top-level category in BIS tagset with three sub-types

belonging to level-1 of hierarchy. It sub-types include Common Noun (NN),

Proper Noun (NNP) and Spacio-temporal Noun (NST).

NN is the first subtype of the noun which includes those nouns that are classes but

not particular instances. Most of the nouns are common nouns which can be easily

quantified or pluralized, e.g. kitaab (book), gagur (rat), insaan (human), etc. The

common nouns extracted from the dataset-4 are given in Appendix-II. It not only

includes simple nouns but also other multiword expressions like compound nouns

and Izaafe. NNP is the second subtype which includes nouns that are particular

68 Note: Two tagsets with considerable differences were developed for Kashmiri on the basis of BIS format; one was developed at KU (which proposed fine-grained distinction in verb classification) and other at LDCIL (which avoided the fine-grained distinction like Hindi-Urdu). They were, latter on, combined in National Workshop on BIS & ILCI (2011), commenced at LTRC Lab. IIIT Hyderabad. The present tagset of Kashmiri is the same unpublished collaborative work proposed from the side of LDCIL and it has been first time used in any research.

83

instances (like person, place and institution names) but not general classes. These

can’t be quantified or pluralized, e.g. zA:kir hussain (Zakir Hussain), Jiil-i-Dal

(Dal Lake), ladaakh (Ladak), Kashmir University, Cufewed Night, etc. In some

languages like English, common and proper nouns can be identified with the help

of orthographic cues like initial letter capitalization but many languages like

Indian Languages lack such luxury. Moreover, proper nouns are also used as

common nouns; hence, extraction of either of them is very tough task. The proper

nouns that have been extracted from the dataset-4 are given in Appendix-II. It not

only includes simple single token proper nouns but also the multi-word/token

expressions like Compound Nouns, Izafe & other Named Entities, e.g. company

names, institution names, book names, person names, etc.

NSTs are fourth subtype of nouns which are also called Nouns of Location

(Nloc). This subcategory was actually introduced in ILMT tagset to register the

distinctive nature of some of the locational nouns which also function as part of

complex postpositions (e.g. ke uupar, ke niiche, etc) in Indian Languages but in

the current tagset the notion has been used little differently. Here, NSTs have

been treated as equivalent to the traditional adverbs of time and place. Since, there

is no place for traditional adverb of time and place in this tagset; these have been

classified under NST which basically refers to particular points in space or time.

For example, hoteyth/tateyth (there), yeteyth (here), bronThI (front), peyThI (top),

etc (also see Appendix-II). The Fig.1 shows frequency distribution of

subcategories of noun and reveals which subcategory is the most frequent in

Kashmiri.

N_NN N_NNC N_NNP N_NNPC N_NST0

200

400

600

800

1000

1200

1400

84

Figure.1 Subtype Frequency of Noun

ii. Pronoun (PR)

Pronoun is a closed class item which like noun has the inherent property of being

inflected for PNGC and can substitute a noun or a noun phrase. The idea for

introducing pronominals as separate category or as subtype of noun has been well

explored and it has been decided that the tag for pronouns will be helpful for

anaphora resolution. Moreover, it is not a subtype of noun but is rather a variable

which need not necessarily be referring to a noun. The top-level category of

pronouns (PR) includes Pronominal69-PRP, e.g. bI (I), tsI (you), su (he), sw (she),

yi (this), ti/hu (that/it), etc; Reflexive-PRF, e.g. paanI (herself/ himself); Relative-

PRL, e.g. yus (who), yi (which), yeli (when), etc; Reciprocal-PRC, e.g. paanIvan’

(each other); WH or Interrogative-PRQ, e.g. kus (who/which), kyaa (what), kar

(when) and Indefinite-PRI kahn (someone), kuni (somewhere), as six sub-types. It

is important to mention that unlike other traditional pronominal sub-classes,

possessive pronoun hasn’t been introduced as a sub-type in this tagset. The reason

is that possession, as an attribute (genitive), can be inflected with other sub-types

as well like his, whose, etc. The pronouns that have been extracted from the

dataset-4 are given in Appendix-II. The Fig.2 shows frequency distribution of

subcategories of pronoun and reveals which subcategory is the most frequent in

Kashmiri.

PR_PRC PR_PRF PR_PRI PR_PRL PR_PRP PR_PRQ0

2040

60

80

100

120

140

160

180

200

69 It is a cover term that was originally used in LDCIL tagsets. It includes personal pronouns (I, you, He, etc) that have persons (+human) as antecedents, pronouns (this, that, it) that have animates (-human) or in-animates as antecedent. Discourse deictic pronouns (this, that, it) that have whole proposition as antecedent, e.g. John abused Mary. It is clearly violation.

85

Figure.2 Subtype Frequency of Pronoun

iii. Demonstrative (DM)

Demonstratives are closed class items that perform deictic70 function for a noun.

Demonstrative will be always followed by a noun, a pronoun, an intensifier or an

adjective. These are a distinct category of determiners and can neither substitute a

noun, nor can specify a noun but can point out a noun. Therefore, one must not

confuse them with nouns or adjectives though these resemble by form with

pronouns and are traditionally treated as adjectives. It is obvious why

demonstratives are being posited as separate top-level category in this tagset. It

consists of Deictic or Default Demonstratives (DMD), Relative Demonstratives

(DMR), WH-Demonstratives (DMQ) and Indefinite Demonstrative (DMI), e.g. yi

in yi laDkI (this boy), kahn in kahn chiiz (something), kus in kus insaan (which

man), etc. The demonstratives that have been extracted from the dataset-4 are

given in Appendix-II. The Fig.3 shows frequency distribution of subcategories of

demonstratives and reveals which subcategory is the most frequent in Kashmiri.

DM_DMD DM_DMI DM_DMR0

20

40

60

80

100

120

140

Figure.3 Subtype Frequency of Demonstrative

iv. Verb (V)

Verb is an open class item that refers to actions, events, occurrences or states.

Verbs have inherent properties of Tense, Aspect, Mood or Voice and are also

inflected with such information. They also show inflections for Person, Number,

Gender and Case due to their agreement properties. In the present tagset Verbs are

70 It literally means pointing out.

86

top-level category with two subtypes; Verb Main (VM) and Verb Auxiliary

(VAUX). As mentioned above, the finer distinctions of finite, nonfinite and

infinite have been postponed to be tackled at chunk level. The rationale to use

these unspecified tags is that the morphosyntactic information of verb that

determines status of verb as finite or nonfinite is distributed on two or three

tokens. Therefore, it is impossible to decide upon the status of the verb unless all

the constituent tokens are not taken into consideration. For example; kheyvaan

(eating), shong (slept), chu (is), os (was), etc. The verbs extracted from dataset-4

are given in appendix-II. The Fig.4 shows frequency distribution of subcategories

of verbs and reveals which subcategory is the most frequent in Kashmiri.

V_VAUX V_VM0

100

200

300

400

500

600

700

800

Figure.4 Subtype Frequency of Verb

v. Adjective (JJ)

Adjective is an open class item that modifies a noun or pronoun by representing

one of its properties. Adjectives agree in terms of number, gender, and case with

the nouns they modify. Therefore, Adjectives (both attributive and predicative)

are inflected for PNGC. In the present tagset, there is no further distinction of

subtypes but a distinction has been made between those adjectives which are

constituents of compound words as well as izafe and simple adjectives. The tag

for simple adjective is JJ while as the tag for constituent adjective is JJC. For

example: zyuuTh (tall), asIl (fine), byuuTh (waste), vozul (red), etc. The adjectives

extracted from the dataset-4 are given in Appendix-II. The Fig.5 shows frequency

distribution of subcategories of adjective and reveals which subcategory is the

most frequent in Kashmiri.

87

JJ_JJ JJ_JJC0

50100150200250300350400

Figure.5 Subtype Frequency of Adjective

vi. Adverb (RB)

Adverb is an open class item that modifies verb. They form an important top-level

category of this tagset. Unlike agreement of adjectives, adverbs do not agree with

the verb they modify. They are indeclinable, i.e. do not have any inflectional

property. They are floating elements in the sentence and do not occur necessarily

adjacent to the verb, they modify. Their distribution in a sentence varies

considerably. In the present tagset, only adverb of manner (RB) has been taken

into consideration as adverb of time and adverb of place have been already

classified under noun as Nloc. For example: The adverbs that have been extracted

from the dataset-4 are given in the Appendix-II. The Fig.6 shows frequency

distribution of subcategories of adverbs and reveals which subcategory is the most

frequent in Kashmiri.

RB_RB RB_RBC0

5

10

15

20

25

30

35

40

45

50

Figure.6 Subtype Frequency of Adverb

vii. Postposition (PSP)

88

Postpositions are closed class items which like prepositions represent case

relations between verb and its dependent nouns in a sentence. The forms that

represent case relations are either free-forms or bound-forms. The free-forms are

called pre/postpositions while as the bound-forms (inflectional categories) are

called case-markers. Postpositions, as their name suggests, are always preceded

by nominals (noun or pronoun) & always trigger obliqueness either in their head

nominals (common in Indo-Aryan languages) or in the entire noun phrase (as in

Kashmiri). However, in literature, the notion of pre/postposition is vexed with the

notion of case or case marker. For instance, case-marker is considered to be

purely syntactic inflectional category while pre/postposition an independent word

representing semantic relations but the fact is that orthographic conventions defy

these norms and a purely syntactic form is bound (inflectional) in one language

and free (independent) in other language. To simplify, all the free-forms that

represent some sort of relation (not necessarily semantic) between nominals and

verb or between two nominals are considered as postpositions. It is worth to

mention that Kashmiri have very few relation representing forms occurring before

nouns, e.g. bamutaabiq Farooq (according to Farooq). Such forms can be

considered as prepositions but in the current tagset they can be classified as

postpositions, given the fact that there is no further sub-division in this category

because of their negligible frequency but postpositions have far more frequency,

e.g. sund (of), sI:t’ (with), khA:trI (for), etc. The Fig.7 shows frequency of

pre/postpositions in the dataset-4.

10

100

200

300

400

500

600

700

PP_PSP

Figure.7 Type Frequency of Postposition

89

viii. Conjunction (CC)

Conjunctions are closed-class items or function words that conjoin two or more

lexical items, phrases or clauses. In the current tagset, conjunctions have been

introduced as a top-level category with its two sub-types; coordinators (CCD) and

subordinators (CCS). If their conjoining operation is symmetrical the conjunction

is coordinator, however, if the conjoining operation is asymmetrical, the

conjunction is subordinator. Coordinators form compound sentences while

subordinators form complex sentences. In the former constituent clauses are

symmetrical (both are independent in nature) while in the latter, the constituent

clauses are asymmetrical (one is principal, independent or matrix clause and other

one, introduced by subordinator, is subordinating, dependent or embedded

clause).

Since conjunctions are indeclinable in nature, they were classified under particles

in ILMT and ILPOST vis-à-vis in LDCIL tagsets as per the definition of particle

is concerned. However, given their key syntactic functions unlike other particles,

they have been introduced as top-level category in BIS tagset and have been

tagged as CC, e.g. tI (and), zi (that), etc. It is worth to mention that this decision

may be helpful in conversion of dependency treebank into phrase-structure

treebank. The CCs that have been extracted from dataset-4 are given in Appendix-

II. The Fig.8 shows frequency distribution of subcategories of conjunction and

reveals which subcategory is the most frequent in Kashmiri.

CC_CCD CC_CCS0

50

100

150

200

250

Figure.8 Subtype Frequency of Conjunction

90

ix. Particle (RP)

Particles are open-class items or functional words which are generally

indeclinable in nature and have least significance in a construction. Particle

constitutes a top-level category in the current tagset and has Default (RPD),

Intensifier (INT), Interjection (INJ) and Negation (NEG) as its sub-types. There is

an elaborate list of particles (Emphatic, Similative, Dedative, Inclusive,

Exclusive, etc) which have been assigned a single underspecified label ‘default’,

given the fact that their finer distinction is not very significant at this level.

Particles generally have limited syntactic function but encode key semantic and

pragmatic information, e.g. seThaa (very), na (no), hata (hey), etc. The Fig.9

shows frequency distribution of subcategories of particles and reveals which

subcategory is the most frequent in Kashmiri.

RP_INJ RP_INTF RP_NEG RP_RPD0

102030405060708090

100

Figure.9 Subtype Frequency of Particles

x. Quantifier (QT)

Quantifiers are also closed-class items or function-words which quantify

nominals. Quantifier is a top-level category in the current tagset with General

(QTF), Cardinal (QTC) and Ordinal (QTO) as sub-type, e.g. akh (one), pI:ntsin

(fifth), vaariyaa (lot), etc. General quantifiers include non-numeric quantifiers

that show highness or lowness in the quantum of countable nouns or simply show

quantity of mass nouns while as Ordinals are numeral quantifiers that specify

quantum of countable nouns numerically. The former are less precise in

quantification as compared to the latter. Cardinals on the other hand, do not

quantify at an at all, rather, they specify position of an item in a series. They

91

modify nominals and can occur both at attribute as well as predictive position like

adjectives. However, by form these are generally derivatives of numerals

(ordinals) of a language. In Kashmiri, QTF and QTO show agreement properties

with their phrasal heads in terms of case like their adjective or demonstrative

counterparts. The Fig.10 shows frequency distribution of subcategories of

quantifiers and reveals which subcategory is the most frequent in Kashmiri.

QT_QTC QT_QTF QT_QTO0

20

40

60

80

100

120

140

Fig.10 Subtype Frequency of Quantifier

xi. Residual (RD)

Residual is not any POS category but to accommodate the remaining elements of

the corpus (text) which do not fit in the already discussed scheme, it has been

introduced as a separate top-level category with five sub-types in the present

tagset. Its sub-types include Foreign Word (RDF), Unknown Word (UNK), Eco

Word (ECO), Symbol (SYM) and Punctuation (PUNC). RDF includes the words

which are given in other script while as UNK includes the words which we don’t

know or are confused about or which apparently do not fit anywhere. Therefore,

UNK is a kind of baggage where you dump words that we are unable to classify.

ECO includes partially reduplicated non-words that play a definite grammatical

role. Symbols are apparently neither words nor punctuations but the elements of a

text which encode certain information about some entities which can prove

crucial for NER recognition. Punctuations are closed-class items but not words

that play crucial grammatical function in organizing a discourse. They mark

phrase, clause and sentence boundaries and sometimes play role of coordinators.

The Fig.11 shows frequency distribution of subcategories of residuals and reveals

which subcategory is the most frequent in Kashmiri.

92

RD_ECH RD_PUNC RD_UNK0

50100150200250300350

Figure.11 Subtype Frequency of Residual

6. Requirements for POS Tagging

There are two main requirements for POS annotation besides the availability of

corpus and POS tagset. These include an annotation interface and the storage

format which are elaborated below:

6.1. POS Annotation Interface

The best way to perform consistent and error-free POS annotation is by using

specialized user-friendly interfaces designed for this purpose. There are many

POS annotation interfaces that have been developed in India, e.g. the one

developed by MSRI and other developed by LDCIL, but these are only POS

annotation interfaces. But another level of annotation can not be carried out by

using them and also a link can’t be maintained between two or more levels. Since,

the current POS annotation is an integral part of KashTreeBank, first level of

annotation; we need some specialized plate-form that can consolidate all levels of

annotation in certain format. One such platform is Sanchay71 which has been

developed by writing Approx. 300000 lines of Java code over many years.

Sanchay is a collection of tools and APIs (Application Programme Interfaces) for

various language processing purposes (Singh 2006). It is an open-source platform

to carry out various NLP tasks for South Asian Languages (SALs). So far, it has

been extensively used for Indian Languages (ILs) at various NLP research labs for

various research projects. The background information ob Sanchay has been

given nicely as:

“It has already been used for the creation of POS tagged corpora for several

Indian languages. In fact, the beginning of treebank creation work in India

coincides with that of the beginning of the development of this interface and

71 http://sanchay.co.in

93

much of the treebank annotation work for Indian languages has been

accomplished on various versions of this interface.” Singh (2011)

Sanchay Syntactic Annotation (SA) interface, as shown in Fig.12.A & Fig.12.B,

is a specialized interface for syntactic annotation but it has been generalized for

various kinds of annotations; Morphological annotation, POS tagging, Chunking,

PSG Annotation, Dependency annotation, Named entity annotation and PropBank

annotation. It was first developed when the preparations for creating a Hindi

treebank were started at LTRC. Actually the work on developing the whole

platform started with this interface, as pointed out by A. K. Singh, developer of

Sanchay, “It was not just the first annotation interface, but also the first graphical

user interface in Sanchay” (ibid 2011).

The same interface with the same mechanisms can be used for these

different kinds of annotations. This is made possible by a data representation that

is in terms of threaded trees with feature structures (multiple and/or nested). The

different threads in the base tree allow different layers of annotation (for details

see Singh, 2011).

Sanchay is a generalized platform which needs customization to work for

a particular language which is yet to be included. Customization related to

encoding or font was already done but same for the tagsets needed to be done.

BIS tagset is a recent development and Sanchay was customized for previous

tagsets only. For the current work, it needed to be customized for BIS scheme.

The properties files (pos-tags.txt, pos-tags-ben.txt, non-terminals.txt) located in

the directory <Sanchay/workspace/syn-annotation> which contain the lists of tags

for POS tagging and chunking, need to be customized. These plain text files

contain simple listings of tags in alphabetical order, with one tag per line. In the

same files ILMT tags were simply replaced with BIS tags. These tags have been

sorted alphabetically so that Method-3 can be employed for POS tagging

conveniently.

6.2. Storage Format

Since, the data has to be converted into the storage format before starting actual

POS tagging, as mentioned in the Chapter one. The format encodes the threaded-

tree-representation, which allows multiple layers of annotation to be stored in a

single structure or a single file which is readable for various algorithms or

94

convertible in a format which in turn is readable. The default format which

Sanchay uses for storage & linking of various levels of grammatical information

is called SSF72 (for details, see section-3.2, Chapter one). However, the interface

also supports XML and several other formats that are commonly used for

computational purposes such as preparing input data for Machine Learning tools.

For converting the corpus into SSF, first the text needs to be split into

separate sentences in such a way so that each sentence occupies a separate line. It

was done in MS.Word by using a special case of “Find and Replace”

(CNTRL+H) option, in which sentence delimiters ( ۔& ؟ ) have been replaced with

paragraph markers (^p) & one sentence per line arrangement has been achieved.

Secondly, the doc-files need to be converted to plain txt-files by saving the

content in plain text editor-notepad with UTF-8 encoding. Finally, the resultant

text file needs to be loaded/ opened in SA-Interface of Sanchay and then saved

there. The clicking on the save button automatically converts the raw text into

four-column SSF, provided the sentences are arranged in one sentence per line

fashion.

The following screenshots, Figures- 12.a, 12.b, and 12.c, depict the step-

wise opening of the corpus file <kashmiri_treebank_IASNLP_286.txt> as well as

the Sanchay.

Step-I: Open the Sanchay folder and double-click on the Sanchay.bat file, an

executable file which starts running and results in the opening of the Sanchay

Shell, as shown in the Figure.12.a & 12.b.

72 http://shakti.iiit.ac.in

95

Figure.12.a. Opening of Sanchay Shell

Figure.12.b. Sanchay Shell with Multiple API Tabs

Step-II: Clicking on the SA-button in Sanchay Shell results in the opening of SA-

interface as shown in Figure.12.c. It is this API only which is needed in the entire

course of this work.

96

Figure.12.c. Sanchay Syntactic Annotation (SA) Interface

Step-III: Clicking on Open button in the SA-interface results in the opening of a

small Browsing-window in which Browsing-button is to browse for the required

task-file. In this window, one needs to set the language in the language drop-down

list and also set the encoding in the encoding-dropdown list as shown below in

Figure.12.d.

Figure.12.d. SA-Interface with Browsing Window

Step-IV: Clicking on the Browse button results in the opening of small Open-

window which provides a list of the files in a particular directory which are in text

format & can be opened. One needs to select the required file by clicking on it

97

and then clicking again but on the Open-button so that the path of the file is

selected in the Browsing-window. It is shown in the Figures 12.e & 12.f.

Figure.12.e. SA-Interface Showing Browsing Window

Step-V: The path (C:\ Users\ Shanu\ Desktop\ Sanchay-16-02-11\

KashDTreeBabk_03-Aug 2013\ 1.kashmiri_treebank_IASNLP_286.tx) of the

required file is selected as shown in the Figure 12.f. The selected file will open as

soon as one clicks on the OK-button. The file will open in the interface in a way

so that only one sentence is displayed at a time as shown in the Figure.13.a.

Figure.12.f. SA-Interface Showing Annotation Task-Setup

98

Figure.13.a. SA-Interface Showing a Sentence (Before POS Tagging)

7. POS Tagging of KashCorpus

POS tagging can be performed with the help of this SA-interface by four methods

which differ in terms of ease of use. As shown in Figure.13.a, one sentence is

displayed at a time in a vertical order so that the first word of a horizontal

configuration (right-to-left or left-to-right) corresponds to the top most word in

the vertical configuration. In SSF, each word is represented by a node. Once a

node has been selected either by clicking on it or by moving the cursor with the

keyboard, one of the following methods can be employed to tag it. (i) by selecting

from a drop-down list, as shown in Figure.13.b (ii) by right-clicking to get a

context menu, then selecting a ‘Node Name’ from the sub-menu and then

selecting the tag from sub-sub-menu (iii) by typing with key-board the first letter

of the tag one or more times (iv) by clicking on a button with the intended tag as

its label.

99

Figure.13.b. SA-Interface Showing a Method of POS Tagging

For POS tagging, 812 sentences of KashCorpus had been taken and converted

into SSSF in which 226 sentences were taken from newspaper domain, 286

sentences from short stories and 300 sentences from literary criticism. All 812

sentences were tagged with POS tags in four phases. Twenty nine POS tags were

assigned to three domains, divided into four data sets. The results are given in the

next section.

100

Fig.13.c. SA Interface Showing a POS Annotated Sentence

8. POS Tagging Issues

The POS annotation of four samples of Kashmiri corpus resulted in raising,

understanding and solving of various linguistic issues. The main issues are

discussed below and their solutions are given below in the form of annotation

guidelines. The statistical information about the various POS categories is also a

byproduct of this work which is also given below.

There are some general decisions that need to be taken at the time of tagset

designing. These decisions are related to whether the tagset should be flat or

hierarchical, fine-grained or coarse-grained, form-based or function based, etc.

Though, only after deciding upon these dualities one can proceed with the

customization of tagset for a particular language but all the things can’t be

decided at the time of the customization and certain things can’t be decided upon

at all, categorically in binary yes-no manner. Therefore, some things need to be

decided at the time of actual corpus annotations and the decisions need to be

documented to form, what is called annotation guideline. One must keep in mind

that the decisions taken to solve some issues may or may not be theoretically

appealing but mere shallow ad hoc solutions, either to postpone the immediate

problem to the next level or to provide the best possible solution that prevents

further problems. The issues that have been raised and addressed at this level of

corpus annotation have been classified under the following headings:

8.1. Fuzzy Items in Complex predicates (FI)

POS categories are hardly like the elements of periodic table that they always

retain their unique identity. They lose their grammatical identity, i.e. morpho-

syntactic features in certain contexts either due to neighborhood effect or due to

grammaticalisation. For instance, in some complex predicates (see Butt for

explanation), it is hard to decide upon the grammatical category of the words

other than the light verb (on V2 or V-final position) as given in the examples;

kor dafah, kor hA:sil, kor pA:dI, darguzar korun, kor tabdiil, kor fanah, etc. In

these examples the bold words (dafah, hA:sil, pA:dI, etc) are most likely to be

either adjectives or nouns although their nominal features like number, gender &

case have been bleached, nevertheless, they are not as clear as the bold words in

101

the following complex predicates; gov khosh, tuj’ dav, dits kreykh, nyuv kheyth,

etc.

It is easy to decide if these words are adjectives, nouns or something else as these

words clearly retain the nominal fetures either at morphological level or at

semantic level. Here, khosh is adjective (mas/fem, agree); dav is noun

(fem), kreykh (fem) is noun and kheyth is verb (participle).

8.2. Zwitter Ion of Natural Language (ZI)

The term Zwitter Ion has been taken from chemistry to illustrate the dual nature of

Gerunds. Usually, chemical particles are either +vely charged or -vely charged at

one instant of time but Zwitter Ions are of dual nature unlike other particles and

have both +ve & -ve charges, simultaneously. Analogically, nouniness &

verbiness are two polar oppositions like positive & negative charges. If a word

tends to be noun it means its verbal properties have declined and vice-versa.

Gerund is the only class of words that simultaneously retain nominal as well as

verbal properties. Gerunds on one hand function as possess case markers &

function like nominal but on the other hand retain their predicate-argument

structure properties like a typical verb.

Now the question arises, how to tag a gerund? Whether its form should be taken

into consideration in order to classify it or its function? If form is taken into

account, it is verb, though nonfinite one but if function is taken into consideration

it is a noun. It should be noted that in ILPOSTs, it was placed under the category

Noun as Verbal Noun, perhaps the focus was on the function, but in ILMT &

subsequently in BIS, it has been placed under the category Verb as gerund, given

the fact that by form it is verb and its predicate-argument structure frame remains

intact, through, it can never be inflected for other typical verbal features like

tense, aspect, mood or voice, e.g. kheyn-I sI:t’, vandn-I kin’, cheyn-

as peyTh, marn-an, etc.

Here, on one hand, kheyn-I, vandn-I, cheyn-as, are gerunds in oblique form,

followed by postposition like nominals. However, the gerund, marn-an, is not in

oblige form but inflected with a case marker (-an). On the other hand, the gerunds

(transitive) like kheyn-I & cheyn-as can also have their Arguments

like batI kheyn-I sI:t’ or chai cheyn-as peyTh.

8.3. izaafat Constructions (IC)

102

These constructions are Persian borrowed multiword expressions like Compounds

and Named Entities but with more coherent internal structure. Usually, two nouns

or a noun & an adjective are combined by means of a marker called “izaafe” to

form izaafat construction. The izaafe behaves like genitive in Urdu but in Persian

it behaves more like a linker (see Butt). However, in Kashmiri, the construction

seems to behave more like a compound with less conspicuous internal structure,

e.g. aab-i hayaat, vaziir-i aazam, hoquuq-o frA:yiz, dast-i shafaa, habiibi paakh,

hoquumat-i hind, shariiq-i hayaat, etc.

The diacritic marker that represent izafe in Kashmiri is mostly zer ( ,(ہunlike in Persian and Urdu where hamzah(ء), vaav (و) and badii ye(ے) also

represent izaafe. Therefore, izaafat constructions are either NN-NN combinations

or NN-JJ combinations. In NN-NN combinations, the diacritic is on the first

element (NN) but it seems to belong to the second element (NN) when it is

simplified (nativized) for interpretation. Kashmiri news paper corpus is replete

with such expressions. Given the writing conventions of Arabic, i.e. omission of

diacritics, and its influence on Urdu and thereby on Kashmiri writing, such

markers may or may not be there in the written expressions but are intact in

spoken forms.

Here, the problem is how to tag the two constituent words of an izaafat

construction? Should the words be tagged separately (like aabi/NN Hayaat/NN)?

Or should the constituents be joined & then tagged together as a unit (like aabi-

hayaat/NN)? However, in the first case, there will be less clarity in determining

the POS category of the first word marked with Izafe.

8.4. Identification of Proper Nouns (PNI)

As such the noun as a POS category doesn’t pose any problem but the inclusion

of common noun, proper noun and Nloc as subcategories have proved confusing

and thus, the noun came to be the most debatable category in the tagset. At times,

it becomes very difficult to distinguish between NN and NNP by relying on the

traditional notions. For instance, mevIh (fruit) is not referring to any specific fruit

or is not the name of any fruit, hence, it is NN. Then, by this logic, amb (mango)

or tsuunTh (an apple), names of specific fruits, should be NNP but these are

considered as NN. In order to address this issue, properly, one needs to go by

some concrete standards that can be generalized with least exceptions. Therefore,

it has been posited that NN is the noun that denotes a class of things, concrete or

103

abstract, (set or sub-set) while as NNP denotes an instance of a class (member of

set or sub-set), e.g. mango is a name but of a class of different varieties or

instances like Alphanso, Baadaamii, etc. Similarly, different varieties of apples

like Amriican, Chomuuriyah, Deylshas, etc are instances of the class apple, hence,

‘Chomuuriyah’ is NNP and ‘tsuunTh’ is NN. This position solves the problem to

some extent but raises other questions like, whether zuun (moon) is NN or NNP,

given the fact that other planets also have moons with specific names, and hence,

zuun is a class not an instance, likewise, in the above examples, Alphanso

mangeos or Chomuuriya apples are also names of classes rather than the

particular instances. By the same logic, it can be said that Alphanso mango tree is

also a class of different Alphanso mango trees but not an instance. Actually,

determining the status of a thing as a class or an instance is very tough ontological

problem. By looking from the top to bottom of an ontological tree, it seems an

object, like Alphanso mango, is a class but looking at the same object from the

bottom to top, it seems that the thing or object is an instance. It is hard to

determine, where one should stop dividing a class into subclasses, sub-subclass,

etc? in order to take one level as an instance. The problem of indeterminacy

comes to fore as soon as the definiteness issue creeps in the already vexed

problem of looking for instances. One can ask question whether the notion of

NNP incorporates the notion definiteness or is it independent of it, e.g. a person

name, “Umar” is no doubt an NNP but is not definite as the “Umar” can be Umar

Farooq (hurriyat leader), Umar Abdullah (CM), Baba Umar (Journalist), Umar

ibni Khataab (Second Khaliifah), Umar Gull (cricketer) or any other Umar.

Hence, the person names can be themselves a class (indefinite) rather than an

instance (definite). In order to ease out the problem, one can keep the definiteness

at bay from the notion of an instance while looking for ‘instances’ within a class.

Only then, one may be able to distinguish between NN and NNP otherwise the

status of person names or some place names as NNP can be objectionable.

Nevertheless, one can propose various diagnostic features, as given below, to help

in determining whether a noun is NN, NNP or NST.

1. If a noun can’t be pluralized and quantified it is likely to be NNP.

2. If there is room for asking question for the thing under consideration, like

which thing? Then the thing is likely to be NN and if there is no room for asking

such question, the thing is likely to have NNP as its POS category, e.g. aaftaab

104

(sun) is a specific instance of stars and hence, NNP. There is no room for asking

the question, which sun?

8.5. Named Entities (NEs)

As the name itself suggests, NEs include the names of companies, institutions,

persons, places and things which are multiword in nature. For example: vaziir-

i aazam manmohan singh, islamic university of science and technology, Microsoft

India Private Limited, etc.

The problem with named entities is that they form long chains of words which in

isolation refer to nothing specific but as whole refer to specific entities. Therefore,

as whole they are multiword proper nouns though their constituent words can be

of any category. Also, izaafat constructions can be their constituent elements as in

vaziir-i aazam manmohan singh.

The question arises how to tag them at this level? There are two options; one is to

tag each constituent word with their respective POS categories and the second

option is to tag all the constituent words with the same tag used for proper noun

because as whole they are proper nouns.

8.6. Compound Words (CW)

Compound words are also the problematic multiword expressions but they are not

comparatively simpler than izaafat constructions and named entities in that the

number of the constituent words can’t exceed more than two like named entities

and there is no internal linker to them like Izafe. However, they have their own

complexities. They can be endocentric with compositional meaning or exocentric

with non-compositional meaning. Like the outward drift in the meaning of

exocentric compounds, there can be also heterogeneity in their POS

compositionality, i.e. words of two different POS categories can form a new word

which may or may not have the category of one of its constituent words. For

example: Akis/QT akh/QT (pronoun), pA:n’/?? paanai/PR (pronoun), shinI/??

baal/N (noun), gA:r/JJ zimIdaar/JJ (adjective), kheyn/V cheyn/V (noun),khosh/JJ

nasiib/N (adjective),zorI/RB zorI/RB (adverb), As’/?? As’/?? (adverb), heyokun/V

kheyth/V (verb) , etc.

Of all the compounds, compound nouns and verbs are far more productive in

Kashmiri and pose more challenges to the annotator. In some compounds as

shown above, the constituent words without POS tags (with ??) are intuitively

difficult to classify as their original form has been changed and reduced to a sort

105

of bound form (like pA:n’/?? & As’/??) . However, some words (like shinI/??)

seem to have assumed the oblique form, the form which a noun assumes under the

influence of following postpositions as in (shinI/N peyTh’/PSP), “shiin” changes

to “shinI”. Therefore, such forms have independent existence unlike (pA:n’/?? &

As’) whose existence is bound to these contexts (compounds) only & do not exist

outside such contexts. Such forms have been classified and tagged on the basis of

their original form vis-à-vis category.

In addition to such problems compounds in general are like multiword

expressions and therefore, it is important to keep in mind whether the constituent

words of the compounds should be joined together by some convention, e.g. dash

(-), to form a single token or they should be kept as such (two tokens) without

joining. If former approach is followed then they need to be tagged as whole unit

which will ignore the category of their constituent words. However, if one follows

the latter approach then they need to be tagged separately as per their respective

categories which will ignore the category of the entire compound. It is also

important to take into account that whether the POS information of the individual

constituent words of a compound is more important at this level of annotation or

the POS information of the entire compound. However, if one thinks both are

important, then it must be also keep in mind whether it is possible to achieved at

this level.

8.7. Numeric Dates (NDs)

It has been observed that there occur various instances of dates in the corpus.

They are also like named entities. As they represent particular points in time, it is

quite possible to label them as Nloc but it is a debatable issue whether to classify

them under Nloc or not. Date is the name for particular point of time like the

name of a place or a person and they are unnamed like the typical temporal Nloc

(adverbs of time). Therefore, they are classified under proper nouns and not under

Nloc. However, problem arises when they are followed by a case marker which

occurs as separate token, e.g. 16 January 1950 has, 1847 huk, 1947 yas (۱۹۴۷ ، چ777س۱۹۵٠ ہ777ک،۱۹۴۷یس ،). In these examples, -has, -huk and -yas are

basically bound forms but can’t be attached with numerals naturally. Similarly,

sometimes the dates, e.g. 1950 (۱۸۵٠) occur with symbol (ء) for issvi (AD) and

the case-marker (یس) yas, e.g. in 1850 iisvi yas (۱۸۵ یس ٠ ء ) manz. It has been

106

also observed that there are some occurrences of the dates in the corpus where the

initial and final dates representing a period of time are kept in brackets and the

case marker occurs outside the brackets, e.g. (چ777س تام ۱۹۴۷- ۱۹۵٠( or (1842-

1857) as taam. Such cases, though a typical tokenization problem, have been

handled at POS level as they had been left as such at the time of tokenization

because they came to fore during annotation process.

8.8. Underspecified Verbs (UVs)

As aforementioned in the subsection 2.2.6 (d), the fine grained sub-classification

of verb has been avoided given the fact that fine-grained sub-classification is

based on the notion of finiteness, i.e. finite verb, non-finite verb and infinite verb

and the notion itself is controversial at deeper level. It is not only tense that

contributes to finiteness but sometimes aspect, mood and agreement determines

finiteness. The morpho-syntactic information that constitutes finiteness is usually

distributed in two or three verb tokens (auxiliary and main verbs). For instance, in

the example; su chu batI khevaan (he is eating an apple), auxiliary verb (chu)

carries tense (present) & PNG agreement (mas.SG.3rd), and the main verb

(khevaan) carries aspectual information (progressive) in addition to the lexical

semantics. In such sentences, it would be absurd to say that the auxiliary verb

(chu) is finite and the main verb (khevaan) is nonfinite. If only the tense

determines finiteness then all –tense verbs are nonfinite. Then, a very important

questions which arises is that, are the perfective sentences like “arshidan kheyo

batI” (arshid ate rice) and imperative sentences like “tsI khe batI” (you eat rice)

basically nonfinite clauses? Isn’t it that only de-verbal verb-forms like the

participles or gerunds are basically non-finite?

It is obvious that the main verb in the sentence; su chu batI khevaan (he is eating

an apple), is also finite despite that fact that it doesn’t carry the tense (-tense) or

PNG information (-PNG). It is verbal in nature and plays a key role in the

sentence by providing the lexical semantics of the action and its aspectual

information, unlike the nonfinite verbs (such as its khey-th form) which are de-

verbal in nature and, thus, play a marginal (modifying) role in the sentences in

which they occur. Therefore, finiteness needs to be determined by taking into

account the all verb tokens (except de-verbal) of a sentence, irrespective of

whether the verbs tokens are contiguous or non-contiguous with relation to each

other. Keeping in view this complicated nature of finiteness, verb classification

107

has been kept underspecified at this stage and only two types (main and auxiliary)

have been posited just to avoid resolving of finiteness puzzle at this stage and to

postpone it to the next level of annotation, i.e. chunking level.

8.9. Non-manner Adverbs (NMVs)

The notion of adverb has been simplified by restricting it to only manner adverbs

and putting time and location adverbs under noun as Nloc. However, there are lots

of words which seem to be adverbs but other than manner and locatives adverbs.

Since, only manner adverb has been posited in the current POS tagset, the label

needs to be neutralized & expanded to accommodate both, manner as well as non-

manner adverbs which signify reason, frequency, some quantification and

sentential adverbs e.g. kyaazi (why), beyi (again), dohdish/ dohai/ rozaanI

(everyday), hameshI (forever), zyadI (more), vaariya (lot), kam (less), shaayad

(perhaps), yaqiinan (surely), lA:ziman (necessarily ), etc.

The rationale to include reason, frequency, unique quantification and sentential

adverbs in adverbs is well grounded. The reason word kyaazi (why) modifies

whole clause like the sentential modifiers; shaayad (perhaps), yaqiinan (surely),

lA:ziman (necessarily ). Frequency words like dohdish/ dohai/ rozaanI (everyday)

& hameshI (forever) sound like manner adverbs, sort of temporal manner. It is not

that all quantifiers modify verbs but surely some are verb modifiers. Such role of

some quantifiers is more evident when they are used with intransitive verbs, e.g.

zyadI (more) in the sentence, su shong zyadI (slept more); kam (less) in the

sentence, kam osun (s/he laughed less), vaariya (lot) in the sentence vaariya

kheyvun (s/he ate lot), etc.

Another problem related to adverbs is that of being multiword like compounds,

though it is far from compounding. It mostly arises out of writing convention and

can be handled more like other multiword expressions or taken care at the time of

tokenization. For instance, certain adverbs are composed of two tokens in which

first token is adjective or noun and the second one mostly postposition, e.g. Thiikh

pA:Th’ (nicely), khOsh pA:Th’ (happily), vaarI pA:Th’ (safely), dor pA:Th’

(strongly), khushii saan (with happiness), etc.

All the multiword rather multi-token adverbs are not problematic for this level of

annotation as they can be handled like any other multiword expression but some,

in which the POS status of both the constituent tokens is not clear, are really

challenging, e.g. the status of pA:Th’ in the expressions; Thiikh pA:Th’ (nicely),

108

khOsh pA:Th’ (happily), vaarI pA:Th’ (safely) & dor pA:Th’ (strongly), is not

clear. Although, it has been treated as postposition in some previous annotation

works, it is more likely to be a bound-form. It is, no doubt, a separate token in

corpus but intuitively speaking, it is not a word; rather, it is a part of the preceding

word and is more like an adverbial morpheme except in the instances like misaali

pA:Th’ (for example) where it is clearly a postposition but its frequency in the

corpus is very less. The instances, in which it appears to be a bound-form have

high frequency in the corpus and have not been handled at the time of

tokenization, where a bound-form is usually attached to the preceding token (see

chapter-III). The reasons due to which it has been left as such in tokenization

process are its high frequency and unclear status.

8.10. Paradox in POS Annotation

As aforementioned form-function is one of the important dualities. It is very

crucial for tagset designing as well as corpus annotation. Theoretically one needs

to stick to only one aspect & to carry out the entire task of corpus annotation on

the basis of the same principle without occasionally switching to the alternative

dictum. However, practically it seems to be implausible as the annotations are not

carried out in isolation just for the sake of annotation but the product of

annotation needs to be used for some bigger task ahead and thus, one can’t ignore

the formal aspect of a word and focus entirely on its functional aspect or vice-

versa as demanded by theory. Somehow, both the aspects need to be taken into

account and one needs to realize the use of a one aspect or other in a particular

task along and also its use in the tasks ahead so that a particular aspect can be

ignored if not very important. For instance, on the one hand, demonstratives

wouldn’t have been a POS category if only formal approach would have been

taken into account, as by form demonstratives are actually pronouns but by

function they are demonstratives, e.g. the word, su (he) is pronoun in the sentence

su aav (he came) but the same word is demonstrative in the sentence su shur aav

(that kid came). Similarly, on the other hand gerunds wouldn’t have been verbs

but nouns if their formal aspect would have been ignored and only functional

aspect would have been taken into account. Therefore, it seems contradictory that

at one point for positing demonstrative as POS category, functional aspect of a

word has been taken into account but to posit gerunds as subclass of verbs the

same functional approach has been defied. It is important to mention that such

109

decisions are matter of expertise and experience and one need not to follow theory

strictly unless it doesn’t undermine the goal of the task in hands. In the present

task, i.e. POS annotation, the goal is to lay down the foundation of dependency

treebank which can be further augmented with anaphoric information or other

discourse level information. Thus, this dual or hybrid approach to corpus

annotation is justified. Nevertheless, it can be said that opting hybridity in corpus

annotation under the influence some practical usage, is indeed paradoxical, as

capturing the functional aspect of words in the corpus is an alternative way of

looking at data and thus, the essence of corpus linguistics vis-à-vis corpus

annotation.

9. Guidelines for POS Annotation

Some important guidelines that were framed and followed for POS tagging of

KashCorpus are given below:

i. All NEs are essentially proper nouns (NNPs) as they refer to specific

entities that have been named; however, the name is composed of more

than one word with different POS categories. Actually NEs are phrases

rather than words but need to be handled like words at this level. Since,

NEs are as whole are NNPs, all the words composing NEs are tagged as

NNPs, e.g. NE “vaziiri aazam manmohan singh” is tagged as

“vaziiri/NNP aazam/NNP manmohan/NNP singh/NNP” so that a chain of

NNPs is obtained can be easily identified in the annotated corpus. This

might look absurd decision and one can argue that original POS

information is suppressed but as aforementioned, it is strategy to evade the

problem at this level and to keep track of the problem-items in order to be

handled at another intermediate level handling multiword expressions

(MWEs).

ii. CWs are handled slightly differently, though like NEs they too are treated

as MWEs. These are composed of only two words and mostly include

compound nouns, compound adjectives, compound adverbs

(reduplications), etc. The words which form a compound are assigned

their respective POS tags but the specialized ones with ‘C’ to indicate

compound, e.g. Compound nouns like “shinI baal, zaril zaal, masI vaal”,

Compound adjectives like “khOsh qIsmath, gA:r zoruurii” are tagged as

“shinI/NNC baal/NNC, zaril’/NNC zaal/NNC, masI/NNC vaal/NNC” and

110

“khOsh/JJC qIsmat/NNC, gA:r/JJC zoruurii/JJC”, respectively. The

overall POS information of the compound word has been suppressed

unlike the treatment of NEs but ‘C’ maker has been added to the tag to

make the compounds identifiable or extractable. It must be noted that the

capturing compounding information of verbs, like the above, has been

avoided, given the fact that there are other dreaded complicacies

associated with verbs which are handled at the chunking level. It has been

also avoided in pronouns as there are very few compound forms in them.

iii. ICs are also handled like CWs and the words which are linked together by

izaafe are tagged with their respective POS categories, ignoring the

change brought about by the izaafe in the word to which they are bound,

e.g. the ICs “aabi hayaat, khuuni jigar, habiibi paakh, hoquuqo faraayiz”

are tagged as “aabi/NNC hayaat/NNC, khuuni/NNC jigar/NNC,

habiibi/NNC paakh/JJC, hoquuqo/NNC faraayiz/NNC”. Here, information

about Izaafat has been suppressed as like many other cases it is not much

needed for sentence parsing and ICs behave more like compound words.

iv. NC’s are actually a kind of numeral NEs and hence, tagged as NNPs but

the other complications associated with them have been handled by

joining the bound-forms with the numeric date by dash (for details, see

above discussion), e.g. Numeric dates like 1845 has manz, 1845 yas ء

manz, (1845 (ء yas manz, etc are tagged as 1845-has/NNP manz/PSP,

1845/NNP ء-yas/NNP manz/PSP, (/PUNC 1845/NNP ء-yas/NNP )/PUNC

manz/PSP. It should be noted that some unexpected tokenization problems

like the one discussed in the issues and mentioned above, have been

tackled even at this stage.

v. As aforementioned, the de-verbal non-finite forms like perfective

participles (-ith-forms) such as kheyth, pArith, shongith, bihith, etc;

progressive participles (-vun-forms) such as zeyvvun, shongvun, natsvun,

asvun, gindvun, khasvun, bozvun, etc and gerundial forms such as

shongun, shongnI, shongas, shongnan, shongnuk, vothnI, natsnas, etc

have been assigned underspecified POS tag, VM. Also, verbal non-finite

forms (infinitives like shongun) and verbal finite forms (main verbs) also

have been assigned the same tag, i.e. VM. It is important to mention that

111

no distinction has been maintained at this level and the reasons for the

same are discussed above.

vi. The pronominal-forms which are followed by verb or postposition have

been assigned a POS tag, PRP but if the same form is followed by

intensifier, quantifier, adjective or noun, it is demonstrative and is tagged

as DMD.

vii. Adverbs which are not essentially manner adverbs like those representing

frequency, quantification & reason (beyi, vaariyah, kyaazi, etc) have been

also tagged as RB like the manner adverbs.

viii. In addition to traditional adverbs of time and space, some vague time-

words like subhas (in morning), shaamas (in evening), dohli (in the day)

and ‘roth’ in “roth kyuth” are also potential NSTs as long as they provide

temporal location for an action/event. But when they are inflected with

genitive marker like subhuk (of the morning), shaamuk (of the evening),

etc and cease to provide temporal location for an action/event, they cease

to be NSTs and are NNs.

ix. Usually, NSTs do not take demonstratives, are marked with locative,

ablative or terminatives, and can’t be pluralized or quantified.

x. Besides, words like “kA:shur or kashmiri” are most likely to be NNP

when used in isolation or with other words in order to refer to a language

but are likely to be NN when used in isolation to refer to people “koshur,

PunjA:b’, BangA:l’, etc” However, they are likely to be JJ, when used

with other words like koshur saqaafat (kashmiri culture, koshur geyav

(kashmiri ghee), etc.

So far linguistic information, an outcome of the analysis cum annotation process

has been discussed or put forward in the form of a small guideline for problematic

cases. Now, in the next sub-section, statistical information, yet another kind of

outcome of the annotation process has been given in the following sub-section.

10. Statistical Results

As aforementioned, the data has been divided into four sets for annotation. In

each dataset, the words have been classified into eleven classes. Table.1. shows

cumulative frequency of each of the POS category in all four datasets while as

Fig. 14.a shows the total quantity of each POS category in terms of percentage. It

112

has been prepared from the frequency Table.1. to show the contribution of each

POS category in making KashCorpus and to compare the percentage of the

categories, in order to get the most frequent and the least frequent POS categories.

N V PP RD JJ PR CC RP DM QT RB0

5

10

15

20

25

30

35

40

34.689

18.242

9.077 7.9346.4346.3129999999999

96.038

3.476 2.945 2.501 2.348

Figure.14.a. Total Quantum of POS in terms of (%)

S.

No

POS Type

(x)

Data Set-1

(f1)

DataSet-2

(f2)

Data Set-3

(f3)

Data Set-4

(f4)

Grand Total

fx = (f1+f2+f3+f4)

1 N 953 2296 868 1042 5159

2 V 793 1045 597 278 2713

3 PP 190 665 210 285 1350

4 RD 394 345 251 190 1180

5 JJ 176 384 212 185 957

6 PR 333 234 176 196 939

7 CC 169 313 208 208 898

113

8 RP 207 115 99 96 517

9 DM 50 146 119 123 438

10 QT 68 183 64 57 372

11 RB 48 48 96 157 349

Total f = 3381 5774 2900 2817 14872

Table.1. Cumulative frequencies (fx) of POS

11. Summary

In this chapter, the fundamental layer of annotation, i.e. POS tagging, of

dependency treebank of Kashmiri, has been explored with reference to the four

datasets taken from KashCorpus, discussed in chapter-III. First of all, the task to

be handled in this chapter has been introduced in section.1 and then the notion of

POS tagging has been explained in the beginning of the section.2. In the same

section some important corpus annotation standards have been discussed and the

various existing POS tagsets have been reviewed briefly. Further, not only the

category wise description of Kashmiri POS tagset (used in the current work), has

been given in this section, but also the comparative statistical information about

various sub-categories involved. In section.3, at first, the prerequisites for actual

POS tagging have been discussed which include the annotation interface and a

particular data storage format. The SA-Interface of Sanchay platform has been

used for the current task and the procedure for using the same has been given in

114

this section, using various snapshots. The storage format called SSF has been also

discussed along with the need to rely on any such format for any annotation

pipeline. Latter, the actual POS annotation has been discussed along with the

results, in the form of various linguistic issues raised and their solutions. The

solutions have been presented in the form of a mini-guideline. Finally, statistical

results like the frequency and cumulative frequency of various POS categories

have been given in the same sections.

The chapter has overall explored and discussed various annotation schemes and

tools which have been found relevant to the present work and has also laid down

the foundations for building dependency treebank (KashTreeBank), using four

samples of data taken from KashCorpus. The next chapter will address further

two layers of annotation which revolve around the syntactic dependencies.

Chapter.5. Chunking of KashCorpusJudgments are inherently unreliable because of their

unavoidable meta-cognitive overtones, because grammaticality is better described as a graded quantity, and for a host of other

reasons.Edelman and Christianson (2003)

1. Introduction

Chunking is the second level of annotation in developing a dependency treebank

based on HTB guidelines (Bharati et al., 2012). It involves annotating clusters of

words based on local dependencies with predefined chunk labels. The chunk layer

encodes the intermediate level of linguistic information between the POS level

and the dependency level. In fact, it covers all those dependency relations which

dependents form with their head except with the verbal head. Although, it covers

all lower level dependencies which do not belong to argument-adjunct level but

115

these dependencies are not overtly labeled. However, it is very crucial for

annotation of inter-chunk dependency relations.

This chapter is mainly concerned with describing the second layer of annotation

of KashTreeBank. The second section of this chapter deals with the notion of

chunk, the third section discusses the rationale behind chunking, the fourth

section gives description of chunk tagset, section five describes the process of

manual chunking carried out with the help of Sanchay SA Interface, section six

talks about the issues that were encountered during the annotation process, section

seven presents results, both statistical as well as theoretical. Section eight presents

the guidelines and section nine summarizes the chapter. The next section

discusses the notion of chunk.

2. The Notion of Chunk

The term ‘chunk’ appears similar to the term ‘phrase’ but a chunk and a phrase

differ, considerably, both refer to a group of words. The former is a general term

which has been widely used across various disciplines for a perceptually compact

group of entities and in linguistics; it refers to non-recursive group of words. The

latter is purely a syntactic term which refers to constituents which are often

recursive nature. Therefore, non According to Abney (1991), a chunk consists of

a single content word surrounded by a constellation of function words which

matches a fixed template, e.g. in Kashmiri noun chunk, [huth/DMD baagas/NN

manz/PP].NC (in/PP that/QT garden/NN), the content word baagas/N (garden) is

surrounded by function words, huth/DM (that) and manz/PP (in).

Abney (1995) also defines a chunk as “the non-recursive core of an intra-clausal

constituent, extending from the beginning of the constituent to its head but not including

post-head dependents.” There is psychological evidence for the existence of chunks.

Gee and Grosjean (1983) examine that these are performance structures of word

clustering that emerge from a variety of types of psychological experimental data

such as pause durations in reading and naive sentence diagramming. They argued

that performance structures are best predicted by what they called Ø-phrases

which are created by breaking the input string after each syntactic head that is a

content word. They do not assign syntactic structure to chunks and assume that

pre-nominal adjectives do not qualify as syntactic heads; otherwise, phrases like a

big dog would not comprise one chunk but two. Contrary to that, Abney (1994)

argued that a chunk has syntactic structure which comprises of a connected sub-

116

graph73 of a global parse-tree of a sentence and that the chunks are represented in

terms of major heads which are all content words except those that appear

between a function word and the content word, e.g. ‘proud’ is a major head in ‘a

man proud of his son’ but proud is not a major head in ‘the proud man’ because it

appears between the function word ‘the’ and the content word ‘man’.

However, the practical considerations in implementing a framework on the corpus

samples can lead to a variety of word-constellations that may or may not be

psychologically real chunks as discussed above. Therefore, chunks may not be

complicit with the well-known definitions of a chunk but merely ad-hoc solutions

to more practical problems, e.g. non-contiguity. Thus, a chunk is a sub-tree within

a syntactic phrase structure tree corresponding to nominal, prepositional,

adjectival, adverbial or verbal phrases (Abney, 1991, 1992, 1995) or simply a

word-group based only on local surface information, e.g. the noun group and the

verb group (Bharati et al 1995). Sometimes, even the simplest notion of chunk as

a word group may be problematic (see Bhat, 2012) while handling discontinuity.

3. Rationale for Chunking

It has been already discussed in Chapter-I and Chapter-II that dependency

relations involve asymmetrical grammatical relations, i.e. head-dependent or

modifier-modified relations between words. These relations hold at two levels,

one at chunk level74, between the words of minor POS-class (secondary

dependents) and the word of a major POS-class (secondary head), and other at

sentential level, between the secondary heads (primary dependents) and the

primary head, i.e. finite verb. The rationale behind the division of dependency

relations into two levels is that it allows incorporating the popular notion of

phrase, though crudely, and thereby, permits of division of labor in order to

achieve consistency in syntactic annotation. Moreover, at POS level the more

focus had been on the form or words rather than the function they performed in a

sentence. Therefore, positing the intermediate chunk level is to regain the scope of

function which is the key force constructing a sentence. However, at the first

level, dependency relations between dependent words and their head word,

constituting secondary modifier-modified relations, have not been labeled

explicitly, instead the cluster of dependent words and the head word has been

73 The parse-tree of a chunk is a sub-graph of the global parse-tree (ibid, 1994).74 It can be considered equivalent to the popular phrasal level.

117

annotated with the chunk tags which have been devised based on the notion of

head. Therefore, the relations can be easily predicted by head computing based on

the chunk label, e.g. in NP chunk, N but not JJ, DM, QT, RP or PP will be head.

Similarly, in RBP, RB will be head. So there is no need to label relations

explicitly at this level as they can be easily computed from the information

encoded in the tags.

Further, it is worth to mention that the chunk tag is not only assigned to a cluster

of words which are formed by dependency relations but also to the clusters which

are formed by non-dependency relations, e.g. the clusters of JJ + N, DM + N and

QT + N are clearly formed by dependency relations and have been tagged as NP

chunks but the clusters of N and PP, PRP and PP, N and RP, V and RP, etc are

not formed by modifier-modified relations (hence non-dependency) and have

been tagged with chunk labels, NP and VGF, respectively. Similarly, predicative

adjectives and quantifiers, non-contiguous adverbs, conjuncts and other

discontinuous elements like the tensed and non-tensed verbal elements have been

assigned chunk tags despite of the fact that they don’t essentially form a cluster of

words with some dependency or non-dependency relations, rather, they are

solitary elements and have been treated at par with the cluster of words. This

flexibility of treating clusters of words at par with the solitary words is actually to

account the discontinuity and flexibility in surface word-order which is the

hallmark of sentences taken from corpus. Therefore, positing chunk level is not

only important to deal with one set of dependency relations but also to settle most

of the problems of surface level and to smooth the ground for the next level of

annotation. This also brings the notion of chunk closer to the performance

structure proposed in Gee and Grosjean (1983) than the standard notion of phrase,

visible only in NP chunks.

As POS tagging is prerequisite for chunking; chunking is also pre-requisite for

deep syntactic parsing vis-à-vis annotation, i.e. for annotating inter-chunk

dependency relations. However, in order to be able to chunk POS annotated data

consistently, a chunk tagset, and the chunking interface are must. The chunk

tagset that has been used in the current work is described in the next section.

4. Description of the Chunk Tagset

Though parsing by chunking is common practice in IL treebanks, there is yet to

be a standardized tagset of chunks and other higher dependency relations for ILs

118

as there is for POS tagging in the BIS standards. However, there has been some

work in chunking for Indian Languages particularly for Hindi, Bangla, Urdu and

Telugu (see Bharati et. al. 1995; Ray et. al. 2003; Singh et. al., 2005; Das et. al.

2005; Bharati et. al. 2006). The current chunk tagset is based on the POS tagging

and chunking guidelines used in ILMT (Bharati et. al. 2006) but the notion of

verb-group as posited the guidelines has been refuted in the current work, as non-

applicable for Kashmiri. It has been found that the POS annotated words of

Kashmiri corpus can be grouped and classified into ten chunk categories (Bhat,

2012 & 2013). These ten chunk categories, along with chunk-tags75, are given in

the Table.1 and their description is given below.

4.1 Noun Chunk (NP)

Noun Chunk is the name assigned to a cluster of words, which a noun forms, with

its dependents such as JJ, DM, QT or even with PP which are also considered as

its dependents though they are not modifiers like the other dependents. The notion

of noun chunk is similar to that of the noun phrase except that it is a single entity

and can’t be recursive, i.e. can’t embed any sub-phrase in it, e.g. kwr-i hund (of

the girl), su boDbaarI bag (that big orchid), Ak-is bAd-is maqaan-as manz (in one

big house), su ti (he too), etc.

The further examples of NPs are given in the Table.1 and Table.3, and the

proportion of NP in the Kashmiri corpus is given in the Figure.3.b.

4.2 Auxiliary Chunk (AUXP)

In Kashmiri like other Indo-Aryan (IA) Languages, tense, aspectual and lexical

information of verbs are distributed into three verb tokens known as tense

auxiliary, modal auxiliary and main verb, respectively, but unlike them these three

verb tokens are non-contiguous in nature, with other elements, particularly the

arguments intervening between them. Thus, Auxiliary Chunk is the name

assigned to solitary tensed auxiliary or cluster of tense and aspectual auxiliaries,

both tagged as VAUX at POS level, rather than to the cluster of three verb tokens,

forming a Verb Group (see Bharati, 2006) in Urdu and Hindi. The AUXP tag has

been assigned to these solitary tense auxiliary or a cluster of auxiliary tokens

away from the main verb, e.g. aasi (will), Os-nI (was not), chi-nI aasaan (do not

75 It must be noted that some of the chunks, though conceptually different from other ILs, have been assigned the same tags as in other IL treebanks with the understanding that tags, like words, are arbitrary in nature and there is no point in making objections like why is not verb-chunk-Finite tagged as VCF instead of VGF? Or why Noun Chunks have been tagged as NP instead of NC. This was purely done to keep the doors for easy recourse sharing open.

119

keep), chi heykaan (can), etc. The further examples of AUXP are given in Table.1

and Table.3, and the proportion of AUXP in the Kashmiri corpus is given in the

Figure.3.b.

4.3 Verb Chunk Finite (VGF)

Verb Chunk Finite is the name assigned to the solitary tense-less or tensed main

verbs, tagged as VM at POS level or to the clusters of RB-VM, VM-RP or RP-

RB-VM. When VM is tense-less, it is either the lexical part of the auxiliary verb

occupying V2 position (or both occupying the V2 and V3 positions) in the

sentence or it is itself a full-fledged verb with both lexical part and the mood

information condensed in a single token, e.g. gatsh (go), khey (eat), chey (drink),

etc. But when it is tensed, tense is either clearly inflected, e.g. in khe-yi (will eat),

che-yi (will drink) and shongi (will sleep) or it isn’t inflected at all or it can be

said that tense information is morphologically unmarked or underspecified in

these cases but contextually encoded for which the aspect provides the most

crucial cue. Another possibility is that aspectual information is contaminated with

tense and both the tense and aspect are expressed through a single inflection

(portmanteau morpheme). For instance, there are two perfective forms of finite

verbs in Kashmiri, ‘-mut’ form and ‘-ov’ form. The ‘-mut’ forms, e.g. khey-mut

(eat-prf), chey-mut (drink-prf), shong-mut (sleep-prf), etc, co-occur with tense

auxiliary which occupy the V2 position in the sentence. Therefore, one can easily

determine whether the ‘-mut’ form of the verb is present or simple past perfective

form by looking at V2 position where its tense information is located. Therefore,

the tense and aspectual information is disjunctive in such cases. However, the ‘-

ov’ forms, e.g. khey-ov (ate), chey-ov (drank), shong (slept), etc, neither co-occur

with the tense auxiliary at V2 position, nor are such forms inflected with tense

information. Since, tense information is underspecified, in such forms; it should

have been difficult to determine whether such forms are present or past

perfectives but as default, native speaker perceives such forms as past perfectives.

So it is evident that in ‘-ov’ forms either ‘-ov’ carries tense information in

addition to aspectual information (hence, portmanteau) or it merely provides a cue

to tense which is encoded in the context. Irrespective of whatever may be the

convincing explanation for this case, such forms have been tagged as VGF, e.g.

natsaan (dances), khe’ (eat), shong (slept), etc. The further examples of VGF are

120

given in Table.1 and Table.3, and the proportion of VGF Kashmiri corpus is given

in the Figure.3.b.

4.4 Verb Chunk Non-finite (VGNF)

Verb Chunk Non-finite is the name that has been assigned to solitary participle

forms, ‘-vol’ forms and a cluster of reduplicated progressive forms which are

essentially de-verbal in nature and function either as an event or an entity

modifiers. Such forms are generally known as non-finite verbs but non-finite

verbs also include gerunds and infinitives. However, as mentioned in the Chapter-

IV, the task of determining finiteness has been avoided on POS level as the

grammatical information of the verbs distributed on multiple tokens rather than

being condensed in a single token. The task, thus, becomes very complex if one

goes by the standard definition of finiteness but it has been found that the notion

of ‘de-verbal’ forms simplifies the task. It has been better addressed under the

forth coming section on issues. It is important to mention that gerunds and

infinitives, though de-verbal in nature, don’t play a modifying role and thus, are

not tagged as VGNF like other de-verbal forms, mentioned above, e.g. shong-ith

(sleeping), bih-ith (sitting), pakaan pakaan (while walking), etc. The further

examples of VGNF are given in Table.1 and Table.3, and the proportion of VGNF

in Kashmiri corpus is given in the Figure.3.b.

4.5 Verb Chunk Gerund (VGNN)

Verb Chunk Gerund is the name that has been assigned to those de-verbal forms

which function as nominals. These include solitary direct gerundial forms and the

clusters of oblique gerundial forms and the postpositions. Infinitives have been

also tagged as VGNN as they form the argument of the finite verbs like the

gerunds. As aforementioned, such forms are distinguished from the other de-

verbal forms only in terms of their functions, otherwise the constituent verbs of

both the VGNF and VGNN are devoid of any verbal feature, except the argument

structure which is intact in them even if they play non-verbal roles in the

sentence, e.g. shong-nI sI:t’ (because of sleeping), nats-un (dancing), asn-an

(laugh-ERG), etc. The further examples of VGNN are given in Table.1 & Table.3,

and the proportion of VGNN in Kashmiri corpus is given in the Figure.3.b.

4.6 Conjunct Chunk (CCP)

Conjunct Chunk is the name assigned to the conjunctions, both coordinating and

subordinating which have been tagged as CCD and CCS, respectively at POS

121

level. Most of the sentences in the corpus are compound, complex or compound-

complex in nature in which conjunctions play a key structural role and thus, the

frequency of conjunctions is high in the corpus. Since, conjunctions neither have

any modifier-modified relation nor do they bear any part whole relation with any

other POS category, they can’t be part of any other chunk like postpositions.

Therefore, they are solitary and are projected as separate chunks, e.g. tI (and), ya

(or), kinI (or), zi (that). The further examples of CCP are given clearly, in Table.1

& Table.3, and the proportion of CCP in Kashmiri corpus is given in the

Figure.3.b.

4.7 Adjectival Chunk (JJP)

The name Adjectival Chunk has been given to the solitary adjectives and

quantifiers, or adjectival or quantifier clusters like RP-JJ and RP-QT, which can’t

be a part of any noun chunk. It is worth to mention, here, that, although, all

adjectives have been tagged as JJ and all quantifiers have been tagged as QT, at

POS level, all adjectives and quantifiers can’t be raised up to the chunk level as

JJP. Adjectives or even quantifiers occur either at attributive position, as part of

NP, or at predicative position, as solitary elements or clusters. It is only at these

predicative positions, the adjectives and quantifiers have the status of head as they

do not constitute what are popularly known as discontinuous phrases and can be

easily posited as the adjectival chunks and have been tagged as JJP, e.g. su chu

rut (he is nice). The further examples of JJP are given in Table.1 & Table.3, and

the proportion of JJP in Kashmiri corpus is given in the Figure.3.b.

4.8 Adverbial Chunk (RBP)

The name Adverbial Chunk has been assigned to the solitary adverbs or adverbial

clusters (RB-RP) which can’t be a part of any verb chunk. It must be noted that

although all adverbs are tagged as RB at POS level, all can’t be raised up to chunk

level and tagged as RBP because sometimes they are adjacent to their head and

can be also part of VGF but mostly they occur non-contiguously with their head

are tagged as RBP, e.g. su os vaarI vaarI garI kun pakaan (he was moving

towards home slowly). The further examples of RBP are given in Table.1 &

Table.3, and the proportion of RBP in Kashmiri corpus is given in the Figure.3.b.

4.9 Negation Chunk (NEGP)

The name Negation Chunk has been given to those negative particles that occur

as solitary elements without an obvious head and hence, may be treated as the

122

heads to be projected as the chunks, e.g. na su yiyi-nI vaapas (no, he won’t come

back). The further examples of NEGP are given in Table.1, and the proportion of

NEGP in Kashmiri corpus is given in the Figure.3.b.

4.10 Other Chunk (BLK)

The name Other Chunk is reserved for all those solitary or clusters of POS tagged

words which do not fit in the aforementioned chunks. This is actually like a bag

in which all elements can be put which do not confirm with the chunking scheme,

either because they are unrelated to the sentence structure, e.g. the serial

numbers, or they belong to discourse level, connecting one sentence with the

other, e.g. khA:r tAm’ vAn’-nI zahn ti titsh kath (however, he never said

anything like that). The further examples of BLK are given in Table.1, and the

proportion of BLK in Kashmiri corpus is given in the Figure.3.b.

S. No Chunk Name Tag Examples

I Noun Chunk NP [su/DM badI/RP rut/JJ shakhIts/NN]

NP

(that very big man), [farooq/NNP

ti/RP] NP (farooq also), [farooq/NNP

nI/RP] NP (not farooq), [Akis/QT

bADis/JJ palas/NN peTh/PP] NP (on

one big rock)

II Auxiliary Chunk AUXP [chu/VAUX] AUXP (is), [chu/VAUX

aasaan/VAUX] AUXP (keeps),

[Os/VAUX] AUXP (was), [Os/VAUX

rOzaan/AUXP] AUXP (used to),

[aav/VAUX] AUXP (was),

[yiyi/VAUX] AUXP (will be)

III Verb Chunk Finite VGF [kheyovum/VM] VGF (ate +1st person

clitic), [vaarI/RB parihaa/VM] VGF

(may should have read nicely),

[dav/VM haz/RP] VGF (run +

honorific), [variyaa/RP zorI/RB

pakh/VM] VGF (walk very fastly),

123

[yiyi-nI/VM kehn/RP] VGF (will not

come + emphasis) *[chu/VAUX

vonmut/VM] VGF (has said)

IV Verb Chunk Non-

Finite

VGNF [kheyth/VM] VGNF (after eating),

[kheyth/VM cheyth/VM] VGNF (after

eating drinking), [pakaan/VM

pakaan/VM] VGNF (during walking),

[kheynIvol/VM] VGNF (eater)

V Verb Chunk Gerund VGNN [paknas/VM peyTh/PP] VGNN (for

walking), [kheynI/VM sI:t’/PP]

VGNN (with eating), [natsnI/VM

kin’/PP] VGNN (due to dancing),

[khenIch/VM] VGNN (of eating),

[kheyon/VM] VGNN (eating/ to eat),

[kheynIvol/VM] VGNN (one who eats)

VI Conjunct Chunk CCP [tI/CCD] CCP (and), [yaa/CCD] CCP

(or), [kinI/CCD] CCP (or),

[natI/CCD] CCP (or) [ki/CCS] CCP

(that), [zi/CCS] CCP (that),

[yodvai/CCS] CCP (if), [agar/CCS]

CCP (if), [magar/CCD] CCP (but),

VII Adjectival Chunk JJP [variyaa/INT rut/JJ] JJP (very good),

[pantsah/QC kiluu/NN] JJP (fifty

kilo), [pandhA:yim/QO] JJP

(fifteenth), [zyuuTh/JJ] JJP (tall or

lengthy)

VIII Adverbial Chunk RBP [teyz/RB teyz/RB] RBP (quickly or

fastly), [zorI/RB ti/RP] RBP (loudly)

[lot/RB] RBP (slowly) [ti/RP

kyaazi/RB] RBP (because),

[chunki/CCS] RBP (because),

[tawai/RB] RBP (because of that),

[teli/RB] RBP (then)

124

IX Negation Chunk NEGP [na/RP] NEGP (no), [na/RP saa/RP

na/RP] NEGP (no +honorific not)

X Other BLK [khA:r/RP] BLK (however), [teli/RP]

BLK (so)

Table.1. Kashmiri Chunk Tagset

5. Chunking POS Tagged Corpus Samples

As aforementioned, chunking is labeling a cluster of POS annotated words (with

an obvious head) or a solitary POS annotated word (which itself acts as a head),

with a higher level tag. During chunking, words have been clustered together and

assigned a particular chunk tag, keeping in view their POS tags, adjacency and

dependency relations between them that make them perceptually closed entities. It

has been done in such a way that each chunk has a definite internal structure, i.e.

words constituting a chunk are asymmetrically related to each other, with one

word as a head and the remaining words as its dependents or in case the word is

solitary, it is itself head with no dependents. However, there are certain cases

where the word which has been given chunk status is neither a head nor a

dependent as per semantic dependency is concerned, e.g. AUXP, as discussed

above. The chunking process has been carried out using the same interface which

was used to carry out POS tagging. The chunk layer has been built on the POS

layer as illustrated below in three steps for the sentences 43 and sentence 42

(taken from the corpus), given in the Table.2 along with their English translation

and the chunk information. The POS annotated file in SSF format can be opened

in the Sanchay SA Interface (GUI), as shown in the Fig.1.a in order to carry out

manual chunking.

Kashmiri Sentence 43

ٹھ اجار-دری ید خطر ٹیکنالوجی پ ری طاقت چھ پنن ف ن ز جو ٲتم و ٮ� ٲ ٲ ہ ے ہ Iن ۍ یی پراونس ری توان چھ بیین ملکن امن مقصدو خطر جو ٲیژھان ت ن ٲ ہ ہ

۔اجازت دوان Translation He said that the atomic powers want monopoly on the technology for their benefit

and do not let other countries to use atomic energy for peaceful purposes.

125

Chunks [[ [[ ۍتم _PR_PRP ]]_NP [[ ن نIو _V_VM ]]_VGF [[ CC_CCS_ز ]]_CCP [[ ری ج77و _JJ_JJ ہط77اقت _N_NN ]]_NP [[

ےچھ _V_VAUX ]]_VGF [[ ہپنن _PR_PRF ]]_NP [[ ی77د ٲف _N_NN ٲخطر _PP_PSP ]]_NP [[ ٹیکنالوجی_N_NN ٹھ ٮ�پ _PP_PSP ]]_NP [[ ری ٲاج7777ار-د _N_NN ]]_NP [[ V_VM_یژھ7777ان ]]_VGF [[ ہت _CC_CCD ]]_CCP [[ ہن _RP_NEG V_VAUX_چھ ]]_VGF [[ JJ_JJ_بیین N_NN_ملکن ]]_NP [[ N_NNC_امن N_NNC_مقصدو ٲخطر _PP_PSP ]]_NP [[ ری ج77و _JJ_JJ یی ٲتوان _N_NN ]]_NP [[ ۔ V_VM_دوان ]] NP_[[ N_NN_اجازت ]] V_VM ]]_VGF_پراونس_RD_PUNC ]]_VGF ]]_SSF

Sentence 42 دس ملکس نمت ز تم س ن�ایرانک صدر محمود احمدی نژادن چھ و ۍ Iن ۍ میت ت سالمتی ہخالف چھن اقوام متحد-کس تاز قراردادس کا ا KہYن ہ

ک ‘ آل بنیمژ ‘ - ۔کونسل چھ امریک ہ ہ ےTranslation President of Iran Mahmood Ahmad Nasraad has said that there is no significance

of United Nation’s resolution against his country and United Nations has become an instrument in the hands of America.

Chunks [[ [[ ۍایرانک _N_NNP ]]_NP [[ N_NNPC_صدر N_NNPC_محمود N_NNPC_احم777دی N_NNPC_ن777ژادن ]]_NP [[ V_VAUX_چھ نمت نIو _V_VM ]]_VGF [[ CC_CCS_ز ]]_CCP [[ ۍتم _PR_PRP دس ن�س _PP_PSP ]]_NP [[ ملکس_N_NN خالف_PP_PSP ]]_NP [[ ہچھن _V_VAUX ]]_VGF [[ N_NNPC_اق77777وام متح77777د- N_NNPC_کس ]]_NP [[ JJ_JJ_ت77از N_NN_ق77راردادس ]]_NP [[

نYہKکا _DM_DMI میت ا _N_NN ]]_NP [[ ہت _CC_CCD ]]_CCP [[ N_NNPC_س77777777المتی N_NNPC_کونسل ]]_NP [[

ےچھ _V_VAUX ]]_VGF [[ ک - ہام77777ریک _N_NNP ]]_NP [[ ‘_RD_PUNC ہآل _N_NN ]]_NP [[’_RD_PUNC V_VM_بنیمژ ۔ _RD_PUNC ]]_VGF ]]_SSF

Table.2. Showing Example Sentence

126

Figure.1.a. SA Interface Showing POS Tagged Sentence

Step-1: In this step, the contiguous words which form a chunk have been selected by holding control key and clicking on the nodes so that all the contiguous nodes are selected, simultaneously, as shown in the Fig.1.a. Although, the first three chunks (NP, VGF and CCP) consist of solitary words, they have been also chunked following the same steps as shown for the forth chunk (highlighted one in Fig.1.b), i.e. by selecting the nodes, adding a layer and changing the name of the layer (chunk name) for the selected nodes.

Figure.1.b. SA Interface Showing Step-1 of Chunking

127

Step-2: In step two, one can right click on the selected chunk so that the drop down list of actions opens in which ‘Add Layer’ option can be selected and new chunk layer can be added in the format as shown in the Fig.1.b.

Figure.1.c. SA Interface Showing Step-2 in Chunking

Step-3: In this step, the newly added layer would have some default tag (NP) which can be easily changed by clicking on the chunk tag itself and using keyboard by pressing the first letter key of the desired chunk tag. One can keep pressing the letter key unless the desired chunk tag is assigned to the newly added chunk layer, as shown in the Fig.1.d.

Figure.1.d. SA Interface Showing Step-3 in Chunking

128

As shown above, sentence-43 has 26 token which have been grouped into 18

chunks and sentence-42 has 29 tokens which have been grouped into 16 chunks.

The ratio between the tokens/words and chunks is not very large (approx 1.6)

which is indicative of the fact that there is high frequency of the solitary words

that have been given status of chunk. The 18 chunks of the sentence-43, as viewed

in the tree viewer of the interface are shown in Fig.2.a and Fig.2.b.

Figure.2.a. SA Interface Showing Chunks in Sentence 43

Figure.2.b. SA Interface Showing Chunks in Sentence 43

129

<Sentence id='42'>1 (( NP1.1 ۍایرانک N_NNP))2 (( NP2.1 صدر N_NNPC2.2 محمود N_NNPC2.3 احمدی N_NNPC2.4 نژادن N_NNPC))3 (( VGF3.1 چھ V_VAUX3.2 نمت نIو V_VM))4 (( CCP4.1 ز CC_CCS))5 (( NP5.1 ۍتم PR_PRP5.2 دس ن�س PP_PSP))6 (( NP6.1 ملکس N_NN6.2 خالف PP_PSP))7 (( VGF7.1 ہچھن V_VAUX))8 (( NP8.1 اقوام N_NNPC8.2 متحد-کس N_NNPC))9 (( NP9.1 تاز JJ_JJ9.2 قراردادس N_NN))10 (( NP10.1 نYہKکا DM_DMI

<Sentence id='43'>1 (( NP1.1 ۍتم PR_PRP))2 (( VGF2.1 ن نIو V_VM))3 (( CCP3.1 ز CC_CCS))4 (( NP4.1 ری جو JJ_JJ4.2 ہطاقت N_NN))5 (( VGF5.1 ےچھ V_VAUX))6 (( NP6.1 ہپنن PR_PRF))7 (( NP7.1 ید ٲف N_NN7.2 ٲخطر PP_PSP))8 (( NP8.1 ٹیکنالوجی N_NN8.2 ٹھ ٮ�پ PP_PSP))9 (( NP9.1 ری ٲاجار-د N_NN))10 (( VGF10.1 یژھان V_VM))11 (( CCP11.1 ہت CC_CCD))12 (( VGF

130

10.2 میت ا N_NN))11 (( CCP11.1 ہت CC_CCD))12 (( NP12.1 سالمتی N_NNPC12.2 کونسل N_NNPC))13 (( VGF13.1 ےچھ V_VAUX))14 (( NP14.1 ک - ہامریک N_NNP))15 (( NP15.1 ‘ RD_PUNC15.2 ہآل N_NN))16 (( VGF16.1 بنیمژ V_VM16.2 ’ RD_PUNC16.3 ۔ RD_PUNC))</Sentence>

12.1 ہن RP_NEG12.2 چھ V_VAUX))13 (( NP13.1 بیین JJ_JJ13.2 ملکن N_NN))14 (( NP14.1 امن N_NNC14.2 مقصدو N_NNC14.3 ٲخطر PP_PSP))15 (( NP15.1 ری جو JJ_JJ15.2 یی ٲتوان N_NN))16 (( VGF16.1 پراونس V_VM))17 (( NP17.1 اجازت N_NN))18 (( VGF18.1 دوان V_VM18.2 ۔ RD_PUNC))</Sentence>

Table.3. Showing Chunked Sentences in SSF

6.1. Chunking Issues

As discussed above, chunking covers half of the dependency relations, though

they are not explicitly marked as relational labels which is a general practice in

dependency treebanks and can be clearly seen in the works of Nivre (2009) who

has translated Tesnière’s (1959) seminal work on dependency grammar.

However, applying the framework which has been designed for treebanking in ILs

(Bharati et. al. 1995, 2006) on Kashmiri, entirely new issues come to fore. Such

issues have partly stemmed from the underlying theory and partly from the

peculiar morphosyntactic or syntactic properties of Kashmiri that distinguish it

131

from rest of the ILs and bring it closer to Germanic Languages like German and

Yiddish. The main issues that have been encountered during the manual chunking

of Kashmiri corpus are briefly given below.

6.1. V2 and V3 Phenomena

It has been found that the notion of verb group that was proposed for ILs, do not

stand for Kashmiri corpus because of a unique syntactic feature of Kashmiri

language known as V2 Phenomenon. V2 phenomenon occurs in all tensed clauses

be it matrix clause or embedded clause, both in active and passive configurations.

It is due to this phenomenon that the tense auxiliary and the main verb cease to

exist contiguously. The tense auxiliary (VAUX) occur at the second (V2) position

and the main verb (VM) occur at the final position of the sentence but if there is

also modal auxiliary in the sentence, it occupies the third (V3) position. For

example:

farooq/NNP chu/VAUX batI/NN khevaan/VM

Farooq is eating rice.

farooq/NNP chu/VAUX aasaan/VAUX batI/NN khevaan/VM

Farooq keeps eating rice.

However, in interrogative sentence the tense auxiliary can also occur at third (V3)

position in the sentence if the auxiliary carrying aspectual information is also

present, it occurs at forth (V4) position. For example:

farooq/NNP kya/WH chu/VAUX reyaazas/NNP divaan/VM

What Farooq is giving to Riyaz?

farooq/NNP kya/WH chu/VAUX aasaan/VAUX reyaazas/NNP

divaan/VM

What Farooq keeps giving to Riyaz?

The problem with the finite clauses in Kashmiri is that they can’t be easily

chunked like in other IA languages, e.g. Hindi, Urdu or Punjabi, etc, due to the

presence of V2 phenomenon. The tensed verb stands discontinuous from its main

verb as shown in the above examples. Usually, a group or cluster of words is

assigned a chunk label if the words are adjacent or contiguous to each other and

also have an asymmetric relation of dependence with each other or simply share

unequal category status, so that the one with higher status can be projected as the

head but in this case the VAUXs and the VMs are neither adjacent to each other

nor the relationship they hold with each other is dependency relation in the real

132

sense. Dependencies are essentially modifier-modified relations and between the

discontinuous VAUX and VM in finite clauses, there is hardly any modifier-

modified kind of relation but definitely a part-whole kind of relationship.

Therefore, it impossible to posit verb as a chunk as noun and adjective chunks

have been posited. Some ad hoc decisions need to be taken to tackle the V2

problem as the language data is far from being ideal to fit for our perceived notion

of chunk.

6.2. Headless Head

Adverbs are considered as the most floating or movable elements in a sentence.

They frequently occur, discontinuously, away from their heads (VMs), at

beginning, at the final position or at elsewhere in the sentence. However,

sometimes they occur adjacent to the VM, they modify and thus, become parts of

VGF, VGNF or VGNN. When adverbs (RB) occur discontinuously, they have no

governing or influencing head adjacent to them and are authority in themselves.

Under such circumstances, RBs can be considered as heads, though they would be

pseudo-heads and would be still having a clear cut dependency relation with their

far away head, which is also the ultimate head, the root. It is evident that the

dependency relation of lower level (chunk level) has been promoted to the

dependency relation of higher level (argument structure level) to handle

discontinuous verb chunk.

6.3. PP No More a Head

Since there is well known notion of functional heads in both the constituency

based as well as in dependency based frameworks, at least for exocentric

constructions, a distinction has been maintained for when a cluster of words (N-

PP) has noun its head and when it has post/preposition its head. In other words,

when N-PP cluster is a noun phrase/chunk and when it is a post/prepositional

phrase/chunk, has been properly distinguished. However, no such distinction has

been drawn on the functional basis, i.e. on the basis of functioning as an argument

or an adjunct. In all the clusters of words containing noun, nouns have been

treated as their head but never the pre/postposition, irrespective of the fact that

some of them perform core functions (subject or objects) and many of them

perform mere subsidiary functions (adverbial) in the sentence. This uniformity

has been maintained at this level because the underlying notion of the head in

133

PCG (Bharati, 1995) is essentially a semantic notion, with few exceptions76. The

function words are devoid of semantics or content and can’t be treated as heads

based on the underlying theory. Therefore, there is no possibility of existence of

pre/postpositional phrases or chunks unlike what been originally posited in

Bloomfieldean and Post-Bloomfieldean literature for exocentric constructions, as

already given chapter two, which latter appeared in PSG (Chomsky, 1956) and

DG (Tesniere, 1959). It is worth to mention that in these works, NPs are generally

arguments and PPs are adjuncts but this distinction has been avoided here for the

sake of theory and has been encoded at the next level of annotation.

6.4. Junction Still a Head

The notion of dependency does not always provide unambiguous solutions when

it comes to exocentric constructions. The dependency representation is at a loss

when it comes to representing the notorious paratactic linguistic phenomena such

as coordination, whose nature is symmetric (two or more conjuncts play the same

role), as opposed to the head-modifier asymmetry of dependencies (Popel et al.,

2013). In other words, coordination is a pending problem of natural languages and

both PSG as well as DG struggle with it (Hudson, 1988, Covington, 1980).

Conjuncts are also exocentric constructions but they have not been treated as

endocentric constructions like the pre/postpositional phrases/chunks have been,

given their crucial role in the structural organization of sentences.

6.5. Negation and Double Negation

Kashmiri has negative elements in free-form like na and nI (no and not) as well as

in bound-form like -nI (not) in khe-yi-nI (will not eat) and sometimes there is also

double negation, e.g. khe-yi-nI kehn (will not eat no) and na saa na (no +honrific

not). The negative markers do not belong to this level and certain negative

particles in double negation constructions (see the above example) which do have

obvious heads, are of no concern here and can’t be projected as chunks. However,

some negative particles, either solitary or in clusters (RP-RP), do not have any

obvious heads and they themselves have the potential of being head.

6.6. Discourse Elements

Discourse elements are the particles that have been tagged as particle default

(RPD) at POS level. They conjoin sentences at semantic or discourse level to

76 The strict notion that only lexical items can be heads seems to be diluted by projecting certain chunks from function words, e.g. CCP, NEGP and BLK.

134

bring cohesion in the text. Since, they were extraneous to the existing set of

chunks and like conjunctions, in spite of being functional words don’t seem to be

dependents of any existing semantic head. It must be noted that discourse

elements have been also treated as heads (connectives) in discourse treebanks.

6.7. Relational Confusion

As aforementioned, at chunk level one needs to handle two kinds of grammatical

relations, one lower level dependencies, e.g. between JJ and N or RB and VM and

a kind of part-whole relations, e.g. between N and PP, N and RP, VAUX and VM.

It would be more result oriented if one focuses on the one type of relation at a

time. Therefore, ones need to keep track of the kind of the relations one has to

handle without confusing between the dependencies and part-whole relations.

7. Statisticsl Results

The quantitative results are given in terms of chunk statistics and qualitative

results are given in terms of a miniature guideline.

The four datasets that had been used for POS tagging have been reduced to only

three datasets by merging second and third ones. These three datasets have been

utilized in chunking which has been carried out by using SA Interface of Sanchay

as aforementioned and chunk frequency of each dataset has been obtained with

the help of the same interface. The frequency distribution table so obtained has

been latter used to calculate the cumulative frequency and the percentage of the

chunks. The same data is represented through the bar chats given in the Fig.3.a

and 3.b.

135

AUXP BLK CCP JJP NEGP NP RBP VGF VGNF VGNN0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

40750

794

2729

4556

159

1473

215 190

Series1

Figure.3.a. Showing Cumulative Frequency of Chunks

The three datasets consist of 682 POS annotated sentence which in turn consist of

8125 chunks classified into ten chunk categories. It has been found that the most

frequent chunk is NP and the least frequent chunk is NEGP, as given in the

Fig.3.a. As shown in the figure the height of the bar is directly proportional to the

frequency of the item it represents. Therefore, the ascending order of frequency of

the series of chunks would be as follows:

NEGP< BLK< RBP< VGNN< VGNF< JJP< CCP< VGF< NP

The NP being the most frequent and VGF being the second most frequent chunk

is expected from the empirical facts about POS categories that have been as given

in Chapter-IV. The statistical chunk results have shown an important empirical

fact that 27.630% clauses show V2 phenomenon and 72.369 % clauses are devoid

of V2 phenomenon in which tense is condensed in the main verb itself. However,

it must be noted that only tensed verbs have not been considered finite but the all

verbs, which have not become de-verbal and possess aspectual or modal

information, have been considered as finite and it is because of this reason that

comparatively lesser percentage of finite clauses have been found with V2

phenomenon which otherwise could have been larger. The statistical results reveal

another empirical fact that 78.438 % verbs in Kashmiri are finite and 21.565 %

verbs are non-finite. In the non-finite forms, 46.913 % are gerunds and remaining

53.086 % are other non-finite forms.

136

The bar diagram in Fig.3.b shows the data represented in Fig.3.a in terms of

percentage. It was done just to reveal the striking quantitative similarities among

the three datasets and to put forward a numerical generalization about the

percentage of various chunks in the corpus so that one can claim reliably that NPs

constitute more that 50% of chunks in Kashmiri.

Figure.3.b. Showing Relative Proportion of Chunks

8. Chunking Guidelines

Chunk guidelines include various decisions that have been taken to resolve

various chunk issues raised during the chunking the data. These guidelines can be

followed in order to achieve the consistency in future chunking tasks.

i. The auxiliaries and main verbs need to be independently projected as

chunks (AUXP and VGF), so that the non-adjacency problem can be

settled at next level by positing a relation between them in which VM

would be head of VAUX. The solution may sound weird if one is

preoccupied with the popular notions of syntax but one must think that it

is the surface form that is being accounted here through surface level

manipulations without positing some abstract layers and categories which

has been the popular practice. Moreover, the purpose here is not to

contribute or challenge to theoretical paradigm but simply to produce a

well grounded data driven grammar which a parser can learn or from

which a probabilistic grammar can be extracted.

137

AUXP BLK CCP JJP NEGP NP RBP VGF VGNF VGNN0

10

20

30

40

50

60

5.0090.615000000000004

9.7723.347

0.11

56.073

1.95600000000001

18.129

2.646 2.338

%

ii. Though conjunctions can’t be semantic head, it has been worked out that

conjunction should be treated as the head and be projected as a chunk

under the label CCP.

iii. The negative particles have scope on the entire sentence rather than on the

single word or phrase. Therefore, it can be said that they are involved in

sentential negation. Such particles should be projected as chunks under the

label NEGP.

iv. Though discontinuous adverbs have quite high frequency but as

aforementioned, in spite of occurring at long distances from the semantic

head, they are still the dependents of verb at lower level. They need to be

projected as chunks under the label RBP only to handle the discontinuity.

v. Since, discourse particles have no role in the internal organization of a

sentence; they can not belong to any other chunk proposed in the tagset

which are essential to account the internal organization of a clause or a

sentence. Therefore, they must be projected as separate chunks under the

label BLK.

vi. Though MWEs which include named entities, compound words and

izaafat constructions, are the POS level problems which have been

handled by concatenating ‘C’ with the tag but they are still separate tokens

which can be potentially confusing. It must be taken care of that all the

adjacent or contiguous POS tagged tokens with the ‘C’ marked tag must

be considered one word so that they are together either a head or a

dependent. It should not be seen as problem that they apparently give rise

to very big chunks.

vii. It has been found that discontinuous noun phrases are rarity un like

discontinuous verb phrases but adjectives do occur either as predicative

adjectives or as adjectival component of complex predicates which are

genuinely heads and should be projected as chunks and assigned a label

JJP.

9. Summary

In this chapter, first part of syntactic annotation, i.e. chunking, has been described.

First of all the nation of chunk was dealt which seems to be quite similar to the

popular notion of phrase. The distinction between the two has been neatly

discussed. Since chunking also is an annotation task, it prerequisites a tagset and

138

annotation tool just like POS annotation. Both the tagset and the tool have been

described at length. Each tag has been explained with the help of examples and

the entire process of manual chunking of previously built POS annotated

KashCorpus has been illustrated with the help of example sentences 42 and 43.

The snapshots of the tree viewer have been also given along with the chunked

data of the example sentences 42 and 43 in SSF to show how chunk projections

are created in the interface and how actually they are stored at the back end in

SSF. Further, the linguistic issues raised during the process have been given

elaborately with sufficient examples. Finally, the results of the annotation work

have been presented. The empirical results have been given in the form of bar-

charts which have been also briefly interpreted. The theoretical results have been

given in the form of guidelines which covers main decisions that have been taken

in order to resolve various issues.

139

Chapter.6 Dependency Parsing of KashCorpus“Unfortunately or luckily, no language is tyrannically

consistent . . . All grammars leak.”

Adverd Sapir, Language (1921)

1. Introduction

As already mentioned in the Chapter three and Chapter four, treebank is a set of

machine readable parse trees of natural language, encoding the syntactic, semantic

or both types of linguistic information. Dependency treebanks are multi-layered

annotation pipelines and at each layer, a separate but related set of linguistic

information is annotated, in a manner, so that the tags at the lower level facilitate

the annotation at the higher level. The obligatory annotation layers of a

dependency treebank include a POS layer, a chunk layer and a relational layer.

However, further layers of linguistic information like the morphological or

discourse level information can be also added, depending upon the intended utility

of the layer in a treebank.

For the current dependency treebank, only three layers of linguistic information

have been taken into consideration. The first layer contains coarse grained hierarchical

POS labels for each token/word of a sentence, as discussed in Chapter four. The second

layer contains chunk labels for the clusters of words which have been dealt in Chapter

fifth. The third layer contains labels for inter-chunk dependency and non-

dependency relations which are dealt in this chapter. In this chapter dependency

parsing/annotation of already built chunked KashCorpus has been discussed.

Dependency parsing involves labeling head-dependent relations at lower level as

well as at higher level. The dependency annotation at lower level has been

covered under chunking, hence, intra-chunk, which has been already dealt in

previous chapter. The dependency annotation at higher level, i.e. at predicate

argument structure level, hence, inter-chunk, is the sole concern of this chapter.

Section second is concerned with introducing the notion of (deep) syntactic parsing.

Section three is concerned with the description of the grammar formalism used as parsing

model. Section four is concerned with the description of GRs. Section five deals with the

annotation of dependencies. Section six is concerned with the issues raised during the

annotation process. Section seven provides the statistical results of dependency

annotation. Section eight discusses the inter-annotator agreement and the results that have

been obtained in the concerned experiment. Finally, section nine summarizes the chapter.

140

2. Notion of Syntactic Parsing

Generally, parsing refers to syntactic analysis of an input string and parser is a

programme that parses an input string automatically. According to Grune and

Jacobs (2008), parsers are already being used extensively in a number of

disciplines; in computer science for compiler construction, database interfaces and

artificial intelligence, in linguistics for text analysis, corpora analysis, machine

translation and stylistic analysis, in document preparation and conversion, in

typesetting chemical formulae, in chromosome recognition, etc. Although, the

term parsing has been derived from Latin phrase paras orationis meaning parts-

of-speech, it is a technical term used for manual or automatic grammatical

analysis. When the grammatical analysis involves word level analysis, it is called

morphological parsing, when it involves phrase or chunk level analysis, it is

shallow syntactic or simply shallow parsing and when it involves a clause or

sentence level analysis, it is deep syntactic parsing or simply syntactic parsing.

Similarly, if the analysis belongs to discourse level, it can be called as discourse

parsing. However, in general terms, parsing is a cognitive or computational

process of taking an input string and generating some sort of structure for it, e.g.

generation of a parse tree for an input sentence. As far as, the end product of

syntactic parsing is concerned, it is clear that parsing stays at the heart of

treebanking where the syntactic trees are produced by manual or semi-supervised

methods. The notion of syntactic parsing is closely linked to the parsing model

which provides grammar formalism for determining the nature of output syntactic

trees or graphs. As already mentioned in the chapter one and two, there are two

main approaches to syntactic parsing. One is based on the popular syntactic

notion known as constituency and other is based on the relatively obscure notion

of syntax known as dependency. However, the last decade has shown renewal

interest in various varieties of dependency grammar, particularly for parsing text

corpus and developing dependency treebanks and parsers. The current work is in

line with this resurgent dependency wave. The next section gives a brief account

of Indian version of dependency grammar.

3. Paninian Computational Grammar (PCG)

The goal of Paninian approach is to construct a theory of human communication,

i.e. how natural language is used to convey information to the hearer and how the

hearer gets on to the intended meaning? Therefore, grammar is seen as the system

141

of rules that establishes correspondence between what the speaker intendeds to

say and corresponding utterance s/he produces and also between what the hearer

listens and the meaning s/he extracts from it. Paninian Grammar (500 B.C) has

been originally written for Sanskrit two and PCG is actually an attempt to

interpret Paninian Grammar in new light and apply it to all modern IA languages.

According to Kiparsky & Staal (1969), PCG (Bharati et al., 1993) is a variant of

dependency grammar. It has been used as parsing model for all treebanks that are

being built in India. It is for the same reason that it has been also used in the

current syntactic annotation which is final level of annotation in building

dependency treebank of Kashmiri. This model helps to capture the syntacto-

semantic relations which are instrumental in constructing a sentence. Sentence is

considered as a series of modifier-modified relations with a primary modified,

main verb (VM), which is the root of dependency stemma (graph or tree). The

elements which modify main verb are its arguments and adjuncts that participate

in the action specified by the verb. The relations of these participants with the

main verb are called karaka. Since, Kashmiri is highly inflectional language; there

are clear cut case markers or postpositions (vibaktis) on the arguments and

adjuncts that participate in an action/event. Such morpho-syntactic cues can be

very instrumental in identifying the relation of arguments and adjuncts with its

root. To some extent there is one-to-one relation between the karakas & the case

markers/postpositions. However, many constructions found in the corpus defy this

expected correspondence. It has been found that such correspondences between

karaka & vebhakti along with TAM features are very helpful in syntactic

annotation of Indian Languages which are relatively free word-order in nature

(ibid). For illustration consider the following sentence:

raath dits library nish bAshiir-an farooq-as neelofer-as khA:trI akh kitaab.

Yesterday give-PRF library near Bashir-ERG Farooq-DAT Neelofer-DAT

for one book

Yesterday Bashir gave a book to Farooq for Neelofer near library.

142

Figure.1. Paninian Dependency Graph

In the above sentence, there is an action represented by the finite verb dits (gave)

which is also the root of Paninian stemma, shown in Figure.1. Since the verb is

ditransitive in nature, it has three valency slots for arguments. Therefore, there are

three arguments represented by three NPs in which Bashir is SUB which has

agentive role and has kartaa (k1) relation with the root, Farooq is IO which has

the semantic role of recipient or beneficiary and has sampradaanaa (k4) relation

with the root and kitaab (book) is DO which has the semantic role of patient and

has kartaa (k2) relation with the root. Besides, these participating NPs which fill

the valency slots of the verb and play the core roles directed by it, there are

additional NPs which are external to the predicate argument or sub-categorization

frame of the verb and hence, play secondary non-participatory roles. Some of the

NPs (which project from NSTs) provide location for the action diyun (to give).

The raath (yesterday) provides temporal location therefore, has kaala adhikarnaa

(k7t) relation with the root and the library nish (near library) provides spacial

location and therefore, has disha adhikarnaa (k7p) relation with the root.

However, the NP Neelofer is neither part of sub-categorization frame nor does it

stem from NST and hence, doesn’t provide any information related to direct

participation or location of an action or an event but represents an indirect

participant which is the purpose of the action. Therefore, Neelofer is a purpose

NP which has Taadarthya (rt) relation with the root.The dependency labels that

have been devised based on karakas are given in the Figure.2. The description of

these karakas is given in the section four of this chapter.

143

Figure.2. Grammatical Relations Shown in HTB Guidelines

Keeping aside limitations and strengths of dependency grammar in general, the

criticism that is being leveled upon the PCG is that it lacks tight formalism and

doesn’t distinguish between arguments and adjuncts. It is fact that there are hardly

any syntactic notions like the transitivity or the argument-adjunct distinction

either in the original Paninian grammar or in the current PCG. It is because it is

essentially a syntacto-semantic theory that has hardly to do anything with the

syntactic notions of sub-categorization, argumentation and adjunction but it must

be noted that the syntactic categories like argument and adjunct can be easily

extracted from the dependency labels itself. Further, the notion of karaka are

roughly equivalent to the notion of semantic role but the karaka relations are

identified through the notions semantic roles, subject and object, otherwise, unless

one has complete hold on Sanskrit, it is impossible to know what a particular

karaka is all about.

4. Description of Relational Labels

So far parts-of-speech tagset and chunk tagsets have been described in chapter

four and five, respectively which were one of the pre-requisites to carry out the

annotation at the respective levels. Similarly, in order to carry out annotation at

this level, i.e. syntactic annotation, an inventory of grammatical relations (GRs) is

needed. The following table presents the set of the GRs that have been used in

developing the current treebank. These GRs, essentially Sanskritic, are given

144

along with their interpretations and the attachment labels and their variants that

have been used at this level of dependency annotation.

S. NO Name of

The Relation

Interpretation Relational

Label

Variants

1 Karta SUB, Agent, Doer, k1 pk1, jk1, mk1

2

Karta

Samanadhikarana

SUB compliment,

predicative JJ.1k1s **

3 Karma

OBJ, Patient, Goal,

Destinationk2 k2g, k2p

4

Karma

Samanadhikarana

OBJ compliment,

predicative JJ.2k2s **

5 Samanadhikarana Noun Elaboration rs rs-k1, rs-k2

6 Karana Instrumental k3 **

7

Sampradaana/

Anubhava Karta

Recipient, Experiencer,

Possessor k4 k4a, k4v

8 Apaadaana

Source, Departure from

the source k5 k5prk

9

Vishaya/Kaal/Desha

Adhikarana

Time/Space/Elsewhere

Locationalk7 k7t, k7p

10 Shashthi Genitive/Possessive r6 r6k1, r6k2

11 Prati Directional rd **

12 Hetu Reason/Cause rh **

13 Taadarthya Purposive rt **

14 Saadrishya Comparative/Similative k*u k1u, k2u, rsm

15

Upapada

Sahakaarakatwa Associativeras-*

ras-k1, ras-k2,

ras-neg

16 *** Duratives rsp **

17 *** Address Terms rad **

18 Kriyaa Visheshana Adverbs/Sentential too adv sent-adv

19 *** Participlised N-Modifiers nmod **

20 ***

Participlised/Gerundial

V-Modifiersvmod*

vmod_Rh,

vmod_Inst

21 *** Yus/Yuth/Yeli Relative *mod_Relc nmod_Relc,

145

Clauses

jjmod_Relc,

rbmod_Relc

22 ***

Conjunct of

Co-ordinationccof **

23 ***

N/JJ Part of Complex

Predicatepof **

24 ***

Tense/Aspectual

Fragment of Verbfragof **

25 *** Enumerator enm **

Table.1. Showing Grammatical Relations in KashTreeBank

The main twenty five GRs, given in the Table.1, though used in developing the

current dependency treebank of Kashmiri and other ILs (Hindi, Urdu, Telegu and

Bangla) are not all dependency relations. Many of them are no-dependency in

nature and are very important for accounting the structure of a sentence. These

GRs, dependency or non-dependency, can be divided into eight types, depending

up on the nature of relation they represent. The type wise description of the GRs

(dependency and non-dependency and karaka or non-karaka) is given below with

detailed examples, so that this description would also serve as guidelines for the

dependency annotation:

9.1. Type-one GRs

These include all six karaka relations which are the core of Ashtaadhyaayii. They

include Karta labeled as k1, pk1, jk1 and mk1, Karma labeled k2, k2g and k2p,

Karna labeled as k3, Sampradaana or Anubhava Karta labeled as k4, k4a and k4v,

Apadana labeled as k5 and k5prk, and Adhikarana labeled as k7, k7t and k7p.

Type one GRs also include three non-karaka relations such as Karta-

samanadhikarana and Karma-samanadhikarana labeled as k1s and k2s,

respectively and Samanadhikarana labeled as rs-k1 and rs-k2. The description of

each Type-one grammatical label (GR) is given below:

i. Karta <k1>

It is the most independent of all karakas. The chunk or clause having k1 relation

with the finite chunk is generally subject with agentive role but there are non-

agentive instances also. Therefore, karta can be either primary karta which is

146

volitional in nature or secondary karta which is non-volitional in nature. In

nominative constructions, it is SUB <k1> which agrees with the AUXP, in terms

of number and gender. In short, the dependency relations, which nominative and

ergative marked subjects hold with their respective heads in non-causative active

constructions are all karta relations. For example:

bI chu-s batI khey-vaan. (1)

I-NOM be-PRS rice eat-PROG

I am eating rice.

mea khe-yo batI. (2)

I-ERG eat-PRF.SG.MAS rice

I ate rice.

However, there are exceptional cases which may not fall under the

aforementioned criteria. In such cases the SUB may be marked with a case which

is dative by form but not by function. For example:

feroz-as pazi-nI zyaadI davIdav karin’* (3.a)

Feroz-DAT need-NEG more struggle do-INF.SG.FEM

Feroz needs not to struggle hard.

*farooq-as chi yi kitaab parIn’ (3.b)

Farooq-DAT be-PRS-SG-FEM this book read-INF.SG.FEM

Farooq has to read this book.

Here also, the DREL between the SUB and the verb is marked as k1, irrespective

of the fact that it is non-volitional as mentioned above that karta can be volitional

or non-volitional. Therefore, it must be noted that volitionality (agentiveness) is

not the sole criterion for being karta of a verb but it is fact that it is the strongest

criterion.

In Kashmiri, passives are formed by a combination of an infinitival oblique verbal

form -nI and a periphrastic auxiliary yun (to come) in perfective form, in which

the internal argument of the transitive verb surfaces as the subject of the sentence.

The agent of the action is not overtly realized and preferably omitted. Therefore,

the agentive phrase is optional. However, if the agent is realized, it is either in the

form of -zaryi or -athi phrase (a kind of by phrase).

farooq-an khuul kuluf. (ACTIVE VOICE) (4)

Farooq-ERG open-PRF lock

147

Farooq opened the lock.

farooq-ni zAryi aav kuluf khol-nI. (PASSIVE VOICE) (5)

Farooq-GEN by come-PRF lock open-PASS

The lock was opened by Farooq.

ii. Prayojaka, Prayojaya and Madhyastha Karta <pk1, jk1 and mk1>

Like any other morphologically rich language, causative or double causative

verbs in Kashmiri are formed by morphological process, by suffixing –Inaav in

single causatives where there is a causer and a causee, and by doubling the suffix

–Inaav in double causatives where, in addition to a causer and a causee

arguments, there is one more argument (NP chunk) called mediator causer. The

Prayojaka Karta is the causer NP, Prayojaya Karta is the causee NP and

Madhyastha Karta is the mediator causer. The dependency relations which causer,

causee and mediator causer NPs hold with the causative verb (root/head) are

labelled as pk1, jk1 and mk1, respectively, as shown below, in (6), (7) and (8),

respectively. For example:

arshid-an dyaav-InA:v feroz-as athi aijaaz-as kitaab. (6)

Arshid-ERG give-PRF-CAU.SG.FEM feroz-DAT by aijaaz-ACC book

Arshid made Feroz to give book to Aijaaz.

Arshid-an dyaav-InA:v ferozas athi aijaazas kitaab. (7)

Arshid-ERG give-PRF-CAU.SG.FEM feroz-DAT by aijaaz-ACC book

Arshid made feroz to give book to aijaaz.

arshidan dyaav-Inaav-InA:v’ feroz-ni zaryi shaanu-vas athi

aijaa-zas akh kitaab. (8)

Arshid-ERG give-PRF-CAU-CAU.SG.FEM feroz-GEN through

shaanuv-DAT by aijaaz-ACC a book

Arshid made feroz give aijaaz a book through shanuv.

It is evident from the above examples that pk1 is the DREL of ergative marked

NP chunk (causer), jk1 is the DREL of –athi marked NP chunk (cause) and mk1

is the DREL of –zAryi marked NP Chunk (mediator causer) with the root of the

sentence, i.e. causative verb.

148

iii. Karta-samanadhikarana <k1s>

It is the DREL which predicative JJ chunk holds with the verb. These JJ chunks

function as SUB compliments. The NP chunks at the predicative position in the

copular constructions can be also considered as SUB compliments and can be

labelled as k1s. For example:

farooq chu baDi neyk. (9)

Farooq be-PRS-SG-MAS very pious

Farooq is very pious.

farooq chu scholar. (10)

Farooq be-PRS-SG-MAS scholar

Farooq is a scholar.

iv. Karma <k2>

The OBJ NP chunks which have semantic role of patient are the Karma and the

DREL they bear with their heads are labelled as k2, irrespective of whether the

construction is in active or passive configuration. The finer distinctions have been

also drawn within karma and have been labelled as k2p and k2g which are dealt

separately. In ergative constructions (active voice), it is OBJ (k2) which agrees

with the verb, as shown in (12). Also in passive configurations, the agreeing NP

chunk is karma as shown in (15). In short, the agreeing NP chunks in ergative

constructions are karma. These are actually unmarked (accusative) OBJs in both

nominative and ergative constructions and the DREls they hold with the root are

labelled as k2. However, there are instances where a whole clause introduced by a

subordinating particle can be treated as the OBJ. It is called Vakya Karma or

clausal OBJ and is also labelled as k2, as shown in (13). For example:

farooq chu kitaab par-aan. (11)

Farooq be-PRS-SG-MAS book read-PROG

Farooq is reading a book.

farooq-an chi yim-I kitaab-I pArm-ItsI. (12)

Farooq-ERG be-PRS-SG-FEM this-Pl book-Pl read-PRF-PL.FEM

Farooq has read this book.

farooq-an von zi ta-s chu-nI kahn ti vyatsaan-Iy. (13)

149

Farooq-ERG say-PRF that he-DAT be-PRS-SG-MAS-NEG no one

impress-PROG-EMP

Farooq said that no one impresses him.

yuh-us A:y’ maah-i-ramzaan-as manz vaariyaa khAzIr bAgraavnI. (15)

This year come-PRF-PL-MAS in ramazaan-DAT lot of dates distribute-

PASS

This year in Ramadan lot of dates have been distributed.

v. Karma <k2p>

The OBJ NP chunks which have semantic role of goal or destination is also

Karma but the DREL they bear with their heads are labelled as k2p, irrespective

of whether the construction is in active voice or passive conformation. For

example:

farooq-as chu garI gatsh-un. (16)

Farooq-DAT chu-PRS-SG-MAS garI gatsh-INF

Farooq has to go home.

farooq gatsh-i garI. (17)

Farooq go-FUT home

Farooq will go home.

vi. Karma <k2g>

In the sentences with ditransitive verbs where there are no giver-recipient roles,

the second OBJ is called secondary karma and bears k2g DREL with the root. For

example:

yim pagal lukh chi gulzaaras piir sA:b vanaan. (18)

These insane people bePRS-Pl gulzaar-DAT saint call-HAB

These insane people call Gulzar a saint.

It must be noted the semantics of verb vanun (to call) presupposes that there is a

person who calls, a thing/person that is to be called and another the name by

which the thing has to be called. All the three presuppositions are nominal in

nature but can’t be attributive in nature.

vii. Karma-samanadhikarana <k2s>

The difference between k2g and k2s seems to be very confusing given the fact

that in both cases the verbs involved, e.g. vanun (to say) and maanun (to believe),

150

are ditransitive in nature. However, in the latter case, as illustrated in (19) and

(20), the predicative JJP or NP (in copula constructions) can’t be treated as

arguments but OBJ compliments. The reason for treating them as compliments is

that they are attributive in nature, carrying the attributes of OBJ, rather than being

nominal in nature like OBJ, so that they could be treated as arguments. These are

just like SUB compliments. For example:

arshid chu shaanuv-as rut samjaan. (19)

Arshid be-PRS.SG.MAS shaanuv-ACC nice think-HAB

Arshid thinks that Shanuv is nice.

chiin chu hindostaan-as askh taaqathvar muluk maanaan. (20)

Chiin be-PRS-MAS hindustaan-ACC a strong country consider-HAB

China considers India a strong country.

viii. Karna <k3>

In the sentences with transitive verb, Karna is that NP chunk through which the

action has been carried out by the agent or which is instrumental in carrying out

the action. The instrumental role exists irrespective of the type of sentence

configuration, i.e. whether the sentence is in active, passive or WH configuration.

However, it should be noted that it is not part of argument structure like the

causee in the causative constructions. The DREL of instrumental marked NP

chunks has been labelled as k3. In short, –sI:t’ marked NPs are Karna but –sI:t’ is

ambiguous and shows syncretism between instrumental and associative roles. For

example:

farooq-an khuul kunz-i sI:t’ kuluf. (21)

Farooq-ERG open-PRF with key-ABL lock

Farooq opened lock with key.

maaji aaprov bachch-as chamchi sI:t’ batI (21)

Mother-ERG feed-PRF.SG.MAS kid-ACC spoon-ABL with rice

The mother has feed kid rice with spoon.

kuluf aav kunz-i si:t’ khol-nI. (22)

Lock come-PRF.SG.MAS key-ABL with open-PASS

The lock was opened with key.

151

ix. Sampradaana <k4>

The OBJ NP chunks which have semantic role of recipient or beneficiary or

represent final destination of an action are the Sampradaana. In the sentences with

ditransitive verbs, it is the dative marked NP (DO) which is recipient, beneficiary

or final destination and holds k4 DREL with the root. It must be noted that the

semantic role receiver is not constrained by animacy feature in Kashmiri and even

the inanimate OBJs can be marked with dative case, as shown in (26). For

example:

farooq-an di-ts Suhail-as kitaab parnI khA:trI. (23)

Farooq-ERG give-PRF.SG.FEM Suhail-DAT book for reading

Farooq gave Suhail a book for reading.

tse vAn-ith mea raath rIts kath. (24)

you-ERG tell-PRF-2PC me-DAT yesterday nice talk

You told me yesterday a nice thing.

mea van-iy tse raath rIts kath. (25)

I-ERG tell-PRF-1PC.SG.FEM you-DAT yesterday nice talk

I told you yesterday a nice thing.

darvaaz-as diut-ukh kuluf (26)

door-DAT give-PRF-3PC.PL lock

They locked the door.

sw kitaab chi dAh-an rop-yan yiv-aan. (27)

That book be-PRS.SG.FEM ten-DAT rupee-DAT.PL.FEM come-PROG

That book cots ten rupees.

x. Anubhava Karta <k4a>:

The SUB NP chunk which has the semantic role of a passive experiencer is

Anubhava Karta, who perceives through a process represented by the

(intransitive) verb. In the clauses with perception verbs, it is the perceiving entity.

However, it must be noted that perceiving entities in Kashmiri are not constrained

by animacy feature, as shown in (32). In short, it is the dative marked SUB NP

chunk which is experiencer and have k4a DREL with the root. For example:

152

mea lAj bochi. (28)

I-DAT.SG hurt-PRF.SG.FEM hunger

I felt hungry.

lADk-as peyi nindIr. (28)

boy-DAT.SG.MAS fall-PRF sleep

The boy felt asleep.

lADk-an tor fiqri (29)

boy-DAT.PL.MAS cross-PRF.SG.MAS understand

The boys understood.

kor-i gov shakh. (30)

girl-DAT.SG.FEM go-PRF.SG.MAS doubt

The girl got doubt.

kor-eyn aav mushuk. (31)

girl-DAT.PL.FEM come-PRF.SG.MAS smell

The girls caught smell.

makaan-as peyi traTh (32)

house-DAT.SG.MAS fall-PRF.SG.FEM lightening

The house is struck by thunder.

dukaan-as log naar. (33)

shop-DAT catch-PRF.SG.MAS fire

The shop caught fire.

tsuunT-is log daag. (34)

apple-DAT catch-PRF.SG.MAS stain

An apple caught stain.

makaan-as peyov pash vAs’ (35)

153

house-DAT fall-PRF.SG.MAS roof

The roof of the house fell down.

xi. Shashthi Karta <k4v>

It is a unique kind of Anubhava Karta specific to Kashmiri which has a different

semantic role to play which is entirely different from the recipient/

beneficiary/destination or experiencer. This role is sort of possessor-experiencer

with a clause structure like that of k1s, i.e. the type of clause structure of

predicative adjectives and that of copulas which simply state the existential state

of affairs. The DREL which is labelled as k4v actually holds between a dative

SUB, showing some existential state of affairs, and the root. For example:

farooq-as chi sath neychiv (36)

Farooq-DAT be-PRS.PL seven sons

Farooq has seven sons.

bAshiir-as chu sharaarath. (37)

Bashir-DAT be-PRS.SG.MAS anger

Bashir is angry.

farooq-as chu-nI heys-Iy. (38)

Farooq-DAT be-PRS.SG.MAS-NEG consciousness-EMP

Farooq is unconscious.

bAshiir-as chu safeyd mas. (39)

Bashir-DAT be-PRS.SG.MAS white

Bashir has white hairs.

xii. Apaadaana <k5>

The NP chunk which is the source of an activity, i.e. the point of departure or

starting point of an action or activity, is the Apaadaana. In Kashmiri it is the

ablative marked NP followed by ablative marked location postposition, which

represents source or starting/departure point of an activity. It should be noted that

the cases like (41) and (42) might appear confusing. However, the former ablative

marked one is right example of k5 but not the latter dative marked one. The

difference lies in the ablative marked and the unmarked locative postpositions, i.e.

peyTh-I and peyTh, which is enough to recognize that the former case is an

154

instance of Apaadaana while the latter can be the instance of simple locative.

Although, semantically it is obvious that in former case, there is a sense of

departure (ablative effect) which is lacking in the latter case but it would be

semantically anomalous to consider latter case a simple spatial locative adverbial

as the ‘plate’ in (42) can be instrumental or even ablative but can’t be space where

the action of drinking takes place. The DREL which such ablative marked NPs

hold with the root of the clause is labelled as k5. For example:

kul-i peyTh-I pyov Duun pathar.* (40)

Tree-ABL on-ABL fall-PRF.SG.MAS walnut down

A walnut fell sown from the tree.

farooq-an cheyi kawl-i manz-I treysh. (41)

Farooq-ERG drink-PRF.SG.FEM bowl-ABL in-ABL water

Farooq drank water from the bowl.

farooq chu kuTh-is manz paleT-as manz batI kheyvaan.* (42)

Farooq be-PRS.SG.MAS room-DAT.SG.MAS in plate-DAT.SG.MAS in

rice eat-PROG

Farooq eats rice in plate in the room.

farooq-as nish-I draav su raath. (43)

Farooq-DAT near-ABL left he yesterday

He left from Farooq (Farooq’s place) yesterday.

su nafar tsol az militaryvaaly-an nish-I (44)

That man run-PRF.SG.MAS today military man-DAT.SG.MAS

from-ABL

That man ran away today from military.

There are cases where the activity represents a process through which change of

state of a substance occurs. In such cases, the source point from which that change

starts is the NP representing one substance (raw/natural material) which changes

into another substance (the product), represented by another NP. The former NP

which undergoes change is said to be the Prakruti Apaadaana and its DREL with

155

the root is represented by a variant of k5, i.e. k5prk. The dative marked NP which

represents base substance or raw material is the Prakruti Apaadaana. For example:

guur chu dod-as tsaaman banaavaan. (45)

Milkman be-PRS.SG.MAS milk-DAT.SG.MAS cheese make-PROG

Milkman makes cheese out of milk.

sIts chu kapr-as palav suvvaan. (46)

Tailor be-PRS.SG.MAS cloth-DAT.SG.MAS dress-Pl sew-PROG

Tailor makes dress out of cloth.

In the above examples (45) and (46), it is clear that there are two states of a

substance, the first is the natural or the original state and the second is the finished

or the changed state of the substance. The former is called Prakruti and the latter

one is called Vikruti.

xiii. Desha Adhikarana <k7p>

The NP chunk which denotes a location in space for an action, involving different

participants like Karta, Karma or Karna, is called Desha Adhikarana. These are

not only those cases which are typical spatial locatives and are tagged as NST at

POS level, e.g. yetyth (here), hotyth (there), etc but also those cases which are

typical nouns and are tagged as NN/NNP at POS level, e.g. garI (home), saDakh

(road), kuTh (room), etc. However, both are NPs at chunk level, therefore, any NP

which provides spatial location for an event, action or a state is Desha

Adhikarana. It is generally a dative marked NP followed by postposition

indicating spacial location. The combination of –DAT marker and locative

postposition, e.g. -as peyTh, in Kashmiri has corresponding complex postposition

in Urdu/Hindi, e.g. ke uupar. It has been found that whatever role genitive (ke)

performs in the formation of Hindi/Urdu complex postpositions (where it ceases

to be genitive and the postposition projects itself in non-compositional way), the

dative markers (-i/-yan/-as/-av) in Kashmiri perform the same role but it can’t

form a complex postposition as the dative itself is merely a marker (a bound form)

unlike genitive of Urdu/Hindi which occurs as postposition (a free form).

However, the –DAT in Kashmiri also ceases to be dative when it is followed by

locative position and the GREL it represents has totally locative interpretation as

if the markers, –i, –yan, –as, and –av are no more dative markers but mere

156

obliqueness markers. It must be noted that the dative markers are very

controversial in Kashmiri Koul and Wali (2006) treats them as –DAT but Emily

Manetta (2008*) treats them as obliqueness markers. The DREL of Desha

Adhikarana with the root of clause is labelled as k7p. For example:

tati Os Farooq kursy-i peyTh bih-ith. (47)

there be-PST.SG.MAS Farooq be-PST.SG.MAS chair-DAT.SG.FEM

on sit-PART

There Farooq was sitting on the chair.

su chu bon-i tal aaraam karaan. (48)

He be-PRS.SG.MAS chinaar tree-DAT.SG.FEM under relax do-PROG

He is relaxing under the Chinar tree.

farooq chu kuTh-is manz paleT-as manz batI kheyvaan.* (49)

Farooq be-PRS.SG.MAS room-DAT.SG.MAS in plate-DAT.SG.MAS in

rice eat-PROG

Farooq eats rice in plate in the room.

mysuur-as manz chu mosam baDi jaan rozaan. (50)

Mysore-DAT.SG.MAS in be-PRS.SG.MAS weather very

pleasant remain-PROG

In Mysore weather remains very pleasant.

xiv. Kaal Adhikarana <k7t>

The NP chunk which denotes location in time for an event, action or an activity

involving various participants is called Kaal Adhikarana. These not only include

the cases which are typical temporal locatives and are tagged as NST at POS

level, e.g. vun’ (now), patI (latter), etc but also those cases which are typical

nouns and are tagged as NN/NNP at POS level, e.g. shaam-as (in the evening),

pagah (tomorrow), 1950-has manz (in 1990), etc. Both the cases will be NPs at

the chunk level which have adverbial function. Such chunks which provide the

temporal location of an event or an action are usually dative marked NPs. For

example:

subh-as gayizi garI panun. (51)

Morning-DAT go-2PC home own

157

In the morning go your own home.

1947-has manz gov heyndoshtaan aazaad. (52)

1947-DAT in go-PRF.SG.MAS india free

In 1947 India got freedom.

raath vot farooq garI paantsi baji. (53)

Yesterday reach-PRF.SG.MAS Farooq home five O’clock

Farooq reached home yesterday at five O’clock.

patI khot tamm-is taph. (54)

then rise-PRF.SG.MAS he-DAT fever

Then he caught fever.

xv. Vishaya Adhikarana <k7>

The NP chunk which denotes location of an event, action or an activity elsewhere,

i.e. other than concrete space and time is called Vishaya Adhikarana. Generally,

the NPs which are non-spatial and non-temporal in nature are marked with dative

and followed by locative postposition like Desha and Kaal Adhikarana. Such NPs

include abstract entities that capture some notion of abstract space. The DREL

such NP holds with root is labelled as k7. For example:

kejriiwaal chu az kal surkhi-yan manz aasaan. (55)

Kejriwal be-PRS.SG.MAS today tomorrow headline-DAT.Pl.FEM in

remain-PROG

These days Kejriwal continuously remains in the headlines.

myaa-ni kath-i peyTh khot tAmm-is sakh sharaarath. (56)

My talk-DAT.SG.FEM

on climb-PRF he-DAT.SG.0 very anger

He became very angry on my argument.

siyaast-as manz vasun chu-nI Thiekh. (57)

Politics-DAT in enter-INF be-PRS.SG.MAS -NEG good

To enter into politics is not good.

158

tAm’-sInd-is khayaal-as manz chi sA:rii panni-panni shaayi Thiekh. (59)

He-GEN-DAT.SG.MAS idea-DAT.SG.MAS in be-PRS.Pl.MAS

own-GEN.SG.FEM good

In his opinion everyone is correct at his place.

9.2. Type Two GRs

It only includes only two GRs, r6 and rsp. These are non-karaka but dependency

GRs which holds between two nominals, i.e. between two nouns or between a

pronoun and a noun. It rarely occurs between two pronouns. The description of r6

and rsp DREls are given below.

11.1. Shashthi <r6: r6-k1, r6-k2>

The NP chunks which have genitive relation with the other NP chunks are the

Shashthi. In Shashthi GR, there are like any other dependency relation a

dependent and a head, the possessor and the possessed. The first NP is usually

possessor and the second NP is possessed and it is the possessed NP which is

head and possessor the dependent. This is the only GR that holds between two

nominals and not between a nominal and a verb root.

For example:

farooq-un boD bOy chu polices-as manz. (60)

Farooq-GEN elder brother be-PRS.SG.MAS in the police-DAT

Farooq’s elder brother is working in police.

tAm’ sInz majbuurii ma kariv nazar andaaz. (62)

He of-SG.FEM predicament not do-Pl ignore

Do not ignore his predicament.

kaam-i hInz jaldii kin’ draav su garI. (63)

work-ABL.FEM of-SG.FEM hastiness

Because of urgency of work he went home.

yeti-ch sadakh chi vaariyah kharaab. (64)

here-GEN road be-PRS.SG.MAS very bad

Road of this place is very damaged.

makaan-uk pash peyov vAs’. (65)

159

house-GEN roof fall-PRF down

The roof of the house fell down.

The label <r6> is an underspecified one which has two realizations or variants,

i.e. r6-k1 and r6-k2. <r6-k1> is assigned to if the possessed is k1, i.e. if the head

NP is k1 and the dependent NP is attached to it. Similarly, if an NP is in genetic

relation with another NP and the head NP is k2 dependency relation which a

dependent NP has with it is labeled as r6-k2.

For example:

insaan-I sund athI chu vaariyah qismI-chi qaami heyqaan kArith. (66)

Human-ABL.MAS of-SG.MAS hand be-PRS.SG.MAS lot of

type-GEN.Pl.FEM work-Pl able do-NF

Human hand can do many types of work.

In the above sentence (Q.6), the first genitive marked NP is dependent on the NP

which having k1 relation with root of the sentence and the second genitive

marked NP is dependent on the NP which has k2 relation with the root.

11.2. Duratives <rsp>

The NP chunks which indicate the duration (temporal) or span (spatial) of an

event, an action or a state is durative expression. There are two points in durative

expressions, viz. a point of starting and an end point. The duratives, consisting of

two NPs, may function as temporal, spatial or manner adverbs. The starting point

NP depends upon the end point NP and the DREL between them is labelled as

<rsp>. In Kashmiri, the initiating postposition is ablative marked locative (peyTh-

I) and the terminating postposition is dative marked locative (taam). For example:

1982(-I) peyTh-I 2012-as taam ruudus bI baDI khosh. (67)

1982(-ABL) on-ABL 2012-DAT to remain-PRF-1PC.SG.MAS I very

happy

From 1982 to 2012 I remained very happy.

Kashmiri-I peyTh-I kanyaakumaarii taam cha-nI kahn-ti train. (68)

Kashmir-ABL on-ABL Kanyakumari to be-PRS.SG.Fem-NEG any-Emp

train

There is no train from Kashmir to Kanyakumari.

160

9.3. Type-Three GRs

These include rd, rh, rt, ku, ras, rsp, and rad GRs which are also non-karaka but

dependency relations holding between the dependent NP and the root of the

clause like Type-one GRs but unlike the Type-two GRs which holds between a

dependent NP and non-root head. The description of these relations is given

below:

i. Prati <rd>

The NP chunk which indicates the direction of an activity is the Prati or the

directional NP. In Kashmiri, ‘kun’ is the directional postposition and any NP

consisting of directional postposition is Prati. It has been observed that like the

NPs consisting of locative postpositions, the NPs which consist of directional

postpositions are marked with dative. Also, like locative NPs, the directional NPs

are mere adjuncts. It must be noted that there are certain directionals which do not

really show the direction of an activity but show some metaphorical direction as

shown in (72). The GR which such NPs bear with the root are labelled as ‘rd’. For

example:

farooq draav darvaaz-as kun. (69)

Farooq go-PRF door-DAT towards

Farooq went towards the door.

su chu mea kun vuchaan. (70)

He be-PRS.SG.MAS me towards see-PROG

He is looking/ towards me.

su chu dochun kun pakaan. (71)

He be-PRS.SG.MAS right towards

He moves towards right.

tim luukh A:s’ puliis-as mukhA:lif naarI bA:zii karaan. (72)

Those people be-PST.Pl.MAS opposite protest do-PROG

Those people were protesting against police.

ii. Hetu <rh>

The NP chunk which indicates the reason or cause of an activity is the Hetu or the

reason NP. In Kashmiri, (sI:t’ and kiny) are the reason postpositions and any NP

161

consisting of reason postposition is Hetu. However, these NPs need to be clearly

distinguished from the instrument NPs which also use (sI:t’) postposition. (sI:t’)

is actually an instance of case syncretism in Kashmiri. The GR which such NPs

bear with the root are labelled as ‘rh’. For example:

tami vajah kiny heyok-nI police-an su band thA:vith. (73)

That reason because of can-PRF.SG.MAS -NEG police-ERG he

close keep-PRF

Because of that reason police couldn’t kept him locked.

tami sabb-I gov mea hana tseyr. (74)

That reason-ABL go-PRF.SG.MAS I little late

Because of that reason I am late.

ami sardii sI:t’ pevos bI ti beymaar. (75)

This cold due to fall-PRF.1P.SG.MAS I too ill

I too have fallen ill due to/because of this cold.

ShoTh Os dagi sI:t’ chwrI chwrI karaan. (76)

I be.PST.SG.MAS pain due to shiver-PROG

I was shivering due to pain.

iii. Taadarthya <rt>

The NP chunk which indicates the purpose of an activity is the Taadarthya or the

purpose NP. In Kashmiri, khA:trI, mokhI, muujuub, and baapath (for) are the

purpose postpositions and any NP consisting of purpose postposition is

Taadarthya. However, sometimes VGNN also performs purposive role as shown

in (80) in which a gerund is marked with purposive case. The GR which such NPs

bear with the verb root are labelled as ‘rh’. For example:

faqooq-ni khA:trI pyov yi soorui karun. (77)

Farooq-GEN for fall-PRF.SG.MAS this all do-INF

All this needed to be done for Farooq.

kaam-i baapath aas bI yor. (78)

Work-ABL for come-PRF.SG.MAS I here

I came here for work.

162

chaan-i mokhI/muujuub gatshi sw shahar. (79)

You-GEN for go-FUT she Srinagar

For you she will go to Srinagar.

batI khe-yth draav su shong-ni. (80)

Rice eat-PART go-PRF he sleep-GER-PUR

Having eaten rice he went for sleeping.

iv. Saadrishya <ku: k1u, k2u>

The NP chunk which indicates the similarity or comparison (expressed through

predication) between two entities is the Saadrishya or the comparand NP. In

Kashmiri, khotI and nish are the comparand postpositions while as pA:Th’ and

hiuv or hish are the Similative postpositions. Hence, any NP consisting of

comparand or Similative postposition is Saadrishya. However, the NPs with

pA:Th’ postposition need not to be confused with adverbials like Thiekh-pA:Th

and also nish postposition need not to be confused with locative postposition. The

forms, nish and pA:Th’, are ambiguous and can perform two functions in to

contexts. These are actually two more instances of case syncretism in Kashmiri.

The GR which such NPs bear with the root can be labelled as ‘k*u’ but if the

comparison or similarity is with SUB NP, the GR of comparand NP is labelled as

k1u and similarly if it is with OBJ NP, the GR is labelled as k2u. Actually, the

star mark (*) can be seen as a variable mark which is substituted with any karaka,

depending upon the comparee. For example:

koshur treebank chu vuni hindi tI urdu treebank-av khotI/nish

vaariyaa lokut. (81)

Kashmiri treebank be-PRS.SG.MAS yet Hindi and Urdu

treebank-ABL.Pl.MAS as

compared to very small

Kashmiri treebank is yet very small as compared to Hindi and Urdu

treebanks.

farooq chu bAshiir-In’ pA:Th’ rut insaan. (82)

Farooq be-PRS.SG.MAS basher-GEN like good person

Farooq is a good person like Bashir.

163

bAshiir oos deyv hiuv insaan ??? (83)

Bashir be-PST.SG.MAS gaint like-SG.MAS person

Bashir was a giant like man.

faaroq-as baasey sw kuur nowshiin-as hish. (84)

Farooq-DAT feel-PRF.SG.FEM that-SG.FEM girl

Nowsheen-DAT like-SG.Fem

That girl looked like Nowsheen to Farooq.

v. Upapaada-sahakaaraktawa <ras: ras-k1, ras-k2, ras-neg>

The NP chunk which indicates the association of an entity with other entity in

performing an activity is the Upapaada-sahakaaraktawa or the associative NP. In

Kashmiri, sI:t’, saan, heyth, bagA:r and varA:y are the associative postpositions

and any NP consisting of associative postposition is Taadarthya. The sI:t’ is

positive associative postposition while as bagA:r and varA:y are negative

associative postpositions. The GR which such NPs bear with the verb root can be

labelled as <ras> but when the NP is associative of SUB, it is labelled as <ras-k1>

and when it is an associative of an OBJ, it is labelled as <ras-k2>. However, when

there is any negative associative postposition it is labelled as <ras-neg>.

For example:

su draav pann-is mA:l-is sI:t’ chakr-as. (85)

He go-PRF.SG.MAS own-DAT father-DAT with walk-DAT

He went on a walk with his father.

hindostaan chu chiin-as sI:t’ kath baath karn-I baapath tayaar.* (86)

India be-PRS.SG.MAS China-DAT talk do-GER-GEN for ready

India is ready to talk with China.

farooq-an kheyov tshuunTh deyl heyth/deyl-I saan. (87)

Farooq-ERG eat-PRF.SG.MAS apple peel with/along with

Farooq ate apple with/along with peel.

miltry-vA:l’ A:s’ saaman-I varA:y/ bagA:r natsaan. (88)

Military men be-PST.Pl.MAS weapons-ABL without roaming

Soliders were roaming around without weapons.

164

vi. Address Terms <rad>

The NP chunk, which indicates addressing to some person, bears a DREL with

the verb root which is labelled as <rad>. Some of the addresses terms are overtly

marked with vocatives but others are inherently vocatives. For example:

moj-ai, mea di-tay batI. (89)

Mother-VOC.SG.MAS me give-2PC.SG.FEM rice.

Mother, give me rice.

farooq-aa, tala yuur’ yi. (90)

Farooq-VOC.SG.MAS tala-MOOD here come

Farooq, you come here.

jinaab, tAm’ von zi bI gatsh-I garI. (91)

Sir, he say-PRF that I go-FUT.1PC.SG home

Sir, he said that he will go home.

hayaa, so kitaab maqlA:v-tha pAr-ith. (92)

Hey-SG.MAS that book finish-WH.SG.MAS read-PART

Hey! You finished that book.

vii. Information Source <rac>

The NP chunk, which indicates the source of an information or point of view

which may or may not be a person, is information source NP. In Kashmiri, the NP

consisting of genitive marked nominal and informational postpositions (mtA:bik

and hisaabI) bear a DREL with the verb root which is labelled as <rac>. For

example:

farooq-ni mutA:bik pazi mea garI vaapas yun. (93)

Farooq-GEN according to should I home back come-GER

According to Farooq I should come back home.

chaa-ni hisaab-I os su apuz vanaan. (94)

You-GEN way-ABL be-PST.SG.MAS he lie tell-PROG

According to you he was telling a lie.

165

viii. Information Target <rab>

The NP chunk, which indicates an entity towards which all the information of a

proposition is directed, is information target NP. It can be also a clause. In

Kashmiri, the NP consisting of dative marked nominal and the target postposition

mutliq (about) bears a DREL with the verb root which is labelled as <rab>. For

example:

farooq-an vAn’ mea bAshiir-as mutliq akh kath. (95)

Farooq-ERG tell-PRF.SG.FEM me Bashir-DAT about one talk

Farooq told me something about Bashir.

ix. Hurdle <rin> in spite of

The NP chunk, which indicates a hurdle in an activity that has been overcome, is

the hurdle NP. The hurdle can also be a whole clause. In Kashmiri, the NP

consisting of genitive marked nominal and hurdle postposition baavojuud (in spite

of or despite) bears a DREL with the verb root which is labelled as <rin>. For

example:

tAm’sIndi inkaar karnI baavojuud pyo mea tor gatshun. (96)

He-GEN denial in spite of fall-PRF.SG.MAS I go-INF

In spite of his denial, I had to go there.

pareshA:ni-yav baavojuud ruud-us su panIn’ kA:m karaan. (97)

Difficulty-Pl.FEM in spite of keep-PRF.SG.MAS -1PC

he own work do-PROG

In spite of difficulties, he kept doing his work.

9.4. Type Four GRs

Type-Four relations are essentially to capture the adverbial relations of inherent

adverbs and participles. Since, the same participles also modify nouns; participial

noun modifiers are also included in this class of GRs. It includes adverbials (adv

and sent-adv), vmod (vmod_Rh and vmod_Inst) and nmod.

i. Adverbial <adv and sent-adv>

The manner adverbs which are discontinuous and hence, didn’t form a chunk with

verb but gave rise to their own chunk projections tagged as RBPs depend on the

verbal root and the DREL they bear with the root is labelled as <adv>. Similarly,

the discourse particles which form BLK chunk are considered to be sentential

adverbs and the DRELs they bear with the verb root are labelled as sent-adv. It

166

must be noted that other traditional adverbs, i.e. adverb of time and place, don’t

form RBP and hence do not bear adv DREL with the root. The examples of RBP

and BLK chunks that bear adv and sen-adv DRELs, respectively with the root are

given below:

su draav vaarI vaarI garI kun. (98)

He leave-PRF.SG.MAS slowly towards home.

He left towards home slowly.

tim chi yim-I kath-I baar baar karaan. (99)

They be-PRS.Pl.MAS these talk-Pl again do-PROG

They talk it again and again.

yithI-pA:Th’ gatsh-an-nI yim-I kath-I karIn-i. (100)

This like should-2PC-NEG these talk-Pl do-INF.Pl

You shouldn’t talk like this.

ii. Participial Verb Modifier <vmod>

The participial forms which may or may not constitute non-finite clauses are

projected as chunks and are attached to the root in order to show that they bear

modifier relation with it. In Kashmiri, the –ith and –aan forms of verb are

participial forms and constitute VGNF chunks. The –ith forms like shongith

(having slept), bihith (having sit), tsA:p’ith (having chewed), etc may be

sequential, consequential or instrumental in nature as far as their role is concerned

while as the –aan forms like pakaan pakaan (during walking), gindaan gindaan

(while playing), etc which are actually progressive/habitual forms but express

simultaneity on reduplication and function as verb modifiers. The –ith forms are

also reduplicated but when reduplicated they change their form like shongith

(having slept) changes into shong’ shong’ to encode simultaneity and

sequentiality together in more complex way in order to express manner. The

DRELs which all these variants of VGNF bear with the root are labelled with an

underspecified label <vmod>. For example:

batI khe-yth draav su shong-ni. (101)

Rice eat-PART go-PRF he sleep-GER-PUR

Having eaten rice he went for sleeping.

167

mAr’-mAr’ chu su kA:m karaan. (102)

die-PART-RED be-PRS-SG.MAS he work do-HAB

He works slowly.

pakaan-pakaan os su malaayi kulfii kheyvaan. (103)

Walk-HAB-RED be-PST.SG.MAS he ice cream eat-PROG

While walking he was eating ice cream.

iii. Participial Noun Modifier <nmod>

Some participials forms are projected as VGNF chunks like the verb modifier

participials as mentioned above but are attached to the modified noun instead of

the root to show their modifier relation. In Kashmiri, the –vun, –vol and –ith

forms are noun modifier participials. The –vun forms like asvun (laughing) and

vudvun (flying), –vol forms like asanvol (laughing) and natsanvol (dancing) and –

ith forms like bihith (sitting) and shongith (sleeping) are the noun modifier

participials and the DREL they bear with the nouns is labelled as <nmod>.

For example:

asvun insaan chu saarinIy khosh kar-aan. (104)

smile-PART person be-PRS.SG.MAS all happy do-HAB

All people like a smiling person.

gindan-vol insaan chu chust dusrust rozaan. (105)

Play-PART person be-PRS.SG.MAS healthy remain-HAB

The person who plays remains healthy.

kullis peyTh bihith kaav chu Taav Taav kar-aan. (106)

tree-DAT on sit-PART crow Taav Taav do-PROG

The crow sitting on the tree is crowing.

9.5. Type Five GRs

It includes clausal modifications brought about by relative clauses. Relative

clauses are embedded clauses introduced by relative pronouns. The relative

pronouns have their corresponding pronouns in the matrix clause and therefore,

these clauses are called relative–correlative constructions. In Kashmiri, yus–su,

yuth–tyut, and yithIpA:Th’–tithIpA:Th’ are relative–correlative elements.

168

Relative clauses modify noun/pronoun, adjective and adverb. The description of

such GRs is given below:

i. Relative clause Nominal Modification <nmod_Relc>

In Kashmiri, the relative element which introduces a relative clause to modify a

nominal is yus and its correlative is su or any other noun. These relative elements

are either the relative pronouns or the relative demonstratives. The DREL which

these relative clauses bear with the non-root nominal head are that of noun

description which is different from normal nominal modification brought about by

inherent adjectives or participials. Nevertheless the name used for noun

description is nominal modification but it is labeled differently as <nmod-Relc>.

For example:

su nafar yus otrI gov garI yiyi pagah vaapas. (107)

That man who day before yesterday go-PRF.SG.MAS home be-FUT.SG

tomorrow return

The man who went home day before yesterday will return tomorrow.

su nafar yiyi pagah vaapas yus otrI garI gov. (108)

That man be-FUT.SG tomorrow who day before yesterday gov.

The man will return tomorrow who went home day before yesterday.

bI chus sw kitaab paraan yath peyTh bavaal os voth-mut. (109)

I be-PRS.SG.MAS .1PC that-SG.FEM book read-PROG

which-DAT on hue and cry be-PST.SG.MAS stand-PRF.SG.MAS

I read that book on which there was hue and cry.

tAm’ dits tas lADkI-as kitaab yAm’ tas mAnj’. (110)

He give-PRF.SG.FEM that-DAT boy-DAT book who-ERG

ask-PRF.SG.MAS him

He gave that boy book the book who asked him for it.

farooq-an lyokh tami qalmI sI:t’ zindagii hund falsafI yami sI:t’

tAm’ mea kA:trI akh chiTh leych-mIts A:s. (111)

Farooq-ERG write-PRF that pen with life of philosophy which with

he-ERG I for one latter write-PRF.SG.FEM be-PST.SG.Fem

169

Farooq wrote philosophy of life with that pen with which he had

written a letter to me.

ii. Relative clause Adjectival Modification <jjmod_Relc>

The relative elements which introduce relative clauses in order to modify deictic

adjectives are yuth, yith’, yitsh and yitshI and their corresponding correlatives are

tyuth, tith’, titsh and titsI, respectively. For example:

su chu huu-ba-huu tiyuth-ui yuth tas panun mwl os. (112)

He be-PRS.SG.MAS exactly like-that like-which he-DAT

his father be-PST.SG.MAS

He is exactly like that like which his father used to be.

tiyuth-ui jacket An’-zi yuth mea raath on. (113)

like-that-EMP jacket buy-IMP.2PC like-which I buy-PRF yesterday

You buy that kind of jacket like which I bought yesterday.

iii. Relative clause Adverbial Modification <rbmod_Relc>

The relative elements which introduce relative clauses in order to modify an

adverb or deictic adverb is yIthI-pA:Th’, yithI-kIn’ and yami tAriiqI and their

corresponding correlative are tithI-pA:Th’, tithI-kIn’ and tami tAriiqI. For

example:

tithay-pA:Th’ kAri-zi az ti rut dance yithIpA:Th’ raath kor-uth. (114)

Like-that do-IMP.2PC today also nice dance like-which yesterday

do-PRF.SG.2PC

You dance as nicely today also as you did yesterday.

yemi tAriiqI tami mea von tami tAriiqI pyov mea karun. (115)

Like-which she I-DAT tell-PRF like-that had I-DAT do-INF

As she told me I had to do like that.

9.6. Type Six GRs

It includes a set of non-dependency relations that hold between two conjoined

elements or clauses of equal status and that of between two elements or clauses of

unequal status. The former is the coordination relation and the latter is the

subordination relations which are basically structural relations to keep the

organization of compound, complex, compound-complex and complex-compound

clauses intact and have nothing to do with modification of the finite verb but have

170

everything to do with the tying together of one or more chunks on intra-clausal

level or one or more VGF chunks on inter-clausal level. The description of these

relations is given below.

iv. Coordinating Conjunct <ccof>

The CCP chunk which conjoins two chunks within a clause is the head of the

conjoining chunks and is attached to the root of the clause in which they occur as

they both are in symmetrical relation with each other and thus, both are

dependents of the root. However, the CCP chunk which occurs at inter-clausal

level is considered as the root of the compound and compound-complex sentences

that involve many finite clauses. In Kashmiri, CCPs include tI (and), kinI/yaa (or),

etc. It must be noted that sometimes even commas can also functions as CCPs.

The GR which two chunks or clauses bear with their respective heads, which in

both cases is CCP, is labelled as <ccof> as shown in Fig.1 and Fig.2 for the

examples (116) and (117). For example:

farooq tI bAshiirI chi dohdish paanIvan’ chob chob karaan. (116)

Farooq and Bashir be-PRS.Pl everyday each other hit hit do-HAB

Everyday Farooq and Bashir fight with each other.

farooq chu lamaan dwchun tI bAshiirI chu lamaan khovur. (117)

Farooq be-PRS.SG.MAS pull-HAB towards right and Bashir

be-PRS.SG.MAS pull-HAB towards left.

Farooq pulls towards right and Bashir pulls towards left.

v. Sub-ordinating Conjunct <ccof>

Instead of symmetrically conjoining chunks and clauses, some CCP chunks

introduce new chunks new clauses and thus, enter into embedding phenomenon

by joining two chunks or clauses asymmetrically. The embedded clauses

introduced by complementizers are considered as OBJs already discussed but the

rest of the rooted complementizer clause attaches to the complementizer which

acts as the syntactic head of the compliment clause but not the VGF. Therefore,

the complementizer, e.g. zi and ki (that), is the dependent (OBJ) of the root of

matrix clause but simultaneously, it is also the head of complementizer clauses as

shown in Fig. which is the graphic representation of (118). The GR the VGF of a

complementizer clause bears with the complementizers is also labelled as ccof as

shown in Fig.5. For example:

171

farooq-an chu vanaan zi tas chu-nI kahn-ti veytsaan. (118)

Farooq-ERG be-PRS.SG.MAS say-HAB that he-DAT

be-PRS.SG.MAS -NEG noyone-EMP impress-HAB

Farooq says that no one impresses him.

Figure.3 Showing Intra-clausal ccof in (116)

Figure.4 Showing Inter-clausal ccof in (117)

172

Figure.5. Showing Sub-ordinating ccof in (118)

9.7. Type Seven GRs

It includes entirely different a set of GRs, pof and fragof, which are actually

innovations to handle certain crucial phenomenon like complex predication and

v2 phenomenon. Without these relations, it would have been difficult to account

for the structures involving such phenomenon. The description of these relations

is given below:

i. Part of Verb <pof>

Some nouns, adjectives and participle forms combine with certain verbs which are

bleached of their original semantics due to grammaticalisation, hence called light

verbs. Such combinations with light verb give a pure sense of predication in South

Asian Languages and are called complex predicates or conjunct verbs (see Butt

2005). A generalized internal structure of these complex predicates is

(Noun/Adjective/Participle + verbalizer). Like Hindi/Urdu, complex predicates

are productive in Kashmiri in which participles also are involved in complex

predicate formation in addition to nouns and adjectives. In Kashmiri, the

173

commonly occurring light verbs are karun (to do), niyun (to take), tshunun (to

enter), etc. As illustrated above in Figure.3, the GR of these nominal, adjectival

and participles with the light verb, which is projected as VGF, is a non-

dependency relation and is labelled as <pof>. For example:

tAm’ kyaa chu vunyuk taam hA:sil kormut. (119)

He what be-PRS.SG.MAS now till achieved

What he has achieved till now?

achaanak gov su mea broThI kani pA:dI. (120)

Suddenly go-PRF he I front in appear

Suddenly he appeared in front of me.

tAm’ tuj’ su vuchith vwTh. (121)

He lift-FEM he see-PART jump

Having seen him he jumped.

tAm’ kor ti saarinIy bronTh kani zA:hir. (122)

he do-PRF.SG.MAS that everyone front in reveal

He revealed it before everyone

tAm’ diyut ath savaal-as akh rut javaab. (123)

He give-PRF that question-DAT a nice answer

He gave him nice answer.

sw chi farooq-In sakh tA:riif karaan. (124)

She be-PRS.SG.FEM very praise do-HAB

She praises Farooq very much.

su chu pann’ galtiyi qwbuul karaan. (125)

He be-PRS.SG.MAS his mistake-Pl accepts

He accepts his mistakes.

Although, there are various diagnostics for identifying complex predicates

(Mohanan 1994; Butt, 2004; Chakrabarty et. al, 2007; Bhatt, 2008) but still

identifying them is not easy task and hence, their annotation is also a confusing

job. The problem in identifying them is that sometimes it is difficult to figure out

174

whether the nominal part is an OBJ or not. Intuitively, it appears that the nominal

is a part of complex predicate but syntactically as per sub-categorisation frame is

concerned, it appears to be an OBJ as can be seen in above example (123).

ii. Fragment of Verb <fragof>

It has been observed that in finite clauses the tensed verbal element (VAUX)

which has been projected as AUXP chunk occurs at second position while as the

un-tensed lexical part (VM) which has been projected as VGF chunk, occurs at

the final position of the clause. This disjunctive or discontinuous occurrence of

tensed and lexical verbal elements is due to the fact that Kashmiri exhibits V2

phenomenon like German. Since, such elements of finite verb do not occur

contiguously like in other Indo-Aryan languages, they do not form a single verb

chunk instead they form two chunks, AUXP and VGF. Since VGF is root of a

clause and most of the other chunks are its dependents and are attached to it as per

the current scheme. However, AUXP is not modifier of VGF in any sense but

tensed fragment of it which has fallen apart. This ‘fragment-of’ GR is shown by

attaching AUXP chunk to the root like the dependents and labelling the relation as

<fragof>. For example:

farooq chu tsuunTh kheyvaan. (Active Voice) (126)

Farooq be-PRS.SG.MAS apple eat-PROG

Farooq is eating an apple.

farooq-ni zAryi aav tsuunTh khey-nI . (Passive Voice) (127)

Farooq-GEN by come-PRF apple eat-PASS

An apple was eaten by Farooq.

farooq ch-aa tsuunTh kheyvaan? (Interrogative) (128)

Farooq be-PRS-WH apple eat-HAB/PROG

Is Farooq eating an apple/does Farooq eat an apple?

As aforementioned, Kashmiri exhibits the verb-second phenomenon (V2) which

has been argued by Raina (1991) to be a PF level constraint. In Kashmiri tensed

clauses are subjected to the verb second constraint due to which the finite verbal

element always occurs in the second position, i.e. the position followed by the

first constituent. At surface level Kashmiri shows V2 like German except that V2

appears in both main and embedded clauses in Kashmiri but at deep level, it is

175

argued that the underlying word order of Kashmiri is SOV like German for which

the evidence comes from non-finite and relative clauses.

9.8. Type Eight GRs

It includes a non-grammatical relation which though not of any significance to

account the structure of a clause but in corpus it is a part of sentence. It includes

enumerator or serial numbers for sentences which are non-structural parts. Even

though, enumerators are of no grammatical significance, these are important to

account as these are integral elements in corpus. Enumerators can be projected as

BLK chunks and can be attached to the toot of the clause. So for these relations

have not been labeled as the enumerator elements were not present in the corpus.

However, the relation between the enumerator BLK and the root can be labeled as

<emn>. For example:

1. akh nafrah chu vat-i pakaan. (129)

1 one man be-PRS.SG.MAS road-DAT walk-PROG

1. One man is walking on the road.

5. Annotating Inter-chunk GR Relations

Marking inter-chunk grammatical relations (dependencies or non-dependencies)

involves syntactic parsing and its annotation. The chunked corpus, a set of GRs

and a SA Interface are prerequisites for carrying out annotation of inter-chunk

GRs. The chunked Kashmiri corpus development of which has been described in

the chapter five is used for the current task. The set of GRs, given in the section

four of this chapter, provides all the necessary relational labels along with their

description and illustration. The same annotation interface (Sanchay SA Interface)

which was used for POS level annotation and chunk level annotation has been

also used for the current syntactic annotation. The process of syntactic annotation

has been carried manually SA Interface of Sanchay. The entire process of

annotation is illustrated with reference to the following example sentence taken

from corpus.

kAshiir-i manz haalI-keyn doh-an manz shAhrii halaaqts-an hund silsilI

teyzn-I kin’ chu salaamtii maahol mutA:sir sapud-mut.

Kashmir-DAT.SG.Fe in recent-GEN.Pl.MAS day-DAT.Pl.MAS in

civilian death-GEN.Pl.FEM of-SG.MAS spree intensity-ABL towards be-

PRS.SG.MAS security condition affect-PRF.SG.MAS

176

The security conditions have been affected in Kashmir because of increase

in recent death sprees of civilians.

The various steps that were involved in the annotation of the above sentence are

given below:

Step-1 Opening Chunked Data in SA Interface

The chunked corpus file is opened in the interface which shows POS level nodes

as well as chunk level nodes in SSF format.

Figure.6. SA Interface Showing a Chunked Sentence

Step-2 Opening in Tree Viewer Window

In order to attach various types of chunks to the root and other non-root heads, the

sentence needs to be opened in the tree viewer by clicking on ‘View Dependency

Tree’ button on right side of the window indicated by the arrow. Once it is done

the chunks will be displayed as below. Each chunk has been already automatically

assigned an ID number according to its position in the sentence. Here chunks are

displayed in the same order but from left to right. The sentence displayed is

constituted of two clauses, one finite clause with ultimate head VGF (root) and

one non-finite modifying clause with ultimate VGNN as head.

177

Figure.7. SA Interface Displaying Various Chunks

Step-3 Finding Root and Target Chunk

Once chunks are displayed in the tree viewer, root of the sentence, i.e. VGF

chunk, needs to be identified so that its rest of dependents and their relations with

it can be annotated. In tensed clauses the root occurs at the final position of the

sentence as shown below in Fig.8 by the arrow. After finding the root of the

sentence, the target chunk needs to be identified which can be attached to the root.

First of all the most closest element of the verb root, i.e. AUXP needs to be

identified and attached if it is tensed clause and then, NP, JJP, or VGNF needs to

be identified and attached if it is light verb of complex predicate which is

projected as VGF. This needs to be done with first priority in order to get the

complete information (inflectional and lexical) about the root as shown in Fig.9 to

decide upon its sub-categorization frame. Therefore, the first target chunk in the

finite clause of the given sentence (chu salaamtii maahol mutA:sir sapud-mut)

was AUXP which is the tensed part of VGF and the second target chunk would be

JJP which is adjectival part of the VGF, a complex predicate.

178

Figure.8. SA Interface Showing the Identified Root Chunk (VGF)

Figure.9. SA Interface Showing the Identified Target Chunk (AUXP)

Step-5 Drag and Drop of Target Chunk

Once the target chunk (AUXP) is identified, it can be attached to the VFG root by

drag and drop method as shown in Fig.10 by an arch. This creates an undefined

179

relation between AUXP and the root as shown by the arrow. The relation needs to

be identified and labeled according to the set of labels given in the table.

Figure.10. SA Interface Showing Attaching of the Target Chunk to the Root

Step-5 Choosing Relational Label

In this step, a dropdown list of relational labels can be opened by simply left

clicking on the dependent node, i.e. AUXP. The clicking on the node will open a

dialog box, as shown in Fig.11 and by clicking on the OK button of the dialog box

a dropdown list of relational labels will open, as shown in Fig.12, from which an

appropriate label can be chosen.

Figure.11. SA Interface Showing Undefined GR between AUXP and VGF

180

Figure.12. SA Interface Showing Selected GR between AUXP and the Root

Once the OK button of the dropdown list is clicked upon, the selected label, i.e.

fragof, gets assigned to the undefined relation between AUXP and VGF as shown

in Fig.13 by an arrow.

Figure.13. SA Interface Showing FRAGOF GR between AUXP and the Root

181

Same procedure is applied for the next target, i.e. JJP chunk and is attached to the

root with the relational label fragof, as shown in Fig.14 with the help of an arrow.

Once the complete information of the root is available, it is easy to identify and

attach other dependents, both arguments and adjuncts, and to decide upon their

DRELs.

Figure.14. SA Interface Showing partof and fragof Attachments to the Root

Step-6 Annotating Rooted Dependencies

Having idea of the sub-categorization frame of the complex predicate which is the

root of finite clause, it becomes obvious that the next NP is its argument, though

there was little bit confusion on whether it bears k1 DREL with the root or k2 but

initially it seemed to be k2. Therefore, it was attached to the root by same drag

and drop method which was used to attach other chunks. The DREL it holds with

the root was annotated as k2 as shown in Fig.15 with help of an arrow.

182

Figure.15. SA Interface Showing k2, partof and fragof Attachments to the Root

In this way, the syntactic annotation of the finite clause of the given sentence (chu

salaamtii maahol mutA:sir sapud-mut) was completed in which the three chunks

AUXP, JJP and NP have been attached to the root VGF with the attachment labels

fragof, pof and k2 repectively.

Step-7 Annotating Non-Rooted Dependencies

In this step, the annotation of non-finite clause (kAshiir-i manz haalI-keyn doh-an

manz shAhrii halaaqts-an hund silsilI teyz-nI kin’) of the sentence was taken in

which first point was to identify the head of the entire nonfinite clause, i.e. VGNN

and the target chunk that needs to be attached first but there were some other

dependents which instead of depending on VGNN were dependents of NPs which

in turn were dependents of VGNN. Such cases needed to be taken care first so

that latter one can fully concentrate on the attachments of VGNN in order to avoid

errors in the annotation. Therefore, attaching to VGNN was postponed and

instead the next genitive marked dependent NP was attached to its head NP and

the attachment was labeled as r6, as shown in Fig.16. Having finished this, the

head NP along with its own attachment was itself attached to the ultimate head of

the clause, i.e. VGNN and k1 was assigned as attachment label, as shown below

in Fig.16 by an arch and in Fig.17 by an arrow.

183

Figure.16. SA Interface Showing r6 Attachment to the first NP Head

Figure.17. SA Interface Showing k1 Attachment to VGNN Head

Immediately another genitive marked NP was encountered which also was

attached to its head NP and was assigned r6 attachment label as shown in Fig.18

by an arrow. Next the first NP of the sentence was attached to VGNN as shown in

184

Fig.18 by an arch and an attachment label k7p was assigned to the attachment as

shown in Fig.19 by an arrow.

Figure.18. SA Interface Showing r6 Attachment to the Second NP

Figure.19. SA Interface showing Another k7p Attachment to VGNN Head

185

Finally, the leftover NP along with its genitive attachment is attached to VGNN as

shown in Fig.19 by an arch and was assigned an attachment label as shown in

Fig.20 by an arrow.

Figure.20. SA Interface showing k7t Attachment to VGNN Head

Step-8 Annotating Inter-clausal Dependencies

By now, there are two parsed clauses in which one is finite and other is nonfinite.

As already mentioned several times that the root of a sentence lies in finite clause,

i.e. VGF and nonfinite clause just modifies the root. Therefore, VGNN chunk

with all its attachments is attached to the root and assigned rh attachment label as

shown in Fig.21 by an arrow.

Figure.21. Showing Inter-clausal DREL (rh) between VGNN and the Root

186

Once the annotation of entire sentence is complete, the dependency tree is saved

and the the tree viewer window is closed. The saved annotated sentence is

displayed in the interface as threaded structure in SSF format as shown in Fig.22.

Figure.22. Showing Threaded Structure of Syntactically Annotated Sentence

Finally, opening the threaded structure in the tree viewer in collapsed form, i.e.

with collapsed nodes, the dependency tree will be displayed as shown in Fig.23.

On evaluating various attachment labels once more before moving on the next

sentence is essential as the relations can be seen more clearly now as done in this

case also, the NP attachment to the root actually bears k1 relation but mistakenly

it was labeled as k2. Errors like this can be rectified at this stage easily.

Figure.23. Showing a Complete Dependency Tree in Collapsed Form

187

The corresponding expanded form of the dependency tree will be as shown in

Fig.24, with all its nodes or sub-trees completely expanded.

Figure.23. Showing a Complete Dependency Tree in Expanded Form

6. Issues of Syntactic Annotation

The crucial issues that have been encountered while annotating the data are

summarized below:

6.1. V2 Phenomenon

V2 phenomenon is the most crucial issue for annotating Kashmiri data. The issue

is discussed with reference to the following example sentence taken from the

current Kashmiri Treebank.

Asi [A:s]AUXP doshvun’ bA:ts-an tam’-sInz seyThaa nikhath [gA:mIts]VGF.

(1.a)

we be-PST.SG.MAS two-DAT.EMP husbandwife-DAT.Pl it-GEN lot hatred

go-PRF.SG.Fem

We both husband-wife had developed lot of hatred of it.

In the examples like above, the Finite Verb Group [A:s gA:mIts]VGF (had gone)

occurs discontinuously as AUXP (A:s) and VGF (gA:mIts) with three intervening

NP chunks. As aforementioned, the tense auxiliary occurs at second position in

188

Kashmiri and the main verb at final position of the sentence. This discontinuous

occurrence of AUXP is called V2 phenomenon which is similar to German and

Yiddish with variation. Since the root of the sentence is VGF chunk, the main

issue was whether to posit AUXP or VGF as the root of the sentence; given the

discontinuity in finite verb group (VGF). Initially, FRAGP chunk label (used to

handle occasional discontinuity in Hindi treebank) was used for tense auxiliary

and it was treated as a root of the sentence given the fact that most of the

treebanks consider finite verb as the head and also because in generative

framework too finite clause is treated as tensed phrase. Latter on, the decision was

taken to change the nomenclature and replace FRAGP label with AUXP to mark

that it is regular phenomenon and the notion of verb group, as posited for

treebanking in Indian Languages, is problematic with respect to Kashmiri data

which is replete with V2 phenomenon. Also, the previous notion of head vis-à-vis

root of sentence was revised and VGF instead of AUXP was taken as head/root of

the sentence. This decision was made in consonance with the basic tenet

governing PCG which is that only content words can be heads. Further, it is

considered that the grammatical information that gives the impression of

finiteness is distributed over two or three tokens and only single tensed token

without its lexical part can’t be considered finite verb. Therefore, AUXP is

considered tensed part of the lexical element which together constitute finite verb.

Since, only lexical elements can be root, the lexical part of the verb has been

assigned VGF and the AUXP is attached to it like any other dependent, though it

is not a dependent but tensed fragment of the lexical verb, and is assigned fragof

attachment label, considering the AUXP-VGF complex would give sense of only

VGF. In short, the V2 phenomenon was tackled by simple attachment technique

assuming what can’t be grouped together during chunking can be at least attached

and the attachment label will indicate its status as there is no notion of

hierarchical notion of the organization of sentence.

6.2. Complex Predicates

The problems related to complex predicates in Kashmiri are discussed with

reference to the following example sentence taken from the current Kashmiri

Treebank which of a complex predicate pasand aasun (to like).

zA:hir chu ki Akis [aasi]VGF akh kitaab pasand tI beykis

aasi byaakh kitaab [pasand]NP. (1.b)

189

obvious is that one will-have one book like and other

will-have other book like

It is obvious that one would like one book and other

would like other book.

Identification of complex predicates (CP) and their extraction is already a

complex problem in which at times it becomes very difficult to indentify whether

a combination [light verb + Noun] is simply verb + OBJ combination or a

complex predicate as aforementioned. Four criteria have been used, in addition to

native speakers’ intuitions, to recognize CPs in Kashmiri.

i. The first one that verbal element is semantically beached and doesn’t

retain the original lexical semantics CPs. It is because of this reason that it

is also called light verb and more or less functions metaphorically.

ii. The second criteria would be that if the (NN/JJ/VM + VM) combination

has a single lexical item, a verb, as its translation equivalent in English, it

is most likely to be a CP.

iii. Pondering on the sub-categorization frame of the light verb will reveal a

lot that if the nominal element is an argument, adjunct or something else.

If it is something else the combination is more likely to be a complex

predicate.

iv. The third criteria would be that the nominal, adjectival or the participial

part of CP can’t be easily conjoined while as the OBJ or the complements

can be easily conjoined.

v. Further, some CPs can be identified by just looking at the non-verbal part

to see if they are brushed of any agreement features like PNG. If one can

perceive no features there, it most likely forms a complex predicate.

This problem is even more complicated in Kashmiri where both discontinuous

CPs are hallmark of finite clauses. The noun/adjective/participial part of CP

occurs apart from the light verb which takes second position due to V2-

phenomenon while as the noun/ adjective/ participial. The light verb carries only

the grammatical features but the lexical semantics is provided by the

noun/adjective/participial part. However, the light verb is tensed element but the

only verbal element and there is no main verb which provides lexical semantics

fro predication. Therefore, the light verb is assigned VGF tag but not AUXP. The

noun/adjective/participial parts are simply attached to VGF with an attachment

190

label pof (Part-of). In the above example the nominal part of the CP “pasand” is

attached to the light verb “aasi” by pof attachment label just like AUXP was

attached to VGF. Here, again the discontinuity of complex predicate is solved

through attachment technique.

6.3. Pronominal Cliticisation

The problems related to pronominal cliticisation in Kashmiri is discussed with

reference to the following example sentence taken from the current Kashmiri

Treebank.

yAmi-is yi Ø behtar zon-un ti thov-n-as Ø lekh-ith. (1.c)

who-DAT this Ø better know-PRF.3PC.SG.MAS that

keep-PRF-3PC.DAT Ø write-PART

For whom whatever s/he deemed better s/he kept that in his/her destiny.

yAmi-is yi tAm’ behtar zon-un ti thov-n-as tAm’ le’kh-ith.* (1.d)

who-DAT this better know-PRF.3PC.SG.MAS that

keep-PRF-3PC.DAT write-PART

For whom whatever s/he deemed better s/he kept that in his/her destiny.

bI chus-ai tse vuchaan. (1.e)

I be-PRS.SG.MAS -2PC you see-PROG

I am watching you.

bI chus-ai Ø vuchaan. (1.f)

I be-PRS.SG.MAS -2PC Ø see-PROG

I am watching you.

bI chus-Ø tse vuchaan.* (1.g)

I be-PRS.SG.MAS -2PC Ø see-PROG

I am watching you.

Pronominal clitics are the characteristic morpho-syntactic feature of Kashmiri

verbs like that of Punjabi, Landha and Sindhi. There are two types of pronominal

clitics in Kashmiri, one type includes those which simply act as agreement

markers and do not replace arguments as shown in examples (1.e) and (1.f). In

such cases, presence or absence of pronominal arguments hardly matters in

191

presence of the clitics, both can also co-exist without making a construction sound

odd but in absence of the clitic, the pronominal argument makes the construction

sounds odd as shown in (1.g). This indicates that in the slot of PRO drop in such

clauses, an artificial argument can be introduced even though the information

about the argument can be extracted from the clitic itself. However, there are

other cases, in which argument replacing takes place and the clitic and

pronominal argument can’t co-exist. If the artificial pronominal arguments are

introduced to fill the slot of PRO drop, triggered by the clitics, the presence of the

argument sounds redundant and the construction looks odd as shown in (1.c) and

(1.d). In example (1.d), the clitic and the argument are simultaneously present in

the clause “yAmi-is yi (tAm’) behtar zon-un” and this is the reason the clause

sounds odd. Therefore, in such cases introducing pronominal arguments

artificially is not of much importance. However, in the cases where pronominal

argument and the clitics are mutually compatible and can coexist, they can be

introduced.

7. Statistical Results

As mentioned in the chapter five, the three datasets that have been used for the

current task consist of 682 POS annotated sentences of varying lengths, taken

from three different text domains, i.e. newspaper editorials (ASL = 25 Ws or 15

Cs), short-stories (ASL = 11 Ws or 8 Cs) and critical discourse (ASL = 16 Ws or

10Cs), are partially parsed into 8125 chunks. The task with which this chapter is

concerned is to deduce, annotate and find score for each inter-chunk GR, holding

among 8125 chunks in 682 structures. In aggregate, 4287 GRs have been found

holding under 682 dependency structures among 8125 chunks. The 4287 GRs are

further classified under 25 labels, each with its frequency count in three different

domains and also in aggregate, as shown in the Table.2. However, the score for

each attachment label is given in underspecified manner, i.e. no separate

frequency score of the variants is given.

192

Label Variants f1 f2 f3 fx1

k1pk1, jk1, mk1 294 80 230 604

2k1s ** 49 14 64 127

3k2 k2g, k2p 213 49 262 524

4k2s ** 7 2 16 25

5Rs rs-k1, rs-k2 3 1 17 21

6k3 ** 31 3 24 58

7k4 k4a, k4v 55 2 102 159

8k5 k5prk 6 0 6 12

9k7 k7t, k7p 166 45 120 331

10r6 r6k1, r6k2 93 55 151 299

11Rd ** 23 0 3 26

12Rh ** 10 1 16 27

13Rt ** 9 8 18 35

14k*u k1u, k2u 4 0 0 5

15Ras

ras-k1, ras-k2, ras-neg 6 4 5 15

16Rsp ** 6 8 5 19

17Rad ** 8 0 0 8

18Adv sent-adv 134 9 68 211

19Nmod ** 14 7 31 52

20Vmod

vmod_Rh, vmod_Inst 78 7 66 151

21

*mod_Relc

nmod_Relc, jjmod_Relc, rbmod_Relc 27 4 25 56

22Ccof ** 334 76 455 865

23Pof ** 134 49 126 309

24Fragof ** 113 33 203 349

25Enm ** 0 0 0 0

Total 1817 457 2013 4287Table.2. Showing Frequency Distribution of GRs

193

The empirical facts given in pie chart in the Fig.24 reveal that ccof is the most

frequent GR which covers 20% of the total GRs. Therefore, co-ordination and the

sub-ordination form the bulk of grammatical operations occurring Kashmiri text.

Fragof constitutes 8% of the total GRs found in Kashmiri, indicating the strength

of V2-phenomenon. Pof constitutes 7% of total GRs showing the significant

occurrence of complex predicates in Kashmiri. Similarly, k1 constitutes 14% and

k2 constitutes 12% of the relational bulk of Kashmiri text, indicating that SUBs

and OBJs constitute 26% GRs in aggregate which is quite significant. It is

interesting to see that quantitatively, k1, k2, ccof and fragof together cover more

than half of the total relational bulk. These facts further reveal that 39-40% GRs

in Kashmiri are karakas and rest, about 60%, are non karakas and 65% of GRs are

dependency relations while as 35% of the relations are non-dependencies in

which 6% are non-rooted dependencies, i.e. the attachments are made with non-

root heads (in genitive, participial and relative clause modifiers). 16% of GRs are

adverbial modifiers and only 1% of GRs are relative clause modifiers. Finally, it

is important to point out that only 30% GRs belong to sub-categorization frame,

thus, represent the arguments relations while as the 61% of GRs fall outside the

sub-categorization frame, thus, represent adjunct relations.

Figure.24 Showing Proportion of Each GR

194

8. Inter-annotator Agreement

One of the biggest challenges to a treebank project is maintaining consistency in

annotations. It includes both, achieving significant inter-annotator and intra-

annotator agreement. To check the inter-annotator agreement, two independent

annotators need to annotate the same data with while as intra-annotator agreement

can be achieved if an annotator encounters the same constructions or phenomenon

many times during the course of annotation, the annotator annotates them

consistently by sticking to the previous decisions regarding. Since, consistency

increases the usefulness of the data for training or testing automatic methods for

linguistic investigations. The understanding of various linguistic phenomena and

the annotation guidelines is also often reflected in inter-annotator agreement

studies. In order to check the consistency in the annotations of the current treebank,

a dataset of 200 sentences was annotated by two annotators who had proper

understanding of various issues and the guidelines for Kashmiri treebank. When the

two annotated datasets were compared, a confusion matrix was formulated as

shown in the Table.6. The matrix shows for which label and for how many times

there is confusion. For example: in the first row of the table, there is confusion of

adv with rt one times, with vmod two times, k7p two times, nmod one times, k2

one times, k7 one times, k7t one times and pof one times.

Inter-annotator agreement was measured using Cohen’s kappa (Cohen, at al., 1960)

which is the mostly used agreement coefficient for annotation tasks with

categorical data. Kappa was introduced to the field of computational linguistics by

(Carletta et al., 1997) and since then many linguistics resources have been

evaluated using the matrix such as (Uria et al., 2009; Bond et al., 2008; Yong and

Foo, 1999). The kappa statistics show the agreement between the annotators and

the reproducibility of their annotated datasets. However, a good inter-annotator

agreement does not necessarily ensure accuracy of attachment labels as the

annotators can make similar kind of mistakes and errors.

The kappa coefficient k is calculated as:

195

Pr (a) is the observed agreement between the annotators and Pr (e) is the expected

agreement, i.e. the probability that the annotators agree by chance. Based on the

interpretation matrix of kappa value proposed by Landis and Koch (Landis and

Koch, 1977) as shown in Table.3, the agreement between two annotators on the

data set used for the evaluation is reliable as given in the Table.4. There is a

substantial amount of inter-annotator agreement which implies that there is

similar understanding of the annotation guidelines and of the linguistic

phenomenon found in the data. The label attachment score, agreement on only

labels and agreement on only attachments are given in Table.5.

Kappa Statistics Strength of Agreement

1 < 0.00 Poor

2 0.0-0.20 Slight

3 0.21-0.40 Fair

4 0.41-0.60 Moderate

5 0.61-0.80 Substantial

6 0.81-1.00 Almost Perfect

Table.3. Coefficients for the Agreement Rate

Observed Agreement Expected Agreement Kappa Value

0.77738515901060079 0.089149258949418789 0.75559679434126129

Table.4. Kappa Statistics

Label

Attachment

Score (LAS)

Agreement on

Labels (LA)

Agreement on

Attachments

(UAS)

No Match

(NM)

0.5177619893428

064

0.7380106571936

057

0.6341030195381

883

0.15008880994671

403

Table.5. Kappa Statistics

196

S. NO Labels Confusions

1 adv {'rt': 1, 'vmod': 2, 'k7p': 2, 'sent-adv': 1, 'nmod': 1, 'k2': 1, 'k7': 1, 'k7t': 1, 'pof': 1}

2 ccof {'k1s': 2, 'rt': 1, 'vmod': 2, 'nmod__relc': 1, 'k2': 1, 'k1': 1, 'pof': 1}

3 fragof {'pof': 1, 'ccof': 3, 'nmod': 1}

4 k1 {'k1s': 3, 'r6': 1, 'vmod': 1, 'k1u': 1, 'ccof': 1, 'k4v': 7, 'nmod': 2, 'k2': 14, 'pof': 2, 'k4a': 5}

5 k1s {'k2s': 1, 'nmod': 1, 'k3': 1, 'k2': 12, 'k1': 3, 'k7t': 1, 'pof': 2}

6 k2 {'adv': 2, 'r6': 1, 'k4v': 4, 'k3': 1, 'ccof': 1, 'k1': 8, 'k4': 2, pof': 6, 'k4a': 3}7 k2p {'k7': 1, 'rh': 1, 'k2g': 1}8 k2s {'k2': 2}9 k4 {'k2': 1, 'k4v': 6, 'k4a': 4, 'k1': 2}10 k4a {'k2': 1, 'k1': 2, 'k4': 1}11 k4v {'k1': 1, 'k4': 1}12 k5 {'rd': 1, 'k7p': 1}

13 k7 {'vmod': 1, 'k2': 1, 'k2p': 1, 'k7p': 1, 'k1': 2, 'k7t': 2, 'k5': 1, 'rsp': 3}14 k7p {'rd': 1, 'k2p': 1, 'k7': 3, 'k7t': 1}15 k7t {'adv': 1, 'k7p': 1, 'k7': 1, 'vmod': 3}16 nmod {'vmod': 2, 'rs': 1, 'ccof': 1, 'k2': 1, 'k1': 1, 'k7': 2, 'k5': 1}

17nmod__k1inv {'nmod': 1}

18nmod__k2inv {'nmod': 1}

19 nmod__relc {'fragof': 1, 'nmod': 1}20 pk1 {'k1': 1}21 pof {'k2': 7, 'k1': 1, 'vmod': 1}22 r6 {'k4v': 1, 'r6-k2': 1, 'k1': 1}23 r6-k2 {'r6': 5}24 r6v {'k1s': 1, 'k7p': 1, 'k4v': 1}25 rad {'k7p': 2}26 ras-k1 {'r6': 1, 'k7': 1, 'k4': 1}27 rbmod {'ccof': 1}28 rh {'k3': 2, 'ccof': 1}39 rs {'k2': 3, 'vmod': 1, 'k2s': 2}30 rt {'sent-adv': 1, 'rh': 4}31 vmod {'adv': 1, 'ras-neg': 1, 'sent-adv': 1, 'ccof': 3, 'pof': 1}

Table.6. Confusion Matrix Showing Disagreement Labels

197

9. Summary

In this chapter, the most important annotation layer of dependency treebank of

Kashmiri, i.e. syntactic parsing and annotation has been discussed. First of all the

notion of parsing was introduced as it forms the key syntactic operation to

produce dependency trees out of input sentences with some degree of previously

annotated grammatical information. Since, the Paninian grammatical sketch for

Sanskrit has been adopted for sentence parsing in IL treebanks, the basic tenets of

Paninian Computational Grammar (PCG) were introduced in order to reveal what

kind of syntactic parsing would be involved in developing dependency treebank

for Kashmiri. As already mentioned, PCG is reinterpretation of Paninian grammar

by one of the leading NLP groups in India called Bharati. PCG seems to be a

blend of ideas flourishing throughout the world in dependency tradition. It is not

purely Paninian as the name suggests, some key notions like Noun Group and

Verb Group also appear in Abney (1996). Moreover, the reinterpretation has been

made in terms of modern notions of grammar either by complete equivalence or

by mere approximation, e.g. Karta is interpreted as roughly equivalent to agent or

SUB. So it is though the popular notions of agent or SUB, the annotators

equipped with modern linguistic jargon interpret the terms like Karta. It is

because of this reason that PCG sometimes appears to be a matter of ancient

labels which, however, is not the case. The fundamental ideas that stay at the

heart of PCG have been taken from the ancient Sanskritic genius which is more

semantics oriented. It is essentially syntactico-semantic model which incorporates

more semantics in it as compared to the syntax and this is the reason that it lacks

the popular notions like SUB, OBJ, argument, adjunct, etc but it must be noted

that the elements corresponding to such notions can be easily extracted from the

treebank as the semanto-syntactic attachment labels can be classified in terms of

popular notions of syntax, e.g. k1 attachment always attaches an argument, a

SUB.

Next, in this chapter the inventory grammatical relations that need to be

annotated have been given with their original Paninian terms, their interpretation

in modern terms, their label and their variant labels. The description of each GR

mention in the inventory has been given with variety of examples in such a

manner that this description also serves as the guidelines. Then, the procedure for

annotating various inter-chunk GRs has been given with graphic illustrations so it

198

becomes obvious how all the dependency structures have been produced by using

the Sanchay SA Interface. The various annotation issues that have been

encountered while annotating Kashmiri corpus were also discussed and

illustrated, particularly the V2 phenomenon which brings Kashmiri closer to

Germanic languages and it is due to V2 factor particularly that Kashmiri

dependency structures from the dependency structures of other ILs.

Finally, the notion of inter-annotator has been introduced and an

experiment, for measuring the inter-annotator agreement vis-à-vis consistency in

the treebank annotation, has been given. The confusion matrix has shown the

disagreement or conflicting labels and the rest of the tables in this section show

that the inter-annotator agreement is substantial as per the interpretation matrix of

Landis and Koch (1977). The observed agreement was found 0.777 and the kappa

value was found 0.7555. In short, the experiment conducted to check the inter-

annotator agreement has shown that the annotators agree quite considerably on

labels as well as on attachments which mean both have similar understanding of

the issues and the guidelines. It also indicates that there will be quite considerable

consistency in the syntactic annotations of the current treebank.

199

Chapter.7 Conclusion“Computers are incredibly fast, accurate and stupid!!!

Human beings are incredibly slow, inaccurate and brilliant!!!Together they are powerful beyond imagination.”

Albert EinsteinThe corpus based investigations on natural languages has become hallmark of

contemporary linguistic research, which not only presents an alternative to the

popular introspection based investigations, particularly on natural language

syntax, but also adds more interdisciplinary and applied orientation in the

research. Since, the contemporary age is considered an age of information where

knowledge creation, knowledge dissemination and knowledge acquisition is no

more restricted to traditional means and privileged persona but with the invent of

computer, internet, world-wide-web and social media, under the force of

globalization, even underdogs can share, produce and disseminate knowledge.

Since, vehicle of knowledge, either on representational level (cognitive) or on

transactional level (communicative) are concrete natural languages, rather than,

an ideal natural language which provides space for notions like universalism and

lingua-franka, and undermines the potential of creativity in individual languages

vis-à-vis the community, need for online representation of all the languages has

been severely felt in last few decades. This thirst on the part of speech

communities can’t be quenched through only introspection based research but it

definitely needs boom of empirical research on natural language so that the

probabilistic methods can be harnessed for technological purposes. Further, the

need for human and machine interactions through natural languages has also

increased considerably which is compensated by increase in resource creation

(both linguistic as well as computational) and interdisciplinary researches on

natural language which resulted in the inception of entirely new fields of inquiry

like computational linguistics (CL), natural language processing (NLP) and

language technology (LT). The current research is this kind of effort to create a

small scale syntactically annotated corpus, i.e. a treebank for Kashmiri, and lay

down the basic methodology for creating a large scale syntactically annotated

corpus which can be used for training various NLP algorithms like syntactic

parsers.

However, for creating syntactically annotated corpus, grammar formalism

is of paramount importance but one astonishes to see the wide range of competing

200

grammatical models/ formalisms. It becomes very difficult to prefer one model

over the other as all the models claim flexibility and universality to cater to wider

range of language data. However, the fact is that the choice of framework vis-à-

vis grammar formalism is, in itself, an interesting and lesser explored research

area of experimental syntax which is beyond the scope of this dissertation.

Nevertheless, dependency based representations have been considered more

suitable for inflectionally rich (relatively free-word-order) languages, i.e. lesser

configurational or positional, e.g. Czech, Turkish, Hindi, Urdu, Kashmiri, etc.

Since, dependency relations are essentially syntactico-semantic in nature and

directly encode the predicate-argument structure, i.e. directly encode the

participatory relation of various arguments or adjuncts, it has been argued that

dependency based representations are more suitable for annotated resource

creation. It is not only because they cater to free-word-order but also because they

are considered more suitable for a number of NLP applications (Covington 1995,

Culotta & Sorensen 2004, Reichartz et al., 2009). Further, most of languages of

the world are inflectionally rich vis-à-vis relatively free-words-order languages

(Covington 1995) and thus, in most of the treebanks, there is a tendency to take

into account the morpho-syntactic cues, i.e. obliqueness, overt case markings or

relational words (pre/postpositions) during sentence parsing. This also provides

clear semantic information, crucial for various NLP applications like MT. Since,

Kashmiri is an inflectionally rich language; there are also clear cut morpho-

syntactic cues associated with NPs or VGNNs, i.e. with entities which are the

participants of an action or an event, which are crucial in sentence parsing. It has

been observed that if there are no hundred percent one-to-one correspondences

between the case relations & the case markers/pre/postpositions which mark the

dependents but definitely such morpho-syntactic cues along with TAM features

are very helpful in syntactic parsing in relatively free-word-order languages. On

the other hand, constituency based representations have been considered better for

the fixed-word-order languages where there are least morpho-syntactic cues but

the positions of the constituents dictate their grammatical relations, e.g. English

and French.

Finally, it is very important to recognize the advantages and disadvantages

of both the frameworks while applying them on a particular language data. It is

equally important to approach the problem of framework selection on the basis of

201

certain research questions like, what essentially these formalisms are able to

capture. Are they complimentary to each other? How can they be helpful in

developing treebanks and subsequently, robust syntactic parsers? However, the

choice of the framework or formalism, in this research, is not much determined by

the theoretical motivations and other technicalities related to the formalism but by

the unavailability of the required resources in Indian scenario. For instance, if one

wishes to use annotation scheme of Prague or TIGER Treebank, one need to be in

constant touch with the people who are actually working in the area to obtain

resource & get expert opinions. So, it is quite impractical to use such a grammar

formalism for which there are no resources available or not easily accessible.

Further, there is no need of reinventing the wheel, as other representations can be

added latter and also the algorithms are now available to convert a dependency

treebank to the phrase structure one. Therefore, on practical considerations as well

as on the basis of principles for treebanking, given in chapter second, the model of

AnnCorra Treebanks, i.e. Hindi, Urdu, Telugu and Bengali, has been followed.

Apart from the grammatical model, the most important requirement for

developing Kashmiri treebank was the primary source data, i.e. Kashmiri text

corpus. The major bottleneck in getting the corpus is the unavailability of any

online resource like newspaper, from which data could have been obtained. There

is complete vacuum of commercially important text domains (like medical &

tourism) in Kashmiri. Therefore, KashCorpus was built for developing

KashTreeBank. The selected sets of the corpus were manually pre-processed, i.e.

sanitized, normalized and finally tokenized.

The selected sets of corpus were converted into Shakti Standard

Format (SSF) with the help of Sanchay platform. Its SA Interface was used to

built three annotation layers, i.e. parts-of-speech layer, chunk layer and inter-

chunk dependency layer. It is not merely adding the annotation layers to the

corpus which is important but the arrangement in which the lower layer of

annotation facilitates the higher layer. The arrangement is provided by the SSF

which is also important for machine readability of the dependency trees, created

during annotation process. The fundamental annotation layer that was added to

KashCorpus was POS layer. Each word in a sentence was assigned a POS tag

according the BIS tagset for Kashmiri which is coarse grained hierarchical,

consisting of 11 top level categories and 32 type level tags. In the process total

202

14852 words were classified into 11 POS categories with the frequency order; N

>V >PP >RD >JJ >PR >CC >RP >DM >QT >RB. During the annotation process,

several issues were raised which were resolved and annotation guidelines were

laid down to achieve consistency and intra-annotator agreement. Finally,

frequency for each category was calculated and cumulative frequencies were

obtained.

In the second layer of annotation, same interface to add chunk level

information to the same sets of POS annotated KashCorpus. Earlier there were

four sets of corpus from three domains in which two sets which belong to the

same domain were combined. Therefore, three sets of POS annotated corpus were

chunked based on the local dependencies and discontinuities. The cluster of POS

tagged words which were contiguous with dependency or part-whole relation with

each other, were assigned a single chunk label. It is not that only groups of words

were assigned chunk labels but also some solitary or the discontinuous elements

which defy the intuitive notion of chunk. The V2 phenomenon, which constitutes

5.009% of grammatical phenomena at the chunk level in Kashmiri data, was also

handled by positing AUXP chunk. The most crucial issues related to finiteness

and complex predication were resolved and an annotation guideline was laid

down for consistency in chunking. All 14852 POS annotated words were chunked

and classified into 10 chunks. The increasing frequency order of the chunks is

NEGP< BLK< RBP< VGNN< VGNF< JJP< CCP< VGF< NP.

Finally, the third layer of linguistic information was added to the

three datasets. The 682 POS annotated sentences of varying lengths from three

domains, i.e. newspaper editorials (ASL77 = 25 Ws or 15 Cs), short-stories (ASL

= 11 Ws or 8 Cs) and critical discourse (ASL = 16 Ws or 10Cs), were partially

parsed into 8125 chunks. The inter-chunk GRs holding among 8125 chunks were

annotated. In aggregate, 4287 GRs have been found holding under 682

dependency structures. The 4287 GRs were further classified under 25 labels. The

inter-annotator agreement was also measured for syntactic annotations. The inter-

annotator agreement was found substantial as per the interpretation matrix of

Landis and Koch (1977). The observed agreement was found 0.777 and the kappa

value was found 0.7555. In short, the experiment conducted to check the inter-

annotator agreement has shown that the annotators agree quite considerably on

77 Average Sentence Length

203

labels as well as on attachments which indicates that both the annotators have

similar understanding of the issues and the guidelines.

Appendix-I: Showing BIS POS Tagset for Kashmiri

S.

No

Category Name Annotation

Convention

Examples

Top level Subtype

(level 1)

1 Noun N

1.1 Common N__NN gu:r (milk man), kul (tree), ku:r (girl)

1.2 Proper N__NNP gulI marg (Gulmarh) Pahal gha:m

(Pahagham), huzaif (huzaif)

1.3 Nloc N__NST heri (in upper storey), bonI (in lower storey)

2 Pronoun PR

2.1 Pronominal PR__PRP su (he-nom) bI (I-nom), tse (you-erg), hom-

is (he-dat), yi (this), ti (that)

2.2 Reflexive PR__PRF panun (self’s-MAS ), panIn’ (self’s-FEM)

2.3 Relative PR__PRL yus (who-SG), yim (who-pl)

2.4 Reciprocal PR__PRC akh Akis (to one another),

pa:nIvan’ (amongst each other)

2.5 WH PR__PRQ kus (who-SG), kIm (who-pl)

2.6 Indefinite PR__PID kenh (something), kanh (someone)

3 Demonstrative DM

Deictic DM__DMD hu (he), so` (she), hum (those), yi (this)

Relative DM_DMR yus (who-SG), yim (who-pl),

yAmy`(who-erg), yimav (who-pl-erg)

WH DM__DMQ kus (who), kIm (who)

Indefinite DM__DMI kenh (something), kanh (someone)

204

4 Verb V

4.1 Main V__VM paka:n (walks/walking), thovmut (kept)

pari (will read), gindun (to play), tulun

(to lift), tsalun/davun (to run), gatshith

(having gone), gindnuk (of playing)

4.2 Auxiliary V__VAUX chi/chu (is), Os/A:s (was), a:si (will)

5 Adjective

tshoT (dewarf), z’u:Th (tall) zabar (nice),

asIl (good)

6 Adverb

jaldi: (quickly), va:rI va:rI (slowly)

7 Postposition

peTh (on), manz (in), tal (under), nish

(near)

8 Conjunction CC

8.1 Co-ordinator CC__CCD tI (and), ya:/natI (or) magar (but)

8.2 Subordinator CC__CCS zi/ ki (that), agar (if), zanti (as if), teli

/adI (then)

9 Particles RP

9.1 Default RP__RPD ti (too), sirif/ mAhaz (only), hish/hiuv

(like)

9.3 Interjection RP__INJ alie! Oho!

9.4 Intensifier RP__INTF seTha: (very), va:riyah (very)

9.5 Negation RP__NEG na (no), ma (don’t), kehn (not)

10 Quantifiers QT

205

10.1 General QT__QTF kam (little), zya:dI (more), kehn (some)

10.2 Cardinals QT__QTC akh (one), zI (two), tsor hath (4 hundred)

10.3 Ordinals QT__QTO Akim (first), doyim (second)

11 Residuals RD

11.1 Foreign

word

RD__RDF It is fine

11.2 Symbol RD__SYM ، ، ء ، ،،

11.3 Punctuation RD__PUNC ؛ ! ( ) “ ؟ ، : ۔

11.4 Unknown RD__UNK

ڑگ ،فاین ،از ،اٹ

11.5 Echo words RD__ECH tre:lI ve:lI (apple and the like)

cha:y va:y (tea and the like),

batI vatI,(rice and the like)

ma:z va:z,(meat and the like)

Appendix-II: Showing POS Additional Examples Extracted from Dataset-4

N_NN ن، پ77وچھر، ، دور، ریاس77تس، امن، کوشش77 -ک ہوادی، ح77ال ہ 77ددلی، ، مویویس77ی، ب می77د، امک77ان، دور، بی77ان ، و -نگار، مستقبل-ک ہتجزی ۄ ہ - ، بی77ان، مقص77د، را وپن، کتھ-ب77اتھ، وزی77راعظم، خ77وش-آو ژو، گ ےجم ے ن� ٲش77ت- ، پ77زر، وزی77راعظمن، کنوکیش77نس، خط77اب، گ77روپن، کتھ، د عام رگرمیٮو، اس77تعمال، گردی، دوستی، اشتراکس، تعلقات، سر-زمین، س77چھ، خطاب، دل، ، ک ن س، سل -ی ، خوبصورتی، دن ن ۄسلسلس، یقین-د ۍ ٲ ٲ چ ۍ ۍ ہٲ ر، ن، رز، ش77 ر، موس77من، دلن، سالن ر ، خوبص77ٮورتی، ر-چ ، ش حص ٮ� ٲ ٮ� ہ ہ ہ ، ان77د، مس77لک، ح77ل، کتھن، دان ی، م ، تب پ ، بن77دوق، ژھ ، مس77ل ہاند، عالق ٲ ٲ ہ ۄ ہ ہ -در، س، ژک ، ام77ورن، پ77ارٹی، ک77ارکنن، لکھ، تھ نم77ا، ریاس77تک ۍباوتھ، ر ۄ ۍ ، لکن، ر، نی77ا گ، کش ، پالیس77ی، لتھ، م77ذاکراتن، س77 ن ن، ےجنگجٮ77و ی� ہ

206

، قدم، ، امن -ک ، ریاست -ک ن، تجویز، مسل -ی د، دن ہپیشکش، تشددس، د ہ ہ ہ ہ ۍ ۄ ن، مس77تقبلک، مل77ک، ن، ب77د -ک ، ریاست ج-وٹھ، مسل ، ب ٮ�چیرمین، تنازع ٮ� ہ ہ ٲ ہ ، ر ل77ک، کش77 77ر، م د، ک77تر-بت نی، جدوج ن، قورب ۍمسلس، اندبور، کشر ٲ ٲ ٲ ٮ� ٲ زی، سرحدن، ف77وج، ربتس، ف77وجی، ، سرکار، ت ک ی�بچاو، سرحدس، چھ ہ ۄ ، ر، رپوٹن، ف7وجن، ص7وبن، ٹینک ک ، باس، ملکن، قسمک، ژ ، کتھ ہطرف ٮ� ۄ ہ ہ 77و، ، چھوک ، جنگ 77ر، س77رحدس، ک77ام کن، حمل77و، بت ہبینکر، می77ٹر، گ77اڑ، ٹی ہ ن� د، ش77روع، حق77وقن، ادارو، احتج77اجو، م77وت، نی، رو رحدن، نگ77ر ن�س77 ٲ ن، خ77ونس، گج77و وٹلس، ج گج77و، وایت، ج ، ر نوج77وانن، وری، تھن ن� ن� ٮ� ہ ش، گیس، احتج77اج، افس77رن، ن، الر، ا ، نف77ر، م77ار، پلس77 ، ل77ڑ یش نIت ٲے ہ ن� ، س77رکارن، تحقیق77ات، ، پلس77ن، جم77اژ، پلس 77و-ج77وانن، گ77ول ہک77رکٹ، ن ہ میت، ک، ملکس، قراردادس، ا دعوی، معاملن، تر، پلسس، فوجس، ژ Iن یی، اج77ازت، ط77اقتو، ری، ملکن، توان ید، ٹیکنالوجی، اجار-د ، ف ، طاقت ٲآل ٲ ٲ ہ ہ یی، اس77تمعال، اس77تعمال، کتھن، ب77اوتھ، دورس، م77ذاکرات، پ77وت، ٲتون ری، -د ی77ٮراد، آبچ، حص س، رنگ، ضرورت، ٲمسلن، دل، اند، ملک، میڈیا ہ ھ ٮ چ ، صحت، اث77ر، ، سرکار، تموک، منشیاتک ، تعلیم ۍبحث، حمایت، ریاستک ہ ۍ ر، ، نن ، پرن77اونس، س77نجیدگی، کتھ ٮ�عن77وانس، جمژ، نص77ابس، مض77مون ہ ٲ ، تقریر، قونون، ادارن، کتھ، باند، قونونس، کم، سماج، منشیات- ٲتقریب ٹ، ، کتھ، منیش77اتس، ر ، کث77افت ، کاربار، بیدری، زور، ریاس77ت، تم77اک نIک ہ ہ ٲ ہ د، لی77درو، ، عم77ل، امک77ان، پ ، میٹنگ ٲکمبر، سرکارس، استھواس، رازدان ہ ، وتھ، ق77رار، مالق77ات، ری، ماحول، تالش، لی77ڈرن، مالق77ات -د ہوزیرن، ذم ٲ -یس، موق77ف، نف77رن، 77ک، بتھ، مخ77الفژ، ور ۍبی77ان، پ77ارلیمنٹس، مخالفت ، م، ع77دم-تش77دد-چ ، آزدی، م یی، مع77املس، واویال، ملکس، حمل ہک77ارو ٲ ہ ٲ زی، چن777اون، دھان777دلی، رش777و-خ777ور، تختس، ، منظ777رس، دغاب ، پ ٲوت ے ہ -ستری، نفر، نع77رو، دور، ح77د، ند، زندانن، زنانن، ب ن، پ ےطاقت، چانٹ ٮ� ٮ� س، ، دعوی- ، حق، گصف، بد-قسمتی، حص چ کھ، لو چمصیبت، ستم، د ہ ن� ۄ -ون، ، ژین ل 77و، حقیق77ژ، ورن، س77وتھرس، حق77وقن، پام پی ہورب77ر، رقم، ر ۍ ٲ ۄ ر، دران77دزی، ری، ریڈالرٹ، عم77ارژن، کھ ، تی شت-گرد، پھاس، حملچ ٲد ۄ ٲ ، -ون ین، وژھ، زال، معلوم، بارس، ژین ہسرحدو، تصدیق، حکومتن، کارو ہ ٲ

207

تس، یی، فص77لس، وزی77ر، ژ -اف77ز ٮ�حفاظت، فص77لس، پ77ذیریی، حوص77ل ٲ ٲ ہ ٲ ٲ ، ادارس، ، پھش، خطس، س77ن 77و، تھ77ام ی ن، ور نم77ا نٹس، ر ہنشست، گ ۍ چ ٲ س77تادو، ن، خ77وش-قس77متی، زب77ان، انقالب، عن77وانس، و ۄت77ربیت، دو چ ن، روح، نشورو، یاد، شمٮولیت، مس77لمانن، علمس، اچھ77و، ح77التھ، د د ٲ ٲ ار، ن، اظ دگ، انسان، دم77اغ، وز، کتھ، مج77ال، موض77وعس، ج77ذبک، د چ ۄ ن، ، علومن، عبور، س77ر-پرس77تی، س77رحدو، جنگج77و ثبوت، مستقبل، وت ہ ریخ، ، س77المتی، خط77ر، عالقن، ذک77ر، عالق77و، طالب77انن، اج77را، ت ٲمق77ابل ہ راجس، یادداش77ت، تقس77یم، یستام، لکو، ریذڈنٹن، ت یاداشت، عمارژ، و ن� دان، لکھ، ، وف77د، م ٲمبصر، مقام، لکن، استان، مطالبن، یاداش77تن، دش77 ، ، قبض ، نیچ77وس، مک77انس، س77اعتک ہظلمس، احتج77اج، عم77ارت، مک77ان ۍ ہ ، ج77ون، لکار، ادار، جنگ-بندی، نظ77ر-گ77زر، پروگ77رام ، ا س -د س ، و جای ہ ن�ۄ ہ ۄ ہ ، تقریٮبن، شرکت، دورک، علیحدگی-پسندن، شٹھ_نی77ار، ر ن، سرک ۍدو ٲ چ ، دع77وتس، اعتم77ادک، دورس، ن77وعیتس، کسن ۍپروگرام، ج77والی-یس، پ ٲ ٲ ، ع77زت- ن، کت77اب ، لکھ77ار ، تق77ریی س77وار، د د، ب ہم77احولس، س77وال، پ ٮ� ہ ۄ ٮ� ن� ٲن، اعزاز، عطا، شعر، ، زبانن، کتابن، کتا ٲافزیی، اکیڈیمی، صدر، ینام چ� ٮ ہ ٲ ریخ-دان، لکھ77777777777777777اربو، دانش77777777777777777ورو ٲنق77777777777777777اد، ت ، ت77و، ش، ر -گی77ور، مزاک77راتن، ش ، ج77واب، وز، خ77بر، گت نٹ ، گ ر کھ ٮ�ل ۄ ہ ہ ٲ ۍ ٲ ٮ� ، چھاو، وت77اولی، ، آش، بت ، تجربن، کنڈ ٹ ج، ژکس، لکن، ر، ن ر-چھ ہچھ ۍ ۄ ۄ لک77ارن، ر، خط77رات، وس77یلو، علحی77دگی-پس77ندو، ا -گ ژھ ، پ ط77رز، عمل ٲ ہ ن� ہ رت77الن، احتج77اجن، انکش77افس، ، کھ، خ77برو، منص77وب ، اتھن، پھ نش77ان ہ ۄ ہ ، ج77اداوس، ددار، ن، ٹھر ر، گاڑ نی، بلو، مظ ، کرست ن ۍتاثرات، کرست ٮ� ٲ ٲ ٲ ۍ ٲ ٲ ٹ، -ر نIج777رم، سوس777ایٹی-ین، اک777ثریت، عالقن، امن، س777رکارس، اتھ ہ ، ، زن77دگی، گیس77ک ، سیاستدانن، معمٮ77ولچ ۍنظامچ، جرم، کاژ، ازا، مثال ہ ہ رین، بان7د-بان7د، ین، ش ، ن، چھک7ار، ک77ان 77ژ، گ7ول ش77ل، مخ77الفت، اکژی ٲ ہ ٮ� ، ک ر، می77ڈیا- رت77ال، س77ڑکن، تش77دد، وط ۍشکایژ، سڑکن، دار، جوان، ہ ی� ، میل-واجن ، راز، ش77کل، ش77 ، تولق، سپ ، داستان، سنگ-باز، مذاق ہذرای ٲ ہے ہ ژ، راج، 77تر ، ب ، وقت م ، س77 فاظتک، التجا، رازن، گر، ع77رض، گل ، ح ٲزنان ہ ۍ ٮ� ہ و، ، گریس77ت بر، آسمان-کس، چھتس، ٹھر، دلس، پریشن ٮ�خوف، ناگ، ا ۍ ٲ Iن

208

س، حکن، م77ارکس، ص77 ا، غ کھ، ژور- و، ک ، دای77رس، علی-ج77ا ۄتم، ژکھ ن� چ ٲ ہ ، ناگو، و، حکم ، دیا، کن، حکم، بیٹر ن، رو نی، پاس، پادش رب ہرحم، م ٮ� ے ٲ ر-یقی77نی، 77و، افراتف77ری، غ -ی -ی77ا، تش77دد، ور رن، قون77ونن، دن ٲبرتھ، کھ ۍ ۍ ۄ ، ، تنخ77وا ، مع77امل ، بح77ران ڑت77ال زمن، -کس، آغ77ازس، مل -ی ش77کار، ور ہ ٲ ہ ۍ س77پتالن، ن، ، ریاس77تک، ص77ورتحال، ص77ورتحالس، و پنش77نک، مع77امل ۄ ہ عث، س77رکارک، ربتن، فض77انن، ب م77ارن، ن77ادارن، مش77کالتن، ن ٲلوکن، ب ۄ ٮ� زمن، چن7اوس، ، مل ، مرک7ز، ب7دس، کوشش ، بقایاج7ات، ادا، روپی ٲمطالب ہ ہ ہ ر، تش77ویش، ف77د، کش ، ، پ77ارلیمنٹ-ک ی�وع77د، ف77روری-یس، س77مجھوت ٮ� ہ ہ -کس، ، ممبرو، اتف77اق، ع77الم د، کاز، کوشش وپن، ع ری، اند-بور، گ رفت ن� ٲ ، زش77 ، زی77ر، س ، مس77لمانن، حوص77ل نمایندن، دعوت، سزش، امام، چھ ٲ ہ Kہ ٲ ، سزشن، مس77لمان، ، حکمو، سزشن، دس، امامن، لڈالی ٲلڈاین، لحاظ ہ ٲ ٲ ہ ن، ی، ایمپ77ایر، س77رکار-ک ، حکومتچ، پش77ت-پن ٮ�نا-انصفی، ظلمن، لڈ ٲ ٲے ٲ ، ، وزیراعل77ی -کس، اجالسسس، ریاس77تک ٹس، مفاد، ایوان ۍتعلقاتن، گ ۍ ہ ٮ� ن� ، مف77اداتن، ، اش77ار، مج77را رن، ش77ریکن، وت ، پھ ژ، ن77ال ، جم س77رکار-چ ہ ٮ� ہ ٲ ہ قدمن، ترجمانن، ، پنڈتن، م ٹس، مفادات، کال ن، تعلقاتن، گ ۄنظر، در ہ ٮ� ن� ٮ� ، بنی77اد- و، تح77ریک ہبٹن، نسل-کشی، اتھس، الزام، پن77ڈت، لفظن، در-ک ٮ�، -ین، م77الک ، ور ین، ڈوٹھ-ت ن، ژ ، وتش، طبقن، اتھواس77 ۍپرستی، جام ۍ ن� ۄ ہ دن، ب77وس، 77دس، نمی 77د، ون ، ون ، س77ربرا ن�لیڈر، سوالک، تش77دد، ف77وجک ٲ ۍ ، دستن، ، کوشش، کوششن، جنوری، کوشش حاالتن، وذارتن، زرایو، لٹ ری، رس77تس، گرفت یو، ف ش77ت-گ77ردن، ک77ارو ن، مزید، جنگجو، د ٲجنگجو ٲ چ شتگردی، جنگس، فورسن، اتھواس، آیتن، پاکس77تان، آپریشن، انجام، د ، مرحب77ا، ین، دنی77ا ش77ت-گ77ردن، ک77ارو اعتراض، اتھ77واس، درخاس77ت، د ٲ تھی77ار، مم77برن، نماین77دن، ین، ، دع77وا، ور کھ77ن پروگرامس، پابندی، و ۍ ے ۄ یس، ۍپررگ77رامس، ق77راردادن، ب77وژ-ش77وژ، مش77اورت، چ77یرمینن، ور س، زن77گ، ڈ، کالس، تش77ددس، ، کھ -وٹ ن بی، کھ ہتھیارن، یورینیم، کامی ۄ ہ ہ ۄ ٲ ، برص77غیرس، امنچ، ض77مانت، تل77وار، سمٹس، خطابس، صدرس، تنازع ، ان77داز، -ک ، پاونڈ، مول م متس، نیالم، لچھ، ساس، پونڈن، نیل ہریکارڈ، ہ ۍ ٲ س� ، منظ77ور، حص77ل، پ ، ر پی ، گ77ورنر، ر پ ٲص77دی، چان77د، ش77یر، اعالن، ر ے ۄ ۄ ے ۄ

209

الک، ، ، پیش، بن77دوق-ب77ردارو، گ77ول ، واق 77ژن، سلس77ل الک ن، -ک ح77ال ۍ ہ ہ ٮ� ہ ر، ل777وکھ، ص777ورژن، ایجنس777ی-ین، ڈر، ، اض777طراب، ظ ٲشخص777س، آی ہ -چن، دی، نفس77یاتس، ف77وجکس، رد-ب77دل، ملک 77ژو، آب الک ہاض77طرابک، ٲ کرس، ش77تگردن، افس77پا-کس، ایکٹ، ک ، موجودگی، سوالس، د یی ۄکارو ہ ٲ ، تن77او، مش77ور، -کھ77اتس، دسپوس ہزن77گ، ط77ور، اعتم77اد-سزی، گ77ال ۍ ٲ شٹھنیار،مالقاتسN_NNC وا، حکم77ران، جم77اعت، ، جنت، آب، د، وت علحیدگی، پس ہ ن� ، ڈیرس، خود- ٹ ژو، ژھ ، وزیر، پسندن، کتھ، پسندو، جنگجو، جم ہگاڈ، وت ۄ ٲ ہ

از، اڈ، اڈ، 77رس، می77ڈیا، ذراین، ش77مال، ج ، ٹک ن ری، تج77ویز، زم خت م ہ ی� ٲ ۄ گ، ، ج لک77ارن، کن ن، طلب، علم، ف77وجی، ا ، ورش77 ن�طالب، علمن، گ77ول ہ ٲ ہ ، تعلق77ات، وزی77ر، موص77وفن، ر ۍعلمس، قتل، عامس، امن، مقصدو، باپ ٲ ، پھیر، بچ77او، ، کوچ ، گل ، کار، عمل، کوچ ، سیکریٹری-ین، طریق ہخارج ۍ ہ ہ ، ش77یچھ، 77ن 77و، ک ، سٹیشن، س77رحد، ایجنس77ی-ی ، اڈن، ریلو ۍبجٹ، عمل ے ، ڈاک77ٹر، ٹری77ول، ای77ڈوایزری، ب77اتھ، قس77مت، علم77و، س77یرت، کانفرنس اس77رارس، دل، مل77ٹری، اوب77زروز، عم77ارت، ڈوگ77را، راج-کس، تعلیمی،77رن، اوب77زرور، گ77روپن، اعظم، ج77ون، ربتس، ن، کاربار، پنڈ، پ ل ر، ق ی�م ٲ ٲ یمچ، کلچرل، اکی77ڈمی، ام، تف ، طور، طریقن، پوت، منظرس، اف خارج ، وایس، چانس777لر، س777ینٹرل، یونیورس777ٹی، ڈیوجن777ل، کمش777نر، ۍاعل777ی ، ، لیج ، بت ، ت77وو، شوش77 پ ، زی77و، پی-تس، ر، حریت، کانفرنسک کشم ہ ہ ن�ۄ ۍ ی� ، جن77گ، ، جنگس، س77نگ، ب77ازن، کن وم، منس77ٹری، کن ہمرک77زی، گریل77و، ، ارس77اتس، ٹ77یر-گیس، ، چھرک77او، لک ، آب نج ، و ری، کن 77نز، ب77رد ہاعلی، ک ٲ ٲ ، بی77ٹر، ، داں، فص77ل ک ، چھ ٹھ ہشل، حکمران، رحم، دل، دیا، ساگر، ڈو ہ ٮ� ہ ن� ، -وارن ، کم، م77ذکرات، مس77لمان، نوج77وان، ف77رق ژن، ت77ال ہپتھ77ر، زو، ذ ہ ٲ ہ ٲ ، پتر، پن7ڈت، ب77رادری، گ7ر، خ7بر، اس77مبلی، ر، لچھ ک فسادو، اندری، ژ ٮ� ۄ

، ب777ردارو، ژر،۲٠٠۴مم777بر، رس، اس777لح ، عیس777وی، جم777وں، کش777م ی� ، طالب777ان، کمان777ڈر، وزی777رن، ، ج777ا ، خرج ےعیس777وی-یس، بتھی، الگی ٲ ہ ، نیالم، ، ژھیپ ہکمانڈرن، عالمی، ط77اقتن، ج77وانٹ، چی77ف، آف، اس77ٹافک ۍ

ر، تعلیم، پ77دم، ش77ری، اع77زاز، م77احول، متس، م ہگرس، ریک77ارڈ، ٲ ،۲٠س�

210

نٹ77ل، کلنکس، ب77اعث، تش77ویش، ملیٹنٹ، تنطمین، ٮ�م77ارچ، ش77امس، ڈ ، سنگر، مال، تجزی مشرقی، ریاستن، ملک، دشمن، عنصرن، ان ۍ ،

N_NNP ، -چ مالی ر، کش بھارت، کستان، پ کستانس، ہپ ی� ٲ ٲ ن، س77اگرن، این- ن، این-س77ی، کش77ر ر، نگ نگر، کش ر ر، س77 ٮ�کش ٲ ی� ی� ی� ی�ندوس77تانچ، اروناچ77ل، ندوس77تان، ندوس77تانس، پاکس77تانس، س77ی-یچ، ، نش77اطس، ندوس77تانن، بی-ایس-ایفک ، ندوس77تانک ۍپ77ردیش، چینس، ۍ ، پاکستان، بھوٹانس، سیاچن، ک، چین-کس، پاکستانک - ، امریک ۍایرانک ہ ۍ ن، مص777ر-کس، مم777بی، ، تمپھ777و، پاکس777تان-ک ٮ�س777رکریک، بھوٹ777انچ ہ

ن،۱۹۱٠ ندوس77تانک، ام77ریک ار، ن، کش77یر، ی77وپی، ب ندوستان-ک کس، ٮ� رن، ، ممبی، بنگلور، چندی-گڈھ، گج77رات، ش ، امریک -چ بھارتن، امریک ہ

، ک - ۍام777ریک ہ ، علی، گ777ڈھ، اس777المچ، اس777المس، پاکس777تانکس،۱۹۸۷ ، پینٹ77ا-گ77ونن، و، تالب77انن، پینٹ77ا-گ77ونن، پاکس77تان ندوس77تانک ہپاکس77تانن، ٮ� ، ک - ، ٹین77گ، گپک77ارس، برط77انی ر س، پینٹا-گنک، سرینگرس، کش77 ۍفاٹا ہ ۍ ٲ ر، ن، اردو، کش77 _ ر_کس، عم77ر، عب77دالل ٲپاک، تعلق77اتن، جم77وں، کش77م چ ی� د، میلولن77گ، ی، ل77ل-د اڑی، پنج77ابی، وی77د-را ندی، ل77داخی، پ ٮ�گوجری، ر، پالی، گیالنی، برس77یلز، اس77الم-آب77ادس، ی77ورپی، پ77ارلیمنٹس، ی�‘کش

،۲٠٠۸، ۲٠٠۳نجیب-آب777اد، پی-ڈی-پی، کانگریس777س، ۍس، س777نگرامک ۱۹۸۹ ،۲٠٠۴، ن ، پی-چدمبرس، اسالم-آب77اد، پاکس77ت ن ندوست ۍ، عیسوی، ٲ ۍ ٲ

، پاکس77تان-کس، قریش77ی-ین، نس77تانک ، افغانس77تانس، افغ ن ۍافغنس77ت ٲ ۍ ٲ ٲ ، چین، ف77رانس، روس، ران، برط77انی ن، ت - نیویارک، ایران-کس، امریک چ ، افغانس77تان، ، بھ77ارتچ جرمنی، ایران، واشنگٹنس، سرینگر، پاکس77تانک ۍ -کس، ندوستانس، امریک بلوچستان، افغانستانس، بھارتچ، واشنگٹنس، ، نیویارک ، افسپاچ ال ک، سوپر، ت اراشٹرا ہلندن، پاٹلن، پاٹل، م ہ ہ ن� ہ ، N_NNPC ، ح77زب، نگھن ن، س77 ہن77و، دل وزی77ر، اعظم، ڈاک77ٹر، منمٮ77و ہ انگیرن، ، ج ر، زرعی، یونیورسٹی، مغل، بادشا دین، شیر، کشم المجا ی� ، علی، ک ۍجھیل، ڈل، نیشنل، کانفرنس77ن، نیش77نل، ک77انفرنس، کانفرنس77 ن، سنگھن، جمٮوں، لبریشن، فرنٹ- محمد، ساگرن، وزیراعظم، منمو

،۱۹۶۲کس، یاسین، ملک، اروناچل، پردیش، چین، جموں، ہ، عیسوی_ک

211

، وام77ق،۳، الل، چوک-کس، عنایت، خان، ی۲٠۱٠، جنوری، ۸ ، کدل ہ، راز ے د، ۵ف77اروق، ، غ77نی، میمٮوری77ل،۱۱، ف77روری، زا ، ک77دال ہ، ج77ون، راز ے

سٹیڈیمس، طفیل، متو، ناوکس، صدر، محمود، احمدی، نژادن، اق77وام،، ، قریش77ی-ین، س77ارک، پ77یرزاد متح77د-کس، س77المتی، کونس77ل، ش77ا ، رس، س77نگھس، اعالنی س، ش77رم، الش77یخ، ش ، اجالس77 سعیدن، سربرا

، ن، مم777بی، حملن، بھ777ارت،۲٠٠۸، نوم777بر، ۲۶ہن777و، دل ٮ�، عیس777وی-ک ن77د، پ77اک، اس77ٹنٹ، س77یکریٹری، آف، س77ٹیٹ، ف77ار، پبل77ک، س77رکارن، ، نگھ، یوس77ف، رض77ا، گیالنی، کانفرنس ین، س77 ، ک77رول ہافریس، پی، ج ے ے س، اسرار، ادار، تصنیف، و، تحقی77ق، ادارس، گ77ڈھس، ، تھمپو ہراج-دان ہ ، گ77ڈھ، اس77رارس، ذاک77ر، نای77ک، پروفیس77ر، ، پیغمبر، اس77ال ن نمعربی، زب ۍ ٲ ، خ77دام، الق77رآن، موالن77ا، امین، احس77ن، اص77الحی، ، تنظیم کلیم، الل

ک، ح77777د نگھس، گیالنی-یس، م ن، س77777 ہمنم نت Kعیس77777وی-یس،۱۹۲٠ٮ� ، کی، مم77بر، ٹینگن، ، کانفرنس77 ، شیخ، عب77دالل ، اقوام، متحدک ندوستانک ۍ ہ

، ، جن77وری، 1947ہجعف77ر، خ77ان ، زراع77تی۱۹۴۹ہ، عیس77وی، متح77د-چ ، ، ایس، ایم، کرش77نا، اس77الم، وم، منسٹر، پی-چ7دمبرم، خ7ارج تعلقاتن، ن، انٹرنیش77نل، رس، منم77و آباد، س77نگن، س77رینگر_کس، مس77ل، کش77م ی� ہ خن، فیاض، ف77ورک-ل7ور، ج7ان، ، س ن ، می ۄکنوینشن، سینٹرس، کامل، ی ۍ ٲ ہ وس77کتا، تھ77ا، س77رنگ، ی، ار، میں، سمندر، حکیم، جان، غزل، شام، ب ربھجن، س77اغر، جوسیل، شارشمس، بشیر، مرزا، رنگ، رتن، گل77زار، ، ترنم، ریاض، میرا، رخت، سفر، عمر، مجید، نم77بر، رومی، بند، درواز ، سون، ادب، پنجابی، واح77د، قریش77ی، نس77یم، لنک77ر، مب77ارک، ۍبیگ، کٹ ، س، مرک7زی، جی، ک ےگل، دیون7در، ران7ا، اٹ7ل، ڈل7و، آل، ان7ڈیا، ری7ڈیو ن، فضل، الحق، -وال ، چدمبرم، دل ٮ�پالی، میر، واعظ، فاروقن، سنگھ-ن ہ ہ ، ، تنخ77وا -م ژ، ش77یی ک، ریاس77 س، ریاس77تس، نینش77ل، کانفرنس77 - عبدالل ہ چ

، ی77ورپی،۲٠۱٠، ن77و، پ77ارٹی، گ77روپ، کش77میرک، ۲٠٠۹، ۲٠٠۶کمیش77ن، ن، - ، سید، احمد، بخ77اری-ین، رفی7ع، ال7دین، بخ7اری، عب7دالل فت چیونین، ہ 77ک، بش77ارت، ج77نرل، دیپ77ک، ن، پیپل77ز، ڈیموکریٹ چک77انگریس، رمن، بال

س،۲٠کپورن، چ، مارچ، مال، عبدالغنی، برادر، حامد، ک77رزای-ین، متح77د-

212

س، ٹیپ77و، س77لطان، س، گیالنی-ین، ایجنس77ی، را ن، متح77د چکونس77ل-ک چ ٮ� ، گ77ورنر، ڈی، یش77ونت، راو،۲٠٠۳ ک - ، تریپ77ور ن، اگ77رتل - ، م77الی ۍ، وج ہ چ ے

ر، س77رینگر- ، س77لیم، ڈار، ش ، سوپور، س77ری ال ، ڈاڈسر، ت پاٹلن، تریپور ہ ہ ن� ، س77نگھ، 77و، دل، چی77ف، ویک ر-نگر، عالقس، ن ، کالونی، جوا ےکس، بمن 7و، ، ن ، مس7ل ہآرمڈ، فورسز، سپیشل، پ7اورس، ایکٹس، کمش7یر، وی-ک ے زی ن اعظمن، الزم، ترشی، بیان، ب ، واشنگٹن، آباد-ک ٲدل ٲ ٲ ٮ� ہ N_NST ، ، پت ، تیلی، ییل ، ی77ور، از-ک77ل، ت77و-ت77ام، ی77و-ت77ام-ن ، پت و ہب ہ ہ ہ KہYن ن� ، ، ح777ال ٹھ، اول ، پ ، ات ، ونیس777تام، ون ، ییت ن ، و ، ان777در ہےح777ال، ح777ال ہ ٮ� ہ ہ ہ ۍ ۄ ۍ ے بر، ، ونیس77تام، یوت77ام، ن ن، اور، ی77ورع، تل -ک ال، ون س، اپ77ار، ٮ�ابت77دا ہ ٮ� ۍ چ ، پتھر ، تیتھ ، تیل ےب777777777777777777777777777777777777777777777777777777777777777777ر ہ KہYن Iن ، ک، س77 س، ی ، ونیستام، یو-تام، وں، پرس، ی ، پرس، توتام، یوتام-ن برو ہ ہ KہYن ر پ ز، ٲم ہ ن�PR_PRC ، اکھ-اکس، اکھ، اکس ن -و ۍپان ٲ ہPR_PRF ، ن، پ77ان ، پنن ، پنن 77ن ن، پنن، پن ، پنن، پننس، پنن ، پنن 77ن ہپن ٮ� ہ ۍ ٮ� ہ ۍ ، پننس، خود ےپانPR_PRI ، کی نس ، ک ژھا، کا نKYکی ٲ KہYن ن�PR_PRL ، ، یوس ، تمن، یوس س ، ی س ، ی ، ییم ک، یم، یتھ، ییم ہییم ہ ۄ ہ ۄ ہ ن� د ، یس ، یس، ییم تھ، ییم ن�ییت ہ ٮ�PR_PRP ، ، ام ، ییم ن7777د ، یتھ، تمن، ی ن7777د، تم ، یم، ت ، اس ہاتھ، ی ہ ۍ ۍ ۍ ہ ، ، ام ، امس، ب د ، تم، ت د، تتھ، یی، یمن، اس ، تم777و، تس777 ، س777 ۍت ہ ۍ ن� ہ ہ ن� ۍ س� د، ، ت ، ام ز ن، ت د 77ک، ت دس، تمی ، تمس، تس77 ، سن ، م ن�تس، تم ہ ے ن� ہ ٮ� ن� ہ ن� ۍ ٲ ھے ہ 77777ک، ، امی د، ام ، مز د، ام ، سنس، ت د، س ، ت ز، س77777ان ، ت ی، بی ہت ی� ن� ہہ ٲ ۄ ن� ہ ہ ن� ہ ے ، امک -چ ز، ام و، ت زن، سان ، یمو، ی ، تمو ۍسور ہ ہ ن� ہ ٮ� ن� ہ ے ے

DM_DMD ، یمن، یمن، ، یم ، یم، اتھ، ییم ، تم، ی ، تمن، اتھ، ی ہام ہ ہ ہ] ہ ، یتھ، و ، ی ، اتھی، تم ، یس، امس، ام ، ام ، امی، تتھ، س ، ییم ، ییم ےام ہ ہ ، ی ، یمو، ی ، کن ، ییم اتھ، ام ہ ہ ۍDM_DMI تام، کنس ، کیا- ، کی ، کن ہکا ٲ KہYن ہ KہYنDM_DMR ، یس، ییم ہتمن، یمو، ی ہ

213

V_VAUX و، گ77ژھن، ، اوس، آمت، ت ، آمت -ن ، چھ، دی ، چھ، چھ ن�ٲچھن ۍ ہ ہ ے ہ ، ، س77پد، چھن س ، آمژ، وژھ، گ77وو، گ77و، -مت ت ، ین، چھن، ہگژھن، یین ۍ ٲ ۍ ۍ ھ ۍ ، 77وان، روزم77ژ، آی ، یتھ، آمژ، س، پیومت، ینس، گوژھ، ی ، یی ہدوان، چھن ٲ ہ ہ ، ، رودمت، آم77ژن، آس س، گی ، آو، اوس77 ، آم77تی، یی -م77ت ، اس ، پیی ہآ ہ ہ ۍ ۍ ہ ے ، چھ، ، آو-ن ک777ون، ووت، رود کن، آمت، ین، آس777ان، ، یک ہچھ777و، ۍ ٮ� ہ ٮ� ہ ہ

، چھ -ن ، چھن، یوان، یی ، گژھ، یین ھےکور-ن ہ ہ ہ ہV_VM نمت، ریم77ژ، و ، ، آمت پدن -چن، ل77وگمت، مان77ان، س77 نIک77رن ہ ہ ہ ہ ن، درتھ، یژھ7ان، س7پدن، ، اس77 ن، کران، چھ، ک77رن ، کرن، و ہکرنچ، چھ Iن ے -وول، تمت، ب77دلن، ین ، چھ، آس77نک، د پدتھ پھلیمژ، س77 77وان، پھ77ا ہکر، ی ن� ہے ن� 77ر، ، ک ، انن ٹ77اون تمت، روز، گژھن، انزراون، کر، ان77ان، ہینس، گومت، و ہ Iن ژھن77اونس، ن، و -ن77و، دتھ، د -ن77و، ژل -ن77و، ڈل ، استھ، پز، ال ٮ�کن-ن ن� ے ے ے ہ ٮ� ہ، ، ک77رن ، رلتھ، میلتھ، دن -وال 77ک، ین ، وون، ونن، کرن ۍس، کرم77ژ، چھن ۍ ہ ہ ہ ٲن، لجمژ، چھکراون ، سوزنس، رلن-وال ہسپدیومت، بناون ٮ� ہ ، 77ک، ، منگن77اومژ، بچن ، منگن77اون نمت، وڑاون ، و ، س77وزن 77نی، ونن ہدوان، رن ہ Iن ہ ہ ، ، ژای ، چل77ون ، م77ارن ن 77نیمژ، پ 77ژان، ب ، ن -م77ت 77ر ، ک -ن س ، -م77ت و ہبن ۍ ہ ہ ٮ� ۍ ۍ ہ ۍ ٲ ۍ ۍ ٲ 77رن- ن، چالو، پ ن، انن-وول، ک77ور، چالو، گن77دن-وال ، ک77رن-وال پد ٮ�س77 ٮ� ۍان، دنس، امت، ووتمت، ب77نیمژ، پ77راونس، ک77رنس، ، د لس، ک77رنک ہۄو ن� ۍ ٲ ران، ، س77پد، س77 کن، ان77زراون ، ، دنچ، چھ 77ن ، یتھ، س77پدن-واجی ہواتن ٮ� ہ ھے ۍ ہ 77ژھ، گن77ڈتھ، س77پزمژ، س77پدنک، ، گ ، دیت، انن یک ، بچتھ، 77ن ہس77وچان، یی ہ ۍ ، ، سپدن، آس77ن ، وچھن، رود، ونان، سپدان، سپدمت ، دن ہدژمژ، کرن، دن ۍ ہ ، 77رن ، گن77ڈتھ، ک تراون یزان، و ، گ ، دژ، کرتھ، برن ناون ، بی ، پکان، آی ہمانن ۍ ٮ� ن� ہ ہ ہ ہ ، ک77ورمت، -م77ت ، لگ اون اونس، س77پدمت، پیم77ژ، ت ، گ77ومت، ت ۍخرچ77اون ۍ ہ ن� ن� ہ -کس، ، ک77رن ، ب77ڑاون -م77ت -نس، لگ وتھ، وون ک77و، بل م77ژ، وتھ، ون ر ہٹھ ہ ۍ ۍ ہ ٲ ٮ� ہ ۍ ٲ متت، مژ، د ، روز، گ -مت ، سمکھن، پاونس، سمکھ -مت ہن�وونمت، سپد ٲ ۍ ۍ ۍ ۍ اوان، سمکھیووس، اوسس، سمکھان، کھنچ، پ ، اوس، آمت، -ون ن�پوش ٮ� ہ ہ ، ژیون، -نم، رودس، لبن ن تھ، و ، و ہبوزان، ورتاونک، ژستھ، گژھان، چھن ہ Iن Iن ہ - ، س77وز ، س77پدن زن ، س ٹاون، لڑن ، تلمت، ، نن ، گاران، بناون - ۍچھا، ڈل ہ ہ Iٮ ہ ہ ہ ہ] ہ

، روزان، رٹ، و لس، ر ، گن77ڈان، ک77رن-و 77ژھن ۍمتھ، روزتھ، ب77نیومت، گ ٲ ٲ ہ

214

ن، س77رن، س77ونچن، ونن، _ک 77ک، ک77رن ، پکن77اون، گژھن ل ٮ�تھوتھ، واتن-و ہ ۍ ٲ - ، ین -تھ 77ژھ ت، گ تمت، د ، آو، د و گر ، ب کھن ، ل ، پک ا، آس یک ، ہپیی ہے ۍ ہن� ن� ۍ ٲ ٲ ہ ٮ� ہ ہ ن� ہ

77رتھ، ج77ڑتھ، ، ک - یک ک77ان، باس77یوو، وی77ان، ، وچھو، وسان، کڈتھ، ہ]چ ہ ٮ� ہ ہ ، ، واتن77اون، تھ77ون 77ن 77ژھنس، ان ، کرن، گ ہبچوو، ننان، وونمت، کرن، باون ۍ ۍ ، ، چالون، رودمت، اس77تعمال، ک77رنک ل -نس، ک77رن-و ، الی ہتھ77وان، رٹن ۍ ٲ ہ ہ ، باسان، واتن77اون- اون، بناون -ناونچ، ، دپس، پان ، ونن ، چھکن ۍنیران، رٹن ہ ہ ہگتھ، تھ، ل س77 رتھ، ، کھ -م77ت روو، کورن، گنڈ ، ٹھ ٲوول، گژھتھ، دراو، گی ٲ ٲ ۍ ۍ ہ -ت77و، 77ر ، ک ن ، ی ، ب77وزتھ، پ و -ن تھ، ژل ، ، کھ77وژتھ، پپن-وال -واجن ۍین ہ ٮ� ہ ہ ن� ۍ ٲ ہ ٮ� ہ ہ ہ ہ ،اسان، کریوکھ، دژ، نیمتھ، 7777رن ٹتھ، ونن، س7777پدنس، ک7777رنچ، ک 7777ر، و ، ک ، روزن-وول، زینن ا د ٲ ہ ے ن� تمت، س777مکھیوو، دیتکھ، ، گم777ژ، د زراون ہن�عمالونس، ٹ777ال-مٹ777ول، ا ہ ن� ، کھولمت، ین، ین، تل، تھونس، ، سپد، مناون ، تھون، پکناون ہبوومت، لگ ہ - ، روزتھ، ول و ن، ک77ڈن اوو، بنیمت سمژ، آسن، نتھ، ت ، کڈان، دیت، ن ۍپ ۍ ٲ ٮ� ن� ٲ ہ ، ل77وژراوان، وونمت، کرم77ژ، وتھ، س77مکھن ، ت تمت، وومت، ژل ، د ہم77ت ن�ٲ ۍ ہن� ۍ 77ک، تھ77وتھ، ، دن -م77ت ، س77پد ، چھپن، ژلن -م77ت ، سپزمژ، سپز، تر ۍچھڑ ۍ ہ ۍ ۍ ے ، وتھ، کھن ، ک77رنس، پ77ر ، س77پدن وتھ، سپز، س77پد، ک77رن تھمت، بن ، و ہالگن ٲ ۍ ٲ Iن ہ ، 77ن ، ون ، نی 77ژ-م77ژ، کنن ی ، کھنتھ، -ون وتھ، پوش ، پ ، اننچ، اند-ن او، ژھنن ۍت ہ ہ ہ ٲ ہ ۍ ن� ، ن وان، یتھ، آس77 ، پ ، کھ77الن -تھ س ، یتھ، چال ، ل ، ت77یزن ۍگوو، رٹن، ونن ٮ� ہ ے ۍ ٲ Iن ٲ ہ ۍ -متس، سپزمژ ن، سپد ، و -ک ، کرن ۍتلن ۄ ہ ہ ہJJ_JJ ، شت-گردان د تیار، ، گمرا عالمی، ملکی، د، پ نو، یم، ق یمی، ہد ٲ ٲ ٲ -ون، س77ی، پوش 77د-ص77ورت، ختم، ممکن، سی م، ب ثر، ا ش77ور، مت ہپ77ور، م ٲ ٲ وژ، س77رحدی، کش77ر، شمل، ، پھ -ون ر، قونٮونی، روا-دار، اوم، پوش ٲس ٲ ن� ہ ہ ی� یی نی، ، چ یی، ب77اق می ، ایٹمی، ک کچ ، ل ، مضبوط، ف77وجی، بیی ، اصل ی�ثبق ے ٲ ی� ہ ۄ ہ ۍ ہ

77د، -ل ن، ع77ام، چھ77وک ہمختلف، انسنی، کش77ر ٮ� ٲ ر، -۱۲ٲ ر، -۱۳ہو ر،-۱۶ہو ہو می،۱۷ ری، بیین، بیین، خرجی، ج77امع، ب77ا ، ت77از، ج77و -کھل رس، ان -و ٲ ہ ۍ

، آزاد، بحال، واض77ح، س77از- ق فز، ث ، مضر، خصوصی، ن بی، بڈ می، ہب ٮ� ٲ ۍ ٲ ٲ، ٹھ تی، بح77ال، ش77 -مس، م77ذاکر 77ت ہگ7ار، محت77اط، منظ77وم، زبردس77ت، پ ٲ ۍ 77رامن، بی، امن-پس77ند، پ ، م77ذ ، تتھی، یتھ ملوث، ٹھوس، مشروط، روٹ ۍ ہ

215

ن، ، اژھ 77ژھ وری، ت ، غب، ٹاک77ار، جم ر ، س77رک ن ندوس77ت ، ٮ�تعلیم-یافت ے ٲ ۍ ٲ ۍ ٲ ہ یی، و 77ڈ، بین، زمی77نی، ج، ب ، د ٲیتھ، اصل، نا-کام، بین-االق77وامی، انسن Iن ن� ۍ ٲ س77مندری، متح77رک، خ77وش، چ77الو، ش77روع، ض77روری، مس77رور، تیتھ،، ن ن77نی، پاکس77ت م، ت77یز، ش77اندار، عظیم، قر می، نامور، ابتر، اعظ ۍاسل ٲ ٲ ی� ٲ دی- و، برط77انوی، موج77ود، ص77در، آب د، کشر ٲبیین، صاف، اندرونی، ش ٮ� ٲ ٲ، -پ77ای ل _افزا، ریاس77تی، ب ور، سرکری، کم، تژھ، حوصل ، مش ہوول، کھل ہ ہ ٲ ہ ، 77ژھ ن، ی -م 77ت ، پ 77و و، ن -م 77ت ، پ ، خفی ترین، ب77اق ، شمل، ب ہپر-شکو ٮ� ۍ ے ٮ� ۍ ہے ٲ 77د، -ل ، غل77ط، آیتن، تتھ، چھ77وک ژار، علحیدگی-پس77ند، گریل77و، ب77یی ، پ خفی ہ ن� ہ ، قون77ونی، س77مجی، ٮ77وریت-واجن 77د، گ77وش-گ7ذار، ع77دالتی، جم -ل ک ٲچھ ہ ہ ۄ ، ادا، ش77777777وژ، نم77777777ودار، منظ77777777ور ، ب77777777دل ہےمجرم77777777ان ہ ، -دار، پر-جوش، گرفتار، روشن، مچھ، کار-بند، مجب77ور، ن77وو، ملی، ٲکھان ہ ، بال- ، انتظمی، تعلیمی، تھ7ام اکھ، اقتصدی، ساد، سیود، معش اکھ، ب ٲ ۍ ٲ ٲ ن�ی، 77ر-امن، ش ، ی77ورپی، سس77ت، پ یم ٲجواز، واگذار، درکار، وعد-بن77د، و Iن ، -مل ، منظم، بند، مجب77ور، رل ، گرفتار، زیر، معشی، تبا -وای ہفرضی، ب ہ ٲ ھے ، ر -دخ7ل، کش77 7یین، ب ، ب ، کش7ر ، خود-س7اخت ، رت ، بج -مل ن، رل ۍرت ٲ ھے ۍ ٲ ہ ۍ ہ ہ ٮ� ری، گ77رکلی، ر، گ77رکلی، مرک77زی، اخب -دار، ب77د-ن77ام، س77ینر، کش77 ٲذم ٲ ہ ، بین-االق7ومی، پ77ور، مص77الحتی، ٲحفاظتی، کامیاب، مطل77وب، مش77ترک ٲ ہ ن، چی77نی، س77فارتی، ام77ریکی، واس، اف77زود، ٮ�مزی77د، ن77و، مس77تقل، ن77و ری، -وار، ش ت ، س7777ودی، نای7777اب، ر ن یی، نزدی7777ک، پ د، نیوکلی الو ہ ٮ� ۍ ن�ٲ ٲ ن� ہہ - ثر، نا-معلوم، یژھن، یژھو، کربناک، ناکار، معمولی، تن77ازع ہسالمتی، مت ٲیاب، کنی، مختصر ۍدار، دسJJ_JJC ن، مش77رقی، _وال ، ایٹمی، ط77اقت در ، ا ر، ب77را -نظ ٮ�خ77اص، ب ہ ے Iن ی� ے ٹھ ، خ77777777777777وش، ن77777777777777رم، ش ، ام77777777777777ریکی، خفی ہادر ی� ہ ۍ ، ، رت، سالمتی ےورش، وژھ، مین، سٹریم، بند، رتRB_RB ، نوس77ر، -مونجی، جری، ییی ، دبار، مل ، واپس، عنقریب ہبیی ٲ ہ ے ہ ، ج77ل، تھ س، ک س-ن ، ت ، تیل ، ح77ال میش ، واپس، ، دوری ت دش، ہد ٮ� ہے ہ ہ ۍ س� ۄ ، ٹھ -پ ، س77ید، س77یود، یتھ 77ن -ک ، کتھ ر، یک77وٹ ، بظ ، جان، ج77ل، پت ۍبال-وج ٲ ہ ۍ ہ ہ ٲ ہ ن -ک ، بیک-وقت، تتھ میش ر- ، میش ر- ، گ -م ۍگ ٲ ۍ ہ ہ ن� ہ ن� ہ

216

RB_RBC ، کھل ہسید، سیود، ژھیپ PP_PSP ، ت دس، س77 ز، ز، کن، س77 ، م ، بجای ز، حوال ، م ت ۍد، ن� ہہ ن� ن� ہ ہ ن� ۍ س� ن� ہہ تھ، خطر، د، م77نز، ز، ، 77ن ٹھ، مخلف، ک ٲب77اپتھ، بق77ول، خالف، پ ٮ� ہ ن� ہہ ن� ہہ ۍ ٲ ٮ� ، ٹ د، درمی77ان، خطر، پ ، الیق، باپت، س د ن، د ، د ز، ، ٹھ پ ٮ� ٲ ن� ۍ ن� ہہ ٮ� ن� ہہ ۍ ن� ہ ن� ہہ ے ٮ� و، ب77اوجود، د ، عالو، ، پت ٮ�سان، تام، دور، خ77اطر، موج77وب، نش، نکھ ن� ہہ ہ ہ ، ، رنگ ٹھ ، پ وٮ77ی ند، ، دس، دوران، پت ، س ، تل ۍخالف، نزدیک، طرف ٮ� ۍ ہ ہ ہ ن� ہے ہ ، د ، منز، ط77ور، س77 ٹھ ، پ ت ، ک ست دی، ر ، متلق، ، سبب و ٹھی، ب ۍپ ن� ہ ٮ� ہ ہۄ ے Iن ن� ہہ ہ KہYن ن� ٲ ز د، خظر، ، و ز، ب ز، متعل777777777777777777777ق، س777777777777777777777 ن�م ہہ ٲ ن� ہہ KہYن ن� ن� ن� ، ، ز ، اتھ، س77 ، تل ، ذری ن، کن د، خطر، ت77ل، ب بق، س77 ، مط ون ےدس، ب ن� ہ ہ ہ ۄ ٲ ن� ٲ Kہ ن� ن� ہہ ، باپتھ ، نش کھ ، م ، کن ، غرض ت ، کھ ٹھ پ ہ ۄ ہ ہ ہ ۄ ۍ ٲ ،CC_CCD ، و، -ک ، ن ، کن ، مگ77ر، مگ77ر، ز، وں-گ7و، یتھ، پش ، ی77ا، ت ہبلک ہ ہ ہ ہ ہیتھ-زن، اما-پوز، تCC_CCS ، ، ی77ان رگ ، اگر، ذا، ز، یعن ، ل ، تکیاز، یودو ، تو رگ ےز، Kہ ے ے ے چ خصوصن

RP_INJ ہ]RP_INTF 77ڈ، س77خ، ، ب ژا ، ٹھ ، براب77ر، س ، واری77ا ٹھا بڈ، زی77اد، س سک ہ] ٮ� ٮ�

ٹھ ، س ہ]تیوتا ہ ٮ� RP_NEG ، کی ، ن ، ن ا، ن نYہKک ہ ہ ن�RP_RPD - ، یتھ ، ح77االنک ا ، ک م، او ، ت77ا ش77 ی77و، ، ہسان، محض، ت ہ ن� ے ہ ہ

وی، ام777777ا، زن، ام777777ا_پ777777وز ، ی777777و، کی777777ا ، ص777777رف، ٹھ ہپ ۍ ٲ ، ، تی، وں-گ7و، فق7ط، چھ7را، ی7ودو ، ی7وت، ، زنت رگ ےآیا، وں-گوو، ما، س� ہ Kہ ہوں، بیی

QT_QTC ، اکس، ش77و ےاکھ، کرور، د ن، ۱۵٠٠٠ۄ ،۳٠٠٠، ۳٠٠ۄ، د ن�ے، ت ، یشو ےت ، ستتھ، س77اس، ک77رور، ز،۵ن� -س شویو، اک ، اکس، د ، اک ے، اک ۍ ۄ ہ ۍ

، -م ت تھ، س77 ن، لچھ، س77 ن، د ن، د ڈس، د ، ڈ ش77ون ، د ن ش ، د ن ش د ۍ چ ہ چ ۄ ۍ ۄ ۍ ہۄ ۄ ۍ ۄ ۄ ٹھن، ن، لچھن، -وی، س777777اڈن، ژ ، د ن، ت ، ت و ٲژ Iن ے ھے ن� ٮ� ن� ، ارب،۴۲ن، ۱۶ۄ

ن، ، و چاکی، دچھن ژن،۱۱٠، ۴٠، ش77یٹھ، ۲۱۹، ۳۴۲٠۲، ۲۱۹۱ہ ن، پ77ا ن�، ش ٮ�

217

ٮتن، تھ، ڈاین، ا، ، د۱۹، ۷٠، ۵٠چ -م ہ

QT_QTF - س، درجن تھ، د ، پ گن گنس، د س، ا ، د 77ت ہاکثر، تم7ام، ی ن� ٮ� ن� ہ ۄ ۄ ن� ے، ، ب77ود، ی77وت، زی77اد ، کھ77رب ح77د، واری77ا ، سری، پور، پ77ور، و ہواد، سارن ہ ٲ ٲ ے ، سری، ژو، کم، کی ژن، زی77اد، کی ، کی ، سر تیا ، مزید، ک ژو، زیاد ٲکی KYن ن� ن� ے ٲ ٲ ے ن� ژن ن، تمام-تر، کی ، واری ، زیاد م، سارو ، کل ن�باق ے ہےQT_QTO یم، اول ن، د -م ی ، د یم ، د م ، ب یم نIب ٮ� ہ Iن Iن ہ ہۍ ہ RD_ECH ، اترو ، وت ہباتھ، باتھ، باتھ، وت ہRD_PUNC ہ، ،، ‘، ’، “، ت : ،’’ ،( ،) ،‘‘ ،۔RD_UNK کی

Appendix-III: Showing a Sample of Sytactically Annotated Sentences in SSF

<document id="">

<head></head>

<Sentence id='1'>

1 (( NP <fs name='NP' drel='r6:NP4'>

1.1 وزیر N_NNPC <fs name='وزیر' cat='n'>

1.2 اعظم N_NNPC <fs name='اعظم' cat='n'>

1.3 ڈاکٹر N_NNPC <fs name='ڈاکٹر' cat=''''>

1.4 نمنمٮ�ہن N_NNPC <fs name='نمنمٮ�ہن ' cat=''''>

1.5 ہ� ہ�نگھن N_NNPC <fs name='ہ� ہ�نگھن ' cat=''''>

))

2 (( NP <fs name='NP2' drel='r6:NP4'>

2.1 وادی N_NN <fs name='وادی' cat='n'>

2.2 ہ� نن ہہ PP_PSP <fs name='ہ� نن ہہ ' cat=''''>

))


3.1 ہ�- ک ہ� حال N_NN <fs name=' ہ�- ک ہ� <'cat='nst 'حال

))

4 (( NP <fs name='NP4' drel='k3:VGF'>

4.1 دور N_NN <fs name='دور' cat='n'>

4.2 س�تۍ PP_PSP <fs name='س�تۍ ' cat=''''>

))

5 (( AUXP <fs name='AUXP' drel='fragof:VGF'>

5.1 ہ� ی#ھن V_VAUX <fs name='ہ� ی#ھن ' cat='v'>

218

))

6 (( NP <fs name='NP5' drel='k7p:VGNN'>

6.1 ہ%�ا�تس N_NN <fs name='ہ%�ا�تس ' cat='n'>

6.2 م��ن�ز PP_PSPن''''''& <fs name=' م��ن�ز <''''=cat 'ن''''''&

))

7 (( NP <fs name='NP6' drel='k2:VGNN'>

7.1 یمی ٲد JJ_JJ <fs name='یمی ٲد ' cat='adj'>

7.2 یمن نا N_NN <fs name='یمن نا ' cat=''''>

))

8 (( JJP <fs name='JJP' drel='pof:VGNN'>

8.1 ٲقیم JJ_JJ <fs name='ٲقیم ' cat='adj'>

))

9 (( VGNN <fs name='VGNN' drel='r6v:NP7'>

9.1 ن#ن- ہ� کر� V_VM <fs name=' ن#ن- ہ� <''''=cat 'کر�

))


10.1 ن*ن ہ+ ک� N_NN <fs name='ن*ن ہ+ <'cat='n 'ک�

))


11.1 ن�ہ� کا DM_DMI <fs name='ہ��ن <'cat='pn 'کا

11.2 خاص JJ_JJC<fs name='خاص' cat=''''>

11.3 پوچھر N_NN <fs name='پوچھر' cat=''''>

))

12 (( VGF <fs name='VGF' drel='ccof:CCP'>

12.1 یمت و�گ ل V_VM <fs name='یمت و�گ <'cat='v 'ل

))

13 (( CCP <fs name='CCP'>

13.1 ہ� ب0لک CC_CCD <fs name='ہ� ب0لک ' cat='avy'>

))

14 (( AUXP <fs name='AUXP2' drel='fragof:NP10'>

14.1 ہ#ھ V_VAUX <fs name='ہ#ھ ' cat='v'>

))

15 (( NP <fs name='NP9' drel='k1:NP10'>

15.1 اکثر QT_QTF <fs name='اکثر' cat='avy'>

15.2 نگار- ہتجزی N_NN <fs name=' نگار- ہتجزی ' cat=''''>

))

219

16 (( NP <fs name='NP10' drel='ccof:CCP'>

16.1 ما�ا& V_VM <fs name='&ا�ما' cat='v'>

))

17 (( CCP <fs name='CCP2' drel='csof:NP10'>

17.1 ہز CC_CCS <fs name='ہز ' cat='avy'>

))

18 (( NP <fs name='NP11' drel='k7:VGNN2'>

18.1 ہ�- ک مستقبل N_NN <fs name=' ہ�- ک <'cat='n 'مستقبل

18.2 ہ� ح�ال PP_PSP <fs name='ہ� <''''=cat 'ح�ال

))

19 (( NP <fs name='NP12' drel='ccof:CCP3'>

19.1 ن�ہ� کا DM_DMI <fs name=' ن�ہ� 2کا ' cat='pn'>

19.2 نو JJ_JJ <fs name='نو' cat=''''>

19.3 وۄمی� N_NN <fs name='وۄمی�' cat=''''>

))

20 (( CCP <fs name='CCP3' drel='k1:VGNN2'>

20.1 یا CC_CCD <fs name='یا' cat='avy'>

))


21.1 ہامکا& N_NN <fs name='&ہامکا ' cat='n'>

))

22 (( JJP <fs name='JJP2' drel='pof:VGNN2'>

22.1 ہ; پٲ JJ_JJ <fs name=';ہ <'cat='adj 'پٲ

))

23 (( VGNN <fs name='VGNN2' drel='vmod:VGF2'>

23.1 ہ� ن�پ�� V_VM <fs name='ہ� ن�پ�� ' cat=''''>

23.2 ہ� 0جا� PP_PSP <fs name='ہ� <''''=cat '0جا�

))

24 (( AUXP <fs name='AUXP3' drel='fragof:VGF2'>

24.1 وھے # V_VAUX <fs name='وھے #' cat='v'>

))


25.1 ہ� نام DM_DMD <fs name='ہ� نام ' cat='pn'>

25.2 ہ% ;و N_NN <fs name='%ہ <''''=cat ';و

))

26 (( CCP <fs name='CCP4' drel='k3:VGF2'>

220

26.1 ہ� ت CC_CCD <fs name='ہ� <'cat='avy 'ت

))

27 (( NP <fs name='NP15' drel='k7t:VGNF'>

27.1 ناتھ PR_PRP <fs name='ناتھ ' cat='pn'>

27.2 من�ز ن''''''& PP_PSP <fs name=' من�ز <''''=cat 'ن''''''&

))

28 (( NP <fs name='NP16' drel='UNDEF:VGNF'>

28.1 ن�ہ� نرو 0 N_NST<fs name='ہ��ن نرو 0' cat='nst'>

28.2 یکن PP_PSP <fs name='یکن ' cat=''''>

))

29 (( VGNF <fs name='VGNF' drel='nmod:NP17'>

29.1 ہ� آامت V_VM <fs name='ہ� آامت ' cat='v'>

))


30.1 ہ� 0یا� N_NN <fs name='ہ� <'cat='n '0یا�

30.2 س�تۍ PP_PSP <fs name=' 2س�تۍ ' cat=''''>

))


31.1 مویویسی N_NN <fs name='مویویسی' cat='n'>

))

32 (( CCP <fs name='CCP5' drel='k1:VGF2'>

32.1 ہ� ت CC_CCD <fs name=' ہ� 2ت ' cat='avy'>

))


33.1 بددلی N_NN <fs name='بددلی' cat='n'>

))

34 (( VGF <fs name='VGF2' drel='rsv:CCP2'>

34.1 ہمژ یہر� V_VM <fs name='ہمژ یہر� ' cat='v'>

34.2 ۔ RD_PUNC <fs name='۔' cat=''''>

))

</Sentence>

<Sentence id='2'>

1 (( NP <fs name='NP' drel='UNDEF:NP2'>

1.1 Eی ہحز N_NNPC <fs name='Eی ہحز ' cat='n'>

1.2 المجاہ��ن N_NNPC <fs name='المجاہ��ن' cat=''''>

221

1.3 �ا& RP_RPD <fs name='&ا�' cat=''''>

))


2.1 تمام QT_QTF <fs name='تمام' cat='avy'>

2.2 علحیدگی N_NNC <fs name='علحیدگی' cat=''''>

2.3 نن� نپس N_NNC <fs name='نن� نپس ' cat=''''>

2.4 ژو ٲجم N_NN <fs name='ژو ٲجم ' cat=''''>

))


3.1 ی#ھ V_VAUX <fs name='ی#ھ ' cat='v'>

3.2 ی�مت نو و V_VM <fs name='مت�ی نو <''''=cat 'و

))

4 (( CCP <fs name='CCP' drel='rsv:VGF'>

4.1 ہز CC_CCS <fs name='ہز ' cat='avy'>

))

5 (( NP <fs name='NP3' drel='ras-k2:VGNN'>

5.1 تمام QT_QTF <fs name=' 2تمام ' cat='avy'>

5.2 نپن نرو گ N_NN <fs name='نپن نرو <''''=cat 'گ

5.3 س�تۍ PP_PSP <fs name='س�تۍ ' cat=''''>

))

6 (( NP <fs name='NP4' drel='pof:VGNN'>

6.1 0اتھ- نکتھ N_NN <fs name=' 0اتھ- نکتھ ' cat='n'>

))

7 (( VGNN <fs name='VGNN' drel='r6v:NP6'>

7.1 ہ�چ کر V_VM <fs name='چ�ہ <'cat='v 'کر

))


8.1 وزیراعظم N_NN <fs name='وزیراعظم' cat='v'>

8.2 ن�ز س�� ن''''''& PP_PSP <fs name=' ن�ز س�� <''''=cat 'ن''''''&

))

9 (( NP <fs name='NP6' drel='k1:VGF3'>

9.1 آاوے- خ�ش N_NN <fs name=' آاوے- <'cat='psp 'خ�ش

))

10 (( VGF <fs name='VGF2' drel='csof:CCP'>

10.1 وھے # V_VM <fs name='وھے #' cat='v'>

))


222

11.1 محض RP_RPD <fs name='محض' cat='avy'>

11.2 ناکھ QT_QTC <fs name='ناکھ ' cat=''''>

11.3 ن0یا& N_NN <fs name='&ن0یا ' cat=''''>

))

12 (( NP <fs name='NP8' drel='nmod__Relc:NP7'>

12.1 نیک یم وی � PR_PRL <fs name='نیک یم وی �' cat='pn'>

))


13.1 نمقص� N_NN <fs name='نمقص� ' cat='n'>

))

14 (( JJP <fs name='JJP' drel='ccof:CCP2'>

14.1 محض RP_RPD <fs name=' 2محض ' cat='avy'>

14.2 ملکی JJ_JJ <fs name='ملکی' cat='adj'>

))

15 (( CCP <fs name='CCP2' drel='CCNmod:NP10'>

15.1 ہ� ت CC_CCD <fs name='ہ� <'cat='avy 'ت

))

16 (( JJP <fs name='JJP2' drel='ccof:CCP2'>

16.1 عالمی JJ_JJ <fs name='عالمی' cat='adj'>

))


17.1 عام�- %اے N_NN <fs name=' عام�- <'cat='n '%اے

))

18 (( JJP <fs name='JJP3' drel='pof:VGF3'>

18.1 یگمراہ JJ_JJ <fs name='یگمراہ ' cat='adj'>

))

19 (( VGF <fs name='VGF3' drel='k1:NP9'>

19.1 یر& ک V_VM <fs name='&یر <'cat='v 'ک

19.2 ی#ھ V_VAUX <fs name=' 2ی#ھ ' cat='v'>

19.3 ۔ RD_PUNC <fs name='۔' cat='s'>

))

</Sentence>

<Sentence id='3'>

1 (( NP <fs name='NP' drel='k2:VGF4'>

1.1 پزر N_NN <fs name='پزر'>1.2 ہ� ت RP_RPD <fs name='ہ� <'ت

223

))


2.1 ی#ھ V_VAUX <fs name='ی#ھ '>

))


3.1 ہ� � PR_PRP <fs name='ہ� �'>))

4 (( CCP <fs name='CCP' drel='ras-k1:NP2'>

4.1 ہز CC_CCS <fs name='ہز '>))


5.1 وز�راعظمن N_NN <fs name='وز�راعظمن'>))


6.1 نو& و V_VM <fs name='&نو <'و))


7.1 شیر N_NNPC <fs name='شیر'>

7.2 کشمی�ر O''''''ی N_NNPC <fs name=' کشمی�ر O''''''ی'>7.3 زرعی N_NNPC <fs name='زرعی'>

7.4 یونیورسٹی N_NNPC <fs name='یونیورسٹی'>7.5 ہ�س نن ہہ PP_PSP <fs name='ہ�س نن ہہ '>

))

8 (( NP <fs name='NP5' drel='k2:VGNF'>

8.1 کن�کی*نس N_NN <fs name='کن�کی*نس'>

))

9 (( NP <fs name='NP6' drel='pof:VGNF'>

9.1 خطاب N_NN <fs name='خطاب'>

))

10 (( VGNF <fs name='VGNF' drel='vmod:VGF4'>

10.1 کرا& V_VM <fs name='&کرا'>

))

11 (( JJP <fs name='JJP' drel='UNDEF:VGF4'>

11.1 ی�تے QT_QTF <fs name='ی�تے '>))

12 (( CCP <fs name='CCP2' drel='rs:JJP'>

12.1 ہز CC_CCS <fs name=' 2ہز '>

224

))


13.1 با�ۍ PR_PRP <fs name='با�ۍ '>))

14 (( VGF <fs name='VGF3' drel='csof:CCP2'>

14.1 ہ#ھ V_VM <fs name='ہ#ھ '>))


15.1 نمن ہت DM_DMD <fs name='نمن ہت '>15.2 تمام QT_QTF <fs name='تمام'>15.3 گروپن N_NN <fs name='گروپن'>

15.4 ہ�تۍ PP_PSP <fs name='ہ�تۍ '>))

16 (( NP <fs name='NP9' drel='pof:VGNN'>

16.1 نکتھ N_NN <fs name='نکتھ '>

))

17 (( VGNN <fs name='VGNN' drel='UNDEF:VGF4'>

17.1 ہ� کر� V_VM <fs name='ہ� <'کر�

17.2 نپتھ 0ا PP_PSP <fs name='نپتھ <'0ا))

18 (( JJP <fs name='JJP2' drel='pof:VGF4'>

18.1 تیار JJ_JJ <fs name='تیار'>))

19 (( NP <fs name='NP10' drel='nmod__k1inv:NP8'>

19.1 یم PR_PRP <fs name='یم'>))

20 (( NP <fs name='NP11' drel='x:VGF4'>

20.1 Rہ 0ق� PP_PSP <fs name='Rہ <'0ق�20.2 ہ� ہتہن PR_PRP <fs name='ہ� ہتہن '>

))


21.1 گردی- شت ہد N_NN <fs name=' گردی- شت ہد '>

21.2 خالف PP_PSP <fs name='خالف'>

))


22.1 ن�ن با V_VM <fs name='ن�ن با '>22.2 ۔ RD_PUNC <fs name='۔'>

225

))

</Sentence>

<Sentence id='4'>


1.1 بتمۍ PR_PRP <fs name='بتمۍ '>))




3.1 با�ۍ PR_PRP <fs name='با�ۍ '>))

4 (( AUXP <fs name='AUXP' drel='fragof:VGF2'>

4.1 ہ#ھ V_VAUX <fs name='ہ#ھ '>))

5 (( NP <fs name='NP3' drel='ras-k1:VGF2'>

5.1 ن�س پٲکستا N_NNP <fs name='س�ن <'پٲکستا5.2 س�تۍ PP_PSP <fs name='س�تۍ '>

))


6.1 دوستی N_NN <fs name='دوستی'>

))

7 (( CCP <fs name='CCP' drel='k7:VGNF'>

7.1 ہ� ت CC_CCD <fs name='ہ� <'ت))


8.1 نکس ہا+ترا N_NN <fs name='نکس ہا+ترا '>8.2 ٮٮٹھ پ PP_PSP <fs name='ٮٮٹھ <'پ

))

9 (( VGNF <fs name='VGNF' drel='nmod:NP6'>

9.1 ہ%تھ ب; V_VM <fs name='ہ%تھ ب; '>))


10.1 تعلقات N_NN <fs name='تعلقات'>))

11 (( VGF <fs name='VGF2'>

226

11.1 ن�ژھا& V_VM <fs name='&ن�ژھا '>11.2 ۔ RD_PUNC <fs name='۔'>

))

</Sentence>

<Sentence id='5'>

1 (( CCP <fs name='CCP' drel='csof:CCP2'>

1.1 نہرگ� CC_CCS <fs name='نہرگ� '>))


2.1 پٲکستا& N_NNP <fs name='&پٲکستا'>))


3.1 ہننۍ نپ PR_PRF <fs name='ہننۍ نپ '>))


4.1 زمین- ہر ن� N_NN <fs name=' زمین- ہر ن� '>))

5 (( NP <fs name='NP4' drel='rd:VGF'>

5.1 0ھا%ت N_NNP <fs name='0ھا%ت'>5.2 ٲمخلف PP_PSP <fs name='ٲمخلف '>

))

6 (( NP <fs name='NP5' drel='rh:VGF'>

6.1 ہ�- گر;ا� ب;ہ*ت JJ_JJ <fs name=' ہ�- گر;ا� ب;ہ*ت '>6.2 سرگرمیٮو N_NN <fs name='سرگرمیٮو'>

6.3 نپتھ 0ا PP_PSP <fs name='نپتھ <'0ا))


7.1 استعمال N_NN <fs name='استعمال'>))

8 (( VGF <fs name='VGF' drel='ccof:CCP2'>

8.1 ن�& �پ V_VM <fs name='&ن� <'�پ

8.2 ہ�- � ہ� ہ;� V_VAUX <fs name=' ہ�- � ہ� ہ;� '>))

9 (( CCP <fs name='CCP2'>


227


10.1 ناتھ DM_DMD <fs name='ناتھ '>10.2 نلس ہس ہ�ل N_NN <fs name='نلس ہس ہ�ل '>10.3 م��ن�ز PP_PSPن''''''& <fs name=' م��ن�ز <'ن''''''&

))


11.1 ہ� نپنن PR_PRF <fs name='ہ� نپنن '>))


12.1 ;ہٲ�ۍ- ن�قین N_NN <fs name=' ;ہٲ�ۍ- ن�قین '>))

13 (( JJP <fs name='JJP' drel='pof:VGF2'>

13.1 ہ% و� پ JJ_JJ <fs name='%ہ و� <'پ))

14 (( VGF <fs name='VGF2' drel='ccof:CCP2'>

14.1 ہر نک V_VM <fs name='ہر نک '>

14.2 ۔ RD_PUNC <fs name='۔'>))

</Sentence>

<Sentence id='6'>

1 (( NP <fs name='NP' drel='k1:VGF2'>

1.1 ک��شی�ر O''''''ی N_NNP <fs name=' ک��شی�ر O''''''ی'>))


2.1 وھے # V_VM <fs name='وھے #'>

))


3.1 ہ� نپنن PR_PRF <fs name='ہ� نپنن '>))

4 (( NP <fs name='NP3' drel='rh:VGF2'>

4.1 خوبصورتی N_NN <fs name='خوبصورتی'>

4.2 ہکنۍ PP_PSP <fs name='ہکنۍ '>

))

5 (( NP <fs name='NP4' drel='k7p:VGF2'>

5.1 ن%س ی; QT_QTF <fs name='ن%س ی; '>

228

5.2 نہس- � ی;�ۍ N_NN <fs name=' نہس- � ی;�ۍ '>5.3 م��ن�ز PP_PSPن''''''& <fs name=' م��ن�ز <'ن''''''&

))

6 (( JJP <fs name='JJP' drel='k1s:VGF2'>

6.1 شور ہم JJ_JJ <fs name='شور ہم '>

6.2 ، RD_PUNC <fs name='،'>))


7.1 نت�ے CC_CCS <fs name='نت�ے '>))

8 (( AUXP <fs name='AUXP' drel='fragof:VGF2'>

8.1 ہ#ھ V_VAUX <fs name='ہ#ھ '>))

9 (( NP <fs name='NP5' drel='rsp:VGF2'>

9.1 ہ� نپت N_NST<fs name='ہ� نپت '>9.2 ہ� نوت N_NNC <fs name='ہ� نوت '>9.3 ٮٮٹھ پ PP_PSP <fs name='ٮٮٹھ <'پ

))


10.1 �ٲلٲ�ۍ N_NN <fs name='ۍ�ٲلٲ�'>

))


11.1 یور N_NST<fs name='یور'>))

12 (( VGF <fs name='VGF2' drel='rh:CCP'>

12.1 ہ��ا& V_VM <fs name='&ہ��ا '>12.2 ہمتۍ آا V_VAUX <fs name='ہمتۍ آا '>12.3 ۔ RD_PUNC <fs name='۔'>

))

</Sentence>

<Sentence id='7'>


1.1 ہ� � PR_PRP <fs name='ہ� �'>))


229

2.1 وھے # V_VM <fs name='وھے #'>

))


3.1 ہ�- # ہمالی� N_NNP <fs name=' ہ�- # <'ہمالی�))

4 (( NP <fs name='NP3' drel='k7p:VGF'>

4.1 ہ#ھ کۄ N_NN <fs name='ہ#ھ <'کۄ

4.2 م��ن�ز PP_PSPن''''''& <fs name=' م��ن�ز <'ن''''''&))


5.1 ہمژ ن�پھلی پھا V_VM <fs name='ہمژ ن�پھلی <'پھا5.2 ۔ RD_PUNC <fs name='۔'>

))

</Sentence>

<Sentence id='8'>


1.1 ہر یی بک* N_NNP <fs name='ہر یی بک* '>

1.2 ہز نن ہہ PP_PSP <fs name='ہز نن ہہ '>))


2.1 خوبصورتی N_NN <fs name='خوبصورتی'>

2.2 س�تۍ PP_PSP <fs name='س�تۍ '>))

3 (( JJP <fs name='JJP' drel='pof:VGNF'>

3.1 ثر ٲمت JJ_JJ <fs name='ثر ٲمت '>

))

4 (( VGNF <fs name='VGNF' drel='vmod:VGF'>

4.1 ہے ہ�تھ ہپ ن� V_VM <fs name='ہے ہ�تھ ہپ ن� '>))


5.1 ی#ھ V_VM <fs name='ی#ھ '>

))


6.1 مغل N_NNPC <fs name='مغل'>

6.2 0ا;+اہ N_NNPC <fs name='0ا;+اہ'>6.3 جہا�گیر& N_NNPC <fs name='&گیر�جہا'>

230

))


7.1 ن�تھ PR_PRP <fs name='ن�تھ '>))

8 (( NP <fs name='NP5' drel='k1s:VGNN'>

8.1 جنت N_NNC <fs name='جنت'>

8.2 بنظی�ر- O''''''ےی JJ_JJC<fs name=' بنظی�ر- O''''''ےی '>

))

9 (( VGNN <fs name='VGNN' drel='r6:NP6'>

9.1 ینک آا� V_VM <fs name='ینک آا� '>))



))


11.1 یمت ن�ت ; V_VM <fs name='یمت ن�ت ;'>11.2 ۔ RD_PUNC <fs name='۔'>

))

</Sentence>

<Sentence id='9'>


1.1 ہر یی بک* N_NNP <fs name='ہر یی بک* '>

1.2 نن� یہ PP_PSP <fs name='نن� یہ '>))

2 (( NP <fs name='NP2' drel='k1s:VGF'>

2.1 دل N_NN <fs name='دل'>

))



))


4.1 س��ری�نگر O''''''ی N_NNP <fs name=' س��ری�نگر O''''''ی'>4.2 ۔ RD_PUNC <fs name='۔'>

))

</Sentence>

231

<Sentence id='10'>


1.1 ہ� � PR_PRP <fs name='ہ� �'>))



))


3.1 ک��شی�ر Oی'''''''''''ی N_NNP <fs name=' ک��شی�ر Oی'''''''''''ی'>3.2 نن� یہ PP_PSP <fs name='نن� یہ '>

))


4.1 دل N_NN <fs name='دل'>

))

5 (( CCP <fs name='CCP' drel='k1s:VGF'>



6.1 ناکھ QT_QTC <fs name='ناکھ '>6.2 م ہا JJ_JJ <fs name='م ہا '>

6.3 ہ� حص N_NN <fs name='ہ� <'حص


</Sentence>

<Sentence id='11'>

1 (( NP <fs name='NP' drel='ccof:CCP'>

1.1 ہل جھی N_NNPC <fs name='ہل <'جھی

1.2 ڈل N_NNPC <fs name='ڈل'>

))

2 (( CCP <fs name='CCP' drel='k1:VGF'>



3.1 ییین ہ�گ N_NNP <fs name='ییین ہ�گ '>

232

))



))


5.1 ہ�- # ن+ہر N_NN <fs name=' ہ�- # ن+ہر '>))


6.1 خوبصٮورتی N_NN <fs name='خوبصٮورتی'>

6.2 م��ن�ز PP_PSPن''''''& <fs name=' م��ن�ز <'ن''''''&))


7.1 ننس اۄگ QT_QTF <fs name='ننس <'اۄگ7.2 ہ� ;ۄگن QT_QTF <fs name='ہ� <';ۄگن7.3 ��رٮ��ر ''''''ہٮ''''''\ ی N_NN <fs name=' ��رٮ��ر ''''''ہٮ''''''\ ی '>

))




</Sentence>

<Sentence id='12'>

1 (( NP <fs name='NP'>

1.1 نمن م�� N_NN <fs name='نمن <'م��1.2 نن� یہ PP_PSP <fs name='نن� یہ '>

))

2 (( VGNF <fs name='VGNF'>

2.1 یلن �0 V_VM <fs name='یلن �0'>))



4 (( NP <fs name='NP2'>

4.1 ہ;لن N_NN <fs name='ہ;لن '>))

5 (( JJP <fs name='JJP'>

233

5.1 0راہ JJ_JJC<fs name='0راہ'>))

6 (( VGNF <fs name='VGNF2'>

6.1 وول- ہ''''''ہین�� V_VM <fs name=' وول- ہ''''''ہین�� '>

))


7.1 آب N_NNC <fs name='آب'>7.2 ہوا N_NNC <fs name='ہوا'>

))

8 (( AUXP <fs name='AUXP'>


))


9.1 ٮٮن �ٲلا� N_NN <fs name='ٮٮن <'�ٲلا�

))


10.1 یور N_NST<fs name='یور'>10.2 یکن PP_PSP <fs name='یکن '>

))

11 (( VGNF <fs name='VGNF3'>

11.1 ننس ہ� V_VM <fs name='ننس ہ� '>))


12.1 رز N_NN <fs name='رز'>

))

13 (( VGF



</Sentence>

<Sentence id='13'>


1.1 مگر CC_CCD <fs name='مگر'>

))

2 (( NP <fs name='NP' drel='k7t:VGF'>

2.1 کل- از N_NST<fs name=' کل- <'از

234

))



))


4.1 ر ہش N_NN <fs name='ر ہش '>

))

5 (( NP <fs name='NP3' drel='k7p:VGF'>

5.1 ٮ�تھ نر پ QT_QTF <fs name='ٮ�تھ نر <'پ5.2 ہ� نا� N_NN <fs name='ہ� نا� '>

))


6.1 ی�ہے DM_DMD <fs name='ی�ہے '>6.2 ص�%ت- ن�0 JJ_JJ <fs name=' ص�%ت- ن�0 '>6.3 ہ� علاق N_NN <fs name='ہ� <'علاق

6.4 و� ہی RP_RPD <fs name='و� <'ہی))


7.1 یمت و� گ V_VM <fs name='یمت و� <'گ


</Sentence>

<Sentence id='14'>

1 (( NP <fs name='NP' drel='modn:NP2'>

1.1 یحکمرا& N_NNC <fs name='&یحکمرا '>

1.2 جماعت N_NNC <fs name='جماعت'>

))


2.1 نیشنل N_NNPC <fs name='نیشنل'>2.2 کا�فر�سن N_NNPC <fs name='سن�فر�کا'>

))



3.2 یتمت نو و V_VM <fs name='یتمت نو <'و))

235




5.1 ک��شی�ر O''''''ی N_NNP <fs name=' ک��شی�ر O''''''ی'>5.2 نن� یہ PP_PSP <fs name='نن� یہ '>

))


6.1 ہ� نمسل N_NN <fs name='ہ� نمسل '>))


7.1 ہز %و V_VM <fs name='ہز <'%و))


8.1 تام- تو N_NST<fs name=' تام- <'تو))

9 (( JJP <fs name='JJP' drel='pof:VGF2'>

9.1 cہ گا N_NNC <fs name='cہ <'گا

9.2 ی%ے نو; ا JJ_JJC<fs name='ی%ے نو; <'ا))


10.1 - ہ�- � تام و� � N_NST<fs name=' - ہ�- � تام و� �'>))

11 (( NP <fs name='NP7' drel='k1:NP8'>

11.1 بندوق N_NN <fs name='بندوق'>))

12 (( JJP <fs name='JJP2' drel='pof:NP8'>

12.1 ختم JJ_JJ <fs name='ختم'>

))


13.1 نگژھن V_VM <fs name='نگژھن '>


</Sentence>

<Sentence id='15'>

236


1.1 ک��شی�ر O''''''ی N_NNP <fs name=' ک��شی�ر O''''''ی'>1.2 نن� یہ PP_PSP <fs name='نن� یہ '>

))


2.1 ہ� نمسل N_NN <fs name='ہ� نمسل '>))

3 (( VGNN <fs name='VGNN' drel='k1:VGF'>

3.1 یو& با�ز%ا V_VM <fs name='&یو با�ز%ا '>))



))

5 (( NP <fs name='NP3' drel='rsp:VGF'>

5.1 تیلی N_NST<fs name='تیلی'>))


6.1 یممکن JJ_JJ <fs name='یممکن '>))


7.1 ہ� ویل � N_NST<fs name='ہ� ویل �'>))



9 (( NP <fs name='NP7' drel='pof:VGF2'>

9.1 ہ� ژھۄپ N_NN <fs name='ہ� <'ژھۄپ))


10.1 ہر ک V_VM <fs name='ہر <'ک


</Sentence>

<Sentence id='16'>


237




))


3.1 ی ٲہتب N_NN <fs name='ی ٲہتب '>

))


4.1 نا�ا& V_VM <fs name='&ا�نا '>))




6.1 بندوق N_NN <fs name=' 2بندوق '>

))

7 (( AUXP <fs name='VGF2' drel='fragof:VGF3'>

7.1 اوس V_VAUX <fs name='اوس'>))


8.1 ک��شی�ر O''''''ی N_NNP <fs name=' ک��شی�ر O''''''ی'>8.2 من�ز ن''''''& PP_PSP <fs name=' من�ز <'ن''''''&

))


9.1 نیشنل N_NNPC <fs name='نیشنل'>9.2 کا�فر�س N_NNPC <fs name='س�فر�کا'>

))


10.1 ٲسیسی JJ_JJ <fs name='ٲسیسی '>

10.2 ہ� مٲ;ا� N_NN <fs name='ہ� <'مٲ;ا�10.3 ہز من PP_PSP <fs name='ہز <'من

))

11 (( NP <fs name='NP7' drel='adv:VGNN'>

11.1 نا�� N_NN <fs name='��نا '>11.2 ٮٮتھ ہ PP_PSP <fs name='ٮٮتھ <'ہ

))

238

12 (( VGNN <fs name='VGNN' drel='rh:VGF3'>

12.1 ہ� ہٹاو� V_VM <fs name='ہ� <'ہٹاو�12.2 ہر خٲط PP_PSP <fs name='ہر <'خٲط

))

13 (( VGF <fs name='VGF3' drel='ccof:CCP'>

13.1 ہ� نا�ن V_VM <fs name='ہ� نا�ن '>13.2 یمت آا V_VAUX <fs name='یمت آا '>13.3 ۔ RD_PUNC <fs name='۔'>

))

</Sentence>

<Sentence id='17'>


1.1 سی- این N_NNP <fs name=' سی- <'این))



))

3 (( NP <fs name='NP2' drel='rsp:VGF'>

3.1 ہ� پت N_NST<fs name='ہ� <'پت3.2 ہ� نوت N_NNC <fs name='ہ� نوت '>3.3 ٮٮٹھے پ PP_PSP <fs name='ٮٮٹھے <'پ

))


4.1 ک��شی�ر O''''''ی N_NNP <fs name=' ک��شی�ر O''''''ی'>4.2 ہ� نن ہہ PP_PSP <fs name='ہ� نن ہہ '>

))


5.1 یلک مس N_NN <fs name='یلک <'مس))


6.1 یو&- ہ� پ�+ JJ_JJ <fs name=' یو&- ہ� <'پ�+6.2 حل N_NN <fs name='حل'>

))


7.1 ن�ژھا& V_VM <fs name='&ن�ژھا '>7.2 یمت آا V_VAUX <fs name='یمت آا '>

239


</Sentence>

<Sentence id='18'>


1.1 نمن ہت DM_DMD <fs name='نمن ہت '>1.2 بکتھن N_NN <fs name='بکتھن '>

1.3 ��ن�ز ''''''ہن''''''ن ہ PP_PSP <fs name=' ��ن�ز ''''''ہن''''''ن ہ '>

))


2.1 نوتھ 0ا N_NN <fs name='نوتھ <'0ا))


3.1 کر V_VM <fs name='کر'>

))

4 (( NP <fs name='NP3' drel='r6:CCP'>

4.1 نیشنل N_NNPC <fs name='نیشنل'>4.2 ہسکۍ N_NNPCکا�فر� <fs name='ہسکۍ <'کا�فر�

))


5.1 سین�ر ''''''�ی''''''ن ی JJ_JJ <fs name=' سین�ر ''''''�ی''''''ن ی '>

5.2 نما ہر N_NN <fs name='نما ہر '>

))

6 (( CCP <fs name='CCP' drel='modnc:NP8'>



7.1 ہتکۍ %�ا� N_NN <fs name='ہتکۍ <'%�ا�))


8.1 قونٮونی JJ_JJ <fs name='قونٮونی'>

8.2 ام�%& N_NN <fs name='&%ام�'>8.3 نن�ۍ ہ PP_PSP <fs name='نن�ۍ <'ہ

))


9.1 وزیر N_NNC <fs name='وزیر'>

240

))


10.1 علی N_NNPC <fs name='علی'>

10.2 محم� N_NNPC <fs name='محم�'>10.3 �اگر& N_NNPC <fs name='&اگر�'>

))


11.1 پارٹی N_NN <fs name='پارٹی'>11.2 ٮ�ن نن� ہہ PP_PSP <fs name='ٮ�ن نن� ہہ '>

))


12.1 کا%کنن N_NN <fs name='کا%کنن'>

))



))

14 (( VGNF <fs name='VGF2' drel='vmod:VGF'>



</Sentence>

<Sentence id='19'>


1.1 بتمۍ PR_PRP <fs name='بتمۍ '>))






4.1 تم DM_DMD <fs name='تم'>4.2 یلکھ N_NN <fs name='یلکھ '>

))

5 (( NP <fs name='NP3' drel='nmod__k1inv:NP2'>

241

5.1 یم PR_PRL <fs name='یم'>))


6.1 نکتھ N_NN <fs name='نکتھ '>

6.2 ہتھ 0ا RD_ECH <fs name='ہتھ <'0ا))


7.1 تھۄس N_NN <fs name='تھۄس'>))


8.1 نا�ا& V_VM <fs name='&ا�نا '>8.2 ہ#ھ V_VAUX <fs name='ہ#ھ '>

))

9 (( AUXP <fs name='AUXP' drel='pof:VGF3'>

9.1 ہ�- � ٮٮکن ہ V_VM <fs name=' ہ�- � ٮٮکن <'ہ))


10.1 ٮ�ن کٲ+ر N_NNP <fs name='ٮ�ن <'کٲ+ر

10.2 نن�ۍ ہہ PP_PSP <fs name='نن�ۍ ہہ '>))

11 (( NP <fs name='NP7' drel='k1s:VGF3'>

11.1 در- ۍژک N_NN <fs name=' در- ۍژک '>

))


12.1 ہ�تھ با V_VM <fs name='ہ�تھ با '>12.2 ۔ RD_PUNC <fs name='۔'>

))

</Sentence>

<Sentence id='20'>

1 (( NP <fs name='NP' drel='ccof:CCP'>

1.1 علحیدگی N_NNC <fs name='علحیدگی'>

1.2 پسن�& N_NNC <fs name='&پسن�'>))

2 (( CCP <fs name='CCP' drel='k1:CCP2'>

2.1 ہ� ت CC_CCD <fs name='ہ� <'ت

242

))


3.1 جنگجٮ�ہن N_NN <fs name='جنگجٮ�ہن'>

))

4 (( VGF <fs name='VGF' drel='fragof:CCP2'>

4.1 ہز پ V_VM <fs name='ہز <'پ))

5 (( VGF <fs name='VGF2' drel='pof_idiom:NP3'>

5.1 نو- ےال V_VM <fs name=' نو- ےال '>

5.2 نو- ےڈل V_VM <fs name=' نو- ےڈل '>

))

6 (( NP <fs name='NP3' drel='pof_idiom:VGF3'>

6.1 ہ� نپنن PR_PRF <fs name='ہ� نپنن '>6.2 ہ� ہن N_NN <fs name='ہ� <'ہن

))

7 (( VGF <fs name='VGF3' drel='k2u:NP4'>

7.1 نو- ےژل V_VM <fs name=' نو- ےژل '>

7.2 ہ*� ہہ RP_RPD <fs name='�*ہ ہہ '>))


8.1 پالیسی N_NN <fs name='پالیسی'>))

9 (( NP <fs name='NP5' drel='pof:VGNF'>

9.1 نلتھ N_NN <fs name='نلتھ '>))

10 (( VGNF <fs name='VGNF' drel='vmod:CCP2'>

10.1 ہ;تھ V_VM <fs name='ہ;تھ '>))


11.1 مذاکراتن N_NN <fs name='مذاکراتن'>))


12.1 ن�گ N_NN <fs name='ن�گ '>))


13.1 ن�ن ی; V_VM <fs name='ن�ن ی; '>

243

))

14 (( CCP <fs name='CCP2'>

14.1 ہ� ت CC_CCD <fs name=' ہ� 2ت '>

))


15.1 ہر یی بک* N_NN <fs name='ہر یی بک* '>

15.2 نن� یہ PP_PSP <fs name='نن� یہ '>))


16.1 ہ�یاے N_NN <fs name='یاے�ہ '>))


17.1 یو& با�ز%ا V_VM <fs name='&یو با�ز%ا '>17.2 ۔ RD_PUNC <fs name='۔'>

))

</Sentence>

244

Bibliography

Aarts, J. & Meijs, M. 1984. Corpus Linguistics: Recent Developments in the Use of

Computer Corpora in English Language Research. Amsterdam: Rodopi

Abeille, A. Editor. 2000. Building and Using Syntactically Annotated Corpora.

Kluwer, Dordrecht.

Abney, S. 1989. A Computational Model of Human Parsing. In The Journal of

Psycholinguistic Research. Vol.8.1. Bell Communications Research: Morristown,

NJ.

Abney, S. 1991. Chunks and Dependencies: Bringing Processing Evidence to Bear

on Syntax. MS. University of Tubingen.

Abney, S. 1991. Parsing by Chunks. In Berwick, R. Abney, S. Tenny, C. (eds.),

Corpus-based Methods in Language and Speech. Dordrecht: Kluwer, page 257-278.

Abney, S. 1996. A Grammar of Projections. MS. University of Tubingen

Abney, S. 1996. Partial Parsing via Finite-State Cascades. John Caroll, ed. ESSLLI

Workshop on Robust Parsing. Prague. page 8-15.

Abney, S. 1996. Chunk Stylebook. MS. University of Tubingen.

Khan, A. J. 2006. Urdu/Hindi: An Artificial Divide. Algora Publishing: New York.

Aduriz, I. Aranzabe, M. J. Arriola, J. M. Atutxa, A. Diaz de Ilarraza, A. Garmendia,

A. Oronoz, M. 2003. Construction of Basque Dependency Treebank. In:

Nivre/Hinrichs 2003, page 201-204.

Afonso, S. Bick, E. Haber, R. Santos, D. 2002. A Treebank for Portuguese. In

Proceedings of the Third International Conference on Language Resources and

Evaluation. Las Palmas, Spain, 1698-1703.

Ambati, Bharat Ram, Samar Husain, Joakim Nivre & Rajeev Sangal. 2010. On the

Role of Morphosyntactic Features in Hindi Dependency Parsing. MS. Language

Technologies Research Centre, IIIT-Hyderabad, India & Department of Linguistics

and Philology, Uppsala University, Sweden.

Ambati, Bharat Ram, Pujitha Gade, Chaitanya GSK & Samar Husain. 2009. Effect

of Minimal Semantics on Dependency Parsing. MS. LTRC, IIIT-Hyderabad.

245

Arppe Antti, Gaëtanelle Gilquin, Dylan Glynn, Martin Hilpert & Arne Zeschel. 2010

Cognitive Corpus Linguistics: Five Points of Debate on Current Theory and

Methodology. Corpora Vol. 5.1:1-27. Edinburgh University Press

Atkins, Sue, Jeremy Clear & Nicholas Ostler. 1991. Corpus Design Criteria. Literary

& Linguistic Computing 7:1-16.

Backer, P. et al., 2000. Corpus Linguistics and South Asian Languages: Corpus

Creation and Tool Development. Literary and Linguistic Computing. Vol 19 (4).

Page 509-524

Baerman, Matthew & Brown, D. 2013. Case Syncretism. World Atlas of Language

Structures, Eds. Bernard Comrie, Matthew Dryer, David Gil and Martin Haspelmath.

Munich: Max Planck Digital Library.

Bamman, D. & Crane, G. 2006. The Design and Use of a Latin Dependency

Treebank. In Proceedings of TLT, 67-78. FAL MFF UK, Prague.

Bank. 2003. In Proceedings of the 4th International Workshop on Linguistically

Interpreted Corpora (LINC). Budapest, Hungary.

Barnbrook, Geoff. 1996. Language and Computers: A Practical Introduction to the

Computer Analysis of Language. Edinburgh University Press: Edinburgh.

Baskaran S. et al. 2007. Framework for a Common. Parts-of-Speech Tagset for Indic

Languages. (Draft) http://research.microsoft.com/~baskaran/POSTagset/

Bayer, Josef. 2008. What is Verb Second? MS. University of Konstanz.

Begum, R., Husain, S., Sharma, D.M., Bai, L. 2008. Developing Verb Frames in

Hindi. In Proceedings of LREC. Marrakech, Morocco.

Begum, R. Husain. S. Dhwaj, A. Sharma, D. M. Bai, L. and R. Sangal. 2008.

Dependency annotation scheme for Indian Languages. In Proceedings of IJCNLP.

Citeseer.

Begum, R. Jindal, K., Jain, A., Husain, S. and Sharma D.M. 2011. Identification of

Conjunct Verbs in Hindi and its Effect on Parsing Accuracy. In Computational

Linguistics and Intelligent Text Processing: 29-40.

Becker, D. Kashif, R. A Study in Urdu Corpus Construction. University of St.

Thomas,

Department of Computer Science, University of Minnesota-Twin Cities. U.S.A. Ms.

246

http://research.microsoft.com/~baskaran/POSTagset/

Bharati, A., Chaitanya, V., Sangal, R. and KV Ramakrishnamacharyulu. 1995.

Natural Language Processing: A Paninian Perspective. Prentice-Hall of India.

Bharati A, D. M. Sharma, L. Bai and R. Sangal. 2006. AnnCorra: Annotating

Corpora Guidelines For POS And Chunk Annotation For Indian Languages. LTRC

Technical Report-31

Bharati A, D. M. Sharma, S. Husain, L. Bai, R. Begam, R. Sangal. 2012. AnnCorra:

Treebanks for Indian Languages Guidelines For Annotation Hindi Treebank. LTRC

Technical Report.

Bharati, A., Bhatia, M., Chaitanya, V. and R. Sangal. 1996. Paninian Grammar

Framework Applied to English. Technical Report, Technical Report TRCS-96- 238,

CSE, IIT Kanpur.

Bharati, A. Sangal, R. and D.M. Sharma. 2007. SSF: Shakti Standard Format Guide.

Technical Report, Technical report, IIIT Hyderabad.

Bharati, A., Sharma, D.M., Husain, S., Bai, L., Begum, R. and R. Sangal. 2009.

Anncorra: Treebanks for Indian Languages Guidelines for Annotating Hindi

Treebank (version–2.0).

Bharati, A., Husain, S., Sharma, D.M., Sangal, R. 2008. A Two-Stage Constraint

Based Dependency Parser for Free Word Order Languages. In Proceedings of the

COLIPS IALP. Chiang Mai, Thailand.

Bhat, D.N.S. 1991. Grammatical Relations: the Evidence Against their Necessity and

Universality. Psychology Press

Bhat S.M. 2012. Building large Scale POS Annotated Corpus for Hindi & Urdu (co-

authored). In Proceedings of Workshop on Indian Language & Data: Resources &

Evaluation (WILDRE), LREC 2012 (Istanbul, Turkey).

Bhat S.M. 2010. Developing Fine-grained Hierarchical POS Tagset for Kashmiri. In

Proceedings of International Conference on Language Development & Computing

Methods ICLDCM, Coimbatore

Bhat, S.M. & Richa, S. 2011. Case Syncretism and Disambiguating Algorithms for

Urdu-Hindi POS Tagging. In Interdisciplinary Journal of Linguistics 4:187-194/

University of Kashmir, Srinagar

247

Bhat, S.M. 2012. Introducing Kashmiri Dependency Treebank. In Workshop on

Machine Translation and Parsing of Indian Languages (MTPIL), COLING 2012, IIT

Mumbai, Mumbai

Bhat, S.M. 2011. Developing Small Scale Treebank for Kashmiri. In SCONLI-06, at

Banaras Hindu University, Varanasi. Ms.

Bhat, S.M. 2012. Empirical Method of Language Documentation: A Case Study of

Compiling Kashmiri Corpus, In National Seminar on Endangered and Lesser Known

Languages: Issues and Responses, 2012, LU, Lukhnow. Ms.

Bhat, S.M. 2013. Manual Chunking and Parsing Kashmiri Text Corpus. In

International Conference of Linguistic Society of India (ICOLSI), Central Institute of

Indian Languages (CIIL), Mysore. Ms.

Bhat, R.A. Bhat, S.M & D. M. Sharma. 2014. Towards Building a Kashmiri

Treebank: Settingup A Trrbanking Pipeline. Ms.

Bhat, R. A. & D. M. Sharma. 2012. In Proceedings of the 6th Linguistic Annotation

Workshop, pages 157-165, Jeju, Republic of Korea. Association for Computational

Linguistics

Bhat, R. A. & D. M. Sharma. 2013. Non-projective Structure in Indian Language

Treebanks. Ms.

Bhat R. N. Dardic: What does the Label Denote? BHU, Varanasi. Ms.

Bhatt, R. B. Narasimhan, M. Palmer, O. Rambow, D.M. Sharma, and F. Xia. 2009. A

multi-representational and multi-layered treebank for hindi/urdu. In Proceedings of

the Third Linguistic Annotation Workshop: page 186-189. Association for

Computational Linguistics.

Biber, Douglas. 1993. Representativeness in Corpus Design. Literary and Linguistic

Computing 8.4.

Blake, Barry J. 2004. Case. Camdridge University Press: Cambridge.

Bloomfield, L. 1933. Language. The University of Chicago Press.

Bögel, T. M. Butt, and S. Sulger. 2008. Urdu ezafe and the Morphology-syntax

interface. In Proceedings of LFG ’08.

Bond, F. S. Fujita, and T. Tanaka. 2008. The Hinoki Syntactic and Semantic

Treebank of Japanese. Language Resources and Evaluation 42(2):243–251.

248

Bod, R. and Scha, R. 1997. Data-oriented Language Processing. In Young and

Bloothooft, pages 137–173.

Bod, R. 2003. Is there Evidence for a Probabilistic Language Faculty? Ms.

Bod, R. Hay, J. & Jannedy, S. (Eds.). 2003. Probabilistic Linguistics. Cambridge,

Massachusetts: MIT Press.

Bod, R. Margaux, S. 2012. Empiricist Solutions to Nativist Puzzles by Means of

Unsupervised TSG. In Proceedings of Workshop on Computational Models of

Language Acquisition and Loss. EACL. Association of Computational Linguistics.

Bosco, C. Lombardo, V. 2004. Dependency and Relational Structure in Treebank

Annotation. In Proceedings of Workshop on Recent Advances in Dependency

Grammar at COLING.

Bosco, C. & Lombardo, V. 2003. A Relation-based Schema for Treebank

Annotation. In Proceedings of the Advances in Artificial Intelligence, 8th Congress

of the Italian Association for Artificial Intelligence, Pisa, Italy.

Bosco, C. & Lombardo, V. 2000. An Annotation Schema for an Italian Treebank. In

Proceedings of the Student Session, 12th European Summer School in Logic,

Language and Information, Birmingham, UK.

Brants, S., S. Dipper, P. Eisenberg, S. Hansen, E. Knig, W. Lezius, C. Rohrer, G.

Smith & H. Uszkoreit, 2004. TIGER: Linguistic Interpretation of a German Corpus.

In E. Hinrichs and K. Simov (Eds), Research on Language and Computation, Special

Issue. Vol. 2: 597-620.

Brants, S. S. Dipper, S. Hansen, W. Lezius, and G. Smith. 2002. The Tiger Treebank.

In Proceedings of the Workshop on Treebanks and Linguistic Theories: page 24-41.

Brants, T. Wojciech S. and Hans U. 1999. Syntactic Annotation of a German

Newspaper Corpus. In Proceedings of the ATALA Treebank Workshop. Paris, France.

Burkhardt, Petra. 2005. The Syntax–Discourse Interface: Representing and

Interpreting Dependency. John Benjamins Publishing Company:

Amsterdam/Philadelphia

Butt, Miriam. 2005. Theories of Case. Cambridge University Press: Cambridge.

Butt, Rajesh. --- . Verb Movement in Kashmiri. Ms.

249

Buchholz, S. Marsi, E. 2006. CoNLL-X Shared Task on Multilingual Dependency

Parsing. In Proceedings of Tenth Conference on Computational Language Learning.

Association for Computational Linguistics.

Carroll, John. Guido Minnen. & Ted Briscoe. 1999. In Proceedings of the EACL

Workshop on Linguistically Interpreted Corpora (LINC), Bergen, Norway.

Carletta, J. S. Isard, G. Doherty-Sneddon, A. Isard, J.C. Kowtko, and A.H. Anderson.

1997. The Reliability of a Dialogue structure Coding Scheme. Computational

linguistics 23.1:13–31.

Chatterji, Sanjay,Tanaya Mukherjee Sarkar, Sudeshna Sarkar & Jayshree

Chakraborty. 2009. Karak Relations in Bengali. In Proceedings of 31st All-India

Conference of Linguists (AICL), Hyderabad, India, pp 33-36.

Chatterji, Sanjay, Praveen Sonare, Sudeshna Sarkar & Devshri Roy. 2009. Grammar

driven rules for hybrid Bangla dependency parsing. In Proceedings of ICON09 NLP

Tools Contest: Indian Language Dependency Parsing, Hyderabad, India, pp. 37-41.

Chaudhry, H. and D.M. Sharma. 2011. Annotation and Issues in Building an English

Dependency Treebank.

Chater, N. & Manning, C.D. 2006. Probabilistic Models of Language Processing and

Acquisition. Trends in Cognitive Sciences, 10, pages 335-344.

Chen, K.-J. Luo, C. C. Gao, Z. M. Chang, M. C. Chen, F. Y. & Chen, C. J. 1999. The

CKIP Chinese Treebank. In Journees ATALA sur les Corpus annot es pour la

syntaxe, Talana, Paris VII: 85-96.

Charniak, E. (1993). Statistical Language Learning. MIT Press, Cambridge, Mas-

sachusetts.

Chen, K. J. et al., 2003. Building and Using Parsed Corpora. (A. Abeillé Eds).

KLUWER: Dordrecht.

Chomsky, N. 1981. Lectures on Government and Binding: The Pisa Lectures.

Holland: Foris Publications.

Cloeren, J. 1999. Tagsets. In Syntactic Wordclass Tagging, Hans van Halteren (ed.),

Dordrecht: Kluwer Academic.

Cohen, J. et al. 1960. A Coefficient of Agreement for Nominal Scales. Educational

and Psychological Measurement 20 (1): 37-46.

250

Collins, M., Jan Hajič, L. Ramshaw and C. Tillmann. 1999. A Statistical Parser for

Czech. In Proceedings of ACL: 505-512.

Collins, M. 1999. Head-driven Statistical Models for Natural Language Parsing. PhD

Thesis, University of Pennsylvania. Ph.D. thesis.

Corbett, G., N. M. Fraser, and S. McGlashan, 1993. Heads in Grammatical Theory.

Cambridge University Press, Cambridge.

Covington, M. A. (1984). Syntactic Theory in the High Middle Ages. Cambridge

University Press.

Covington, M. A. (1990a). A dependency parser for variable-word-order languages,

Technical Report AI-1990-01, University of Georgia.

Covington, M. A. (1990b). Parsing Discontinuous Constituents in Dependency

Grammar. Computational Linguistics 16: pages 234-236.

Covington. M. A. 1990. A Dependency Parser for Variable Word Order Languages.

Research Report AI-1994-02, Artificial Intelligence Programmes, University of

Georgia, Athens, Georgia 30602 U.S.A.

Covington. M. A. 1994. Discontinuous Dependency Parsing of Free and Fixed Word

Order. Research Report AI-1994-02, Artificial Intelligence Programmes, University

of Georgia, Athens, Georgia 30602 U.S.A.

Covington. M. A. 2001. A Fundamental Algorithm for Dependency Parsing. In

Proceedings of the 39th Annual ACM Southeast Conference, pages 95–102.

Cowper, Elizabeth. 2002. Finiteness. MS. University of Toronto

Culotta, A. and J. Sorensen. 2004. Dependency tree kernels for relation extraction. In

Proceedings of the 42nd Annual Meeting on Association for Computational

Linguistics: 423. Association for Computational Linguistics.

Dandapat, Sandipan. 2008. Part-of-Speech Tagging for Bengali. Unpublished

Dissertation. IIT Kharagpur.

Durrani, N. and S. Hussain. 2010. Urdu word segmentation. In Human Language

Technologies, The 2010 Annual Conference of the North American Chapter of the

Association for Computational Linguistics: 528–536. Association for Computational

Linguistics.

251

Dash, N. S. 2010. Corpus Linguistics: A General Introduction. Paper presented at

CIIL, Mysore.

D. Chakrabarty, V. Sarma and P. Bhattacharyya. 2007. Complex Predicates in Indian

Language Wordnets, Lexical Resources and Evaluation Journal, 40 (3-4), 2007.

Debusmann, R. 2004. A Declarative Grammar Formalism for Dependency Grammar.

Dissertation. Universitat des Saarlandes.

Dipper, S. 2008. Theory-driven and Corpus-driven Computational Linguistics, and

the Use of Corpora. In Anke Lüdeling and Merja Kytö (eds.), Corpus Linguistics: An

International Handbook. Handbooks of Linguistics and Communication Science, pp.

68-96.Mouton de Gruyter: Berlin.

Dowty, D. 1982. Grammatical Relations and Montague Grammar. In Jacobson, P.

and Pullum, G., Editors, The Nature of Syntactic Representation, pages 79-130. D.

Reidel Publishing Company.

Eide, Kristin M. 2007. Finiteness. Paper presented at 3rd ScanDiaSyn Grand

Meeting, Iceland.

E. Hajicova. 1998. Prague Dependency Treebank: From Analytic to

Tectogrammatical Annotation. In Proc. TSD’98.

Fillmore, C. 1968. The Case for Case. In Universals in Linguistic Theory, E. Bach

and R. T. Harms (eds).

Fillmore, Charles J. 1992. Corpus Linguistics or Computer-aided Armchair

Linguistics. In: Directions in Corpus Linguistics. Proceedings of Nobel Symposium,

48 August 1991. Ed. By Jan Svartvik. Berlin, New York: Mouton de Gruyter.

Fong, S. Robert C.B. ---. Treebank Parsing and Knowledge of Language: A

Cognitive Perspective. Department of Linguistics and Computer Science, University

of Arizona, Department of EECs, Brain and Cognitive Science, MIT. Ms.

Garside, R. 1987. The CLAWS Word-tagging System. In The Computational

Analysis of English, Garside, Leech and Sampson, (eds). London: Longman.

Garside, R. Leech, L. & MacEnery, T. 1997. Corpus Annotation: Linguistic Inform-

ation from Computer Text Corpora. London and New York: Longman

Glynn, Dylan. 2010. Corpus-driven Cognitive Linguistics. A case study in polysemy.

MS. Lund University.

252

Gries, Stefan. 2011. Corpus data in usage-based linguistics: What’s the right degree

of granularity for the analysis of argument structure constructions? In Mario Brdar,

Stefan Th. Gries, & Milena Žic Fuchs (eds.), Cognitive linguistics: convergence and

expansion, 237-256. John Benjamins: Amsterdam & Philadelphia.

Gries, Stefan. 2012. The Corpus Linguistics: Quantitative Methods. In Carol A.

Chapelle (ed.), The encyclopedia of applied linguistics, 1380-1385. Wiley-

Blackwell: Oxford.

Gruber, J. S. 1965. Studies in Lexical Relations. Ph.D. thesis, MIT.

Gupta, Mridul, Vineet Yadav, Samar Husain & Dipti M Sharma. 2008. A Rule

Based Approach for Automatic Annotation of a Hindi TreeBank. In Proceedings of

the 6th International Conference on Natural Language Processing (ICON-08).

Gupta, Swati. 2004. Aligning Hindi and Urdu Bilingual Corpora for Robust

Projection. M.Sc. Report.

Habash, N. and Owen Rambow (2005). Arabic Tokenization, Morphological

Analysis, and Part-of-Speech Tagging in One Fell Swoop. In Proceedings of the

Conference of American Association for Computational Linguistics (ACL05).

Hajiˇc, J. 1998. Building a Syntactically Annotated Corpus: The Prague Dependency

Treebank. Issues of valency and meaning: 106–132.

Hajič, J. E. Hajicova, M. Holub, P. Pajas, P. Sgall, B. Vidova-Hladka, and V.

Reznickova. 2001. The Current Status of the Prague Dependency Treebank. Lecture

Notes in Artificial Intelligence (LNAI) 2166: 11-20, NY.

Hajicova, E. and M. Ceplov, 2000. Deletions and Their Reconstruction in

Tectogrammatical Syntactic Tagging of Very Large Corpora. In Proceedings of

Coling: 278-284.

Hajicová, E. 1998. Prague Dependency Treebank: From Analytic to

Tectogrammatical Annotation. In Proceedings of TSD ’98: 45–50.

Hajicova, E., A. Abeillé, J. Hajiˇc, J. M´ırovský, and Z. Ureˇsová. 2010. Treebank

Annotation. In Nitin Indurkhya and Fred J. Damerau (eds), Handbook of Natural

Language Processing, Second Edition. CRC Press, Taylor and Francis Group, Boca

Raton, FL.

Hammond, Michael. 2003. Programming for Linguists: Perl for Language

Researchers. Blackwell Publishing: Oxford.

253

Hardie, A. 2003. Developing a Tagset for Automated Part-of-speech Tagging in

Urdu. In Proceedings of the Corpus Linguistics ‘03.

Hardie, A. 2004. The Computational Analysis of Morpho-syntactic Categories in

Urdu. PhD Dissertation, Lancaster University.

Haspelmath M. 1997. From Space to Time Temporal Adverbials in the World’s

Languages. LINCOM Studies in Theoretical Linguistics 03. LINCOM EUROPA

München – Newcastle

Herrera, Jesus. 2007. Building Corpora for the Development of a Dependency Parser

for Spanish Using Maltparser. Procesamiento del Lenguaje Natural 39: pages 181-

186.

Holt, Rinehart and Winston, NY C. Fillmore, P. Kay & M. O’Connor. 1988.

Regularity and Idiomaticity in Grammatical Constructions: The Case of Let Alone.

Language 64: 501-538.

HSK_Corpus Linguistics, MILES, Release 18.02x on Friday December 7 16:20:48

BST 2007, gesp. unter: HSKCOR$U13/letzter Rechenvorgang: 14-01-08 10:04:45

Hudson, R. 1984. Word Grammar. Basil Blackwell, Oxford and New York.

Hudson, R. 1990. English Word Grammar. Basil Blackwell, Oxford and Cambridge.

Hudson, R. 2003. The Psychological Reality of Syntactic Dependency Relations.

MTT, Pasis. Ms.

Hudson, R. --- . Discontinuous Phrases in Dependency Grammar. Ms.

Jarvinen, T. 2000. Bank of English and Beyond Hand-crafted Parsers for Functional

Annotation. In Abeille, 2000, pages 43–59.

Kinsbury, P. and Palmer, M. 2002. From Tree-Bank to PropBank. In Proceedings of

LREC, Las Palmas, Spain.

Kinsbury, P., Palmer, M., and Marcus, M. 2002. Adding Semantic Annotation to the

Penn TreeBank. In Proceedings of the Human Language Technology Conference,

San Diego California.

Husain, Samar, Phani Chaitanya, Ganeshwar Rao Dulam, Tariq Khan & Dipti M.

Sharma. 2009. Using Levins Verb Classification for Preposition Sense Selection in

English to Indian Language MT. In Proceedings of the Conference on Language and

Technology 2009 (CLT09), Lahore, Pakistan.

254

Hussain, M. 1987. Geography of Jammu and Kashmir, Delhi, Rajesh publications.

Jackendoff, R. 1972. Semantic Interpretation in Generative Grammar. MIT Press:

Cambridge.

Jacque, Kristin. 2006. Analysis of a Potential Latin Treebank. MS.

J. Daniel & James. H. Martin. 1999. Speech and Language Processing: An

Introduction to Natural Language Processing, Computational Linguistics and

Speech Recognition, Prentice Hall, Englewood Cliffs, New Jersey.

Kahane, Sylvain. ---. Why to Choose Dependency Rather Than Constituency for

Syntax: A Formal Point of View. MS. Modyco-Université Paris Ouest Nanterr &

CNRS.

Kakkonen, T. 2006. DepAnn - An Annotation Tool for Dependency Treebanks. In

Proceedings of the Eleventh ESSLLI Student Session. Janneke Huitink & Sopia

Katrenko (eds.)

Kakkonen, T. 2006 Dependency Treebanks: Methods, Annotation Schemes and

Tools.

http://arXiv:cs/0610124v1 [cs.CL] 20 Oct 2006

Keith, A. 2007. The Western Classical Tradition in Linguistics. Equinox Publishing

Ltd, London.

Kidwai, Ayesha. 2007. A Handbook for Research Scholars. URL:

www.jnu.ac.in/SLLCS/SLLCS%20Research%20Manual.pdf

King, T. H., R. Crouch, S. Riezler, M. Dalrymple and R. Kaplan. 2003. The

PARC700 Dependency.

Kingsbury, P. and M. Palmer. 2002. From treebank to propbank. In Proceedings of

LREC.

Kiparsky, P. ---. On the Architecture of Panini’s Grammar. Stanford University. Ms.

Kiparsky, P. 1994. Paninian Linguistics, Asher R.E., Ed., Encyclopedia of Language

and Linguistics. Oxford, New York

Kiparsky, P. ---. Panini is Slik But He is not Mean. Stanford University. Ms.

Kiparsky, P. 2007. Panini’s Razor. Paris. Ms.

255

Kiparsky, P. 1979. Panini as a Variationist. MIT Press and Poona University Press,.

Kiparsky, P. 1991. On Paninian Studies. Journal of Indian Philosophy, Vol. 19:189-

225

Kiparsky, P. ---. Dvandvas, Blocking, and the Associative: The Bumpy Ride from

Phrase to Word. Ms.

Kiparsky, P. ---. Event Structure and the Perfect. Ms.

Kiparsky, P. ---. The Shift to Head-Initial VP in Germanic. Ms.

Kiparsky, P. ---. Towards a Null Theory of the Passive. Ms.

Kiparsky, P. ---. Grammaticalization as Optimization. Ms.

Klein, D. and C. D. Manning. 2003. Accurate Unlexicalized Parsing. In Proceedings

of ACL-.Japan.

Kolachina, Prasanth, Sudheer Kolachina, Anil Kumar Singh, Samar Husain,

Viswanatha Naidu, Rajeev Sangal & Akshar Bharati. ---- . Grammar Extraction from

Treebanks for Hindi and Telugu. MS. Language Technologies Research Centre, IIIT-

Hyderabad, India

Koul, Omkar N. 2006. Modern Kashmiri Grammar. USA: McNeil Technologies,

Inc.Manfred Krifka

Krifka, Manfred. 2006. Basic Notions of Information Structure. Interdisciplinary

Studies on Information Structure 06, Féry, Fanselow and Krifka (Eds.)

Kroch, A. Taylor A. --- .Verb Movement in Old and Middle English: Dialect

Variation and Language Contact. Ms

Kroch, A. Taylor A. 2000. Verb-Object Order in Early Middle English. Ms.

Kübler. Sandra, Ryan McDonald, and Joakim Nivre. 2009. Dependency Parsing.

Synthesis Lectures on Human Language Technologies. Graeme Hirst (Ed) Morgan &

Claypool Publishers.

Kucera, H. 1992. The Odd Couple: The Linguist and the Software Engineer. The

Struggle for High Quality Computerized Language Aids. In Svartvik, pages 401–424.

Kuhlmann, M. and M. Möhl. 2007. Mildly Context Sensitive Dependency Language.

In Proceedings of ACL. Prague, Czech Republic.

256

Landis, J.R. and G.G. Koch. 1977. The Measurement of Observer Agreement for

Categorical Data. Biometrics: 159–174.

Lawey, Aadil, A. & Nazima, Mehdi. 2011. Development of Unicode Complaint

Kashmiri Font: Issues and Resolutions. In Interdisciplinary Journal of Linguistics

4:195-200/ University of Kashmir: Srinagar.

Lee, H. C. N. Huang, J. Gao and X. Fan, 2004. Chinese Chunking with Another

Type of Spec. In Proceedings of SIGHAN: 41-48. Barcelona.

Leech, G. & Wilson, A. 1996. Recommendations for the Morpho-syntactic

Annotation of Corpora. EAGLES Report EAG-TCWG-MAC/R.

Leech, G and Wilson, A. 1999. Standards for Tag-sets. In Syntactic Wordclass

Tagging, Hans van Halteren (ed.), Dordrecht: Kluwer Academic.

Leech, G. 1991. The State of the Art in Corpus Linguistics. In Aijmer, K. and

Altenberg, B., Editors, English Corpus Linguistics: Studies in Honour of Jan

Svartvik, pages 8–29. Longman, London.

Leech, G. 1992. Corpora and Theories of Linguistic Performance. In Svartvik,

1992b, pages 105–122.

Leech, G., Barnett, R., and Kahrel, P. 1996. EAGLES Recommendations for the

Syntactic Annotation of Corpora, eag-tcwg-sasg/1.8 version of 11th march 1996.

http://www.ilc.pi.cnr.it/EAGLES96/segsasg1/segsasg1.html.

Lehal, G.S. 2010. A Word Segmentation System for Handling Space Omission

Problem in Urdu Script. In Proceedings of 23rd International Conference on

Computational Linguistics: 43.

Lesmo, L. and Lombardo, V. 2000. Automatic Assignment of Grammatical

Relations. In Proceedings of LREC, pages 475-482, Athens, Greece.

Litkowski, K. 1999. Question-answering Using Semantic Relation Triples. In

Proceedings of TREC-8, pages 349–356, Gaithersburg MD.

Lombardo, V. and Lesmo, L. 1998. Unit Coordination and Gapping in Dependecy

Theory. In Processing of Dependency-based Grammars, COLING-ACL.

Levin, B. 1993. English Verb Classes and Alternations: A Preliminary Investigation.

University of Chicago Press.

Lindquist, Hans. 2009. Corpus Linguistics and the Description of English. Edinburgh

University Press: Edinburgh.

257

Liberman, M. 2000. Legal, Ethical and Policy Issues Concerning the Recording and

Publication of Primary Language Materials. In Steven Bird and Gary Simons,

(editors).

Lüdeling, Anke & Merja Kytö (eds.). 2009. Corpus Linguistics: An International

Handbook Vol.2. Walter de Gruyter: Berlin.

Manning, C. and H. Schütze. 1999. Foundations of Statistical Natural Language

Processing. MIT.

Steedman M. 2011. Romantics and Revolutionaries: What Theoretical and

Computational Linguists Need to Know About Each Other But We Are Afraid.

Linguistic Issues in Language Technology LILT. CSLI Publications

Marcus, M.P. M.A. Marcinkiewicz, and B. Santorini. 1993. Building a Large

Annotated Corpus of English: The Penn Treebank. Computational linguistics 19 (2):

313–330.

Marantz, A. P. 1984. On the Nature of Grammatical Relations. MIT Press,

Cambridge.

Marcus, M., Kim, G., Marcinkiewicz, M., MacIntyre, R., Bies, A., Ferguson, M.,

Katz, K. and Schasberger, B. 1994. The Penn Treebank: Annotating Predicate

Argument structure. In Proceedings of The Human Language Technology Workshop,

San Francisco. Morgan-Kaufmann.

Marcus, M., Santorini, B., and Marcinkiewicz, M. 1993. Building a Large Annotated

Corpus of English: The Penn Treebank. Computational Linguistics, 19:313–330.

M. Butt. 2004. The Light Verb Jungle. In G. Aygen, C. Bowern & C. Quinn Eds.

Papers from the GSAS/Dudley House Workshop on Light Verbs. Cambridge, Harvard

Working Papers in Linguistics, p. 1-50.

McEnery, T. and Wilson, A. (1996). Corpus Linguistics. Edinburgh University Press,

Edinburgh.

Masica, C.P. 1993. The Indo-Aryan Languages. Cambridge University Press.

Cambridge, UK

Matthews, P.H. 2007. Syntactic Relations: A Critical Survey. Cambridge University

Press, Cambridge, UK

McDonald, R. F. Pereira, K. Ribarov and J. Hajič. 2005. Non-Projective Dependency

Parsing using Spanning Tree Algorithms. In Proceedings of HLTEMNLP.

258

McEnery, Tony & Wilson, A. Corpus Linguistics. Edinburgh University Press:

Edinburgh.

McEnery, A. M. Backer, J. P. Gaizauskas, R. & Cunningham, H. 2000. EMILLE:

Building Corpus of South Asian Languages. Vervek, A Quaterly in Artificial

Intelligence. 13 (3): page 23-32.

Melčuk, I. 1979. Studies in Dependency Syntax. Karoma Publishers, Inc.

Mel’cuk, I.A. 1988. Dependency Syntax: Theory and Practice. State University Press

of New York.

Meyers, A. R. Reeves, C. Macleod, R. Szekely, V. Zielinska, B. Young, and R.

Grishman, 2004. The NomBank Project: An Interim Report. In NAACL/HLT 2004

Workshop Frontiers in Corpus Annotation.

Meyers, A. 1995. The NP Analysis of NP. In Papers from the 31st Regional Meeting

of the Chicago Linguistic Society: 329-342.

Meyer, Charles F. 2002. English Corpus Linguistics: An Introduction. Cambridge

University Press: Cambridge.

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill Higher Education.

Milicevic, Jasmina. 2006. A Short Guide to the Meaning-Text Linguistic Theory.

Journal of Koralex, vol. 8: 187-233.

Mohanan, T. 1990. Arguments in Hindi. Ph.D. Thesis, Stanford University.

Neumann, Gunter. 1994. A Uniform Computational Model for Natural Language

Parsing and Generation. Doctoral Dissertation. The University of Saarlandes

Nilsson, Peter. --- . An Experimental Study of Nivre’s Parser. Thesis for a diploma in

computer science, Department of computer science, Faculty of science, Lund

University.

Nivre, J. 2003. An Efficient Algorithm for Projective Dependency Parsing. In

Proceedings of the 8th International Workshop on Parsing Technologies (IWPT 03),

pages 149–160, Nancy.

Nivre, J. 2005. Inductive Dependency Parsing of Natural Language Text. PhD thesis,

School of Mathematics and System Engineering, Växjö University.

259

Nivre, J. and Nilsson, J. (2005). Pseudo-projective Dependency Parsing. In

Proceedings of the 43rd Annual Meeting of the Association for Computational

Linguistics (ACL’05), pages 99–106, Ann Arbor.

Nivre, J. ---. Dependency Grammar and Dependency Parsing. Ms.

Oepen, S. K. Toutanova, S. M. Shieber, C. D. Manning, D. Flickinger, and T. Brants,

2002. The LinGO Redwoods Treebank: Motivation and Preliminary Applications. In

Proceedings of COLING. Taipei, Taiwan.

O'keeffe, A. & M. Mccarthy (eds.). 2010. The Routledge Handbook of Corpus

Linguistics. Routledge:London.

Oflazer, K. B., Say, D.Z. Hakkani-Tür, and G. Tür. 2003. Building a turkish

treebank. Abeillé: 261–277.

Palmer, M., D. Gildea, P. Kingsbury. 2005. The Proposition Bank: An Annotated

Corpus of Semantic Roles. Computational Linguistics 31(1):71-106.

Palmer, M., R. Bhatt, B. Narasimhan, O. Rambow, D.M. Sharma, and F. Xia. 2009.

Hindi syntax: Annotating dependency, lexical predicate-argument structure, and

phrase structure. In Proceedings of 7th International Conference on Natural

Language Processing: 14–17.

Perlmutter, D. M. and P. M. Postal, 1984. The 1- Advancement Exclusiveness Law.

In Studies in Relational Grammar 2. D. M. Perlmutter & C. G. Rosen,(eds). Univ. of

Chicago Press.

Perlmutter, D. 1983. Studies in Relational Grammar. University of Chicago Press.

Poesio, M. 1999. Coreference in MATE Deliverable 2.1,

http://www.ims.unistuttgart.de/projekte/mate/mdag/cr/cr_1.html

Piwek, Paul & Kees van Deemter. 2006. Constraint-based Natural Language

Generation: A Survey. Technical Report. The Open University, UK.

Phillips, C. - - -. Should we Impeach Armchair Linguists? To Appear in S. Iwasaki

(Ed.) Japanese/Korean Linguistics 17. CSLI Publications. Special Section of Papers

from a Workshop on ‘Progress in Generative Grammar’. Ms.

Poesio, M. 2004. The MATE/GNOME Scheme for Anaphoric Annotation, Revisited.

In Proceedings of SIGDIAL.

260

http://www.ims.unistuttgart.de/projekte/mate/mdag/cr/cr_1.html

Poesio, M. and R. Artstein. 2005. The Reliability of Anaphoric Annotation,

Reconsidered: Taking Ambiguity into Account. In Proceedings of ACL Workshop on

Frontiers in Corpus Annotation.

Polguère. A & Mel’čuk A. Igor. 2009. Dependency in Linguistic Description. John

Benjamins.

Pustejovsky, J. A. Meyers, M. Palmer, and M. Poesio, 2005. Merging PropBank,

NomBank, TimeBank, Penn Discourse Treebank and Coreference. In ACL

Workshop: Frontiers in Corpus Annotation II: Pie in the Sky.

Rambow, O., Creswell, C., Szekely, R., Taber, H., Walker, M. 2002. A Dependency

Treebank for English. In Proceedings of LREC.

Rajesh Bhatt. 2008. A Lecture at EFLU, Hyderabad.

http://people.umass.edu/bhatt/papers/eflu-aug18.pdf

Reddy, Prashanth, Aswarth Abhilash & Akshar Bharati. 2009. LTAG-spinal

Treebank and Parser for Hindi. In International Conference on Natural Language

Processing (ICON2009).

Reichartz, F., H. Korte, and G. Paass. 2009. Dependency Tree Kernels for Relation

Extraction from Natural Language Text. Machine Learning and Knowledge

Discovery in Databases: 270–285.

Renouf, A. 2002. The Time Dimension in Modern English Corpus Linguistics. In B.

Kettemann & G. Marko (eds.). 2000. Teaching and Learning by doing Corpus

Analysis. Papers from the Fourth International Conference on Teaching and

Language Corpora, Graz, Amsterdam.

Richa. 2011. Hindi Verb Classes & Their Argument Structure Alternations.

Cambridge Scholars Publishing: UK.

Ross, J. R. 1967. Constraints on Variables in Syntax. Doctoral dissertation, MIT.

Robins, R. H. 1967. A Short History of Linguistics. Longman.

Robinson, J. J. (1970). Dependency Structures and Transformational Rules.

Language

46: page 259-285.

Sag, I. A. and J. D. Fodor, 1994. Extraction without Traces. In R. Aranovich, W.

Byrne, S.

261

http://people.umass.edu/bhatt/papers/eflu-aug18.pdf

Sampson, G. 2005. Quantifying the Shift Towards Empirical Methods. International

Journal of Corpus Linguistics 10 (1)

Sampson, G. 2007. Grammar without Grammaticality. Corpus Linguistics and

Linguistic Theory 3 (1)

Schneider, G. 1998. A Linguistic Comparision of Constituency, Dependency and

Link Grammar. ExtrAns Research Report: Dependency vs. Constituency

Bird, S. and Simons, G. 2001. The OLAC Metadata Set and Controlled

Vocabularies. In Proceedings of ACL/EACL Workshop on Sharing Tools and

Resources for Research and Education. http://arXiv.org/abs/cs/0105030.

Bird, S. and Simons, G. 2001. Seven Dimensions of Portability for Language

Documentation and Description. LDC UPenn http://arxiv.org/abs/cs/0204020v1

Salmon-Alt, S. and L. Romary. 2004. RAF: Towards a Reference Annotation

Framework, LREC.

Santorini, B. 1990. Part-of-speech Tagging Guidelines for the Penn Treebank

Project. Technical Report MS-CIS- 90-47, Department of Computer and Information

Science, University of Pennsylvania.

Sharma, D. M., R. Sangal, L. Bai, R. Begam, and K.V. Ramakrishnamacharyulu.

2007. AnnCorra: TreeBanks for Indian Languages, Annotation Guidelines

(manuscript), IIIT, Hyderabad, India.

Shaumyan, S. 1977. Applicative Grammar as a Semantic Theory of Natural

Language. Chicago Univ. Press.

Shieber, S.M. 1985. Evidence Against the Context-freeness of Natural Language.

Linguistics and Philosophy 8(3): page 333-343.

Singh, A. K. 2008. A Mechanism to Provide Language-encoding Support and an

NLP Friendly Editor. In Proceedings of the third international joint conference on

natural language processing (ijcnlp). Hyderabad, India: Asian Federation of Natural

Language Processing.

Singh, A. K. 2011. A Concise Query Language with Search and Transform

Operations for Corpora with Multiple Levels of Annotation. CoRR,

http://arxiv.org/abs/1108.1966.

Singh, A. K. & Ambati, B. 2010. An Integrated Digital Tool for Accessing Language

262

Resources. In The Seventh International Conference on Language Resources and

Evaluation (lrec). Malta: The European Language Resources Association (ELRA).

Singh, A. K. 2011. Part-of-Speech Annotation with Sanchay. In Proceedings of

National Seminar on POS Annotation: Issues and Prespectives.LDCIL, CIIL

Mysore.

Skut, Wojciech, Brigitte Krenn, Thorsten Brants, and Hans Uszkoreit, 1997. An

Annotation Scheme for Free Word Order Languages. In Proceedings of the Fifth

Conference on Applied Natural Language Processing ANLP-97. Washington, DC.

Simkova, Maria (ed.). 2006. Insight into the Slovak and Czech Corpus Linguistics.

Publishing House of Slovak Academy of Sciences: Bratislava.

Sinclair, John & Ronald Carter (eds.). Trust the Text: Language, Corpus &

Discourse. Routledge: London.

Singh, Anil Kumar, Samar Husain, Harshit Surana, Jagadeesh Gorla, Chinnappa

Guggilla & Dipti Misra Sharma. 2007. Disambiguating Tense, Aspect and Modality

Markers for Correcting Machine Translation Errors. In Proceedings of the

Conference on Recent Advances in Natural Language Processing (RANLP).

Borovets, Bulgaria.

Sinha, Mahesh K. 2009. A Journey from Indian Scripts Processing to Indian

Language Processing. IEEE the Annals of the History of Computing:8-31. IEEE

Computer Society.

Sampson, G. 1992. Probabilistic Parsing. In Svartvik, 1992b, pages 105–122.

Sampson, G. 2000. Thoughts on Two Decades of Drawing Trees. In Abeillé, 2000,

pages 23–41.

Taylor, A., Marcus, M., and Santorini, B. 2000. The Penn Treebank: An Overview.

In Abeillé, 2000, pages 5–22.

Telljohann, H. E. Hinrichs, S. Kübler and H. Zinsmeister. 2005. Stylebook of the

Tübinger Treebank of Written German (TüBa-D/Z).

Teubert, Wolfgang. 2001. Corpus Linguistics and Lexicography. International

Journal of Corpus Linguistics Vol. 6:125-153.

Teubert, Wolfgang. 2005. My Version of Corpus Linguistics. International Journal

of Corpus Linguistics 10.1: 1-13.

263

Thielen, C. and A. Schiller, 1996. Technical report. University of Tübingen.. Ein

kleines und erweitertes Tagset fürs Deutsche. In Feldweg,

Tsai, J. L., 2005. Lexicographica. Tübingen: Niemeyer. 193-203. A Study of

Applying BTM Model on the Chinese Chunk Bracketing. In LINC-2005, IJCNLP-

2005, pp.21-30.

Uria, L., A. Estarrona, I. Aldezabal, M. Aranzabe, A. D´ıaz de Ilarraza, and M.

Iruskieta. 2009. Evaluation of the syntactic annotation in epec, the reference corpus

for the processing of Basque. Computational Linguistics and Intelligent Text

Processing:72–85.

Uszkoreit, H. 1986. Constraints on Order. Linguistics 24.

Vaidya, A., S. Husain, P. Mannem, and D. Sharma. 2009. A Karaka Based

Annotation Scheme for English. Computational Linguistics and Intelligent Text

Processing: 41–52.

Vempaty, Chaitanya, Naidu, Viswanatha, Husain, Samar, Kiran, Ravi, Bai, Lakshmi,

Sharma, Dipti M., and Sangal, Rajeev. 2010. Issues in Analyzing Telugu Sentences

Towards Building a Telugu Treebank. In Proceedings of CICLING.

Klimes, Vaclav. 2006. Analytical and Tectogrammatical Analysis of a Natural

Language. Ph.D. Thesis. Charles University, Prague.

Van Deemter, K. and R. Kibble, 2001. On Coreferring: Coreference in MUC and

related Annotation schemes. Journal of Computational Linguistics 26 (4): 629-637

Van Der Beek, L., G. Bouma, R. Malouf, and G. Van Noord. 2002. The Alpino

Dependency Treebank. Language and Computers 45(1):8-22.

VanValin, R. D. 1999. Generalized Semantic Roles and the Syntax-semantics

Interface. In Corblin, F., Dobrovie-Sorin, C., and Marandin, J. M., Editors, Empirical

Issues in Formal Syntax and Semantics 2, pages 373-389. Thesus, The Hague.

VanValin, R. D. 2001. An Introduction to Syntax. Cambridge University Press,

Cambridge.

Vempaty, Chaitanya, Viswanatha Naidu, Samar Husain, Ravi Kiran, Lakshmi Bai,

Dipti M Sharma & Rajeev Sangal.2010. Issues in Analyzing Telugu Sentences

Towards Building a Telugu Treebank. MS. Language Technologies Research Centre,

IIIT-Hyderabad, India. Page 50-59

264

Volodina, Elena. 2008. From Corpus to Language Classroom: Reusing Stockholm

Umeå Corpus in a Vocabulary Exercise Generator SCORVEX. Master Thesis.

University of Gothenburg.

Wenger, Neven. 2009. The Syntax of Finiteness. Frankfurt a. M.

Woolford, Ellen. 1997. Four Way Accusative Case Systems: Ergative, Nominative,

Objective and Accusative. Natural Language & Linguistics Theory 15:181-227.

Kluwer Academic Publishers: Netherlands.

Xia, F. O. Rambow, R. Bhatt, M. Palmer and D. Sharma, 2009. Towards a Multi-

Representational Treebank. In Proceedings of the 7th Int’lWorkshop on Treebanks

and Linguistic Theories (TLT-7)

Xia, F. M. Palmer, N. Xue, N., M. E. Okurowski, J. Kovarik, F.-D. Chiou, S. Huang,

T. Kroch, and Marcus, M., 2000. Developing Guidelines and Ensuring Consistency

for Chinese Text Annotation. In Proceedings of LREC. Greece.

Xia, F. 2001. Automatic Grammar Generation from Two Different Perspectives. PhD

Thesis, University of Pennsylvania.

Xia, F. and Palmer, M. (2001). Converting Dependency Structures to Phrase

Structures. In Proceedings of the Human Language Technology Conference (HLT-

2001), San Diego CA.

Xue, N. F. Chiou and M. Palmer. Building a Large-Scale Annotated Chinese Corpus,

2002. In Proceedings of COLING. Taipei, Taiwan.

Xue, N., F. Xia, F.-D. Chiou and M. Palmer, 2005. The Penn Chinese TreeBank:

Phrase Structure Annotation of a Large Corpus. Natural Language Engineering

11(2): 207.

Yong, C. and S.K. Foo. 1999. A Case Study on Inter-annotator Agreement for Word

Sense Disambiguation. Ms.

Zeldes, Amir & Anke Lüdeling (eds.). 2011. Proceedings of Quantitative

Investigations in Theoretical Linguistics 4. Humboldt-Universität zu Berlin.

Zwicky, A. M. (1985). Heads. Journal of Linguistics 21: page 1-29.

265