18
AUTOMATIC SEGMENTATION AND PART-OF-SPEECH TAGGING FOR TIBETAN: A FIRST STEP TOWARDS MACHINE TRANSLATION PAUL G. HACKETT Over the past few decades, great advances have been made in the field of Natural Language Processing (NLP). Much of this research has focused on dominant world languages such as English, French, German, Spanish. In recent years, Asian languages such as Japanese and Chinese have been added to that list, spurred in part by the increased ease with which text in electronic form can be generated, archived and shared, and by the important role those languages play in international commerce. There has been limited research however, for less economically important languages such as Tibetan. This paper presents what we believe to be the first reported work on Tibetan machine translation (MT). In the construction of a generic MT system, there are three conceptually distinct components: analysis, transfer, and generation. The analysis phase is constituted by part of speech (POS) tagging and parsing as the process of taking a sequence of surface forms and converting them into a meaning-preserving internal representation. In the Tibetan context, there is the prerequisite stage of word- segmentation and sentence boundary detection. We feel that this first phase has been successfully completed. The subsequent sections of this paper present the theoretical background to the research, an overview of the algorithm and its implementation, issues in evaluation, and a discussion of its current application. Background This section of the paper presents a brief overview of the theoretical underpinnings of this research. There are three conceptually distinct components of a simple MT system: analysis of the source data, transfer of the data to the target

AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION ANDPART-OF-SPEECH TAGGING FOR TIBETAN:

A FIRST STEP TOWARDS MACHINE TRANSLATION

PAUL G. HACKETT

Over the past few decades, great advances have been made in thefield of Natural Language Processing (NLP). Much of this researchhas focused on dominant world languages such as English, French,German, Spanish. In recent years, Asian languages such as Japaneseand Chinese have been added to that list, spurred in part by theincreased ease with which text in electronic form can be generated,archived and shared, and by the important role those languages playin international commerce. There has been limited researchhowever, for less economically important languages such asTibetan. This paper presents what we believe to be the first reportedwork on Tibetan machine translation (MT).

In the construction of a generic MT system, there are threeconceptually distinct components: analysis, transfer, and generation.The analysis phase is constituted by part of speech (POS) taggingand parsing as the process of taking a sequence of surface forms andconverting them into a meaning-preserving internal representation.In the Tibetan context, there is the prerequisite stage of word-segmentation and sentence boundary detection. We feel that thisfirst phase has been successfully completed. The subsequentsections of this paper present the theoretical background to theresearch, an overview of the algorithm and its implementation,issues in evaluation, and a discussion of its current application.

Background

This section of the paper presents a brief overview of thetheoretical underpinnings of this research.

There are three conceptually distinct components of a simple MTsystem: analysis of the source data, transfer of the data to the target

Page 2: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 2

mapping, and generation of the target data in appropriate surfaceforms and configuration. The analysis phase is constituted by part ofspeech (POS) tagging and parsing as the process of taking a surfaceform and converting it into a meaning-preserving internalrepresentation. In a theoretical context, analysis can be seen as a setof conceptually distinct processes:

Segmentation: the separation of a possibly insufficientlydifferentiated stream of linguistic datainto meaningfully quantized groups;

Tagging: the assignment of one or more part-of-speech tags to each word;

Parsing: the identification of phrase markers andassignment of a coherent sentencestructure to a given set and sequence oftags;

In practical application however, morphological analysis andsegmentation are aspects of a single process, and hence somesystems blur these distinctions for the sake of efficiency or anadvantage in accuracy.

Segmentation

Segmentation can be performed for a variety of reasons: into wordsfor indexing, into sentences for text summarization, etc. Wuproposed a single general principle for segmentation in his work onChinese which he dubbed the Monotonicity Principle forSegmentation. Stated simply, “[a] valid basic segmentation unit(segment or token) is a substring that no processing stage after thesegmenter needs to decompose.”1 The implication of such aprinciple is that segmentation should be conservative, notprematurely committing to long segments. Wu posited thedefinition of a valid segment as a substring that no applicationwould ever need to decompose for any reason, whether structural orstatistical. While this principle set an upper limit for general-use

1 Wu (1998).

Page 3: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 3

segmentation length, Wu also presented a lower length limitprinciple for application-specific uses: that a segmenter should alsofind the maximum length segment that does not impair accuracy forthat application.

Although a number of researchers2 have advocated a simple"greedy segmentation" algorithm –– longest substring matching to adictionary, Wu and Fung3 identified a danger of greedysegmentation (leading to premature commitment) as being a methodopen to “crossing-segments” errors arising from ambiguousboundaries. As a means of avoiding this danger, Wu likewiseadvocated performing segmentation in tandem with applicationspecific processes such as name-entity labeling, POS tagging,translation, etc., rather than as a pre-processing stage. In a similarvein, Maosong et al., reported an increase in accuracy insegmentation when segmentation and part-of-speech (POS) taggingwere integrated.4

POS Tagging and Parsing

A review of tagging research is provided in Hackett.5Concerning Tibetan however, in brief, Wilson’s presentation ofTibetan grammar is centered around a detailed syntacticclassification of verbs. This classification is based on the principlethat the syntax of a clause or sentence is determined by the verb thatterminates it. In general, Wilson’s verb categories are defined bothsyntactically and semantically as follows:

Class ISyntactic Dimension: Nominative Subjects and Nominative

ComplementsSemantic Dimension: Existential and Linking

2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997).3 Wu and Fung (1994).4 Maosong et al. (1997).5 Hackett (2000), p15ff .

Page 4: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 4

Class IISyntactic Dimension: Nominative Subjects and Locative

QualifiersSemantic Dimension: Existence qualified referentially, or by

location, time, or dispositionClass III

Syntactic Dimension: Nominative Subjects and ObjectiveQualifiers

Semantic Dimension: Reflexive activities qualified bylocation or destination, and rhetorical statements with aqualified existential identity

Class IVSyntactic Dimension: Nominative Subjects and (Non-la

class) Syntactic QualifierSemantic Dimension: Existential verbs indicative of a state

of separation, absence, conjunction, or disjunctionClass V

Syntactic Dimension: Agentive Subjects (i.e., agents &instruments) and Nominative [Direct] Objects

Semantic Dimension: Actions performed by an agent orinstrument on something other than itself

Class VISyntactic Dimension: Agentive Subjects (i.e., agents &

instruments) and Objective [Direct] ObjectsSemantic Dimension: Actions performed by an agent or

instrument on something other than itselfClass VII

Syntactic Dimension: Dative Subjects and NominativeObjects

Semantic Dimension: Indicating need, purpose or potentialbenefit

Class VIIISyntactic Dimension: Locative Subjects and Nominative

ObjectsSemantic Dimension: Conveying possession, or attribution

These classes provide the basic, first-order categorization scheme forTibetan verbs. Subcategorization, which has been shown to

Page 5: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 5

improve parsing accuracy when combined with lexicalization,6 isderivable from a consideration of both the semantic scope of theverb class (as is the case with Class IV verbs7) and, moresignificantly, from statistical patterns of usage, though appears toremain invariant across corpora domains.8 While these patterns ofusage vary not only from verb to verb within each class, with senseand, in the case of Class V, VI, and some VIII verbs, voice.

In English, a number of properties differentiate inchoative (or"non-causative") uses of a verb from causative uses.9 In Tibetan,this alternation of voice between the causative and inchoative isrendered through either the omission of an explicit agent, or theinclusion of a non-sentient agent (an instrument) sometimes with theagent / object sequence reversed for emphasis. Numerous examplesillustrate this alternation.

For example, in its causative alternation the Class V (Agentive-Nominative) verb as “join” or “hold” occurs in the standardagent-object construction: “He is holding the book,”

“He is holding in his mind (i.e., hasmemorized) the teachings,” or “[He] does nothave a good grasp on the measure of the object of negation.” In aninchoative formation however, a sentence built around the Class Vverb can place emphasis on the subject of an action through therepositioning of words in an object-instrument construction coupledwith the lack of a sentient agent. For example:

“renunciation is joined by (i.e., with) the generation of the[altruistic] mind.” Some more examples of Class V Agentive-Nominative verbs in both constructions are:

1. [He] added the troop to hisretinue2. External matter is notsubsumed by (i.e., includedwithin) the continuum of a being

6 Carroll et.al. (1998).7 Wilson (2000) pp.519ff .8 Roland et.al. (2000).9 Radford (1997) p.341ff .

Page 6: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 6

1. The regent ruled Tibet for 20years2. [I] am sustained by faith1. [They] completely cleared thesite2. The friends were separated byanger

As can be seen, there is also a subtle shift in the sense of the verbwith the alternation — a feature observed in other languages as well.

Taking this verb classification scheme, and combining it withWilson's syllable classification10 yields a complete grammaticalformulation suitable for automatic processing. Moreover, since theWilson formulation of Tibetan grammar incorporates Tibetan–English transfer rules into its parsing strategy, successfulidentification of a verb's syntactic class yields sufficient sub-categorization information to perform subject–object identificationand shallow prepositional phrase attachment.

The Algorithm and Its Implementation

Although simple segmentation can be achieved through longestsubstring matching to a dictionary, as noted above, algorithmswhich combine segmentation with POS tagging have been shown toachieve higher accuracy. Statistical methods for POS taggingrequire a pre-segmented, pre-tagged training corpus, but a rule-based POS tagger can be constructed manually without a trainingcorpus. Since no pre-tagged training corpus exists for Tibetan, arule-based tagger for Tibetan was manually constructed for thispurpose.

Resources

Three types of evidence were employed in the construction of thetagger/parser: lexical knowledge of the words in a language, an

10 A detailed description of this formulation of Tibetan grammar is given inWilson (1992).

Page 7: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 7

explicit knowledge representation that reflects what is known aboutthe ways in which those words can be combined, and statisticalevidence of usage patterns gathered from large text corpora. Weassembled this lexical knowledge in two forms: a verb lexicon and anoun/adjective dictionary.

Of the two lexicons created for use by the segmentationalgorithm, the first was a verb lexicon, necessary for identifyingverb-terminated clauses and sentence boundaries. The primarysource for the verb lexicon was an augmented version of Wilson'sverb classification tables. This was supplemented with other verbtables and glossaries.11 The resulting verb lexicon used by thesegmenter contained over 4,800 distinct verbs. Syntactic classinformation was obtained by bootstrapping off of correlations withthe traditional verb categories of , , and :

Wilson Verb Class Traditional categoryI Nominative-nominative (linking) VerbsII Nominative-locative Verbs

2.1 simple verbs of existence2.2 verbs of living2.3 verbs of dependence2.4 verbs expressing attitudes

III Nominative-objective Verbs3.1 verbs of motion3.2 nominative action verbs3.3 rhetorical verbs

IV Nominative-syntactic Verbs12

4.1 separative verbs4.2 verbs of absence

11 ACIP (1994); Das (1915); Kharto (n.d.).12 Here “syntactic” is being used in a specific sense by Wilson to connote “a

syntactic particle that is not a case marking particle”, rather than in its generalsense referring to “units of one or two syllables that have only a grammaticalfunction.” This technical sense referred to here, refers to eight different types ofparticles: introductory, terminating, rhetorical, punctuational,conjunctive/disjunctive, verb modifying, continuational, and adverbial. SeeWilson, pp.663-64.

Page 8: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 8

4.3 conjunctive verbs4.4 disjunctive verbs

V Agentive-nominative VerbsVI Agentive-objective VerbsVII Purposive-nominative Verbs of NecessityVIII Locative-nominative Verbs

8.1 verbs of possession8.2 attributive usage

Because of divergent criteria used in dictionaries for thisassessment,13 this information was then validated against statisticalsampling over a text corpus on a verb-by-verb basis.

The second resource needed was a noun/adjective dictionary,necessary for maximum-length substring matching and suffixstripping. This dictionary was formed by initially combining twoelectronic dictionaries14 and three dissertation glossaries.15

Duplicate entries were collapsed, erroneous entries discarded, and acomparison program was used to compare the resultant dictionaryagainst the verb lexicon and remove collocations containing verbs.The final dictionary thus contained roughly 50,000 entriescomprised of only nouns, adjectives, and noun-adjectivecollocations.

Algorithm

The basic algorithm for the Tibetan POS tagger and segmenter isstraightforward and exploits the two features of the Tibetanlanguage which make it amenable to analysis: the existence ofphrase delimiters and verb termination of sentences and clauses.The process of word-segmentation for Tibetan can be divided intofive successive steps:

13 Some dictionaries and grammars present the tha dad / tha mi dad distinctionin terms of morphology, while others use requisite syntax. Syntax is the criterionadopted here.

14 Hopkins et al. (forth.); Kunsang (1996).15 Dreyfus (1991); Levinson (1994); Zahler (1994).

Page 9: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 9

• Sentence boundary detection• Verb and verb-clause identification• Tagging of case-marking particles• POS tag sequence disambiguation• Longest substring matching for undifferentiated

substrings

We embedded knowledge of some well-understood characteristicsof Tibetan into our segmentation algorithm. Since Tibetan is averb-final language, only the verb lexicon is used to recognizewords at the end of a sentence. Sentence boundaries are not markedas such in written Tibetan, but they can be detected fairly reliably bythe presence of a verb or some other standard marker (a rhetoricalcontinuative or an ornamental terminating particle) that immediatelyprecedes a marked phrase boundary. We used this as the startingpoint for shallow parsing of the input text, exploiting syntacticconstraints to guide the segmentation process.

We used maximum-length substring matching against thenoun/adjective dictionary as a preference criterion for regions inwhich shallow parsing failed to produce a unique segmentation.These ambiguous cases could result from overgeneration (regionsfor which our parser generated multiple analyses) or fromundergeneration (regions for which our parser failed to discoverany syntactic constraints). We used a simple greedy left-to-rightsearch for these regions, removing the longest substring thatappeared in our noun/adjective dictionary. When no substring wasdiscovered, we segmented the leftmost syllable as a word of lengthone and continued.

The parser hinges on sentence boundary detection and verbidentification, as shown in the following pseudo-code:

for each set of phrases with valid sentence termination

for each verb

mark the verb, auxiliaries and Sanskritic adverbs as a

verb phrase

for each remaining substring of unmarked syllables

mark all known case-marking syllables

for each remaining substring of unmarked syllables

perform maximum substring matching against dictionary

Page 10: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 10

The final step has the effect of integrating the greedy longestsubstring matching into the parser itself.

The situation for verb phrases is somewhat more complex. ManyTibetan nouns can be declined into adverbs, a characteristic thatTibetan shares with English. Such adverbial nouns are separatedfrom the verbs they qualify by our parser, so they are segmented asseparate words. There is, however, a second class of Tibetanadverbs that correspond to Sanskrit verbal prefixes: for vi-,

for abhi-, etc. In those instances, adverbs and verbs arebracketed together by our parser and thus segmented as a singleword, reflecting their Sanskritic origin. The last phrase in theexample presented below is an instance of this type of verb phrase.

Complete details of the segmentation and tagging process areprovided in Hackett.16 In this paper we provide a brief example toillustrate the parser’s operation on two Tibetan phrases:

[translated roughly as:"By the hub being broken, similarly the spokes;you should completely and thoroughly guard that."]

The parser reads in the first phrase. Since there are no sentenceterminating particles, and since neither nor are verbs,the second phrase is also read into the working buffer. Since theornamental sentence terminating particle is found embedded ontothe auxiliary verb the two phrases are successfully matched as asentence, and the entire last syllable is marked as the closing portionof a verb phrase. Pattern matching is applied to the string ofsyllables preceding this, identifying the main verb "to guard;protect." That verb is followed by a syntactic particle that connectsit to the auxiliary verb and preceded by two Sanskritic adverbs, and which are each followed by their respective case-markingparticles and . At this point, the entire verb phrase is marked asa unit. The remainder of the sentence is then searched for additionalverb clauses. The verb "to break" is found and marked as a verbphrase with no auxiliaries or adverbs, although the verb phrase does

16 op.cit., p.20ff .

Page 11: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 11

contain an additional syllable rendering it as the gerundive withan embedded instrumental case-marking particle . The entireprepositional phrase is then grouped together and marked.

All remaining lexical, syntactic and case marking particles arethen found and marked: the plural lexical particle , the frozenadverbial phrase consisting of the single syllable adverb "similar", the pronoun , and the single syllableagentive/instrumental case-marking particle . Maximum lengthstring matching is then performed for the remaining untaggedsyllable sequences: , , and . The words "hub, center",

"wheel spoke", and "you" are found and marked as multi-syllable noun phrases and simple nouns.

The verb lexicon yields the syntactic categorization informationfor the verb (Class V) and the parser successfully identifies boththe agent and the nominative object of the verb, with the remainderparsed as a prepositional phrase. The result of the analysis, is thesentence (S) parse tree:

S /\ / \ / \ / \ NP VP / \ \ / \ \ / \ \ PP \ \ / \ \ \ / \ \ \ / \ \ \ / \ \ \ INS NP \ \ /\ / \ NP \ / \ NP ADV /\ \ NP VP /\ \ / \ \ / \ NP PLP \ / \ \/ \ / \ \ / \ \

where, NP indicates a noun phrase, VP a verb phrase, PP aprepositional phrase, INS an instrumental clause, ADV an adverb,and PLP a plural lexical particle.

Page 12: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 12

Evaluation and Implications

The word-segmenter component was evaluated at the task ofdocument indexing for Information Retrieval (IR). Althoughsegmentation is application specific, preliminary analysis indicatedslightly better (though statistically comparable) performance to n-gram based approaches at a known-item IR task, with error analysisplacing segmentation accuracy at 99%.17 The accuracy of the POStagger is also estimated at 99% based on IR error analysis andrandom sampling.

The two dominant sources of error for both POS tagging andsegmentation were identified during analysis as unknown words(i.e., not contained in the lexicon) and ambiguous boundaries. Inthe case of unknown words, nearly all instances observed concernedeither transliterated Sanskrit mantras or proper names. While someresearchers have attempted to compensate for this problem in otherlanguages through the use of a proper name database, it is unclearwhether or not this approach would be satisfactory for Tibetangiven that proper names have considerable overlap with content-bearing words and phrases. With regard to Sanskrit mantras,however, the problem appears worse. Given the plethora ofdifferent mantras, variations in transliteration, and variability inlength and composition, it remains unclear if even an explicit list ofmantras would suffice. Although an alternate approach consistingof the algorithmic identification of morphologically invalid Tibetan(i.e., Sanskritic) syllables would identify some syllables, anyapproach short of full lexicalization could not sufficiently identifyentire phrases given that some Sanskritic syllables are also validTibetan ones. For example, even the simple mantra

contains the valid syllables , , , and . Without themeaning-based determination that, for instance, is not functioningas a rhetorical continuative, it would be very difficult toautomatically parse a sentence containing this phrase correctly, letalone correctly index its constituent words.

The issue of ambiguous boundaries reflects the slightly differentdifficulty of establishing an unbiased baseline assessment of Tibetansentences against which the output of any automatic parser could be

17 See Hackett op.cit. for details.

Page 13: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 13

judged. As an example to illustrate this point, we may consider thefollowing two lines drawn from ⁄›ntideva's Bodhicary›vat›ra:18

The point of ambiguity in the valid parse of this sentence is whetherthe word break in the last line is between and ,or between and . It is interesting to note thatbetween English translators of this stanza there is a divergence ofopinion, however the Tibetan and Mongolian commentators whohave written annotations (mchan) to the text, notably Thog-med-dpal-bzang-po, Agwangdampa, Kun-bzang-chos-grags, and Gzhan-phan-chos-kyi-snang-ba uniformly take the former interpretation.Similarly, the Sanskrit of this verse accords with the former readingas well, and could be taken as a basis for judgment. This exampledemonstrates however, the difficulty of finding standards for theresolution of such grammatical ambiguities in cases for which thereis no Sanskrit original or divergent opinion on the rendering ofearlier compositions, above and beyond the time commitmentrequired to resolve each ambiguity. Consequently, until such fullyparsed test collections are compiled, evaluations of computationalTibetan utilities must remain approximate at best.

Current Applications

Despite the work which remains to bring the project to completion,the presently completed component has a number of viableapplications. Utilization of the POS / segmentation utility allows for:

• Intelligent full-text indexing and searching• Cross-language search and retrieval• Semantic mapping of Tibetan corpora

through the clustering of index terms in ann-dimensional vocabulary space ("LatentSemantic Indexing")

18 Stanza 9.4cd

Page 14: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 14

• Vocabulary analysis• Intelligent spell-checking

The first of these points was explored in detail in Hackett.19 Thesecond application is a natural extension of monolingual indexing.By employing simple word-substitution from a dictionary, a corpuscan be indexed for search and retrieval in a separate language. Thisprocess has been utilized for numerous parallel text and MTgenerated text corpora, allowing users to issue queries in a familiarlanguage and retrieve documents in a different language.

The third application has implications for revising contemporarylibrary classification schemes for Tibetan materials. Through thegeneration of statistically derived index terms for a sufficiently largecorpus of Tibetan materials, a set of keywords could be compiledwhich was immune from any bias or pre-conception on the part ofhuman indexers, in essence, allowing the texts to "speak forthemselves."

The fourth and fifth of these uses points to the utility's applicationto individual texts rather than to a corpus as a whole. Bysegmenting an individual text, a complete vocabulary list may beobtained with minimal effort. The utility in this is demonstrated notonly for simple pedagogical and translation uses, but also inauthorial profiling and verification as previously attempted byValby.20 Another use is in refining Tibetan spell-checkers. Whenused, there are two types of spell-checkers currently employed —morphological rules, and high frequency occurrence syllable lists.Employing the form of segmentation described here, single syllablesegments (identified as such during segmentation due totypographic error) could be compared in context using longest-substring routines with fuzzy-matching criteria against a lexicon.This would allow for a form of "intelligent" spell-checking andcould be used in cases where a corrupt source text is suspected.

19 op.cit.20 Valby (1983).

Page 15: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 15

Conclusion

We are optimistic that future developments are feasible in termsof both the compilation of necessary resources and theimplementation of the relevant aspects of MT theory beingemployed. Moreover, while the working target language of thisproject is English, the modular design of the project would allow theutility to be re-targeted for any other language given an appropriategeneration module. Although there is no projected timetable forcompletion, we feel that approximate machine translation forTibetan to English is attainable within the next few years and willalleviate much of the tedious groundwork in translation, quicklyproviding researchers with a large corpus of first-pass machinetranslated texts.

Page 16: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 16

References

ACIP (Asian Classics Input Project), tshig mdzod chen mo. (1994)limited distribution. Electronic (Tibetan only) edition ofbod rgya tshig mdzod chen mo. [Great Tibetan-ChineseDictionary] 3 vols. Beijing: Nationalities Publishing House(mi rigs dpe skrun khang), 1984; reprinted in 2 vols., 1993.

Agwangdampa (ngag dbang bstan pa, 1814-1885), byang chubsems dpa'i spyod pa la 'jug pa'i mchan bu tshig gsal melong. Reproduced in Choi. Lubsanjab (ed.), WORKS OFAGWANGDAMBA (ºAG-DBAº-BSTAN-PA) New Delhi:Jayyed Pr. (1980), v.1, fol. 1-149a.

Carroll, John, Guido Minnen, and Ted Briscoe, "CanSubcategorisation Probabilities Help a Statistical Parser?" inProceedings of the 6th ACL/SIGDAT Workshop on VeryLarge Corpora, Montreal (1998).

Chen, Keh-Jiann, and Shing-Huan Liu, “Word Identification forMandarin Chinese Sentences,” Proceedings of COLING-92,pp.101-107.

Das, Sarat Chandra, An Introduction to the Grammar of the TibetanLanguage. Darjeeling (1915); Delhi: Motilal Banarsidass(1972).

Dreyfus, Georges, Ontology, Philosophy of Language, andEpistemology in Buddhist Tradition. Ph.D. dissertation.Dept. Religious Studies. University of Virginia (1991).

Gzhan-phan-chos-kyi-snang-ba, gzhen-dga' (1871-1927), byangchub sems dpa' spyod pq la 'jug pa zhes bya ba'i mchan'grel. Reproduced in The Thirteen Great Treatises of Mkhan-po G¤an phan Chos-kyi Snaº-ba, vol. 2, pp. 1-475, Delhi:Jayeed Press, (1987).

Hackett, Paul G., Approaches to Tibetan Information Retrieval:Segmentation vs. n-grams. Master's Thesis. College ofLibrary and Information Services, University of Maryland,College Park (2000).

Hopkins, Jeffrey, et al., Tibetan-Sanskrit-English Dictionary.Ithaca: Snow Lion Publ. forthcoming.

Kharto, Dorje Wangchuk, Thumi: dGongs gTer (The CompleteTibetan Verb Forms). Delhi: Lakshmi Printing Works. n.d.

Page 17: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 17

Kun-bzang-chos-grags (1862?-ca.1940), byang chub sems dpa'ispyod pa la 'jug pa'i tshig 'grel 'jam dbyang bla ma'i zhallung bdud rtsi'i theg pa, in Collected works gsuº bum ofMkhan-po Kun-bzaº-dpal-ldan. Paro, Bhutan: LamaNgodrub, (1982) v.1 pp.1-741.

Kunsang, Erik Pema, Concise Dharma Dictionary. (electronic; workin progress, 1996).

Kwok, K.L., “TREC-5 English and Chinese Retrieval Experimentsusing PIRCS,” Proceedings of the Fifth Text RetrievalConference (TREC-5) Gaithersburg, Md, (November 1997),pp.133-142.

Levinson, Jules Brooks, The Metaphors of Liberation. Ph.D.dissertation. Dept. Religious Studies. University of Virginia(1994).

Maosong, Sun, Shen Dayang, and Huang Changning, “CSeg & Tag1.0: A Practical Word Segmenter and POS Tagger forChinese Texts,” Fifth Conference on Applied NaturalLanguage Processing (ANLP-97), pp.119-126 (1997).

Nguyen, Van Be Hai, Ross Wilkinson, and Justin Zobel, “Cross-language Retrieval in English and Vietnamese,” Proceedingsof the AAAI-97 Spring Symposium on CLIR. 1997.

Radford, Andrew, Syntactic Theory and the Structure of English.Cambridge: Cambridge University Press (1997).

Roland, Douglas, et al., "Verb Subcategorization FrequencyDifferences between Business-News and Balanced Corpora:The Role of Verb Sense" http://www.colorado.edu/linguistics/jurafsky/ACL-COMPCORP2000.pdf.

⁄›ntideva, Bodhicary›vat›ra. Sanskrit published in Leh, Ladakh:Kendrıya Bauddha Vidy›-Sa˙sth›na (1989). Tibetan inbyang chub sems dpa'i spyod pa la 'jug pa. Toh.3871;P.5272.

Thogs-med-dpal-bzang-po (1295-1369), byang chub sems dpa'ispyod pa la 'jug pa'i 'grel legs par bshad pa'i rgya mtsho.Thimbu: kun bzang thob rgyal (1975); Sarnath: Pleasure ofElegant Sayings Pr. (1975).

Valby, James M., The Life and Ideas of the 8th Century A.D. IndianBuddhist Mystic Vimalamitra: A Computer-AssistedApproach to Tibetan Texts. Ph.d. Thesis. Dept. of FarEastern Studies, University of Saskatchewan (1983).

Page 18: AUTOMATIC SEGMENTATION AND PAUL ACKETTph2046/docs/Hackett/Hackett-IATS9.pdfSemantic Dimension: Existential and Linking 2 Chen and Liu (1992); Kwok (1997), Nguyen et al. (1997). 3 Wu

AUTOMATIC SEGMENTATION 18

Wilson, Joe, Translating Buddhism from Tibetan. Ithaca: Snow LionPubl. (1992, 1998).

Wu, Dekai, “A Position Statement on Chinese Segmentation” paperpresented at Chinese Language Processing Workshop,Univ.Penn. 30-June to 2-July 1998.

Wu, Dekai, and Pascale Fung, “Improving Chinese Tokenizationwith Linguistic Filters on Statistical Lexical Acquisition,”ANLP-94, Fourth Conference on Applied Natural LanguageProcessing, Stuttgart.

Zahler, Leah Judith, The Concentrations and Formless Absorptionsin Mahayana Buddhism. Ph.D. dissertation. Dept. ReligiousStudies. University of Virginia (1994).