Arabic Content Processing with Rosette and Microsoft/FAST ESPbasistech.com/whitepapers/Rosette-ESP-integration-EN.pdf · Arabic Content Processing with Rosee and Microso/FAST ESP

January 19, 2010

Arabic Content Processing with Rosetteand Microsoft/FAST ESPby Chrisan Moen

We put the World in the World Wide Web®

ABOUT BASIS TECHNOLOGY

Basis Technology provides soware soluons for text analycs, informaon retrieval,digital forensics, and identy resoluon in over forty languages. Our Rosee® linguiscsplaorm is a widely used suite of interoperable components that power search, businessintelligence, e-discovery, social media monitoring, financial compliance, and otherenterprise applicaons. Our linguiscs team is at the forefront of applied naturallanguage processing using a combinaon of stascal modeling, expert rules, andcorpus-derived data. Our forensics team pioneers beer, faster, and cheaper techniquesto extract forensic evidence, keeping government and law enforcement ahead ofexponenal growth of data storage volumes.

Soware vendors, content providers, financial instuons, and government agenciesworldwide rely on Basis Technology’s soluons for Unicode compliance, languageidenficaon, mullingual search, enty extracon, name indexing, and nametranslaon. Our products and services are used by over 250 major firms, including Cisco,EMC, Exalead/Dassault Systems, Hewle-Packard, Microso, Oracle, and Symantec.Our text analysis products are widely used in the U.S. defense and intelligence industryby such firms as CACI, Lockheed Marn, Northrop Grumman, SAIC, and SRI. We arethe top provider of mullingual technology to web and e-commerce search engines,including Amazon.com, Bing, Google, and Yahoo!.

Company headquarters are in Cambridge, Massachuses, with branch officesin San Francisco, Washington, London, and Tokyo. For more informaon, visitwww.basistech.com.

© 2013 Basis Technology Corporaon. “Basis Technology”, “Geoscope”, “Odyssey Digital Forensics”, “Rosee”, and “We put the World in theWorld Wide Web” are registered trademarks of Basis Technology Corporaon. All other trademarks, service marks, and logos used in this document are theproperty of their respecve owners. (2013-02-20)

http://www.basistech.com

Arabic Content Processing with Rosee and Microso/FAST ESP 3

1. INTRODUCTION

The Microso/FAST Enterprise Search Plaorm (ESP) has limited support for processing Arabiccontent. Features such as lemmazaon, named enty extracon, word and name translitaon,etc. are not supported by ESP out-of-the-box. Basis Technology Corporaon’s Rosee linguiscsplaorm provides these features and more.

This document describes a work-in-progress on how Rosee has been integrated with ESP to bringadvanced Arabic search capabilies to ESP.

The remainder of this document is structured as follows: Secon 2 gives an overview of contentprocessing and linguiscs in ESP; secon 3 provides a descripon of the integraon work done;secon 4 presents some screenshots of Arabic search in the integrated soluon; and secon 5suggests further work.

2. CONTENT PROCESSING AND LINGUISTICS IN ESP

Figure 1 is a simplified architecture diagram of ESP that captures some of its major architecturalcomponents. Content is fed into ESP using a pre-built connector, such a FileTraverser (feed fromfilesystem), a JDBC Connector (feed from a database), etc. or through ESP’s Content API.

When content items have been fed into ESP, they are processed prior to indexing. Typical contentprocessing includes format and encoding detecon, encoding normalizaon, format parsing/textextracon (from formats such as HTML, PDF and MS Office), but also linguiscs funconality suchas language idenficaon, sentence and word segmentaon, various normalizaons, part-of-speech tagging, lemmazaon and named enty extracon.

The actual processing being done is defined by a document processing pipeline. A documentprocessing pipeline is a sequence of processing steps applied to the documents fed. Eachprocessing step is done by a document processor which mutates or annotates the document andis typically responsible for a well-defined funconal area.

For example, a language idenficaon document processor typically reads the document in textualform and annotates the document with an aribute named language with an apropriate value, i.e.“ar” for Arabic. 1

The last processing stage in pipelines is typically a output stage that passes the processed andannotated document to the indexer for indexing. Documents are searchable when indexing hasbeen completed.

1 ISO 639 two leer language codes are used.

Arabic Content Processing with Rosee and Microso/FAST ESP4

Queries are sent to the core search engine and undergoes a similar pipelined processing on thequery side by means of a query processing pipeline consisng of query processors. Search resultsare processed in a similar fashion through a result processing pipeline of result processors.

3. ROSETTE INTEGRATION OVERVIEW

Document, query, and result processors are plugins in ESP. Custom plugins can be wrien andintegrated with those already part of ESP and they can be mixed and matched in pipelines. Thisway ESP’s can be extended with new funconality through these plugins.

Document processors are wrien in Python and are loaded by the procserver module in ESP whichdoes the document processing. The query and result processors are wrien in C++ and are loadedby the qrserver module in ESP, which does the query and result processing.

Our high-level strategy for integrang Rosee with ESP is to make make Rosee available to theprocserver run-me and build a prototype document processor that processes content using byinvoking Rosee. The below secons describe this in further detail.

3.1. Python Bindings to RoseeRosee provides bindings to programming languages Java and C/C++. The procserver runme isbased on Python 2.3 and ESP ships with a Python interpreter called Cobra. 2

Python bindings to Rosee’s C++ API has been developed and Rosee funconality is now callablefrom the cobra interpreter and the procserver.

It’s now possible to do to things like

% $FASTSEARCH/bin/cobraCobra 5.3.0.1 (2009-10-16 15:28:19)For FAST ESP 5.3, Build Id build_09101600Copyright (c) 2004-2007 Fast Search & Transfer ASA>>> import rlp

on the command line and start programming Python with Rosee.

The Arabic ar-rlp_sample_alternaves_c.c sample applicaons that ships with Rosee has beenrewrien using Roseeʼs Python bindings to demonstrate how some of the funconality can beused. See Appendix A for sample source code.

The Cobra interpreter dynamically loads the Rosee C++ libraries at import me and so does theprocserver when Rosee is imported in a document processor.

The Rosee Python bindings have been implemented on Linux, but porng the bindings toWindows is expected to be fairly straighorward.

3.2. RoseeLanguageProcessor document processorA sample document processor has been developed with the primary goal of demonstrangRosee and some of its core funconality working as an integrated part of ESP rather than beingcomplete in terms of which linguiscs capabilies it supports.

2 $FASTSEARCH/bin/cobra


The document processor uses the Python bindings to do a ght integraon with Rosee.Microso/FAST makes both their own and 3rd party C/C++ libraries available to documentprocessors through similar bindings.

The sample processor is called RoseeLanguageProcessor and applies Rosee processing ina given language and configuraon to a set of document aributes, which are named parts ofdocument, i.e. tle and body.

RoseeLanguageProcessor tokenizes the text in these document aributes and also finds anyalternave or lemmased forms if they exist. For example, if aribute body is used for processing,body will become a space delimited string of tokens aer processing.

Any variant of the same text with lemmas expanded is output to the lembody aribute and alllemma forms are appended to the form found in the input. For example, if body was “اقا” only,we would get lembody as “اقاق اِ”. The first term was found in the body and all alternave formsare appended.

RoseeLanguageProcessor also extracts named enes from input fields, i.e. body. The extractednamed enes are as follows: 3

• Facility: A man-made structure or architectural enty such as a building, stadium,monument, airport, bridge, factory, or museum.

• Locaon: Name of a geographically defined place such as a connent, body of water,mountain, park, or full address. It also refers to a region that either spans GPE boundaries(such as Middle East, Northeast, West Coast) or is contained within a larger GPE (such asSunni Triangle, Chinatown).

• GPE: An geo-polical enty comprised of three elements: a populaon, a geographiclocaon and a government. A country, state, city, or other locaon that contains both apopulaon and a centralized government.

• Naonality: Reference to a country or region of origin, such as American or Swiss.

• Organizaon: A corporaon, instuon, government agency, or other group of peopledefined by an established organizaonal structure.

• Person: A human idenfied by name, nickname or alias.

• Title: Appellaon associated with a person by virtue of occupaon, office, birth, or as anhonorific.

All named enes extracted are semicolon-separated and are assigned to document aributesin the document being processed that corresponds to their respecve enity names. This way, theenty informaon can be used as a mul-value searchable field or as facets/navigators in searchresults.

See Appendix B for further details on the RoseeLanguageProcessor.

3 Named enty descripons are from the Rosee Applicaon Developer’s Guide.


3.3. ConfiguraonThe RoseeLanguageProcessor reads its configuraon at startup me and follows the same overallconfiguraon approach as other by ESP document processors.

RLP configuraon such as root directory (BT_ROOT), environment and context can be passed on asconfiguraon parameters. The language of interest and a space separated list of fields to processcan also be specified.

Please see RoseeLanguageProcessor’s configuraon descripon below:

<?xml version="1.0"?><!DOCTYPE processors SYSTEM "ProcessorServer.dtd"><processors> <processor> <load module="processors.linguistics.RosetteLanguageProcessor" class="RosetteLanguageProcessor"/> <desc> <brief>Sample processor using the Rosette Language Platform (RLP)</brief> <p>Copy the Input attribute to Output.</p> </desc> <config> <param name="language" value="ar" type="str"/> <param name="attributes" value="body" type="str"/> <param name="root_dir" value="/home/cm/rlp-glue" type="str"/> <param name="environment_file" value="/home/cm/rlp/rlp/etc/rlp-global.xml" type="str"/> <param name="context_file" value="/home/cm/rlp/pyrlp/rlp-test-context.xml" type="str"/> </config> <ops><add/><partialupdate/></ops> </processor></processors>

Note that even though this document primarily discusses Rosee integraon in the context ofArabic, the RoseeLanguageProcessor also supports other languages - simply specify the languageof interest as the language parameter and it will process content accordingly.

3.4. Document processing pipelineThe RoseeLanguageProcessor has been added to a document processing pipeline forprocessing small XML documents. The pipelines is based on FastXML and the pipeline withRoseeLanguageProcess included is called FastXMLRLP.

3.5. Index profileIn ESP, an index-profile defines the search schema, including fields and their search capabilies,navigators, ranking, etc. (Note that fields and document aributes are somemes usedinterchangably in ESP terminology.)

The ESP installaon with Rosee integrated uses a standard index-profile, but fields andcorresponding navigators have been added for the named enes. For example, there’s asearchable field locaons with a corrresponding navigator named locaonsnavigator, and thereare field and navigator pairs defined for all the named enes described above.


4. SEARCH INTERFACE SCREENHOTS

Figure 2 shows ESP’s default search interface for the query اقا (Iraq). The extracted namedenes can be seen as navigators/facets on the le side.

Figure 2: ESP search interface

Figure 3 shows tokenized Arabic text in field generic1 and the same text with all lemmas expandedin field generic2. Both fields are searched by default. A search for اقِ, which is a lemmazed formof اقا (Iraq), will match documents which only contains the former, i.e. in aribute body.

Figure 3: Tokenized Arabic text with and without lemmas


5. ADDITIONAL WORK TO BE CONSIDERED

To further integrate Rosee with ESP, the following work can be considered:

• Split RoseeLanguageProcessor into several document processors, i.e. have one processordo tokenizaon and lemmasaon, and another do named enty extracon

• Build addional document processors that expose addional Rosee linguiscs features, i.e.for language idenficaon, transcripon, name normalizaon, name translaon, etc.

• Integrate Rosee into a query processor that expands into an OR query with lemmazedforms

• Integrate lemmas with dynamic teaser generaon

• Port/compile the Rosee Python bindings to Windows


APPENDIX A — PYTHON VERSION OF AR-RLP_SAMPLE_ALTERNATIVES_C.C#!/home/cm/esp-rlp/bin/cobra#### Python Arabic RLP test#### Christian Moen ([email protected])##import rlpimport sysimport osfrom optparse import OptionParserenvironment = Nonecontext = None#### Initialize RLP and set global variables environment and context##def initialize(root_dir, environment_file, context_file): rlp.BT_RLP_Environment_SetBTRootDirectory(root_dir) global environment environment = rlp.BT_RLP_Environment_Create(); if environment == None: error("could not create empty environment") status = environment.InitializeFromFile(environment_file) if status != rlp.BT_OK: error("could not initialize environment from %s. status code %d" % (environment_file, status)) global context context = rlp.BT_RLP_Environment_GetContextFromFile2(environment, context_file) if context == None: error("could not create context from %s" % context_file)#### Process input file in specified language##def process_file(file, lang): lang_code = rlp.BT_LanguageIDFromISO639(lang) if lang_code == rlp.BT_LANGUAGE_UNKNOWN: error("unknown ISO 639 language code %s" % lang) if lang == "ar": context.SetPropertyValue("com.basistech.arbl.roots", "true") context.SetPropertyValue("com.basistech.arbl.lemmas", "true") context.SetPropertyValue("com.basistech.arbl.alternatives", "true") status = context.ProcessFile(file, lang_code) if status != rlp.BT_OK: error("could not process file %s using language %s" % (file, lang)) print "The rlp root directory now is: %s" % rlp.BT_RLP_Environment_RootDirectory() print "Language: %s(%d)" % (lang, lang_code) # print context.GetStringResult(rlp.BT_RLP_DETECTED_ENCODING) # print "Raw text: " % rlp.bt_xutf16toutf8_2(context.GetUTF16StringResult(rlp.BT_RLP_RAW_TEXT, 0)) token_iter_fact = rlp.BT_RLP_TokenIteratorFactory.Create() token_iter = token_iter_fact.CreateIterator(context) while (token_iter.Next()): token = rlp.bt_xutf16toutf8_2(token_iter.GetToken()) index = token_iter.GetIndex() start = token_iter.GetStartOffset() end = token_iter.GetEndOffset() print "#%i s:%d, e:%d, token:%s" % (index, start, end, token) if token_iter.GetNumberOfAnalyses(): print " Alternative Analyses:" count = 0 while token_iter.NextAnalysis(): count += 1 print "\tNormal\t%d :%s" % (count, rlp.bt_xutf16toutf8_2(token_iter.GetNormalForm())) print "\tPOS\t%d :%s" % (count, token_iter.GetPartOfSpeech()) print "\tStem\t%d :%s" % (count, rlp.bt_xutf16toutf8_2(token_iter.GetStemForm())) lemma = rlp.bt_xutf16toutf8_2(token_iter.GetLemmaForm()) if lemma: print "\tLemma\t%d :%s" % (count, lemma) root = rlp.bt_xutf16toutf8_2(token_iter.GetRootForm()) if root: print "\tRoot\t%d :%s" % (count, lemma) print print


print#### Output error message and exit with status code 1##def error(message): print >> sys.stderr, message sys.exit(1)#### Parse options, initialize RLP and process file##def main(): parser = OptionParser(usage="%prog INPUTFILE", version="%prog 1.0") parser.add_option("-l", "--language", action="store", type="string", metavar="LANG", dest="language", help="use LANG language processing on input (default: \"ar\")") parser.add_option("-r", "--root-dir", action="store", type="string", metavar="DIR", dest="root_dir", help="use DIR as root directory (BT_ROOT)") parser.add_option("-e", "--environment-file", action="store", type="string", metavar="FILE", dest="environment_file", help="use FILE as environment file") parser.add_option("-c", "--context-file", action="store", type="string", metavar="FILE", dest="context_file", help="use FILE as context file") (options, args) = parser.parse_args() if not options.root_dir or not options.environment_file or not options.context_file: parser.print_usage() if len(args) != 1: parser.error("incorrect number of arguments") if options.language: lang = options.language else: lang = "ar" # Arabic is the default language initialize(options.root_dir, options.environment_file, options.context_file) process_file(args[0], lang) if environment: environment.DestroyContext(context)if __name__ == "__main__": main()


APPENDIX B — ROSETTELANGUAGEPROCESSOR.PY## -*- coding: utf-8 -*-#### RosetteLanguageProcessor - Arabic RLP test document processor#### Christian Moen <[email protected])##import rlpimport refrom docproc import Processor, DocumentException, ProcessorStatusclass RosetteLanguageProcessor(Processor.Processor): def ConfigurationChanged(self, attributes): self.language = self.GetParameter('language') self.attributes = re.split('\s+', self.GetParameter('attributes')) self.root_dir = self.GetParameter('root_dir') self.environment_file = self.GetParameter('environment_file') self.context_file = self.GetParameter('context_file') log(log.FLOG_VERBOSE, "Intializing RLP with root %s environment %s and context %s" % (self.root_dir, self.environment_file, self.context_file)) rlp.BT_RLP_Environment_SetBTRootDirectory(self.root_dir) self.environment = rlp.BT_RLP_Environment_Create(); if self.environment == None: raise DocumentException, "Could not create empty environment" status = self.environment.InitializeFromFile(self.environment_file) if status != rlp.BT_OK: raise DocumentException, "Could not initialize environment from %s. status code %d" % (self.environment_file, status) self.context = rlp.BT_RLP_Environment_GetContextFromFile2(self.environment, self.context_file) if self.context == None: raise DocumentException, "Could not create context from %s" % self.context_file self.language_code = rlp.BT_LanguageIDFromISO639(self.language) if self.language_code == rlp.BT_LANGUAGE_UNKNOWN: raise DocumentException, "Unknown ISO 639 language code %s" % self.language if self.language == "ar": self.context.SetPropertyValue("com.basistech.arbl.roots", "true") self.context.SetPropertyValue("com.basistech.arbl.lemmas", "true") self.context.SetPropertyValue("com.basistech.arbl.alternatives", "true") def Process(self, docid, document): log(log.FLOG_VERBOSE, "RosetteLanguageProcessor processing %s" % docid) for attribute in self.attributes: if document.Has(attribute): self.ProcessAttribute(attribute, document) return ProcessorStatus.OK def ProcessAttribute(self, attribute, document): log(log.FLOG_VERBOSE, "Processing attribute: %s" % attribute) value = document.GetValue(attribute)# log(log.FLOG_VERBOSE, 'Value: %s' % value)# log(log.FLOG_VERBOSE, 'Length: %d' % len(value))# log(log.FLOG_VERBOSE, 'Type: %s' % str(type(value)))# log(log.FLOG_VERBOSE, 'LanguageCode: %d' % self.language_code) status = rlp.ProcessBuffer2(self.context, value, len(value), self.language_code, 'UTF8') if status != rlp.BT_OK: log(log.FLOG_ERROR, 'Could not process %s in language %s. Error code %d' % (self.input, self.language, status)) return ProcessorStatus.ERROR token_iter_fact = rlp.BT_RLP_TokenIteratorFactory.Create() token_iter = token_iter_fact.CreateIterator(self.context) newvalue = '' newvalue_with_lemmas = '' while (token_iter.Next()): token = rlp.bt_xutf16toutf8_2(token_iter.GetToken()) newvalue += token + ' ' newvalue_with_lemmas += token + ' ' if token_iter.GetNumberOfAnalyses(): while token_iter.NextAnalysis(): lemma = rlp.bt_xutf16toutf8_2(token_iter.GetLemmaForm()) if lemma: newvalue_with_lemmas += lemma + ' ' document.Set(attribute, newvalue) document.Set('lem' + attribute, newvalue_with_lemmas)


# log(log.FLOG_VERBOSE, '%s : %s' % (attribute, document.GetValue(attribute)))# log(log.FLOG_VERBOSE, '%s : %s' % ('lem' + attribute, document.GetValue('lem' + attribute))) facilities = [] locations = [] GPEs = [] organizations = [] people = [] nationalities = [] titles = [] entity_iter_fact = rlp.BT_RLP_NE_Iterator_Factory.Create() entity_iter = entity_iter_fact.CreateIterator(self.context) while entity_iter.Next(): entity = rlp.bt_xutf16toutf8_2(entity_iter.GetRawNamedEntity()) entity_type = rlp.BT_RLP_NET_ID_TO_STRING(entity_iter.GetType()) # FIXME: The beloe should use entity_iter.GetType() and # rlp.BT_NE_TYPE_FACILITY, etc. constants if entity_type == "FACILITY": facilities.append(entity) elif entity_type == "LOCATION": locations.append(entity) elif entity_type == "GPE": GPEs.append(entity) elif entity_type == "ORGANIZATION": organizations.append(entity) elif entity_type == "PERSON": people.append(entity) elif entity_type == "NATIONALITY": nationalities.append(entity) elif entity_type == "TITLE": titles.append(entity) document.Set("facilities", str.join(';', facilities)) document.Set("locations", str.join(';', locations)) document.Set("gpes", str.join(';', GPEs)) document.Set("organizations", str.join(';', organizations)) document.Set("people", str.join(';', people)) document.Set("nationalities", str.join(';', nationalities)) document.Set("titles", str.join(';', titles)) log(log.FLOG_VERBOSE, 'facilities: %s' % document.Get("facilities")) log(log.FLOG_VERBOSE, 'locations: %s' % document.Get("locations")) log(log.FLOG_VERBOSE, 'gpes: %s' % document.Get("gpes")) log(log.FLOG_VERBOSE, 'organizations: %s' % document.Get("organizations")) log(log.FLOG_VERBOSE, 'people: %s' % document.Get("people")) log(log.FLOG_VERBOSE, 'nationalities: %s' % document.Get("nationalities")) log(log.FLOG_VERBOSE, 'titles: %s' % document.Get("titles"))