172
Pivotal Greenplum ® Text Version 3.3.0 User Guide Rev: 01 © 2019 Pivotal Software, Inc.

Pivotal Greenplum Text

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Pivotal Greenplum Text

PivotalGreenplum®Text

Version3.3.0

UserGuide

Rev:01

©2019PivotalSoftware,Inc.

Page 2: Pivotal Greenplum Text

23410151722293234364660728391144168169171

TableofContents

TableofContentsPivotal®Greenplum®Text3.3.0DocumentationPivotal®GPText3.3.0ReleaseNotesInstallingGPTextUpgradingGPTextIntroductiontoPivotalGPTextAdministeringGPTextGPTextHighAvailabilityGPTextBestPracticesTroubleshootingHadoopConnectionProblemsWorkingWithGPTextIndexesQueryingGPTextIndexesCustomizingGPTextIndexesWorkingWithGPTextExternalIndexesNaturalLanguageProcessingwithGPTextIndexesGPTextFunctionReferenceGPTextManagementUtilitiesGPTextandSolrDataTypeMappingsGPTextSchemaTablesGPTextConfigurationParameters

©CopyrightPivotalSoftware,Inc,2013-2019 2 3.3.0

Page 3: Pivotal Greenplum Text

Pivotal®Greenplum®Text3.3.0Documentation

GPTextDocumentationPDF

PivotalGPText3.3.0ReleaseNotes

InstallingPivotalGPText

UpgradingPivotalGPText

UsingPivotalGPText

GPTextReferences

AdditionalResourcesPivotalGreenplumDatabase

ApacheSolrWebSite

ApacheMADlib

©CopyrightPivotalSoftware,Inc,2013-2019 3 3.3.0

Page 4: Pivotal Greenplum Text

Pivotal®GPText3.3.0ReleaseNotesThisdocumentcontainsreleaseinformationforPivotalGPText3.3.0

Released:June2019

AboutPivotalGPTextPivotalGPTextjoinstheGreenplumDatabasemassivelyparallel-processingdatabaseserverwithApacheSolrCloudenterprisesearchandtheApacheMADlibAnalyticsLibrarytoprovidelarge-scaleanalyticsprocessingandbusinessdecisionsupport.GPTextincludesfreetextsearchaswellassupportfortextanalysis.

GPTextincludesthefollowingfeatures:

TheGPTextdatabaseschemaprovidesin-databaseaccesstoApacheSolrindexingandsearching

BuildindexeswithdatabasedataorexternaldocumentsandsearchwiththeGPTextAPI

Customtokenizersforinternationaltextandsocialmediatext

AUniversalQueryProcessorthatacceptsquerieswithmixedsyntaxfromsupportedSolrqueryprocessors

Facetedsearchresults

Termhighlightinginresults

Naturallanguageprocessing,includingpart-of-speechtaggingandnamedentityextraction

Greateremphasisonhighavailability

TheGPTextmanagementutilitysuiteincludescommand-lineutilitiestoperformthefollowingtasks:

Start,stop,andmonitorZooKeeperandGPTextnodes

ConfigureGPTextnodesandindexes

Addanddeletereplicasforindexshards

BackupandrestoreGPTextindexes

RecoveraGPTextnode

ExpandtheGPTextclusterbyaddingGPTextnodes

PrerequisitesInstallingGPTextalsoinstallsApacheSolrCloudand,optionally,ApacheZooKeeper.

FollowingareGPTextinstallationprerequisites.

GPTextrunsonRedHatEnterpriseLinux5.x,6.x,and7.x.

GPTextrunsonGreenplumDatabaseversion4.3.6orhigher,GreenplumDatabase5,orGreenplumDatabase6.GreenplumDatabase6requiresatleastGPText3.3.

GPTextrequiresJava8,OpenJDK8,Java11,orOpenJDK11tobeinstalledoneachhostintheGreenplumDatabasecluster.AddtheJRE bindirectorytothe PATH onallhostsinthecluster.

InstallandconfigureyourGreenplumDatabasesystembeforeyouinstallGPText.SeethePivotalGreenplumDatabaseInstallationGuideathttps://gpdb.docs.pivotal.io .

Ensurethat nc (netcat)isinstalledonallGreenplumclusterhosts( sudo yum install nc ).

Installing lsof onallclusterhostsisrecommended( sudo yum install lsof ).

GPTextcannotbeinstalledontoasharedNFSmount.

GPTextnodescanbeinstalledontheGreenplumDatabaseclusterhostsalongsidetheGreenplumsegmentsoronadditional,non-databasehostsaccessibleontheGreenplumclusternetwork.AllhostsparticipatingintheGPTextsystemmusthavethesameoperatingsystemandconfigurationandhavepasswordless-sshaccessforthegpadminuser.SeethePivotalGreenplumDatabaseInstallationGuideforinstructionstoconfigurehosts.

IfyouplantoplaceGPTextnodesontheGreenplumDatabasesegmenthosts,ensurethatyoureservememoryforGPTextusewhenyouconfigureGreenplumDatabase.TodeterminethememorytosetasideforGPText,multiplythenumberofGPTextnodestocreateoneachGreenplumsegmenthostbytheJVMmaximumsize.SubtractthismemoryfromthephysicalRAMwhencalculatingthevaluefortheGreenplumDatabase

©CopyrightPivotalSoftware,Inc,2013-2019 4 3.3.0

Page 5: Pivotal Greenplum Text

gp_vmem_protect_limit serverconfigurationparameter.SeetheGreenplumDatabaseserverconfigurationparameter gp_vmem_protect_limit intheGreenplumDatabaseReferenceGuideforrecommendedmemorycalculationformulasorvisittheGPDBVirtualMemoryCalculator website.

ApacheSolrrequiresaZooKeeperclusterwithatminimumthreenodes(fivenodesrecommended).Youcaninstalla“binding”ZooKeeperclusterwithGPTextontheGreenplumclusterhosts,oryoucanuseanexistingZooKeepercluster.WhendeployedalongsideGreenplumDatabasesegments,ZooKeeperperformancecanbeaffectedunderheavydatabaseload.Forbestperformance,installaZooKeeperclusteronseparatehostswithnetworkconnectivitytotheGreenplumnetwork.

NewFeaturesandEnhancementsinGPText3.3.0

UsingGPText3.3withGreenplumDatabase6.0GPText3.3canbeinstalledonaGreenplumDatabase6systemwithJava8orJava11.

AGPTextbinarydistributionhasbeenaddedtoPivotalNetwork forRedHat7/CentOS7withGreenplumDatabase6.

FollowingaredifferencesusingGPTextwithGreenplumDatabase6thanwithearlierGreenplumDatabasereleases:

The custom_variable_classes serverconfigurationparameterhasbeenremovedinGreenplumDatabase6.WithearlierGreenplumDatabaseversions,itwasnecessarytoadd 'gptext' tothisparameterinordertosetGPTextconfigurationparameters.GreenplumDatabase6allowsyoutosetconfigurationparametersinadatabasesessionwithoutdeclaringavariableclass.

InGreenplumDatabase4and5,thedefaultoutputformatforthebinarydatatype bytea isthePostgreSQLescapeformat,asequenceofASCIIcharacterswithescapesequenceswherebytescannotberepresentedwithASCII.InGreenplumDatabase6,thedefaultoutputformatisthehexformat,whichrepresentseachbytewithhexadecimaldigits.InGreenplumDatabase5,thehexoutputformatcanbespecifiedbysettingthebytea_output configurationparameterto hex .ToproducethesameoutputinGreenplumDatabase4,5,and6,youcansetthe bytea_output

configurationparameterto escape .

CustomConfigurationDirectoryAnewoptionalinstallationparameter, GPTEXT_CUSTOM_CONFIG_DIR ,canbesetinthe gptext_install_config filetospecifyadirectorytostorecustomconfigurationfiles.

Bydefault,GPTextsavescustomconfigurationfilesunderthe $GPTEXTHOME/share/ directoryoneachSolrhost,forexample $GPTEXTHOME/share/external_ .

Tospecifyadifferentdirectorytostoreexternalconfigurationfiles,beforeyouruntheGPTextinstaller,uncommentthe GPTEXT_CUSTOM_CONFIG_DIRparameterinthe gptext_install_config fileandspecifythefullpathtothedirectory.Forexample:

GPTEXT_CUSTOM_CONFIG_DIR="/home/gpadmin/config_dir"

ThegpadminusermusthavetheOSpermissionsrequiredtocreatethedirectory.

Iftheparameterisset,theGPTextinstallerwillcreatethecustomconfigurationdirectoryoneverySolrhost.Configurationfilesyouuploadusingthegptext-externalupload

commandwillbestoredunderthisdirectoryoneverySolrhosttoallowSolrtoaccesstheexternaldocumentsourcefromeveryhost.

Forexampleifthe GPTEXT_CUSTOM_CONFIG_DIR parameterissetto /home/gpadmin/config_dir whenyouinstallGPText,ans3configurationwiththenames3_conf willbesavedinthedirectory /home/gpadmin/config_dir/external_source/s3/s3_conf oneachhost.

NewFeaturesandEnhancementsinGPText3.2.0TheGPText3.2.0releaseprovidesthefollowingfeaturesandenhancements.

LemmatizationGPText3.2.0enableslemmatizingtermsinGPTextindexes.YoucandefineSolranalysischainsthatincludetheApacheOpenNLPparts-of-speechfilterandthenewGPTextWordNetLemmatizerfilter,whichreplacestermswiththerootformoftheterm.TheWordNetLemmatizerfilterusesalexicaldatabasefromthePrincetonUniversityWordNet®projecttodeterminetherootform.

©CopyrightPivotalSoftware,Inc,2013-2019 5 3.3.0

Page 6: Pivotal Greenplum Text

GPTextConfigurationFilesLocationGPTextnowsavesconfigurationfiles gptext.conf , gptxtenvs.conf ,and zookeeper.conf onlyintheGreenplumDatabasemasterandstandbymasterdirectories.The gptext.conf fileisnolongersavedineachsegmentdatadirectory.

FlexibleShardingBydefault,GPTextcreatesoneSolrindexshardforeachGreenplumDatabaseprimarysegment.Youcannowspecifyasmallernumberofshardsbysettingthe gptext.idx_num_shards parametertothenumberofshardsyouwantbeforeyoucreatetheindex.ThisworksforbothregularGPTextindexesandexternalindexes.

When gptext.idx_num_shards issettothedefault(0),GPTextconfigurestheindextousetheSolr implicit router,withoneshardperGreenplumDatabasesegment.Whenthe gptext.idx_num_shards parameterischangedtothenumberofshardsdesired,GPTextcreatestheindexusingtheSolr compositeId routertoroutedocumentstoshards.The compositeId routerdoesnotsupportduplicateIDs,soifyousetthe if_check_id_uniqueness argumenttofalsewhenyoucallthe gptext.create_index() functionthe implicit routerisused,andtheindexwillhaveoneshardperGreenplumDatabasesegment.

The content_id columnisremovedfromtheoutputofthe gptext.index_status() and gptext.index_summary() functions,sinceGreenplumDatabasesegmentsarenotalwaysassociatedwithasingleindexshard.

SeeSpecifyingtheNumberofShardsformoreinformationaboutthisfeature.

gptext-recoverUtilityWhenusingthe -f ( --force )option,the gptext-recover utilitynowverifiesthattherearenoindexesinaredstatebeforeproceeding.Ifanyindexisdown,theutilityexits.

ZooKeeperUpgradeApacheZooKeeperincludedwithGPText3.2.0hasbeenupgradedtoversion3.4.11.ThisZooKeeperreleaseincludesbugfixesthatresolveaninconsistentclusterissuewithGPText(MPP-29742).

NewFeaturesandEnhancementsinGPText3.1.0TheGPText3.1.0releaseprovidesthefollowingfeaturesandenhancements.

ImprovementstoaidindevelopingandtestinganalyzerchainsThenew gptext.list_field_types() functionliststhefieldtypesdefinedinthe managed-schema configurationfileforanindex.

Thenew gptext.get_field_type() functiondisplaystheindexandqueryanalyzerchainsforafieldtypeinJSONformat.

Thenew gptext.analyzer() functionshowstheindexorqueryanalyzerchainoutputforagivenfieldtypeandinputtext.Thisfunctionisusefulfortestinganddebugginganalyzerchainsinteractivelywithoutmodifyingtheindex.

Part-of-speechtaggingandnamedentityrecognitionGPTextincludesOpenNLPlibrariesandanalyzerclassestoclassifyindexedterms’parts-of-speech(POS),andtorecognizenamedentities,suchasthenamesofpersons,locations,andorganizations(NER).GPTextsavesNERtermsinthefield’stermsvector,prependedwithacodetoidentifythetypeofentityrecognized.Thisallowssearchingdocumentsbyentitytype.

Thenew gptext.ner_terms() functionlistsNER-taggedtermsfordocumentsthatmatchaquery.

GPTextincludestheOpenNLPmodelsfortheEnglishlanguage.YoucandownloadmodelsforotherlanguagesfromtheOpenNLPwebsiteandusethemwithGPText.

Otherenhancementsandfixes

©CopyrightPivotalSoftware,Inc,2013-2019 6 3.3.0

Page 7: Pivotal Greenplum Text

Thefirstargumentofthe gptext.terms() function,ananytabledatatype,hasbeenmadeoptional.

Fixedanerrorwherethe gptext.partition_status() functiondisplayedpartitioninformationforanindexafteritwasdropped.

ApacheSolrupdatedtoSolrversion7.3GPText3.1.0includesApacheSolr7.3.SeethefollowingreleasedocumentsforinformationabouttheSolr7.3release.

ApacheSolr7.3UpgradeNotes

ApacheSolr7.3ReleaseHighlights

FollowingareGPTextchangesandSolrusagenotesrelatedtotheSolr7.3upgrade.

GPTextserver-sidecomponentsarerebuiltandtestedwiththenewSolrJARfiles.

The managed-schema , solrconfig.xml andothercollectionconfigurationfilesareupdated.

Thetop-level <highlighting> elementin solrconfig.xml isnowofficiallydeprecatedinfavoroftheequivalent <searchComponent> syntax.ThiselementhasbeenoutofuseindefaultSolrinstallationsforseveralreleasesalready.

The legacyCloud parameternowdefaultstofalse.Ifanentryforareplicadoesnotexistin state.json ,thatreplicawillnotberegistered.Thismayaffectuserswhobringupreplicasandtheyareautomaticallyregisteredasapartofashard.ItispossibletoreverttotheoldbehaviorbysettingthepropertylegacyCloud=true intheclusterpropertiesbyrunningthefollowingcommandintheGPTextinstallationdirectory:

$./server/scripts/cloud-scripts/zkcli.sh-zkhost127.0.0.1:2181-cmdclusterprop-namelegacyCloud-valtrue

WithearlierSolrreleases,ifyoudropanindexwhileaSolrnodewithareplicaoftheindexisdown,whenthedownnodecomesbackon-line,theindexcomesbackandcannotbedeleted.Solr7fixesthisbug.TheGPTextworkaroundforthisbugisremoved.

PointFieldsaredefaultnumerictypes.Solrhasimplemented*PointFieldtypesacrosstheboard,toreplaceTrie*basednumericfields.AllTrie*fieldsarenowconsidereddeprecated,andwillberemovedinSolr8.IfyouareusingTrie*fieldsinyourschema,youshouldconsidermovingtoPointFieldsassoonasfeasible.ChangingtothenewPointFieldtypeswillrequireyoutore-indexyourdata.

Thefollowingspatial-relatedfieldshavebeendeprecated:LatLonTypeGeoHashFieldFieldTypeSpatialTermQueryPrefixTreeFieldTypeUseoneofthesefieldtypesinstead:LatLonPointSpatialFieldSpatialRecursivePrefixTreeFieldRptWithGeometrySpatialField

ToimproveparameterconsistencyintheCollectionsAPI,theparameternames fromNode fortheMOVEREPLICAcommandandsource,and target fortheREPLACENODEcommandhavebeendeprecatedandreplacedwith sourceNode and targetNode instead.Theoldnameswillcontinuetoworkforbackwardscompatibility,buttheywillberemovedinSolr8.

Thereplicacorenamehaschangedfrom <collection_name>_shard#_replica# to <collection_name>_shard#_replica_<node_type># .Forexample,demo.wikipedia.articles_shard0_replica1 becomes demo.wikipedia.articles_shard0_replica_n1 .

NewFeaturesandEnhancementsinGPText3.0.0GPText3.0.0allowsaddingdocumentsstoredinAmazonWebServicesS3bucketstoaGPTextexternalindex.ThisenhancementincludeschangestoenableuploadingAWScredentialstoZooKeeperandsupportforthe s3 documentsourcetypeforthe gptext.external_login() , gptext.external_logout() ,gptext.index_external() ,and gptext.index_external_dir() GPTextfunctions.

The gptext-state utilitywiththe --index ( -i )optionnowincludesthedateandtimetheGPTextindexwaslastmodified.

NewFeaturesandEnhancementsinGPText2.4.0GPText2.4.0allowsaddingdocumentsstoredinanauthenticatedFTPservertoaGPTextexternalindex.Thisenhancementincludeschangestoaddsupportforthe ftp typetothe gptext.external upload command-lineutilityandthe gptext.external_login() , gptext.external_logout(), gptext.index_external() ,and gptext.index_external_dir() GPTextfunctions.

©CopyrightPivotalSoftware,Inc,2013-2019 7 3.3.0

Page 8: Pivotal Greenplum Text

NewFeaturesandEnhancementsinGPText2.3.1The gptext-backup command-lineutilitycannowbackupGPTextindexestolocalGPTextclusterstorageaswellasadirectoryonashareddrive.Forlocalbackups,backupmetadataandtheindexconfigurationfilesarebackeduptotheGreenplumDatabasemasterdatadirectoryandindexshardsarebackedupinthesegmentdatadirectoriesoneachhost.

The gptext-backup utilityhasanewoptiontobackupjusttheindexconfigurationfilesfromZooKeeper,withnoindexdata.

The gptext-restore uilityisupdatedtorestorebackupscreatedonlocalclusterstorage.

The gptext-restore utilityhasanewoptiontorestoreonlytheconfigurationfilesfromabackup.ThisoptionloadstheconfigurationfilesintoZooKeeperandcreatesanemptyGPTextindex.

NewFeaturesandEnhancementsinGPText2.3.0

Revisedgptext-configUtilitySyntaxThe gptext-config command-lineutilitywasrevisedtohaveamoreuser-friendlysyntax.

Anew list subcommandwasaddedto gptext-config youcanusetolistalloftheconfigurationfilesforaspecifiedGPTextindex.

$gptext-configlist-i<index-name>

IndexDocumentsinaHadoopFileSystem(hdfs)DocumentSourceGPText2.3.0enablesyoutoadddocumentsstoredinahdfssystemtoaGPTextexternalindex.

Thenew gptext-external command-lineutilityuploadsHadoopconfigurationandauthenticationfilestoanamedconfigurationinZooKeeper.Theutilityhassubcommands upload , list ,and delete tomanagetheconfigurationsyouhaveuploaded.

Thenew gptext.external_login() functionlogsintothehdfssystemusingthenamedconfigurationyouhaveuploaded.Youcanlogintoonlyoneexternaldocumentsourceatatime.

UseURLsoftheform hdfs://<url> withthe gptext.index() and gptext.index_external() functionstoadddocumentstoaGPTextexternalindex.

Usethenew gptext.index_external_dir() functiontoaddalldocumentsinanhdfsdirectorytoaGPTextexternalindex.

Logoutofthehdfsexternaldocumentsourcewiththenew gptext.external_logout() function.

SeeAuthenticatingwithanExternalDocumentSourceforstepstoenableaccesstoanhdfsdocumentsource.

KnownIssuesSeetheApacheJira forknownissuesinApacheSolr.

FollowingareknownissuesinGPText.Workaroundsareprovidedwhenavailable.

WildcardsinGPTextSearchOptionsSolrdoesnotreturnallfieldswhenthe fl Solrsearchoptioncontainsawildcardthatmatchesfieldnames.Forexample,givenatablewithcolumnscontenta and contentb ,specifying fl=contenta,contentb,(sum,1,1) correctlyreturnsthreefields.Specifying fl=cont*,sum(1,1) correctlyreturns contenta andcontentb ,butomitsthepseudo-field sum(1,1) .

Specifyingawildcardtomatchallfields( fl=*,sum(1,1) )alsoomitsthepseudo-field.

IndexLoadFailureAfterConfigurationFileErrorIfSolrfailstoloadanindexbecauseofaconfigurationfileerror,andthentheindexisdroppedwithoutfirstcorrectingtheconfigurationfileerror,the

©CopyrightPivotalSoftware,Inc,2013-2019 8 3.3.0

Page 9: Pivotal Greenplum Text

indexcannotberecreateduntilGPTextisrestarted.Thiscanhappenifyouedit managed-schema or solrconfig.xml andintroduceanXMLsyntaxerrororatypoinconfigurationvalues.

Workaround:

1. Whenanindexfailstoload,checktheSolrlogtofindthecause.

2. Ifthecauseisaconfigurationfileerror,suchasinvalidXML,usethe gptext-config utilitytoeditthefileandfixtheerror.Droppingtheindexwithoutfirstcorrectingtheerrorisnotrecommended.

3. Ifyouhavedroppedanindexthatfailedtoloadwithoutfirstcorrectingthecauseofthefailure,youmustrestartGPTextbeforeyoucanrecreatetheindex.Run gptext-start -r torestartGPText.

StartupFailurewithLargeNumbersofIndexesWhenthereisalargenumberofSolrcores,SolrCloudcanfailtorestartsuccessfully,witherrormessagesindicatingfailuretoelectleadersforshards.ThisisaknownSolrissue;seehttps://issues.apache.org/jira/browse/SOLR-5990 intheApacheSolrJiraforanexample.Becauseofthisissue,itisrecommendedtoavoiddesigningGPTextapplicationsthatcreatelargenumbersofindexes,shards,andreplicas.Thenumberofcoresyoucancreatebeforeyouobservethisbehaviorishardwaredependent,soyoushouldtesttodetermineyoursystem’slimits.Youcancreateandsuccessfullyoperatealargernumbersofindexesthancanberestartedsuccessfullylater,sobesuretotestrestartingGPTexttodetermineapracticallimit.

SettingGPTextConfigurationParametersWithoutFirstSettingcustom_variable_classesInGreenplumDatabaseversionsbeforeGreenplumDatabase6,ifthe custom_variable_classes GreenplumDatabaseserverconfigurationparameterdoesnotincludethevalue“gptext”,attemptingtosetaGPTextconfigurationparameterreturnsanerrormessage,forexample:

mydb-#setgptext.replication_factor=4;WARNING:PleaselogonagaintomakeGUCsettingtakeeffect.(GucValue.h:301)WARNING:PleaselogonagaintomakeGUCsettingtakeeffect.(GucValue.h:301)ERROR:unrecognizedconfigurationparameter"gptext.replication_factor"

InGPText2.0,inadditiontotheerrormessage,thevalueoftheconfigurationparameterpersistedinZooKeeperiszero,replacingthepreviousvalueoftheparameter.

mydb-#showgptext.replication_factor;gptext.replication_factor----------------------------0

BeginningwithGPText2.1,theerrormessageisstillgenerated,howeverthevaluesavedinZooKeeperisthevaluespecifiedinthe set command,4intheprecedingexample.

Topreventtheerrormessage,beforesettinganyGPTextconfigurationparameters,usethe gpconfig command-lineutilitytosetthe custom_variable_classesconfigurationparameter:

$gpconfig-ccustom_variable_classes-v'gptext'

InGreenplumDatabase6.0,the custom_variable_classes configurationparameterisremovedandcustomparameterscanbesetwithouterrors.

©CopyrightPivotalSoftware,Inc,2013-2019 9 3.3.0

Page 10: Pivotal Greenplum Text

InstallingGPText

PrerequisitesTheGPTextinstallationincludestheinstallationofApacheSolrCloudand,optionally,ApacheZooKeeper.

IfyouareinstallinganewGPTextreleaseintoanexistingGPTextsystem,followtheinstructionsinUpgradingGPTextinstead.

FollowingareGPTextinstallationprerequisites.

InstallandconfigureyourGreenplumDatabasesystem,version4.3.6orhigher.SeethePivotalGreenplumDatabaseInstallationGuideathttps://gpdb.docs.pivotal.io .

GPTextrunsonRedHatEnterpriseLinuxorCentOS5.x,6.x,or7.x.

GPTextcannotbeinstalledontoasharedNFSmount.

InstallaJRE1.8or1.11onallhostsinthecluster.

Ensurethat nc (netcat)isinstalledonallGreenplumclusterhosts( yum install nc ).

Installing lsof onallclusterhostsisrecommended( sudo yum install lsof ).

GPTextnodescanbeinstalledontheGreenplumDatabaseclusterhostsalongsidetheGreenplumsegmentsoronadditional,non-databasehostsaccessibleontheGreenplumclusternetwork.AllhostsparticipatingintheGPTextsystemmusthavethesameoperatingsystemandconfigurationandhavepasswordless-sshaccessforthegpadminuser.SeethePivotalGreenplumDatabaseInstallationGuideforinstructionstoconfigurehosts.

IfyouplantoplaceGPTextnodesontheGreenplumDatabasesegmenthosts,ensurethatyoureservememoryforGPTextusewhenyouconfigureGreenplumDatabase.TodeterminethememorytosetasideforGPText,multiplythenumberofGPTextnodestocreateoneachGreenplumsegmenthostbytheJVMmaximumsize.SubtractthismemoryfromthephysicalRAMwhencalculatingthevaluefortheGreenplumDatabasegp_vmem_protect_limit serverconfigurationparameter.SeetheGreenplumDatabaseserverconfigurationparameter gp_vmem_protect_limit intheGreenplumDatabaseReferenceGuideforrecommendedmemorycalculationformulasorvisittheGPDBVirtualMemoryCalculator website.

ApacheSolrrequiresaZooKeeperclusterwithatminimumthreenodes.Youcaninstalla“binding”ZooKeeperclusterwithGPTextontheGreenplumclusterhosts,oryoucanuseanexistingZooKeepercluster.WhendeployedalongsideGreenplumDatabasesegments,ZooKeeperperformancecanbeaffectedunderheavydatabaseload.Forbestperformance,installaZooKeeperclusterwithatleastthreenodes(fivenodesrecommended)onseparatehostswithnetworkconnectivitytotheGreenplumnetwork.

InstalltheGPTextBinaryDistribution1. OntheGreenplummasterhost,extracttheGPTextdistributionfile.Forexample:

$cd/home/gpadmin$tarxvfzgreenplum-text-<version>-<platform>.tar.gz

Thiscreatesthedirectory greenplum-text-<version>-<platform> containingthefiles: gptext_install_config andtheGPTextinstallationbinary,whichhasanameintheformat greenplum-text-<version>-<platform>.bin .

2. Ifnecessary,grantexecutepermissiontotheGPTextbinary.Forexample:

$chmod+x/home/gpadmin/greenplum-text-<version>-<platform>.bin

3. IfyouareinstallingGPTextinadirectorythatisonlywritablebyroot,suchasthedefaultdirectory /usr/local ,performthesestepsasroot:

a. Sourcethe greenplum_path.sh fileintheGreenplumDatabaseinstallationdirectory.

#source/usr/local/greenplum-db-<version>/greenplum_path.sh

b. LocateorcreateatextfilecontainingalistofthenamesofallhostswhereyouwillinstallGPText,oneperline,includingthemasterandstandbyhostnames.

c. Startgpssh,specifyingthetextfilewithhostnames.

#gpssh-fhostlist.txt

d. Createtheinstallationdirectoryandthe greenplum-solr directoryandsettheownershipandpermissions.Forexample,ifyouareinstallingGPTextinthedefaultdirectory, /usr/local :

©CopyrightPivotalSoftware,Inc,2013-2019 10 3.3.0

Page 11: Pivotal Greenplum Text

=>mkdir/usr/local/greenplum-text-<version>=>mkdir/usr/local/greenplum-solr=>chowngpadmin:gpadmin/usr/local/greenplum-text-<version>=>chmod775/usr/local/greenplum-text-<version>=>chowngpadmin:gpadmin/usr/local/greenplum-solr=>chmod775/usr/local/greenplum-solr=>exit

e. Completetheremainingstepsasthegpadminuser.

4. Editthe gptext_install_config filetosetparametersfortheinstallation.SeeSetInstallationParametersfordetails.

5. RuntheGPTextinstallationbinaryas gpadmin onthemasterserver:

$./greenplum-text-<version>-<platform>.bin-c<gptext_install_config>

6. AcceptthePivotallicenseagreement.

OptionalTwo-PartGPTextInstallationTheGPTexttwo-partinstallationinstallsanddeploystheGPTextsoftwareinseparatesteps.Thisgivesyoutheoptiontoinstallthesoftwarefilestoaread-only,shareddirectorymountedonallGPTexthostsinthecluster,ratherthaninstallingthesoftwareoneveryGPTexthost.

IfyouinstalltheGPTextsoftwareontoashareddrive,youmustsetthe GPTEXT_CUSTOM_CONFIG_DIR parameterintheinstallationconfigurationfile.ThisparameterspecifiesawritabledirectorythatexistsoneveryGPTexthostwhereGPTextcanstoreconfigurationfilesforexternaldatasources.SeeGPTextinstallationparametersformoreinformationaboutthisparameter.

RuntheGPTextinstallationintwopartsbyfollowingthestepsinthissection.

1. PrepareGPTextinstallationdirectoriesasdescribedinsteps1through3inInstalltheGPTextBinaries.

2. RuntheGPTextinstallationbinaryas gpadmin onthemasterserver:

$./greenplum-text-<version>.bin-b

Notethatthe -c<gptext_install_config> optionisomitted.

3. SourcetheGPTextenvironmentscriptintheGPTextinstallationdirectory:

$source<gptext-install-dir>/greenplum-text_path.sh

4. Editthe gptext_install_config filetosetparametersfortheGPTextdeployment.SeeSetInstallationParametersfordetails.Besuretouncommentandsetthe GPTEXT_CUSTOM_CONFIG_DIR parameterifyouinstalledthesoftwareonaread-onlydrive.

5. DeploytheGPTextclusterwiththe gptext-deploy command.Thecommandrequiresthe -c optiontospecifytheinstallationconfigurationfile.Alsoincludethe -m optionbecauseyouinstalledtheGPTextsoftwaretoashareddrivemountedonallGPTexthosts.Ifyoudonotinclude -m , gptext-deploy copiestheGPTextsoftwaretoallGPTexthosts.

$gptext-deploy-m-c<gptext_install_config>

SetInstallationParametersAGPTextconfigurationfilenamed gptext_install_config containsparameterstoconfiguretheGPTextinstallation.Editthefileandsettheparametersasdescribedinthefollowingtable.

The GPTEXT_HOSTS and DATA_DIRECTORY installationparametersdeterminethenumberofGPTextnodesthataredeployed.Thenumberofdirectoriesincludedinthe DATA_DIRECTORY arrayisthenumberofGPTextnodesthatarecreatedperhost.

The GPTEXT_HOSTS parameterdeterminesthenumberofhosts.Ifsettotheconstant "ALLSEGHOSTS" thenumberofGPTextnodehostsisthesameasthenumberofGreenplumsegmenthosts.If GPTEXT_HOSTS issettoanarrayofhostnames,thelengthofthearrayisthenumberofGPTextnodehosts.

©CopyrightPivotalSoftware,Inc,2013-2019 11 3.3.0

Page 12: Pivotal Greenplum Text

GPTextinstallationparameters

GPTEXT_HOSTS

AnarrayofhostnamesonwhichtoinstallGPText,orusetheconstant "ALLSEGHOSTS" toinstallGPTextonallGreenplumDatabasesegmenthosts.GPTexthostsmustbepasswordlessssh-accessiblebythegpadminuserfromallotherhostsintheGreenplumCluster.

declare -a GPTEXT_HOSTS=(gptext_h1 gptext_h2 gptext_h3)

GPTEXT_HOSTS="ALLSEGHOSTS"

DATA_DIRECTORY

AnarrayofdirectorypathswhereGPTextdatadirectoriesaretobecreated.ThenumberofdirectoriesinthearraydeterminesthenumberofGPTextnodesthatwillbecreatedoneachphysicalhost.If GPTEXT_HOSTS listsmultipleinterfacesperhost,theGPTextnodesarespreadevenlyacrosstheinterfaceaddresses.

declare -a DATA_DIRECTORY=(/data/primary /data/primary)

GPTEXT_CUSTOM_CONFIG_DIR

ThepathtoadirectorywhereGPTextstoresuploadedexternaldatasourceconfigurationfilesandcustomlibraries.Ifyoudonotsetthisparameter,thedefaultistostorethesefilesinthe share subdirectoryoftheGPTextinstallationdirectory.Ifyoudospecifyadirectorywiththisparameter,thedirectoryiscreatedoneverySolrhostinthecluster,andexternalconfigurationfilesandcustomlibrarieswillbestoredthere,leavingtheGPTextinstallationdirectoryfreefromapplicationdata.

JAVA_OPTS

SetstheminimumandmaximummemoryeachSolrCloudJVMcanuse.

JAVA_OPTS="-Xms1024M -Xmx2048M"

GPTEXT_PORT_BASE

GP_MAX_PORT_LIMIT

SetarangeofportnumbersavailabletoGPTextnodes.GPTextfindsunusedportsinthespecifiedrange.

GPTEXT_PORT_BASE=18983GP_MAX_PORT_LIMIT=28983

ZOO_CLUSTER

WhethertodeployaGPTextbindingZooKeeperclusteroruseanexistingZooKeepercluster.Ifsetto "BINDING" theinstallationdeploysaZooKeepercluster.TouseanexistingZooKeepercluster,setthisparametertoalistofZooKeepernodesintheformat"host1:port,host2:port,host3:port “.

ZOO_CLUSTER="BINDING"

ZOO_HOSTS

If ZOO_CLUSTER issetto "BINDING" ,thisparameterisanarrayofthehostswheretheZooKeepernodesaretobeinstalled.Thearraymustcontain3,5,or7hostnames,forexample ZOO_HOSTS=(sdw1 sdw2 swd3 sdw4 sdw5) .IfyouareusingasinglehostforZooKeeper,specifyitmultipletimes,forexample, ZOO_HOSTS=(sdw1 sdw1 sdw1) .

declare -a ZOO_HOSTS=(sdw1 sdw2 sdw3 sdw4 sdw5)

ZOO_DATA_DIR

TheZooKeeperdatadirectory,requiredwhen ZOO_CLUSTER issetto "BINDING" .

ZOO_DATA_DIR="/data/master/"

ThemaximumnumberofGPTextnodesisthenumberofGreenplumDatabaseprimarysegments.ThebestpracticerecommendationistodeployfewerGPTextnodeswithmorememoryratherthantodividethememoryavailabletoGPTextamongthemaximumnumberofGPTextnodesallowed.Forexample,ifthereareeightprimarysegmentsperhostintheGreenplumDatabasecluster,themaximumnumberofGPTextnodesperhostiseight,butyoushouldtestwithtwoorfourGPTextnodesperhost,adjustingthe JAVA_OPTS installationparametertodividethememoryreservedforGPTextamongthem.

©CopyrightPivotalSoftware,Inc,2013-2019 12 3.3.0

Page 13: Pivotal Greenplum Text

ZOO_GPTXTNODE

ThenodepathinZooKeeperforGPText.Thisparameterisrequiredwhether ZOO_CLUSTER issetto "BINDING" oralistofhosts.

ZOO_GPTXTNODE="gptext"

ZOO_PORT_BASE

ZOO_MAX_PORT_LIMIT

ArangeofportnumberstousefortheZooKeepercluster.Unusedportsareallocatedfromwithinthisrange.Therangemustcontainatleast4000portnumbers.

ZOO_PORT_BASE=2188ZOO_MAX_PORT_LIMIT=12188

GPTEXT_JAVA_HOME

ThehomedirectoryoftheJavainstallationtorunforZooKeeperandSolrprocesses.Ifnotset,theJREspecifiedinthe PATH and JAVA_HOMEenvironmentvariableswillbeused.

GPTEXT_JAVA_HOME=/usr/java/jdk1.8.0_131

StartingGPTextFirst,makesuretheGPTextcommand-lineutilitiesareinyourpathbysourcingtheGreenplumDatabaseandGPTextenvironmentscripts.ItisimportanttosourcetheGPTextenvironmentscripteachtimeyousourcetheGreenplumDatabasescript.Forexample:

$source/usr/local/greenplum-db-<version>/greenplum_path.sh$source/usr/local/greenplum-text-<version>/greenplum-text_path.sh

TouseGPTextinadatabase,youmustfirstusethe gptext-installsql managementutilitytoinstalltheGPTextuser-definedfunctionsandotherobjectsinthedatabase:

$gptext-installsqldatabase[database2...]

TheGPTextobjectsarecreatedinthe gptext schema.

TheZooKeeperclustermustberunningbeforeyoustartGPText.IfyouinstalledaboundZooKeepercluster,startitwiththe zkManager command-lineutility.

$zkManagerstart

StartGPTextwiththe gptext-start utility.

$gptext-start

ConfigureGreenplumDatabaseGPTextconfigurationparametersaresavedinZooKeeper.Youcan,however,viewandsetGPTextconfigurationparametersinaGreenplumDatabasesessionusingthe SHOW and SET commands.

IfyouareusingGreenplumDatabase4.3.xor5.x,youmustfirstdeclaretheGPTextcustomvariableclassbyaddingittotheGreenplumDatabasecustom_variable_classes configurationparameter.The custom_variable_classes parameterisremovedinGreenplumDatabase6,sothisstepisunnecessaryifyouhaveGreenplumDatabase6.

The custom_variable_classes configurationparameterisacomma-separatedlistofclassnames.Itisunsetbydefault.Toseeifanycustomvariableclasseshavealreadybeenconfigured,runthis gpconfig commandatthecommandline.

$gpconfig-scustom_variable_classes

Ifnocustomvariableclasseshavebeenset,settheparameterwiththefollowingcommand.

©CopyrightPivotalSoftware,Inc,2013-2019 13 3.3.0

Page 14: Pivotal Greenplum Text

$gpconfig-ccustom_variable_classes-v'gptext'[gpadmin@gpsne~]$gpconfig-ccustom_variable_classes-v'gptext'20171029:12:29:11:028199gpconfig:gpsne:gpadmin-[INFO]:-completedsuccessfully

Ifotherclasseshavebeenconfigured,add gptext totheexistinglist,separatedbyacomma.

Run gpstop-u

tohaveGreenplumDatabasereloadtheconfigurationfile.

VieworsetGPTextConfigurationParametersWhenyouwanttovieworsetGPTextconfigurationparametersina psql session,firstexecutethe gptext.version() functiontoloadtheGPTextconfigurationparametersintothesession.

=#SELECTgptext.version();version--------------------------------GreenplumTextAnalytics3.2.0(1row)

=#SHOWgptext.idx_delim;gptext.idx_delim------------------,(1row)

SeeSettingGPTextConfigurationParametersformoreaboutGPTextconfigurationparameters.

UninstallingGPTextTouninstallGPText,runthe gptext-uninstall utility.YoumusthavesuperuserpermissionsonalldatabaseswithGPTextschemastorun gptext-uninstall .

gptext-uninstall runsonlyifthereisatleastonedatabasewithaGPTextschema.

Execute:

$gptext-uninstall

©CopyrightPivotalSoftware,Inc,2013-2019 14 3.3.0

Page 15: Pivotal Greenplum Text

UpgradingGPTextUpgradingaGPTextsystemtoanewGPTextreleaseinstallsthenewGPTextsoftwarereleaseonallhostsintheGreenplumclusterandthenupgradestheGPTextsystem.

UpgradingGPTextandGreenplumDatabaseattheSameTimeIfyouareupgradingtonewreleasesofGreenplumDatabaseandGPTextatthesametime,followthesesteps:

1. CompletetheGreenplumDatabaseupgradefirstandensurethedatabaseisoperational.

2. RuntheGPText gptext-migrator utilitytomigrateyourcurrentGPTextsystemtothenewlyupgradedGreenplumDatabasesystem.

3. EnsurethatthecurrentversionofGPTextworkswiththenewGreenplumDatabaseversion.

4. ProceedwiththeGPTextupgrade.

UpgradingaGPTextReleaseUpgradingaGPTextreleaseisatwo-partprocess:installthenewsoftwarereleaseontheGreenplumclusterhostsandthenupgradetheexistingGPTextsystem.TheGPTextinstallerperformsthefirstpart,installingthenewsoftware.The gptext-upgrade utilityperformsthesecondpart,upgradingthecurrentGPTextsystemtothenewversion.

TheGPTextinstallerdetectsanexistingGPTextsystemand,afterinstallingthenewsoftwarerelease,offerstorunthe gptext-upgrade utilityforyou.IfyouchoosetoupgradetheGPTextsystemlater,youcanrunthe gptext-upgrade utilityyourself.

AllupgradetasksareexecutedontheGreenplummasterhostasthe gpadmin user.The gpadmin usermusthavewritepermissioninthedirectorywherethenewGPTextreleaseistobeinstalled, /usr/local/greenplum-text-<release>-<version> bydefault.

TheGreenplumDatabase,ZooKeeper,andGPTextclustersmustberunning.TheprocedurestopsandrestartsGPTextduringtheupgrade.

Followthesesteps:

1. DownloadthenewGPTextreleaseforyourplatformfromPivotalNetwork .

2. Extractthereleasepackage.

$tarxfzgreenplum-text-<version>-<platform>.tar.gz

3. MakesurethatZooKeeperandGPTextarerunning.

$gptext-state

4. RuntheGPTextinstaller.

$./greenplum-text-<version>-<platform>.bin

5. TheinstallerpromptsyoutoacceptthePivotallicenseagreementandtochooseandcreatetheinstallationdirectory.

6. Theinstallerverifiestheenvironmenttoensurethatprerequisitesarepresent,suchasPythonandJava.Ifanyproblemsarediscovered,theinstalleroutputsanerrormessageandstops.Correcttheproblemidentifiedbythemessageandruntheinstalleragain.

7. AfterthenewsoftwarehasbeeninstalledontheGreenplumcluster,theinstallerlooksforanexistingGPTextinstallation.IfanexistingGPTextsystemisfound,theinstallerasksifyouwishtoupgradeGPTextdirectly.

Ifyouansweryes,theinstallerrunsthe gptext-upgrade script.The gptext-upgrade utilityvalidatestheenvironmenttoensureitcancompletetheupgrade,thenexecutestheupgradeandrestartstheGPTextsystem.Ifanyproblemsarediscovered, gptext-upgrade outputsamessageandquits.Fixtheindicatedproblemsandrunthegptext-upgradeutility(at <NEW_GPTEXTHOME>/bin/gptext-upgrade )tocomplete

WhenupgradingGPText,youdonotspecifyaninstallationconfigurationfileasyoudofortheinitialGPTextinstallation.

©CopyrightPivotalSoftware,Inc,2013-2019 15 3.3.0

Page 16: Pivotal Greenplum Text

theGPTextsystemupgrade.Ifyouanswerno,youmustrunthe gptext-upgrade scriptaftertheinstallercompletes.Seethegptext-upgradeutilityreferenceforinstructions.

Important:Ifyouanswernoorifthe gptext-upgrade quitswithoutupgradingyoursoftware,followthesestepstore-run gptext-upgrade atalatertime:

a. Sourcethe greenplum-text_path.sh scriptintheoldGPTextinstallationdirectory.Forexample:

$ source /usr/local/greenplum-text-<old-version>/greenplum-text_path.sh

b. Runthe gptext-upgrade commandfromthenewGPTextinstallationdirectory:

$ /usr/local/greenplum-text-<new-version>/bin/gptext-upgrade

8. Aftertheupgradehascompleted,sourcethe greenplum-text_path.sh inthenewGPTextreleasedirectoryandrun gptext-statehealthcheck toverifytheGPTextsystem:

$source/usr/local/greenplum-text-<version>/greenplum-text_path.sh$gptext-statehealthcheck

©CopyrightPivotalSoftware,Inc,2013-2019 16 3.3.0

Page 17: Pivotal Greenplum Text

IntroductiontoPivotalGPTextPivotalGPTextenablesprocessingmassquantitiesofrawtextdata(suchassocialmediafeedsore-maildatabases)intomission-criticalinformationthatguidesbusinessandprojectdecisions.GPTextjoinstheGreenplumDatabasemassivelyparallel-processingdatabaseserverwithApacheSolrCloudenterprisesearch.GPTextincludespowerfultextsearchaswellassupportfortextanalysis.GPTextsupportsbusinessdecisionmakingbyoffering:

Multiplekindsofdata:GPTextsupportsbothsemi-structuredandunstructureddatasearches,whichexponentiallyincreasesthekindsofinformationyoucanfind.

Multipledocumentsources:GPTextcanindexdocumentsstoredinGreenplumDatabasetablesordocumentsretrievedfromexternalstores,suchasHTTPorFTPservers,AmazonS3,orHadoophdfs.Mostdocumentformatsarerecognizedautomatically.

Lessschemadependence:GPTextdoesnotrequirestaticschemastosuccessfullylocateinformation;schemascanchangeorbequitesimpleandstillreturntargetedresults.

Naturallanguagetextprocessing:GPTextprovidesNLPcapabilitieswiththeintegratedApacheOpenNLPtoolkit.

Textanalytics:YoucanuseApacheMADlibinGreenplumDatabaseforadvancedmachinelearning,graph,statisticsandanalyticsinGreenplumDatabase.

Thischaptercontainsthefollowingtopics:

GPTextSystemArchitecture

GPTextSampleUseCase

GPTextWorkflow

TextAnalysis

GPTextSystemArchitectureGPTextcombinesaGreenplumDatabaseclusterwithanApacheSolrCloudcluster.GreenplumDatabasesegmentsandGPTextnodescanbedeployedonthesamehostsorondifferenthostswithnetworkconnectivity.

ThefollowingfigureshowstheprocessarchitectureofthecombinedGreenplumDatabaseandApacheSolrclusters.ThefigureshowsfourclusternodeswithfourGreenplumsegmentsandfourSolrinstancesdeployedoneach.AnApacheZooKeeperservicemanagestheSolrCloudcluster.ZooKeepernodesaredeployedonthreeofthefourhosts.GreenplumDatabaseusersaccessSolrCloudservicesviaGPTextuser-definedfunctionsinstalledinGreenplumdatabasesandcommand-lineutilities.

©CopyrightPivotalSoftware,Inc,2013-2019 17 3.3.0

Page 18: Pivotal Greenplum Text

ThefigureomitstheGreenplummasterhost,secondarymaster,andmirrorsegmentsfortheGreenplumprimarysegments.

TheGreenplumsegments,Solrinstances,andZooKeepernodesmayallbedeployedonseparatehostsonthesamenetwork,dependingonapplicationandperformancerequirements.

ThefollowingsectionsdescribehowGPTextintegratesSolrCloudwithGreenplumDatabaseandhowthetwoclustersworktogethertoprovideparalleltextsearchcapabilitiesinGreenplumDatabaseandmaintainhighavailability.

GreenplumDatabaseClusterAGreenplumDatabaseclusteriscomprisedofthefollowingcomponents:

Amasterdatabaseinstance,executingonadedicatedhost,conventionallynamed mdw .(Notillustrated)

Asecondarymasterinstance,onahostconventionallynamed smdw ,actingasawarmstandbyforthemasterinstance.(Notillustrated)

Anarrayofdatabaseprimarysegmentinstancesandmirrorsdeployedonsegmenthosts,byconvention sdw1 through sdwn .AsegmentinstanceisanindependentPostgresdatabaseservermanagingaportionofthedistributeddata.Eachsegmenthasamirror(notillustrated)onanotherhostintheclustertoprovideuninterruptedserviceincaseofasegmentorsegmenthostfailure.Thenumberofprimarysegmentsperhostisdeterminedbythehardwareconfiguration—thenumberandtypeofprocessorcores,theamountofphysicalRAM,localstoragecapacity,andnetworkcapacity—aswellasavailabilityandperformancerequirements.

TheGreenplumDatabasemasterinstance,whichstoresnouserdata,coordinatestheworkofthesegmentinstances.DatabaseuserslogintothemasterinstanceandsubmitSQLqueries.Themasterinstancecreatesaplanforexecutingthequery,distributestheworktothesegments,andgathersandreturnstheresultstotheuser.

ApacheSolrCloudApacheSolrisaserverprovidingaccesstoApacheLucenefull-textindexes.ApacheSolrCloudisahighlyavailable,faulttolerantclusterofApacheSolrservers.ThetermGPTextclusterisanotherwaytorefertoaSolrCloudclusterdeployedbyGPTextforusewithaGreenplumDatabasesystem.

ASolrCloudclusteriscomprisedofthefollowingcomponents:

AnApacheZooKeeperclustertomanagetheSolrCloudcluster.SolrCloudusesZooKeepertomanageserverandindexconfigurationsandtocoordinatethecluster’sactivities.GPTextcaninstallaZooKeeperclusterthatisboundtotheGPTextcluster,oritcanshareanexistingZooKeepercluster.If

©CopyrightPivotalSoftware,Inc,2013-2019 18 3.3.0

Page 19: Pivotal Greenplum Text

GPTextinstallstheZooKeepercluster,itcanbemanagedusingGPTextfunctionsandutilities.TheZooKeeperclustercanbedeployedonGreenplumDatabaseclusterhostsor,forbestperformance,onseparatehostsaccessibletotheGreenplumDatabasecluster.

MultipleSolrCloudserverinstancesdeployedontheGreenplumsegmenthostsoronotherhostsonthesamenetwork.EachinstanceisaJVMprocessrunningSolrserver.SolrCloudinstancesuselocalstorage,whichmaybethesamelocalstoragevolumesthatstoreGreenplumDatabasedata.ThenumberofSolrCloudinstancesperhostcanbethesameasthenumberofGreenplumprimarysegmentsperhost,butthisisnotarequirement.ThenumberofinstancestoexecuteperhostisspecifiedduringGPTextinstallation.

GPTextprovidesdocumentindexingandsearchcapabilitiesforGreenplumDatabasewithuser-definedfunctions(UDFs)thataccessSolrAPIsfromwithindatabasequeries.

GPTextUDFsperformthefollowingtasks:

createandmanageGPTextindexes

providestatusinformationaboutindexes

insertdocumentsintoindexesfromdatabasetablesor,forGPTextexternalindexes,fromdocumentsstoredoutsideofGreenplumDatabase

searchindexes

TherearealsoGPTextUDFsandcommand-lineutilitiestoconfigure,monitor,andmanagetheSolrCloudcluster,andtomanagereplicas,SolrCloud’shigh-availabilitymechanism.(Moreonreplicasinthenextsection.)

ParallelisminGPTextIndexingandSearchingSolrClouddistributesdocumentindexesinslicescalledshards.EachshardismanagedbyaSolrCloudinstanceandZooKeeperensuresthattheshardsaredistributedevenlyamongtheSolrCloudinstances.TheSolrCloudinstancesandGreenplumsegmentsarenotrequiredtobeonthesamehosts.

WithGPText,thedefaultnumberofshardsforanindexisthenumberofGreenplumDatabasesegments,sothateachsegmentoperatesonanequalportionoftheindex.Optionally,alessernumberofshardscanbespecifiedwhenyoucreateaGPTextindex,allowingindexingworkloadstobescaledforperformancerequirementsandresourceusage.

HighAvailabilityforGPTextIndexesSolrCloudprovideshighavailabilitybymaintainingreplicasofshardsandprovidingautomaticfailoverifashardfailsorbecomesunavailable.Onereplicaofeachshardistheleadreplicaandanychangestoitareappliedtotheotherreplicas.Thereplicationfactor,whichdeterminesthenumberofreplicastomaintainforeachshard,issetwhentheindexiscreated.ReplicasmayalsobeaddedordroppedlaterusingGPTextUDFsorcommand-lineutilities.

ZooKeeperdeterminesthelocationsofshardreplicasamongtheSolrnodesandhosts.WhenaddingareplicausingaGPTextUDForcommand-lineutility,anewshardcanbeexplicitlyplacedonaSolrCloudinstance.

GPTextSampleUseCaseForensicfinancialanalystsneedtolocatecommunicationsamongcorporateexecutivesthatpointtofinancialmalfeasanceintheirfirm.Theanalystsusethefollowingworkflow:

1. LoadtheemailrecordsintoaGreenplumdatabase.

2. CreateaSolrindexoftheemailrecords.

3. Runqueriesthatlookfortextstringsandtheirauthors.

4. Refinethequeriesuntiltheypairadummycompanynamewithtopthreeorfourexecutivescorrespondingaboutsuspectoffshorefinancialtransactions.Withthisdata,theanalystscanfocustheinvestigationonspecificindividualsratherthanthethousandsofauthorsintheinitialdatasample.

GPTextWorkflowGPTextworkswithGreenplumDatabaseandApacheSolrCloudtostoreandindexbigdataforinformationretrieval(query)purposes.High-levelworkflowsincludedataloadingandindexing,anddataquerying.

Thistopicdescribesthefollowinginformation:

©CopyrightPivotalSoftware,Inc,2013-2019 19 3.3.0

Page 20: Pivotal Greenplum Text

DataLoadingandIndexingWorkflow

QueryingDataWorkflow

DataLoadingandIndexingWorkflowThefollowingdiagramshowstheGPTextworkflowforloadingandindexingdata.

AllclientinteractionwiththesystemisthroughtheGreenplummasterinstance.

1. LoaddataintoyourGreenplumDatabasesystem.Createadatabasetabletoholddataandthenaddthedatatothetable.Greenplumprovidesparalleldataloadingutilitiesandprotocolsthathelptotransformandloadexternaldatainvariousformatsandfromvarioussources.Fordetails,seetheGreenplumDatabaseAdministratorGuide,athttp://gpdb.docs.pivotal.io .Youcanalsocreateanexternalindexfordocumentsyouretrievefromawebserver,ftpserver,AmazonS3,orhdfs.Youcan

2. CreateandconfigureanemptyGPTextindex.Usethe gptext.create_index() user-definedfunction(UDF)tocreateanemptyGPTextindexforadatabasetable.GPTextstoresconfigurationfilesfortheindexinZooKeeper.

3. Customizetheindex,ifdesired,byeditingtheindexconfigurationfileswiththe gptext-config command-lineutility.Youcancustomizethewaydocumenttextistokenized,filtered,andtransformedbeforestoringintheindexandhowquerytextispreparedtosearchtheindex.

4. Populatetheindexwithdatafromthedatabasetableorexternaldatasource.Usethe gptext.index() or gptext.index_external() UDFtoadddatatotheindex.TheseUDFsworkbydispatchingSQLqueriestoexecuteoneachGreenplumsegment.ThesegmentsexecutethequeriesandaddtheresultstotheindexusingSolrAPIs.

5. Commitchangestotheindex.CommitchangestotheGPTextindexbycallingthe gptext.commit_index() UDF.Untilthechangesarecommitted,queriesexecutedontheindexcannotaccessanydataaddedtotheindexwith gptext.index() .Ifneeded,uncommittedchangescanberolledback.SolrCloudreplicateschangescommittedtotheleadreplicatotheshards’non-leadreplicas.

QueryingDataWorkflowThefollowingdiagramshowsthehigh-levelGPTextqueryprocessworkflow:

©CopyrightPivotalSoftware,Inc,2013-2019 20 3.3.0

Page 21: Pivotal Greenplum Text

1. AusersubmitsaSQLquerydesignedtosearchtheindexeddata.AGPTextsearchqueryisaSQL SELECT statementonaGPTextsearchUDFthatcontainsfull-textsearchexpressions.

2. TheGreenplummasterdispatchesthequerytotheGreenplumDatabasesegments.

3. Eachsegmentexecutesthequery,usingtheSolrAPItosearchitsindexshard.Solranalyzesandexecutesthesearchqueryontheleadreplicafortheshard.

4. TheGreenplumDatabasesegmentsreturntheresultsofthesearchquerytotheGreenplumDatabasemaster.

5. TheGreenplumDatabasemasteraggregatestheresultsfromallsegmentsandreturnsthemtotheclient.

TextAnalysisGPTextenablesanalysisofSolrindexeswithApacheMADlib,anopensourcelibraryforscalablein-databaseanalytics.MADlibprovidesdata-parallelimplementationsofmathematical,statistical,andmachinelearningmethodsforstructuredandunstructureddata.YoucanuseGPTexttoperformavarietyofMADlibanalyses.

LearnmoreaboutApacheMADlibathttp://madlib.apache.org .A gppkg packageforMADlibisavailableonthePivotalnetworkathttp://network.pivotal.io .

TheApacheOpenNLPtoolkitprovidesadvancedmachinelearningtoolsfortokenizing,recognizing,andtaggingnaturallanguagetextthatyoucanenableforGPTextindexinandsearching.SeeNaturalLanguageProcessingwithGPTextIndexesformoreinformation.

©CopyrightPivotalSoftware,Inc,2013-2019 21 3.3.0

Page 22: Pivotal Greenplum Text

AdministeringGPTextGPTextadministrationincludessecurityconsiderations,monitoringSolrindexstatistics,managingandmonitoringZooKeeper,andtroubleshooting.

ViewingtheClusterConfigurationGPTextdeploysApacheZooKeeperandApacheSolrnodesonhostsinyourGreenplumDatabasenetwork.EachnodeisaJVMserverprocesslisteningforrequestsfromothernodes.Usethe gptext-stateconfig commandtolistthehostandportforeachZooKeeperandSolrnodeandthememoryconfigurationforSolrnodes.

$gptext-stateconfigs20181112:12:38:26:018080gptext-state:mdw:gpadmin-[INFO]:-ExecuteGPTextstate...20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-ClusterConfigurations.20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:----------------------------------------------------------20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-JVMMin|MaxXms1024M|Xmx2048M20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-Nodeinformation20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:----------------------------------20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-HostNodeNamePortSolrDir20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw1sdw1_solr:1898318983/data/gptext/solr020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw1sdw1_solr:1898418984/data/gptext/solr120181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw2sdw2_solr:1898318983/data/gptext/solr020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw2sdw2_solr:1898418984/data/gptext/solr120181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-Zookeeperinformation20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:----------------------------------20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-HostPortZookeeperDir20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-mdw2189/data/zoo/zoo020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw22189/data/zoo/zoo020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw12189/data/zoo/zoo020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-Done.

Youdon’tneedthesedetailstousetheGPTextfunctionsandutilities,buttheinformationcanbeusefulformonitoringandtroubleshootingthecluster.Forexample,youcanaccesstheSolrAdminUIbybrowsingtotheURL http://<hostname>:<port> onanySolrnode.SeeUsingtheSolrAdministrationInterface forinformationabouttheSolrAdminUI.

ChangingGPTextServerConfigurationParametersConfigurationparametersusedwithGPTextarebuilt-intoGPTextwithdefaultvalues.YousetnewvaluesfortheparametersinaGreenplumDatabasesessionusingthe SET command,thesamewayyousetGreenplumDatabasesessionparameters.Whenyouenterthe SET commandGPTextupdatesthevalueinZooKeepersothatthechangepersistsbetweendatabasesessions.

WithGreenplumDatabase4.xand5.x,aone-timeGreenplumDatabaseconfigurationchangeisneededsothatGreenplumDatabaseallowsyoutosetanddisplayGPTextconfigurationparameters.Untilyouhaveperformedthisstep,anyattempttosetaGPTextparameterresultsinan“Unrecognizedconfigurationparameter”error.YoumustdeclareacustomvariableclassforGPText.

Asthe gpadmin user,enterthefollowingcommandsinashell:

$gpconfig-ccustom_variable_classes-v'gptext'$gpstop-u

Oncethisstepiscompleted,youcanviewandsetGPTextconfigurationparametersin psql.

ToviewGPTextconfigurationparameters,youfirstneedtofetchthemfromZooKeeperintoyourGreenplumDatabasesessionbyexecutingthegptext.version() UDF.

=#SELECTgptext.version();version------------------------------------------------------GreenplumTextAnalytics3.2.0(1row)

The custom_variable_classes configurationparameterisremovedinGreenplumDatabase6.Youcansetcustomvariablesinadatabasesessionwithouterror,sothisstepisnotneededforGreenplumDatabase6.

©CopyrightPivotalSoftware,Inc,2013-2019 22 3.3.0

Page 23: Pivotal Greenplum Text

Thenyoucanusethe SHOW commandtodisplayvaluesoftheparameters,forexample:

=#SHOWgptext.idx_num_shards;gptext.idx_num_shards-----------------------0(1row)

SeeGPTextConfigurationParametersforacompletelistofconfigurationparameters.

GPTextusesthecurrentvaluesoftheconfigurationparameterswhenyoucreateanewindex,sochangingaconfigurationparameteraffectsnewindexes,butdoesnotaffectexistingindexes.

ChangethevaluesofGPTextconfigurationvariablesusingthe SET commandinasessionwithadatabasethatcontainstheGPTextschema.Thefollowingexamplesetsvaluesforthreeconfigurationparametersina psql session:

=#setgptext.idx_buffer_size=10485760;SET=#setgptext.idx_delim='|';SET=#setgptext.extension_factor=5;SET

Youcanviewthenewvalueofaconfigurationparameterthatyouhavesetusingthe SHOW command:

=#showgptext.idx_delim;gptext.idx_delim------------------|(1row)

SecurityandGPTextIndexesGPTextsecurityisbasedonGreenplumDatabasesecurity.YourprivilegestoexecuteGPTextfunctionsdependonyourprivilegesforthedatabasetablethatisthesourcefortheindex.Forexample,ifyouhaveSELECTprivilegesforatableintheGreenplumDatabasedatabase,thenyouhaveSELECTprivilegesforanindexgeneratedfromthattable.

ExecutingGPTextfunctionsrequiresoneofOWNER,SELECT,INSERT,UPDATE,orDELETEprivileges,dependingonthefunction.TheOWNERisthepersonwhocreatedthetableandhasallprivileges.SeetheGreenplumDatabaseAdministratorGuideforinformationaboutsettingprivileges.

ZooKeeperAdministrationApacheZooKeeperenablescoordinationbetweentheApacheSolrandPivotalGPTextdistributedprocessesthroughasharednamespacethatresemblesafilesystem.InZooKeeper,anode(calledaznode)cancontaindata,likeafile,andcanhavechildznodes,likeadirectory.ZooKeeperreplicatesdatabetweenmultipleinstancesdeployedasaclustertoprovideahighlyavailable,fault-tolerantservice.BothSolrandGPTextstoreconfigurationfilesandsharestatusbywritingdatatoZooKeeperznodes.GPTextstoresinformationinthe /gptext znode.TheconfigurationfilesforaGPTextindexareinthe/gptext/configs/<index-name> znode.

ThenumberofZooKeeperinstancesintheclusterdetermineshowmanyZooKeepernodefailurestheclustercantolerateandstillremainactive.Theserviceremainsavailableaslongasaclearmajorityofthenon-failednodesareabletocommunicatewitheachother.Totolerateafailureofnnodestheclustermusthave2 +1nodes.Aclusteroffivenodes,forexample,cantoleratetwofailednodes.

ZooKeeperisveryfastforreadrequestsbecauseitstoresdatainmemory.IfZooKeeperbeginstoswapmemorytodisk,SolrandGPTextperformancewillsufferandcouldexperiencefailures,soitiscriticaltoallocatesufficientmemorytotheZooKeeperJavaprocesses.ToavoidZooKeeperinstancescompetingwithGreenplumDatabasesegmentsformemory,youshoulddeploytheZooKeeperinstancesandGreenplumDatabasesegmentsondifferenthosts.TheZooKeeperandGreenplumDatabasehostsmustbeonthesamenetworkandaccessiblewithpasswordlessSSHbythegpadminuser.YoucanusetheGreenplumDatabase gpssh-exkeys utilitytoshareSSHkeysbetweenZooKeeperandGreenplumDatabasehosts.

YoumuststarttheZooKeeperclusterbeforeyoustartGPText.WhenyoustartGPText,theSolrnodeseachloadthereplicasforindexestheymanage.Withlargenumbersofindexes,shards,andreplicas,startinguptheclustercangenerateaveryhigh,atypicalloadonZooKeeper.ItcantakealongtimetogetallindexesloadedandsomeZooKeeperrequestsmaytimeoutwaitingforresponses.Usingthe gptext-start--

slow_startoptionstartsSolrnodesoneata

time,providingamoreorderedstart-upandlimitingthenumberofconcurrentZooKeeperrequests.

n

©CopyrightPivotalSoftware,Inc,2013-2019 23 3.3.0

Page 24: Pivotal Greenplum Text

TheGPTextcommand-lineutility zkManager canbeusedtomonitortheZooKeepercluster.IftheZooKeeperclusterisboundtoGPText,youcanalsostartandstoptheclusterusing zkManager .

CheckingZooKeeperStatusUsethe zkManager utilityfromthecommandlinetochecktheZooKeeperclusterstatus.Theutilityliststhehosts,ports,latency,andfollower/leadermodeforeachZooKeeperinstance.Ifanodeisdown,itsmodeislistedasDown.

TochecktheZooKeeperclusterstatus,runthe zkManagerstate command.

$zkManagerstate20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-Executezookeeperstateprocess.20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-HostportLatencymin/avg/maxMode20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-gpdb21890/0/22follower20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-gpdb21900/0/29leader20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-gpdb21880/0/27follower20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-Done.

Inadatabasesession,youcanusethe gptext.zookeeper_hosts() functiontolisttheZooKeeperhosts.

=#SELECT*FROMgptext.zookeeper_hosts();host|port--------+------gpdb51|2188gpdb51|2189gpdb51|2190(3rows)

StartingandStoppingtheZooKeeperClusterIftheZooKeeperclusterwasinstalledbytheGPTextinstaller,the zkManager utilitycanstartorstoptheZooKeepercluster.Tostartthecluster,runthezkManagerstart

command.

$zkManagerstart20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-Executezookeeperstartprocess20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-StartingZookeeper:20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-HostZookeeperDir20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo020171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo120171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo220171016:16:14:48:017845zkManager:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20171016:16:14:53:017845zkManager:gpdb:gpadmin-[INFO]:-Done.

TostopZooKeeper,runthe zkManagerstop command.

$zkManagerstop20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-Executezookeeperstopprocess.20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-StopZookeeper:20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-HostZookeeperDir20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo020171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo120171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo220171016:16:14:09:016499zkManager:gpdb:gpadmin-[INFO]:-Done.

SeethezkManagerreferenceformoreinformation.

CheckingSolrCloudStatusYoucancheckthestatusoftheSolrCloudclusterandindexesbyrunningthe gptext-state utilityfromthecommandline.

©CopyrightPivotalSoftware,Inc,2013-2019 24 3.3.0

Page 25: Pivotal Greenplum Text

TocheckthestateoftheGPTextnodesandeachindex,runthe gptext-state utilitywiththe -D ( --details )option.Example:

$gptext-state-D20180615:16:09:24:031986gptext-state:mdw:gpadmin-[INFO]:-ExecuteGPTextstate...20180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-CheckGPTextclusterstatus...20180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-CurrentGPTextVersion:3.0.020180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-Allnodesareupandrunning.20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-Indexstatedetails.20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-databaseindexnamestate20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-demodemo.twitter.messageGreen20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-demodemo.wikipedia.articlesGreen20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-Done.

ThiscommandreportsthestatusoftheGPTextnodesandstatusofeachGPTextindex.

Run gptext-statelist toviewjusttheindexes.

The gptext-statehealthcheck commandcheckstheGPTextconfigurationfiles,theindexstatus,requireddiskspace,userprivileges,andindexanddatabaseconsistency.Bydefault,therequireddiskspacecheckpassesifthereisatleast20%diskfree.Youcansetadifferentdiskfreethresholdusingthe--disk_free option.Forexample:

[gpadmin@gpdb-sandbox~]$gptext-statehealthcheck--disk_free=2520160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-ExecutehealthcheckonGPTextcluster!20160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-CheckGPTextconfigfiles...20160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-CheckGPTextindexstatus...20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checkingforrequireddiskspace...20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checkingforrequireduserprivileges...20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checkingforindexesanddatabaseconsistency...20160629:15:45:27:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:27:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Done.

Seethe gptext-state utilityreferenceforadditionaloptions.

RecoveringGPTextNodesUsethe gptext-recover utilitytorecoverdownGPTextnodes,forexampleafterafailedGreenplumDatabasesegmenthostisrecovered.

Withnoarguments,the gptext-recover utilitydiscoversdownGPTextnodesandrestartsthem.

Withthe -f (or --force )option,ifaGPTextnodecannotberestartedandnoshardsaredown,thenodeisdeletedandcreatedagainonthesamehost.Missingreplicasareaddedandthefailednodeandfailedreplicasareremoved.Iftheindexisinaredstate gptext-recover-

fwillprintamessageandexit.

The -H ( --new_hosts )optionallowsrecreatingdownGPTextnodesonnewhoststhatreplacefailedhosts.ThedownGPTextnodesaredeletedandrecreatedonthenewhosts.Theargumenttothe -H optionisacomma-separatedlistofthenewhoststhataretoreplacethefailedhosts.Thenumberofnewhostsmustmatchthenumberoffailedhosts.Ifshardsaredown,itadvisesreindexing.Ifonlysomereplicasaredown,itrecreatesthereplicasonthenewhostsandupdates gptext.conf .

The -r optionrecoversreplicas,butdoesnotattempttorecoveranydownnodes.

Note:BeforerecoveringGPTextnodesonnewlyaddedhosts,ensurethatthefollowingGPTextprerequisiteshavebeeninstalledonthehost:

Java1.8

Python2.6

TheLinux lsof utility

ViewingSolrIndexStatisticsYoucanviewSolrindexstatisticsbyrunningthe gptext-state utilityfromthecommandline.

©CopyrightPivotalSoftware,Inc,2013-2019 25 3.3.0

Page 26: Pivotal Greenplum Text

TolistallGPTextindexes,enterthefollowingcommandatthecommandline:

gptext-statelist

Acommandlinethatretrievesallstatisticsforanindex:

gptext-state--indexdemo.wikipedia.articles

Acommandlinethatretrievesthenumberofdocumentsinanindex:

gptext-state--indexdemo.wikipedia.articles--stats_columns=num_docs

Acommandlinethatretrieves num_docs ,index size ,andthedateandtime last_modified :

gptext-state--indexdemo.wikipedia.articles--stats_columnsnum_docs,size,last_modified

BackingUpandRestoringGPTextIndexesWiththe gptext-backup managementutility,youcanbackupaGPTextindexsothat,ifneeded,youcanquicklyrecoverfromafailure.ThebackupcanberestoredtothesameGPTextsystemortoanothersystemwiththesamenumberofGreenplumDatabasesegments.

The gptext-backup managementutilitybacksupanindexanditsconfigurationfilestoeitherasharedfilesystem,whichmustbemountedonandwritablebyeachhostintheGreenplumDatabasecluster,ortolocalstorageontheGreenplumDatabasemasterandsegmenthosts.

BackingUptoaSharedFileSystemTobackuponasharedfilesystem,usethe -p ( --path )command-lineoptiontospecifythelocationofadirectoryonthemountedfilesystemandthe-n ( --name )optiontoprovideanameforthebackup.Specifytheindextobackupwiththe -i (--index )option.

$gptext-backup-i<index-name>-p<path>--n<backup-name>

The gptext-backup utilitythenchecksthat:

theGPTextclusterisup

thesharedfilesystemisvalid

thebackupnamespecifiedwiththe -n optiondoesnotalreadyexistinthedirectoryspecifiedwiththe -p option

Theutilitycreatesthenewdirectoryandthensavesonecopyofeachindexshardtothatdirectory,alongwiththeindex’sconfigurationfilesfromZooKeeper.

Tosavetheconfigurationfilesonly,withnodata,addthe -c ( --backup_conf )command-lineoption.

Torestoreanindexfromasharedfilesystem,usethe gptext-restore managementutility.TheGPTextsystemyourestoretomustbeonaGreenplumDatabaseclusterwiththesamenumberofsegments.Thedatabaseandschemafortheindexmustbepresent.

The -i ( --index )optionspecifiesthenameoftheGPTextindexthatwillberestored.Iftheindexexists,youmustfirstdropitwiththe gptext.drop_index()user-definedfunction.

The -p ( --path )optionspecifiesthelocationofthedirectorycontainingthebackupfiles—thedirectorythat gptext-backup createdonthesharedfilesystem.

$gptext-restore-i<index-name>-p<path>

Youcanaddthe -c optiontorestoreonlytheconfigurationfilestoZooKeeperandcreateanemptyGPTextindex,withoutrestoringanysavedindexdata.

BackingUptoLocalStorage

©CopyrightPivotalSoftware,Inc,2013-2019 26 3.3.0

Page 27: Pivotal Greenplum Text

TobackuptolocalstorageontheGreenplumDatabasecluster,addthe local keywordtothe gptext-backup command-line.

AlocalGPTextbackuphasauniquenameconstructedbyappendingatimestamptotheindexname.Youdonotusethe -n optionwithlocalbackups.

$gptext-backuplocal-i<index-name>

Onthemasterhost,inthemasterdatadirectorybydefault,thebackuputilitysavesaJSONfilewithbackupmetadataandadirectorycontainingtheindex’sconfigurationfilesfromZooKeeper.

TheutilitybacksupeachindexshardontheGreenplumDatabasesegmenthostwiththeGPTextnodethatmanagestheshard’sleadreplica.Bydefault,theshardbackupfilesaresavedinasegmentdatadirectory.

The gptext-backup commandoutputreportsthelocationsofallbackupfiles.

Youcanaddthe -p ( --path )optiontothe gptext-backup commandtospecifyalocaldirectorywherethebackupwillbesaved.ThedirectorymustbepresentoneveryGreenplumDatabasehostandmustbewriteablebythegpadminuser.

$gptext-backuplocal-i<index-name>-p<path>

ThebackupfileswillbesavedinthespecifieddirectoryoneachhostinsteadofintheGreenplumDatabasemasterandsegmentdatadirectories.

Torestoreabackupsavedtolocalstorage,addthe local keywordtothe gptext-restore command-lineandspecifythepathtothebackupdirectoryonthemasterhost.

$gptext-restorelocal-p<path>

The <path> isthefullpathtothedirectorythe gptext-backup commandcreatedonthemasterhost,includingthetimestamp,forexample$MASTER_DATA_DIRECTORY/demo.twitter.message_2018-05-08T15:32:21.397779 .

Seegptext-backupforsyntaxandexamplesforrunning gptext-backup .Seegptext-restoreforsyntaxandexamplesforrunning gptext-restore .

ExpandingtheGPTextClusterThe gptext-expand managementutilityaddsGPTextnodestothecluster.Therearetwowaystoaddnodes:

AddGPTextnodestoexistinghostsinthecluster.ThisoptionincreasesthenumberofGPTextnodesoneachhost.

AddGPTextnodestonewhostsaddedbyusingtheGreenplumDatabase gpexpand managementutilitytoexpandtheGreenplumDatabasesystem.

AddingGPTextNodestoExistingSegmentHostsToaddnodestoexistingsegmenthosts,runthe gptext-expand utilitywithacommandlikethefollowing:

gptext-expand-e-p/data1/nodes,/data2/nodes

ThisexampleaddstwoGPTextnodestoeachhost.

The -e ( --existing )optionspecifiesthatnodesaretobeaddedtoexistinghosts.

The -p ( --expand_paths )optionprovidesalistofdirectorieswherethenewnodes’datadirectoriesaretobecreated.TheseshouldbethesamedirectoriesthatcontaintheGreenplumDatabasesegmentdatadirectoriesandexistingGPTextdatadirectories.Thenumberofdirectoriesinthelististhenumberofnewnodesthatareadded.

AdirectorycanberepeatedinthedirectorylistmultipletimestoincreasethenumberofnewGPTextnodestocreate.Forexample,ifthereiscurrentlyoneGPTextnodeperhostinthe /data1/nodes directory,youcouldaddthreenodeswithacommandlikethefollowing:

gptext-expand-e-p/data1/nodes,/data2/nodes,/data2/nodes

Thisaddsonenodetothe /data1/nodes directoryandtwonodestothe /data2/nodes directorysotherearetwoGPTextnodesineachdirectory.

AddingGPTextnodesaffectsnewindexes,butnotexistingindexes.Replicasfornewindexeswillbedistributedacrossallofthenodes,includingbothold

©CopyrightPivotalSoftware,Inc,2013-2019 27 3.3.0

Page 28: Pivotal Greenplum Text

nodesandthenewlycreatednodes.Replicasforindexesthatexistedbeforerunning gptext-expand arenotautomaticallymoved.Rebalancingexistingreplicasrequiresreindexing.

AddingGPTextNodestoNewHostsCheckthatthefollowingGPTextprerequisitesareinstalledoneachnewhostaddedtotheGreenplumDatabasecluster:

Java1.8

Python2.6orgreater

Linux lsof utility

NewhostsmustbereachablebyallhostsintheGPTextcluster,includingexistinghostsandthenewhostsyouareadding.

AfterexpandingtheGreenplumDatabaseclusterwiththe gpexpand managementutility,call gptext-expand withthe -H ( --new_hosts )optionandalistofthenewhostsonwhichtoinstallGPText:

gptext-expand-Hnewhost1,newhost2

The gptext-expand utilityinstallsGPTextbinariesonthenewhostsandthencreatesnewGPTextnodesonthenewhosts.

ExpandingaGreenplumDatabaseclusterincreasesthenumberofsegments,sothenumberofGPTextindexshardsforexistingindexesmustbeincreasedtoequalthenewnumberofsegments.Thisrequiresreindexingallexistingdocuments.Newlycreatedindexeswillautomaticallybedistributedamongthenewshards.

TroubleshootingGPTexterrorsareofthefollowingtypes:

Solrerrors

gptext errors

MostoftheSolrerrorsareself-explanatory.

gptext errorsarecausedbymisuseofafunctionorutility.Theyprovideamessagethattellsyouwhenyouhaveusedanincorrectfunctionorargument.

MonitoringLogsYoucanexaminetheGreenplumDatabaseandSolrlogsformoreinformationiferrorsoccur.GreenplumDatabaselogsresidein:

segment-directory/pg-log

Solrlogsresidein:

<GPDBpath>/solr/logs

DeterminingSegmentStatuswithgptext-stateUsethe gptext-state utilitytodetermineifanyprimaryormirrorsegmentsaredown.See gptext-state intheGPTextManagementUtilitiesReference.

©CopyrightPivotalSoftware,Inc,2013-2019 28 3.3.0

Page 29: Pivotal Greenplum Text

GPTextHighAvailabilityTheGPTexthighavailabilityfeatureensuresthatyoucancontinueworkingwithGPTextindexesaslongaseachshardintheindexhasatleastoneworkingreplica.

AGPTextindexhasoneshardforeachGreenplumsegment,sothereisaone-to-onecorrespondencebetweenGreenplumsegmentsandGPTextindexshards.TheshardmanagedbyaGreenplumsegmentisanindexofthedocumentsthataremanagedbythatsegment.

TheGPTexthighavailabilitymechanismistomaintainmultiplecopies,orreplicas,oftheshard.TheZooKeeperservicethatmanagesSolrCloudchoosesaGPTextinstance(SolrCloudnode)foreachreplicatoensureevendistributionandhighavailability.Foreachshard,onereplicaiselectedleaderandtheGreenplumsegmentassociatedwiththeshardoperatesonthisleaderreplica.TheGPTextinstancemanagingtheleadreplicamayormaynotbeonanotherGreenplumhost,soindexingandsearchingoperationsarepassedovertheGreenplumcluster’sinterconnectnetwork.SolrCloudreplicateschangesmadetotheleaderreplicatotheremainingreplicas.

ThefollowingfigureillustratestherelationshipsbetweenGreenplumsegmentsandGPTextindexshardsandreplicas.Theleaderreplicaforeachshardisshowningreenandthefollowersaregray.

Thenumberofreplicastocreateforeachshard,thereplicationfactor,isaSolrCloudproperty.Bydefault,GPTextstartsSolrCloudwithareplicationfactorofthree.ThereplicationfactorforeachindividualindexisthevalueoftheSolrCloudreplicationfactorwhentheindexiscreated.Changingthereplicationfactordoesnotalterthereplicationfactorforexistingindexes.

GreenplumSegmentorHostFailureIfaGreenplumprimarysegmentfailsanditsmirrorisactivated,GPTextfunctionsandutilitiescontinuetoaccesstheleaderreplica.Nointerventionisneeded.

Ifahostintheclusterfails,bothGreenplumandGPTextareaffected.MirrorsfortheGreenplumprimarysegmentslocatedonthefailedhostareactivatedonotherhosts.SolrCloudelectsanewleaderreplicaforaffectedshards.BecauseGreenplumsegmentmirrorsandGPTextshardreplicasaredistributedthroughoutthecluster,asinglehostfailureshouldnotpreventtheclusterfromcontinuingtooperate.Theperformanceofdatabasequeriesandindexingoperationswillbeaffecteduntilthefailedhostisrecoveredandtheclusterisbroughtbackintobalance.

ZooKeeperClusterAvailabilitySolrCloudisdependentonaworking,availableZooKeepercluster.ForZooKeepertobeactive,amajorityoftheZooKeeperclusternodesmustbeupandabletocommunicatewitheachother.AZooKeeperclusterwiththreenodescancontinuetooperateifoneofthenodesfails,sincetwoisamajorityofthree.Totoleratetwofailednodes,theclustermusthaveatleastfivenodessothatthenumberofworkingnodesremainingafterthefailureareamajority.Totoleratennodefailures,then,aZooKeeperclustermusthave2*n*+1nodes.ThisiswhyZooKeeperclustersusuallyhaveanoddnumberofnodes.

Thebestpracticeforahigh-availabilityGPTextclusterisaZooKeeperclusterwithfiveorsevennodessothattheclustercantoleratetwoorthreefailednodes.

©CopyrightPivotalSoftware,Inc,2013-2019 29 3.3.0

Page 30: Pivotal Greenplum Text

ManagingGPTextClusterHealthGPTextdocumentindexingandsearchingservicesremainavailableaslongaseachshardofanindexhasatleastoneworkingreplica.Toensureavailabilityintheeventofafailure,itisimportanttomonitorthestatusoftheclusterandensurethatalloftheindexshardreplicasarehealthy.YoucanmonitortheSolrCloudclusterandindexesusingtheSolrCloudDashboardorusingGPTextfunctionsandmanagementutilities.AccesstheSolrCloudDashboardwithawebbrowseronanyGPTextinstancewithaURLsuchas http://sdw3:18983/solr .(TheportnumbersforGPTextinstancesaresetwiththeGPTEXT_PORT_BASE parameterintheinstallationparametersfileatinstallationtime.)

RefertotheApacheSolrClouddocumentationforhelpusingtheSolrCloudDashboard.

MonitoringtheClusterwithGPTextTheGPText gptext-state managementutilityallowsyoutoquerythestateoftheGPTextclusterandindexes.Youcanalsouse gptext.index_status() toviewthestatusofallindexesoraspecifiedindex.

ToseetheGPTextclusterstaterunthe gptext-state command-lineutilitywiththe -d optiontospecifyadatabasethathastheGPTextschemainstalled.

gptext-state-dmydb

TheutilityreportsanyGPTextnodesthataredownandliststhestatusofeveryGPTextindex.Foreachindex,thedatabasename,indexname,andstatusarereported.Thestatuscolumncontains“Green”,“Yellow”,or“Red”:-Green–allreplicasforallshardsarehealthy-Yellow–allshardshaveatleastonehealthyreplicabutatleastonereplicaisdown-Red–noreplicasareavailableforatleastoneindexshard

ToseethedistributionofindexshardsandreplicasintheGPTextcluster,executethisSQLstatement.

SELECTindex_name,shard_name,replica_name,node_nameFROMgptext.index_summary()ORDERBYnode_name;

TolistallGPTextindexes,runthe gptext-statelist command.

gptext-statelist-dmydb

The gptext-statehealthcheck commandchecksthehealthofthecluster.The -f flagspecifiesthepercentageofavailablediskspacerequiredtoreportahealthycluster.Thedefaultis10.

gptext-statehealthcheck-f20-dmydb

See gptext-state intheManagementUtilitiesreferenceforhelpwithadditional gptext-state options.

Thegptext.index_status()user-definedfunctionreportsthestatusofallGPTextindexesoraspecifiedindex.

SELECT*FROMgptext.index_status();

Specifyanindexnametoreportonlythestatusofthatindex.

SELECT*FROMgptext.index_status('demo.twitter.message');

AddingandDroppingReplicasThe gptext-replica utilityaddsordropsareplicaofasingleindexshard.Usethe gptext.add_replica() and gptext.delete_replica() user-definedfunctionstoperformthesametasksfromwithinthedatabase.

Ifareplicaofashardfails,use gptext-replica toaddanewreplicaandthendropthefailedreplicatobringtheindexbackto“Green”status.

gptext-replicaadd-imydb.public.messages-sshard3

Hereistheequivalent,usingthe gptext.add_replica() function:

©CopyrightPivotalSoftware,Inc,2013-2019 30 3.3.0

Page 31: Pivotal Greenplum Text

SELECT*FROMgptext.add_replica('mydb.public.messages',shard3);

ZooKeeperdetermineswherethereplicawillbelocated,butyoucanalsospecifythenodewherethereplicaiscreated:

gptext-replicaadd-imydb.public.messages-sshard3-nsdw3

Inthe gptext.add_replica() function,addthenodenameasathirdargument.

Todropareplica,call gptext.delete_replica() withthenameoftheindex,thenameoftheshard,andthenameofthereplica.Youcanfindthenameofthereplicabycalling gptext.index_status(index_name) .Thenameisintheformat core_noden .Anoptional -o flagspecifiesthatthereplicaistobedeletedonlyifitisdown.

gptext-replicadrop-imydb.public.messages-sshard3-rcore_node4-o

Hereistheequivalentoftheabovecommandusingthe gptext.delete_replica() user-definedfunction.

SELECT*FROMgptext.delete_replica('mydb.public.messages','shard3','core_node4',true);

©CopyrightPivotalSoftware,Inc,2013-2019 31 3.3.0

Page 32: Pivotal Greenplum Text

GPTextBestPracticesEachGPText/ApacheSolrnodeisaJavaVirtualMachine(JVM)processandisallocatedmemoryatstartup.ThemaximumamountofmemorytheJVMwilluseissetwiththe -Xmx parameterontheJavacommandline.Performanceproblemsandoutofmemoryfailurescanoccurwhenthenodeshaveinsufficientmemory.

OtherperformanceproblemscanresultfromresourcecontentionbetweentheGreenplumDatabase,Solr,andZooKeeperclusters.

ThistopicdiscussesGPTextusecasesthatstressSolrJVMmemoryindifferentwaysandthebestpracticesforpreventingoralleviatingperformanceproblemsfrominsufficientJVMmemoryandothercauses.

IndexingLargeNumbersofDocumentsIndexingdocumentsconsumesdatainSolrJVMmemory.Whentheindexiscommitted,partsofthememoryarereleased,butsomedataremainsinmemorytosupportfastsearch.Bydefault,Solrperformsanautomaticsoftcommitwhen1,000,000documentsareindexedor20minutes(1,200,000milliseconds)havepassed.Asoftcommitpushesdocumentsfrommemorytotheindex,freeingJVMmemory.Asoftcommitalsomakesthedocumentsvisibleinsearches.Asoftcommitdoesnot,however,maketheindexupdatesdurable;itisstillnecessarytocommittheindexwiththe gptext.commit()user-definedfunction.

Youcanconfigureanindextoperformamorefrequentautomaticsoftcommitbyeditingthe solrconfig.xml filefortheindex:

$gptext-configedit-fsolrconfig.xml-i<db>.<schema>.<index-name>

The <autoSoftCommit> elementisachildofthe <updateHandler> element.Editthe <maxDocs> and <maxTime> valuestoreducethetimebetweenautomaticcommits.Forexample,thefollowingsettingsperformanautocommitevery100,000documentsor10minutes.

<autoSoftCommit><maxDocs>100000</maxDocs><maxTime>600000</maxTime></autoSoftCommit>

IndexingVeryLargeDocumentsIndexingverylargedocumentscanusealargeamountofJVMmemory.Tomanagethis,youcansetthe gptext.idx_buffer_size configurationparametertoreducethesizeoftheindexingbuffer.

SeeChangingGPTextServerConfigurationParametersforinstructionstochangeconfigurationparametervalues.

DeterminingtheNumberofGPTextNodestoDeployAGPTextnodeisaSolrinstancemanagedbyGPText.ThenodescanbedeployedontheGreenplumDatabaseclusterhostsoronseparatehostsaccessibletotheGreenplumDatabasecluster.ThenumberofnodesisconfiguredduringGPTextinstallation.

ThemaximumrecommendednumberofGPTextnodesyoucandeployisthenumberofGreenplumDatabaseprimarysegments.However,thebestpracticerecommendationistodeployfewerGPTextnodeswithmorememoryratherthantodividethememoryavailabletoGPTextamongthemaximumnumberofGPTextnodes.Usethe JAVA_OPTS installationparametertosetmemorysizeforGPTextnodes.

AsingleGPTextnodeperhostcaneasilyhandleseveralindexes.EachadditionalnodeconsumesadditionalCPUandmemoryresources,soitisdesirabletolimitthenumberofnodesperhost.FormostGPTextinstallations,asingleGPTextnodeperhostissufficient.

IftheJVMhasaverylargeamountofmemory,however,garbagecollectioncancauselongpauseswhiletheJVMreorganizesmemory.Also,theJVMemploysamemoryaddressoptimizationthatcannotbeusedwhenJVMmemoryexceeds32GB,soatmorethan32GB,aGPTextnodelosescapacityandperformance.Therefore,noGPTextnodeshouldhavemorethan32GBofmemory.

Forexample,ifyouhave48GBmemoryavailableforGPTextperhost,youshoulddeploytwoGPTextnodeswith24GBmemory.Ifyouhave128GBavailable,youshoulddeployatleastfourJVMs,andmoreifgarbagecollectionbecomesaproblem.

©CopyrightPivotalSoftware,Inc,2013-2019 32 3.3.0

Page 33: Pivotal Greenplum Text

ConfigureMaximumJVMHeapSizeEachSolrcorefileconsumesJVMheapmemory.AddingmoreindexesincreasesJVMswappingandgarbagecollectionfrequencysothatittakeslongertocreateindexesandtoloadthecorefileswhenGPTextisstarted.IfyoucontinuetocreateindexeswithoutincreasingtheJVMheap,anoutofmemoryerrorwilleventuallyoccur.

MonitorperformanceatstartupandduringindexcreationandincreasetheJVMsizewhenyoubegintoseedegradedperformance.Youcanalsousetoolssuchasjconsole,includedwiththeJavaDeveloperKit,tomonitorJavaheapusage.Ifgarbagecollectionsareoccurringtoofrequentlyandfreeingtoolittlememory,JVMheapshouldbeincreased.

TheJVMsizeisinitiallyconfiguredduringGPTextinstallationbysettingthe JAVA_OPTIONS parameterintheinstallationconfigurationfile.Afterinstallation,usethe gptext-configjvm commandtoincreasetheJVMheapsize.Forexample,this gptext-configjvm commandsetstheJVMmaximumheapoptionto4GB:

$gptext-configjvm-o"-Xmx=4096M"

ManageIndexingandSearchLoadsWithhighindexingorsearchload,JVMgarbagecollectionpausescancausetheSolroverseerqueuetobackup.ForaheavilyloadedGPTextsystem,youcanpreventsomeperformanceproblemsbyschedulingdocumentindexingfortimeswhensearchactivityislow.

TermsQueriesandOutofMemoryErrorsThe gptext.terms() functionretrievestermsvectorsfromdocumentsthatmatchaquery.Anoutofmemoryerrormayoccurifthedocumentsarelarge,orifthequerymatchesalargenumberofdocumentsoneachnode.Otherfactorscancontributetooutofmemoryerrorswhenrunninga gptext.terms() query,includingthemaximummemoryavailabletotheSolrnodes(-Xmxvaluein JAVA_OPTS )andconcurrentqueries.

Ifyouexperienceoutofmemoryerrorswith gptext.terms() youcansetalowervalueforthe term_batch_size GPTextconfigurationvariable.Thedefaultvalueis1000.Forexample,youcouldtryrunningthefailingquerywith term_batch_size setto500.Loweringthevaluemaypreventoutofmemoryerrors,butperformanceoftermsqueriescanbeaffected.

SeeGPTextConfigurationParametersforhelpsettingGPTextconfigurationparameters.

ConfigureFileSystemCachingforZooKeeperGoodSolrperformanceisdependentonfastresponseforZooKeeperrequests.ZooKeeperperformsbestwhenitsdatabaseiscachedsoitdoesnothavetogotodiskforlookups.IfyoufindthatZooKeeperJVMshavefrequentdiskaccesses,lookforwaystoimprovefilecachingormoveZooKeeperdiskstofasterstorage.

TheZooKeeper zkClientTimeout parameteristhetimeaclientisallowedtonottalktoZooKeeperbeforehavingitssessionexpired.

©CopyrightPivotalSoftware,Inc,2013-2019 33 3.3.0

Page 34: Pivotal Greenplum Text

TroubleshootingHadoopConnectionProblemsThissectiondescribesHadoop-relatedproblemsandpotentialsolutionstotheseissues.

DataNodeAccessErrorsYoumayexperienceHadoopaccesserrorswithGPTextifanyDataNodesintheHadoopclusterresideinamulti-homednetwork.GPTextusesanexternalIPaddresstoaccesstheHDFSNameNode.GPTextencountersanerrorwhentheNameNodeprovidesaninternalIPaddressforaDataNode.Inthissituation,additionalconfigurationisrequiredtoconfigureGPTexttoperformitsownDNSresolutionofDataNodehostnames.

PerformthefollowingproceduretoexplicitlyconfigureDNSresolutionofDataNodehostnames:

1. LocatealocalcopyoftheHadoopauthenticationconfigurationdirectorythatyoupreviouslyuploadedtoZooKeeper.Forexample,ifthedirectoryislocatedat /home/gpadmin/auths/hdfs_conf :

$cd/home/gpadmin/auths/hdfs_conf$lscore-site.xmlhdfs-site.xmluser.txt

2. Open hdfs-site.xml intheeditorofyourchoice.Forexample:

$vihdfs-site.xml

3. Addthefollowingpropertyblocktothefile,andthensavethefileandexit:

<property><name>dfs.client.use.datanode.hostname</name><value>true</value></property>

ThispropertyallowsGPTexthoststoperformtheirownDNSresolutionofHDFSDataNodehostnames.

4. Re-uploadthemodifiedconfigurationtoZooKeeper.Forexample,ifthe hdfs_conf directoryincludestheauthenticationconfigurationfilesforaHadoopclusterwith<config_name> hdfs_bill_auth :

$cd..$gptext-externalupload-thdfs-chdfs_bill_auth-phdfs_conf

5. Determinethehostname-to-IPaddressmappingforallDataNodes,andaddtheassociatedentriesintothe /etc/hosts fileonallGPTextclienthosts.

Kerberos-RelatedErrorsThefollowingproblemsarespecifictoHadoopclusterssecuredwithKerberos.

ClockSkewAloginattempttoaHadoopclustersecuredwithKerberoswillfailifclockskewbetweenGPTextclienthostsandtheKerberosKDChostistoogreat.Inthissituation,youmayseethefollowingerrorintheSolrlog:

java.io.IOException causedbya KrbException noting“Clockskewtoogreat”

Toresolvethissituation,ensurethattheclocksontheKerberosKDChostandGPTextclienthostsaresynchronized.

TimeoutErrorsAloginattempttoaHadoopclustersecuredwithKerberosmayfailwithtimeouterrorswhenthe kdc and admin_server settingsinthe krb5.conf filearespecifiedwithahostname,andtheGPTextclienthostscannotresolvethehostname.Inthissituation,youmayseeoneofthefollowingerrorsintheSolrlog:

©CopyrightPivotalSoftware,Inc,2013-2019 34 3.3.0

Page 35: Pivotal Greenplum Text

org.apache.solr.common.SolrException: Failed to login HDFS messagecausedbya java.io.IOException specifyingjavax.security.auth.login.LoginException: Receive timed out

java.nio.channels.UnresolvedAddressException with SocketIOWithTimeout referencedinthestacktrace

Inthissituation,youmaychooseeitherofthefollowing:

UpdatetheKerberos krb5.conf filetospecifythe kdc and admin_server settingsusingIPaddresses.Or

UpdateallGPTexthoststoperformtheirownDNSresolutionoftheKerberosKDCserver.

Ifyouchoosetoupdatethe krb5.conf file:

1. LocatealocalcopyoftheHadoopKerberosauthenticationconfigurationdirectorythatyoupreviouslyuploadedtoZooKeeper.Forexample,ifthedirectoryislocatedat /home/gpadmin/auths/hdfs_kerb_conf :

$cd/home/gpadmin/auths/hdfs_kerb_conf$lscore-site.xmlhdfs-site.xmlkeytabkrb5.confuser.txt

2. Open krb5.conf intheeditorofyourchoice.Forexample:

$vikrb5.conf

3. Replacethe KERBEROS blockattributeswiththeirequivalentIPaddressesandthensavethefileandexit.Forexample:

[realms]KERBEROS={kdc=<kdc_ipaddress>admin_server=<admin_server_ipaddress>}

4. Re-uploadthemodifiedconfigurationtoZooKeeper.Forexample,ifthedirectorynamed hdfs_kerb_conf includestheauthenticationconfigurationfilesforaHadoopclusterdefinedwiththe<config_name> hdfs_kerb_auth :

$cd..$gptext-externalupload-thdfs-chdfs_kerb_auth-phdfs_kerb_conf

Alternately,ifyouchoosetoconfiguretheGPTexthoststoperformtheirownDNSresolutionoftheKerberosKDCserver,addanentryfortheKDChostname-to-IPaddressmappingtothe /etc/hosts fileonallGPTextclienthosts.

©CopyrightPivotalSoftware,Inc,2013-2019 35 3.3.0

Page 36: Pivotal Greenplum Text

WorkingWithGPTextIndexesIndexingpreparesdocumentsfortextanalysisandfastqueryprocessing.ThistopicshowsyouhowtocreateGPTextindexesandadddocumentsfromGreenplumDatabasetablestothem,andhowtomaintainandcustomizeindexesforyourownapplications.

ForhelpindexingandsearchingdocumentsstoredoutsideofGreenplumDatabaseseeWorkingWithGPTextExternalIndexes.

SettingUptheSampleDatabaseTheexamplesinthisdocumentationworkwitha demo databasecontainingthreedatabasetables,called wikipedia.articles , twitter.message ,andstore.products .Ifyouwanttoruntheexamplesyourself,followtheinstructionsinthissectiontosetupthe demo database.

1. LogintotheGreenplumDatabasemasterasthegpadminuserandcreatethe demo database.

$createdbdemo

2. Openaninteractiveshellforexecutingqueriesinthe demo database.

$psqldemo

3. Createthe articles tableinthe wikipedia schemawiththefollowingstatements.

CREATESCHEMAwikipedia;CREATETABLEwikipedia.articles(idint8primarykey,date_timetimestamptz,titletext,contenttext,refstext)DISTRIBUTEDBY(id);

4. Createthe message tableinthe twitter schemawiththefollowingstatements.

CREATESCHEMAtwitter;CREATETABLEtwitter.message(idbigint,message_idbigint,spamboolean,created_attimestampwithouttimezone,sourcetext,retweetedboolean,favoritedboolean,truncatedboolean,in_reply_to_screen_nametext,in_reply_to_user_idbigint,author_idbigint,author_nametext,author_screen_nametext,author_langtext,author_urltext,author_descriptiontext,author_listed_countinteger,author_statuses_countinteger,author_followers_countinteger,author_friends_countinteger,author_created_attimestampwithouttimezone,author_locationtext,author_verifiedboolean,message_urltext,message_texttext)DISTRIBUTEDBY(id)PARTITIONBYRANGE(created_at)(START(DATE'2011-08-01')INCLUSIVEEND(DATE'2011-12-01')EXCLUSIVEEVERY(INTERVAL'1month'));CREATEINDEXid_idxONtwitter.messageUSINGbtree(id);

5. CREATEthe store.products tablewiththesestatements.

©CopyrightPivotalSoftware,Inc,2013-2019 36 3.3.0

Page 37: Pivotal Greenplum Text

CREATESCHEMAstore;CREATETABLEstore.products(idbigint,titletext,categoryvarchar(32),brandvarchar(32),pricefloat)DISTRIBUTEDBY(id);

6. Downloadtestdataforthethreetableshere .Right-clickthelink,savethefile,andthencopyittothegpadminuser’shomedirectory.

7. Extractthedatafileswiththistarcommand.

$tarxvfzgptext-demo-data.tgz

8. Loadthewikipediadataintothe wikipedia.articles tableusingthe psql\COPY metacommand.

\COPYwikipedia.articlesFROM'/home/gpadmin/demo/articles.csv'HEADERCSV;

The articles tablenowcontainstextfrom23Wikipediaarticles.

9. Loadthetwitterdataintothe twitter.message tableusingthefollowing psql\COPY metacommand.

\COPYtwitter.messageFROM'/home/gpadmin/demo/twitter.csv'CSV;

The message tablenowcontains1730tweetsfromAugusttoOctober,2011.

10. Loadtheproductstableintothe store.products tablewiththefollowing psql\COPY metacommand.

\COPYstore.productsFROM'/home/gpadmin/demo/products.csv'HEADERCSV;

The products tablenowcontains50rows.Thistableisusedtodemonstratefacetedsearchqueries.SeeCreatingFacetedSearchQueries.

SettinguptheGPTextCommand-lineEnvironmentToworkwithGPTextindexes,youmustfirstsetupyourenvironmentandaddtheGPTextschematothedatabasecontainingthedocuments(GreenplumDatabasedata)youwanttoindex.

Tosettheenvironment,loginasthe gpadmin userandsourcetheGreenplumDatabaseandGPTextenvironmentscripts.TheGreenplumDatabaseenvironmentmustbesetbeforeyousourcetheGPTextenvironmentscript.Forexample,ifbothGreenplumDatabaseandGPTextareinstalledinthe/usr/local/ directory,enterthesecommands:

$source/usr/local/greenplum-db-<version>/greenplum_path.sh$source/usr/local/greenplum-text-<version>/greenplum-text_path.sh

Withtheenvironmentnowset,youcanaccesstheGPTextcommand-lineutilities.

AddingtheGPTextSchematoaDatabaseUsethe gptext-installsql utilitytoaddtheGPTextschematodatabasescontainingdatayouwanttoindexwithGPText.Youperformthistaskonetimeforeachdatabase.Inthisexample,the gptext schemaisinstalledintothe demo database.

$gptext-installsqldemo

The gptext schemaprovidesuser-definedtypes,tables,views,andfunctionsforGPText.ThisschemaisreservedforGPText.Ifyoucreateanynewobjectsinthe gptext schema,theywillbelostwhenyoureinstalltheschemaorupgradeGPText.

CreatingGPTextIndexesandIndexingData

©CopyrightPivotalSoftware,Inc,2013-2019 37 3.3.0

Page 38: Pivotal Greenplum Text

ThegeneralstepsforcreatingaGPTextindexandindexingdocumentsare:

1. CreateanemptySolrindex

2. Customizetheindex(optional)

3. Populatetheindex

4. Committheindex

Afteryoucompletethesesteps,youcancreateandexecuteasearchqueryorimplementmachinelearningalgorithms.SearchingGPTextindexesisdescribedintheQueryingGPTextIndexestopic.

ThefollowingstepsarecompletedbyexecutingSQLcommandsandGPTextfunctionsinthedatabase.RefertotheGPTextFunctionReferencefordetailsabouttheGPTextfunctionsdescribedinthefollowingexamples.

CreateanemptyGPTextindexAGPTextindexisanApacheSolrcollectioncontainingdocumentsaddedfromaGreenplumDatabasetable.TherecanbeoneGPTextindexperGreenplumDatabasetable.EachrowinthedatabasetableisadocumentthatcanbeaddedtotheGPTextindex.

Ifthedatabasetableispartitioned,thereisoneGPTextindexforallpartitions.Youmustspecifytheroottablenamewhencreatingtheindexandaddingdocumentstoit.GPTextprovidessearchsemanticsthatenablesearchingpartitionsefficiently.

AGPTextexternalindexisaSolrindexfordocumentsthatarelocatedoutsideofGreenplumDatabase.GPTextprovidesuser-definedfunctionstocreateexternalindexesandadddocumentstothem.SeeWorkingwithGPTextExternalIndexes.

AGPTextindex,bydefault,hasoneSolrshardforeachGreenplumDatabasesegment.Youcanspecifyfewershardswhenyoucreateanindexbychangingthe gptext.idx_num_shards configurationparameterfrom 0 tothenumberofshardsyouwantbeforeyoucreatetheindex.SeeSpecifyingtheNumberofIndexShardsforinformationaboutusingthisoption.

The gptext.create_index() functioncreatesanewGPTextindex.Thisfunctionhastwosignatures:

gptext.create_index(<schema_name>,<table_name>,<id_col_name>,<def_search_col_name>[,<if_check_id_uniqueness>])

or

gptext.create_index(<schema_name>,<table_name>,<p_columns>,<p_types>,<id_col_name>,<def_search_col_name>[,<if_check_id_uniqueness>])

The <schema_name> and <table_name> argumentsspecifythedatabasetablethatcontainsthesourcedocuments.

The <id_col_name> argumentisthenameofthetablecolumnthatcontainsauniqueidentifierforeachrow.The <id_col_name> columncanbeoftypeint4 , int8 , varchar , text ,or uuid .

The <def_search_col_name> argumentisthenameofthetablecolumnthatcontainsthecontentyouwanttosearchbydefault.Forexample,ifyouwanttoindexandsearchjustthe <content> column,youcanusethefirstsignatureandspecifythe content columnnameinthe <def_search_col_name> argument.

Thefinal,optionalargument, <if_check_id_uniqueness> ,isaBooleanargument.Whentrue,thedefault,attemptingtoaddadocumentwithanidthatalreadyexistsintheindexgeneratesanerror.Ifyousettheargumenttofalse,youcanadddocumentswiththesameid,butwhenyousearchtheindexalldocumentswiththesameIDarereturned.

Thefollowingcommandcreatesanindexforthe twitter.message table,withthe id columnastheuniqueIDfieldandthe message_text columnforthedefaultsearchcolumn:

=#SELECT*FROMgptext.create_index('twitter','message','id','message_text');

Toverifythatthe demo.twitter.message indexwascreated,call gptext.index_status() :

©CopyrightPivotalSoftware,Inc,2013-2019 38 3.3.0

Page 39: Pivotal Greenplum Text

=#SELECT*FROMgptext.index_status('demo.twitter.message');index_name|shard_name|shard_state|replica_name|replica_state|core|node_name|base_url|is_leader|partitioned|external_index----------------------+------------+-------------+--------------+---------------+-----------------------------------------+-----------------+------------------------+-----------+-------------+----------------demo.twitter.message|shard1|active|core_node3|active|demo.twitter.message_shard1_replica_n1|sdw2:18984_solr|http://sdw2:18984/solr|t|t|fdemo.twitter.message|shard1|active|core_node5|active|demo.twitter.message_shard1_replica_n2|sdw1:18983_solr|http://sdw1:18983/solr|f|t|fdemo.twitter.message|shard2|active|core_node7|active|demo.twitter.message_shard2_replica_n4|sdw2:18983_solr|http://sdw2:18983/solr|f|t|fdemo.twitter.message|shard2|active|core_node9|active|demo.twitter.message_shard2_replica_n6|sdw1:18984_solr|http://sdw1:18984/solr|t|t|fdemo.twitter.message|shard3|active|core_node11|active|demo.twitter.message_shard3_replica_n8|sdw2:18984_solr|http://sdw2:18984/solr|t|t|fdemo.twitter.message|shard3|active|core_node13|active|demo.twitter.message_shard3_replica_n10|sdw1:18983_solr|http://sdw1:18983/solr|f|t|fdemo.twitter.message|shard4|active|core_node15|active|demo.twitter.message_shard4_replica_n12|sdw2:18983_solr|http://sdw2:18983/solr|f|t|fdemo.twitter.message|shard4|active|core_node16|active|demo.twitter.message_shard4_replica_n14|sdw1:18984_solr|http://sdw1:18984/solr|t|t|f(8rows)

ThisexampleexecutedonaGreenplumDatabaseclusterwithfourprimarysegments.Fourshardswerecreated,oneforeachsegment,andeachshardhastworeplicas.

Youcanalsorunthe gptext-state-D

command-lineutilitytoverifytheindexwascreated.Seethegptext-statereferencefordetails.

TheGPTextindexforthe demo.twitter.message tableisconfigured,bydefault,toindexallcolumnsinthe twitter.message databasetable.Youcanwritesearchqueriesthatcontaincriteriausinganycolumninthetable.

Ifyouwanttoindexandsearchasubsetofthetablecolumns,youcanusethesecond gptext.create_index() signature,specifyingthecolumnstoindexinthe<p_columns> argumentandthedatatypesofthosecolumnsinthe <p_types> argument.The <p_columns> and <p_types> argumentsaretextarrays.Theidcolumnnameanddefaultsearchcolumnnamemustbeincludedinthearrays.

Usethesecond gptext.create_index() signaturetocreateanindexforthe wikipedia.articles table.Thisindexwillallowyoutosearchonthe title , content ,andrefs columns.Notethattheidcolumnanddefaultsearchcolumnarestillspecifiedinseparateargumentsfollowingthe <p_columns> and <p_types>arrays.

=#SELECT*FROMgptext.create_index('wikipedia','articles','{id,title,content,refs}','{long,text_intl,text_intl,text_intl}','id','content',true);INFO:Createdindexdemo.wikipedia.articlescreate_index--------------t(1row)

Becausethe date_time columnwasomittedfromthe <p_columns> and <p_types> arrays,itwillnotbepossibletosearchthe wikipedia.articles indexondatewiththeGPTextsearchfunctions.

Customizetheindex(optional)CreatingaGPTextindexgeneratesasetofconfigurationfilesfortheindex.Beforeyouadddocumentstotheindex,youcancustomizetheconfigurationfilestochangethewaydataisindexedandstored.Youcancustomizeanindexlater,afteryouhaveaddeddocumentstoit,butyoumustthenreindexthedatatotakeadvantageofyourcustomizations.

Onecommoncustomizationistoremapdatatypesforsomedatabasecolumns.Inthe managed-schema configurationfileforanindex,GPTextmapsthedatatypesforeachfieldfromtheGreenplumDatabasetypetoanequivalentSolrdatatype.GPTextappliesdefaultmappings(seeGPTextandSolrDataTypeMappings),butyourindexmaybemoreeffectiveifyouuseadifferentmappingforsomefields.

The demo.twitter.message table,forexample,hasa message_text textcolumnthatcontainstweets.Bydefault,GPTextmapstextcolumnstotheSolr text_intl(internationaltext)type.TheGPText text_sm (socialmediatext)typeisabettermappingforatextcolumnthatcontainssocialmediaidiomssuchasemoticons.

Followthesestepstoremapthe message_text fieldtothe gtext_sm type.

1. Usethe gptext-config utilitytoeditthe managed-schema fileforthe demo.twitter.message index.

©CopyrightPivotalSoftware,Inc,2013-2019 39 3.3.0

Page 40: Pivotal Greenplum Text

$gptext-configedit-idemo.twitter.message-fmanaged-schema

The managed-schema fileloadsinatexteditor(normallyvi).

2. Findthe <field> elementforthe message_text field.

<fieldname="message_text"stored="false"type="text_intl"indexed="true"/>

3. Changethe type attributefrom text_intl to text_sm .

<fieldname="message_text"stored="false"type="text_sm"indexed="true"/>

4. Savethefileandexittheeditor.

TherearemanyotherwaystocustomizeaGPTextindex.Forexample,youcanomitfieldsfromtheindexbychangingthe indexed attributeofthe <field>elementto false ,storethecontentsofthefieldintheindexbychangingthe stored attributeto true ,oruse gptext-config toeditthe stopwords.txt filetospecifyadditionalwordstoignorewhenindexing.

SeeCustomizingGPTextIndexestolearnhowdatatypemappingdetermineshowSolranalyzesandindexesfieldcontentsandformorewaystocustomizeGPTextindexes.

PopulatetheindexTopopulatetheindex,usethetablefunction gptext.index() ,whichhasthefollowingsyntax:

SELECT*FROMgptext.index(TABLE(SELECT*FROM<table_name>),<index_name>);

Toindexallrowsinthe twitter.message table,executethiscommand:

=#SELECT*FROMgptext.index(TABLE(SELECT*FROMtwitter.message),'demo.twitter.message');dbid|num_docs------+----------2|8923|838(2rows)

Thiscommandindexestherowsinthe wikipedia.articles table.

=#SELECT*FROMgptext.index(TABLE(SELECT*FROMwikipedia.articles),'demo.wikipedia.articles');dbid|num_docs------+----------3|112|12(2rows)

Theresultsofthiscommandshowthat23documentsfromtwosegmentswereaddedtotheindex.

Thefirstargumentofthe gptext.index() functionisatableexpression. TABLE(SELECT*FROMwikipedia.articles)

createsatableexpressionfromthearticles

table,usingthetablefunction TABLE .

Youcanchoosethedatatoindexorupdatebychangingtheinnerselectlistinthequerytoselecttherowsyouwanttoindex.Whenaddingnewdocumentstoanexistingindex,forexample,specifya WHERE clauseinthe gptext.index() calltochooseonlythenewrowstoindex.

Theinner SELECT statementcouldalsobeaqueryonadifferenttablewiththesamestructure,oraresultsetconstructedwithanarbitrarilycomplexjoin,providedthecolumnsspecifiedinthe gptext.create_index() functionarepresentintheresults.Ifyouindexdatafromasourceotherthanthetableusedtocreatetheindex,besurethedistributionkeyfortheresultsetmatchesthedistributionkeyofthebasetable.TheGreenplumDatabase SELECTstatementhasa SCATTERBY clausethatyoucanusetospecifythedistributionkeyfortheresultsfromaquery.SeeSpecifyingadistributionkeywithSCATTERBYformoreaboutthedistributionpolicyandGPTextindexes.

Committheindex

©CopyrightPivotalSoftware,Inc,2013-2019 40 3.3.0

Page 41: Pivotal Greenplum Text

Afteryoucreateandpopulateanindex,youcommittheindexusing gptext.commit_index(<index_name>) .

Thisexamplecommitsthedocumentsaddedtotheindexesinthepreviousexample.

=#SELECT*FROMgptext.commit_index('demo.twitter.message');commit_index--------------t(1row)

=#SELECT*FROMgptext.commit_index('demo.wikipedia.articles');commit_index--------------t(1row)

The gptext.commit_index() functioncommitsanynewdataaddedtoordeletedfromtheindexsincethelastcommit.

ManagingGPTextIndexesGPTextprovidescommand-lineutilitiesandfunctionsyoucanusetoperformtheseGPTextmanagementtasks:

Configuringanindex

Optimizinganindex

SpecifyingadistributionpolicywithSCATTERBY

Deletingfromanindex

Droppinganindex

Addingafieldtoanindex

Droppingafieldfromanindex

Listingallindexes

ConfiguringanindexYoucanmodifyyourindexingbehaviorgloballybyusingthe gptext-config utilitytoeditasetofindexconfigurationfiles.Thefilesyoucaneditwithgptext-config are:

solrconfig.xml –ContainsmostoftheparametersforconfiguringSolritself(seehttp://wiki.apache.org/solr/SolrConfigXml ).

managed-schema –DefinestheanalyzerchainsthatSolrusesforvariousdifferenttypesofsearchfields(seeTextAnalyzerChains).

stopwords.txt –Listswordsyouwanttoeliminatefromthefinalindex.

protwords.txt –Listsprotectedwordsthatyoudonotwanttobemodifiedbytheanalyzerchain.Forexample,iPhone.

synonyms.txt –Listswordsthatyouwantreplacedbysynonymsintheanalyzerchain.

elevate.xml –Movesspecificwordstothetopofyourfinalindex.

emoticons.txt –Definesemoticonsforthe text_sm socialmediaanalyzerchain.(seeTheemoticons.txtfile).

Youcanalsouse gptext-config tomovefiles.

OptimizinganindexThefunction gptext.optimize_index(<index_name>,<max_segments>) mergesallsegmentsintoasmallnumberofsegments( <max_segments> )forincreasedefficiency.

Example:

=#SELECT*FROMgptext.optimize_index('demo.wikipedia.articles',10);optimize_index----------------t(1row)

©CopyrightPivotalSoftware,Inc,2013-2019 41 3.3.0

Page 42: Pivotal Greenplum Text

SpecifyingthenumberofindexshardsThemaximumnumberofindexshardsGPTextallowsisthesameasthenumberofGreenplumDatabasesegments.However,youcanspecifyfewershardsforanindex,andyoumayfindthathavingfewershardsusesresourcesmoreefficiently,withoutaffectingperformancesignificantly.

Solrhastwomethodsofassigningdocumentstoindexshards,the implicit routerandthe compositeId router.Witheitherrouter,Solrcandeterminetheshardadocumentbelongstogiventhedocument’suniqueID.Thevaluesofthe gptext.idx_num_shards GPTextconfigurationparameterdetermineswhichroutertousewhentheindexiscreated.

Whenthevalueof gptext.idx_num_shards is0,thedefault,GPTextcreatesindexesusingthe implicit router.Withthe implicit router,GPTextsuppliesSolralistofshardnames—shard1,shard2,shard3,…—oneshardperGreenplumDatabasesegment.ThisisthedefaultshardingschemeforGPText.

Whenthevalueof gptext.idx_num_shards isanintegergreaterthan0(butnotgreaterthanthenumberofGreenplumDatabasesegments),GPTextcreatesindexesusingthe compositeId router.Withthisrouter,GPTextsuppliesthenumberofshardstocreate,andSolrdetermineshowtoroutedocumentstotheshards.

Thefinal,optionalargumentofthe gptext.create_index() , if_check_id_uniqueness ,specifieswhethertheindexallowsdocumentswithduplicateIDs.Itistruebydefault;ifyousetittofalse,theindexcanhavemultipleversionsofthesamedocumentandasearchcanreturnmorethanoneresultforadocument.The compositeId routerdoesnotsupportduplicateIDs,soifyouset if_check_id_uniquess tofalsewhenyoucreateanindex,GPTextwillalwaysusetheimplicit routerandtherewillbeoneshardperGreenplumDatabasesegment.

SeeChangingGPTextServerConfigurationParametersforinstructionstosetGPTextconfigurationparameters.

SpecifyingadistributionpolicywithSCATTERBYThefirstparameterof gptext.index() isatableexpression,suchas TABLE(SELECT*FROM

wikipedia.articles).Thequeryinthisparametermusthavethesame

distributionpolicyasthetableyouareindexingsothatdocumentsaddedtotheindexareassociatedwiththecorrectGreenplumDatabasesegments.Somequeries,however,havenodistributionpolicyortheyhaveadifferentdistributionpolicy.Thiscouldhappenifthequeryisajoinoftwoormoretablesoraqueryonanintermediate(staging)tablethatisdistributeddifferentlythanthebasetablefortheindex.

Tospecifyadistributionpolicyforaqueryresultset,theGreenplumDatabaseSELECTstatementhasa“SCATTERBY”clause.

TABLE(SELECT*FROMwikipedia.articlesSCATTERBY<distrib_id>)

where distrib_id isthenameornumberofthecolumnusedtodistributethebasetablefortheindex.

DeletingfromanindexYoucandeletefromanindexusingaquerywiththefunction gptext.delete(<index_name>,<query>) .Thisdeletesfromtheindexalldocumentsthatmatchthesearchquery.Todeletealldocuments,usethequery '*' .

Afterasuccessfuldeletion,execute gptext.commit_index(<index_name>) tocommitthechange.

Thisexampledeletesalldocumentscontaining "toxin" inthedefaultsearchfield.

=#SELECT*FROMgptext.delete('demo.wikipedia.articles','toxin');delete--------t(1row)

SELECT*FROMgptext.commit_index('demo.wikipedia.articles');

Examplethatdeletesalldocumentsfromtheindex:

SELECT*FROMgptext.delete('demo.wikipedia.articles','*:*');

Besuretocommitchangestotheindexafterdeletingdocuments.

SELECT*FROMgptext.commit_index('demo.wikipedia.articles');

©CopyrightPivotalSoftware,Inc,2013-2019 42 3.3.0

Page 43: Pivotal Greenplum Text

DroppinganindexYoucancompletelyremoveanindexwiththe gptext.drop_index(<index_name>) function.

Example:

SELECT*FROMgptext.drop_index('demo.wikipedia.articles');

AddingafieldtoanindexYoucanaddafieldtoanexistingindexusingthe gptext.add_field() function.Forexample,youcanaddafieldtotheindexafteracolumnisaddedtotheunderlyingdatabasetableoryoucanaddafieldtoindexacolumnthatwasnotspecifiedwhentheindexwascreated.

GPTextmapstheGreenplumDatabasefieldtypetoanequivalentSolrdatatypeautomatically.SeeGPTextandSolrDataTypeMappingsforatableofdatatypemappings.

CREATETABLEmyarticles(idint8primarykey,date_timetimestamptz,titletext,contenttext,refstext)DISTRIBUTEDBY(id);

SELECT*FROMgptext.create_index('wikipedia','myarticles','id','content',true);...populatetheindex...SELECT*FROMgptext.commit_index('demo.wikipedia.myarticles');

ALTERTABLEmyarticlesADDnotestext;SELECT*FROMgptext.add_field('demo.wikipedia.myarticles','notes',false,false);SELECT*FROMgptext.reload_index('demo.wikipedia.myarticles');

AddingafieldtoaGPTextindexrequiresthebasetabletobeavailable.Ifyoudropthetableaftercreatingtheindex,youcannotaddfieldstotheindex.

DroppingafieldfromanindexYoucandropafieldfromanexistingindexwiththe gptext.drop_field() function.Afteryouhavedroppedfields,call gptext.reload_index() toreloadtheindex.

Example:

SELECT*FROMgptext.drop_field('demo.wikipedia.myarticles','notes');SELECT*FROMgptext.reload_index('demo.wikipedia.myarticles');

ListingallindexesYoucanlistallindexesintheGPTextclusterusingthe gptext-state command-lineutility.Forexample:

$gptext-state-D20170822:10:11:23:029752gptext-state:gpsne:gpadmin-[INFO]:-ExecuteGPTextstate...20170822:10:11:23:029752gptext-state:gpsne:gpadmin-[INFO]:-Checkzookeeperclusterstate...20170822:10:11:23:029752gptext-state:gpsne:gpadmin-[INFO]:-CheckGPTextclusterstatus...20170822:10:11:23:029752gptext-state:gpsne:gpadmin-[INFO]:-CurrentGPTextVersion:2.1.220170822:10:11:24:029752gptext-state:gpsne:gpadmin-[INFO]:-Allnodesareupandrunning.20170822:10:11:24:029752gptext-state:gpsne:gpadmin-[INFO]:------------------------------------------------20170822:10:11:24:029752gptext-state:gpsne:gpadmin-[INFO]:-Indexstatedetails.20170822:10:11:24:029752gptext-state:gpsne:gpadmin-[INFO]:------------------------------------------------20170822:10:11:24:029752gptext-state:gpsne:gpadmin-[INFO]:-databaseindexnamestate20170822:10:11:24:029752gptext-state:gpsne:gpadmin-[INFO]:-wikipediademo.wikipedia.articlesGreen20170822:10:11:28:029752gptext-state:gpsne:gpadmin-[INFO]:-Done.

Storingfieldcontentinanindex

©CopyrightPivotalSoftware,Inc,2013-2019 43 3.3.0

Page 44: Pivotal Greenplum Text

Solrcanstorethecontentsofcolumnsintheindexsothatresultsofasearchontheindexcanincludethecolumncontents.Thismakesitunnecessarytojointhesearchqueryresultswiththeoriginaltable.Youcanevenstorethecontentsofdatabasecolumnsthatarenotindexedandreturnthatcontentwithsearchresults.GPTextreturnstheadditionalfieldcontentinabufferaddedtothesearchresults.Individualfieldscanberetrievedfromthisbufferusingthe gptext.gptext_retrieve_field() , gptext.gptext_retrieve_field_int() ,and gptext.gptext_retrieve_field_float() functions.

Onedesignpatternistostorecontentforallofatable’scolumnsintheGPTextindexsothedatabasetablecanthenbetruncatedordropped.AdditionaldocumentscanbeaddedtotheGPTextindexlaterbyinsertingthemintothetruncatedtable,orintoatemporarytablewiththesamestructure,andthenaddingthemtotheindexwiththe gptext.index() function.

ToenablestoringcontentinaGPTextindex,youmusteditthe managed-schema filefortheindex.The <field> elementforeachfieldhasa stored attribute,whichdefaultstofalse,exceptfortheuniqueidfield.

Followthesestepstoconfigurethe demo.wikipedia.articles indextostorecontentforthe title , content ,and refs columns.

1. Logintothemasteras gpadmin anduse gptext-config toeditthe managed-schema file.

$gptext-configedit-idemo.wikipedia.articles-fmanaged-schema

2. Findthe <field> elementsforthecolumnsyouwanttostoreintheindex.Notethat <field> elementswithnamesbeginningwithanunderscoreareinternalfieldsandshouldnotbemodified.The“title”,“content”,and“refs”fieldsinthisexampleareindexed,butnotstored.

<fieldname="__temp_field"type="intl_text"indexed="true"stored="false"multiValued="true"/><fieldname="_version_"type="long"indexed="true"stored="true"/><fieldname="id"stored="true"type="long"indexed="true"/><fieldname="__pk"stored="true"indexed="true"type="long"/><fieldname="title"stored="false"type="text"indexed="true"/><fieldname="content"stored="false"type="text"indexed="true"/><fieldname="refs"stored="false"type="text"indexed="true"/>

3. Foreachfieldyouwanttostoreintheindex,changethe stored attributefrom "false" to "true" .

<fieldname="title"stored="true"type="text"indexed="true"/><fieldname="content"stored="true"type="text"indexed="true"/><fieldname="refs"stored="true"type="text"indexed="true"/>

4. Savethefileand,ifanydocumentswerealreadyaddedtotheindex,reindexthetable.SeeRetrievingStoredFieldContentforinformationaboutretrievingthestoredcontentwithGPTextqueryresults.

Formoreaboutthecontentsofthe managed-schema fileandadditionalwaystocustomizeGPTextindexesseeCustomizingGPTextIndexes.

CreatingaGPTextindexforaGreenplumDatabasepartitionedtableCreatingaGPTextindexforapartitionedGreenplumDatabasetableusing gptext.create_index() isthesameascreatinganindexforanon-partitionedtable.Youmustsupplythenameoftherootpartition,however;ifyouattempttocreateaGPTextindexforachildpartition,the gptext.create_index() functionissuesanerrormessage.

GPTextrecognizesapartitionedtableandaddsa __partition fieldtotheindex.Thenwhenyouadddocumentstotheindex,GPTextsavesthechildpartitiontablenameinthe __partition field.Youcanusethe __partition fieldtocreateGPTextqueriesthatsearchandfilterbypartition.

UnlikeGreenplumDatabase,whichmanageschildpartitionsasseparatedatabasetables,GPTextdoesnotcreateaseparateSolrcollectionforeachdatabasepartitionbecausethelargernumberofSolrcorescouldadverselyaffectthecapacityandperformanceoftheSolrcluster.

The demo.twitter.message tablecreatedintheSettingUptheSampleDatabasesectionisapartitionedtable.SeeSearchingPartitionedTablesforexamplesofsearchingpartitions.

AddinganddroppingpartitionsfromGPTextindexesYoucanaddnewpartitionsto,anddroppartitionsfrom,GreenplumDatabasepartitionedtables.IfyouhavecreatedaGPTextindexonapartitionedtable,whenyouaddordroppartitionsinthebasedatabasetable,youmustperformaparallelGPTextindexoperation.

Whenanewpartitionisadded,thepartitioncanbeindexedoncethedataisinplace.Youcanselectrowsdirectlyfromthenewlyaddedchildpartitiontabletoindexthedata.First,usethe gptext.partition() statusfunctiontofindthenamesofchildpartitiontables.

©CopyrightPivotalSoftware,Inc,2013-2019 44 3.3.0

Page 45: Pivotal Greenplum Text

=#SELECT*FROMgptext.partition_status('demo.twitter.message');partition_name|inherits_name|level|cons

------------------------------------+----------------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------demo.twitter.message_1_prt_1|demo.twitter.message|1|((created_at>='2011-08-0100:00:00'::timestampwithouttimezone)AND(created_at<'2011-09-0100:00:00'::timestampwithouttimezone))demo.twitter.message_1_prt_2|demo.twitter.message|1|((created_at>='2011-09-0100:00:00'::timestampwithouttimezone)AND(created_at<'2011-10-0100:00:00'::timestampwithouttimezone))demo.twitter.message_1_prt_3|demo.twitter.message|1|((created_at>='2011-10-0100:00:00'::timestampwithouttimezone)AND(created_at<'2011-11-0100:00:00'::timestampwithouttimezone))demo.twitter.message_1_prt_4|demo.twitter.message|1|((created_at>='2011-11-0100:00:00'::timestampwithouttimezone)AND(created_at<'2011-12-0100:00:00'::timestampwithouttimezone))demo.twitter.message_1_prt_dec2011|demo.twitter.message|1|((created_at>='2011-12-0100:00:00'::timestampwithouttimezone)AND(created_at<'2112-01-0100:00:00'::timestampwithouttimezone))(5rows)

Intheexampleabove,anewpartitionwiththename twitter.message_1_prt_dec2011 wasaddedtothe demo.twitter.message table.ThefollowingstatementsaddthedatafromthenewpartitiontotheGPTextindexandcommitthechanges.

=#SELECT*FROMgptext.index(TABLE(SELECT*FROMtwitter.message_1_prt_dec2011),'demo.twitter.message');dbid|num_docs------+----------3|1092|128(2rows)

=#SELECT*FROMgptext.commit_index('demo.twitter.message');commit_index--------------t(1row)

Thenameofthenewchildpartitionfile(excludingthedatabaseandschemanames)issavedinthe __partition fieldintheindex.

Whenapartitionisdeletedfromapartitionedtable,thedatafromthepartitioncanbedeletedfromtheGPTextindexbyspecifyingthepartitionnameinthe <search> argumentofthe gptext.delete() function.Besuretocommittheindexafterdeletingthepartition.

=#SELECT*FROMgptext.delete('demo.twitter.message','__partition:message_1_prt_dec2011');delete--------t(1row)

=#SELECT*FROMgptext.commit_index('demo.twitter.message');commit_index--------------t(1row)

©CopyrightPivotalSoftware,Inc,2013-2019 45 3.3.0

Page 46: Pivotal Greenplum Text

QueryingGPTextIndexesToretrievedata,yousubmitaquerythatperformsasearchbasedoncriteriathatyouspecify.Simplequeriesreturnstraight-forwardresults.Youcanusethedefaultqueryparser,orspecifyadifferentqueryparseratquerytime.

CreatingaSimpleSearchQueryAfteraSolrindexiscommitted,youcanrunquerieswiththe gptext.search() function,whichhasthissyntax:

gptext.search(<src_table>,<index_name>,<search_query>,<filter_queries>[,<options>])

The <search_query> argumentisatextvaluethatcontainsaSolrquery.The <filter_queries> argumentisanarrayofqueriesthatrestrictthesetofdocumentstosearch.

ThedefaultSolrStandardQueryParserhasarichquerysyntaxthatincludeswildcardcharacters,Booleanoperators,proximityandrangesearches,andfuzzysearches.SeeTheStandardQueryParser attheSolrwebsiteforexamples.

Solrhasadditionalqueryprocessorsthatyoucanspecifyinthe <search_query> argumenttoaccessadditionalfeatures.TheGPTextUniversalQueryParser, gptextqp ,allowsqueriesthatmixfeaturesfromallofthesupportedqueryparsers.

SeeSelectingaQueryParserforalistofthesupportedqueryparsersandhowtorequesttheminyourqueries.SeeUsingtheUniversalQueryParserforexamplesusingtheGPTextUniversalQueryParser.

Thefollowingsectionsshowhowtousethe gptext.search() function,includingexamplequeriesthatdemonstrateSolrsearchfeatures.

AnANDsearchexamplewithtop5resultsThissearchfindsdocumentsinthe wikipedia.articles indexthatcontainbothsearchterms“solar”and“battery”.The 'rows=5' argumentisaSolroptionthatspecifiesthetop5resultsaretobereturnedfromeachsegment.InaGreenplumDatabaseclusterwithtwosegments,thisqueryreturnsupto10rows.

=#SELECTa.id,a.date_time,a.title,q.scoreFROMwikipedia.articlesa,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.wikipedia.articles','solarANDbattery',null,'rows=5')qWHEREq.id::int8=a.idORDERBYscoreDESC;id|date_time|title|score----------+------------------------+---------------------+-----------13690575|2017-08-2402:34:00-05|Solarpower|2.71286582008322|2017-08-0502:09:00-05|Vehicle-to-grid|2.58101534711003|2017-08-1018:56:00-05|Osmoticpower|2.207300725784|2017-08-2607:10:00-05|Renewableenergy|2.1295567213555|2017-08-2712:48:00-05|Solarupdrafttower|2.021064827743|2017-08-2015:56:00-05|Solarenergy|1.6916461608623|2017-08-2703:56:00-05|Ethanolfuel|1.4619896(7rows)

SeeSolroptionsformoreaboutSolroptions.

AnORsearchexamplewithtop5resultsByusingtheORkeyword,thissearchmatchesmoredocumentsthantheANDexample.Thetotalnumberofrowsreturnedislimitedbythe rows=5 Solroption.

©CopyrightPivotalSoftware,Inc,2013-2019 46 3.3.0

Page 47: Pivotal Greenplum Text

=#SELECTa.id,a.date_time,a.title,q.scoreFROMwikipedia.articlesa,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.wikipedia.articles','solarORbattery',null,'rows=5')qWHEREq.id::int8=a.idORDERBYscoreDESC;id|date_time|title|score---------+------------------------+---------------------+-----------2008322|2017-08-0502:09:00-05|Vehicle-to-grid|2.581015325784|2017-08-2607:10:00-05|Renewableenergy|2.12955672120798|2017-01-2800:59:00-06|Lithiumeconomy|2.0416002213555|2017-08-2712:48:00-05|Solarupdrafttower|2.021064827743|2017-08-2015:56:00-05|Solarenergy|1.6916461608623|2017-08-2703:56:00-05|Ethanolfuel|1.4619896533423|2017-08-2800:52:00-05|Solarwaterheating|1.02390722988035|2017-03-1206:39:00-05|Vortexengine|0.9519546113728|2017-08-1509:59:00-05|Geothermalenergy|0.680103555017|2017-08-2819:24:00-05|Fusionpower|0.6432224(10rows)

Searchnon-defaultfieldsAGPTextindexhasadefaultsearchcolumn,specifiedwhentheindexiscreatedwiththe gptext.create_index() function.Ifyouhaveincludedadditionalcolumnstoindex,youcanreferencetheminyourqueries.Thisquerysearchesfordocumentswiththeword“solar”inthe title column.

=#SELECTa.id,a.date_time,a.title,q.scoreFROMwikipedia.articlesa,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.wikipedia.articles','title:solar',null,null)qWHEREq.id::int8=a.idORDERBYscoreDESC;id|date_time|title|score----------+------------------------+---------------------+-----------13690575|2017-08-2402:34:00-05|Solarpower|1.654772927743|2017-08-2015:56:00-05|Solarenergy|1.6547729533423|2017-08-2800:52:00-05|Solarwaterheating|1.1132113213555|2017-08-2712:48:00-05|Solarupdrafttower|1.1132113(4rows)

Thisexamplefindsdocumentswherethe title columnmatches“Solarpower”or“Solarenergy”.

=#SELECTa.id,a.date_time,a.title,q.scoreFROMwikipedia.articlesa,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.wikipedia.articles','title:(solarAND(powerORenergy))',null,null)qWHEREq.id::int8=a.id;id|date_time|title|score----------+------------------------+--------------+-----------27743|2017-08-2015:56:00-05|Solarenergy|3.309545813690575|2017-08-2402:34:00-05|Solarpower|2.9718256(2rows)

Thisexamplesearchesforarticlesthathave“photosynthesis”inthe content columnbutthatdonothave“solar”inthe title column.

=#SELECTa.id,a.date_time,a.title,q.scoreFROMwikipedia.articlesa,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.wikipedia.articles','photosynthesisand-title:solar',null,null)qWHEREq.id::int8=a.idORDERBYscoreDESC;id|date_time|title|score----------+------------------------+------------------+-----------25784|2017-08-2607:10:00-05|Renewableenergy|2.972095553716476|2017-08-2820:40:00-05|Seaweedfuel|1.424022114205946|2017-08-2808:46:00-05|Algaefuel|1.3022419608623|2017-08-2703:56:00-05|Ethanolfuel|0.7614042(4rows)

Filteringsearchresults

©CopyrightPivotalSoftware,Inc,2013-2019 47 3.3.0

Page 48: Pivotal Greenplum Text

Afilterqueryappliesfilterstotheresultsreturnedbythequery.The <filter_queries> argumentofthe gptext.search() functionisanarray,soyoucanapplymultiplefilterstothesearchresults.

Thefollowingexamplefindsarticlesthathavetheword“nuclear”inthe content columnandthenappliestwofilterqueriestoremovearticlesthathave“solar”inthe title columnandarticlesthatdonothave“power”inthe title column.

=#SELECTa.id,a.date_time,a.title,q.scoreFROMwikipedia.articlesa,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.wikipedia.articles','nuclear','{-title:solar,title:power}',null)qWHEREq.id::int8=a.idORDERBYscoreDESC;id|date_time|title|score----------+------------------------+------------------+------------14090587|2017-08-1414:00:00-05|Low-carbonpower|1.189789755017|2017-08-2819:24:00-05|Fusionpower|1.175360913021878|2017-08-0905:03:00-05|Geothermalpower|0.99499804(3rows)

Thefollowingexamplesearchesthe demo.twitter.message tableformessagesthatcontainthetext“iphone”andeither“hate”or“love”andfiltersforauthorswhospecifiedEnglishlanguageintheirtwitterprofile.

=#SELECTt.id,q.score,t.author_screen_name,t.message_textFROMtwitter.messaget,gptext.search(TABLE(SELECT*FROMtwitter.message),'demo.twitter.message','(iphoneAND(hateORlove))','{author_lang:en}','rows=5')qWHEREt.id=q.id::int4ORDERBYscoreDESC;id|score|author_screen_name|message_text----------+-----------+--------------------+------------------------------------------------------------------------------------------------------------19424811|3.446217|kennediiscool|Ihate

:iPhones:20663075|2.9209785|Hi_imMac|RT@indigoFKNvanity:IhatetheautocorrectoniPhones!!!!!!!!!20042822|2.9209785|renadrian|@KDMC23ohhhh!!!IhateIphoneTalk!20759274|2.5128412|SteLala|Droppedfrutopiaon

:Myphone...#ciaowaterdamageIhateiPhones.19416451|2.1448703|ShayFknShay|I'minlovewithmynewiPhone(:20350436|2.102924|mahhnamestj|Iabsolutelylovehowfastthisphoneworks.LovetheiPhone.19284329|1.9478481|popolvuhplaya|#nowplayingonmyiPhone:DaftPunk-"DigitalLove"19714120|1.9478481|BipolarBearApp|@ayee_Eddy2011Ilovepancakestoo!#iPhone#app20257190|1.6903389|alasco|Lovemy#iphone-onlyproblemnow?Iwantan#Ipad!20473459|1.379696|ArniBella|ilovemyiphone4butI'mexcitedtoseewhattheiphone5hastooffer#gadgets#iphone#apple#technology(10rows)

CreatingFacetedSearchQueriesFacetingbreaksqueryresultsintomultiplecategorieswithacountofthenumberofdocumentsintheindexforeachcategory.TherearethreeGPTextfacetedsearchfunctions:

gptext.faceted_field_search() –thecategoriesarethevaluesofoneormorefieldsinGPTextindex.

gptext.faceted_query_search() –thecategoriesarealistofsearchqueries.

gptext.faceted_range_search() –thecategoriesarealistofrangescalculatedfromminimumvalue,maximumvalue,andthesizeoftherange(gap).

Theexamplesinthissectionusethe store.products table.SeeSettingUptheDemoDatabaseforcommandstocreateandloaddataintothistable.

Afterthetableiscreatedandthedataloaded,createtheGPTextindex,indexthedata,andthencommittheindexasshowninthisexample.

=# SELECT * FROM gptext.create_index('store', 'products', '{id, title, category, brand, price}',

©CopyrightPivotalSoftware,Inc,2013-2019 48 3.3.0

Page 49: Pivotal Greenplum Text

FacetingonFieldsWiththe gptext.faceted_field_search() function,thecategoriesarevaluesofoneormorefieldsintheindex.Hereisthesyntaxforthe gptext.faceted_field_search()function:

gptext.faceted_field_search(<index_name>,<query>,<filter_queries>,<facet_fields>,<facet_limit>,<minimum>[,<options>])

<index_name> isthenameoftheGPTextindexwithfieldstofacet.

<query> isasearchquerythatselectsthesetofdocumentstobefaceted.Tofacetalldocumentsintheindexspecify '*:*' .

<filter_queries> isanarrayofqueriesthatfilterdocumentsfromthesetreturnedbythe <query> ,or null ifnone.Onlydocumentsthatmatchallqueriesinthelistareincludedinthecounts.

<facet_fields> isanarrayofindexfieldstofacet.

<facet_limit> isthemaximumnumberofresultstoreportforanyonecategory.Use -1 toreportallresults.

<minimum> istheminimumnumberofresultsacategorymusthaveinordertobeincludedintheresults.

Thisexamplefacetsalldocumentsinthe demo.store.products indexonthecategoryfield.

=#SELECT*FROMgptext.faceted_field_search('demo.store.products','*:*',null,'{category}',-1,1);field_name|field_value|value_count------------+--------------+-------------category|Pot|11category|Desktops|10category|Tablets|8category|Monitors|7category|Tent|6category|Luggage|5category|SleepingBag|3(7rows)

Thisexamplefacetsalldocumentsontwofields, category and brand .Onlyfacetswithacountof2ormoreareincludedintheresults.

'{int, text_intl, string, string, float}', 'id', 'title'); =# SELECT * FROM gptext.index(TABLE(SELECT * FROM store.products), 'demo.store.products'); dbid | num_docs------+---------- 2 | 25 3 | 25(2 rows)

=# SELECT * FROM gptext.commit_index('demo.store.products'); commit_index-------------- t(1 row)

©CopyrightPivotalSoftware,Inc,2013-2019 49 3.3.0

Page 50: Pivotal Greenplum Text

=#SELECT*FROMgptext.faceted_field_search('demo.store.products','*:*',null,'{category,brand}',-1,2);field_name|field_value|value_count------------+----------------+-------------brand|ASUS|7brand|Dell|5brand|HP|4brand|Samsung|4brand|Apple|2brand|UtopiaKitchen|2brand|BigAgnes|2brand|Yaheetech|2brand|Kelty|2brand|Huawei|2category|Pot|11category|Desktops|10category|Tablets|8category|Monitors|7category|Tent|6category|Luggage|5category|SleepingBag|3(17rows)

Thenextexampleusesafilterquerytofacetthe brand fieldforjustthe10documentswithcategory“Desktops”.

=#SELECT*FROMgptext.faceted_field_search('demo.store.products','*:*','{category:Desktops}','{brand}',-1,1);field_name|field_value|value_count------------+-------------+-------------brand|Dell|5brand|ASUS|3brand|HP|2(3rows)

FacetingonsearchqueriesWiththe faceted_query_search() function,thecategoriesareGPTextsearchqueries.Thecountsareareportofthenumbersofdocumentsthatmatcheachsearchquery.Hereisthesyntaxforthe faceted_field_search() function:

gptext.faceted_query_search(<index_name>,<query>,<filter_queries>,<facet_queries>);

<index_name> isthenameoftheGPTextindexwithfieldstofacet.

<query> isasearchquerythatselectsthesetofdocumentstobefaceted.Tofacetalldocumentsintheindexspecify '*:*' .

<filter_queries> isanarrayofqueriesthatfilterdocumentsfromthesetreturnedbythe <query> ,or null ifnone.Onlydocumentsthatmatchallqueriesinthelistareincludedinthecounts.

<facet_queries> isanarrayofsearchqueries.Eachqueryinthearrayisacategoryintheresults.

Thisexamplereportsthenumberofdocumentsthatcontain“windows”,“intel”,andboth“windows”and“intel”inthedefaultsearchcolumn( title ).

=#SELECT*FROMgptext.faceted_query_search('demo.store.products','*:*',null,'{windows,intel,windowsANDintel}');query_name|value_count-------------------+-------------intel|7windows|4windowsANDintel|2(3rows)

ThefacetqueriesinthisexampleareSolrrangequeriesthatdefinefourcustomrangesoverthe price field.

©CopyrightPivotalSoftware,Inc,2013-2019 50 3.3.0

Page 51: Pivotal Greenplum Text

=#SELECT*FROMgptext.faceted_query_search('demo.store.products','*:*',null,'{price:[*TO200],price:[201TO250],price:[251TO300],price:[301TO*]}');query_name|value_count--------------------+-------------price:[201TO250]|2price:[251TO300]|2price:[301TO*]|11price:[*TO200]|35(4rows)

FacetingonRangesThe gptext.faceted_range_search() functionfacetsasinglefieldintheGPTextindexintorangesspecifiedwithstart,end,andgapvalues.Thefacetedfieldmustbeanumerictype.

gptext.faceted_range_search(<index_name>,<query>,<filter_queries>,<field_name>,<range_start>,<range_end>,<range_gap>,<options>)

<index_name> isthenameoftheGPTextindexwithfieldstofacet.

<query> isasearchquerythatselectsthesetofdocumentstobefaceted.Tofacetalldocumentsintheindexspecify '*:*' .

<filter_queries> isanarrayofqueriesthatfilterdocumentsfromthesetreturnedbythe <query> ,or null ifnone.Onlydocumentsthatmatchallqueriesinthelistareincludedintheresults.

<field_name> isthenameofthefieldtofacet.Thefieldmusthavenumericcontent.Thecalculatedrangeswillhavethesamedatatypeasthefield.

<range_start> isthesmallestvalueofthefirstrangecategory.

<range_limit> isthehighestvalueofthetoprange.

<range_gap> isthesizeofeachrangecategory.

<options> isanoptionalstringcontainingSolrqueryoptions.

Thisrangesearchexamplefacetsthepricefieldintorangesbetween0and1200withagapof100.The range_value columnintheresultsisatextvalue,sothe ORDERBY clausecaststhevaluetoafloattype.

=#SELECT*fromgptext.faceted_range_search('demo.store.products','*:*',null,'price','0','1200','100')ORDERBYrange_value::float;field_name|range_value|value_count------------+-------------+-------------price|0.0|23price|100.0|12price|200.0|4price|300.0|6price|400.0|0price|500.0|1price|600.0|1price|700.0|1price|800.0|0price|900.0|1price|1000.0|0price|1100.0|1(12rows)

HighlightingSearchTermsinQueryResultsHighlightinginsertsmarkuptagsbeforeandaftereachoccurrenceofthesearchtermsinaquery.Forexample,ifthesearchtermis“iphone”,eachoccurrenceof“iphone”inthefieldismarkedup:

<em>iphone</em>

Youcanchangethedefaultmarkupstringsfrom <em> and </em> bysettingthe gptext.hl_pre_tag and gptext.hl_post_tag serverconfigurationoptions.

©CopyrightPivotalSoftware,Inc,2013-2019 51 3.3.0

Page 52: Pivotal Greenplum Text

Therearetwowaystohighlightsearchterms,dependingonwhetherthefieldtobemarkedupisstoredintheGPTextindex.

Ifthefieldisindexed,butnotstored,youmustjointhesearchresultswiththedatabasetableandusethe gptext.highlight() functiontoapplymarkuptagstothecolumndata.

Ifthefieldisindexedandstored,Solrcanapplythemarkuptagsandreturnthemarked-upfieldintheresultsofthesearchquery.ThisisthesamewayhighlightingworksforGPTextexternalindexes.(SeeHighlightingExternalIndexSearchResults.)UsingthismethodwithregularGPTextindexesrequiresmodifyingthe solrconfig.xml configurationfilefortheindex.

HighlightingTermswithgptext.highlight()Touse gptext.highlight() theindexmusthavebeencreatedwithtermsenabledforthecolumnsthataretobehighlighted.Use gptext.enable_terms() toenabletermvectorsandthenreindexthedataifitwasalreadyindexed.See gptext.enable_terms() intheGPTextFunctionReference.

Thisexampleenablestermsforthe message_text fieldinthe demo.twitter.message index,reindexesthedata,andcommitsthechangestotheindex:

=#SELECT*FROMgptext.enable_terms('demo.twitter.message','message_text');=#SELECT*FROMgptext.index(TABLE(SELECT*FROMtwitter.message),'demo.twitter.message');=#SELECT*FROMgptext.commit_index('demo.twitter.message');

The gptext.highlight() syntaxis:

gptext.highlight(<column_data>,<column_name>,<offsets>)

The <column_data> argumentcontainsthetextdatathatwillbemarkedupwithhighlightingtags.

The <column_name> argumentisthenameofthecorrespondingtablecolumn.

The <offsets> argumentisaGPText hstore typethatcontainskey-valuepairsthatspecifythelocationsofthesearchterminthetextdata.Thisvalueisconstructedbythe gptext.search() functionwhenhighlightingisenabled.Thekeycontainsthecolumnnameandthevalueisacomma-separatedlistofoffsetswherethedataappears.

Toenablehighlightingina gptext.search() query,addthe hl and hl.fl optionsinthe <options> argument:

hl=true&hl.fl=<field1>,<field2>

Settingthe hl=true optionenableshighlightingforthesearch.The hl.fl optionspecifiesalistofthefieldnamestohighlight.

Thisexamplereturnsuptofiverowsfromeachsegmentwiththetext“iphone”highlightedinthe message_text field.

=#SELECTt.id,gptext.highlight(t.message_text,'message_text',s.hs)FROMtwitter.messaget,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.twitter.message','{!gptextqp}iphone',null,'rows=5&hl=true&hl.fl=message_text')sWHEREt.id=s.id::int8;id|highlight

----------+---------------------------------------------------------------------------------------------------------------------20473459|ilovemyiphone4butI'mexcitedtoseewhattheiphone5hastooffer#gadgets#<em>iphone</em>#apple#technology19424811|Ihate

:<em>iPhones</em>:20663075|RT@indigoFKNvanity:Ihatetheautocorrecton<em>iPhones</em>!!!!!!!!!20350436|Iabsolutelylovehowfastthisphoneworks.Lovethe<em>iPhone</em>.20042822|@KDMC23ohhhh!!!Ihate<em>Iphone</em>Talk!19714120|@ayee_Eddy2011Ilovepancakestoo!#<em>iPhone</em>#app19284329|#nowplayingonmy<em>iPhone</em>:DaftPunk-"DigitalLove"19416451|I'minlovewithmynew<em>iPhone</em>(:20257190|Lovemy#<em>iphone</em>-onlyproblemnow?Iwantan#Ipad!20759274|Droppedfrutopiaon

:Myphone...#ciaowaterdamageIhate<em>iPhones</em>.(10rows)

Warning:Highlightingaddsoverheadtothequery,includingindexspace,indexingtime,andsearchtime.

©CopyrightPivotalSoftware,Inc,2013-2019 52 3.3.0

Page 53: Pivotal Greenplum Text

HighlightingTermsinStoredFieldsIfthefieldtobehighlightedisstoredintheindex,Solrcanreturnthefieldinthesearchresultswithmarkuptagsapplied.The gptext.highlight() functionisnotusedanditisnotnecessarytoenabletermsforthefield.ThisisthedefaultbehaviorforGPTextexternalindexes,butforregularGPTextindexesyoumustenableitbyeditingthe solrconfig.xml configurationfilefortheindex.

1. Usethe gptext-config utilitytoopenthe solrconfig.xml configurationfilefortheindexintheeditor.. $gptext-configedit-idemo.twitter.message-fsolrconfig.xml

2. Searchfor <!--SearchComponents--> andaddthefollowingelement.

<searchComponentclass="solr.HighlightComponent"name="highlight"/>

3. Searchfor <requestHandlername="/select"class="solr.SearchHandler"> .Inthe <arrname="components"> childelement,change <str>termoffsets</str> to<str>highlight</str> .Thecomplete <requestHandler> entryshouldbe:

<requestHandlername="/select"class="solr.SearchHandler"><!--defaultvaluesforqueryparameterscanbespecified,thesewillbeoverriddenbyparametersintherequest--><lstname="defaults"><strname="echoParams">explicit</str><intname="rows">10</int><strname="df">message_text</str></lst><arrname="components"><str>query</str><str>facet</str><str>mlt</str><str>highlight</str><str>stats</str><str>debug</str></arr></requestHandler>

4. Saveyourchanges.

5. Updatethefielddefinitionsinthe managed-schema configurationfiletostorethefieldsthatwillbehighlighted.SeeStoringAdditionalFieldsinanIndexforinstructions.Besuretoreindexthedataafterchangingstorageoptions.

Thefollowingquerysearchesthe message_text fieldformessagescontainingthetext“iphone”andhighlights“iphone”inthetextreturnedinthe hscolumn.

=#SELECT*FROMgptext.search(TABLE(SELECT1SCATTERBY1),'demo.twitter.message','{!gptextqp}iphone',null,'rows=5&hl=true&hl.fl=message_text');id|score|hs|rf----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------+----19284329|0.8176138|{"columnValue":[{"name":"message_text","value":"#nowplayingonmy\u003cem\u003eiPhone\u003c/em\u003e:DaftPunk-\"DigitalLove\""}]}|19416451|0.9003142|{"columnValue":[{"name":"message_text","value":"I'minlovewithmynew\u003cem\u003eiPhone\u003c/em\u003e(:"}]}|19424811|1.0051261|{"columnValue":[{"name":"message_text","value":"Ihate\n\u003cem\u003eiPhones\u003c/em\u003e:"}]}|20042822|0.8519347|{"columnValue":[{"name":"message_text","value":"Ihate\u003cem\u003eIphone\u003c/em\u003eTalk!"}]}|(4rows)

Youcanusethe gptext.gptext_retrieve_field() functiontoextractthehighlightedtextfromthe columnValue arrayinthe hs column.Comparethepreviousresultstotheresultsfromthisquery.

©CopyrightPivotalSoftware,Inc,2013-2019 53 3.3.0

Page 54: Pivotal Greenplum Text

=#SELECTid,score,gptext.gptext_retrieve_field(hs,'message_text')message_textFROMgptext.search(TABLE(SELECT1SCATTERBY1),'demo.twitter.message','{!gptextqp}iphone',null,'rows=5&hl=true&hl.fl=message_text');id|score|message_text

----------+------------+---------------------------------------------------------------------------------------------------------------------19424811|1.0051261|Ihate

:<em>iPhones</em>:20042822|0.8519347|Ihate<em>Iphone</em>Talk!20350436|0.7387052|Lovethe<em>iPhone</em>.20473459|0.59349346|ilovemyiphone4butI'mexcitedtoseewhattheiphone5hastooffer#gadgets#<em>iphone</em>#apple#technology20663075|0.8519347|RT@indigoFKNvanity:Ihatetheautocorrecton<em>iPhones</em>!!!!!!!!!19284329|0.8176138|#nowplayingonmy<em>iPhone</em>:DaftPunk-"DigitalLove"19416451|0.9003142|I'minlovewithmynew<em>iPhone</em>(:19714120|0.8176138|#<em>iPhone</em>#app20257190|0.7095236|Lovemy#<em>iphone</em>-onlyproblemnow?Iwantan#Ipad!20759274|0.7095236|#ciaowaterdamageIhate<em>iPhones</em>.(10rows)

SearchingPartitionedTablesAGPTextindexforapartitionedGreenplumDatabasetablehasa __partition fieldthatcontainsthenameofthechildpartition.Whenyouquerytheindex,youcanusethe __partition fieldtorestrictthepartitionstosearch.

Searchallpartitionsinanindexbycalling gptext.search() withtherootpartitionname:

=#SELECT*FROMgptext.search(TABLE(SELECT1SCATTERBY1),'demo.twitter.message','{!gptextqp}blackberry',null,null);id|score|hs|rf-----------+-----------+----+----71559892|5.670539||127444971|5.1496587||127024083|5.1496587||65596365|4.4688635||79177658|4.4688635||78934938|4.4688635||111566417|4.4688635||65058966|3.5941496||92240815|5.212467||38424415|4.730712||96811329|4.730712||146782767|4.730712||41409575|4.1019597||104198393|4.1019597||86943734|3.2956126||89120464|3.2956126||153181836|3.2956126||139227011|3.2956126||20664699|2.8236253||(19rows)

Youcansearchasinglepartitionbycalling gptext.search() withthechildpartitionname.Usethe gptext.partition_status(<index_name>) functiontoseethepartitionnames.Forexample:

=#SELECTpartition_name,levelFROMgptext.partition_status('demo.twitter.message');partition_name|level------------------------------+-------demo.twitter.message_1_prt_1|1demo.twitter.message_1_prt_2|1demo.twitter.message_1_prt_3|1demo.twitter.message_1_prt_4|1(4rows)

Thisexamplesearchesonlythe demo.twitter.message_1_prt_3 partition:

©CopyrightPivotalSoftware,Inc,2013-2019 54 3.3.0

Page 55: Pivotal Greenplum Text

=#SELECT*FROMgptext.search(TABLE(SELECT1SCATTERBY1),'demo.twitter.message_1_prt_3','{!gptextqp}blackberry',null);id|score-----------+-----------71559892|5.67053979177658|4.468863578934938|4.4688635111566417|4.468863592240815|5.21246796811329|4.730712104198393|4.101959786943734|3.295612689120464|3.2956126(9rows)

Youcanalsospecifyapartitionnameorarangeofpartitionsinthequeryfilterargumentofthe gptext.search() function.Thisexamplesearchesthepartitionsbetween message_1_prt_2 and message_1_prt_4 .

=#SELECT*FROMgptext.search(TABLE(SELECT1SCATTERBY1),'demo.twitter.message','android','{__partition:[message_1_prt_2TOmessage_1_prt_4]}');id|score-----------+-----------42474603|5.77086895666225|5.67053968701747|4.468863556900818|4.4688635111566417|4.4688635120764432|4.4688635115326522|4.468863567269000|3.594149699959486|6.413594104293903|3.1360807(10rows)

RetrievingStoredFieldContentAGPTextindexdoesnot,bydefault,storethecontentsofdatabasecolumnsintheindex,withtheexceptionoftheuniqueidcolumn.Whenyousearchtheindex,youmustjointhesearchresultswiththeoriginaldatabasetableontheidcolumninordertoaccessothertablecolumns.

YoucanconfigureaGPTextindextostorecontentoffieldswhendocumentsareindexed.Theadditionalstoredfieldscanbereturnedwiththesearchresultssothatitisunnecessarytojoinwiththedatabaseoriginaltable.Forsomeapplications,youcanevendeletedatafromthedatabasetableordropthetableafterthedatahasbeenaddedtotheindex.

RetrievetheadditionalfieldvaluesinaGPTextsearchbyspecifyingalistoffieldsinthe gptext.search() optionsargument.Inthisexample,thedemo.wikipedia.articles indexhasbeenconfiguredtostorethe content , title ,and refs fields,inadditiontothe id field.SeeStoringFieldContentinanIndexforinstructionstoeditthe managed-schema filetostoretheseadditionalfields.Inthe option argument,theSolr fl parameterrequeststhatcontentsofthe id and title fieldsbeincludedintheresults.

=#SELECT*FROMgptext.search(TABLE(SELECT1SCATTERBY1),'demo.wikipedia.articles','+grid+economy',null,'fl=id,title&rows=2');id|score|hs|rf

---------+-----------+----+----------------------------------------------------------------------------------------------------------533423|2.4593863||column_value{name:"id"value:"533423"}column_value{name:"title"value:"Solarwaterheating"}7906908|2.0646634||column_value{name:"id"value:"7906908"}column_value{name:"title"value:"Biomass"}27743|1.823319||column_value{name:"id"value:"27743"}column_value{name:"title"value:"Solarenergy"}113728|1.2235354||column_value{name:"id"value:"113728"}column_value{name:"title"value:"Geothermalenergy"}(4rows)

Toretrieveallfieldsstoredintheindex,usethe * wildcardforthefieldlist: 'fl=*' .

Intheresults,therequestedfieldsarepackedintoanfieldnamed rf addedtotheresults.The rf fieldisatextvaluecontainingastructurewiththefollowingformat:

©CopyrightPivotalSoftware,Inc,2013-2019 55 3.3.0

Page 56: Pivotal Greenplum Text

column_value{name:"<field1_name>"value:"<field1_value>"}[column_value{name:"<field2_name>"value:"<field2_value>"}]...

TheGPTextfunction gptext.gptext_retrieve_field(rf,<column_name>) retrievesasinglefieldvaluebynamefromthisstructureasatextvalue.GPTextprovidesvariationstoretrievethefieldvaluesas int or float values.Ifthespecifiedfieldnamedoesnotexistinthe rf structure,thefunctionreturns NULL .

Thisexampleshowshowyoucanusethe gptext.gptext_retrieve*() functionstounpacksearchresultsintoseparateresultcolumns.

=#SELECTscore,gptext.gptext_retrieve_field_int(rf,'id')id,gptext.gptext_retrieve_field(rf,'title')title,substring(gptext.gptext_retrieve_field(rf,'content'),1,15)contentFROMgptext.search(TABLE(SELECT1SCATTERBY1),'demo.wikipedia.articles','+grid+economy',null,'fl=*');score|id|title|content-----------+----------+---------------------+-----------------2.4593863|533423|Solarwaterheating|'''Solarwater2.0646634|7906908|Biomass|'''Biomass'''i2.0444229|13690575|Solarpower|'''Solarpower'1.823319|27743|Solarenergy|'''Solarenergy1.2235354|113728|Geothermalenergy|'''Geothermale1.0890164|14205946|Algaefuel|'''Algaefuel''(6rows)

SelectingaQueryParserWhenyousubmitaquery,Solrprocessesthequeryusingaqueryparser.ThereareseveralSolrqueryparserswithdifferentcapabilities.Forexample,theComplexPhraseQueryParser canparsewildcards,andthe SurroundQueryParser supportsspanqueries—findingwordsinthevicinityofasearchterminadocument.

GPTextsupportsthesequeryparsers:

QParserPlugin ,thedefaultGPTextqueryparser. QParserPlugin isasupersetofthe LuceneQParserPlugin ,Solr’snativeLucenequeryparser.QParserPlugin isageneralpurposequeryparserwithbroadcapabilities. QParserPlugin doesnotsupportspanqueriesandhandlesoperatorprecedenceinanunintuitivemanner.Thesupportforfieldselectionisalsoratherweak.Seehttp://wiki.apache.org/solr/SolrQuerySyntax .

ComplexPhraseQueryParser supportswildcards,ORs,ranges,andfuzziesinsidephrasequeries.Seehttps://issues.apache.org/jira/browse/SOLR-1604 .

DisMax (or eDisMax )handlesoperatorprecedenceinanintuitivemannerandiswell-suitedforuserqueriessinceitissimilartopopularsearchenginesontheweb.SeeUsingtheDisMaxandExtendedDisMaxQueryProcessors.

SurroundQueryParser ,supportsthefamilyofspanqueries.SeeProximitySearchQueriesandSurroundQueryParser intheApacheSolrReferenceGuide.

gptextqp ,theGPTextUnifiedQueryParser,canusealloftheabovequeryparsersincombination.SeeUsingtheUniversalQueryParserformoreinformation.

YoucanspecifythequeryparsertouseatquerytimebysettingtheSolr defType optioninthe options argumentofthesearchfunctionorbysettingthetype asaSolrLocalParamembeddedinthequery.

Thisqueryspecifiesthe dismax queryparserinthe options argumentofthe gptext.search() function:

=#SELECTa.title,q.scoreFROMwikipedia.articlesa,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.wikipedia.articles','+hydroelectric-solar',null,'defType=dismax')qWHEREa.id=q.id::int8;title|score------------------------+-----------Forwardosmosis|0.9552469Liquidnitrogenengine|1.0126935(2rows)

Thedefaultqueryparserisspecifiedinthe requestHandler definitionsin solrconfig.xml .Youcanedit solrconfig.xml withthemanagementutilitycommand gptext-config

edit.

©CopyrightPivotalSoftware,Inc,2013-2019 56 3.3.0

Page 57: Pivotal Greenplum Text

ThefollowingqueryusestheComplexPhraseQueryParser,settingthe type parameterinaSolrLocalParam.

=#SELECTa.title,q.scoreFROMwikipedia.articlesa,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.wikipedia.articles','{!type=complexphrase}sequesterANDcarbon',null,null)qWHEREa.id=q.id::int8;title|score---------+---------Biomass|3.83572(1row)

IntheLocalParam,the type= specifiercanbeomittedbecause type isthedefaultparameter:

'{!complexphrase}sequesterANDcarbon'

ProximitySearchQueriesProximitysearchqueriesfinddocumentsthathavesearchtermswithinaspecifieddistance.Thedistanceismeasuredasthenumberoftermmovesthatwouldbeneededtomakethetermsadjacent.

Withthestandardqueryparser,thetermstomatchareplacedinquotesandthedistancebetweenthemisspecifiedbyaddingatilde ~ andanintegeraftertheclosingquote.Thefollowingsearchqueryfindsdocumentswiththeterms“solar”and“fossil”withinfivetermsofeachother.

=#SELECTt.id,s.score,t.titleFROMwikipedia.articlest,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.wikipedia.articles','"solarfossil"~5',null,null)sWHEREs.id::int8=t.id;id|score|title----------+------------+------------------25784|0.4855828|Renewableenergy14090587|0.30585092|Low-carbonpower13690575|0.62667537|Solarpower(3rows)

Thesearchtermsinsidethequotescanappearineitherorder.However,ifthetermsoccurintheoppositeorderinthedocument,thedistancebetweenthemisonegreaterthanifthetermsoccurinthespecifiedorder.

TheSurroundqueryparser allowsorderedandunorderedproximitysearches.The W operatorspecifiesanorderedsearchandthe N operatorspecifiesanunorderedsearch.Themaximumdistancebetweenthetermsisspecifiedbyprefixingthe W or N operatorwithaninteger,forexample3W .

Theproximityquerycanbewrittenwithprefixorinfixnotation.

Prefixnotation: '{!surround}3W(solar,fossil)'

Infixnotation: '{!surround}solar3Wfossil'

HerearesomeproximityqueryexamplesusingtheSurroundqueryparser.

'{!surround} title:2w(solar, heat)'

Searchesthe title fieldfortheterms“solar”and“heat”withintwoterms,andinthespecifiedorder.Thisqueryusesprefixnotation.The N andW operatorsarenotcase-sensitive.

'{!surround} title:heat 2N solar'

Searchesthe title fieldfortheterms“heat”and“solar”withintwoterms,inanyorder.Thisqueryusesinfixnotation.'{!surround} title: W(solar, heat)'

Searchesthe title fieldforadjacentterms“solar”and“heat”.Thedefaultdistanceis1,so 1W canbeabbreviatedto W .

The wikipedia.articles indexcontainsadocumentwiththetitle“Solarwaterheating”.Thefollowingexamplesearch,however,cannotfindit.

TheSurroundqueryparserdoesnotanalyzequerytextliketheotherqueryparsers.GPTextindexesarebydefaultbuiltwithlowercaseandstemmingfilters,forexample,sosurroundqueriescontainingcapitallettersorunstemmedtermswillreturnnoresults.

©CopyrightPivotalSoftware,Inc,2013-2019 57 3.3.0

Page 58: Pivotal Greenplum Text

=#SELECTt.id,s.score,t.titleFROMwikipedia.articlest,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.wikipedia.articles','{!surround}title:2w(Solar,heating)',null,null)sWHEREs.id::int8=t.id;id|score|title----+-------+-------(0rows)

Whenyourewritethequerytouseonlylowercasecharactersandremovethesuffixfrom“heating”,thedocumentisfound.

=#SELECTt.id,s.score,t.titleFROMwikipedia.articlest,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.wikipedia.articles','{!surround}title:2w(solar,heat)',null,null)WHEREs.id::int8=t.id;id|score|title--------+-----------+---------------------533423|1.5434089|Solarwaterheating(1row)

AneasywaytoavoidthislimitationistousetheGPTextUniversalQueryParser,whichdoesanalyzethequerytextandalsosupportstheSurroundqueryparser’sproximitysyntax.

UsingtheUniversalQueryParserWiththeGPTextUniversalQueryParser,youcanperformsearchesusingfeaturesfromanyoftheothersupportedqueryparsers,combinedintoonesearchstring.InvoketheUniversalQueryParserbysettingtheSolr type parameterinaSolrLocalParamwiththisformat:

'{!gptextqp}<search_query>'

Thesearchqueryinthefollowingexampleincludessyntaxfromthreequeryparsers:

sea* –Complexquerywithwildcard

2W –Proximityqueryrequestingamaximumoftwowordsdistancebetweentheterms“sea*”and“oil”or“fuel”

oil OR fuel –SolrStandardQueryProcessor

=#SELECTa.title,q.scoreFROMwikipedia.articlesa,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.wikipedia.articles','{!gptextqp}sea*2W(oilORfuel)',null,null)qWHEREa.id=q.id::int8;title|score--------------+-----------Seaweedfuel|55.250305(1row)

Inthefollowingexample, title:n(power,geothermal) specifiesthattheterms“power”and“geothermal”inthe title fieldmustbeadjacent,buttheycanoccurineitherorder.

=#SELECTa.title,q.scoreFROMwikipedia.articlesa,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.wikipedia.articles','{!gptextqp}title:n(power,geothermal)',null,null)qWHEREa.id=q.id::int8;title|score------------------+---------Geothermalpower|2.05577(1row)

Thisqueryusesthefuzzysearchoperator ~ tofindarticleswithtitlescontainingatermsimilarto“lethiam”andacomplexquerythatfindsarticleswith“ocean”and“wind”inthecontent.

©CopyrightPivotalSoftware,Inc,2013-2019 58 3.3.0

Page 59: Pivotal Greenplum Text

=#SELECTt.id,score,titleFROMwikipedia.articlest,gptext.search(TABLE(SELECT1SCATTERby1),'demo.wikipedia.articles','{!gptextqp}title:lethiam~ORcontent:(oceanANDwind)',null,null)sWHEREt.id=s.id::int8;id|score|title----------+------------+-------------------2120798|1.3326647|Lithiumeconomy4711003|2.6328268|Osmoticpower25784|3.3899183|Renewableenergy55017|0.95579207|Fusionpower113728|1.3909805|Geothermalenergy27743|2.114852|Solarenergy13690575|1.4488393|Solarpower(7rows)

UsingtheDisMaxandExtendedDisMaxQueryParsersTheDisMaxqueryparser supportsasubsetoftheSolrStandardQueryParsersyntax.Itisusefulforqueriesfromenduserswhoarefamiliarwithcommonsearchsystems,suchasGooglesearch.Itsupportsquotedphrases,ANDandORoperators,and+and-operators.TheExtendedDisMaxqueryparser improvesupontheDisMaxqueryparser,supportingthefullStandardqueryparsersyntax.

TheDisMaxandExtendedDisMaxqueryparserbehaviorscanbecustomizedatquerytimebysettingparametersintheSolroptionsargumentofthegptext.search() functionoraslocalparametersinthequerytext.SeeDisMaxParameters andExtendedDisMaxParameters fordetails.Oneusefulparameteristhe qf (queryfields)parameter,whichspecifiesalistoffieldstosearch.Usingthisparameteravoidshavingtowriteaquerythatsearcheseachfieldindividually.Forexample,insteadofwritingthisquery:

'content:nuclearORtitle:nuclearORlinks:nuclear'

youcanwrite:

{!edismaxqf="contenttitlelinks"}nuclear

ThefollowingexamplequeriesillustratefeaturesoftheDisMaxandExtendedDisMaxqueryparsers.

'{!dismax} +nuclear reactor'

Findsdocumentscontainingtheterm“nuclear”and,optionally,theterm“reactor”.'{!dismax} +"nuclear reactor"'

Findsdocumentscontainingthephrase“nuclearreactor”.'{!dismax} +solar -reactor'

Findsdocumentscontainingtheterm“solar”butnottheterm“reactor”.'{!edismax qf="title refs"} solar'

Findsdocumentswiththeterm“solar”inthe title or refs fields.'{!edismax qf="title"} (solar or renewable) and energy'

Findsthedocumentswithtitles“Solarenergy”and“Renewableenergy”.

©CopyrightPivotalSoftware,Inc,2013-2019 59 3.3.0

Page 60: Pivotal Greenplum Text

CustomizingGPTextIndexesGPTextsavesconfigurationfilesforanindexintheZooKeeper /gptext/configs/<index_name> znode,forexample /gptext/configs/demo.twitter.message .Theconfigurationfilesarecopiedfromthe $GPTXTHOME/share/gp_index_template/conf directoryandmodifiedwithinformationpassedinthe gptext.create_index()functionargumentsandtheGreenplumDatabasetabledefinition.

Afteranindexhasbeencreated,youcanmodifytheindex’sconfigurationfilesusingthe gptext-config command-lineutility.Youcanalsoeditthetemplatefilesinthe $GPTXTHOME/share/gp_index_template/conf directorysothatanynewindexyoucreatehasyourcustomizations.

Ifyouchoosetocustomizethetemplatefilesinthe $GPTXTHOME/share/gp_index_template/conf directory,youshouldfirstbackupthefilessothatyoucanrestorethedefaultversionsifnecessary.

EditingGPTextIndexConfigurationFilesYoucanedittheindexconfigurationfilessavedinZooKeeperusingthe gptext-config command-lineutilitywiththe edit option.Youprovidethenameoftheindexandthenameoftheconfigurationfileyouwanttomodify.Toeditthe managed-schema fileforthe demo.twitter.message index,forexample:

$gptext-configedit-idemo.twitter.message-fmanaged-schema

Theutilityloadsthefileintoaneditor, vi bydefault.Youcanspecifyadifferenteditorwiththe -e option.Thiscommandusesthe nano editortoeditthe stopwords.txt file.

$gptext-configedit-idemo.twitter.message-fstopwords.txt-enano

Youcanusethe gptext-configupload

commandtouploadalocalconfigurationfiletoZooKeeper.Thisexampleuploadsalocalconfigurationfilenamed

protwords.custom toZooKeeper,overwritingtheexisting protwords.txt file.

$gptext-configupload-idemo.twitter.message-lprotwords.custom-fprotwords.txt20171011:11:24:59:030178gptext-config:gpdb:gpadmin-[INFO]:-ExecuteGPTextconfig.20171011:11:25:00:030178gptext-config:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20171011:11:25:00:030178gptext-config:gpdb:gpadmin-[INFO]:-Uploadfileprotwords.customtozookeeper...20171011:11:25:01:030178gptext-config:gpdb:gpadmin-[INFO]:-Reloadingconfiguration...20171011:11:25:02:030178gptext-config:gpdb:gpadmin-[INFO]:-Modificationstoprotwords.txtrequirethatalldatabereindexed.20171011:11:25:02:030178gptext-config:gpdb:gpadmin-[INFO]:-Done.

Usethe gptext-configappend commandtoappendalocaltextfiletoanexistingconfigurationfile.Forexample,youcouldcreateanadditionallistofstopwordsinalocalfile stopwords.add andappendthemtothe stopwords.txt file.

$gptext-configappend-idemo.twitter.message-lstopwords.add-fstopwords.txt20171010:09:52:59:019764gptext-config:gpdb:gpadmin-[INFO]:-ExecuteGPTextconfig.20171010:09:53:00:019764gptext-config:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20171010:09:53:00:019764gptext-config:gpdb:gpadmin-[INFO]:-Creatingtemporarycopyofstopwords.txt...20171010:09:53:01:019764gptext-config:gpdb:gpadmin-[INFO]:-Appendingcontentsofstopwords.addtostopwords.txt20171010:09:53:01:019764gptext-config:gpdb:gpadmin-[INFO]:-Backingupstopwords.txtforindexdemo.twitter.message...20171010:09:53:03:019764gptext-config:gpdb:gpadmin-[INFO]:-Reloadingconfiguration...20171010:09:53:22:019764gptext-config:gpdb:gpadmin-[INFO]:-Modificationstostopwords.txtrequirethatalldatabereindexed.20171010:09:53:22:019764gptext-config:gpdb:gpadmin-[INFO]:-Done.

Seethe gptext-config commandreferencefor gptext-config command-lineoptionsandfordescriptionsofthefilesyoucaneditwith gptext-config .

Themanaged-schemaFileThemainconfigurationfileforanindexisthe managed-schema file.The managed-schema fileisanXMLfilecontainingdefinitionsforthefields,fieldtypes,andanalyzerchainsthatdefinethecontentsandbehaviorofaGPTextindex.

Afield( <field> XMLelement)mapsaGreenplumDatabasetablecolumntoafieldintheGPTextindex.

WheneditingXMLfilessuchas managed-schema ,besurethatyousaveavalidXMLdocument.InvalidXMLsyntaxwillcauseSolrerrorsandpreventaccesstoyourindex.

©CopyrightPivotalSoftware,Inc,2013-2019 60 3.3.0

Page 61: Pivotal Greenplum Text

Afieldtype( <fieldType> XMLelement)assignsSolrJavaclassesandanalyzerchainsthathandleadatatypetoafield.

Ananalyzerchain( <analyzer> XMLelement)isacontainerelementthatspecifiestheJavaclassesthattokenizeandfilterthecontentofafieldthatistobeindexed.An <analyzer> elementisachildofa <fieldType> element.

Inadditiontothe managed-schema file,theSolrconfigurationfilesforanindexincludetextfilesthatcontainlistsofwordstotreatspeciallywhenindexingdata,localizationfiles,charactersetcollationmapsusedforsorting,andaSolrserverconfigurationfile.

Thefollowingsectionsprovideanoverviewofthecontentsofthemanaged-schema fileandtherelationshipsbetweentheXMLelementsthatdefinefields,fieldtypes,andanalyzers.Byeditingthe managed-schema file,youcanspecifyatthefieldlevelhowSolrindexesandstoresGreenplumDatabasedata.

Fordetaileddocumentationofthecontentsofthe managed-schema file,refertothecommentsinthefileortotheApacheSolrClouddocumentation.

FieldElementsGPTextadds field elementstothe managed-schema fileforcolumnsincludedwhentheindexwascreatedwiththe gptext.create_index() function.Thisexampleisthedefinitionforatextfieldnamed description :

<fieldname="description"stored="false"type="text_intl"indexed="true"/>

The name attributeisthenameofthedatabasecolumn.IfthecolumnnameisnotavalidSolrfieldname,itisalteredtoconform.

The stored attributedeterminesifthecontentofthefieldwillbestoredintheindex.Ifthefieldisstoredintheindex,GPTextsearchresultscanreturnthecontentofthefield.Iftheattributeisnotstored,retrievingthefieldcontentrequiresaSQLjoin.

The type attributemapstheGreenplumDatabasetypetoaSolrtype,definedinthesamefilewitha <fieldType> element.

The indexed attributedetermineswhetherthefieldcontentwillbeindexed.

The <field> elementcanhaveadditionalattributesusedwithsometypes.Seethecommentafterthe <fields> elementforacompletelistofattributes.

FieldTypesThe type attributeofthe <field> elementismappedtothe name attributeofa <fieldType> elementinthe managed-schema file.The <fieldType> elementdetermineshowSolrparsesandstoresafieldintheindex.

The class attributemapsthefieldtypetoaSolrJavaclassthatrecognizesandprocessesthedatatype.Solrincludesmanybasefieldtypes.SeeGPTextandSolrDataTypeMappingsforamappingofSolrtypestoGreenplumDatabasetypes.

Youcanmapafieldtoadifferenttypebychangingthefield’s type attribute.Forexample,tousetheGPTextsocialmediatextanalyzerchain,youcanchangethetypeofatextfieldfrom text_intl to text_sm .Bothof text_intl and text_sm usethe Solr.TextField class,butspecifydifferentfiltersintheiranalyzerchains.

TheGPText gptext.list_field_types() functionisaconveniencefunctionthatletsyouseethetextfieldtypesdefinedinthe managed-schema fileforanindexwithouthavingtoeditthefile.Allofthetypeslistedhavetheclass Solr.TextField .

©CopyrightPivotalSoftware,Inc,2013-2019 61 3.3.0

Page 62: Pivotal Greenplum Text

SELECT*FROMgptext.list_field_types('demo.wikipedia.articles');list_field_types---------------------------ancestor_pathdelimited_payloads_floatdelimited_payloads_intdelimited_payloads_stringdescendent_pathlowercasephonetic_entexttext_artext_bgtext_catext_cjktext_cztext_datext_detext_eltext_entext_en_splittingtext_en_splitting_tighttext_estext_eutext_fatext_fitext_frtext_gatext_generaltext_general_revtext_gltext_hitext_hutext_hytext_icutext_idtext_intltext_intl_prevtext_ittext_jatext_lvtext_nltext_notext_pttext_rotext_rutext_smtext_svtext_thtext_trtext_wstext_zhsmart(49rows)

Toaddacustomtype,youcanaddanewfieldtypebyimplementingSolrJavatypeinterfaces,oryoucanspecifyanexistingbasetypeandcustomizeitwithananalyzerchain,asdescribedinthenextsection.

AnalyzerChainsAnanalyzerexaminesthecontentsoffieldorsearchqueryphraseandreturnsastreamoftokensusedtoindexthefieldorsearchtheindex.The<analyzer> elementisachildofa <fieldType> elementthatspecifieshowtextwillbetokenizedandprocessedbeforeitisindexedorappliedtoasearch.An <analyzer> canbeoftype index or query .

Differentindexingandquerychainscanbedefinedforindexingandqueryingoperationsbyaddinga type attributetothe <analyzer> element.Ifno typeattributeappearsthechainisappliedtobothfieldtextthatistobeindexedandquerytextthatsearchestheindex.

Fieldanalysisbeginswitha <tokenizer> thatdividesthecontentsofafieldintotokens.InLatin-basedtextdocuments,thetokensarewordsorterms.InChinese,Japanese,andKorean(CJK)documents,thetokensarecharacters.

Thetokenizercanbefollowedbyoneormore <filter> elementswhichareappliedinsuccession.Filtersrestrictthequeryresults,forexample,byremovingunnecessaryterms(“a”,“an”,“the”),convertingtermformats,orbyperformingotheractionstoensurethatonlyimportant,relevanttermsappearintheresultset.Eachfilteroperatesontheoutputofthetokenizerorfilterthatprecedesit.Solrincludesmanytokenizersandfiltersthatallowanalyzerchainstoprocessdifferentcharactersets,languages,andtransformations.SeeAnalyzers,TokenizersandFilters formoreinformation.

Fieldtypesareassignedanalyzersinanindex’s managed-schema file.ThefollowingexampleshowstheSolr text fieldtypespecification:

©CopyrightPivotalSoftware,Inc,2013-2019 62 3.3.0

Page 63: Pivotal Greenplum Text

<fieldTypename="text"class="solr.TextField"positionIncrementGap="100"autoGeneratePhraseQueries="true"><analyzertype="index"><tokenizerclass="solr.WhitespaceTokenizerFactory"/><!--inthisexample,wewillonlyusesynonymsatquerytime<filterclass="solr.SynonymFilterFactory"synonyms="synonyms.txt"ignoreCase="true"expand="false"/>--><filterclass="solr.StopFilterFactory"ignoreCase="true"words="stopwords.txt"/><filterclass="solr.WordDelimiterFilterFactory"generateWordParts="1"generateNumberParts="1"catenateWords="1"catenateNumbers="1"catenateAll="0"splitOnCaseChange="1"/><filterclass="solr.LowerCaseFilterFactory"/><filterclass="solr.KeywordMarkerFilterFactory"protected="protwords.txt"/><filterclass="solr.PorterStemFilterFactory"/></analyzer><analyzertype="query"><tokenizerclass="solr.WhitespaceTokenizerFactory"/><filterclass="solr.SynonymFilterFactory"synonyms="synonyms.txt"ignoreCase="true"expand="true"/><filterclass="solr.StopFilterFactory"ignoreCase="true"words="stopwords.txt"/><filterclass="solr.WordDelimiterFilterFactory"generateWordParts="1"generateNumberParts="1"catenateWords="0"catenateNumbers="0"catenateAll="0"splitOnCaseChange="1"/><filterclass="solr.LowerCaseFilterFactory"/><filterclass="solr.KeywordMarkerFilterFactory"protected="protwords.txt"/><filterclass="solr.PorterStemFilterFactory"/></analyzer></fieldType>

Ananalyzerhasonlyonetokenizer, solr.WhitespaceTokenizerFactory inthisexample.Thetokenizercanbefollowedbyoneormorefiltersexecutedinsuccession.

Filtersrestrictthequeryresults.Eachfilteroperatesontheoutputofthetokenizerorfilterthatprecedesit.Forexample,the solr.StopFilterFactory filterremovesunnecessaryterms(“a”,“an”,“the”)fromthestreamoftokens.Thewordstofilteroutofthestreamarelistedinthe stopwords.txt configurationfile.Youcaneditthe stopwords.txt filewiththe gptext-config utilitytochangethelistofwordsexcludedfromtheindex.

Thereareseparateanalyzertypesforindexandqueryoperations.Thequeryanalyzerchaininthisexampleincludesa solr.SynonymFilterFactory thatlooksupeachtokeninafile synonyms.txt and,iffound,returnsthesynonyminplaceofthetoken.

Theanalyzerchaincanincludea“stemmer”, solr.PorterStemFilterFactory inthisexample.Thestemmeremploysanalgorithmtochangewordstotheir“stems”.Forexample,“confidential”,“confidentiality”,and“confidentis”areallstemmedto“confidenti”.Usingastemmercandramaticallyreducethesizeoftheindex,butusersexecutingsearchesshouldbeawarethatsomesearchexpressionswillnotworkasexpectedbecauseofstemming.Forexample,searchingwithawildcardsuchas "confidential*" willreturnnomatchesbecausethewordswerestemmedto“confidenti”duringindexing.Withoutawildcard,thewordinthesearchexpressionisalsostemmedandthereforethesearchsucceeds.

The gptext.get_field_type() conveniencefunctionretrievesthefieldtypedefinitionforafieldtypefromthemanaged-schemafileforanindex,asaJSONstring.ThisexampleshowsthefieldtypedefinitionfortheSolr text fieldtype.

#=SELECT*FROMgptext.get_field_type('demo.wikipedia.articles','text');field_type-------------------------------------------------{"name":"text","class":"solr.TextField","indexAnalyzer":{"tokenizer":{"class":"solr.WhitespaceTokenizerFactory"},"filters":[{"class":"solr.StopFilterFactory","attributes":[{"name":"words","value":"stopwords.txt"},{"name":"ignoreCase","value":"true"}]},{"class":"solr.WordDelimiterFilterFactory","attributes":[{"name":"catenateNumbers","value":"1"},{"name":"generateNumberParts","value":"1"},{

©CopyrightPivotalSoftware,Inc,2013-2019 63 3.3.0

Page 64: Pivotal Greenplum Text

{"name":"splitOnCaseChange","value":"1"},{"name":"generateWordParts","value":"1"},{"name":"catenateAll","value":"0"},{"name":"catenateWords","value":"1"}]},{"class":"solr.LowerCaseFilterFactory"},{"class":"solr.KeywordMarkerFilterFactory","attributes":[{"name":"protected","value":"protwords.txt"}]},{"class":"solr.PorterStemFilterFactory"}]},"queryAnalyzer":{"tokenizer":{"class":"solr.WhitespaceTokenizerFactory"},"filters":[{"class":"solr.SynonymFilterFactory","attributes":[{"name":"expand","value":"true"},{"name":"ignoreCase","value":"true"},{"name":"synonyms","value":"synonyms.txt"}]},{"class":"solr.StopFilterFactory","attributes":[{"name":"words","value":"stopwords.txt"},{"name":"ignoreCase","value":"true"}]},{"class":"solr.WordDelimiterFilterFactory","attributes":[{"name":"catenateNumbers","value":"0"},{"name":"generateNumberParts","value":"1"},{"name":"splitOnCaseChange","value":"1"},

©CopyrightPivotalSoftware,Inc,2013-2019 64 3.3.0

Page 65: Pivotal Greenplum Text

},{"name":"generateWordParts","value":"1"},{"name":"catenateAll","value":"0"},{"name":"catenateWords","value":"0"}]},{"class":"solr.LowerCaseFilterFactory"},{"class":"solr.KeywordMarkerFilterFactory","attributes":[{"name":"protected","value":"protwords.txt"}]},{"class":"solr.PorterStemFilterFactory"}]},"attributes":[{"name":"autoGeneratePhraseQueries","value":"true"},{"name":"positionIncrementGap","value":"100"}]}

(1row)

The gptext.analyzer() functionletsyoutestananalyzerchainforafieldwithoutalteringtheindex.Itshowstheoutputofthetokenizerandeachfilterinthechain.Yousupplythetexttoanalyzeandspecifywhethertotesttheindexorthequeryanalyzerchain.Itisusefulfortestingtokenizersandfiltersandfortroubleshootingsearchqueriesthatdonotreturntheexpectedresults.

=#SELECT*FROMgptext.analyzer('demo.wikipedia.articles','index','IfYouOptimizeEverything,YouwillAlwaysbeUnhappy.');class|tokens------------------------+-----------------------------------------------------------------------------------------------WhitespaceTokenizer|{{"If"},{"You"},{"Optimize"},{"Everything,"},{"You"},{"will"},{"Always"},{"be"},{"Unhappy."}}StopFilter|{{},{"You"},{"Optimize"},{"Everything,"},{"You"},{},{"Always"},{},{"Unhappy."}}WordDelimiterFilter|{{},{"You"},{"Optimize"},{"Everything"},{"You"},{},{"Always"},{},{"Unhappy"}}LowerCaseFilter|{{},{"you"},{"optimize"},{"everything"},{"you"},{},{"always"},{},{"unhappy"}}SetKeywordMarkerFilter|{{},{"you"},{"optimize"},{"everything"},{"you"},{},{"always"},{},{"unhappy"}}PorterStemFilter|{{},{"you"},{"optim"},{"everyth"},{"you"},{},{"alwai"},{},{"unhappi"}}(6rows)

GPTextTextAnalyzerChainsInadditiontothetextanalyzerchainsSolrprovides,GPTextprovidesthefollowingtextanalyzerchains:

text_intl,theInternationalTextAnalyzer

text_sm,theSocialMediaTextAnalyzer

text_intl,theInternationalTextAnalyzertext_intl isthedefaultGPTextanalyzer.Itisamultiplelanguagetextanalyzerfor text fields.IthandlesLatin-basedwordsandChinese,Japanese,andKorean(CJK)characters.

©CopyrightPivotalSoftware,Inc,2013-2019 65 3.3.0

Page 66: Pivotal Greenplum Text

text_intl processesdocumentsasfollows.

1. SeparatesCJKcharactersfromotherlanguagetext.

2. Identifiescurrencytokensorsymbolsthatwereignoredinthefirstpass.

3. ForanyCJKcharacters,generatesabigramfortheCJKcharacterand,forKoreancharactersonly,preservestheoriginalword.

NotethatCJKandnon-CJKtextaretreatedasseparatetokens.PreservingtheoriginalKoreanwordincreasesthenumberoftokensinadocument.

FollowingisthedefinitionfromtheSolr managed-schema template.

<fieldTypeautoGeneratePhraseQueries="true"class="solr.TextField"name="text_intl"positionIncrementGap="100">

<analyzertype="index"><tokenizerclass="com.emc.solr.analysis.worldlexer.WorldLexerTokenizerFactory"/><filterclass="solr.CJKWidthFilterFactory"/><filterclass="solr.LowerCaseFilterFactory"/><filterclass="com.emc.solr.analysis.worldlexer.WorldLexerBigramFilterFactory"han="true"hiragana="true"katakana="true"hangul="true"/><filterclass="solr.StopFilterFactory"enablePositionIncrements="true"ignoreCase="true"words="stopwords.txt"/><filterclass="solr.KeywordMarkerFilterFactory"protected="protwords.txt"/><filterclass="solr.PorterStemFilterFactory"/></analyzer><analyzertype="query"><tokenizerclass="com.emc.solr.analysis.worldlexer.WorldLexerTokenizerFactory"/><filterclass="solr.CJKWidthFilterFactory"/><filterclass="com.emc.solr.analysis.worldlexer.WorldLexerBigramFilterFactory"han="true"hiragana="true"katakana="true"hangul="true"/><filterclass="solr.StopFilterFactory"enablePositionIncrements="true"ignoreCase="true"words="stopwords.txt"/><filterclass="solr.KeywordMarkerFilterFactory"protected="protwords.txt"/><filterclass="solr.PorterStemFilterFactory"/></analyzer></fieldType>

Followingaretheanalysisstepsfor text_intl .

1. Theanalyzerchainforindexingbeginswithatokenizercalled WorldLexerTokenizerFactory .Thistokenizerhandlesmostmodernlanguages.ItseparatesCJKcharactersfromotherlanguagetextandidentifiesanycurrencytokensorsymbols.

2. The solr.CJKWidthFilterFactory filternormalizestheCJKcharactersbasedoncharacterwidth.

3. The solr.LowerCaseFilterFactory filterchangesallletterstolowercase.

4. The WorldLexerBigramFilterFactory filtergeneratesabigramforanyCJKcharacters,leavesanynon-CJKcharactersintact,andpreservesoriginalKorean-languagewords.Setthe han , hiragana , katakana ,and hangul attributesto "true" togeneratebigramsforallsupportedCJKlanguages.

5. The solr.StopFilterFactory removescommonwords,suchas“a”,“an”,and“the”,whicharelistedinthe stopwords.txt configurationfile(seeToconfigureanindex).Iftherearenowordsinthe stopwords.txt file,nowordsareremoved.

6. The solr.KeywordMarkerFilterFactory markstheEnglishwordstoprotectfromstemming,usingthewordslistedinthe protwords.txtconfigurationfile(seeToconfigureanindex).If protwords.txt doesnotcontainalistofwords,allwordsinthedocumentarestemmed.

7. Thefinalfilteristhestemmer,inthiscase solr.PorterStemFilterFactory ,afaststemmerfortheEnglishlanguage.

Note:The text_intl analyzerchainforqueryingisthesameasthe text analyzerchainforindexing.

Ananalyzerchain, text ,isincludedinGPText’sSolr managed-schema andisbasedonSolr’sdefaultanalyzerchain.Becauseitstokenizersplitsonwhitespace, text cannotprocessCJKlanguages:whitespaceismeaninglessforCJKlanguages.Bestpracticeistousethe text_intl analyzer.

Forinformationaboutusingananalyzerchainotherthanthedefault,seeUsingthetext_smSocialMediaAnalyzer.

GPTextLanguageProcessing

Theroot-leveltokenizer, WorldLexerTokenizerFactory ,tokenizesinternationallanguages,includingCJKlanguages. WorldLexerTokenizerFactory tokenizeslanguagesbasedontheirUnicodepointsand,forLatin-basedlanguages,whitespace.

©CopyrightPivotalSoftware,Inc,2013-2019 66 3.3.0

Page 67: Pivotal Greenplum Text

Note:UnicodeistheencodingforalltextintheGreenplumDatabase.

Thefollowingaresampleinputto,andoutputfrom,GPText.Eachlineintheoutputcorrespondstoaterm.

EnglishandCJKinput:

₩10대부분english자선단체는.

EnglishandCJKoutput:

₩10

대부분

대부

부분

english

자선

단체는

단체

체는

Bulgarianinput:

Cъставнаnарламента:вж.nротоколи

Bulgarianoutput:

cъстав

на

nарламента

вж

протоколиа

Danishinput:

Genoptagelseafsessionen

Danishoutput:

genoptagelse

af

sessionen

text_intlFilters

Thetext_intlanalyzerusesthefollowingfilters:

The CJKWidthFilterFactory normalizeswidthdifferencesinCJKcharacters.Thisfilternormalizesallcharacterwidthstofullwidth.

The WorldLexerBigramFilterFactory filterformsbigrams(pairs)ofCJKtermsthataregeneratedfrom WorldLexerTokenizerFactory .Thisfilterdoesnotmodifynon-CJKtext.WorldLexerBigramFilterFactory acceptsattributesthatguidethecreationofbigramsforCJKscripts.Forexample,iftheinputcontainsHANGULscriptbutthe hangul attributeissetto false, thisfilterwillnotcreatebigramsforthatscript.Toensurethat WorldLexerBigramFilterFactory createsbigramsasrequired,settheCJKattributes han , hiragana , katakana ,and hangul to true .

text_sm,theSocialMediaTextAnalyzerTheGPText text_sm textanalyzeranalyzestextfromsourcessuchassocialmediafeeds. text_sm consistsofatokenizerandtwofilters.Toconfigurethetext_sm textanalyzer,usethe gptext-config utilitytoeditthe managed-schema file.SeeTousethetext_smSocialMediaAnalyzerfordetails.

text_sm normalizesemoticons:itreplacesemoticonswithtextusingthe emoticons.txt configurationfile.Forexample,itreplacesahappyfaceemoticon,:-) ,withthetext“happy”.

©CopyrightPivotalSoftware,Inc,2013-2019 67 3.3.0

Page 68: Pivotal Greenplum Text

ThefollowingisthedefinitionfromtheSolr managed-schema template.

<fieldTypeautoGeneratePhraseQueries="true"class="solr.TextField"name="text_sm"positionIncrementGap="100"termVectors="true"termPositions="true"termOffsets="true"><analyzertype="index"><tokenizerclass="com.emc.solr.analysis.text_sm.twitter.TwitterTokenizerFactory"delimiter="\t"emoticons="emoticons.txt"/><!--Caseinsensitivestopwordremoval.AddenablePositionIncrements=trueinboththeindexandqueryanalyzerstoleavea'gap'formoreaccuratephrasequeries.--><filterclass="solr.StopFilterFactory"enablePositionIncrements="true"ignoreCase="true"words="stopwords.txt"/><filterclass="solr.LowerCaseFilterFactory"/><filterclass="solr.KeywordMarkerFilterFactory"protected="protwords.txt"/><filterclass="com.emc.solr.analysis.text_sm.twitter.EmoticonsClassifierFilterFactory"delimiter="\t"emoticons="emoticons.txt"/><filterclass="com.emc.solr.analysis.text_sm.twitter.TwitterStemFilterFactory"/><analyzertype="query"><tokenizerclass="com.emc.solr.analysis.text_sm.twitter.TwitterTokenizerFactory"delimiter="\t"emoticons="emoticons.txt"/><filterclass="solr.StopFilterFactory"enablePositionIncrements="true"ignoreCase="true"words="stopwords.txt"/><filterclass="solr.LowerCaseFilterFactory"/><filterclass="solr.KeywordMarkerFilterFactory"protected="protwords.txt"/><filterclass="com.emc.solr.analysis.text_sm.twitter.EmoticonsClassifierFilterFactory"delimiter="\t"emoticons="emoticons.txt"/><filterclass="com.emc.solr.analysis.text_sm.twitter.TwitterStemFilterFactory"/></analyzer></fieldType>

TheTwitterTokenizer

TheTwittertokenizerextendstheEnglishlanguagetokenizer, solr.WhitespaceTokenizerFactory, torecognizethefollowingelementsasterms.

Emoticons

Hyperlinks

Hashtagkeywords(forexample,#keyword)

Userreferences(forexample,@username)

Numbers

Floatingpointnumbers

Numbersincludingcommas(forexample10,000)

timeexpressions(forexample,9:30)

Thetext_smfilters

com.emc.solr.analysis.socialmedia.twitter.EmoticonsClassifierFilterFactory classifiesemoticonsas happy , sad ,or wink .Itisbasedonthe emoticons.txt file(oneofthefilesyoucaneditwith gptext-config ,andisintendedforfutureuse,suchasinsentimentanalysis.

TheTwitterStemFilterFactory

com.emc.solr.analysis.socialmedia.twitter.TwitterStemFilterFactory extendsthe solr.PorterStemFilterFactory classtobypassstemmingofthesocialmediapatterns

©CopyrightPivotalSoftware,Inc,2013-2019 68 3.3.0

Page 69: Pivotal Greenplum Text

recognizedbythe twitter.TwitterTokenizerFactory .

Theemoticons.txtfile

Thisfilecontainslistsofemoticonsfor“happy,”“sad,”and“wink.”Theyareseparatedbyatabbydefault.Youcanchangetheseparationtoanycharacterorstringbychangingthevalueof delimiter inthesocialmediaanalyzerchain.Thefollowingisasamplelinefromthe text_sm analyzerchain:

<filterclass="com.emc.solr.analysis.text_sm.twitter.EmoticonsClassifierFilterFactory"delimiter="\t"emoticons="emoticons.txt"/>

Usingthetext_smSocialMediaAnalyzerTheSolr managed-schema filecreatedforanindexspecifiesananalyzertousetoindexeachfield.Thedefaultanalyzerfortextfieldsis text_intl .Tospecifythe text_sm socialmediaanalyzer,youusethe gptext-config utilitytomodifytheSolr managed-schema foryourindex.

Thestepsare:

1. Createanindexusing gptext.create_index() .

2. Usethe gptext-config utilitytoeditthe managed-schema filecreatedfortheindex:

gptext-configedit-fmanaged-schema-i<index_name>

The managed-schema filecontainsa <field> elementforeachtextfield.Forexample:

<fieldname="message_text"stored="false"type="text_intl"indexed="true"/>

The type attributespecifiestheanalyzertouse. text_intl isthedefaultanalyzer.

3. Modifythe <field> elementforeachtextfieldyouwanttousetheGPTextsocialmediaanalyzerandchangethe type attributeasfollows:

<fieldname="text_search_col"indexed="true"stored="false"type="text_sm"/>

4. Savethe managed-schema file.

UsingMultipleAnalyzerChainsIfyouwanttoindexafieldusingtwodifferentanalyzerchainssimultaneously,youcandothis:

Createanewemptyindex.Thenusethe gptext-config utilitytoaddanewfieldtotheindexthatisacopyofthefieldyouareinterestedin,butwithadifferentnameandanalyzerchain.

Letusassumethatyourindex,asinitiallycreated,includesafieldtoindexnamed mytext .Alsoassumethatthisfieldwillbeindexedusingthedefaultinternationalanalyzer( text_intl ).

Youwanttoaddanewfieldtotheindex’s managed-schema thatisacopyof mytext andthatwillbeindexedwithadifferentanalyzer(saythe text_smanalyzer).Todoso,followthesesteps:

1. Createanemptyindexwith gptext.create_index() .

2. Opentheindex’s managed-schema fileforeditingwith gptext-config .

3. Adda <field> inthe managed-schema foranewfieldthatwilluseadifferentanalyzerchain.Forexample:<fieldindexed="true"name="mytext2"stored="false"type="text_sm"/>

Bydefiningthetypeofthisnewfieldtobe text_sm ,itwillbeindexedusingthesocialmediaanalyzerratherthanthedefault text_intl .

4. Adda <copyField> in managed-schema tocopytheoriginalfieldtothenewfield.Forexample:<copyFielddest="mytext2"source="mytext"/>

5. Indexandcommitasyounormallywould.

©CopyrightPivotalSoftware,Inc,2013-2019 69 3.3.0

Page 70: Pivotal Greenplum Text

Thedatabasecolumn mytext isnowintheindextwicewithtwodifferentanalyzerchains.Onecolumnis mytext ,whichusesthedefaultinternationalanalyzerchain,andtheotheristhenewlycreated mytext2, whichusesthesocialmediaanalyzerchain.

UsingDifferentAnalyzerChainsforIndividualFieldsYoucanusedifferentanalyzersforindividualfieldsbyeditingthemanaged-schemaconfigurationfile.Forexample,ifonefieldcontainsEnglishtextandanothercontainsChineselanguagetext,youcanspecifydifferentanalyzersforthetwofields.

ExampleYouhaveatablenamed email_tbl withthefollowingdefinition:

createtableemail_tbl(idbigint,english_contenttext,chinese_contenttext,timestampdate,usernametext,ageint,...)#additionalcolumnsthatarenotindexed

Youwanttoindexthesixcolumnsshown— id , english_content , chinese_content , timestamp , username ,and age .

Forthecolumn english_content ,youwanttousetheEnglishlanguageanalyzercalled“text_en”forthetextsegmentation.

Forthecolumn chinese_content ,youwanttousetheinternationallanguageanalyzernamed“text_intl”.

Herearestepstoimplementthisexample:

1. CreatetheGPTextindexforthetable.

SELECT*FROMgptext.create_index('public','email_tbl','id','english_content');

2. Modifytheanalyzerforeachcolumnin managed-schema .

$gptext-configedit-idb.public.email_tbl-fmanaged-schema

3. Findtheelementforthe english_content field.

<fieldname="english_content"type="*"indexed="true"stored="true"/>

Changethe type attributeto text_en .

<fieldname="english_content"type="text_en"indexed="true"stored="true"/>

4. Findtheelementforthe chinese_content field.

<fieldname="chinese_content"type="*"indexed="true"stored="true"/>

Changethe type attributeto text_intl .

<fieldname="chinese_content"type="text_intl"indexed="true"stored="true"/>

5. Indexthetable.

SELECT*FROMgptext.index(TABLE(SELECTid,english_content,chinese_content,timestamp,username,ageFROMemail_tbl),'db.public.email_tbl');

6. Committheindex.

SELECT*FROMgptext.commit_index('db.public.email_tbl');

©CopyrightPivotalSoftware,Inc,2013-2019 70 3.3.0

Page 71: Pivotal Greenplum Text

Thefieldtypes text_en and text_intl aredefinedin <fieldType> entriesinthemanaged-schemafileandthenreferencedinthe type attributeofthe<field> element.

Youcandefineacustomfieldtypebyaddinga <fieldType> entrywithcustomanalyzersandthensettingthefield’s type attributetothenameofthecustomfieldtype.Forexample,thefollowing“text_customize”fieldtypeisacopyofthe“text_en”fieldtypeentrywiththesynonymfiltercommentedoutintheindexanalyzer.Thiscustomfieldtypewillapplythesynonymfiltertoqueries,butnottotheindex.

<fieldTypename="text_customize"class="solr.TextField"positionIncrementGap="100"><analyzertype="index"><tokenizerclass="solr.StandardTokenizerFactory"/><filterclass="solr.StopFilterFactory"ignoreCase="true"words="stopwords.txt"/><!--inthisexample,wewillonlyusesynonymsatquerytime<filterclass="solr.SynonymFilterFactory"synonyms="synonyms.txt"ignoreCase="true"expand="false"/>--><filterclass="solr.LowerCaseFilterFactory"/></analyzer><analyzertype="query"><tokenizerclass="solr.StandardTokenizerFactory"/><filterclass="solr.StopFilterFactory"ignoreCase="true"words="stopwords.txt"/><filterclass="solr.SynonymFilterFactory"synonyms="synonyms.txt"ignoreCase="true"expand="true"/><filterclass="solr.LowerCaseFilterFactory"/></analyzer></fieldType>

Afieldtypecanalsobecustomizedbyaddinganalyzersaschildelementsofthe <field> element:

<fieldname="english_content"type="text"indexed="true"stored="false"><analyzertype="index"><tokenizerclass="solr.StandardTokenizerFactory"/><filterclass="solr.StopFilterFactory"ignoreCase="true"words="stopwords.txt"/><!--inthisexample,wewillonlyusesynonymsatquerytime<filterclass="solr.SynonymFilterFactory"synonyms="synonyms.txt"ignoreCase="true"expand="false"/>--><filterclass="solr.LowerCaseFilterFactory"/></analyzer><analyzertype="query"><tokenizerclass="solr.StandardTokenizerFactory"/><filterclass="solr.StopFilterFactory"ignoreCase="true"words="stopwords.txt"/><filterclass="solr.SynonymFilterFactory"synonyms="synonyms.txt"ignoreCase="true"expand="true"/><filterclass="solr.LowerCaseFilterFactory"/></analyzer></field>

©CopyrightPivotalSoftware,Inc,2013-2019 71 3.3.0

Page 72: Pivotal Greenplum Text

WorkingWithGPTextExternalIndexesAGPTextexternalindexisanApacheSolrindexyoucreateinGreenplumDatabasetoindexandsearchdocumentsthatresideoutsideofGreenplumDatabase.Externaldocumentscanbeofmanytypes,forexample,PDF,MicrosoftWord,XML,orHTML.Solrrecognizesdocumenttypesautomatically,usingcodeincludedfromtheApacheTika project.

GPTextsupportsindexingexternaldocumentsthatareaccessiblebyURLwithanHTTPGETrequest.GPTextalsosupportsindexingexternaldocumentsstoredinHadoop,providesfunctionsandutilitiestospecifyrequiredHadoopconfigurationandauthenticationinformation.

Toaddexternaldocumentstoanindex,yousupplyGPTextwithalistofURLsinanarrayorasaSQL SELECT statement.TheURLbecomestheuniqueidfieldintheSolrindex.

HowGPTextExternalIndexesDifferFromRegularGPTextIndexesExternalindexesexistentirelyinSolr—thereisnoassociateddatabasetableinGreenplumDatabase.Becauseofthis,theindexnamedoesnotfollowthe

database.schema.table patternrequiredforregularGPTextindexes.Youcanchooseanynameforanexternalindex,butitmustnotcontainperiods.YoucanaccessaGPTextexternalindexfromanydatabaseintheGreenplumDatabasesystemthathastheGPTextschemainstalled.

GPTextprovidesthefollowingalternatefunctionsforworkingwithexternalindexes:

gptext.create_index_external() –createanexternalindex.

gptext.index_external() –adddocumentstotheexternalindex.

gptext.search_external() –searchanexternalindex.

gptext.highlight_external() –returnsfragmentsofdocumentswithmatchingsearchtermshighlightedwithmarkuptags.

ThedistributionpolicyforaregularGPTextindexisthesameastheunderlyingGreenplumDatabasetable,sothatsegmentsmanagethesameGPTexttabledataastheSolrindexshard.AGPTextexternalindexalsohasoneshardpersegment,butthedocumentsaredistributedamongthesegmentsusingSolrcompositeIdrouting,whichallowsSolrtochoosetheshardforadocument.SeeShardsandIndexingDatainSolrCloud .

AregularGPTextindexonlyindexesandstoresthedatabasetablecolumnsyouspecify.AGPTextexternalindexstoresandindexesthetextualcontentofthefile,aswellasmetadatafieldsthataremembersofthedocumenttype.

Whenanexternaldocumentisaddedtotheindex,thecontentofthedocumentissavedinthe content field.The content fieldisstoredintheindexbutitisnotindexed.

GPTextcopiesthefollowingfieldstothe text field,thedefaultsearchfield,whichisindexedbutnotstored.

title

author

description

keywords

content

content_type

resourcename

url

Tosearchthedocumentcontent,therefore,searchthe text field,buttoretrieveorhighlightdocumentcontents,usethe content field.

Thefollowingcommonmetadatafieldsareindexedandstored:

title

subject

description

comments

author

ReadabouthowSolrandTika(the“SolrCell”framework)extractandindexdocumenttextandmetadataatUploadingDatawithSolrCellusingApacheTika .SeealistofsupporteddocumenttypesatSupportedDocumentFormats .

©CopyrightPivotalSoftware,Inc,2013-2019 72 3.3.0

Page 73: Pivotal Greenplum Text

keywords

category

resourcename

url

content_type

last_modified

links

Adynamicfieldnamed meta_* isalsoindexedandstored.Thisisamulti-valuedfieldwhereSolrstoresdocument-type-specificmetadata.Insearchresults,thisfieldisreturnedasaJSON-formatted columnValue string.Youcanextractindividualmetadatabynameusingthe gptext.gptext_retrieve_field()function.

Searchresultsforexternalindexesincludeallfieldssavedwiththedocuments,includingallmetadatafields.YoucanusetheSolrfieldlistoption(fl=<field-list> )tolimitthefieldsreturned.Youcanalsouse SELECT<field-list>FROM

gptext.search_external()tolimitthefieldsreturned,butitismoreefficient

tofilteroutthefieldsinSolrwiththe fl optionthaninthedatabasesession.

AuthenticatingwithanExternalDocumentSourceTheinformationinthissectionisapplicableonlytoexternaldocumentsourcesthatrequireauthentication.

Iftheexternaldocumentsourcethatyouwanttoindexrequiresauthentication,youmustprovidetheauthenticationconfigurationtoGPText.YoumustalsouseGPTextfunctionstoexplicitlylogintotheexternaldocumentsourcebeforeindexing,andlogoutofthesourceafterindexingcompletes.

Note:Authenticatingisnotrequiredforsearchinganexternaldocument.

UploadingaConfigurationtoZooKeeperBeforeyouuseGPTexttoindexanexternaldocumentsourcethatrequiresauthentication,youmustuploadconfigurationinformationtoZooKeeper.Usethe gptext-external

uploadcommandtouploadthisinformation:

gptext-externalupload-t<type>-p<config_dir>-c<config_name>

Thistabledescribestheoptionstothe gptext-externalupload

command:

Option Description

<type> Thetypeoftheexternaldocumentsource.Thesupported<type>sare ftp , hdfs ,and s3 .

<config_dir>Thepathtoadirectorythatcontainstheconfigurationfiles.Theconfigurationinformationthatyouprovideinthisdirectorywilldependontheexternaldocumentsource<type>.

<config_name>Thenamethatyouassigntotheconfigurationinformation.Youwillprovidethisnamewhenyoulogintotheexternaldocumentsource.

Note:Retainalocalcopyof<config_dir>.Shouldyouneedtoupdatetheconfiguration,youmusteditalocalcopyofthefile(s)andre-upload.

ConfiguringandUploadingFTPAuthentication

YoucanadddocumentsfromanFTPserverthatrequiresauthenticationtoaGPTextexternalindex.Toauthenticatewiththeservercreateaconfigurationdirectoryandaddafiletoitnamed login.txt .Addthreelinestothe login.txt file:

ThenameoftheusertologintotheFTPserver.

ThepasswordfortheFTPuser,incleartext.

ThemaximumnumberofFTPconnectionsallowedfromGPText.

TouploadconfigurationinformationforanauthenticatedFTPserver:

1. Createadirectoryfortheauthenticationconfiguration.

©CopyrightPivotalSoftware,Inc,2013-2019 73 3.3.0

Page 74: Pivotal Greenplum Text

$mkdirftp_config

2. Createthe login.txt file.

$touchftp_config/login.txt

3. AddtheFTPusername,password,andthemaximumnumberofFTPconnectionstocreatetothe login.txt fileonseparatelines.Forexample:

$echo"bill">ftp_config/login.txt$echo"changeme">>ftp_config/login.txt$echo"10">>ftp_config/login.txt

4. UploadtheconfigurationdirectorytoZooKeeperusingthe gptext-externalupload command.

$gptext-externalupload-tftp-p./ftp_config-cftp_bill_auth

Thiscommandmapsthe login.txt fileinthe ftp_conf/ directorytothename ftp_bill_auth .

Thepasswordisbase64-encodedwhenstoredinZooKeeper.Toprotectthepassworddeletethe login.txt fileafteryouhaveuploadedtheconfigurationtoZooKeeper.

ConfiguringandUploadingHadoopAuthentication

WhenyouaccessaHadoopexternaldocumentsource,<config_dir>mustincludethefollowingconfigurationfilesfor<type> hdfs :

The core-site.xml and hdfs-site.xml configurationfilesfromtheHadoopserver.

Afilenamed user.txt .ThisfilecontainsasinglelineidentifyingtheHadoopusernametouseforauthentication.IfKerberosisenabledintheHadoopcluster,theusernamein user.txt mustidentifytheKerberosprincipalfortheuser.

IftheHadoopclusterissecuredwithKerberos,alsoincludetheuser’s keytab fileandthe krb5.conf filefortheKerberosrealm.

Forexample,touploadconfigurationinformationforaHadoopexternaldocumentstore:

1. Createadirectoryfortheauthenticationconfigurationfiles.Forexample:

$mkdirhdfs_conf

2. Copythe core-site.xml and hdfs-site.xml configurationfilesfromtheHadoopservertotheconfigurationdirectory.ThelocationofthesefileswilldifferfordifferentHadoopdistributions.Forexample:

$scphdfsuser@hdfsnamenode:/etc/hadoop/conf/core-site.xmlhdfs_conf/$scphdfsuser@hdfsnamenode:/etc/hadoop/conf/hdfs-site.xmlhdfs_conf/

3. Constructthe user.txt file.Forexample,iftheHadoopusernameis bill :

$touchhdfs_conf/user.txt$echo"bill">hdfs_conf/user.txt

4. UploadtheHadoopauthenticationconfigurationfilesforuser bill toZooKeeper.Forexample:

$gptext-externalupload-thdfs-p./hdfs_conf-chdfs_bill_auth

Thiscommandmapstheconfigurationinformationyouprovidedinthe hdfs_conf/ directorytothename hdfs_bill_auth .

ConfiguringandUploadingAmazonS3Authentication

YoucanadddocumentsstoredasobjectsinanS3buckettoaGPTextexternalindex.ToauthenticatewithAmazonS3,youuploadthecredentialsforanAWSaccountwithaccesstotheS3buckettoZooKeeper.

TouploadauthenticationcredentialsforAmazonS3:

1. Createadirectoryfortheauthenticationconfiguration.

©CopyrightPivotalSoftware,Inc,2013-2019 74 3.3.0

Page 75: Pivotal Greenplum Text

$mkdirs3_conf

2. Createthe credential file.

$touchs3_conf/credential

3. AddyourAmazonaccount’sAWSaccesskeyandAWSsecretkeytothe credential fileonseparatelines.Theremustbeexactlytwolinesinthefile.Forexample:

$echo"<my-access-key>">s3_conf/credential$echo"<my-secret-key>">>s3_conf/credential

4. UploadtheconfigurationdirectoryyoucreatedtoZooKeeperusingthe gptext-externalupload command.

$gptext-externalupload-t's3'-p/home/gpadmin/s3_conf-cs3_auth20180619:17:44:21:006505gptext-external:mdw:gpadmin-[INFO]:-ExecuteGPTextconfig.20180619:17:44:21:006505gptext-external:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20180619:17:44:22:006505gptext-external:mdw:gpadmin-[INFO]:-Upload'/home/gpadmin/s3_conf'success.20180619:17:44:22:006505gptext-external:mdw:gpadmin-[INFO]:-Done.

LoggingIn/OutoftheExternalDocumentSourcePriortoindexing,youmustexplictlylogintoanexternaldocumentsourcethatrequiresauthentication.Usethe gptext.external_login() functionforthispurpose:

gptext.external_login('<type>','<type>://<url>','<config_name>')

Thetablebelowdescribestheargumentstothe gptext.external_login() function:

Option Description

<type> Thetypeoftheexternaldocumentsource.Validvaluesare ftp , hdfs and s3 .

<type>://<url> TheURLoftheexternaldocumentsource<type>.

<config_name> The<config_name>youprovidedwhenyouuploadedtheauthenticationconfigurationwith gptext-external upload .

Whenyouinvokethe gptext.external_login() function,GPTextlogsyouintotheexternaldocumentsourceastheuseroraccountidentifiedintheconfigurationdirectoryyouprovidedinthe<config_name>configuration.

Forexample,tologintoaHadoopdocumentsourceusingthe hdfs_bill_auth authenticationconfigurationyouuploadedinthepriorsection:

SELECT*FROMgptext.external_login('hdfs','hdfs://<namenode_host_or_ip>:<hdfs_port>','hdfs_bill_auth');

Thecommandissimilartologintoanftpserver.

=#SELECT*FROMgptext.external_login('ftp','ftp://<ftpserver_host_or_ip:<ftp_port>'),'ftp_bill_auth');

Itisnotnecessarytoinclude :<ftp_port> intheURLiftheserverusesthedefaultftpport21.

TheconnectionURLforAmazonS3hasthisformat:

s3://<s3-endpoint>[/<region>][/]

=#SELECT*FROMgptext.external_login('s3','s3://s3.us-west-1.amazonaws.com/','s3_auth');external_login----------------t(1row)

The <s3-endpoint> isanAmazonS3endpoint.Iftheendpointstartswith s3. or s3- andisfollowedbyaregioncode—forexample,s3-us-west-2.amazonaws.com or s3.us-east-1.amazonaws.com —the /<region> partoftheURLisoptional,andGPTextdeterminestheregionfromtheendpoint.TheconnectionURLforanendpointsuchas s3.dualstack.us-east-1.amazonaws.com ,however,mustincludethe /<region> ,forexample

©CopyrightPivotalSoftware,Inc,2013-2019 75 3.3.0

Page 76: Pivotal Greenplum Text

s3://s3.dualstack.us-east-1.amazonaws.com/us-east-1 .

Note:YoucanlogintoonlyoneGPTextexternaldocumentsourceatatime.Youmustexplicitlylogoutbeforeyoucanlogintoanotherexternaldocumentsource.

Tologoutofanexternaldocumentsource,usethe gptext.external_logout('<type>') function.Forexample,tologoutoftheHadoopclusterthatyouarecurrentlyloggedinto:

SELECT*FROMgptext.external_logout('hdfs');

TroubleshootingAuthenticatedDocumentStoresIfyourunintoproblemsloggingintooraccessingdocumentsinanauthenticatedHadoopexternaldocumentstore,refertoTroubleshootingHadoopConnectionProblems.

CreatingExternalIndexesUsethe gptext.create_index_external() functiontocreateanexternalindex.

Thisexamplecreatesanexternalindexnamed gptext-docs .

=#SELECT*FROMgptext.create_index_external('gptext-docs');

AnexternalindexdoesnothaveacorrespondingGreenplumDatabasetable,sotheindexnamedoesnotfollowthe database.schema.table patternrequiredforregularGPTextindexes.Theonlyrestrictionisthatthenameforanexternaltablemustnotcontainperiods.

Bydefault,theexternalindexiscreatedwithoneshardforeachGreenplumDatabasesegment.YoucanspecifyfewershardsbysettingtheGPTextgptext.idx_num_shards configurationparametertothenumberofshardsyouwantbeforeyoucreatetheindex.SeeSpecifyingtheNumberofIndexShardsformoreinformationaboutthisoption.

AddingDocumentstoanExternalIndexToaddexternaldocumentstoanexternalindex,supplyalistofURLswhereSolrcanretrievethedocumenttothe gptext.index_external() function.URLsmaybespecifiedeitherinanarrayorasaSQLresultset.

AhashoftheURListhedocument’sIDintheindex.IfaURLhasalreadybeenaddedtotheindex,thefileisnotreindexed.IfyouaddtwoidenticalfilesretrievedfromdifferentURLs,bothfilesareaddedtotheindex.

ThisexampleaddsasinglePDFdocument,specifiedinanarray,tothe gptext-docs index.

=#SELECT*FROMgptext.index_external('{http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf}','gptext-docs');dbid|num_docs------+----------3|02|1(2rows)

=#SELECT*FROMgptext.commit_index('gptext-docs');commit_index--------------t(1row)

ThisexampleaddsseveralHTMLdocumentsbyselectingURLsfromadatabasetable.

©CopyrightPivotalSoftware,Inc,2013-2019 76 3.3.0

Page 77: Pivotal Greenplum Text

=#DROPTABLEIFEXISTSgptext_html_docs;=#CREATETABLEgptext_html_docs(idbigint,urltext)DISTRIBUTEDBY(id);CREATETABLE=#INSERTINTOgptext_html_docsVALUES(1,'http://gptext.docs.pivotal.io/latest/topics/administering.html'),(2,'http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html'),(3,'http://gptext.docs.pivotal.io/latest/topics/function_ref.html'),(4,'http://gptext.docs.pivotal.io/latest/topics/guc_ref.html'),(5,'http://gptext.docs.pivotal.io/latest/topics/ha.html'),(6,'http://gptext.docs.pivotal.io/latest/topics/index.html'),(7,'http://gptext.docs.pivotal.io/latest/topics/indexes.html'),(8,'http://gptext.docs.pivotal.io/latest/topics/intro.html'),(9,'http://gptext.docs.pivotal.io/latest/topics/managed-schema.html'),(10,'http://gptext.docs.pivotal.io/latest/topics/performance.html'),(11,'http://gptext.docs.pivotal.io/latest/topics/queries.html'),(12,'http://gptext.docs.pivotal.io/latest/topics/type_ref.html'),(13,'http://gptext.docs.pivotal.io/latest/topics/upgrading.html'),(14,'http://gptext.docs.pivotal.io/latest/topics/utility_ref.html'),(15,'http://gptext.docs.pivotal.io/latest/topics/installing.html');INSERT015=#SELECT*FROMgptext.index_external(TABLE(SELECTurlFROMgptext_html_docs),'gptext-docs');dbid|num_docs------+----------3|62|8(2rows)=#SELECT*FROMgptext.commit_index('gptext-docs');commit_index--------------t(1row)

Toadddocumentsfromanexternaldocumentsourcethatrequiresauthentication,suchashdfsoranftpserver,logintotheexternalsystemwiththegptext.external_login() functionbeforeyouaddthedocuments.Withanauthenticateddocumentsource,youcanaddalldocumentsinadirectory,usingthegptext.external_index_dir() function.Seethe gptext.external_index_dir() functionreferenceforanexample.

SearchingGPTextExternalIndexesYoucansearchGPTextexternalindexeswiththestandard gptext.search() functionorwiththe gptext.search_external() function.Thedifferenceisthatthegptext.search() functionreturnsjustthe id , score , hs ,and rf columnsandthe gptext.search_external() functionbydefaultalsoincludesallofthecontentandmetadatastoredintheexternalindex.YoucanusetheSolr fl (fieldlist)optionwitheitherfunctiontosettheactualfieldsthatareincludedintheresults.

Searchingwithgptext.search()Thissimple gptext.search() examplesearchesfor“Solr”inthe title fieldofthe gptext-docs externalindex.

=#SELECT*FROMgptext.search(TABLE(SELECT1SCATTERBY1),'gptext-docs','title:Solr',null,null);id|score|hs|rf-----------------------------------------------------------+-----------+----+----http://gptext.docs.pivotal.io/latest/topics/type_ref.html|0.9745732||(1row)

Toseethetitleofthedocumentthatmatchedthesearch,youmustrequestthefieldwitha fl option.

=#SELECT*FROMgptext.search(TABLE(SELECT1SCATTERBY1),'gptext-docs','title:Solr',null,'fl=title');id|score|hs|rf-----------------------------------------------------------+-----------+----+------------------------------------------------------------------------------------------------------------http://gptext.docs.pivotal.io/latest/topics/type_ref.html|0.9745732||{"columnValue":[{"name":"title","value":"GPTextandSolrDataTypeMappings|\nPivotalGPTextDocs"}]}(1row)

©CopyrightPivotalSoftware,Inc,2013-2019 77 3.3.0

Page 78: Pivotal Greenplum Text

The title fieldspecifiedinthefieldlistoftheSolroptionsargumentisreturnedinthe rf columninaJSONdocument.Ifyouwanttoreturnthetitleinitsownresultcolumn,youcanusethe gptext.gptext_retrieve_field() functiontoextractthetextfromtheJSONdocument.Theexpandeddisplay( \xon )psqloptioninthefollowingexamplesmakestheresultseasiertoread.

=#\xonExpandeddisplayison.demo=#SELECTid,score,gptext.gptext_retrieve_field(rf,'title')titleFROMgptext.search(TABLE(SELECT1SCATTERBY1),'gptext-docs','title:Solr',null,'fl=title');-[RECORD1]----------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/type_ref.htmlscore|0.9745732title|GPTextandSolrDataTypeMappings||PivotalGPTextDocs

Searchingwithgptext.search_external()The gptext.search_external() function,bydefault,returnsastandardsetofmetadatafieldsandthecontentofthedocument.Dependingonthecontenttypeofthedocument, gptext.search_external() returnsadditionalmetadataasaJSONdocumentinthe meta column.

Thefollowingexamplesearchreturnsallfieldsstoredinthe gptext-docs indexforthedocumentwiththeword“Installing”inthetitlefield.The contentand meta columnvaluesintheexampleresultsaretruncated.

=#SELECT*FROMgptext.search_external(TABLE(SELECT1SCATTERBY1),'gptext-docs','title:Installing',null,null);-[RECORD1]------------------------------------------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/installing.htmltitle|InstallingGPText||PivotalGPTextDocssubject|description|comments|author|keywords|category|resourcename|url|content_type|text/html;charset=UTF-8last_modified|links|sha256|F1182EE7D993CB494CAB8480DA47EA2F82DE8F7DCCC4E76745B6FA5FD7E73FC8content|...score|1.4449482Gmeta|{"columnValue":[{"name":"meta_a","value":"..."},{"name":"meta_content_encoding","value":"UTF-8"},{"name":"meta_dc_title","value":"InstallingGPText|\nPivotalGPTextDocs"},{"name":"meta_div","value":"..."},{"name":"meta_form","value":"application/x-www-form-urlencoded,get,/search"},...

Youusuallyonlywantasubsetofthefieldsintheindex.Youcanspecifythefieldsyouwantinthe SELECT clauseorbyaddingthe fl Solroptionintheoptions argumentofthe gptext.search_external() function.Evenifyoulistthedesiredfieldsinthe SELECT clause,specifyingafieldlistintheoptionsargumentismoreefficientbecauseitreducestheamountofdataSolrtransferstoGreenplumDatabase.

ThisexamplesearchesforHTMLdocumentsthathavetheword“Indexes”inthe title field.Afilterquerychoosesdocumentswith“html”inthecontent_type field.Thefieldlistinthe options argumentcontainsjustthe title field.The id , score ,and meta fieldsarealwaysincludedinsearchresults.

=#SELECTid,title,scoreFROMgptext.search_external(TABLE(SELECT1SCATTERBY1),'gptext-docs','title:indexes','{content_type:*html*}','fl=title');id|title|score-----------------------------------------------------------------+----------------------------------------+-----------http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html|WorkingWithGPTextExternalIndexes||1.1593812:PivotalGPTextDocshttp://gptext.docs.pivotal.io/latest/topics/managed-schema.html|CustomizingGPTextIndexes||1.1191859:PivotalGPTextDocshttp://gptext.docs.pivotal.io/latest/topics/indexes.html|WorkingWithGPTextIndexes||0.8013617:PivotalGPTextDocshttp://gptext.docs.pivotal.io/latest/topics/queries.html|QueryingGPTextIndexes||0.8013617:PivotalGPTextDocs(4rows)

©CopyrightPivotalSoftware,Inc,2013-2019 78 3.3.0

Page 79: Pivotal Greenplum Text

HighlightingExternalIndexSearchResultsSolrhighlightingincludesfragmentsofdocumentsthatmatchasearchqueryinthesearchresults,withthequerytermshighlightedwithmarkuptags.Fragmentsarealsocalledsnippetsorpassages.

HighlightingwithGPTextexternalindexesisadifferentprocessthanhighlightingwithregularGPTextindexes.Becausethetextandallmetadataofexternaldocumentsarestoredinanexternalindex,themarkuptagscanbeappliedinSolrbeforereturningsearchresultstoGreenplumDatabase.Withregularindexes,highlightingcanbeperformedonlyforfieldswithtermsenabled,andthensearchresultsmustbejoinedwiththedatabasetablesothatthe gptext.highlight() functioncaninsertthemarkuptagsintothetext.Youcan,however,configurearegularGPTextindexsothatyoustorethefieldsintheindexandperformhighlightinginSolr.Thisrequireseditingtheindex’s solrconfig.xml and managed-schema configurationfiles.SeeHighlightingTermsinStoredFieldsforstepstoenablethisconfiguration.

Solrhighlightingisperformedbyasearchhandlercalleda HighlightComponent ,configuredinthe managed-schema configurationfile.Solrprovideshighlightersthatworksomewhatdifferentlyandhavedifferentconfigurableoptions.GPTextusestheUnifiedHighlighterbydefault.SeeHighlighting

attheApacheSolrwebsitetolearnmoreaboutSolrhighlightingandtheUnifiedHighlighter.

YoucanenablehighlightingforGPTextexternalindexesintheSolroptionsargumentofa gptext.search() query.Usingthismethod,thehighlightedtextisreturnedinaresultcolumnnamed hs ,whichcontainsaJSON-formattedarrayofhighlightedfragments.Youcanaccessthefragmentsusingthegptext.gptext_retrieve_field() function.

Inaddition,GPTextprovidesthe gptext.highlight_external() function,whichunpackshighlightedfragmentsinthesearchresultsintoseparatecolumnsintheGreenplumDatabasesearchresultset.

First,let’slookattheresultsofasearchquerywithhighlightingenabledusingtheSolroptionsargumentinthe gptext.search() function.Thisstatementsearchesthe gptext-docs externalindexfordocumentscontainingtheterm“apache”.TheSolroptionsare:

hl=true –enableshighlighting.

hl.fl=content title –the content fieldwillbehighlighted.

rows=1 –returnjustonedocumentpersegment.

=#SELECT*FROMgptext.search(TABLE(SELECT1SCATTERBY1),'gptext-docs','apache','{content_type:*html*}','hl=true&hl.fl=content&rows=1')-[RECORD1]--------------------------------------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/ha.htmlscore|0.4548784hs|{"columnValue":[{"name":"content","value":"Refertothe\u003cem\u003eApache\u003c/em\u003eSolrClouddocumentationforhelpusingtheSolrCloudDashboard.\n\n\n"}]}rf|-[RECORD2]--------------------------------------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/function_ref.htmlscore|0.05978464hs|{"columnValue":[{"name":"content","value":"Remarks\n\n\nWhenyouaddanexternaldocumenttotheindex,\u003cem\u003eApache\u003c/em\u003eTikaextractsacoresetofmetadatafromthedocument,thecolumnslistedintheReturntypesection."}]}rf|-[RECORD3]--------------------------------------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/ext-indexes.htmlscore|1.2426406hs|{"columnValue":[{"name":"content","value":"Solrrecognizesdocumenttypesautomatically,usingcodeincludedfromthe\u003cem\u003eApache\u003c/em\u003eTikaproject.\n\n\n"}]}rf|-[RECORD4]--------------------------------------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/administering.htmlscore|0.8155949hs|{"columnValue":[{"name":"content","value":"ZooKeeperAdministration\n\n\n\u003cem\u003eApache\u003c/em\u003eZooKeeperenablescoordinationbetweenthe\u003cem\u003eApache\u003c/em\u003eSolrandPivotalGPTextdistributedprocessesthroughasharednamespacethatresemblesafilesystem."}]}rf|

Inthisexamplethe hs columnhasasinglefragmentfromeachofthereturneddocuments.Youcanusethe hl.snippets and hl.fragsize Solroptionstoset,respectively,themaximumnumberoffragmentstoreturnandtheapproximatenumberofcharactersineachfragment.OtheroptionsyoucanusetocontrolhowtheUnifiedHighlighterchoosesfragmentsare hl.bs.type and hl.maxAnalyzedChars .The hl.bs.type optionspecifieshowthehighlighterbreaksthetextintofragments.Thedefaultis SENTENCE .Othervalidchoicesare SEPARATOR , SENTENCE , WORD , CHARACTER , LINE ,or WHOLE .Thehl.maxAnalyzedChars option,default51200,isthemaximumnumberofcharacterstoanalyzeforhighlighting.

SeeHighlighting intheSolrdocumentationfortablesofoptionsyoucansetandtheirdefaultvalues.

Foranexternalindex,Solrreturnsthehighlightedfragmentsina columnValue arrayinthe hs resultcolumn.Youcanusethe gptext.gptext_retrieve_field()functioninthe SELECT listtoextractthefragmentsfromthearray.

©CopyrightPivotalSoftware,Inc,2013-2019 79 3.3.0

Page 80: Pivotal Greenplum Text

=#SELECTid,score,gptext.gptext_retrieve_field(hs,'content')AScontentFROMgptext.search(TABLE(SELECT1SCATTERBY1),'gptext-docs','apache','{content_type:*html*}','hl=true&hl.fl=content&hl.snippets=3&hl.fragsize=75&rows=1');-[RECORD1]--------------------------------------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/function_ref.htmlscore|0.05978464content|Remarks|||Whenyouaddanexternaldocumenttotheindex,<em>Apache</em>Tikaextractsacoresetofmetadatafromthedocument,thecolumnslistedintheReturntypesection.-[RECORD2]--------------------------------------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/ext-indexes.htmlscore|1.2426406content|Solrrecognizesdocumenttypesautomatically,usingcodeincludedfromthe<em>Apache</em>Tikaproject.|||,SeeHighlightingatthe<em>Apache</em>SolrwebsitetolearnmoreaboutSolrhighlightingandtheUnifiedHighlighter.|||,Thisstatementsearchesthegptext-docsexternalindexfordocumentscontainingtheterm“<em>apache</em>”.-[RECORD3]--------------------------------------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/administering.htmlscore|0.8155949content|ZooKeeperAdministration|||<em>Apache</em>ZooKeeperenablescoordinationbetweenthe<em>Apache</em>SolrandPivotalGPTextdistributedprocessesthroughasharednamespacethatresemblesafilesystem.-[RECORD4]--------------------------------------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/ha.htmlscore|0.4548784content|Refertothe<em>Apache</em>SolrClouddocumentationforhelpusingtheSolrCloudDashboard.|

GPTextfoundfourdocuments(1foreachsegment)containingthestring“apache”andextractedfromeachdocumentafragmentofthe content fieldcontainingtheterm.

IndexingTextEmbeddedinImagesUsingopticalcharacterrecognition(OCR),youcanaddtextextractedfromimagestoGPTextexternalindexes.

WhentheopensourceTesseractOCREngine softwareisinstalledonthehostsintheGreenplumDatabasecluster,ApacheTikacallsTesseracttoextracttextfromimagefiles(forexample,GIF,TIFF,JPG,orPNGfiles)andforimagesembeddedindocumentssuchasWorddocuments.Theextractedtextisaddedtotheindexwiththedocument.

RunningOCRonimagesaddssignificantoverheadandtimetoGPTextindexingoperations.YoucanpreventTikafromrunningOCRbyuninstallingTesseractorbyconfiguringindexestonotrunTesseract.SeeDisablingTesseractwithaTikaConfigurationFileforthestepstoconfigureGPTextindexestoexcludeOCR.

InstallTesseractOCREngineonAllGreenplumDatabaseHostsYoucaninstallTesseractOCRbycompilingandinstallingthesourcecodeorbyinstallingaTesseractpackagewith yum .

TocompileandinstallTesseractfromsource,followthecompilationinstructions ontheTesseractOCRGitHubwiki.

Toinstallwithapackagemanager,followinstallationinstructions ontheTesseractOCRGitHubwiki.

BesuretoinstallTesseractoneveryhostintheGreenplumDatabasecluster.

IndexingExtractedImageContentYoucantesttheTesseractOCRengineatthecommandlinewithanimagefilecontainingembeddedtext.

©CopyrightPivotalSoftware,Inc,2013-2019 80 3.3.0

Page 81: Pivotal Greenplum Text

$tesseract<image-file>ocr-out$catocr-out.txt

AnytextTesseractrecognizesissavedintheocr-out.txtfile.

TotestthatTesseractiscalledwhenaddingdocumentstoaGPTextexternalindex,youcanusethe gptext.extract_rich_doc() function.ThisGPTextfunctionreturnsthecontentApacheTikaextractsfromadocumentbutdoesnotaddittotheindex.YouneedaGPTextexternalindexandaURLforanimagefilecontainingtextforTesseracttoextract.Iftheimagefileisinanauthenticatedftp,hdfs,ors3documentstore,makesureyoufirstauthenticatewithgptext.external_login() .

Thisexamplecreatesanexternalindexandcalls gptext.extract_rich_doc() toextracttextfromanimagefilewithanHTMLURL.

=#SELECT*FROMgptext.create_index_external('ocr-test');INFO:Createdindexocr-testcreate_index_external-----------------------t(1row)=#SELECT*FROMgptext.extract_rich_doc('ocr-test','http://gptext.docs.pivotal.io/300/ocrtest.png');stream_name|title|author|keywords|created|modified|content-----------------------------------------------------------------------+-------+--------+----------+---------+----------+------------------------------http://docs-gptext-develop-staging.cfapps.io/300/graphics/ocrtest.png||||||...(emptylinesomitted):Thequickbrownfox:jumpsoverthelazydog.::Thefiveboxingwizardsjump:quickly....(emptylinesomitted)(1row)

WhenTesseractisinstalled,ApacheTikaautomaticallyrunsTesseractwhenadocumentURLreferencesanimagefileorwhenanimageisembeddedinanotherdocumenttype.TextextractedfromimagesisincludedwithdocumentswheneveryoucalltheGPText gptext.index_external() orgptext.index_external_dir() functions.

DisablingOCRWithaTikaConfigurationFileIfyoudonotneedtoindextextembeddedinimagesforaGPTextindex,followthesestepstoexcludeTesseractOCRfromthedocumentindexingprocess.

1. Onthemasterhost,createa tika.xml filewiththefollowingcontent.

<?xmlversion="1.0"encoding="UTF-8"?><properties><parsers><parserclass="org.apache.tika.parser.DefaultParser"><parser-excludeclass="org.apache.tika.parser.ocr.TesseractOCRParser"/></parser></parsers></properties>

2. Copythe tika.xml filetoeachGreenplumDatabasehost.This gpscp examplecopiesthe tika.xml filetothe /home/gpadmin directoryoneveryhostlistedinthe hostlist file.

$gpscp-fhostlisttika.xml=:/home/gpadmin

3. Modifythe solrconfig.xml configurationfileforeachexternalindexthatdoesnotrequireOCR.

$gptext-configedit-i<external-index-name>-fsolrconfig.xml

Searchforthefollowing /update/extract requesthandler,andinserta <str> elementcontainingthepathtothe tika.xml fileasshownhere.

©CopyrightPivotalSoftware,Inc,2013-2019 81 3.3.0

Page 82: Pivotal Greenplum Text

<requestHandlername="/update/extract"startup="lazy"class="com.emc.solr.handler.extraction.SHA256CheckExtractingRequestHandler"><lstname="defaults"><strname="lowernames">true</str><strname="uprefix">meta_</str><strname="captureAttr">true</str></lst><strname="tika.config">/home/gpadmin/tika.xml</str></requestHandler>

©CopyrightPivotalSoftware,Inc,2013-2019 82 3.3.0

Page 83: Pivotal Greenplum Text

NaturalLanguageProcessingwithGPTextIndexesPivotalGPTextincludesApacheOpenNLPcomponentstoallowyoutousenamedentityrecognition(NER).Namedentitiesincludethenamesofpeople,organizations,andlocations.OpenNLPalsorecognizespartsofspeech(POS).TheOpenNLPlibrariesandmodelsrequiredforEnglishlanguagerecognitionareincludedwithGPText.Fornon-Englishlanguagedocuments,youcanuploadtoZooKeeperanyoftheothermodelsavailablefromtheOpenNLPproject.

AGPTextindexthatincludesNERandPOStaggingmusthavetermsenabled,usingthe gptext.enable_terms() function.Youaddatextfielddefinitiontotheindex’sconfiguration,addingPOSandNERfilterstotheanalysischainafterthetokenizer.ThefiltersusetheOpenNLPmodelsyouspecifytorecognizeentitiesindocumentsandclassifypartsofspeech.Tokensrecognizedaretaggedandsavedastermsinthefield’stermvector.

NER-taggedtermshavetheformat _ner_<entity-type>_<token> ,where <entity-type> isthetypeofentity,forexample person or location ,and <token> isthetextofthetoken,producedbythetokenizer.Termsarenotcase-sensitive.ExamplesofNER-taggedtermsare _ner_person , _ner_person_Alan ,and_ner_location_boston .Atermlike _ner_person matchesanyperson,includingamorespecifictermlike _ner_person_alan .

ThePOSEnglishlanguagemodelusespart-of-speechtagsfromtheUniversityofPennTreebank project.POS-taggedtermshavetheformat _pos_<tag>,where <tag> isaPennTreebankpart-of-speechtag.ExamplesofPOS-taggedtermsare _pos_nn , _pos_vb ,and _pos_rb ,fornouns,verbs,andadverbs,respectively.

EnablingNERforGPTextIndexesTheexampleinthissectionshowshowtoaddNERsupporttoaGPTextindex.Theexampleworkswithatablenamed news_demo intheGreenplumDatabase demo database.

1. DownloadtheCSVdatafileforthetabletothegpadminhomedirectoryfromthislink:news_demo.csv.tgz .Extractthe news_demo.csv filefromthedownloadedfilewiththefollowingcommand.

$tarxvfnews_demo.csv.tgz

2. Logintothedemodatabasewith psql andcreateandloadthe news_demo table.

=#CREATETABLEnews_demo(idbigint,articleidvarchar(50),news_datedate,headlinetext,contenttext)DISTRIBUTEDBY(id);

Loaddataintothetablefromthe news_demo.csv datafile.

=#COPYnews_demofrom'/home/gpadmin/news_demo.csv'withcsvheader;

3. CreatetheGPTextindexandenabletermsforthe content field.

=#SELECT*FROMgptext.create_index('public','news_demo','id','content');=#SELECT*FROMgptext.enable_terms('demo.public.news_demo','content');

4. Editthe managed-schema fileforthe demo.public.news_demo indexusingthe gptext-config utility.

$gptext-configedit-idemo.public.news_demo-fmanaged-schema

Addthefollowing text_opennlp fieldtypedefinitiontothelistof <fieldType> elements.

©CopyrightPivotalSoftware,Inc,2013-2019 83 3.3.0

Page 84: Pivotal Greenplum Text

<fieldTypename="text_opennlp"class="solr.TextField"><analyzertype="index"><tokenizerclass="solr.OpenNLPTokenizerFactory"sentenceModel="en-sent.bin"tokenizerModel="en-token.bin"/><filterclass="solr.OpenNLPPOSFilterFactory"posTaggerModel="en-pos-maxent.bin"/><filterclass="com.emc.solr.analysis.opennlp.OpenNLPNERFilterFactory"nerTaggerModels="en-ner-person.bin,en-ner-organization.bin,en-ner-time.bin"/><filterclass="solr.StopFilterFactory"words="stopwords-ner.txt"ignoreCase="true"/><filterclass="com.emc.solr.analysis.opennlp.NERAndTypeAttributeAsSynonymFilterFactory"extractType="true"typePrefix="_pos_"/><filterclass="solr.LowerCaseFilterFactory"/><filterclass="solr.PorterStemFilterFactory"/></analyzer><analyzertype="query"><tokenizerclass="solr.WhitespaceTokenizerFactory"/><filterclass="solr.SynonymFilterFactory"synonyms="synonyms.txt"ignoreCase="true"expand="true"/><filterclass="solr.StopFilterFactory"ignoreCase="true"words="stopwords.txt"/><filterclass="solr.LowerCaseFilterFactory"/><filterclass="solr.KeywordMarkerFilterFactory"pattern="^(_ner_|_pos_).+$"/><filterclass="solr.PorterStemFilterFactory"/></analyzer></fieldType>

Findthe content fieldandchangethe type attributeto "text_opennlp" .

<fieldname="content"type="text_opennlp"indexed="true"termOffsets="true"stored="false"termPositions="true"termPayloads="true"termVectors="true"/>

5. Indexthedocumentsandcommittheindex.

=#SELECT*FROMgptext.index(TABLE(SELECT*FROMnews_demo),'demo.public.news_demo');=#SELECT*FROMgptext.commit_index('demo.public.news_demo');

ExampleSearchQueriesforNER-EnabledIndexesFollowingareexamplequeriesthatsearchforNER-taggedtermsinthe demo.public.news_demo index.

RetrieveNERpersonoffsetsThisqueryretrievesanarrayoflocationsforNERpersontermsindocumentsthatcontainNERpersons.

=#SELECT*FROMgptext.search(TABLE(SELECT1SCATTERBY1),'demo.public.news_demo','_ner_person',NULL,'hl=true&hl.fl=content&rows=10&sort=scoredesc');

Followingareresultsfromthissearch(withsomerowsomittedforspace).

©CopyrightPivotalSoftware,Inc,2013-2019 84 3.3.0

Page 85: Pivotal Greenplum Text

id|score|

hs|rf-----------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----842613544|0.7074248|{"fieldOffsets":[{"field":"content","offsets":[{"start":14,"end":28},{"start":40,"end":44},{"start":251,"end":261},{"start":726,"end":736},{"start":896,"end":909},{"start":1093,"end":1103},{"start":1118,"end":1133},{"start":1184,"end":1187},{"start":1188,"end":1194},{"start":1253,"end":1258}]}]}|842613572|0.7059102|{"fieldOffsets":[{"field":"content","offsets":[{"start":61,"end":65},{"start":500,"end":512},{"start":547,"end":563},{"start":711,"end":715},{"start":883,"end":896},{"start":965,"end":969},{"start":1065,"end":1078}]}]}

|

(ROWSOMITTED)

842613594|0.5854553|{"fieldOffsets":[{"field":"content","offsets":[{"start":520,"end":533},{"start":559,"end":564},{"start":968,"end":982},{"start":987,"end":1000},{"start":1509,"end":1512}]}]}

|842614457|0.5810676|{"fieldOffsets":[{"field":"content","offsets":[{"start":400,"end":423},{"start":723,"end":733},{"start":812,"end":827},{"start":963,"end":970},{"start":1181,"end":1188}]}]}

|(40rows)

RetrievedocumentscontaininganNERpersontermThisqueryretrievesthecontentofdocumentsinthe news_demo tablewithtermstagged _ner_person highlighted.

=#SELECTnews_demo.id,gptext.highlight(news_demo.content,'content',hs)AScontent,s.scoreFROMnews_demo,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.public.news_demo','{!gptextqp}_ner_person',Null,'hl=true&hl.fl=content&rows=10&sort=scoredesc')sWHEREnews_demo.id=s.id::bigintORDERBYs.scoredesc;

Followingarethreerowsoftheoutputfromthiscommandwiththeheadingsandemptylinesomitted.

©CopyrightPivotalSoftware,Inc,2013-2019 85 3.3.0

Page 86: Pivotal Greenplum Text

842613544|WASHINGTON--<em>WilliamTaylor</em>,President<em>Bush</em>'snomineetorunthenation'sdepositinsurancesystem,saidbankregulatorscouldhavedoneabetterjobpolicingtheBankofCredit&CommerceInternational."Ithinkwehavelearnedaseriesofthings,"<em>Mr.Taylor</em>,currentlytheheadofbanksupervisionattheFederalReserve,toldtheSenateBankingCommitteeathisconfirmationhearing."Youshouldn'tallowsomeoneinthecountrythatdoesn'thavesupervisionfromastronghome-countrysupervisor,"hesaid.BCCI,withoperationsintheMiddleEast,Africa,EuropeandtheU.S.,fraudulentlyhidhugelossesformonthsfromregulatorsaroundtheworld.NoU.S.depositorslostmoney.QuestionsaboutBCCI'sactionsduring<em>Mr.Taylor</em>'stenureattheFederalReservewereexpectedtobetheonlyserioushurdletohisconfirmation.Butcommitteemembersseemedsatisfiedwithhisremarks.Sen.<em>DonaldRiegle</em>(D.,Mich.),chairmanofthecommittee,saidthatheexpectsthecommitteewillrecommendtheconfirmationandthattheSenatewillvotewithinafewweeks.Ifconfirmedasexpected,<em>Mr.Taylor</em>wouldsucceed<em>WilliamSeidman</em>,whosetermexpiresnextmonth.Inhistestimony,<em>Mr.</em><em>Taylor</em>saidheremainstroubledbylingeringquestionsinvolving<em>BCCI.</em>IntheU.S.,BCCIwasabletoevadegovernmentprohibitionsfrompurchasingstockinFirstAmericanBanksharesInc.byusingfrontmen."Ireallyhavedifficultyinknowinghowwe'regoingtouncover(such)arrangementsanywhere,"hesaid."Ireallythinkit'sdifficulttodeterminewhentwopeopleconspiretochangethecontrolofanorganization."|0.7074248842613572|WASHINGTON--TheHouse,inastunningvictoryforPresident<em>Bush</em>,agreedtocutthetaxoncapitalgains,soundlyrejectinganalternativeproposedbyDemocraticleaders.Afterweeksofintenselobbyingbybothsides,theleadership'splanwasdefeatedbyalarger-than-expected239-190vote.Theconvincingmarginincreasesthelikelihoodthatacapitalgainscutofsomesortcouldbecomelawthisyear.ThevotewasablowtotheHouse'snewlyelectedDemocraticleadership,particularlySpeaker<em>ThomasFoley</em>ofWashingtonandMajorityLeader<em>RichardGephardt</em>ofMissouri.Bothhadputtheirpersonalprestigeonthelinetodefeatthetax-cutmeasure,whichrepresentedtheirfirstmajorshowdownwiththe<em>Bush</em>administration.Still,fullyone-quarteroftheirmembership--64Democrats--desertedthemandsidedwithanear-solidphalanxofRepublicans.OnlyoneRepublican,<em>DougBereuter</em>ofNebraska,brokeranksandvoted,againstthewishesofPresident<em>Bush</em>,fortheDemocraticalternative."Thiswasawatershedforus,"glowedHouseRepublicanLeader<em>RobertMichel</em>ofIllinois.|0.7059102842613885|<em>JohnKerry</em>,seizingthechancetodefinehiscandidacybeforeanationaltelevisionaudiencewithhispresidentialnominationacceptancespeech,tookthefightstraighttothetwoareaswherePresident<em>Bush</em>hasenjoyedhisgreatestpoliticalstrengths:nationalsecurityandsocialvalues.RatherthanshyingawayfromgroundthathassometimesbeenshakyforDemocrats,Mr.<em>Kerry</em>plantedhisownflaginaforcefulandattimescombativespeech."Lettherebenomistake:Iwillneverhesitatetouseforcewhenitisrequired,"theMassachusettssenatortold4,000cheeringdelegatesonthefinalnightoftheDemocraticconventioninBoston."Anyattackwillbemetwithaswiftandcertainresponse,"hecontinued,attemptingtomeetwidespreadandpersistentvoterquestionsaboutwhetheraDemocrat,evenawarveteran,istoughenoughtoleadthecountryinfightingterrorism.Atonepoint,Mr.<em>Kerry</em>appearedtobelittleMr.<em>Bush</em>'srecordascommanderinchief,especiallyhisjustificationforthewarinIraq."NowIknowtherearethosewhocriticizemeforseeingcomplexities--andIdo--becausesomeissuesjustaren'tallthatsimple,"hesaid."SayingthereareweaponsofmassdestructioninIraqdoesn'tmakeitso."ItwasoneofseveralobliqueshotsMr.<em>Kerry</em>tookatthepresidentandhisadvisers,evenashealsocalleddirectlyonPresident<em>Bush</em>torunapositivecampaign.Confrontinganotherofhisparty'svulnerabilities--aperceptionthatDemocratsareoutoftheculturalmainstream--Mr.<em>Kerry</em>'s45-minutespeechtackledPresident<em>Bush</em>onsocialissues."It'stimeforthosewhotalkaboutfamilyvaluestostartvaluingfamilies,"hesaid.|0.7014734

RetrievedocumentscontainingspecifiedNERpersontermsThissearchreturnsthecontentofdocumentsthatcontainthepersons“Alan”and“Bush”,withthenameshighlighted.

=#SELECTnews_demo.id,gptext.highlight(news_demo.content,'content',hs)AScontent,s.scoreFROMnews_demo,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.public.news_demo','_ner_person_AlanAND_ner_person_Bush',Null,'hl=true&hl.fl=content&rows=10&sort=scoredesc')sWHEREnews_demo.id=s.id::bigintORDERBYs.scoredesc;

RetrievedocumentscontainingbothNERorganizationandtimetermsThissearchfindsdocumentsthatcontainbothNERorganizationandtimeterms,withthetermshighlighted.

=#SELECTnews_demo.id,gptext.highlight(news_demo.content,'content',hs)AScontent,s.scoreFROMnews_demo,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.public.news_demo','_ner_organizationAND_ner_time',Null,'hl=true&hl.fl=content&rows=10&sort=scoredesc')sWHEREnews_demo.id=s.id::bigintORDERBYs.scoredesc;

Followingisanexamplerowreturnedbythisquery.

©CopyrightPivotalSoftware,Inc,2013-2019 86 3.3.0

Page 87: Pivotal Greenplum Text

842613848|NEWYORK--U.S.oilfuturesdeclinedTuesdayastraderswerereluctanttoplacebigbetswhile<em>FederalReserve</em>officialsdebatedthefutureofthecentralbank'skeyeconomicstimulusprogram.Light,sweetcrudeforJanuarydeliverysettled26cents,or0.3%,lowerat$97.22abarrelonthe<em>NewYorkMercantileExchange</em>.NymexpricestradedinanarrowrangeformostofthesessionasmarketparticipantschosetowaituntilWednesday<em>afternoon</em>forpotentialclarityonthe<em>Fed</em>'seasy-moneypolicies."It'sadirectionlesstrade,"saidJohnKilduff,foundingpartnerof<em>AgainCapitalLLC</em>,aNewYorkhedgefundthatfocusesonenergy,referringtothelackofsignificantpricemovement.Headded,"Youcanmakeastrongargumentonbothsides,andthere'salotofroomforthe<em>Fed</em>tosurpriseuseitherway."Manytradersexpectthe<em>Fed</em>tobeginscalingbackitsso-calledquantitative-easingprogram,inwhichitbuys$85billioneachmonthinmortgage-backedsecuritiesandlonger-term<em>Treasury</em>bonds,inthenearfuture.Theprogramhasboostedoilpricesbyweakeningthedollar,makingcrudecheapertobuywithothercurrencies.

RetrievedocumentscontainingNERpersonororganizationtermsThissearchreturnsdocumentscontaininganNERpersontermoranNERorganizationterm,orboth,withthetermshighlighted.

=#SELECTnews_demo.id,gptext.highlight(news_demo.content,'content',hs)AScontent,s.scoreFROMnews_demo,gptext.search(TABLE(SELECT1SCATTERby1),'demo.public.news_demo','_ner_person_ner_organization',Null,'hl=true&hl.fl=content&rows=10&sort=scoredesc')sWHEREnews_demo.id=s.id::bigintORDERBYs.scoredesc;

RetrievedocumentscontainingNERpersonandtimeterms(forwardproximitysearch)Thisqueryperformsaproximitysearchtofinddocumentswithapersontermfollowedbyatimetermwithinthenextseventerms.

=#SELECTnews_demo.id,gptext.highlight(news_demo.content,'content',hs)AScontent,s.scoreFROMnews_demo,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.public.news_demo','{!gptextqp}(_ner_person7W_ner_time)',Null,'hl=true&hl.fl=content&rows=10&sort=scoredesc')sWHEREnews_demo.id=s.id::bigintORDERBYs.scoredesc;

RetrievedocumentswithaspecifiedNERpersonandanyNERperson(unorderedproximitysearch)Likethepreviousexample,thisqueryperformsaproximitysearch,butthetermscanappearinthedocumentineitherorderandmustbewithintentermsofeachother.

=#SELECTnews_demo.id,gptext.highlight(news_demo.content,'content',hs)AScontent,s.scoreFROMnews_demo,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.public.news_demo','{!gptextqp}(_ner_person_Taylor10N_ner_person)',Null,'hl=true&hl.fl=content&rows=10&sort=scoredesc')sWHEREnews_demo.id=s.id::bigintORDERBYs.scoredesc;

AddingLemmatizationtoanNERIndexGPTextlemmatizationusesApachePOS(partsofspeech)labelingandanEnglishlanguageWordNet®Dictionary tofindtheroot(unconjugated)formsofverbsanduninflectedformsofnouns,adjectives,andadverbs.TheGPTextindexstoresthelemmatizedtermsfordocumentssothatasearchquerycanmatchdocumentsthatusethedifferentinflectionsofthequeryterms.

Toenablelemmatization,youdefineafieldtypeinthe managed-schema configurationfilefortheindexthatincludestheGPTextWordNetLemmatizerfilterinthetype’sanalyzerchain,andthenassignthefieldtypetofieldsyouwanttolemmatize.ThefieldtypeyoudefinemustusetheSolrOpenNLPtokenizerandPOSfilter,andthePOSfiltermustappearbeforetheGPTextWordNetLemmatizerfilterintheanalysischain.

1

©CopyrightPivotalSoftware,Inc,2013-2019 87 3.3.0

Page 88: Pivotal Greenplum Text

Thisexampledefinesthetype text_lemm withlemmatization:

1. Editthe managed-schema filefortheindex.

$gptext-configedit-idemo.public.news_demo-fmanaged-schema

2. Addafieldtypeincludingthelemmatizationfilter.

<fieldTypename="text_lemm"class="solr.TextField"autoGeneratePhraseQueries="true"positionIncrementGap="100"><analyzer><tokenizerclass="solr.OpenNLPTokenizerFactory"tokenizerModel="en-token.bin"sentenceModel="en-sent.bin"/><filterclass="solr.OpenNLPPOSFilterFactory"posTaggerModel="en-pos-maxent.bin"/><filterclass="com.emc.solr.analysis.wordnet.WordNetLemmatizerFilterFactory"/><filterclass="solr.TypeAsSynonymFilterFactory"prefix="@"/></analyzer></fieldType>

3. Changethefieldtypeonfieldsyouwanttoindexwithlemmatizationandsavetheedited managed-schema file.

<fieldname="content"type="text_lemm"indexed="true"termOffsets="true"stored="false"termPositions="true"termPayloads="true"termVectors="true"/>

4. Testtheoutputofthe text_lemm analysischainusingthe gptext.analyze() function.Forexample:

=#SELECT*FROMgptext.analyzer('demo.public.news_demo','index','text_lemm','TheelvesspokeofgeesemigrationandthebetterVivaldilibretti.');class|tokens

-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------OpenNLPTokenizer|{{"The"},{"elves"},{"spoke"},{"of"},{"geese"},{"migration"},{"and"},{"the"},{"better"},{"Vivaldi"},{"libretti"},{"."},{},{}}OpenNLPPOSFilter|{{"The"},{"elves"},{"spoke"},{"of"},{"geese"},{"migration"},{"and"},{"the"},{"better"},{"Vivaldi"},{"libretti"},{"."},{},{}}WordNetLemmatizerFilter|{{"the"},{"elf"},{"elves"},{"speak"},{"of"},{"goose"},{"migration"},{"and"},{"the"},{"better"},{"good"},{"vivaldi"},{"libretto"},{"."}}TypeAsSynonymFilter|{{"the","@DT"},{"elf","@NNS"},{"elves","@NNS"},{"speak","@VBD"},{"of","@IN"},{"goose","@NN"},{"migration","@NN"},{"and","@CC"},{"the","@DT"},{"better","@JJR"{"good","@JJR"},{"vivaldi","@NNP"},{"libretto","@NNS"},{".","@."}}(4rows)

TheOpenNLPPOSFilterdeterminesthepartofspeechforeachtermandtheWordNetLemmatizerFilterlooksuptherootformoftheterm.Dependingonthepartofspeech,therecanbemorethanonerootformforaterm.Inthesecases,thefilteraddsanadditionaltermtotheindexforeachrootform.

5. Reindexthedocuments.

=#SELECT*FROMgptext.index(TABLE(SELECT*FROMnews_demo),'demo.public.news_demo');=#SELECT*FROMgptext.commit_index('demo.public.news_demo');

LemmatizationExampleThefollowingexampledemonstrateshowlemmatizationcanimprovesearchresults.Thetable lemm_test containstwodocumentsthatusetheword“meeting”.

1. Createanewtableinthedemodatabase.

=#CREATETABLElemm_test(idinteger,contenttext)DISTRIBUTEDBY(id);=#INSERTINTOlemm_testVALUES(1,'TheyaremeetinginLondon.'),(2,'HeheldameetinginLondontodiscusstheproblem.');

2. CreateaGPTextindexforthe lemm_test tableandaddthedocumentstotheindex.The content column,inthiscase,willbeconfiguredwiththedefault text_intl fieldtype.

=#SELECTgptext.create_index('public','lemm_test','id','content');=#SELECT*FROMgptext.index(TABLE(SELECT*FROMlemm_test),'demo.public.lemm_test');=#SELECTgptext.commit_index('demo.public.lemm_test');

3. Searchtheindexforthestring "themeetinginLondon" .

©CopyrightPivotalSoftware,Inc,2013-2019 88 3.3.0

Page 89: Pivotal Greenplum Text

=#SELECTlt.id,q.score,lt.contentFROMlemm_testlt,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.public.lemm_test',E'\"themeetinginLondon\"',null,nulll)qWHEREq.id::int8=lt.id;

Thesearchfindsbothdocuments:

id|score|content----+-----------+---------------------------------------------------------2|0.5753642|HeheldameetinginLondontodiscusstheproblem.1|0.5753642|TheyaremeetinginLondon.(2rows)

Usinglemmatizationduringindexingandsearching,theresultcanbemoreaccurate.TherowwithID1uses“meeting”asthepresentparticipleoftheverb“meet”.Thesecondrecorduses“meeting”asanoun.Thesearchquery, "themeetinginLondon" ,alsouses“meeting”asanoun.

1. Editthe managed-schema filefortheindex.

$gptext-configedit-idemo.public.lemm_test-fmanaged-schema

2. Addthe text_ner fieldtype.

<fieldTypename="text_ner"class="solr.TextField"autoGeneratePhraseQueries="true"positionIncrementGap="100"><analyzer><tokenizerclass="solr.OpenNLPTokenizerFactory"tokenizerModel="en-token.bin"sentenceModel="en-sent.bin"/><filterclass="solr.OpenNLPPOSFilterFactory"posTaggerModel="en-pos-maxent.bin"/><filterclass="com.emc.solr.analysis.wordnet.WordNetLemmatizerFilterFactory"/><filterclass="solr.TypeAsSynonymFilterFactory"prefix="@"/></analyzer></fieldType>

3. Findthe field elementforthe content fieldandchangethetypeattributeto text_ner .

<fieldname="__temp_field"type="text"indexed="true"stored="false"multiValued="true"/><fieldname="_version_"type="long"indexed="true"stored="true"/><fieldname="id"stored="true"type="int"indexed="true"/><fieldname="__pk"stored="true"indexed="true"type="int"/><fieldname="content"stored="false"type="text_ner"indexed="true"/>

4. Savetheedited managed-schema fileandthenreindexthe lemm_test table.

=#SELECT*FROMgptext.index(TABLE(SELECT*FROMlemm_test),'demo.public.lemm_test');=#SELECTgptext.commit_index('demo.public.lemm_test');

5. Repeatthesamesearchquery.

=#SELECTlt.id,q.score,lt.contentFROMlemm_testlt,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.public.lemm_test',E'\"themeetinginLondon\"',null)qWHEREq.id::int8=lt.id;

Thistimethesearchfindsonlytheseconddocument:

id|score|hs|rf----+-----------+----+----2|3.6823308||(1row)

CustomizingNERFieldTypesGPTextincludesthefollowingEnglishlanguagemodels.

en-ner-date.bin

en-ner-location.bin

©CopyrightPivotalSoftware,Inc,2013-2019 89 3.3.0

Page 90: Pivotal Greenplum Text

en-ner-money.bin

en-ner-organization.bin

en-ner-percentage.bin

en-ner-person.bin

en-ner-time.bin

Tospecifythemodelsyouwanttouseforanindex,editthe managed-schema filefortheindexandsetthe nerTaggerModels attributeoftheOpenNLPNERFilterFactory filterelementinthefieldtypedefinition.

<filterclass="com.emc.solr.analysis.opennlp.OpenNLPNERFilterFactory"nerTaggerModels="en-ner-person.bin,en-ner-organization.bin,en-ner-time.bin"/>

YoucandownloadmodelsforotherlanguagesatModelsfor1.5Series .UploadthemodeltoZooKeeperusingthe gptext-configupload

commandand

thenupdatethe nerTaggerModels attributeasshown.Forexample,toaddtheSpanishpersonmodel:

1. Downloadthe es-ner-person.bin filefromModelsfor1.5Series .

2. Uploadthe es-ner-person.bin filetoZooKeeper.

$gptext-configupload-idemo.public.news_demo-les-ner-person.bin-fes-ner-person.bin

3. Editthe managed-schema filefortheindex.

$gptext-configedit-idemo.public.news_demo-fmanaged-schema

4. Addthe es_ner_person modeltothe OpenNLPNERFilterFactory filterforthefield.Spanishnameswillberecognizedfirst,andthenEnglishnames.

<filterclass="com.emc.solr.analysis.opennlp.OpenNLPNERFilterFactory"nerTaggerModels="es-ner-person.bin,en-ner-person.bin,en-ner-organization.bin,en-ner-time.bin"/>

5. Savethe managed-schema filechangesandreindexthedocuments.

AddingOpenNLPLibrariestoExistingGPTextIndexesTouseNERwithaGPTextindexcreatedwithaversionofGPTextearlierthanGPText3.1,youmustaddtheOpenNLPlibrariestotheindex’s solrconfig.xmlconfigurationfile.Theselibrariesarealreadypresentinthe solrconfig.xml fileforindexescreatedwithGPText3.1orlater.TheinstalledGPTextversionmustberelease3.1orlater.

Usethe gptext-config utilitytoeditthe solrconfig.xml file.

$gptext-configedit-i<index-name>-fsolrconfig.xml

Findtheexisting <lib> elementsandaddtheseelements.

<!--Addbelowexistinglibsettings--><libdir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib"regex="opennlp.*"/><libdir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs"regex="lucene-analyzers-opennlp-.*\.jar"/><libdir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs"regex="gptext-analysis-extras-.*\.jar"/>

Savethe solrconfig.xml filewiththesechanges.

PrincetonUniversity“AboutWordNet.”WordNet .PrincetonUniversity.2010.1

©CopyrightPivotalSoftware,Inc,2013-2019 90 3.3.0

Page 91: Pivotal Greenplum Text

GPTextFunctionReferenceThefollowingfunctionsareavailableinPivotalGPText.

Indexinggptext.create_index()–createsanemptyindex.

gptext.create_index_external()-createsanindexforexternaldocuments.

gptext.index()–populatesanindex.

gptext.index_external()-addsdocumentstoanexternalindex.

gptext.index_external_dir()-addsalldocumentsinadirectoryinanexternaldocumentsourcetoaGPTextexternalindex.

gptext.commit_index()–finalizesindexoperations.

gptext.extract_rich_doc()–showsthecontentextractedfromanexternaldocument,withoutactuallyaddingthecontenttotheindex.

gptext.recreate_error_table()–recreatestheerrortablethatrecordserrorsthatoccurwhileaddingdocumentstoanexternalindex.

AuthenticatingwithExternalDocumentSourcesgptext.external_login()–logintoanexternaldocumentstorethatrequiresauthentication.

gptext.external_logout()–logoutofanexternaldocumentstore.

ModifyingorDeletinganIndexgptext.add_field()–addsafieldtoanindex.

gptext.delete()–deletesdocumentsmatchingasearchquery.

gptext.drop_field()–deletesafieldfromanindex.

gptext.drop_index()-deletesanindex.

Searchgptext.search()–searchesanindex.

gptext.search_count()–returnsnumberofdocumentsthatmatchsearch.

gptext.search_external()-searchesaGPTextexternalindex.

gptext.gptext_retrieve_field()–extractsasinglefieldfromthe rf searchresultcolumnastext.

gptext.gptext_retrieve_field_int()–extractsasinglefieldfromthe rf searchresultcolumnandconvertstoaninteger.

gptext.gptext_retrieve_field_float()–extractsasinglefieldfromthe rf searchresultcolumnandconvertstoafloat.

gptext.highlight()–returnssearchresultwithsearchtermhighlighted.

gptext.highlight_external()–applyshighlighingtosearchresultsfromexternalindexes.

FacetedSearch

©CopyrightPivotalSoftware,Inc,2013-2019 91 3.3.0

Page 92: Pivotal Greenplum Text

gptext.faceted_field_search()–search,facetedbyfields.

gptext.faceted_query_search()–search,facetedbyqueries.

gptext.faceted_range_search()–search,facetedbydefinedranges.

WorkingWithTermsgptext.enable_terms()–enablestermvectorsandpositionstoallowextractingtermsandtheirpositionsfrom text fields.

gptext.ner_terms()-getstermstaggedwithNamedEntityRecognition(NER)filtersfromthetermvectorforaspecifiedfield.

gptext.terms()–getsthetermvectorsfortheindexeddocumentsinaSolrindexforthespecifiedfield.

GPTextIndexMonitoringgptext.cluster_status()-showsstatusofindexesmanagedbytheGPTextcluster.

gptext.index_size()-showsthenumberofdocumentsindexedandtotaldiskspaceusedforGPTextindexes.

gptext.index_status()–showsstatusofreplicasforanindexorforallindexes.

gptext.partition_status()-listspartitionedindexesandchildpartitions.

gptext.index_summary()-showsconfigurationdetails,status,andmetricsforallofthereplicasforaspecifiedindexorforallGPTextindexes.

GPTextIndexConfigurationgptext.analyzer()-showstheoutputfromeachclassintheindexorqueryanalyzerchainforagivenfieldtypeanduser-suppliedinputtext.

gptext.config_append()-appendsthecontentsofalocalfiletoaZooKeeperconfigurationfileforanindex.

gptext.config_delete()-deletesanindexconfigurationfilefromZooKeeper.

gptext.config_get()-displaysthecontentsofaZooKeeperindexconfigurationfile.

gptext.config_list()-listsZooKeeperconfigurationfilesanddirectoriesforanindex.

gptext.config_upload()-uploadsanindexconfigurationfiletoZooKeeper.

gptext.get_field_type()-displaystheanalyzerchainforafieldtypedefinedintheconfigurationforaspecifiedindex.

gptext.list_field_types()-liststhetextfieldtypesdefinedintheconfigurationforaspecifiedindex.

gptext.reload_index()–reloadsSolrconfigurationfiles.

GPTextClusterMonitoringandManagementgptext.live_nodes()–listsactiveSolrnodes.

gptext.version()–returnsversionofGPTextinstallation.

gptext.zookeeper_hosts()–returnsalistoftheZooKeeperhostnamesandports.

HighAvailabilitygptext.add_replica()–Addsareplicaofanindexshard.

gptext.delete_replica()–Deletesareplicaofanindexshard.

©CopyrightPivotalSoftware,Inc,2013-2019 92 3.3.0

Page 93: Pivotal Greenplum Text

GeneralPurposeFunctionsgptext.count_t()–countsnumberofrowsinatable.

PrivilegesYourprivilegestoexecutetheGPTextfunctionsdependonyourGreenplumDatabaseprivilegesforthetablefromwhichtheindexisgenerated.Forexample,ifyouhaveSELECTprivilegesforatableintheGreenplumdatabase,youhaveSELECTprivilegesforanindexgeneratedfromthattable.

ExecutingindexfunctionsrequiresoneofOWNER,SELECT,INSERT,UPDATE,orDELETEprivileges,dependingonthefunction.TheOWNERisthepersonwhocreatedthetableandhasallprivileges.SeetheSecuritysectionoftheGPTextUser’sGuideforinformationaboutsettingprivileges.

ThePrivilegesrequiredsectionforeachoftheGPTextfunctionsspecifiestheprivilegesrequiredtoexecutethatfunction.

UsageThe gptext functionsinthissectionmustbeexecutedasSQLqueriesintheform:

SELECT*FROMgptext.function();

TheexamplesinthisdocumentuseaGreenplumdatabasenamed demo setupasfollows:

Atablenamed articles inthe wikipedia schema.

Atablenamed message inthe twitter schema.

SeeSettingUptheSampleDatabasefordetailsaboutthesetables.

IndexingIndexingfunctionscreate,setup,populate,andfinalize(commit)Solrindexes.

gptext.create_index()CreatesanemptySolrindex.

Syntax

gptext.create_index(<schema_name>,<table_name>,<id_col_name>,<def_search_col_name>[,<if_check_id_uniqueness>])

or

gptext.create_index(<schema_name>,<table_name>,<p_columns>,<p_types>,<id_col_name>,<def_search_col_name>[,<if_check_id_uniqueness>])

Parameters

<schema_name>

ThenameoftheschemaintheGreenplumdatabase.<table_name>

ThenameofthetableintheGreenplumdatabase.Ifthetableispartitionedthismustbethenameoftheroottable.<p_columns>

Atextarraycontainingthenamesofthetablecolumnstoindex.If <p_columns> and <p_types> areomitted,alltablecolumnsareindexed.

©CopyrightPivotalSoftware,Inc,2013-2019 93 3.3.0

Page 94: Pivotal Greenplum Text

Thecolumnsmustbevalidcolumnsinthetable.Thecolumnsidentifiedbythe<id_col_name> and <def_search_col_name> mustbeincludedinthearray.

Ifthe <p_columns> parameterissupplied,the <p_types> parametermustalsobesupplied.Thesizesofthe <p_columns> and <p_types> arraysmustbethesame.

<p_types>

AtextarraycontainingtheSolrdatatypesofthecolumnsinthe <p_columns> array.

Texttypescanbemappedtothenameofananalyzerchain,forexample <text_intl> , <text_sm> ,oranytypedefinedinthe <managed_schema> .SeeMapGreenplumDatabaseDataTypestoSolrDataTypesforequivalentSolrdatatypesforotherGreenplumtypes.

The <p_types> parametermustbesuppliedifthe <p_columns> parameterissupplied.

<id_col_name>

Thenameofacolumnin <table_name> thatisuniqueforeachrow.Thecolumnmustbeoftype int4 , int8 , varchar , text ,or uuid .<def_search_col_name>

Thenameofthedefaultcolumntosearchin <table_name> ,ifnoothercolumnisnamedinaquery.<if_check_id_uniqueness>

Optional.ABooleanvalue.Thedefaultistrue.Settofalsetoindexatablewithanon-uniqueIDfield.

Returntype

boolean

Privilegesrequired

OnlytheOWNERcanexecutethisfunction.

Remarks

AGPTextindexisaSolrcollection.

Thenameoftheindexcreatedhastheformat:

<database_name>.<schema_name>.<table_name>

Thecontentsofthe <id_col_name> columnshould,inmostcases,beauniqueIDforeachrow.Itmustbeoftype int4 , int8 , varchar ,or text .

Theindexisbrokenupintosectionscalledshards.Bydefault,GPTextindexeshaveoneshardperGreenplumDatabasesegment.YoucancreateanindexwithfewershardsbysettingtheGPTextconfigurationparameter gptext.idx_num_shards tothenumberofshardsbeforeyoucreatetheindex.SeeSpecifyingtheNumberofIndexShardsformoreinformation.

Ifthe <if_check_id_uniqueness> argumentistrue,thedefault,adocumentwithanIDmatchinganexistingIDcannotbeaddedtotheindex.

Ifthe <if_check_id_uniqueness> argumentisfalse,documentswithduplicateIDsareallowedtobeaddedtotheindex.ThecontentofotherfieldsmayormaynotbethesameasexistingdocumentswiththesameID.WhenaqueryreturnsmultipledocumentswiththesameID,itistheuser’sresponsibilitytoanticipateandhandlethemultipledocuments.Forexample,atablecouldhavea revision columnthatisincrementedwhenanewversionofadocumentisaddedtotheindex,allowingqueriesthatomitallbutthemostrecentversionfromsearchresults.

GPTextdoesnotsupportindexeswithfewershardsthanthenumberofGreenplumDatabasesegmentswiththe <if_check_id_uniqueness> argumentsettofalse.

Whenthetableispartitioned,theGPTextindexcreatedforthetablewillcontainrecordsforallpartitions.Ifyouspecifythenameofasubpartitiontableinthisfunctionanerrorisreturned.Theindexrecordsfordocumentsaddedtotheindexhavea __partition fieldcontainingthenameofthechildpartitiontable.SeeSearchingPartitionedTablesforsyntaxtosearchbypartitions.

Populatethenewindexwith gptext.index() .

Thenumberofreplicasforeachshardisdeterminedwhentheindexiscreated.Itisthevalueofthe gptext.replication_factor serverconfigurationparameter,2bydefault.

©CopyrightPivotalSoftware,Inc,2013-2019 94 3.3.0

Page 95: Pivotal Greenplum Text

Ifthe gptext.failover_factor serverconfigurationparameterisset, gptext.create_index() failsiftheratioofthenumberofGPTextnodesthatareuptothetotalnumberofGPTextnodesislessthanthe gptext.failover_factor value(from0.0to1.0).IndexshardscanonlybecreatedonactiveGPTextnodes,sothegptext.failover_factor parameterpreventsoverloadingtheactiveGPTextnodeswhentoomanynodesaredown.

Toindexapartitionedtable,specifythenameoftheroottable.The gptext.index() functionreturnsanerrorifyouspecifythenameofachildpartitiontable.

ExamplesCreateanindex, demo.wikipedia.articles ,with content asthedefaultsearchfield.

=#SELECT*FROMgptext.create_index('wikipedia','articles','id','content');

Createanindex, demo.wikipedia.articles ,with content asthedefaultsearchfield.Indexthe id , content ,and title fields.

=#SELECT*FROMgptext.create_index('wikipedia','articles','{"id","content","title"}','{"long","text","text"}','id','content');

gptext.create_index_external()CreatesanemptySolrindexforexternaldocuments.

Syntax

gptext.create_index_external(<index_name>)

Parameters

<index_name>

Thenameoftheindextocreate.Thenamecannotcontainperiods( . ).

Notes

AGPTextexternalindexisaSolrindexfordocumentsexternaltoGreenplumDatabase,forexample,PDF,MicrosoftWord,XML,andHTMLfiles.UnlikeregularGPTextindexes,externalindexesarenotassociatedwithaGreenplumDatabasetable,buttheycanbesearchedwithGPTextsearchfunctions.

LikeregularGPTextindexes,anexternalindexbydefaulthasoneshardperGreenplumDatabasesegment.Tocreateanexternalindexwithfewershards,setthe gptext.idx_num_shards configurationparametertothedesirednumberofshardsbeforeyoucreatetheindex.SeeSpecifyingtheNumberofIndexShardsformoreinformation.

Example

=#SELECT*FROMgptext.create_index_external('gptext-docs');

gptext.index()Populatesanindexbyindexingdatainatable.

Syntax

gptext.index(TABLE(SELECT*FROM<table_name>),<index_name>)

©CopyrightPivotalSoftware,Inc,2013-2019 95 3.3.0

Page 96: Pivotal Greenplum Text

Parameters

TABLE(<select-statement>)

Thedocumentcontenttobeindexed,withdatatype anytable .<index_name>

Nameoftheindexthatwascreatedwith gptext.create_index() andistobepopulated.

Returntype

SETOFdbidINT,num_docsBIGINT

where dbid isthe dbid ofthesegmentthatthedocumentsweresentto,and num_docs isthenumberofdocumentsthatwereindexed.

Privilegesrequired

YoumusthavetheINSERTorUPDATEprivilegetoexecutethisfunction.

Remarks

<index_name> musthavebeencreatedwith gptext.create_index() .

Thefirstargumentto gptext.index() isatableexpressionspecifyingrowstoaddtotheindex.Towritethisargument,wrapa SELECT statementinaTABLE functiontoselectrowstopasstothe gptext.index() function.Forexample, TABLE(SELECT*FROMarticles) isatableexpressionthatindexesallcolumnsandrowsofthearticlestable.The gptext.index() functionaddseachrowofthe SELECT resultsetasadocumentintheGPTextindex.

Youcanselectivelyindextablecolumnsbychangingthe SELECT listinthequeryandyoucanfilterrowstoindexwitha WHERE clause.

The SELECT statementdoesnothavetoquerythebasetablethatwasspecifiedwhentheGPTextindexwascreated,buttheresultsofthequerymusthavethesamecolumnsanddistributionpolicy.Forexample,youcanindexrowsfromatemporarytablewiththesamedefinitionasthebasetable.Oryoucouldindextheresultsfromajoinquerythatproducesthesamecolumnsasthebasetable.Iftheresultshaveadifferentdistributionkey,ornodistributionkey,asisthecasewithajoinquery,youmustspecifyacompatibledistributionpolicybyaddinga SCATTERBY<column> totheSELECTSTATEMENT .Youcanspecifythecolumnbynameorbycolumnnumber.

Theoutputofthe gptext.index() functionincludesatwo-columntablewith dbid (theGreenplumsegmentID)and num_docs (thenumberofdocumentsaddedtotheindexshardforthatsegment)asthecolumns.

Afteraddingdocumentstotheindex,youmustcommittheindexwith gptext.commit_index(<index_name>) .

Example

=#SELECT*FROMgptext.index(TABLE(SELECT*FROMwikipedia.articles),'demo.wikipedia.articles');dbid|num_docs------+----------3|62|5(2rows)

gptext.index_external()AddsdocumentsstoredoutsideofGreenplumDatabasetoaGPTextexternalindex.

Syntax

gptext.index_external(<url-list>,<index-name>)

©CopyrightPivotalSoftware,Inc,2013-2019 96 3.3.0

Page 97: Pivotal Greenplum Text

Parameters

<url-list>

AlistofURLsfordocumentstoaddtotheGPTextexternalindex.TheURLsmaybeexpressedasanarrayorasatable-valuedexpression.<index-name>

Thenameoftheindextowhichthedocumentsaretobeadded.

Remarks

IfthedocumentstoaddtotheGPTextexternalindexareinastorethatrequiresauthentication,usethe gptext.external_login() functiontologintothedocumentstorebeforeyouexecute gptext.index_external() .

IfthedocumentcannotberetrievedatthegivenURL,oranerroroccurswhileindexingthedocument,GPTextinsertsarowinthe gptext.error_table table.Youcanuse gptext.recreate_error_table() tocreateanemptyerrortablebeforeyoucall gptext.index() .

ThevalueoftheGPTextcustomserverparameter gptext.idx_segment_error_limit (default10)isthenumberoferrorsthatcanoccuronanyonesegmentbeforetheindexingoperationiscanceled.

Whenaddingadocumenttoanexternalindex,GPTextcalculatesa256-bithashonthecontentsofthedocument.Thehashisstoredasa64-bytehexadecimalvalueinthe sha256 field.IfyoulateraddadocumentwithaURLmatchinganexistingdocumentintheindex,thenewdocumentisonlyaddedtotheindexifthenewlycalculatedhashdiffersfromthecurrentvalueinthe sha256 field.

Examples

ThisexampleaddsasinglePDFdocumenttotheindex gptext-docs .

=#SELECT*FROMgptext.index_external('{http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf}','gptext-docs');dbid|num_docs------+----------3|02|1(2rows)

ThisexampleaddsmultipleHTMLdocumentstothe gptext-docs externalindexbyselectingURLsfromadatabasetable.Errorswillbeloggedinthegptext.gptext_errrors table.

©CopyrightPivotalSoftware,Inc,2013-2019 97 3.3.0

Page 98: Pivotal Greenplum Text

=#DROPTABLEIFEXISTSgptext_html_docs;=#CREATETABLEgptext_html_docs(idbigint,urltext)DISTRIBUTEDBY(id);CREATETABLE=#INSERTINTOgptext_html_docsVALUES(1,'http://gptext.docs.pivotal.io/latest/topics/administering.html'),(2,'http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html'),(3,'http://gptext.docs.pivotal.io/latest/topics/function_ref.html'),(4,'http://gptext.docs.pivotal.io/latest/topics/guc_ref.html'),(5,'http://gptext.docs.pivotal.io/latest/topics/ha.html'),(6,'http://gptext.docs.pivotal.io/latest/topics/index.html'),(7,'http://gptext.docs.pivotal.io/latest/topics/indexes.html'),(8,'http://gptext.docs.pivotal.io/latest/topics/intro.html'),(9,'http://gptext.docs.pivotal.io/latest/topics/managed-schema.html'),(10,'http://gptext.docs.pivotal.io/latest/topics/performance.html'),(11,'http://gptext.docs.pivotal.io/latest/topics/queries.html'),(12,'http://gptext.docs.pivotal.io/latest/topics/type_ref.html'),(13,'http://gptext.docs.pivotal.io/latest/topics/upgrading.html'),(14,'http://gptext.docs.pivotal.io/latest/topics/utility_ref.html'),(15,'http://gptext.docs.pivotal.io/latest/topics/installing.html');INSERT015=#SELECT*FROMgptext.index_external(TABLE(SELECTurlFROMgptext_html_docs),'gptext-docs');dbid|num_docs------+----------3|62|8(2rows)

=#SELECT*FROMgptext.error_table;-[RECORD1]-------------------------------------------------------------------------------------------error_time|2000-01-0100:25:11.282769index_name|gptext-docssqlcmd|errmsg|Code:RUNTIME_ERROR,Message:'http://gptext.docs.pivotal.io/210/topics/ext-indexes.html.'rawdata|http://gptext.docs.pivotal.io/latest/topics/ext-indexes.htmlrawbytes|

=#SELECT*FROMgptext.commit_index('gptext-docs');commit_index--------------t(1row)

ThisexampleindexesasingleHTMLfileinanAmazonS3bucketnamed“gptext”.

=#SELECTgptext.external_login('s3','s3://s3-us-west-2.amazonaws.com','mys3_auth');external_login----------------t(1row)=#SELECT*FROMgptext.index_external('{s3://gptext/topics/queries.html}','gptext-docs');dbid|num_docs------+----------3|05|04|02|1(4rows)=#SELECTgptext.commit_index('gptext-docs');commit_index--------------t(1row)=#SELECTgptext.external_logout('s3');external_logout-----------------t(1row)

gptext.index_external_dir()AddsalldocumentsinadirectoryinanexternaldocumentsourcetoaGPTextexternalindex.

©CopyrightPivotalSoftware,Inc,2013-2019 98 3.3.0

Page 99: Pivotal Greenplum Text

Syntax

gptext.index_external_dir(<directory_url>,<index_name>)

Parameters

<directory_url>

TheURLforthedirectorywithdocumentstoaddtotheindex.

<index_name>

ThenameoftheGPTextexternalindextowhichthedocumentsaretobeadded.

Remarks

The gptext.index_external_dir() functionaddsthedocumentsinadirectoryanditssubdirectoriestoaGPTextexternalindex.

Logintothedocumentsourceusingthe gptext.external_login() functionbeforeyoucallthe gptext.index_external_dir() function.

Ifyouspecifyafileinsteadofadirectory,anerrorisaddedtothe gptext.error_table table.

TheIDforeachfileaddedtotheindexistheURLforthefileintheexternaldocumentsource.

TheApacheTikalibrarydiscoversthe content_type foreachfile.

Theuserwhologsintotheexternaldocumentsourcemusthavereadpermissionsonthedirectory.The gptext.index_external_dir() functionaddstotheindexonlythosedocumentsandsubdirectoriesthattheuserhaspermissiontoread.

Example

ThisexampleaddsdocumentsfromadirectoryinanhdfsstoretotheGPText webdocs externalindex.

#=SELECT*FROMgptext.external_login('hdfs','hdfs://myhadoop:9000','myhadoop');external_login----------------t(1row)

=#SELECT*FROMgptext.index_external_dir('hdfs://myhadoop:9000/gptext_web/public/300/','webdocs');num_docs----------37(1row)

=#SELECT*FROMgptext.commit_index('webdocs');commit_index--------------t(1row)

=#SELECT*FROMgptext.external_logout('hdfs');external_logout-----------------t(1row)

gptext.commit_index()Finishesanindexoperation.Theresultsofanindexingoperationarenotavailableuntilthisfunctioniscalledfortheindex.

Syntax

gptext.commit_index(<index_name>)

©CopyrightPivotalSoftware,Inc,2013-2019 99 3.3.0

Page 100: Pivotal Greenplum Text

Parameters

<index_name>

Thenameoftheindextocommit.Ifthetableispartitionedthismustbethenameoftheroottable.

Returntype

boolean

Privilegesrequired

YoumusthavetheINSERT,UPDATE,orDELETEprivilegetoexecutethisfunction.

Remarks

Mustbecalledaftergptext.index()andgptext.delete().

Example

=#SELECT*FROMgptext.commit_index('demo.wikipedia.articles');commit_index--------------t(1row)

gptext.extract_rich_doc()RetrievesmetadataandcontentfromanexternaldocumentbutdoesnotaddthecontenttotheGPTextexternalindex.

Syntax

gptext.extract_rich_doc(<index-name>,<document-url>)

Parameters

<index-name>

Thenameoftheexternalindextotest.MetadataandcontentareextractedfromthedocumentusingthecurrentSolrconfigurationfilesforthisindex.

<document-url>

TheURLforthedocumenttoextract.

Returntype

Column Type

stream_name text

title text

author text

keywords text

©CopyrightPivotalSoftware,Inc,2013-2019 100 3.3.0

Page 101: Pivotal Greenplum Text

created textmodified text

content text

Column Type

Remarks

The gptext.extract_rich_doc() functioncanbeusedwhiletestingortroubleshootingtoverifythemetadataandcontentextractedfromexternaldocuments.TheexternaldocumentisnotaddedtotheGPTextexternalindex.Forexample,youcanusethisfunctiontotestthatchangesyoumaketothesolrconfig.xml configurationfilefortheindexhavethedesiredeffect.

Ifthedocumentisinadocumentstorethatrequiresauthentication,youmustfirstloginusing gptext.external_login() .

The stream_name columnisthesameasthe <document-url> functionparameter.Ifthedocumentisaddedtotheindex,thiswillbetheuniquedocumentID.

The title , author , keywords , created ,and modified columnscontainmetadatadependentonthedocumenttypeApacheTikadetects.

The content columncontainstheextracteddocumentcontent.

Examples1. ViewthemetadataandcontentthatwouldbeextractedfromanHTMLdocument.

=#SELECT*FROMgptext.extract_rich_doc('gptext-docs','http://gptext.docs.pivotal.io/latest/topics/intro.html');

2. ViewthetextextractedfromanimagefilewhentheTesseractopticalcharacterrecognition(OCR)enginehasbeeninstalledontheGreenplumDatabasecluster.

=#SELECT*FROMgptext.extract_rich_doc('gptext-docs','http://gptext.docs.pivotal.io/latest/graphics/ocrtest.png');

SeeIndexingTextEmbeddedinImagesforinformationaboutsettingupOCRforexternalimagefilesanddocumentswithembeddedimages.

gptext.recreate_error_table()Dropsandrecreatesthe gptext.error_table databasetable.

Syntax

gptext.recreate_error_table()

Returntype

Boolean

Remarks

IfanerroroccurswhileaccessingadocumenttoaddtoaGPTextexternalindex,GPTextaddsarecordtothe gptext.error_table table.See gptext.error_tableforadescriptionofthistable.Usersshouldnotdropormodifythetable.

Rowsaddedtothe gptext.error_table tableremainuntilyouusethe gptext.recreate_error_table() functiontocreateanewemptytable.

Ifyouattempttoexecute gptext.recreate_error_table() whileitisinuseforanindexingoperation,awarningisraisedandthefunctionreturnsfalsewithoutrecreatingthetable.

Examples

©CopyrightPivotalSoftware,Inc,2013-2019 101 3.3.0

Page 102: Pivotal Greenplum Text

=#SELECTgptext.recreate_error_table();recreate_error_table----------------------t(1row)

AuthenticatingwithExternalDocumentSourcesToadddocumentsfromadocumentsourcethatrequiresauthentication,suchasHadoop,anftpserver,orAmazonS3,loginbeforeaddingdocumentstoaGPTextindexandlogoutwhendone.

gptext.external_login()LogsintoanexternaldocumentsourcebeforeaddingdocumentsfromthesourcetoaGPTextexternalindex.

Syntax

gptext.external_login(<type>,<url>,<config-name>)

Parameters

<type>

Identifiesthetypeoftheexternaldocumentsource.Thevalidtypesare 'ftp' , 'hdfs' ,and 's3' .Thetypeisnotcase-sensitive.

<url>

TheURLoftheexternaldocumentsource.

<config-name>

Thenameoftheconfigurationuploadedwiththe gptext-external upload utilitycommand.

Remarks

Youcanlogintoonlyoneexternaldocumentsourceatatime.

Usethe gptext-externallist commandtolisttheconfigurationsthathavebeenuploaded.

Example

Logintoanhdfsfilesystemusingthe myhdfs configuration.

=#SELECT*FROMgptext.external_login('HDFS','hdfs://198.51.100.23:19000','myhdfs');

Logintoaftpserverusingthe myftp configuration.

=#SELECTgptext.external_login('ftp','ftp://198.51.100.23','myftp');

LogintoAmazonS3usingthe mys3 configuration.

=#SELECTgptext.external_login('s3','s3://s3.us-east-2.amazonaws.com','mys3');

TheAmazonS3connectionURLhastheformat s3://<s3-endpoint>[/<region>][/] .Ifthe <s3-endpoint> beginswith s3- or s3. andisfollowedbyaregioncode,forexample s3-us-east-1.amazonaws.com or s3.ap-southeast-1.amazonaws.com ,the /<region> partoftheURLisnotrequired.Anendpointlikes3.dualstack.us-east-1.amazonaws.com ,however,mustincludetheregioncodeattheend: s3://s3.dualstack.us-east-1.amazonaws.com/us-east-1 .SeeAmazonAWSRegionsandEndpoints foralistofvalidS3endpointsbyregion.

©CopyrightPivotalSoftware,Inc,2013-2019 102 3.3.0

Page 103: Pivotal Greenplum Text

gptext.external_logout()LogsoutofanexternaldocumentsourceafteraddingdocumentsfromthesourcetoaGPTextexternalindex.

Syntax

gptext.external_logout(<type>)

Parameters

<type>

Identifiesthetypeoftheexternaldocumentsource.Thesupportedtypesare 'ftp' , 'hdfs' ,and 's3' .Thetypeisnotcase-sensitive.

Remarks

Youcanlogintoonlyoneexternaldocumentsourceatatime.Toindexdocumentsfromanothersource,youmustfirstcall gptext.external_logout() andthenlogintothenewsourcewith gptext.external_login() .

Example

Logoutofanhdfsfilesystem.

=#SELECT*FROMgptext.external_logout('HDFS');

Logoutofanftpserver.

=#SELECTgptext.external_logout('ftp');

LogoutofAmazonS3.

=#SELECTgptext.external_logout('s3');

ModifyingorDeletinganIndexYoucanchangeanindexbyaddingordroppingfields,revertinganindextoitspreviousstate,ordeletingtheindex.

gptext.add_field()Addsafieldtoyourschemaifthefieldwasaddedtothedatabaseaftertheindexwascreated.

Syntax

gptext.add_field(<index_name>,<field_name>[,<is_default_search_col>[,<if_enable_terms>]])

Parameters

<index_name>

Thenameoftheindextowhichyouwanttoaddthefield.Ifthetableispartitionedthismustbethenameoftheroottable.<field_name>

©CopyrightPivotalSoftware,Inc,2013-2019 103 3.3.0

Page 104: Pivotal Greenplum Text

Thenameofthefieldtobeindexed.<is_default_search_col>

Optional.Booleanvalue.Isthistobecomethedefaultsearchcolumn(field)?<if_enable_terms>

Optional.Booleanvalue.EnabletermssupportonthisfieldwhenaddedtotheGPTextindex.

Returntype

SETOFboolean

Privilegesrequired

OnlytheOWNERcanexecutethisfunction.

Remarks

Callthisfunctionforeachfieldyouadd.

Beforeandafteryouaddoneormorefields,reloadtheSolrconfigurationfileswith gptext.reload_index() .Theinitial reload_index() callisrequiredbecauseofSolr4.0behaviorandmaynotberequiredinsubsequentversions.

Afteryouaddoneormorefields,youshouldreindexthetableandcommittheindexwith gptext.commit_index() .

Example

Addsthefield external_links totheindex,thenrecreates,repopulates,andcommitstheindex.

=#ALTERTABLEwikipedia.articlesADDexternal_linkstext;ALTERTABLE=#SELECT*FROMgptext.reload_index('demo.wikipedia.articles');reload_index--------------t(1row)=#SELECT*FROMgptext.add_field('demo.wikipedia.articles','external_links',false,false);INFO:Addfield:external_linksforindex:demo.wikipedia.articlesadd_field-----------t(1row)

=#SELECT*FROMgptext.reload_index('demo.wikipedia.articles');reload_index--------------t(1row)

=#SELECT*FROMgptext.commit_index('demo.wikipedia.articles');commit_index--------------t(1row)

gptext.delete()Deletesalldocumentsthatmatchthesearchquery.

Syntax

gptext.delete(<index_name>,<query>)

©CopyrightPivotalSoftware,Inc,2013-2019 104 3.3.0

Page 105: Pivotal Greenplum Text

Parameters

<index_name>

Thenameoftheindex.<query>

Documentsmatchingthisquerywillbedeleted.Todeletealldocumentsusethequery '*' or '*:*' .

Returntype

boolean

Privilegesrequired

YoumusthavetheDELETEprivilegetoexecutethisfunction.

Remarks

Afterasuccessfuldelete,committheindexusing gptext.commit_index(<index_name>) .

ExamplesDeletealldocumentscontainingtheword“unverified”inthedefaultsearchfield:

=#SELECT*FROMgptext.delete('demo.wikipedia.articles','unverified');delete--------t

(1row)=#SELECT*FROMgptext.commit_index('demo.wikipedia.articles');commit_index--------------t(1row)

Deletealldocumentsfromtheindex:

=#SELECT*FROMgptext.delete('demo.wikipedia.articles','*:*');delete--------t(1row)

=#SELECT*FROMgptext.commit_index('demo.wikipedia.articles');commit_index--------------t(1row)

gptext.drop_field()Removesafieldfromyourschema.

Syntax

gptext.drop_field(<index_name>,<field_name>)

Parameters

©CopyrightPivotalSoftware,Inc,2013-2019 105 3.3.0

Page 106: Pivotal Greenplum Text

<index_name>

Thenameoftheindexfromwhichtodropthefield.Ifthetableispartitionedthismustbethenameoftheroottable.<field_name>

Thenameofthefieldtodrop.

Returntype

boolean

Privilegesrequired

OnlytheOWNERcanexecutethisfunction.

Remarks

Callthisfunctionforeachfieldyoudrop.

Beforeandafterdroppingoneormorefields,youmustreloadtheSolrconfigurationfileswith gptext.reload_index() ,thencommittheindexwithgptext.commit_index() .

Thecolumn __partition inindexesforpartitioneddatabasetablescannotbedropped.

Theinitial reload_index() isrequiredbySolr4.0behaviorandmaynotbenecessaryinsubsequentversions.

Example

Dropsthefield external_links fromtheindex.

=#SELECT*FROMgptext.reload_index('demo.wikipedia.articles');reload_index--------------t(1row)

=#SELECT*FROMgptext.drop_field('demo.wikipedia.articles','external_links');INFO:Dropfield:external_linksforindex:demo.wikipedia.articlesdrop_field------------t(1row)

=#SELECT*FROMgptext.reload_index('demo.wikipedia.articles');reload_index--------------t(1row)

=#SELECT*FROMgptext.commit_index('demo.wikipedia.articles');commit_index--------------t(1row)

gptext.drop_index()Removesanindex.

Syntax

gptext.drop_index(<index_name>)

©CopyrightPivotalSoftware,Inc,2013-2019 106 3.3.0

Page 107: Pivotal Greenplum Text

Parameters

<index_name>

Thenameoftheindextodrop.Ifthedatabasetableispartitioned,thismustbethenameoftheroottable.

Returntype

boolean

Privilegesrequired

OnlytheOWNERcanexecutethisfunction.

Remarks

Adroppedindexcannotberecovered.

Example

=#SELECT*FROMgptext.drop_index('demo.wikipedia.articles');drop_index------------t(1row)

SearchSearchfunctionsenablequeryinganindex.

ChangingthequeryparseratquerytimeWhenusingthesearchfunctions,youcanchangethequeryparserusedbySolratquerytime.Adifferentqueryparsermayberequired,dependingonthenatureofthequery.SeeUsingAdvancedQueryingOptionsforalistofthequeryparsersGPTextsupports.

Tochangethequeryparseratquerytime,usethe defType Solroptionwiththe gptext.search() function.

Tochangethequeryparserforanysearchfunctionatquerytime,usetheSolrlocalParamssyntax,replacingthe <query> termwith'{!type=edismax}<query>' .

WiththeGPTextUniversalQueryParser,youcanusefeaturesfromanyoftheothersupportedqueryparsersinasinglequery.TousetheUniversalQueryParser,replacethe <query> termwith '{!gptextqp}<query>' .SeeUsingtheUniversalQueryParserforinformationandexamples.

gptext.search()Searchesanindex.

Syntax

gptext.search(TABLE(<select-statement>),<index_name>,<search_query>,<filter_queries>[,<options>])

Parameters

©CopyrightPivotalSoftware,Inc,2013-2019 107 3.3.0

Page 108: Pivotal Greenplum Text

TABLE(<select-statement>)

Atable-valuedexpression.Thisisan anytable pseudo-type,specifiedusingtheformat:

TABLE(<select_statement>)

TheGreenplumDatabasequeryplannerprocessesthe <select_statement> toestimatethenumberofresultrows,butthequeryisnotexecutedbythebodyofthesearchfunction.Youcanusetheexpression TABLE(SELECT1SCATTERBY1) asasimple,syntacticallycorrectvalueforthisargument.

<index_name>

Thenameoftheindextosearch.Ifthedatabasetableispartitionedyoucanspecifythenameofasub-partitiontabletosearch.<search_query>

TextvaluecontainingaSolrtextsearchquery.<filter_queries>

Atextarrayoffilterqueries,ifany.Ifnone,setthisparameterto null .<options>

Anoptionalampersand-delimitedlistofSolrqueryparameters.SeeSolroptions.

Returntype

SETOFgptext.search_scored_result

Thisisacompositetypewiththefollowingcolumns:

Column Type

id text

score doubleprecision

hs(conditional) gptexthstore

rf(conditional) text

The id columnisreturnedastext,evenifthe <id_col> specifiedinthe gptext.create_index() functionisanintegertype.Ifyouorderresultsby id orjoinsearchresultswiththeoriginaltableon id ,youmustcastthereturned id columntothecorrectintegertypeinyourquery.Forexample,thefollowingsearchquerycaststhe id returnedbythesearchquerytoanINT8typetojoinwiththenumeric id columninthe wikipedia.articles tableandtosorttheresultsnumerically.However,the id columnintheresultsisatextvalueandisthereforedisplayedleft-justified.

SELECTs.id,s.score,a.titleFROMwikipedia.articlesa,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.wikipedia.articles','*:*',null)sWHEREs.id::INT8=a.idORDERBYs.id::INT8;id|score|title----------+-------+---------------------------25784|1|Renewableenergy27743|1|Solarenergy54838|1|Biogas55017|1|Fusionpower65499|1|Soilsalinity113728|1|Geothermalenergy213555|1|Solarupdrafttower533423|1|Solarwaterheating608623|1|Ethanolfuel855056|1|Forwardosmosis2008322|1|Vehicle-to-grid2120798|1|Lithiumeconomy2988035|1|Vortexengine4711003|1|Osmoticpower7906908|1|Biomass13021878|1|Geothermalpower13690575|1|Solarpower14090587|1|Low-carbonpower14205946|1|Algaefuel18965585|1|Pressure-retardedosmosis22391303|1|Liquidnitrogenengine26787212|1|Reverseelectrodialysis53716476|1|Seaweedfuel(23rows)

Ifthe <options> parameterisincludedin gptext.search() ,theresultincludesthe offsets column.Thiscolumncontainskey-valuepairs,wherethekeyisthe

©CopyrightPivotalSoftware,Inc,2013-2019 108 3.3.0

Page 109: Pivotal Greenplum Text

columnnameandthevalueisacomma-separatedlistofoffsetstolocationswherethesearchtermoccurs.Thisdataisusedbythe gptext.highlight()functiontoaddhighlightingtagstothecolumndata.Ifhighlightingisnotenabledwiththe 'hl=true' option,the offsets columnis NULL .

Ifthe fl optionisincludedinthe <options> parametertospecifyadditionalfieldstoaddtotheresult,the rf columncontainstheadditionalfieldsinaformattedtextvalue.The gptext.gptext_retrieve_field() functioncanbeusedtoextractasinglefieldvaluefromthe rf column.Therearevariantsofthegptext.gptext_retrieve_field() functiontoretrieveintegerandfloatvaluesfromthe rf columnvalue.

Privilegesrequired

YoumusthavetheSELECTprivilegetoexecutethisfunction.

Solroptions

Solrqueriesallowthefollowingoptionalrefinements,specifiedasanampersand-delimitedlistinthe options parameter.

defType

Thenameofthequeryparsertouseforthisquery.

Example: defType=edismax

rows

Themaximumnumberofrowstoreturnpersegment.Ifomitted,allrowsarereturned.

Example: rows=100 returns100rowspersegmentorallrowsiftherearefewerthan100.

sort

Sortsonafieldorscoreinascendingordescendingorder.

Examples:

sort=score desc (defaultifnosortdefined)

sort=date_time asc

sort=date_time asc score desc sortson date_time ascending,thenon score descending

start

Thenumberofthefirstrecordtoreturn.

Examples:

start=0 default:returnedrecordsstartwithrecord0

start=25 returnedrecordsstartwithrecord25

hl

Enablehighlighting.

Example: hl=true

hl.fl

Comma-separatedlistoffieldnamestoconsiderwhenhighlighting.

Examples:

hl.fl=message_text

hl.fl=title,content

fl

Comma-separatedlistoffieldstoincludeinsearchresults.Thefieldsmusthavebeensettostored=true inthemanaged-schemafortheindex.

Example: fl=title,refs

RemarksTheoutputincludesatablewithcolumns id (theIDnamedingptext.create_index())and score (the tf-idf score).Acolumnnamed offsets is

©CopyrightPivotalSoftware,Inc,2013-2019 109 3.3.0

Page 110: Pivotal Greenplum Text

includedifhighlightingisspecifiedinthe options parameter.Acolumnnamed rf isincludedwhenalistofadditionalfieldstoincludeisspecifiedinthe <options> parameter.

Tochangethequeryparseratquerytime,specifythe defType optionintheoptionsparameterlist.Forexample,settingtheoptionsparameterto'rows=100&defType=edismax' limitstheoutputto100rowspersegmentandwillchangethequeryparserto edismax .

The TABLE queryisplannedandaffectstheestimatefor gptext.search() ,butdoesnotexecute.Forexample,ifyourqueryincludes

gptext.search(TABLE(SELECT*FROMt),...)

thequeryplannerestimatesthenumberofresultsasthenumberofrowsin t .Thiscancausethequeryplannertoignoretheuseofanindexscan.Useaquerylike TABLE(SELECT1SCATTERBY1) toavoidthisissue.

Ifyoudonotspecifyoptions, gptext.search() returnsallrows.

Theoptionsseparatorhaschangedfromcommatoampersand(&)inordertosupporthighlighting.Ifyoudonotusehighlighting,youcanreverttousingthecommaseparatorbysettingthe gptext.search_param_separator to ',' .

TheSolroption rows specifiesthemaximumnumberofrowstoreturnpersegment.Forexample,ifyouhavefoursegments, 'rows=100' returnsatmostatotalof400rows.Tolimitthenumberofrowsreturnedforanentirequery,seta LIMIT intheSQLquery.Forexample,thefollowingqueryreturnsatmost20rowsforthequery:

=#SELECTt.id,q.score,t.message_textFROMtwitter.messaget,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.twitter.message','*:*',null,null)qWHEREt.id=q.id::int8LIMIT'20';

The gptexthstore typeisalimitedformofthePostgres hstore type.Itonlyhasthe hstore inputandoutputfunctionsimplemented,as gptext_hstore_in

and gptext_hstore_out .

ExamplesRunsaGPTextquerythatlooksforWikipediaarticlesthatcontaintheterm“optimization”,andjoinstheresultstotheoriginalGreenplumDatabasearticles table:

=#SELECTa.id,a.title,q.scoreFROMwikipedia.articlesa,gptext.search(TABLE(SELECT*FROMwikipedia.articles),'demo.wikipedia.articles','optimization',null)qWHEREa.id=q.id::int8ORDERBYscoreDESC;id|title|score----------+---------------------------+------------213555|Solarupdrafttower|1.552886218965585|Pressure-retardedosmosis|0.895408457906908|Biomass|0.869263625784|Renewableenergy|0.7473389533423|Solarwaterheating|0.7186527608623|Ethanolfuel|0.694370627743|Solarenergy|0.69437062008322|Vehicle-to-grid|0.635281255017|Fusionpower|0.634744914205946|Algaefuel|0.634744913690575|Solarpower|0.58286035(11rows)

Returns5rowsfromeachsegmentwiththetext“iphone”highlightedinthe message_text column.Thisexamplerequiresthatyouhaveenabledtermsonthe message_text fieldinthe demo.twitter.message table.Seetheexampleinthe gptext.enable_terms() reference.

©CopyrightPivotalSoftware,Inc,2013-2019 110 3.3.0

Page 111: Pivotal Greenplum Text

=#SELECTt.id,gptext.highlight(t.message_text,'message_text',s.hs)messageFROMtwitter.messaget,gptext.search(TABLE(SELECT1SCATTERBY1),'demo.twitter.message','{!gptextqp}iphone',null,'rows=5&hl=true&hl.fl=message_text')sWHEREt.id=s.id::int8;id|message----------+---------------------------------------------------------------------------------------------------------------------19714120|@ayee_Eddy2011Ilovepancakestoo!#<em>iPhone</em>#app19284329|#nowplayingonmy<em>iPhone</em>:DaftPunk-"DigitalLove"19416451|I'minlovewithmynew<em>iPhone</em>(:20257190|Lovemy#<em>iphone</em>-onlyproblemnow?Iwantan#Ipad!20759274|Droppedfrutopiaon

:Myphone...#ciaowaterdamageIhate<em>iPhones</em>.20473459|ilovemyiphone4butI'mexcitedtoseewhattheiphone5hastooffer#gadgets#<em>iphone</em>#apple#technology19424811|Ihate

:<em>iPhones</em>:20663075|RT@indigoFKNvanity:Ihatetheautocorrecton<em>iPhones</em>!!!!!!!!!20350436|Iabsolutelylovehowfastthisphoneworks.Lovethe<em>iPhone</em>.20042822|@KDMC23ohhhh!!!Ihate<em>Iphone</em>Talk!(10rows)

gptext.search_count()Returnsthenumberofdocumentsthatmatchthesearchquery.

Syntax

gptext.search_count(<index_name>,<search_query>,<filter_queries>,<options>)

Parameters

<index_name>

Thenameoftheindex.<search_query>

Thesearchquery.<filter_queries>

Acomma-delimitedarrayoffilterqueries,ifany.Ifnone,setthisparameterto null .<options>

Anoptionalampersand-delimitedlistofSolrqueryparameters.SeeSolroptions.

Returntype

bigint

Privilegesrequired

YoumusthavetheSELECTprivilegetoexecutethisfunction.

Example

=#SELECT*FROMgptext.search_count('demo.wikipedia.articles','bubble',null);count-------3(1row)

©CopyrightPivotalSoftware,Inc,2013-2019 111 3.3.0

Page 112: Pivotal Greenplum Text

gptext.search_external()SearchesaGPTextexternalindex.

Syntax

gptext.search_external(<table-exp>,<index_name>,<search_query>,<filter_queries>[,<options>])

Parameters

<table>

Atable-valuedexpression.Becauseexternalindexesarenotassociatedwithadatabasetable,thisparameterisignored.Anexpressionlikethefollowingissufficient:

TABLE(SELECT1SCATTERBY1)

<index_name>

ThenameoftheGPTextexternalindextosearch.<search_query>

TextvaluecontainingaSolrtextsearchquery.<filter_queries>

Atextarrayoffilterqueries,ifany.Ifnone,setthisparameterto null .<options>

Anoptionalampersand-delimitedlistofSolrqueryparameters.SeeSolroptions.

Returntype

SETOFgptext.search_external_result

Thistypehasthefollowingcolumns:

Column Type

id text

title text

subject text

description text

comments text

author text

keywords text

category text

resourcename text

url text

content_type text

last_modified text

links text

sha256 text

content text

score doubleprecision

©CopyrightPivotalSoftware,Inc,2013-2019 112 3.3.0

Page 113: Pivotal Greenplum Text

meta textColumn Type

Thelastcolumn, meta ispresentonlyiftheoptional <options> argumentisincludedinthesearch.

Remarks

Whenyouaddanexternaldocumenttotheindex,ApacheTikaextractsacoresetofmetadatafromthedocument,thecolumnslistedintheReturntypesection.Ifanyofthesecoremetadatavaluesarenotpresentordonotexistinthedocumenttype,thevalueofthecolumnintheresultrowisnull.

Ifthe <options> argumentissupplied,theresultscontainanadditionaltextcolumnnamed meta .The meta columncontainsadditionaldocument-type-specificmetadata.Youcanusethe gptext.gptext_retrieve_field() functionanditsvariantstoextractindividualmetadatavaluesbynamefromthe metacolumn.

Ifthe <options> argumentcontainsthe fl=<list> Solroption,Solrreturnsvaluesonlyforthecolumnsincludedin <list> andthe id , score ,and metacolumns.Othercolumnsintheresultsetwillhavenullvalues.ItismoreefficienttofilteroutcolumnsinSolrthantoretrieveallcolumnsfromSolrandthenchooseasubsetofcolumnsintheSQL SELECT statement.

ExamplesFindsHTMLdocumentscontainingtheterm“facet”.

=#\xon=#SELECTid,titleFROMgptext.search_external(TABLE(SELECT1SCATTERBY1),'gptext-docs','facet','{content_type:*html*}');-[RECORD1]--------------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/function_ref.htmltitle|GPTextFunctionReference||PivotalGPTextDocs-[RECORD2]--------------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/queries.htmltitle|QueryingGPTextIndexes||PivotalGPTextDocs-[RECORD3]--------------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/guc_ref.htmltitle|GPTextConfigurationParameters||PivotalGPTextDocs

Althoughjusttwocolumnsareinthisresultset,thedataGreenplumDatabasereceivesfromSolrincludesallcoremetadatafields,includingthecontent field,whichcontainsthefulltextofthedocument.ThenextexampleshowshowtolimitthedatatransferredfromSolrtoGreenplumDatabasewiththeSolroptionsargument.

Thisexampleliststhe id , title , sha256 ,and score columnsfromthecoremetadataandextracts meta_creation_date fromtheadditionalmetadatasuppliedforPDFdocumentsinthe gptext-docs externalindex.The fl=title,sha256 SolroptionpreventsSolrfromtransferringunneededfieldsfromtheindextoGreenplumDatabase.(The id and score columnsarealwaystransferred.)

=#\xon=#SELECTid,title,sha256,score,gptext.gptext_retrieve_field(meta,'meta_creation_date')createdFROMgptext.search_external(TABLE(SELECT1SCATTERby1),'gptext-docs','*:*','{content_type:*pdf*}','fl=title,sha256');-[RECORD1]-------------------------------------------------------------id|http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdftitle|PivotalGPText2.1.3Documentation|PivotalGPTextDocssha256|2E063DF5037B9ACC6E180681AE6838077BC5F7A362B4A1E67D9D8FF3E4DD7F3Dscore|1created|2017-09-22T17:22:54Z,2017-09-22T17:22:54Z-[RECORD2]-------------------------------------------------------------id|http://gpdb.docs.pivotal.io/latest/pdf/GPDB510Docs.pdftitle|Version5.1.0sha256|AF0B71D032C99A6BE817E1FA2FB774EB7B4D47D75A755ABF54F4F60FEBB92FF7score|1created|2017-10-31T19:12:41Z,2017-10-31T19:12:41Z

gptext.gptext_retrieve_field()Retrievesasinglefieldfromthe rf or meta searchresultcolumnasatextvalue.

©CopyrightPivotalSoftware,Inc,2013-2019 113 3.3.0

Page 114: Pivotal Greenplum Text

Syntax

gptext.gptext_retrieve_field(rf|meta,<field_name>)

Parameters

rf | meta

ThenameofthecolumninwhichGPTextreturnsfields.Thisis rf forsearchresultsfromregularGPTextindexesand meta forsearchresultsfromGPTextexternalindexes.

<field_name>

Thenameofthefieldtoretrieve.

Remarks

The fl=<field_list> Solrsearchoptionisaddedtothe <options> parameterofthe gptext.search() functiontorequestadditionalstoredfields.Theadditionalfieldsarereturnedintheresultsinacolumnnamed rf ( meta forexternalindexes).Thiscolumnvaluehasaformatlikethefollowing:

column_value{name:"_version"value:"1544714234398507008"}column_value{name:"revision"value:"9.70"}column_value{name:"author"value:"jdough"}

The gptext.gptext_retrieve_field() functionextractsthevalueforasinglespecifiedfieldandreturnsitasatextvalue.Ifthereisnofieldwiththespecifiednameinthe rf column,itreturnsNULL.

Storingadditionalfieldsinanindexrequiresediting managed-schema tospecifythefieldsthatshouldbestored.SeeStoringAdditionalFieldsinanIndexforinstructions.

gptext.gptext_retrieve_field_int()Retrievesasinglefieldfromthe rf or meta searchresultcolumnasanintegervalue.

Syntax

gptext.gptext_retrieve_field_int(rf|meta,<field_name>)

Parameters

rf | meta

Thenameofthecolumncontainingfieldstoberetrieved.ForregularGPTextindexes,itis rf .ForGPTextexternalindexesitis meta .<field_name>

Thenameoftheintegerfieldtoretrieve.

Remarks

The gptext.gptext_retrieve_field_int() functionisthesameasthe gptext.gptext_retrieve_field() function,exceptthattheextractedfieldvalueisconvertedtoanintegervalue.

gptext.gptext_retrieve_field_float()Retrievesasinglefieldfromthesearchresultcolumnasafloatvalue.

Syntax

©CopyrightPivotalSoftware,Inc,2013-2019 114 3.3.0

Page 115: Pivotal Greenplum Text

gptext.gptext_retrieve_field_float(rf|meta,<field_name>)

Parameters

rf | meta

Thenameofthecolumncontainingfieldstoberetrieved.ForregularGPTextindexes,itis rf .ForGPTextexternalindexesitis meta .<field_name>

Thenameofthefloatfieldtoretrieve.

Remarks

The gptext.gptext_retrieve_field_float() functionisthesameasthe gptext.gptext_retrieve_field() function,exceptthattheextractedfieldvalueisconvertedtoafloatvalue.

gptext.highlight()Highlightstermsbyinsertingmarkuptagsintodata.

Syntax

gptext.highlight(<column_data>,<column_name>,<offsets>)

Parameters

<column_data>

Thetextdatafromthetablewhichistobetaggedwithhighlightingtags.<column_name>

Thenameofthecorrespondingcolumnfromthetable.<offsets>

A gptext hstore valuethatcontainskey-valuepairsthatindicatethelocationsofthetexttohighlightwithinthecolumndata.SeeRemarksforinformationaboutthe gptext hstore datatype.

PrequisiteTousehighlighting,termvectorsmustbeenabledbeforecreatingtheindex.Toenabletermvectors,call gptext.enable_terms() foreachfieldwhereyouwanttoenableterms,thenindexorre-indexwith gptext.index() .

RemarksThe offsets parameterisa gptexthstore ,whereeachkeyisacolumnnameandthevalueisacomma-separatedlistofoffsetsintothecolumndata.Thishstore isconstructedbygptext.search()withhighlightingenabledinthe offsets parameter.Followingisanexampleofthe offsets hstore content:

"field1"=>"0:5,9:14","field2"=>"13:20"

gptext.highlight() willinserttwosetsoftagsintothe field1 dataandonesetintothe field2 dataattheindicatedoffsets.

The gptexthstore typeisalimitedformofthePostgres hstore type.Ithasonlythe hstore inputandoutputfunctionsimplemented,as gptext_hstore_in

and gptext_hstore_out .

Thehighlighttagsaredefinedbythe gptext.hl_pre_tag and gptext.hl_post_tag serverconfigurationparameters.Theirdefaultvaluesare <em> and</em> ,respectively.

©CopyrightPivotalSoftware,Inc,2013-2019 115 3.3.0

Page 116: Pivotal Greenplum Text

Example

gptext.highlight_external()Highlightstermsinsearchresultsfromexternalindexesbyinsertingmarkuptags.

Syntax

gptext.highlight_external(<table_exp>,<index>,<search_query>,<filter_queries>[,<options>])

Parameters

<table_exp>

Atableexpression,ignoredforexternalindexes.Anexpressionsuchas TABLE(SELECT 1 SCATTER BY 1) issufficient.<index>

Nameoftheindexcontainingdatatohighlight.<search_query>

TextvaluecontainingaSolrtextsearchquery.<filter_queries>

Atextarrayoffilterqueries,ifany.Ifnone,setthisparameterto null .<options

Anoptionalampersand-delimitedlistofSolrqueryparameters.See Solr options .

Remarks

The gptext.highlight_external() functionsearchesaGPTextexternalindexandenclosesthesearchtermsinmarkuptagsinthereturnedresults.

Example

Searchforandhighlighttheterms“zookeeper”and“solr”inHTMLdocuments.

=#SELECTid,contentFROMgptext.highlight_external(TABLE(SELECT1SCATTERBY1),'gptext-docs','{!gptextqp}zookeeperANDsolr','{content_type:*html*}','rows=2');-[RECORD1]-------------------------------------------------------------------------------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/administering.htmlcontent|includessecurityconsiderations,monitoring<em>Solr</em>indexstatistics,managingandmonitoring<em>ZooKeeper</em>-[RECORD2]-------------------------------------------------------------------------------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/performance.htmlcontent|problemscanresultfromresourcecontentionbetweentheGreenplumDatabase,<em>Solr</em>,and<em>ZooKeeper</em>clusters-[RECORD3]-------------------------------------------------------------------------------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/ha.htmlcontent|/topics/utility_ref.htmlGPTextManagementUtilities|rect/210/topics/type_ref.htmlGPTextand<em>Solr</em>-[RECORD4]-------------------------------------------------------------------------------------------------------------------------id|http://gptext.docs.pivotal.io/latest/topics/indexes.htmlcontent|/topics/utility_ref.htmlGPTextManagementUtilities|rect/210/topics/type_ref.htmlGPTextand<em>Solr</em>

FacetedSearchFacetingbreaksupasearchresultintomultiplecategories,showingcountsforeach.

gptext.faceted_field_search()The faceted_field_search() functionbreakssearchresultsintofieldnamecategories.

©CopyrightPivotalSoftware,Inc,2013-2019 116 3.3.0

Page 117: Pivotal Greenplum Text

Syntax

gptext.faceted_field_search(<index_name>,<query>,<filter_queries>,<facet_fields>,<facet_limit>,<minimum>,<options>)

Parameters

<index_name>

Thenameoftheindex.<query>

Querystatement.Use *:* toqueryforallresults.<filter_queries>

Atextarrayoffilterqueries,ifany.Ifnone,setthisparameterto null .<facet_fields>

Anarrayoffieldnamestofacet.UseGreenplumDatabasearraynotation.<facet_limit>

Maximumnumberofresultstobereturnedforeachaggregation(facet).<minimum>

Minimumnumberofresultsrequiredbeforeanaggregation(facet)willbereturned.Enter0toreturnallfacets.<options>

Anoptionalampersand-delimitedlistofSolrqueryparameters.SeeSolroptions.

Returntype

SETOFgptext.facet_field_result

Thisisacompositetypewiththefollowingcolumns:

Column Type

field_name text

field_value text

value_count bigint

Privilegesrequired

YoumusthavetheSELECTprivilegetoexecutethisfunction.

Remarks

None.

ExamplesFacetalltweetson spam and truncated fields.

=#SELECT*FROMgptext.faceted_field_search('demo.twitter.message','*:*',null,'{spam,truncated}',2,0);field_name|field_value|value_count------------+-------------+-------------spam|true|1730truncated|false|1705truncated|true|25(3rows)

Faceton author_id ,nolimit,withaminimumoffivetweets,alltweets.Selectsfiveauthorswithatleasttwotweets.

©CopyrightPivotalSoftware,Inc,2013-2019 117 3.3.0

Page 118: Pivotal Greenplum Text

=#SELECT*FROMgptext.faceted_field_search('demo.twitter.message','*:*',null,'{author_id}',5,2);field_name|field_value|value_count------------+-------------+-------------author_id|102185050|9author_id|202305785|2author_id|64111799|2author_id|45326213|2author_id|195035308|2(5rows)

gptext.faceted_query_search()The faceted_query_search() functionbreakssearchresultsintocategoriesdefinedbyqueriesthatyouprovide.

Syntax

gptext.faceted_query_search(<index_name>,<query>,<filter_queries>,<facet_queries>,<options>)

Parameters

<index_name>

Thenameoftheindex.<query>

Querystatement.Use *:* toqueryforallresults.<filter_queries>

Atextarrayoffilterqueries,ifany.Ifnone,setthisparameterto null .<facet_queries>

Type:text[].Required.Anarrayoffacetqueries.<options>

Anoptionalampersand-delimitedlistofSolrqueryparameters.SeeSolroptions.

Returntype

SETOFgptext.facet_query_result

Thisisacompositetypewiththefollowingcolumns:

Column Type

query_name text

value_count bigint

Privilegesrequired

YoumusthavetheSELECTprivilegetoexecutethisfunction.

Remarks

None.

Example

ThisexampleusesSolrqueriestodividetwitterauthorsintothreeclassesbasedonnumberoffollowers.

©CopyrightPivotalSoftware,Inc,2013-2019 118 3.3.0

Page 119: Pivotal Greenplum Text

=#SELECT*FROMgptext.faceted_query_search('demo.twitter.message','*:*',null,'{author_followers_count:[0TO5],author_followers_count:[6TO10],author_followers_count:[11TO*]}');query_name|value_count----------------------------------+-------------author_followers_count:[0TO5]|39author_followers_count:[11TO*]|1632author_followers_count:[6TO10]|36(3rows)

gptext.faceted_range_search()The faceted_range_search() functionbreakssearchresultsintorangecategoriesoveranumericordatefield,withrangesdefinedbythe <range_start> ,<range_end> ,and <range_gap> arguments.

Syntax

gptext.faceted_range_search(<index_name>,<query>,<filter_queries>,<field_name>,<range_start>,<range_end>,<range_gap>,<options>)

Parameters

<index_name>

Thenameoftheindex.<query>

Querystatement.Use *:* toqueryforallresults.<filter_queries>

Atextarrayoffilterqueries,ifany.Ifnone,setthisparameterto null .<field_name>

Thenameofthefieldonwhichtofacet.<range_start>

Beginningoftherange.<range_end>

Endoftherange.<range_gap>

Sizeofrangeincrement,atextvalue.<options>

Anoptionalampersand-delimitedlistofSolrqueryparameters.SeeSolroptions.

Returntype

SETOFgptext.facet_range_result

Thisisacompositetypewiththefollowingcolumns:

Column Type

field_name text

range_value text

value_count bigint

Privilegesrequired

YoumusthavetheSELECTprivilegetoexecutethisfunction.

©CopyrightPivotalSoftware,Inc,2013-2019 119 3.3.0

Page 120: Pivotal Greenplum Text

Example

FacetondaterangefrommidnightAugust1,2011tomidnightNovember1,2011,witha7-daygap.

=#SELECT*FROMgptext.faceted_range_search('demo.twitter.message','*:*',null,'created_at','2011-08-01T00:00:00Z','2011-11-01T00:00:00Z','+7DAY');field_name|range_value|value_count------------+----------------------+-------------created_at|2011-08-01T00:00:00Z|0created_at|2011-08-08T00:00:00Z|0created_at|2011-08-15T00:00:00Z|0created_at|2011-08-22T00:00:00Z|52created_at|2011-08-29T00:00:00Z|189created_at|2011-09-05T00:00:00Z|545created_at|2011-09-12T00:00:00Z|0created_at|2011-09-19T00:00:00Z|109created_at|2011-09-26T00:00:00Z|69created_at|2011-10-03T00:00:00Z|59created_at|2011-10-10T00:00:00Z|206created_at|2011-10-17T00:00:00Z|147created_at|2011-10-24T00:00:00Z|112created_at|2011-10-31T00:00:00Z|94(14rows)

WorkingwithTermsTheoptionaltermscomponentsavestermsoutputbytheSolrindexanalyzerchaintoavectorintheindex.Enablingtermsisrequriedtouseparts-of-speechandnamedentityrecognition.

gptext.enable_terms()Enablestermvectorsandpositionstoallowextractingtermsandtheirpositionsfromfieldsofdatatype text .

Syntax

gptext.enable_terms(<index_name>,<field_name>)

Parameters

<index_name>

Thenameoftheindexforwhichyouwanttoenableterms.<field_name>

Thenameofthefieldforwhichyouwanttoenableterms.

Returntype

boolean

Privilegesrequired

OnlytheOWNERcanexecutethisfunction.

Remarks

Solrcanmarktermsandtheirpositionsindocumentswhenindexing.Thiscapabilityisdisabledbydefault.Use gptext.enable_terms() toenablethecapability.

©CopyrightPivotalSoftware,Inc,2013-2019 120 3.3.0

Page 121: Pivotal Greenplum Text

Call gptext.enable_terms() foreachfieldwhereyouwanttoenableterms.

Aftercallingthisfunction,youmustindexorre-indexwithgptext.index().

Examples

=#SELECT*FROMgptext.enable_terms('demo.twitter.message','message_text');WARNING:Enabletermsforfield:message_textofindex:demo.twitter.messagesuccessfully.Reindexdataneeded.enable_terms--------------t(1row)

=#SELECT*FROMgptext.index(TABLE(SELECT*FROMtwitter.message),'demo.twitter.message');dbid|num_docs------+----------3|9472|1020(2rows)

=#SELECT*FROMgptext.commit_index('demo.twitter.message');commit_index--------------t(1row)

gptext.ner_terms()GetsNER-taggedtermsforatextfieldconfiguredforNER(NamedEntityRecognition)fromdocumentsinaSolrindexthatmatchaquery.

Syntax

gptext.ner_terms(TABLE(<select-statement>),<index_name>,<field_name>,<search_query>,<filter_queries>[,<options>])

gptext.ner_terms(<index_name>,<field_name>,<search_query>,<filter_queries>[,<options>])

Parameters

TABLE(<select-statement>)

Thisparameterisignoredandcanbeomitted.

Atableexpressionthatspecifiesa SELECT statement.Specifyintheformat:

TABLE(SELECT*FROM<src_table>;)

<index_name>

ThenameoftheindextoqueryforNERterms.<field_name>

ThenameoftheNER-enabledtextfieldtoqueryforNERterms.<search_query>

Aquerythatmatchesdocumentstoincludeintheresults.<filter_queries>

Acomma-delimitedarrayoffilterqueries,ifany.Ifnone,setthisparameterto null .<options>

Anoptional,comma-delimitedlistofSolrqueryparameters.

Returntype

SETOFgptext.ner_term_info

©CopyrightPivotalSoftware,Inc,2013-2019 121 3.3.0

Page 122: Pivotal Greenplum Text

Thisisacompositetypewiththefollowingcolumns:

Column Type

id text

term text

entity_type text

frequency integer

Privilegesrequired

YoumusthavetheSELECTprivilegetoexecutethisfunction.

Remarks

The gptext.ner_terms() functiondisplaystheNER-taggedtermsinafield’stermsvectorfordocumentsthatmatchaquery.The <fieldType> ofthetextfieldmustbeconfiguredinthe managed-schema fortheindexwiththeOpenNLPtokenizersandfilters.SeeUsingNamedEntityRecognitionwithGPTextforNERconfigurationinstructions.

NER-taggedtermshavetheform _ner_<entity-type>_<token-text> .Termsthatdonothavethe _ner_ prefixarenotincludedintheoutputofthisfunction.Usethe gptext.terms() functiontoretrievealltermsfrommatcheddocuments,includingNER-taggedterms.

Theoutputofthe gptext.ner_terms() functioncontainsonerowforeachtaggedentityindocumentsthatmatchthesearch.The term columncontainsthetermwiththe _ner_<entity-type>- prefixremoved.The entity_type columncontainsthetypeoftheentity,andthe frequency columnisthenumberoftimesthetermappearsinthedocument.

Termswiththesameentitytype,suchasperson,organization,andlocation,thatoccurinconsecutivepositionsinthedocumentareconcatenatedtocreateacompoundtermintheoutput.Forexample,iftheterms _ner_organization_federal , _ner_organization_reserve ,and _ner_organization_board appearinconsecutivepositionsinthedocument,thecompoundterm federalreserveboard appearsintheoutput.Whenyouperformasearchontaggedtermsandincludehighlighting,thecompoundtermishighlighted,ratherthaneachoftheconsecutiveterms.

Example1. GettheNER-taggedtermsforthe content fieldfromthedocumentwithid842613485.

=#SELECT*FROMgptext.ner_terms('demo.public.news_demo','content','id:842613485',null);id|term|entity_type|frequency-----------+--------------------------------+--------------+-----------842613485|alanmurray|person|1842613485|citicorpinformationservices|organization|1842613485|morganstanley|organization|1842613485|stephenroach|person|1842613485|mr.murray|person|1842613485|alan|person|1842613485|mr.|person|1842613485|murray|person|2842613485|roach|person|1842613485|stephen|person|1842613485|board|organization|1842613485|citicorp|organization|1842613485|consumer|organization|1842613485|fed|organization|1842613485|federal|organization|1842613485|information|organization|1842613485|morgan|organization|1842613485|reserve|organization|1842613485|services|organization|1842613485|stanley|organization|1842613485|federalreserveboard|organization|1(21rows)

gptext.terms()GetsthetermvectorsforthespecifiedfieldfromdocumentsinaSolrindex.Youcanuse gptext.terms() tocreatetermstables.

©CopyrightPivotalSoftware,Inc,2013-2019 122 3.3.0

Page 123: Pivotal Greenplum Text

Syntax

gptext.terms(TABLE(<select-statement>),<index_name>,<field_name>,<search_query>,<filter_queries>[,<options>])

gptext.terms(<index_name>,<field_name>,<search_query>,<filter_queries>[,options])

Parameters

TABLE(<select-statement>)

Atableexpressionthatspecifiesa SELECT statement.Specifyintheformat:

TABLE(SELECT1SCATTERBY1)

Thisparameterisignoredandcanbeomitted.

<index_name>

Thenameoftheindextoqueryforterms.<field_name>

Thenameofthefieldtoqueryforterms.<search_query>

Aquerythatthedocumentmustmatch.<filter_queries>

Acomma-delimitedarrayoffilterqueries,ifany.Ifnone,setthisparameterto null .<options>

Anoptional,comma-delimitedlistofSolrqueryparameters.

Returntype

SETOFgptext.term_info

Thisisacompositetypewiththefollowingcolumns:

Column Type

id text

term text

positions integer[]

Privilegesrequired

YoumusthavetheSELECTprivilegetoexecutethisfunction.

Remarks

Toenableusing gptext.terms() ,executetheGPTextfunction gptext.enable_terms() ,thenreindexwith gptext.index() .

IfthefieldhasbeentaggedwithNamedEntityRecognition(NER)orParts-of-Speech(POS)filters,thetaggedtermsappearinthegptext.terms() outputinrawformat,forexample _ner_person_david .Usethe gptext.ner_terms() functiontoviewonlyNER-taggedterms.

Examples

Thisexamplecreatesatermstablefromtheoutputofthe gptext.terms() function.

©CopyrightPivotalSoftware,Inc,2013-2019 123 3.3.0

Page 124: Pivotal Greenplum Text

=#CREATETABLEtwitter.termsASSELECT*FROMgptext.terms('demo.twitter.message','message_text','iphone',null)DISTRIBUTEDBY(id);SELECT5385

GPTextIndexMonitoringThesefunctionsprovidestatisticsandstatusforindexesmanagedbytheGPTextcluster.

gptext.cluster_status()ShowsthestatusofindexesmanagedbytheGPTextcluster.

Syntax

gptext.cluster_status()

ReturnType

SETOFgptext.cluster_status_result

Thisisacompositetypewiththefollowingcolumns:

Column Type

index_name text

max_shards_per_node integer

router text

replication_factor integer

auto_add_replicas boolean

znode_version integer

config_name text

partitioned boolean

Example

=#SELECT*FROMgptext.cluster_status();index_name|max_shards_per_node|router|replication_factor|auto_add_replicas|znode_version|config_name|partitioned-------------------------+---------------------+----------+--------------------+-------------------+---------------+-------------------------+-------------demo.twitter.message|4|implicit|2|f|8|demo.twitter.message|tdemo.wikipedia.articles|4|implicit|2|f|8|demo.wikipedia.articles|f(2rows)

gptext.index_size()ShowsthenumberofdocumentsindexedandtotaldiskspaceusedforGPTextindexes.

Syntax

©CopyrightPivotalSoftware,Inc,2013-2019 124 3.3.0

Page 125: Pivotal Greenplum Text

gptext.index_size([<index_name>])

Parameters

<index_name>

Thenameoftheindex.Optional.Returnssizesforallindexesifnoindexisspecified.

ReturnTypes

SETOFgptext.index_size_result

Thisisacompositetypewiththefollowingcolumns:

Column Type

index_name text

num_docs integer

size_in_bytes bigint

Examples

=#SELECT*FROMgptext.index_size();index_name|num_docs|size_in_bytes-------------------------+----------+---------------demo.wikipedia.articles|23|500515demo.twitter.message|1730|767118gptext-docs|16|618231(3rows)

=#SELECT*FROMgptext.index_size('demo.wikipedia.articles');index_name|num_docs|size_in_bytes-------------------------+----------+---------------demo.wikipedia.articles|23|500515(1row)

gptext.index_status()Showsstatusofreplicasforaspecifiedindexorforallindexes.

Syntax

gptext.index_status([<index_name>])

Parameters

<index_name>

Thenameoftheindex.Optional.Returnsstatusforallindexesifnoindexisspecified.

ReturnType

SETOFgptext.index_status_result

Thisisacompositetypewiththefollowingcolumns:

©CopyrightPivotalSoftware,Inc,2013-2019 125 3.3.0

Page 126: Pivotal Greenplum Text

Column Type

index_name text

shard_name text

shard_state text

replica_name text

replica_state text

core text

node_name text

base_url text

is_leader boolean

partitioned boolean

external_index boolean

Examples1. Showstatusforasingleindex.

=#SELECT*FROMgptext.index_status('demo.wikipedia.articles');index_name|shard_name|shard_state|replica_name|replica_state|core|node_name|base_url|is_leader|partitioned|external_index-------------------------+------------+-------------+--------------+---------------+--------------------------------------------+-----------------+------------------------+-----------+-------------+----------------demo.wikipedia.articles|shard1|active|core_node3|active|demo.wikipedia.articles_shard1_replica_n1|sdw2:18983_solr|http://sdw2:18983/solr|t|f|fdemo.wikipedia.articles|shard1|active|core_node5|active|demo.wikipedia.articles_shard1_replica_n2|sdw1:18983_solr|http://sdw1:18983/solr|f|f|fdemo.wikipedia.articles|shard2|active|core_node7|active|demo.wikipedia.articles_shard2_replica_n4|sdw1:18984_solr|http://sdw1:18984/solr|f|f|fdemo.wikipedia.articles|shard2|active|core_node9|active|demo.wikipedia.articles_shard2_replica_n6|sdw2:18984_solr|http://sdw2:18984/solr|t|f|fdemo.wikipedia.articles|shard3|active|core_node11|active|demo.wikipedia.articles_shard3_replica_n8|sdw2:18983_solr|http://sdw2:18983/solr|t|f|fdemo.wikipedia.articles|shard3|active|core_node13|active|demo.wikipedia.articles_shard3_replica_n10|sdw1:18983_solr|http://sdw1:18983/solr|f|f|fdemo.wikipedia.articles|shard4|active|core_node15|active|demo.wikipedia.articles_shard4_replica_n12|sdw1:18984_solr|http://sdw1:18984/solr|t|f|fdemo.wikipedia.articles|shard4|active|core_node16|active|demo.wikipedia.articles_shard4_replica_n14|sdw2:18984_solr|http://sdw2:18984/solr|f|f|f(8rows)```

1. ShowstatusforallGPTextindexes.

=#SELECT*FROMgptext.index_status();index_name|shard_name|shard_state|replica_name|replica_state|core|node_name|base_url|is_leader|partitioned|external_index-------------------------+------------+-------------+--------------+---------------+--------------------------------------------+-----------------+------------------------+-----------+-------------+----------------demo.public.news_demo|shard1|active|core_node3|active|demo.public.news_demo_shard1_replica_n1|sdw2:18984_solr|http://sdw2:18984/solr|t|f|fdemo.public.news_demo|shard1|active|core_node5|active|demo.public.news_demo_shard1_replica_n2|sdw1:18984_solr|http://sdw1:18984/solr|f|f|fdemo.public.news_demo|shard2|active|core_node7|active|demo.public.news_demo_shard2_replica_n4|sdw2:18983_solr|http://sdw2:18983/solr|t|f|fdemo.public.news_demo|shard2|active|core_node8|active|demo.public.news_demo_shard2_replica_n6|sdw1:18983_solr|http://sdw1:18983/solr|f|f|fdemo.twitter.message|shard1|active|core_node3|active|demo.twitter.message_shard1_replica_n1|sdw2:18984_solr|http://sdw2:18984/solr|t|t|fdemo.twitter.message|shard1|active|core_node5|active|demo.twitter.message_shard1_replica_n2|sdw1:18983_solr|http://sdw1:18983/solr|f|t|fdemo.twitter.message|shard2|active|core_node7|active|demo.twitter.message_shard2_replica_n4|sdw2:18983_solr|http://sdw2:18983/solr|f|t|fdemo.twitter.message|shard2|active|core_node9|active|demo.twitter.message_shard2_replica_n6|sdw1:18984_solr|http://sdw1:18984/solr|t|t|fdemo.twitter.message|shard3|active|core_node11|active|demo.twitter.message_shard3_replica_n8|sdw2:18984_solr|http://sdw2:18984/solr|t|t|fdemo.twitter.message|shard3|active|core_node13|active|demo.twitter.message_shard3_replica_n10|sdw1:18983_solr|http://sdw1:18983/solr|f|t|fdemo.twitter.message|shard4|active|core_node15|active|demo.twitter.message_shard4_replica_n12|sdw2:18983_solr|http://sdw2:18983/solr|f|t|fdemo.twitter.message|shard4|active|core_node16|active|demo.twitter.message_shard4_replica_n14|sdw1:18984_solr|http://sdw1:18984/solr|t|t|fdemo.wikipedia.articles|shard1|active|core_node3|active|demo.wikipedia.articles_shard1_replica_n1|sdw2:18983_solr|http://sdw2:18983/solr|t|f|fdemo.wikipedia.articles|shard1|active|core_node5|active|demo.wikipedia.articles_shard1_replica_n2|sdw1:18983_solr|http://sdw1:18983/solr|f|f|fdemo.wikipedia.articles|shard2|active|core_node7|active|demo.wikipedia.articles_shard2_replica_n4|sdw1:18984_solr|http://sdw1:18984/solr|f|f|fdemo.wikipedia.articles|shard2|active|core_node9|active|demo.wikipedia.articles_shard2_replica_n6|sdw2:18984_solr|http://sdw2:18984/solr|t|f|fdemo.wikipedia.articles|shard3|active|core_node11|active|demo.wikipedia.articles_shard3_replica_n8|sdw2:18983_solr|http://sdw2:18983/solr|t|f|fdemo.wikipedia.articles|shard3|active|core_node13|active|demo.wikipedia.articles_shard3_replica_n10|sdw1:18983_solr|http://sdw1:18983/solr|f|f|fdemo.wikipedia.articles|shard4|active|core_node15|active|demo.wikipedia.articles_shard4_replica_n12|sdw1:18984_solr|http://sdw1:18984/solr|t|f|fdemo.wikipedia.articles|shard4|active|core_node16|active|demo.wikipedia.articles_shard4_replica_n14|sdw2:18984_solr|http://sdw2:18984/solr|f|f|f(20rows)

©CopyrightPivotalSoftware,Inc,2013-2019 126 3.3.0

Page 127: Pivotal Greenplum Text

gptext.partition_status()ListsindexesonpartitionedtablesorchildpartitionnamesinthecurrentGreenplumdatabase.

Syntax

gptext.partition_status([<index_name>])

Parameters

<index_name>

Optional.Returnspartitionstatusforallindexesifnoindexisspecified.

ReturnType

SETOFgptext.partition_status_result

Thisisacompositetypewiththefollowingcolumns:

Column Type

partition_name text

inherits_name text

level integer

cons text

Example

Listpartitionstatusforanindex.

=#SELECTpartition_name,inherits_name,levelFROMgptext.partition_status('demo.twitter.message');partition_name|inherits_name|level------------------------------+----------------------+-------demo.twitter.message_1_prt_1|demo.twitter.message|1demo.twitter.message_1_prt_2|demo.twitter.message|1demo.twitter.message_1_prt_3|demo.twitter.message|1demo.twitter.message_1_prt_4|demo.twitter.message|1(4rows)

Remarks

The gptext.partition_status() functioncanonlylisttheindexpartitionsfortablesinthecurrentGreenplumdatabase.

gptext.index_summary()Showsreplicastatusandcorestatisticsforaspecifiedindexorforallindexes.

Syntax

gptext.index_summary([<index_name>])

Parameters

©CopyrightPivotalSoftware,Inc,2013-2019 127 3.3.0

Page 128: Pivotal Greenplum Text

<index_name>

Optional.Returnsinformationforallindexesifnoindexisspecified.

ReturnType

SETOFgptext.index_summary_result

Thisisacomposite(row)typewiththefollowingcolumns:

Column Type

index_name text

shard_name text

shard_state text

replica_name text

replica_state text

core text

node_name text

base_url text

is_leader boolean

partitioned boolean

external_index boolean

core_name text

instance_dir text

data_dir text

config text

schema text

start_time text

uptime bigint

num_docs integer

max_docs integer

delete_docs integer

index_heap_usage_bytes bigint

version bigint

segment_count integer

current boolean

has_deletions boolean

directory text

last_modified text

size_in_bytes bigint

size text

Example

ListtheSolrnode,numberofdocuments,andsizeoftheleaderreplicaforeachshardofanindex.

©CopyrightPivotalSoftware,Inc,2013-2019 128 3.3.0

Page 129: Pivotal Greenplum Text

=#SELECTindex_name,shard_name,node_name,num_docs,size_in_bytesFROMgptext.index_summary('demo.twitter.message')WHEREis_leader;index_name|shard_name|node_name|num_docs|size_in_bytes----------------------+------------+-----------------+----------+---------------demo.twitter.message|shard4|sdw2:18983_solr|417|295873demo.twitter.message|shard3|sdw2:18984_solr|449|302987demo.twitter.message|shard2|sdw2:18983_solr|449|311868demo.twitter.message|shard1|sdw2:18984_solr|415|282736(4rows)

GPTextIndexConfigurationThefunctionsinthissectionprovideconfigurationinformationandmetricsforGPTextindexesandhelptoconfigureindexes.

gptext.analyzer()Showstheoutputfromeachclassintheindexorqueryanalyzerchainforagivenfieldtypeanduser-suppliedinputtext.

Syntax

gptext.analyzer(<index_name>,{'index'|'query'},<field_type>,<input>)

Parameters

<index_name>

Thenameoftheindex.{'index' | 'query'}

Showoutputfortheindexanalysischainorthequeryanalysischain.<field_type>

Thefieldtypetoanalyze.<input>

Atextstringtopassthroughtheanalyzer.

ReturnType

Text

Example

=#SELECTgptext.analyzer('demo.wikipedia.articles','index','text_intl','Chopintakesawaywardsidestepintoadeliciousseriesofupwardflyingscales');analyzer

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------(WorldLexerTokenizer,"{{""Chopin""},{""takes""},{""a""},{""wayward""},{""sidestep""},{""into""},{""a""},{""delicious""},{""series""},{""of""},{""upward""},{""flying""},{""scales""}}")(CJKWidthFilter,"{{""Chopin""},{""takes""},{""a""},{""wayward""},{""sidestep""},{""into""},{""a""},{""delicious""},{""series""},{""of""},{""upward""},{""flying""},{""scales""}}")(LowerCaseFilter,"{{""chopin""},{""takes""},{""a""},{""wayward""},{""sidestep""},{""into""},{""a""},{""delicious""},{""series""},{""of""},{""upward""},{""flying""},{""scales""}}")(WorldLexerBigramFilter,"{{""chopin""},{""takes""},{""a""},{""wayward""},{""sidestep""},{""into""},{""a""},{""delicious""},{""series""},{""of""},{""upward""},{""flying""},{""scales""}}")(StopFilter,"{{""chopin""},{""takes""},{},{""wayward""},{""sidestep""},{},{},{""delicious""},{""series""},{},{""upward""},{""flying""},{""scales""}}")(SetKeywordMarkerFilter,"{{""chopin""},{""takes""},{},{""wayward""},{""sidestep""},{},{},{""delicious""},{""series""},{},{""upward""},{""flying""},{""scales""}}")(PorterStemFilter,"{{""chopin""},{""take""},{},{""wayward""},{""sidestep""},{},{},{""delici""},{""seri""},{},{""upward""},{""fly""},{""scale""}}")(7rows)

©CopyrightPivotalSoftware,Inc,2013-2019 129 3.3.0

Page 130: Pivotal Greenplum Text

Remarks

Eachrowofthe gptext.analyzer() functionresultisasingletextcolumncontainingthenameoftheanalyzerclassandalistofthetokenstheclassproducedfromtheoutputofthepreviousclass.

Youcanusethisfunctiontocomparetheoutputproducedbyanalyzerchainsconfiguredfordifferentfieldtypesandtotestyouranalyzerconfiguration.

gptext.config_append()AppendsthecontentsofalocalfiletoaZooKeeperindexconfigurationfile.

Syntax

gptext.config_append(<index_name>,<local_config_file>[,<index_config_file>])

Parameters

<index_name>

Thenameoftheindextoconfigure.<local_config_file>

Thepathandfilenameofalocalfilethatyouwillappendtotheindexconfigurationfile.<index_config_file>

Optional.ThenameoftheZooKeeperconfigurationfiletowhichyouwillappendthelocalfile.Ifyouomitthisparameter,thefunctionappendsthelocalfiletoafileofthesamenamethatresidesinthetop-levelZooKeeperdirectory.

ReturnType

boolean

Example

Appendthelocalfile /home/gpadmin/stopwords.add tothetop-levelZooKeeperfile stopwords.txt forindex demo.wikipedia.articles :

=#SELECT*FROMgptext.config_append('demo.wikipedia.articles','/home/gpadmin/stopwords.add','stopwords.txt');config_append---------------t(1row)

gptext.config_delete()DeletesanindexconfigurationfilefromZooKeeper.

Syntax

gptext.config_delete(<index_name>,<index_config_file>)

Parameters

©CopyrightPivotalSoftware,Inc,2013-2019 130 3.3.0

Page 131: Pivotal Greenplum Text

<index_name>

Thenameoftheindexthathasthefiletodelete.<index_config_file>

TheZooKeeperconfigurationfiletodelete.Includethepathifthefiledoesnotresideatthetop-leveldirectory.

ReturnType

boolean

Example

Deletethefilenamed stopwords.add fromthetop-levelconfigurationdirectoryfortheindex demo.wikipedia.articles :

=#select*fromgptext.config_delete('demo.wikipedia.articles','stopwords.add');config_delete---------------t(1row)

gptext.config_get()DisplaysthecontentsofaZooKeeperindexconfigurationfile.

Syntax

gptext.config_get(<index_name>,<index_config_file>)

Parameters

<index_name>

Thenameoftheindexthathasthefileyouwanttodisplay.<index_config_file>

TheZooKeeperconfigurationfiletodisplay.Includethepathifthefiledoesnotresideatthetop-levelZooKeeperdirectoryfortheindex.

ReturnType

text

Example

Displaythecontentsof synonyms.txt fortheindex demo.wikipedia.article :

©CopyrightPivotalSoftware,Inc,2013-2019 131 3.3.0

Page 132: Pivotal Greenplum Text

=#select*fromgptext.config_get('demo.wikipedia.articles','synonyms.txt');config_get----------------------------------------------------------------------------#TheASFlicensesthisfiletoYouundertheApacheLicense,Version2.0#(the"License");youmaynotusethisfileexceptincompliancewith#theLicense.YoumayobtainacopyoftheLicenseat##http://www.apache.org/licenses/LICENSE-2.0##Unlessrequiredbyapplicablelaworagreedtoinwriting,software#distributedundertheLicenseisdistributedonan"ASIS"BASIS,#WITHOUTWARRANTIESORCONDITIONSOFANYKIND,eitherexpressorimplied.#SeetheLicenseforthespecificlanguagegoverningpermissionsand#limitationsundertheLicense.

#-----------------------------------------------------------------------#sometestsynonymmappingsunlikelytoappearinrealinputtextaaa=>aaaabbb=>bbbb1bbbb2ccc=>cccc1,cccc2a\=>a=>b\=>ba\,a=>b\,bfooaaa,baraaa,bazaaa

#SomesynonymgroupsspecifictothisexampleGB,gib,gigabyte,gigabytesMB,mib,megabyte,megabytesTelevision,Televisions,TV,TVs#noticeweuse"gib"insteadof"GiB"soanyWordDelimiterFiltercoming#afteruswon'tsplititintotwowords.

#Synonymmappingscanbeusedforspellingcorrectiontoopixima=>pixma

(1row)

gptext.config_list()ListstheZooKeeperconfigurationfilesanddirectoriesforanindex.

Syntax

gptext.config_list(<index_name>,[<index_config_path>,]<is_recursive>)

Parameters

<index_name>

Thenameoftheindexthathasthefilesanddirectoriesyouwanttolist.<index_config_path>

Optional.AspecificdirectoryintheZooKeeperconfigurationthatyouwanttolist.Omitthisoptiontolistconfigurationfilesanddirectoriesinthetop-leveldirectory.

<is_recursive>

Optional.Abooleanvaluethatdetermineswhetherthefunctionrecursivelylistsfilesanddirectoriesthatarepresentinsubdirectories.

ReturnType

SETOFtext

Examples

ListZooKeeperconfigurationfilesanddirectoriesonlyinthetop-leveldirectoryfortheindex demo.wikipedia.articles :

©CopyrightPivotalSoftware,Inc,2013-2019 132 3.3.0

Page 133: Pivotal Greenplum Text

=#select*fromgptext.config_list('demo.wikipedia.articles',false);config_list-----------------------------currency.xmlmapping-FoldToASCII.txtmanaged-schemaprotwords.txtscripts.confsynonyms.txtmanaged_schemastopwords.txtvelocityadmin-extra.htmlaggconfig.xmlemoticons.txtsolrconfig.xmlelevate.xmlxsltmapping-ISOLatin1Accent.txtspellings.txtlang(18rows)

ListZooKeeperconfigurationfilesintheZooKeeper lang subdirectoryfor demo.wikipedia.articles :

=#select*fromgptext.config_list('demo.wikipedia.articles','lang',false);config_list--------------------------lang/contractions_it.txtlang/contractions_ca.txtlang/stemdict_nl.txtlang/stopwords_hy.txtlang/stopwords_no.txtlang/stopwords_id.txt[...](39rows)

Listallconfigurationfilesanddirectoriesfor demo.wikipedia.articles :

=#select*fromgptext.config_list('demo.wikipedia.articles',true);config_list----------------------------------currency.xmlmapping-FoldToASCII.txtmanaged-schemaprotwords.txtscripts.confsynonyms.txtmanaged_schemastopwords.txtvelocityvelocity/doc.vmvelocity/suggest.vmvelocity/hit.vm[...](86rows)

gptext.config_upload()UploadsanindexconfigurationfiletoZooKeeper,replacinganyexistingfileofthesamename.

Syntax

gptext.config_upload(<index_name>,<local_config_file>[,<index_config_file>])

Parameters

<index_name>

©CopyrightPivotalSoftware,Inc,2013-2019 133 3.3.0

Page 134: Pivotal Greenplum Text

Thenameoftheindextoconfigure.<local_config_file>

ThepathandfilenameofalocalfilethatyouwantouploadtoZooKeeperfortheindex.Thefunctionuploadsthisfiletoafilethesamenameinthetop-levelZooKeeperdirectoryfortheindex,unlessyouincludethe <index_config_file> optiontochangethepathorfilename.

<index_config_file>

Optional.ThedestinationpathforthefileinZooKeeper.Ifyouomitthisparameter,thefunctionuploadsthelocalfiletothetop-levelZooKeeperdirectoryfortheindex.

ReturnTypes

boolean

Examples

Uploadthelocalfile /home/gpadmin/stopwords.txt toZooKeeper,overwritingtheexisting stopwords.txt filefortheindex demo.wikipedia.articles :

=#select*fromgptext.config_upload('demo.wikipedia.articles','/home/gpadmin/stopwords.txt');config_upload---------------t(1row)

Uploadthelocalfile /home/gpadmin/stopwords_japanese.txt toZooKeeper,overwritingthefile lang/stopwords_ja.txt fortheindex demo.wikipedia.articles :

#select*fromgptext.config_upload('demo.wikipedia.articles','/home/gpadmin/stopwords_japanese.txt','lang/stopwords_ja.txt');config_upload---------------t(1row)

gptext.get_field_type()Displaystheanalyzerchainforafieldtypedefinedintheconfigurationforaspecifiedindex.

Syntax

gptext.get_field_type(<index_name>,<field_type>)

Parameters

<index_name>

Thenameoftheindex.<field_type>

Thenameofafieldtypedefinedinthe managed-schema configurationfilefortheindex.

Example

=#SELECTgptext.get_field_type('demo.wikipedia.articles','text');get_field_type-------------------------------------------------{"name":"text","class":"solr.TextField","indexAnalyzer":{"tokenizer":{"class":"solr.WhitespaceTokenizerFactory"},"filters":[{

©CopyrightPivotalSoftware,Inc,2013-2019 134 3.3.0

Page 135: Pivotal Greenplum Text

{"class":"solr.StopFilterFactory","attributes":[{"name":"words","value":"stopwords.txt"},{"name":"ignoreCase","value":"true"}]},{"class":"solr.WordDelimiterFilterFactory","attributes":[{"name":"catenateNumbers","value":"1"},{"name":"generateNumberParts","value":"1"},{"name":"splitOnCaseChange","value":"1"},{"name":"generateWordParts","value":"1"},{"name":"catenateAll","value":"0"},{"name":"catenateWords","value":"1"}]},{"class":"solr.LowerCaseFilterFactory"},{"class":"solr.KeywordMarkerFilterFactory","attributes":[{"name":"protected","value":"protwords.txt"}]},{"class":"solr.PorterStemFilterFactory"}]},"queryAnalyzer":{"tokenizer":{"class":"solr.WhitespaceTokenizerFactory"},"filters":[{"class":"solr.SynonymFilterFactory","attributes":[{"name":"expand","value":"true"},{"name":"ignoreCase","value":"true"},{"name":"synonyms","value":"synonyms.txt"}]},{"class":"solr.StopFilterFactory","attributes":[{"name":"words",

©CopyrightPivotalSoftware,Inc,2013-2019 135 3.3.0

Page 136: Pivotal Greenplum Text

"name":"words","value":"stopwords.txt"},{"name":"ignoreCase","value":"true"}]},{"class":"solr.WordDelimiterFilterFactory","attributes":[{"name":"catenateNumbers","value":"0"},{"name":"generateNumberParts","value":"1"},{"name":"splitOnCaseChange","value":"1"},{"name":"generateWordParts","value":"1"},{"name":"catenateAll","value":"0"},{"name":"catenateWords","value":"0"}]},{"class":"solr.LowerCaseFilterFactory"},{"class":"solr.KeywordMarkerFilterFactory","attributes":[{"name":"protected","value":"protwords.txt"}]},{"class":"solr.PorterStemFilterFactory"}]},"attributes":[{"name":"autoGeneratePhraseQueries","value":"true"},{"name":"positionIncrementGap","value":"100"}]}

(1row)

Remarks

Youcanusethe gptext.list_field_types() functiontolistthetextfieldtypesdefinedintheconfigurationforanindex.

SeeCustomizingGPTextIndexesforinformationabouttextanalyzerchains.

gptext.list_field_types()

©CopyrightPivotalSoftware,Inc,2013-2019 136 3.3.0

Page 137: Pivotal Greenplum Text

Listsavailablefieldtypesinthe managed-schema configurationfileforaspecifiedGPTextindex.

Syntax

gptext.list_field_types(<index_name>)

Parameters

<index_name>

Thenameoftheindex.

ReturnType

SETOFtext

Example

Listthefieldtypesdefinedinthe managed-schema configurationfileforthe demo.wikipedia.articles index.

©CopyrightPivotalSoftware,Inc,2013-2019 137 3.3.0

Page 138: Pivotal Greenplum Text

=#SELECTgptext.list_field_types('demo.wikipedia.articles');list_field_types---------------------------ancestor_pathdelimited_payloads_floatdelimited_payloads_intdelimited_payloads_stringdescendent_pathlowercasephonetic_entexttext_artext_bgtext_catext_cjktext_cztext_datext_detext_eltext_entext_en_splittingtext_en_splitting_tighttext_estext_eutext_fatext_fitext_frtext_gatext_generaltext_general_revtext_gltext_hitext_hutext_hytext_icutext_idtext_intltext_intl_prevtext_ittext_jatext_lvtext_nltext_notext_pttext_rotext_rutext_smtext_svtext_thtext_trtext_wstext_zhsmart(49rows)

gptext.reload_index()ReloadsSolrconfigurationfilesiftheyhavebeenmodified.

Syntax

gptext.reload_index(<index_name>)

Parameters

<index_name>

Optional.Thenameoftheindexforwhichtoreloadtheconfigurationfiles.

Returntype

©CopyrightPivotalSoftware,Inc,2013-2019 138 3.3.0

Page 139: Pivotal Greenplum Text

boolean

Privilegesrequired

OnlytheOWNERcanexecutethisfunction.

Remarks

None.

Example

=#SELECT*FROMgptext.reload_index('demo.wikipedia.articles');reload_index--------------t(1row)

GPTextClusterMonitoringandManagementThesefunctionsprovideinformationabouttheGPTextclusteroperationandstatus.

gptext.live_nodes()–listsactiveSolrnodes.

gptext.version()–returnsversionofGPTextinstallation.

gptext.zookeeper_hosts()–returnsalistoftheZooKeeperhostnamesandports.

gptext.live_nodes()ListsactiveSolrnodesandtheirupordownstate.

Syntax

gptext.live_nodes()

ReturnType

SETOFgptext.live_nodes_result

Thisisacompositetypewiththefollowingcolumns:

Column Type

host text

port bigint

data_dir text

status text

Example

©CopyrightPivotalSoftware,Inc,2013-2019 139 3.3.0

Page 140: Pivotal Greenplum Text

=#SELECT*FROMgptext.live_nodes();host|port|data_dir|status--------+-------+---------------------+--------gpdb51|18983|/data/gpdata1/solr0|ugpdb51|18984|/data/gpdata2/solr0|u(2rows)

Remarks

Thestatuscolumncanbe u (up)or d (down).

gptext.version()ReturnstheversionofyourGPTextinstallation.

Syntax

SELECT*FROMgptext.version()

Parameters

None.

Returntype

text

Privilegesrequired

Youdonotneedanyprivilegestoexecutethisfunction.

Example

=#SELECT*FROMgptext.version();version--------------------------------GreenplumTextAnalytics2.1.3(1row)

gptext.zookeeper_hosts()ReturnsalistofZooKeeperhostsandports.

Syntax

gptext.zookeeper_hosts()

Returntype

text

©CopyrightPivotalSoftware,Inc,2013-2019 140 3.3.0

Page 141: Pivotal Greenplum Text

Remarks

Thisfunctionreturnsacomma-separatedlistofZooKeepernodesinthetheformat <host-name>:<port> .

Example

=#SELECT*FROMgptext.zookeeper_hosts()host|port--------+------gpdb51|2188gpdb51|2189gpdb51|2190(3rows)

HighAvailability

gptext.add_replica()Addsareplicaofanindexshard.

Syntax

gptext.add_replica(<index_name>,<shard_name>[,<node_name>])

Parameters

<index_name>

Nameoftheindex.Iftheindexisforapartitioneddatabasetable,thismustbethenameoftheroottable.<shard_name>

Nameoftheshardtoreplicate.<node_name>

Nameofthenodewherethereplicaistobeadded.Optional.Ifomitted,SolrCloudchoosesthenode.

Returntype

boolean

Remarks

ThisfunctionisusedbytheGPTextmanagementutility gptext-replicaadd .

Thevalueofthe gptext.replication_factor configurationparameterwhenanindexiscreateddetermineshowmanyreplicasarecreatedforeachshard.InaGreenplumsystem,therearethesamenumberofshardsasthereareGreenplumsegments.Thenumberofreplicascreatedforanewindexisthenumberofsegmentstimesthevalueofthe gptext.replication_factor configurationparameter,2bydefault.ThereplicasaredistributedevenlyamongtheliveGPTextnodes.

Replicasconsumespaceonthehostwheretheyarecreated,sotheyareusuallyonlycreatedtoreplaceareplicathathasfailedorbecomeunavailableortorelocateareplicatoanotherGPTextinstance.Whenaddingreplicas,youshouldmaintainequaldistributionofreplicasamongtheGPTextnodesandavoidplacingmultiplereplicasforthesameshardonthesamehost.

ThetotalnumberofreplicasforanindexthatcanbeplacedoneachGPTextnodeissetwhentheindexiscreated.(InSolr,thisisthe MaxShardsPerNodeparameter.)GPTextsetsthislimitbycalculatingthenumberofreplicastocreatepernodeandaddinganadditionalfactor,specifiedinthegptext.extension_factor serverconfigurationparameter.Thisparametercanbesetbetween0and10;thedefaultvalueis2.Sincethelimitissetwhentheindexiscreated,itisrecommendedtosetthe gptext.extension_factor parametertoahighernumbertoallownewreplicastobecreatedwhennecessary.

©CopyrightPivotalSoftware,Inc,2013-2019 141 3.3.0

Page 142: Pivotal Greenplum Text

Example

=#SELECT*FROMgptext.add_replica('demo.wikipedia.articles','shard1');success|core_name---------+-----------------------------------------t|demo.wikipedia.articles_shard1_replica3(1row)

gptext.delete_replica()Deletesanamedreplicafromthespecifiedindexandshard.

Syntax

gptext.delete_replica(<index_name>,<shard_name>,<replica_name>[,<only_if_down>])

Parameters

<index_name>

Nameoftheindex.<shard_name>

Nameoftheshardthatcontainsthereplicatodelete.<replica_name>

Nameofthereplicatoremove.<only_if_down>

Optional.Whentrue,noactionistakenifthereplicaisactive.Defaultisfalse.

Returntype

boolean

Remarks

Usethe gptext.index_status() functiontofindthenameofthereplicatodrop.Namesareintheformat core_nodeX ,where X isanumber.

Thisfunctioniscalledfromthe gptext-replicadrop managementutility.

Examples1. Deletethe core_node5 replicaifitisdown.

=#SELECT*FROMgptext.delete_replica('demo.wikipedia.articles','shard1','core_node5',true);ERROR:Deletereplicafailed:Attemptedtoremovereplica:demo.wikipedia.articles/shard1/core_node5withonlyIfDown='true',butstateis'active'.

2. Deletethe core_node5 replicaevenifitisactive.

=#SELECT*FROMgptext.delete_replica('demo.wikipedia.articles','shard1','core_node5');success---------t(1row)

GeneralPurposeFunctions

©CopyrightPivotalSoftware,Inc,2013-2019 142 3.3.0

Page 143: Pivotal Greenplum Text

gptext.count_t()Countsthenumberofrecordsinatable.

Syntax

gptext.count_t(<table_name>)

Parameters

<table_name>

Nameofthetableforwhichtocountrecords.

Returntype

integer

Privilegesrequired

YouneedSELECTprivilegeson <table_name> toexecutethisfunction.

Example

=#SELECT*FROMgptext.count_t('demo.wikipedia.articles');count_t---------23(1row)

©CopyrightPivotalSoftware,Inc,2013-2019 143 3.3.0

Page 144: Pivotal Greenplum Text

GPTextManagementUtilitiesManagementutilitiesareGPTextcommand-lineutilitiesthatareusedtomanagetheGPTextcluster.TheutilitiesmustberunontheGreenplummasterasthegpadminuser.

ToensuretheGPTextcommand-lineutilitiescanbefoundonthepath,sourcetheGreenplumDatabaseandGPTextenvironmentscripts.TheGreenplumDatabaseenvironmentmustbesetbeforeyousourcetheGPTextenvironmentscript.Forexample,ifbothGreenplumDatabaseandGPTextareinstalledinthe/usr/local/directory,enterthesecommands:

$source/usr/local/greenplum-db-<version>/greenplum_path.sh$source/usr/local/greenplum-text-<version>/greenplum-text_path.sh

HelpTogethelpforautility,specifytheflag -h or --help .Ashorthelpmessagedisplayswithalistofparameters.

DebuggingTogetverboseoutputfordebuggingautility,specifytheflags -v or --verbose .

GPTextUtilitiesgptext-backup–backsupaGPTextindextoasharedfilesystem.

gptext-config–performsGPTextconfigurationoptions.

gptext-expand–addsnewGPTextnodestoexistinghostsinthecluster.

gptext-external–managesconfigurationsforexternaldatasources.

gptext-installsql–installsorremovesthegptextschemaanduser-definedfunctionsinGreenplumdatabases.

gptext-migrator–installstheGPTextbinariesintoanupgradedGreenplumDatabasesystem.

gptext-recover–restartsGPTextnodesthataredown.

gptext-replica–addsordropsareplicaofanindexshard.

gptext-restore–restoresaGPTextindexfromabackuponasharedfilesystem.

gptext-start–startsorrestartstheGPTextcluster.

gptext-state–displaythestateoftheGPTextclusterandindexes.

gptext-stop–shutsdowntheGPTextcluster.

gptext-uninstall–uninstallsGPText,includingdataandinstalledfiles,andZooKeepernodesiftheywereinstalledwiththeGPTextinstaller.

gptext-upgrade-upgradesaGPTextsystemtoanewGPTextversion.

zkManager–checkstheZooKeeperclusterstate.IfZooKeeperwasinstalledwithGPText, zkManager canstartorstoptheZooKeepercluster.

gptext-backupBacksupaGPTextindextolocalstorageontheGreenplumDatabaseclusterortoasharedfilesystem.

Syntax

gptext-backup-h

gptext-backup[-P<pool>]local[-p<path>]-i<index>[-v]

gptext-backup[-P<pool>]-p<path>-i<index>[-n<name>][-v]

gptext-backup[-P<pool>]-c-p<path>-i<index>[-v]

©CopyrightPivotalSoftware,Inc,2013-2019 144 3.3.0

Page 145: Pivotal Greenplum Text

Parameters

Parameter Description

-h

--help Displaysausagemessageandexits.

-P<pool>

--pool<pool>

Setsthenumberofthreadstoaddtotheworkerpool.Ifnotspecified,the gptext-backup utilitydeterminesthenumberofworkerthreadsneeded.Ifssh/scpconnectionsarerefusedordroppedbecausetheconfiguredmaximumnumberofconnectionshasbeenreached,usethisoptiontosetthethreadpoolsizetoalowernumber,reducingthenumberofconcurrentconnections gptext-backup makes.Ifyousetthepoolsizetoohigh,theutilitydisplaysamessageandyoumustconfirmtocontinuewiththemaximumnumberofthreadsallowed.

-c

--backup_confBackupconfigurationfilesonly.The -c optioncannotbeusedwiththe local keywordorthe -noption.

local Savebackuptolocalstorage.

-p<path>

--path<path>

Ifthe local keywordisincluded,thisisthedirectorywheretheutilitycreatesthebackup.Ifnopathisspecified,backupfilesarecreatedintheGreenplumDatabasemasterandsegmentdatadirectories.Thebackupnameandlocationsofthebackupfilesarereportedintheoutputofthecommand.

Ifthe local keywordisomitted,thisisthepathonthesharedfilesystemwherethebackupwillbesaved.Thefilesystemmustbeaccessiblefromallhostsintheclusteranditmustbereadableandwritablebythegpadminuser.

-i<index_name>

--index<index_name> ThenameoftheGPTextindextobackup.

-n<backup_name>

--name<backup_name>Anameforthebackuponthesharedfilesystem.The -n optioncannotbeusedwiththe localkeywordorthe -c option.

NotesBackupanindexsothatyoucanrestoreittoadifferentGPTextsystemortoavoidhavingtoreindexiftheexistingindexbecomescorrupted.

AfullGPTextindexbackupincludesindexconfigurationfilesfromZooKeeperandindexdatauptothelasttransactioncommittedwiththegptext.commit_index() function.Eachindexshardisbackedupseparately.

Youcanoptionallybackupjusttheindexconfigurationfilesusingthe -c option.

YoucanbackupanindextoasharedfilesystemortolocalGreenplumDatabaseclusterstorage.

BackUptoLocalGreenplumDatabaseClusterStorage

Usethe gptext-backuplocal

commandtobackupaGPTextindextolocalstorage.

Forlocalbackups,theutilitygeneratesabackupnameintheformat <index-name>_<timestamp> ,forexample demo.wikipedia.articles_2018-05-07T17:13:32.064427 .The --name optionisnotallowedwithlocalbackups.

Onthemasterhost, gptext-backup createsaJSONfile, <backup-name>.json ,containingmetadataaboutthebackup,andadirectory, <backup-name> ,containingcopiesoftheZooKeeperconfigurationfilesfortheindex.ThedefaultbackupdirectoryonthemasterhostistheGreenplumDatabasemasterdatadirectory,specifiedbythe MASTER_DATA_DIRECTORY environmentvariable.

©CopyrightPivotalSoftware,Inc,2013-2019 145 3.3.0

Page 146: Pivotal Greenplum Text

gptext-backuplocal

writesabackupfileforeachindexshardonthehostwiththeGPTextnodemanagingtheleadreplicafortheshard.Thesefileshave

namesintheformat snapshot.<index-name>_shard<n>_<timestamp> ,forexample snapshot.demo.wikipedia.articles_shard1_2018-05-07T17:13:32.064427 .Bydefault,thesefilesaresavedinthesegmentdatadirectories.

Youcanspecifyabackupdirectoryforthebackupwiththe --path ( -p )option.Thedirectorymustexistonallhostsintheclusterandbewritablebythegpadminuser.

Thisexamplebacksupthe demo.wikipedia.articles indextothedefaultbackuplocations.

$gptext-backuplocal-idemo.wikipedia.articles20180504:12:35:07:006126gptext-backup:mdw:gpadmin-[INFO]:-ExecuteGPTextclusterbackup.20180504:12:35:08:006126gptext-backup:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20180504:12:35:10:006126gptext-backup:mdw:gpadmin-[INFO]:-Checkstatusofindexdemo.wikipedia.articles...20180504:12:35:10:006126gptext-backup:mdw:gpadmin-[INFO]:-Executingbackup...20180504:12:35:10:006126gptext-backup:mdw:gpadmin-[INFO]:-Processing......20180504:12:35:11:006126gptext-backup:mdw:gpadmin-[INFO]:-Recordingmetadataofindex"demo.wikipedia.articles"into"/data/gpmaster/gpseg-1/demo.wikipedia.articles_2018-05-04T12:35:10.594013.json"20180504:12:35:11:006126gptext-backup:mdw:gpadmin-[INFO]:-Backingupconfigurationofindex"demo.wikipedia.articles"into"/data/gpmaster/gpseg-1/demo.wikipedia.articles_2018-05-04T12:35:10.594013"20180504:12:35:12:006126gptext-backup:mdw:gpadmin-[INFO]:-Backup"demo.wikipedia.articles_shard*_2018-05-04T12:35:10.594013"islocatedin"/data/gpdata1/primary"oneachhost20180504:12:35:12:006126gptext-backup:mdw:gpadmin-[INFO]:-Done.

BackUptoaSharedFileSystem

Ifyoubackuptoasharedfilesystem,thesharedfilesystemmustbemountedonallhostswithGPTextnodesandmustbewritablebythegpadminuser.Thefilesystemcouldbe,forexample,anNFSmountoranSSHserverwithsshfssupport.Thefilesystemmustbeconfiguredandaccessiblebeforeyouexecutethe gptext-backup utility,anditmustacceptconnectionsfromeachhostinthecluster.

The gptext-backup utilitycreatesanewsubdirectoryatthespecifiedpathwiththebackupnamespecified.Thecommandfailsifthedirectoryalreadyexists.

Whenthebackupiscomplete,thebackupdirectorycontainsthefollowing:

backup.infoAtextfilecontainingthreecomma-separatedstrings:thedatabasename,schemaname,andindexnamefortheindexthatwasbackedup.

backup.propertiesAtextfilewithpropertiesthatdescribethebackup,suchasthedateandtimethebackupstarted,thenameofthebackup,andthenamesoftheSolrcollectionandcollectionconfiguration.

zk_backupAdirectorycontainingthefollowingfiles:

collection_state.json –aJSONfiledescribingthestatusoftheSolrcollection.

configs/<collection-name>/ –adirectorycontainingcopiesoftheSolrconfigurationfilesstoredinZooKeeperfortheindex,forexample managed-schema , solrconfig.xml , protwords.txt , stopwords.txt .

snapshot.shard0…snapshot.shard_N_Adirectoryforeachshard,withthefilescontainingcontentoftheshard.

Ifthebackupfails—forexampleifthereisinsufficientdiskspace—anerrormessageisdisplayed,butthebackupdirectoryisnotremoved.Besuretoremovethebackupdirectorybeforerestartingthebackup.

Thefollowingexamplebacksupthe demo.twitter.message indextothe twitter subdirectoryonthe /mnt/nas share.

$gptext-backup-idemo.twitter.message-p/mnt/nas-ntwitter20180508:16:34:02:027794gptext-backup:mdw:gpadmin-[INFO]:-ExecuteGPTextclusterbackup.20180508:16:34:03:027794gptext-backup:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20180508:16:34:03:027794gptext-backup:mdw:gpadmin-[INFO]:-Validatesharedfilesystem.20180508:16:34:06:027794gptext-backup:mdw:gpadmin-[INFO]:-Backupindex:demo.twitter.message,intosharedFS'/mnt/nas',asname:twitter.20180508:16:34:06:027794gptext-backup:mdw:gpadmin-[INFO]:-Processing.......20180508:16:34:08:027794gptext-backup:mdw:gpadmin-[INFO]:-Indexbackupsuccessfully.20180508:16:34:08:027794gptext-backup:mdw:gpadmin-[INFO]:-Done.

©CopyrightPivotalSoftware,Inc,2013-2019 146 3.3.0

Page 147: Pivotal Greenplum Text

BackupConfigurationFilesOnly

The gptext-backup option -c ( --backup-conf )createsabackupofjusttheGPTextindexconfigurationfilesfromZooKeeper.Youcanusethe -p ( --path )optiontospecifythedirectorywherethebackupistobecreated.Ifyouomitthe -p option,thebackupiscreatedinthemasterdatadirectory($MASTER_DATA_DIRECTORY ).Theconfigurationfilesaresavedinadirectorywithanameintheformat <index-name>_<timestamp> .

Thisexamplecreatesabackupoftheconfigurationfilesforthe demo.wikipedia.articles index.Thefilesaresavedinthedefaultlocation.Thebackupname,whichyouwillneedtorestorethebackup,isreportedintheoutput.

$gptext-backup-c-idemo.wikipedia.articles20180508:17:17:26:000781gptext-backup:mdw:gpadmin-[INFO]:-ExecuteGPTextclusterbackup.20180508:17:17:27:000781gptext-backup:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20180508:17:17:29:000781gptext-backup:mdw:gpadmin-[INFO]:-Recordingmetadataofindex"demo.wikipedia.articles"into"/data/gpmaster/gpseg-1/demo.wikipedia.articles_2018-05-08T17:17:29.197515.json"20180508:17:17:29:000781gptext-backup:mdw:gpadmin-[INFO]:-Backingupconfigurationofindex"demo.wikipedia.articles"into"/data/gpmaster/gpseg-1/demo.wikipedia.articles_2018-05-08T17:17:29.197515"20180508:17:17:30:000781gptext-backup:mdw:gpadmin-[INFO]:-Done.

gptext-configPerformsGPTextconfigurationtasks:

Edit,append,upload,orlistconfigurationfilesinZooKeeper

RevertconfigurationfilechangesinZooKeeper

EditJVMconfigurationoptions

UploadjarfilestotheGPTexthomedirectory

Syntax

gptext-config-h|--help

gptext-configedit-[P<pool>]-f<file_name>-i<index_name>[-r][-e]

gptext-configlist[-P<pool>]-i<index_name>[--recursive]

gptext-configupload[-P<pool>]-l<path/local_file_name>-f<path/zookeeper_file_name>[-i<index_name>]

gptext-configappend[-P<pool>]-l<local_append_file>-f<file_name>-i<index_name>

gptext-configjar[-P<pool>]-l<path/jar_file>

gptext-configjvm[-P<pool>]-o<jvm_options>

ParametersThe -f parameterisoptionalwith gptext-configappend and gptext-config

upload.Ifyouomit -f withthe append command,thenthelocalfileisappendedto

afileofthesamenameinthetop-levelZooKeeperdirectoryfortheindex.Ifyouomit -f withthe upload command,thentheutilityuploadsthelocalfiletoafileofthesamenameinthetop-levelZooKeeperdirectoryfortheindex.

Parameter Description

-h

--help Displaysausagemessageandexits.

-i<index-name>

--index=<index-name> Nameoftheindex.Iftheindexisforapartitionedtable,youmustspecifytherootpartitionname.

-P<pool>

--pool<pool>

Setsthenumberofthreadstoaddtotheworkerpool.Ifnotspecified,the gptext-config utilitydeterminesthenumberofworkerthreadsneeded.Ifssh/scpconnectionsarerefusedordroppedbecausetheconfiguredmaximumnumberofconnectionshasbeenreached,usethisoptiontosetthethreadpoolsizetoalowernumber,reducingthenumberofconcurrentconnections gptext-config makes.Ifyouset

©CopyrightPivotalSoftware,Inc,2013-2019 147 3.3.0

Page 148: Pivotal Greenplum Text

thepoolsizetoohigh,theutilitydisplaysamessageandyoumustconfirmtocontinuewiththemaximumnumberofthreadsallowed.

-f<filename>

--file=<filename>

ThenameofaZooKeeperconfigurationfiletoedit,append,orupload.The -i optionmustbeincludedtospecifytheindex.Thefollowingfilesaresupported:

solrconfig.xml –ContainsmostoftheparametersforconfiguringSolritself(seeConfiguringsolrconfig.xml attheApacheSolrwebsite).

schema.xml –DefinestheanalyzerchainsthatSolrusesforvariousdifferenttypesofsearchfields(seeSettingupTextAnalyzerChains).

stopwords.txt –Listswordsyouwanttoeliminatefromthefinalindex.Youcanalsoeditlanguagespecificstopwordsbyspecifyingafilenameintheformat stopwords_language_code.txt ,wherelanguage_code isatwo-charactercodesuchas en , fr ,or es .

protwords.txt –Listsprotectedwordsthatyoudonotwanttobemodifiedbytheanalyzerchain.Forexample,<iPhone>.

synonyms.txt –Listswordsthatyouwantreplacedbysynonymsintheanalyzerchain.

emoticons.txt –Definesemoticonsforthe text_sm socialmediaanalyzerchain(seegptext-start).

currency.txt –Definesexchangeratesbetweenonecurrencyandanother(seeWorkingwithCurrenciesandExchangeRates attheApacheSolrwebsite).

jar_file –thenameofajarfiletouploadto <GPText_Install_Directory>/lib/ .

-e<command>

--editor=<command>

Editortouse.Choicesareanyeditorthattakesafilenameonthecommandlineasaparameter.Forexample,vi,vim,emacs,nano,etc.Ifabsent,viisused.

-l<filename>

Thefullpathofalocalfileto append or upload toaZooKeeperconfigurationfile.gptext-config append appendsthenamedfiletoaconfigurationfileanddistributestheresultingfiles.Thisusesthe -f and -i parameters. -f namestheconfigurationfiletowhichyouwanttoappendthefilenamed(includinglocalpath)withthe -l parameter.

gptext-configupload

uploadsthespecifiedlocalfiletoZooKeeper.SpecifythedestinationZooKeeperfile

namewiththe -f optionandspecifytheindexnamewiththe -i option.Ifyouomitthe -i optionyoumustsupplythefullpathtothefileinZooKeeperwiththe -f option,forexample-f/gptext/configs/demo.wikipedia.articles/managed-schema .

Whenusedwiththe gptext-confgjar command, -l mustspecifyalocaljarfiletouploadtothe<GPText_Install_Directory>/lib/ .

--recursiveOptional.Usewiththe gptext-config list commandtorecursivelylistallconfigurationfilesdirectoriesavailableinsubdirectories.Bydefault, gptext-config list displaysonlythoseindexconfigurationfilesanddirectorynamesinthetop-levelZooKeeperdirectoryforanindex.

-r

--revert<filename> RevertthenamedZooKeeperfiletothepreviousversion.

-o“<JVM_Options>”ModifiesJVMoptions.ToensurethattheJVMsarerestartedafterchangingJVMoptions,restarttheGPTextclusterusingthe gptext-stop and gptext-start utilities.

Parameter Description

NotesUsethe gptext-config utilitytomodify,add,orlistconfigurationfilesforaspecifiedindex.

Nevereditthetemplateconfigurationfiles.Ifyoudo,everyindexyoucreateaftereditingthetemplateswillbecreatedwithyourmodifiedversions.Usethe gptext-config utilitytoensurethatyouareeditingtheconfigurationfilesforyourindex,ratherthanthetemplateconfigurationfiles.

gptext-config automaticallyreindexesaftermodifyingfilesiftheconfigurationchangesrequireit.

Ifyouusethe -f ( --file )parameterto edit oneoftheindexconfigurationfiles,GPTextautomaticallyplacestheeditedfileinitsproperdirectory.

Tomoveanindexconfigurationfilefromthelocalfilesystemtotheindexconfigurationdirectoryinallofthesegments,usethe upload commandand

©CopyrightPivotalSoftware,Inc,2013-2019 148 3.3.0

Page 149: Pivotal Greenplum Text

specifythelocalfilewiththe -l optionandthedestinationZooKeeperfilewiththe -f ( --file )option.

Examples1. Editthe managed-schema fileinindex demo.wikipedia.articles ,usingthevieditor:

gptext-configedit-fmanaged-schema-idemo.wikipedia.articles-evi

2. Appendthelocalfile stopwords.add to stopwords.txt inindex demo.wikipedia.articles :

gptext-configappend-lstopwords.add-fstopwords.txt-idemo.wikipedia.articles

3. Revertfile managed-schema inindex demo.wikipedia.articles aftereditingit.

gptext-configedit-fmanaged-schema-idemo.wikipedia.articles-r

4. Uploadthelocalfile custom.txt totheZooKeeperfile custom.conf inindex demo.wikipedia.articles :

gptext-configupload-lcustom.txt-fcustom.conf-idemo.wikipedia.articles

5. Listallavailableconfigurationfilesfortheindex demo.wikipedia.articles :

gptext-configlist-idemo.wikipedia.articles--recursive

6. Uploadthejarfile text.jar tothe lib directoryintheGPTexthomedirectory:

gptext-configjar-ltext.jar

7. SetJVMoptions:

gptext-configjvm-o"-Xms256M-Xmx400M"

gptext-expandExpandsaGPTextclusterbyaddingnewGPTextnodestoexistinghostsinaGPTextclusterortohostsaddedbytheGreenplumDatabase gpexpandmanagementutility.ReplicasforindexescreatedafterthenewGPTextnodesareaddedwillbedistributedacrossthenewandexistingnodes.Documentsmustbereindexedtorebalancereplicasonexistinghostsor,afterexpandingtheGreenplumcluster,toredistributetheindextonewshards.

Synax

gptext-expand-h

gptext-expand[-P<pool>]-e-p<paths>[-d<database>][-v]

gptext-expand[-P<pool>]-H<new-hosts>[-d<database>][-v]

Parameters

Parameter Description

-h

--help Displaysausagemessageandexits.

-P<pool>

Setsthenumberofthreadstoaddtotheworkerpool.Ifnotspecified,the gptext-expand utilitydeterminesthenumberofworkerthreadsneeded.Ifssh/scpconnectionsarerefusedordroppedbecausethe

©CopyrightPivotalSoftware,Inc,2013-2019 149 3.3.0

Page 150: Pivotal Greenplum Text

--pool<pool> configuredmaximumnumberofconnectionshasbeenreached,usethisoptiontosetthethreadpoolsizetoalowernumber,reducingthenumberofconcurrentconnections gptext-expand makes.Ifyousetthepoolsizetoohigh,theutilitydisplaysamessageandyoumustconfirmtocontinuewiththemaximumnumberofthreadsallowed.

-e

--existingAddsGPTextnodestoexistinghostsintheGPTextcluster.The`-p`optionmustalsobesuppliedtospecifythedatadirectoriesforthenewnodes.

-p

--expand_paths

SpecifiespathstodirectorieswherethenewGPTextnodes’datadirectoriesaretobecreated.ThesedirectoriesshouldbeparalleltotheGreenplumDatabasesegmentdatadirectories.Ifthereismorethanonedirectory,placetheminacomma-delimitedlist,forexample-p /data1/nodes,/data1/nodes,/data2/nodes .Requiredwhenexpandingonexistinghosts.

-H

--new_hostsSpecifiesthenewhostsonwhichGPTextistobeinstalled.Placemultiplehostnamesinacomma-delimitedlist,forexample -H host1,host2,host3 .SeeNotesforrequirementsfornewhosts.

-d

--database

SpecifiesthenameofadatabasecontainingGPTextschema.Ifthe`gptext-expand`utilityfailstofindadatabasecontainingtheGPTextschemabecausetheusercannotaccessadatabase,usethisoptiontospecifyanaccessibledatabasethatcontainstheGPTextschema.

-v

--verbose Displaysdebugoutput.

Parameter Description

NotesThe -p and -d optionscannotbeusedtogether.

WhennewhostsareaddedtotheGreenplumDatabasecluster,ensurethatthefollowingGPTextprerequisitesareinstalledbeforerunning gpexpand :

Java1.8Python2.6orgreaterLinux lsof utilityAllhostsintheclustermustbeabletoreachthenewandexistinghosts.

Existingreplicasarenotautomaticallyredistributed.TorebalancereplicasamongtheexpandedGPTextcluster,youmustreindex.

Whenexpandingtonewhosts,youmustreindextoredistributetheindexamongexistingandnewshards.

gptext-externalManagesconfigurationsinZooKeeperforexternaldocumentsources.

Syntax

gptext-external-h

gptext-external[-P<pool>]upload-t<type>-c<config-name>-p<config-dir>[-d<database>][-v]

gptext-external[-P<pool>]list-t<type>[-d<database>][-v]

gptext-external[-P<pool>]delete-t<type>-c<config-dir>[-d<database>][-v]

Parameters

Parameter Description

-h

--help Displaysausagemessageandexits.

©CopyrightPivotalSoftware,Inc,2013-2019 150 3.3.0

Page 151: Pivotal Greenplum Text

-P<pool>

--pool<pool>

Setsthenumberofthreadstoaddtotheworkerpool.Ifnotspecified,the gptext-external utilitydeterminesthenumberofworkerthreadsneeded.Ifssh/scpconnectionsarerefusedordroppedbecausetheconfiguredmaximumnumberofconnectionshasbeenreached,usethisoptiontosetthethreadpoolsizetoalowernumber,reducingthenumberofconcurrentconnections gptext-externalmakes.Ifyousetthepoolsizetoohigh,theutilitydisplaysamessageandyoumustconfirmtocontinuewiththemaximumnumberofthreadsallowed.

-t

--type Specifiesthetypeoftheexternaldocumentsource.Thevalidtypesare 'ftp' , 'hdfs' ,or 's3' .

-c

--clusterAnameforthisexternaldocumentsourceconfiguration.Usethisnametoreferencetheconfigurationinthe gptext.external_login() function.

-p

--pathSpecifiesthepathtoadirectorycontainingtheconfigurationfilestouploadtoZooKeeper.SeetheNotessectionforalistoftherequiredconfigurationfiles.

-d

--database

SpecifiesthenameofadatabasecontainingtheGPTextschema.Ifthe`gptext-external`utilityfailstofindadatabasecontainingtheGPTextschemabecausetheusercannotaccessadatabase,usethisoptiontospecifyanaccessibledatabasethatcontainstheGPTextschema.

-v

--verbose Displaysdebugoutput.

Parameter Description

NotesToindexdocumentsstoredinanexternaldocumentsourcethatrequiresauthentication,suchasaHadoopfilesystem(hdfs),ftpserver,orAmazonS3,youfirstuploadtheconfigurationandauthenticationfilesGPTextneedstoconnecttothedocumentsource.Assemblethefilesinalocaldirectoryandthenusethe gptext-external

uploadcommandtouploadthecontentsofthedirectorytoZooKeeper.

ftpForftpconnections,createalocaldirectorycontainingasinglefile, login.txt .

The login.txt filehasthreelines:

Line1:Thenameoftheusertologintotheftpserver.

Line2:Theuser’spassword.Entertheclear-textpasswordinthisfile.Thepasswordisbase64-encodedwhenGPTextstoresitinZooKeeper.

Line3:ThemaximumnumberofconnectionstocreatetotheFTPserver,aninteger.Ifthislineisomitted,eachGPTextnodeconnectstotheFTPserver.Ifthenumberofconnectionsexceedstheserver’smaximumallowedconnectionsGPTextFTPconnectionswillfail.

hdfsForhdfsconnections,createalocaldirectorycontainingthefollowingfiles.

The core-site.xml and hdfs-site.xml configurationfilesfromtheHadoopserver.

Afilenamed user.txt .ThisfilecontainsasinglelineidentifyingtheusernametousetologintoHadoop.Theusermusthavereadpermissioninhdfsforthedocumentsyouwanttoindex.IfKerberosisenabledintheHadoopcluster,theusernameisthenameoftheKerberosprincipalfortheuser.

IftheHadoopclusterissecuredwithKerberos,alsoincludetheuser’s keytab fileandthe krb5.conf filefortheKerberosrealm.

s3Fors3connections,createalocaldirectorycontainingasinglefile, credential .Addtwolinestothe credential file:

Line1:TheAWSaccesskeyid.

Line2:TheAWSsecretaccesskey.

©CopyrightPivotalSoftware,Inc,2013-2019 151 3.3.0

Page 152: Pivotal Greenplum Text

Uploadtheconfigurationfileswiththe gptext-externalupload

command:

$gptext-externalupload-t<type>-c<config-name>-p<config-dir>

Youcandeletetheconfigurationfiledirectoryafteryouuploadtheconfigurationdirectorytoprotecttheuser’scredentials.

Tomakechangestoconfigurationfiles,edittheconfigurationfilesinthelocaldirectoryanduploadthedirectoryagain.

Run gptext-externallist-t<type>

tolistconfigurationsofthespecifiedtype.

Run gptext-externaldelete-t<type>-c<config-name>

toremovetheconfigurationfromZooKeeper.

gptext-installsqlInstallsorremovesthegptextschemaanduser-definedfunctionsindatabases.

Syntax

gptext-installsql-h

gptext-installsql[-c][-v]<db_name>[<db2_name>...]

Parameters

Parameter Description

-c

--clean RemovesthegptextschemaandUDFsfromthespecifieddatabases.

-h

--help Displaysausagemessageandexits.

-v

--verbose Displaysdebugoutput.

NotesThe gptext schemaisreservedforusebyGPText.The gptext-installsql utilitydropsandrecreatestheschema.IfyouaddanydatabaseobjectstotheschematheywillbelostwhenyoureinstalltheschemaorupgradetheGPTextsystem.

The gptext schemacannotbeinstalledinthesystemdatabases postgres , template0 ,or template1 .

Examples1. InstallGPTextUDFsinthe gpadmin and demo databases.

$gptext-installsqlgpadmindemo20170927:11:06:11:024015gptext-installsql:gpdb:gpadmin-[INFO]:-InstallGPTextudf...20170927:11:06:11:024015gptext-installsql:gpdb:gpadmin-[INFO]:-Creating'gptext'schemaandUDFsindatabasegpadmin...20170927:11:06:11:024015gptext-installsql:gpdb:gpadmin-[INFO]:-Creating'gptext'schemaandUDFsindatabasedemo...20170927:11:06:12:024015gptext-installsql:gpdb:gpadmin-[INFO]:-Validatinggptextinstallation20170927:11:06:12:024015gptext-installsql:gpdb:gpadmin-[INFO]:-Done.

©CopyrightPivotalSoftware,Inc,2013-2019 152 3.3.0

Page 153: Pivotal Greenplum Text

2. DeleteGPTextUDFsindatabase gpadmin .

$gptext-installsql--cleangpadmin20170927:11:10:34:024325gptext-installsql:gpdb:gpadmin-[INFO]:-CleanGPTextudf...20170927:11:10:35:024325gptext-installsql:gpdb:gpadmin-[INFO]:-Connectingtodatabasegpadmin20170927:11:10:35:024325gptext-installsql:gpdb:gpadmin-[INFO]:-Dropping'gptext'schemaandUDFs...20170927:11:10:35:024325gptext-installsql:gpdb:gpadmin-[INFO]:-Validatingcleanoperation20170927:11:10:35:024325gptext-installsql:gpdb:gpadmin-[INFO]:-Done.

gptext-migratorMigratesthecurrentGPTextsystemintoanupgradedGreenplumDatabasecluster.

Syntax

gptext-migrator[-h|--help]

gptext-migrator[-P<pool>][-v|--verbose]

NotesThe gptext-migrator utilityrelocatesthecurrentGPTextsystemtoanewGreenplumDatabaserelease.

TheutilitydeterminesthedestinationGreenplumDatabasereleasefromtheenvironment.IftheGPTextsystemhasalreadybeenmigrated,orifthedestinationGreenplumreleaseisunsupported, gptext-migrator outputsamessageandquits.

IfyouareupgradingGPTextandGreenplumDatabaseatthesametime,completetheGreenplumDatabaseupgradefirst,andthenuse gtext-migrator toaddthecurrentGPTextversiontothenewGreenplumDatabaseinstallation.Finally,use gptext-upgrade toupgradethesystemtothenewGPTextversion.

Parameters

Parameter Description

-h

--help Displaysausagemessageandexits.

-P<pool>

--pool<pool>

Setsthenumberofthreadstoaddtotheworkerpool.Ifnotspecified,the gptext-migrator utilitydeterminesthenumberofworkerthreadsneeded.Ifssh/scpconnectionsarerefusedordroppedbecausetheconfiguredmaximumnumberofconnectionshasbeenreached,usethisoptiontosetthethreadpoolsizetoalowernumber,reducingthenumberofconcurrentconnections gptext-migratormakes.Ifyousetthepoolsizetoohigh,theutilitydisplaysamessageandyoumustconfirmtocontinuewiththemaximumnumberofthreadsallowed.

-v

--verbose Displaysdebugoutput.

gptext-recoverRecoversGPTextnodes.

Syntax

©CopyrightPivotalSoftware,Inc,2013-2019 153 3.3.0

Page 154: Pivotal Greenplum Text

gptext-recover-h

gptext-recover[-P<pool>]-f[-v]

gptext-recover[-P<pool>]-H<new-host1>,<new-host2>,...[-v]

gptext-recover[-P<pool>]-r[-v]

Parameters

Parameter Description

-h

--help Displaysausagemessageandexits.

-P<pool>

--pool<pool>

Setsthenumberofthreadstoaddtotheworkerpool.Ifnotspecified,the gptext-recover utilitydeterminesthenumberofworkerthreadsneeded.Ifssh/scpconnectionsarerefusedordroppedbecausetheconfiguredmaximumnumberofconnectionshasbeenreached,usethisoptiontosetthethreadpoolsizetoalowernumber,reducingthenumberofconcurrentconnections gptext-recovermakes.Ifyousetthepoolsizetoohigh,theutilitydisplaysamessageandyoumustconfirmtocontinuewiththemaximumnumberofthreadsallowed.

-f

--forceForcesrecoveryforanyGPTextnodesthataredown.Ifthenodeisunrecoverable,deletesthenode,createsanewnode,andrecreatesreplicas.

-H

--new-hostsRecoverdownnodesonnewhosts.Forexample“host1,host2”.Thenumberofnewhostsmustbeequaltothenumberoffailedhosts.

-r

--index_replicas Recoverreplicas,butdonotrecoveranydownnodes.

-v

--verbose Displaysdebugoutput.

NotesThe -f and -H optionscannotbeusedatthesametime.

Ifshardsaredown, gptext-recover advisesyoutoreindex.

Ifnoshardsaredown, gptext-recover restoresanyreplicasthataredown.

The -f optionforcesSolrtodropanynodesthataredownandcreatenewonesonthesamehosts.Thisoptionshouldbeusedonlyifyouareunabletorestartnodesusing gptext-start[-s] .The -f optionrequiresthatmostofthenodesandreplicasarehealthy.Ifanyindexisinaredstate(see gptext-state ),gptext-recover-f

willdisplayamessageandexit.

IfanyGPTextnodesrecoveredusingthe -f or -H optionsfailtostart,thereplicascannotberecovered.Ifthisshouldhappen,resolvethestartupproblemwiththenewlycreatednodes,andthenrecoverthereplicasusingthe gptext-recover-

roption.ItisimportanttorecoverreplicaswhenallGPText

nodesarehealthysothatreplicaswillbedistributedevenlyamongthenodes.

gptext-replicaAddordeleteareplicaforanindexshard.

©CopyrightPivotalSoftware,Inc,2013-2019 154 3.3.0

Page 155: Pivotal Greenplum Text

Syntax

gptext-replica-h

gptext-replicaadd-i<index-name>-s<shard>[-n<node>]

gptext-replicadrop-i<index-name>-s<shard>-r<replica>[-o]

Parameters

Parameter Description

-h

--help Displaysausagemessageandexits.

-i<index-name>

--index=<index-name> Nameoftheindex.

-f<filename>

--file=<filename>

Thenameofafiletoedit,append,orupload.The -i optionmustbeincludedtospecifytheindex.Thefollowingfilesaresupported:

solrconfig.xml –ContainsmostoftheparametersforconfiguringSolritself(seeSolrConfigXml

).

schema.xml –DefinestheanalyzerchainsthatSolrusesforvariousdifferenttypesofsearchfields(seeSettingupTextAnalyzerChains ).

stopwords.txt –Listswordsyouwanttoeliminatefromthefinalindex.Youcanalsoeditlanguagespecificstopwordsbyspecifyingafilenameintheformat stopwords_language_code.txt ,wherelanguage_code isatwo-charactercodesuchas en , fr ,or es .

protwords.txt –Listsprotectedwordsthatyoudonotwanttobemodifiedbytheanalyzerchain.Forexample,<iPhone>.

synonyms.txt –Listswordsthatyouwantreplacedbysynonymsintheanalyzerchain.

emoticons.txt –Definesemoticonsforthe text_sm socialmediaanalyzerchain.Seegptext-start.

currency.txt –Definesexchangeratesbetweenonecurrencyandanother(seeWorkingwithCurrenciesandExchangeRates attheApacheSolrwebsite).

jar_file–thenameofajarfiletouploadto <GPText_Install_Directory>/lib/ .

-e<command>

--editor=<command>Editortouse.Choicesareanyeditorthattakesafilenameonthecommandlineasaparameter.Forexample,vi,vim,emacs,nano,etc.Ifabsent,viisused.

-a<filename>

--append=<filename>

Appendsanamedfiletoaconfigurationfileanddistributestheresultingfiles.Requiresthe -f and -iparameters. -f namestheconfigurationfiletowhichyouwanttoappendthefilenamed(includinglocalpath)withthe -a parameter.

-r

--revert

<filename>

Revertnamedfiletopreviousversion.

-i<index>

--index=<index> Required.Thenameoftheindex.

-s<shard>

--shard=<shard> Required.Thenameoftheshardtoaddareplicato.

-n<node>

--node=<node> Optional.Thenodewherethereplicaistobeadded.

©CopyrightPivotalSoftware,Inc,2013-2019 155 3.3.0

Page 156: Pivotal Greenplum Text

-r<replica>

--replica=<replica>Requiredforthedropcommandonly.Thenameofthereplicatodrop.

-o

--onlyifdown Optional.Usedonlywiththedropcommand.Onlydropthereplicaifit’sdown.

Parameter Description

NotesTofindthenameofareplicatodrop,check gptext.index_status() .Thenameis core_nodeX whereXisanumber.

Examples1. Addareplicaforindex demo.wikipedia.articles inshard shard0 ,onnode node1 .

gptext-replicaadd-idemo.wikipedia.articles-sshard0-nnode1

2. Dropthereplicanamed core_node1 forindex demo.wikipedia.articles inshard shard0 ifthereplicaisdown.

gptext-replicadrop-idemo.wikipedia.articles-sshard0-rcore_node3-o

gptext-restoreRestoreaGPTextindexfromabackupsavedtolocalstorageontheGreenplumDatabaseclusterortoasharedfilesystemmountedonallGreenplumDatabaseclusterhosts.

Syntax

gptext-restore-h

gptext-restore[-P<pool>]-c-p<path>[-v]

gptext-restore[-P<pool>local-p<backup-name>[-v]

Parameters

Parameter Description

-h

--help Displaysausagemessageandexits.

-P<pool>

--pool<pool>

Setsthenumberofthreadstoaddtotheworkerpool.Ifnotspecified,the gptext-restore utilitydeterminesthenumberofworkerthreadsneeded.Ifssh/scpconnectionsarerefusedordroppedbecausetheconfiguredmaximumnumberofconnectionshasbeenreached,usethisoptiontosetthethreadpoolsizetoalowernumber,reducingthenumberofconcurrentconnections gptext-restore makes.Ifyousetthepoolsizetoohigh,theutilitydisplaysamessageandyoumustconfirmtocontinuewiththemaximumnumberofthreadsallowed.

-c Restoreindexconfigurationandcreateanemptyindex.

localRestoreanindexthatwasbackeduptolocalGPTextclusterstorage.Ifthe local keywordisnotincluded,theindexisrestoredfromasharedfilesystemmountedonallhosts.

-p<path>

Thepathtothebackupdirectoryoneachhost.

©CopyrightPivotalSoftware,Inc,2013-2019 156 3.3.0

Page 157: Pivotal Greenplum Text

--path<path>Parameter Description

NotesUsethe gptext-restore utilitytorestoreaGPTextindexbackup.YoucanrestorethebackuptoanewGPTextsystem,oryoucanrestorethebackuptothesamesysteminordertorecoverfromacorruptedGPTextindex.Withthe -c option,youcanrestoretheconfigurationfilesandcreateanemptyindexwithoutrestoringtheindexdatafromthebackup.

Theindexyouarerestoringmustnotexist.The gptext-restore utilitycreatesanewindexandreloadsthebackedupdataintoit.Ifyouarerestoringinordertorepairacorruptedindex,youmustfirstdeletetheexistingindexwiththe gptext.drop_index() function.Iftheindexyouwanttorestoreexists,gptext-restore outputsanerrorandquits.

RestoreFromLocalGPTextClusterStorage

Usethe gptext-restorelocal

commandtorestoreaGPTextindexfromlocalstorage.Supplythepathtothebackupdirectoryonthemasterhostusingthe

--path ( -p )option.Theargumenttothe --path optionisthepathtothebackupdirectorythatwascreatedwith gptext-backup ,includingthetimestamp.

Thefollowingexamplerestoresabackupthatwascreatedusingthis gptext-backup command: gptext-backuplocal-idemo.store.products-pgptext-backups

©CopyrightPivotalSoftware,Inc,2013-2019 157 3.3.0

Page 158: Pivotal Greenplum Text

$gptext-restorelocal-pgptext-backups/demo.store.products_2018-05-11T10\:17\:54.49344820180511:11:04:31:026221gptext-restore:mdw:gpadmin-[INFO]:-ExecuteGPTextclusterrestore.20180511:11:04:32:026221gptext-restore:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20180511:11:04:32:026221gptext-restore:mdw:gpadmin-[INFO]:-Readingmetadatafromfile/home/gpadmin/gptext-backups/demo.store.products_2018-05-11T10:17:54.493448.json...20180511:11:04:32:026221gptext-restore:mdw:gpadmin-[INFO]:-Executingrestore...20180511:11:04:32:026221gptext-restore:mdw:gpadmin-[INFO]:-Creatingindexdemo.store.products...20180511:11:04:35:026221gptext-restore:mdw:gpadmin-[INFO]:-Addreplicaintoshardshard2forindexdemo.store.products.20180511:11:04:35:026221gptext-restore:mdw:gpadmin-[INFO]:-Processing......20180511:11:04:37:026221gptext-restore:mdw:gpadmin-[INFO]:-Thereplicaisadded,datarecovering....20180511:11:04:37:026221gptext-restore:mdw:gpadmin-[INFO]:-Datarecovered,replicabecomesactive....20180511:11:04:38:026221gptext-restore:mdw:gpadmin-[INFO]:-Restoringreplicademo.store.products_shard2_replica1frombackupdemo.store.products_shard2_2018-05-11T10:17:54.493448...20180511:11:04:38:026221gptext-restore:mdw:gpadmin-[INFO]:-Addreplicaintoshardshard3forindexdemo.store.products.20180511:11:04:38:026221gptext-restore:mdw:gpadmin-[INFO]:-Processing......20180511:11:04:40:026221gptext-restore:mdw:gpadmin-[INFO]:-Thereplicaisadded,datarecovering....20180511:11:04:40:026221gptext-restore:mdw:gpadmin-[INFO]:-Datarecovered,replicabecomesactive....20180511:11:04:41:026221gptext-restore:mdw:gpadmin-[INFO]:-Restoringreplicademo.store.products_shard3_replica1frombackupdemo.store.products_shard3_2018-05-11T10:17:54.493448...20180511:11:04:41:026221gptext-restore:mdw:gpadmin-[INFO]:-Addreplicaintoshardshard0forindexdemo.store.products.20180511:11:04:41:026221gptext-restore:mdw:gpadmin-[INFO]:-Processing......20180511:11:04:43:026221gptext-restore:mdw:gpadmin-[INFO]:-Thereplicaisadded,datarecovering....20180511:11:04:43:026221gptext-restore:mdw:gpadmin-[INFO]:-Datarecovered,replicabecomesactive....20180511:11:04:44:026221gptext-restore:mdw:gpadmin-[INFO]:-Restoringreplicademo.store.products_shard0_replica1frombackupdemo.store.products_shard0_2018-05-11T10:17:54.493448...20180511:11:04:44:026221gptext-restore:mdw:gpadmin-[INFO]:-Addreplicaintoshardshard1forindexdemo.store.products.20180511:11:04:44:026221gptext-restore:mdw:gpadmin-[INFO]:-Processing......20180511:11:04:46:026221gptext-restore:mdw:gpadmin-[INFO]:-Thereplicaisadded,datarecovering....20180511:11:04:47:026221gptext-restore:mdw:gpadmin-[INFO]:-Datarecovered,replicabecomesactive....20180511:11:04:47:026221gptext-restore:mdw:gpadmin-[INFO]:-Restoringreplicademo.store.products_shard1_replica1frombackupdemo.store.products_shard1_2018-05-11T10:17:54.493448...20180511:11:04:47:026221gptext-restore:mdw:gpadmin-[INFO]:-Processing......20180511:11:04:48:026221gptext-restore:mdw:gpadmin-[INFO]:-Adding1replica(s)todemo.store.products_shard2...20180511:11:04:48:026221gptext-restore:mdw:gpadmin-[INFO]:-Addreplicaintoshardshard2forindexdemo.store.products.20180511:11:04:48:026221gptext-restore:mdw:gpadmin-[INFO]:-Processing......20180511:11:04:51:026221gptext-restore:mdw:gpadmin-[INFO]:-Thereplicaisadded,datarecovering........20180511:11:04:55:026221gptext-restore:mdw:gpadmin-[INFO]:-Datarecovered,replicabecomesactive....20180511:11:04:55:026221gptext-restore:mdw:gpadmin-[INFO]:-Adding1replica(s)todemo.store.products_shard3...20180511:11:04:55:026221gptext-restore:mdw:gpadmin-[INFO]:-Addreplicaintoshardshard3forindexdemo.store.products.20180511:11:04:55:026221gptext-restore:mdw:gpadmin-[INFO]:-Processing......20180511:11:04:57:026221gptext-restore:mdw:gpadmin-[INFO]:-Thereplicaisadded,datarecovering........20180511:11:05:01:026221gptext-restore:mdw:gpadmin-[INFO]:-Datarecovered,replicabecomesactive....20180511:11:05:02:026221gptext-restore:mdw:gpadmin-[INFO]:-Adding1replica(s)todemo.store.products_shard0...20180511:11:05:02:026221gptext-restore:mdw:gpadmin-[INFO]:-Addreplicaintoshardshard0forindexdemo.store.products.20180511:11:05:02:026221gptext-restore:mdw:gpadmin-[INFO]:-Processing......20180511:11:05:05:026221gptext-restore:mdw:gpadmin-[INFO]:-Thereplicaisadded,datarecovering........20180511:11:05:09:026221gptext-restore:mdw:gpadmin-[INFO]:-Datarecovered,replicabecomesactive....20180511:11:05:09:026221gptext-restore:mdw:gpadmin-[INFO]:-Adding1replica(s)todemo.store.products_shard1...20180511:11:05:09:026221gptext-restore:mdw:gpadmin-[INFO]:-Addreplicaintoshardshard1forindexdemo.store.products.20180511:11:05:09:026221gptext-restore:mdw:gpadmin-[INFO]:-Processing......20180511:11:05:12:026221gptext-restore:mdw:gpadmin-[INFO]:-Thereplicaisadded,datarecovering........20180511:11:05:16:026221gptext-restore:mdw:gpadmin-[INFO]:-Datarecovered,replicabecomesactive....20180511:11:05:16:026221gptext-restore:mdw:gpadmin-[INFO]:-Done.

ThisexamplerestorestheconfigurationfilesandcreatestheGPTextindexwithoutreloadingthedata.Notethe local keywordisomitted.

©CopyrightPivotalSoftware,Inc,2013-2019 158 3.3.0

Page 159: Pivotal Greenplum Text

$gptext-restore-c-pgptext-backups/demo.store.products_2018-05-11T10\:17\:54.49344820180511:11:16:50:028171gptext-restore:mdw:gpadmin-[INFO]:-ExecuteGPTextclusterrestore.20180511:11:16:51:028171gptext-restore:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20180511:11:16:51:028171gptext-restore:mdw:gpadmin-[INFO]:-Readingmetadatafromfile/home/gpadmin/gptext-backups/demo.store.products_2018-05-11T10:17:54.493448.json...20180511:11:16:51:028171gptext-restore:mdw:gpadmin-[INFO]:-Executingrestore...20180511:11:16:51:028171gptext-restore:mdw:gpadmin-[INFO]:-Creatingindexdemo.store.products...20180511:11:16:59:028171gptext-restore:mdw:gpadmin-[INFO]:-Done.

RestoreFromaSharedFileSystem

Usethe --path optiontorestoreabackupfromasharedfilesystem.ThesharedfilesystemmustbemountedonallhostswithGPTextnodesandmustbereadablebythegpadminuser.Eachhostintheclustermustbeabletoaccessthefilesystem.

TheGPTextindextorestoremustnotalreadyexist.

Thefollowingexamplerestoresthe demo.twitter.message indexfromasharedfilesystemmountedoneachhostat /mnt/nas .Thebackupwascreatedwiththename twitter ,sothebackupfilesweresavedinthe /mnt/nas/twitter directory.

$gptext-restore--path/mnt/nas/twitter20180510:17:22:46:008054gptext-restore:mdw:gpadmin-[INFO]:-ExecuteGPTextclusterrestore.20180510:17:22:48:008054gptext-restore:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20180510:17:22:48:008054gptext-restore:mdw:gpadmin-[INFO]:-Validatesharedfilesystem.20180510:17:22:50:008054gptext-restore:mdw:gpadmin-[INFO]:-Restoreindex:demo.twitter.message,fromsharedFS'/mnt/nas',backupname:twitter.20180510:17:22:50:008054gptext-restore:mdw:gpadmin-[INFO]:-Processing.......................20180510:17:23:10:008054gptext-restore:mdw:gpadmin-[INFO]:-Checkingleaderreplicasofcollectiondemo.twitter.message..........20180510:17:23:16:008054gptext-restore:mdw:gpadmin-[INFO]:-Validatereplicastate........20180510:17:23:19:008054gptext-restore:mdw:gpadmin-[INFO]:-Indexrestoresuccessfully.20180510:17:23:19:008054gptext-restore:mdw:gpadmin-[INFO]:-Done.

gptext-startStartsorrestartstheGPTextcluster.

Syntax

gptext-start-h

gptext-start[-P<pool>][-r][-s][-v]

Parameters

Parameter Description

-h

--help Displaysausagemessageandexits.

-P<pool>

--pool<pool>

Setsthenumberofthreadstoaddtotheworkerpool.Ifnotspecified,the gptext-start utilitydeterminesthenumberofworkerthreadsneeded.Ifssh/scpconnectionsarerefusedordroppedbecausetheconfiguredmaximumnumberofconnectionshasbeenreached,usethisoptiontosetthethreadpoolsizetoalowernumber,reducingthenumberofconcurrentconnections gptext-start makes.Ifyousetthepoolsizetoohigh,theutilitydisplaysamessageandyoumustconfirmtocontinuewiththemaximumnumberofthreadsallowed.

-r

--restart RestartstheGPTextcluster.

©CopyrightPivotalSoftware,Inc,2013-2019 159 3.3.0

Page 160: Pivotal Greenplum Text

-s

--slow_start RestartstheGPTextclusterbystartingnodesoneatatime.

-v

--verbose Displaysdebugoutput.

Parameter Description

NotesThe gptext-start-

rcommandcallsthe solr

restartcommandtostopandrestartalloftheSolrinstancesinthecluster.TheGPTextutilitydeterminesifthe

processesarerunningbeforeitcompletes,butitcannotverifythatalloftheSolrprocesseswerestopped.IfitisimportanttobecertainthatSolrprocesseswerestopped,forexampleifyouhavechangedtheJVMoptions,use gptext-stop followedby gptext-start insteadof gptext-start-

r.

The -s ( --slow-start )optionisrecommendedifyouhavealargenumberofindexes.Bydefault,whenaSolrclusterstartsallofthecluster’snodesarestartedatonce.Withalargenumberofindexes,thenumberofinitialZooKeeperrequestscanresultintimeouterrorsandpossiblypreventtheclusterfromstartinginacleanstate.Withthe -s option,GPTextperformsarollingstart,startingnodesoneatatime,toreduceZooKeepercontentionandallowamorestablestartup.Ifyouhavemorethan50indexesanddonotspecifythe -s option, gptext-start displaysawarningmessageandrequiresyoutoconfirm.Withthe -s option, gptext-start doesnotreturnuntilallnodeshavebeenstarted;withoutthe -s option,the gptext-start commandreturnsimmediately.

Examples1. StarttheGPTextcluster.

gptext-start

2. RestarttheGPTextcluster.

gptext-start-r

gptext-stateDisplaysthestateoftheGPTextclusterandGPTextindexes.

Syntax

gptext-state-h

gptext-state[-P<pool>][-d<db-name>][-D][-v]

gptext-state[-P<pool>]-i<index-name>[-d<db-name>][-c<col1,...>][-v]

gptext-state[-P<pool>]list[-d<db-name>][-v]

gptext-state[-P<pool>]healthcheck[-d<db-name>][-f<percent>][-v]

gptext-state[-P<pool>]configs

gptext-state[-P<pool>]stats[-i<index-name>]

Parameters

Parameter Description

-h

©CopyrightPivotalSoftware,Inc,2013-2019 160 3.3.0

Page 161: Pivotal Greenplum Text

--help Displaysausagemessageandexits.

-P<pool>

--pool<pool>

Setsthenumberofthreadstoaddtotheworkerpool.Ifnotspecified,the gptext-state utilitydeterminesthenumberofworkerthreadsneeded.Ifssh/scpconnectionsarerefusedordroppedbecausetheconfiguredmaximumnumberofconnectionshasbeenreached,usethisoptiontosetthethreadpoolsizetoalowernumber,reducingthenumberofconcurrentconnections gptext-state makes.Ifyousetthepoolsizetoohigh,theutilitydisplaysamessageandyoumustconfirmtocontinuewiththemaximumnumberofthreadsallowed.

-d<db-name>

--database=<db-name>

ThenameofadatabasecontainingtheGPTextschema.

gptext-state searchesalldatabasesforthefunctionsitneedstorun.Iftheuserdoesnothaveaccesspermissiontothedatabaseitbeginswith,itfails.Inthiscase,usethe --database= parametertospecifyanaccessibledatabasetosearch.

-D

--detailsListthestatusforeachGPTextindex.Whenomitted, gptext-state listscountsofthenumbersofindexeswithGreen,Yellow,andRedstatuses.

-i<index-name>

--index=<index-name>

Thenameofanindex.Displaysstatisticsforthespecifiedindex.Ifthe<index-name>isarootorchildpartition,displaysanyparentorchildpartitions.Thisoptioncannotbeusedwiththe list orhealthcheck subcommands.

-c<column-list>

--stats_columns=<column-list>

Usedwiththe -i or --index option,specifiesacomma-separatedlistofstatisticstodisplay.Thelistmaycontain replication_factor , max_shards_per_node , num_docs , size_in_bytes ,andlast_modified .Ifno -c or --stats_columns optionissupplied,allfivestatisticsaredisplayed.

-f<diskfree>

--disk_free=<diskfree>Usedwiththe healthcheck command,specifiesthepercentagediskfreerequiredperhosttoreportahealthyGPTextcluster.Thedefaultis10.

Parameter Description

NotesAllparametersareoptional,exceptthat -i ( --index )isrequiredwhenyouspecify --c ( --stats_columns ).

Ifyouspecifyasubpartitionnamewiththe -i option, gptext-state displaysthenameoftheparenttableorpartitionfromwhichthepartitioninherits.Ifyouspecifythenameofatableorpartitionwithchildpartitions, gptext-state liststhem.

Whenexecutedwithnoarguments, gptext-state displaystheGPTextversionandcountsofindexesintheGreen,Yellow,andRedstates.

AGreenstatemeansthatallshardsandreplicasarehealthy.

AYellowstatemeansthatallshardsareavailable,butoneormorereplicasisdown.

ARedstatemeansthatonemoremoreshardsisdown.

Withthe -D ( --details )optionspecified, gptext-state listsallGPTextindexeswiththecolumns database , index_name ,and state .The state columndisplaysthestatusoftheindexas Green , Yellow ,or Red .

The gpstateconfigs

commandshowsthehostnames,ports,anddatadirectoriesfortheSolrandZooKeepernodes.Italsoshowstheconfiguredminimum

andmaximumJVMmemorysizesforGPTextnodes.

IfanyindexhasaYelloworRedstatus, gptext-state returnsanon-zerovalue.

Examples1. ShowtheGPTextclusterstate.

©CopyrightPivotalSoftware,Inc,2013-2019 161 3.3.0

Page 162: Pivotal Greenplum Text

$gptext-state20161216:14:01:32:029224gptext-state:gpsne:gpadmin-[INFO]:-Checkzookeeperclusterstate...20161216:14:01:32:029224gptext-state:gpsne:gpadmin-[INFO]:-CheckGPTextclusterstatus...20161216:14:01:33:029224gptext-state:gpsne:gpadmin-[INFO]:-CurrentGPTextVersion:2.0.020161216:14:01:33:029224gptext-state:gpsne:gpadmin-[INFO]:-Allnodesareupandrunning.20161216:14:01:34:029224gptext-state:gpsne:gpadmin-[INFO]:------------------------------------------------20161216:14:01:34:029224gptext-state:gpsne:gpadmin-[INFO]:-Indexstate.20161216:14:01:34:029224gptext-state:gpsne:gpadmin-[INFO]:------------------------------------------------20161216:14:01:34:029224gptext-state:gpsne:gpadmin-[INFO]:-stateindexcount20161216:14:01:34:029224gptext-state:gpsne:gpadmin-[INFO]:-Green4

2. ShowtheGPTextclusterstatewithdetails,specifying demo asadatabasecontainingtheGPTextschema.

$gptext-state-D-ddemo20170929:15:18:21:000872gptext-state:gpdb:gpadmin-[INFO]:-ExecuteGPTextstate...20170929:15:18:21:000872gptext-state:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20170929:15:18:21:000872gptext-state:gpdb:gpadmin-[INFO]:-CheckGPTextclusterstatus...20170929:15:18:21:000872gptext-state:gpdb:gpadmin-[INFO]:-CurrentGPTextVersion:2.1.320170929:15:18:21:000872gptext-state:gpdb:gpadmin-[INFO]:-Allnodesareupandrunning.20170929:15:18:22:000872gptext-state:gpdb:gpadmin-[INFO]:------------------------------------------------20170929:15:18:22:000872gptext-state:gpdb:gpadmin-[INFO]:-Indexstatedetails.20170929:15:18:22:000872gptext-state:gpdb:gpadmin-[INFO]:------------------------------------------------20170929:15:18:22:000872gptext-state:gpdb:gpadmin-[INFO]:-databaseindexnamestate20170929:15:18:22:000872gptext-state:gpdb:gpadmin-[INFO]:-demodemo.twitter.messageGreen20170929:15:18:22:000872gptext-state:gpdb:gpadmin-[INFO]:-demodemo.wikipedia.articlesGreen20170929:15:18:22:000872gptext-state:gpdb:gpadmin-[INFO]:-Done.

3. ShowtheGPTextclusterconfiguration.

$gptext-stateconfigs20181112:12:38:26:018080gptext-state:mdw:gpadmin-[INFO]:-ExecuteGPTextstate...20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-ClusterConfigurations.20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:----------------------------------------------------------20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-JVMMin|MaxXms1024M|Xmx2048M20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-Nodeinformation20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:----------------------------------20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-HostNodeNamePortSolrDir20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw1sdw1_solr:1898318983/data/gptext/solr020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw1sdw1_solr:1898418984/data/gptext/solr120181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw2sdw2_solr:1898318983/data/gptext/solr020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw2sdw2_solr:1898418984/data/gptext/solr120181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-Zookeeperinformation20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:----------------------------------20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-HostPortZookeeperDir20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-mdw2189/data/zoo/zoo020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw22189/data/zoo/zoo020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw12189/data/zoo/zoo020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-Done.

4. Show replication_factor and num_docs statisticsfortheGPTextindex demo.wikipedia.articles .Specify wikipedia asthedatabasewiththeGPTextschema.

$gptext-state-idemo.wikipedia.articles-creplication_factor,num_docs-ddemo20170927:13:00:31:030421gptext-state:gpdb:gpadmin-[INFO]:-ExecuteGPTextstate...20170927:13:00:31:030421gptext-state:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20170927:13:00:31:030421gptext-state:gpdb:gpadmin-[INFO]:-CheckGPTextclusterstatistics...20170927:13:00:33:030421gptext-state:gpdb:gpadmin-[INFO]:-ReplicasUp:520170927:13:00:33:030421gptext-state:gpdb:gpadmin-[INFO]:------------------------------------------------20170927:13:00:33:030421gptext-state:gpdb:gpadmin-[INFO]:-Indexdemo.wikipedia.articlesstatistics.20170927:13:00:33:030421gptext-state:gpdb:gpadmin-[INFO]:------------------------------------------------20170927:13:00:33:030421gptext-state:gpdb:gpadmin-[INFO]:-replication_factornum_docs20170927:13:00:33:030421gptext-state:gpdb:gpadmin-[INFO]:-22320170927:13:00:33:030421gptext-state:gpdb:gpadmin-[INFO]:-Done.

5. Listallindexes.

$gptext-statelist20170929:15:19:02:001023gptext-state:gpdb:gpadmin-[INFO]:-ExecuteGPTextstate...20170929:15:19:03:001023gptext-state:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20170929:15:19:03:001023gptext-state:gpdb:gpadmin-[INFO]:----------------------------------------------------------20170929:15:19:03:001023gptext-state:gpdb:gpadmin-[INFO]:-Indexlist20170929:15:19:03:001023gptext-state:gpdb:gpadmin-[INFO]:----------------------------------------------------------20170929:15:19:03:001023gptext-state:gpdb:gpadmin-[INFO]:-demo.twitter.message20170929:15:19:03:001023gptext-state:gpdb:gpadmin-[INFO]:-demo.wikipedia.articles20170929:15:19:03:001023gptext-state:gpdb:gpadmin-[INFO]:-Done.

©CopyrightPivotalSoftware,Inc,2013-2019 162 3.3.0

Page 163: Pivotal Greenplum Text

6. Listconfigurations.

20180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:-ExecuteGPTextstate...20180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:-ClusterConfigurations.20180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:----------------------------------------------------------20180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:-JVMMin|MaxXms256M|Xmx1024M20180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:-Nodeinformation20180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:----------------------------------20180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:-HostNodeNamePortSolrDir20180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:-sdw1sdw1:1898318983/data/primary/solr020180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:-sdw2sdw2:1898418984/data/primary/solr120180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:-Zookeeperinformation20180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:----------------------------------20180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:-HostPortZookeeperDir20180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:-smdw2188/data/master/zoo020180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:-sdw12189/data/master/zoo120180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:-sdw22190/data/master/zoo220180822:20:38:39:005702gptext-state:gpdb:gpadmin-[INFO]:-Done.

7. Performahealthcheckwitha20%freediskrequirement.

$gptext-statehealthcheck-f20-ddemo20170927:13:03:53:030843gptext-state:gpdb:gpadmin-[INFO]:-ExecuteGPTextstate...20170927:13:03:53:030843gptext-state:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20170927:13:03:53:030843gptext-state:gpdb:gpadmin-[INFO]:-ExecutehealthcheckonGPTextcluster!20170927:13:03:53:030843gptext-state:gpdb:gpadmin-[INFO]:-CheckGPTextbinaryandutilitiesversionmatch...20170927:13:03:53:030843gptext-state:gpdb:gpadmin-[INFO]:-GOOD20170927:13:03:53:030843gptext-state:gpdb:gpadmin-[INFO]:-CheckGPTextconfigfiles...20170927:13:03:55:030843gptext-state:gpdb:gpadmin-[INFO]:-GOOD20170927:13:03:55:030843gptext-state:gpdb:gpadmin-[INFO]:-CheckGPTextindexstatus...20170927:13:03:55:030843gptext-state:gpdb:gpadmin-[INFO]:-GOOD20170927:13:03:55:030843gptext-state:gpdb:gpadmin-[INFO]:-Checkingforrequireddiskspace...20170927:13:03:56:030843gptext-state:gpdb:gpadmin-[INFO]:-GOOD20170927:13:03:56:030843gptext-state:gpdb:gpadmin-[INFO]:-Checkingforrequireduserprivileges...20170927:13:03:57:030843gptext-state:gpdb:gpadmin-[INFO]:-GOOD20170927:13:03:57:030843gptext-state:gpdb:gpadmin-[INFO]:-Checkingforindexesanddatabaseconsistency...20170927:13:03:58:030843gptext-state:gpdb:gpadmin-[INFO]:-GOOD20170927:13:03:58:030843gptext-state:gpdb:gpadmin-[INFO]:-Done.

8. Checkthestatusofapartitionedtable.

$gptext-state-idemo.twitter.message20180615:15:49:33:029252gptext-state:mdw:gpadmin-[INFO]:-ExecuteGPTextstate...20180615:15:49:34:029252gptext-state:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20180615:15:49:34:029252gptext-state:mdw:gpadmin-[INFO]:-CheckGPTextclusterstatistics...20180615:15:49:35:029252gptext-state:mdw:gpadmin-[INFO]:-ReplicasUp:820180615:15:49:35:029252gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------20180615:15:49:35:029252gptext-state:mdw:gpadmin-[INFO]:-Indexdemo.twitter.messagestatistics.20180615:15:49:35:029252gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------20180615:15:49:35:029252gptext-state:mdw:gpadmin-[INFO]:-replication_factormax_shards_per_nodenum_docssizeinbyteslast_modified20180615:15:49:35:029252gptext-state:mdw:gpadmin-[INFO]:-2417308211252018-06-15T20:34:12.660Z20180615:15:49:35:029252gptext-state:mdw:gpadmin-[INFO]:-Childpartitionindexes:20180615:15:49:35:029252gptext-state:mdw:gpadmin-[INFO]:-demo.twitter.message_1_prt_120180615:15:49:35:029252gptext-state:mdw:gpadmin-[INFO]:-demo.twitter.message_1_prt_220180615:15:49:35:029252gptext-state:mdw:gpadmin-[INFO]:-demo.twitter.message_1_prt_320180615:15:49:35:029252gptext-state:mdw:gpadmin-[INFO]:-demo.twitter.message_1_prt_420180615:15:49:35:029252gptext-state:mdw:gpadmin-[INFO]:-Done.

9. ListstatisticsforallGPTextindexes.

$gptext-statestats20180615:15:52:34:029808gptext-state:mdw:gpadmin-[INFO]:-ExecuteGPTextstate...20180615:15:52:35:029808gptext-state:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20180615:15:52:36:029808gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------20180615:15:52:36:029808gptext-state:mdw:gpadmin-[INFO]:-IndexStatistics.20180615:15:52:36:029808gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------20180615:15:52:36:029808gptext-state:mdw:gpadmin-[INFO]:-indexnamenum_docssizeinbytes20180615:15:52:36:029808gptext-state:mdw:gpadmin-[INFO]:-demo.twitter.message173082112520180615:15:52:36:029808gptext-state:mdw:gpadmin-[INFO]:-demo.wikipedia.articles2355424020180615:15:52:36:029808gptext-state:mdw:gpadmin-[INFO]:-Done.

gptext-stop

©CopyrightPivotalSoftware,Inc,2013-2019 163 3.3.0

Page 164: Pivotal Greenplum Text

StoptheGPTextclusternodes.

Syntax

gptext-stop-h

gptext-stop[-P<pool>][-v][-f]

Parameters

Parameter Description

-h

--help Displaysausagemessageandexits.

-P<poole>

--pool<pool>

Setsthenumberofthreadstoaddtotheworkerpool.Ifnotspecified,the gptext-stop utilitydeterminesthenumberofworkerthreadsneeded.Ifssh/scpconnectionsarerefusedordroppedbecausetheconfiguredmaximumnumberofconnectionshasbeenreached,usethisoptiontosetthethreadpoolsizetoalowernumber,reducingthenumberofconcurrentconnections gptext-stop makes.Ifyousetthepoolsizetoohigh,theutilitydisplaysamessageandyoumustconfirmtocontinuewiththemaximumnumberofthreadsallowed.

-v

--verbose Displaysdebugoutput.

-f

--force ForcefullystopsallSolrprocesses.

Examples1. StoptheGPTextcluster.

gptext-stop

2. ForcestoptheGPTextcluster.

gptext-stop-f

gptext-uninstallUninstallsGPText,includingdataandinstalledfiles.UninstallsZooKeepernodesiftheywereinstalledwiththeGPTextinstaller.

StopsanyrunningGPTextinstances.

DeletesallSolrdirectoriesinsegmentdirectories.

Deletestheinstallationdirectory.

RemovesallGPTextschemasandindexesfromalldatabases.

UninstallsZooKeeperifitwasinstalledwiththeGPTextinstaller.

Syntax

©CopyrightPivotalSoftware,Inc,2013-2019 164 3.3.0

Page 165: Pivotal Greenplum Text

gptext-uninstall-h|--help

gptext-uninstall[-P<pool>][-v|--verbose]

Parameters

Parameter Description

-h

--help Displaysausagemessageandexits.

-P<pool>

--pool<pool>

Setsthenumberofthreadstoaddtotheworkerpool.Ifnotspecified,the gptext-uninstall utilitydeterminesthenumberofworkerthreadsneeded.Ifssh/scpconnectionsarerefusedordroppedbecausetheconfiguredmaximumnumberofconnectionshasbeenreached,usethisoptiontosetthethreadpoolsizetoalowernumber,reducingthenumberofconcurrentconnections gptext-uninstallmakes.Ifyousetthepoolsizetoohigh,theutilitydisplaysamessageandyoumustconfirmtocontinuewiththemaximumnumberofthreadsallowed.

-v

--verbose Displaysdebugoutput.

NotesTouse gptext-uninstall ,youmusthavesuperuserpermissionsonalldatabaseswithGPTextschemas.

gptext-uninstall runsonlyifthereisatleastonedatabasewithaGPTextschema.

Examples1. UninstallGPText.

gptext-uninstall

gptext-upgradeUpgradesthecurrentGPTextsystemtoanewGPTextrelease.

Syntax

gptext-upgrade[-h|--help]

gptext-upgrade[-P<pool>][-f<upgrade_file>|--file=<upgrade_file>][-c|--base_check][-v|--verbose]

Parameter Description

-h

--help Displaysausagemessageandexits.

-P<pool>

--pool<pool>

Setsthenumberofthreadstoaddtotheworkerpool.Ifnotspecified,the gptext-upgrade utilitydeterminesthenumberofworkerthreadsneeded.Ifssh/scpconnectionsarerefusedordroppedbecausetheconfiguredmaximumnumberofconnectionshasbeenreached,usethisoptiontosetthethreadpoolsizetoalowernumber,reducingthenumberofconcurrentconnections gptext-upgrademakes.Ifyousetthepoolsizetoohigh,theutilitydisplaysamessageandyoumustconfirmtocontinue

©CopyrightPivotalSoftware,Inc,2013-2019 165 3.3.0

Page 166: Pivotal Greenplum Text

withthemaximumnumberofthreadsallowed.

-f<upgrade_file>

--file<upgrade_file>

Providesthepathtotheupgradefile.Thedefaultupgradefileis $GPPERFMONHOME/share/upgrade.yaml.

-c

--base_check

Bydefault, gptext-upgrade checksthattheGPTextenvironmentcanbeupgradedandreportsanyitemsthatmustbecorrectedbeforeupgrading.Whenthe -c or --base-check optionissupplied,theenvironmentcheckisomitted.

-v

--verbose Displaysdebugoutputwhenexecutingthecommand.

Parameter Description

NotesTheupgrade_fileisaYAML-formattedscriptdefiningactionstoupgradeaGPTextsystemfromapreviousreleasetothecurrentrelease.Thefileisnotintendedtobeeditedbyusers.Iftheupgrade_filedoesnotcontainsupportforthepreviousGPTextrelease, gptext-upgrade outputsanerrormessageandexits.

zkManagerCheckstheZooKeeperclusterstate.IfZooKeeperwasinstalledwithGPText, zkManager canstartorstoptheZooKeepercluster.

Syntax

zkManager[-h|--help]

zkManagerstate[-v|--verbose]

zkManagerstart[-v|--verbose]

zkManagerstop[-v|--verbose][-f|--force]

Parameters

Parameter Description

-h

--help Displayausagemessageandquit.

-f

--force Whenusedwiththe stop command,performsaforcedstop.

-v

--verbose Displaysdebugoutputwhenexecutingthecommand.

NotesThe zkManager start and zkManager stop commandsareonlyavailableiftheZooKeeperclusterwasinstalledbytheGPTextinstaller.

Bydefault,all gptext-* utilitieschecktheZooKeeperclusterstate.Iftheclusterisnothealthy,theZooKeeperstateinformationisdisplayedtowarntheuser.

The nc (netcat)commandmustbeinstalledonthemasterhost.Run nc inaterminaltoensurethecommandisinstalled.

©CopyrightPivotalSoftware,Inc,2013-2019 166 3.3.0

Page 167: Pivotal Greenplum Text

Examples1. StarttheZooKeepercluster,ifZooKeeperwasinstalledbytheGPTextbinary:

zkManagerstart

2. StoptheZooKeepercluster,ifZooKeeperwasinstalledbytheGPTextbinary:

zkManagerstop

3. ForcestoptheZooKeepercluster,ifZooKeeperwasinstalledbytheGPTextbinary:

zkManagerstop-f

4. CheckthestateoftheZooKeepercluster:

$zkManagerstate20160603:14:17:01:307386zkManager:gpdb-sandbox:gpadmin-[INFO]:-Executezookeeperstateprocess.20160603:14:17:01:307386zkManager:gpdb-sandbox:gpadmin-[INFO]:-HostportLatencymin/avg/maxMode20160603:14:17:01:307386zkManager:gpdb-sandbox:gpadmin-[INFO]:-gpdb-sandbox.localdomain21880/0/17follower20160603:14:17:01:307386zkManager:gpdb-sandbox:gpadmin-[INFO]:-gpdb-sandbox.localdomain21890/0/17leader20160603:14:17:01:307386zkManager:gpdb-sandbox:gpadmin-[INFO]:-gpdb-sandbox.localdomain21900/0/70follower20160603:14:17:06:307386zkManager:gpdb-sandbox:gpadmin-[INFO]:-Done.

©CopyrightPivotalSoftware,Inc,2013-2019 167 3.3.0

Page 168: Pivotal Greenplum Text

GPTextandSolrDataTypeMappingsThefollowingtablemapsGreenplumDatabasedatatypestoSolrdatatypes.

IfaGreenplumDatabasedatatypeisnotlisted,itisa text typeinSolr.

IfaGreenplumDatabasedatatypeisanarrayitismappedtoamulti-valuetypeinSolr.Forexample, INT[]

mapstoamulti-value int Solrfield.

GreenplumDatabaseType SolrType

bigint long

bit string

bool boolean

bytea binary

char string

date tdate

float4 float

float8 double

int int

int2 int

int4 int

int8 long

interval string

money string

name string

numeric double

point point

smallint int

text text

time string

timestamp tdate

timestamptz tdate

timetz string

uuid uuid

varbit string

varchar text

©CopyrightPivotalSoftware,Inc,2013-2019 168 3.3.0

Page 169: Pivotal Greenplum Text

GPTextSchemaTablesThe gptext schemaincludestablesthatGPTextusestomanagetheGPTextclusterandtologGPTextactivities.

gptext.admin_historyGPTextwritesarecordtothe gptext.admin_history tablewhenthefollowingactionsoccur:

createordropaGPTextindex

addordropafieldinaGPTextindex

backuporrestoreaGPTextindex

addorremoveaZooKeeperroleonaGPTextnode

Column Type Description

time timestampwithouttimezone Thetimetheactionoccurred.

user charactervarying(64) ThenameoftheGreenplumDatabaserolethatperformedtheaction.

action text Atextmessagedescribingtheaction.

gptext.gptext_envsThe gptext.gptext_envs tableisanexternaltablecontainingrowswithvaluesforGPTextenvironmentvariables.Currently,theonlyGPTextenvironmentvariableis $GPTXTHOME ,whichistheGPTextinstallationdirectory.ThesourcefortherowsinthistableistheCSVfile$MASTER_DATA_DIRECTORY/gptxtenvs.conf onthemasterhostandthestandbymasterhost.

Column Type Description

envname text Thenameofanenvironmentvariable.

value text Thevalueoftheenvironmentvarialbe.

gptext.error_tableGPTextwritesarecordinthe gptext.error_table whenarequesttoaddadocumenttoaGPTextexternalindexfails.Rowsremaininthetableuntilyoucallgptext.recreate_error_table todropandrecreatethetable.

Column Type Description

error_time timestampwithouttimezone Thetimetheerroroccurred.

index_name text Thenameoftheexternalindex.

sqlcmd text TextoftheSQLstatement,ifany.

errmsg text Themessagetextoftheerrorthatoccurred.

rawdata text Dataassociatedwiththeerror,forexamplethedocumentURL.

rawbytes bytea Binarydataassociatedwiththeerror,ifany.

gptext.solr_instancesThe gptext.solr_instances tableisanexternaltablewitharowforeachSolrinstance.ThesourcefortherowsinthistableistheCSVfile$MASTER_DATA_DIRECTORY/gptext.conf onthemasterhostandthestandbymasterhost.

Column Type Description

id integer UniqueidfortheSolrinstance.

host text Nameofthehostwheretheinstanceisrunning.

port integer PortnumberoftheSolrinstance.

©CopyrightPivotalSoftware,Inc,2013-2019 169 3.3.0

Page 170: Pivotal Greenplum Text

solrdir text PathtotheSolrinstance’sdatadirectory.

zoocluster text AlistofZooKeepernodes.Column Type Description

gptext.zoo_clusterThe gptext.zoo_cluster isanexternaltablewithonerowforeachZooKeepernode.ThesourcefortherowsinthistableistheCSVfile$MASTER_DATA_DIRECTORY/zookeeper.conf onthemasterhostandthestandbymasterhost.

Column Type Description

id integer TheuniqueidoftheZooKeepernode.

host text NameofthehostwheretheZooKeepernodeisrunning.

port integer PortnumberoftheZooKeeperinstance.

data_directory text PathtotheZookeepernode’sdatadirectory.

©CopyrightPivotalSoftware,Inc,2013-2019 170 3.3.0

Page 171: Pivotal Greenplum Text

GPTextConfigurationParametersGPTextconfigurationparameterscanbeoverriddenbysettinganewvalueinaGreenplumDatabasesession.ChangesmadetoconfigurationparametersonlyaffectfutureGPTextoperations;existingindexesusetheparametervaluesthatweresetwhentheywerecreated.

SeeChangingGPTextServerConfigurationParametersforinformationaboutchangingconfigurationparametersandexamples.

ThefollowingtableliststheGPTextconfigurationparameterswiththeirdefaultsandvalueconstraints.

admin_timeout Timeout,inseconds,foradminrequests(create_index,etc.). 30

INT_MAX

3600

commit_timeout

Timeout,inseconds,forpreparecommitandcommitoperations. 30

INT_MAX

3600

delete_timeout

Timeout,inseconds,fordeleterequests. 30

INT_MAX

3600

extension_factor

MaximumnumberofreplicasthatcanbeaddedforanindexperGPTextnodeaftertheindexiscreated.

0 10 2

facet_timeout Timeout,inseconds,forfacetingqueries. 30

INT_MAX

3600

failover_factor

MinimumratioofSolrnodesthatmustbeupinordertocreateanewindex.(SolrNodesUp/TotalSolrNodes )

0.0

1.0 0.8

hl_post_tag Markupthat gptext-highlight() insertsaftertermsinsearchresults. '</em>'

hl_pre_tag Markupthat gptext-highlight() insertsbeforetermsinsearchresults. '<em>'

idx_buffer_size

Sizeofindexingbufferinbytes.

4096

67108864

134217728

idx_delim Delimitertouseduringindexing.

comma

','

idx_encapsulator

ThecharacteroptionallyusedtosurroundvaluestopreservecharacterssuchastheCSVseparatororwhitespace.

quote

'"'

idx_escape Escapecharactertouseforindexing.

backslash

'\\'

idx_num_shards

ThenumberofshardstocreateforindexesthatusetheSolrCompositeIdrouter.Thedefault,0,isequaltothenumberofGreenplumDatabasesegments.

0

index_timeout Timeout,inseconds,forreceivingresponsetoindexingoperation. 30

INT_MAX

3600

optimize_timeout

Timeout,inseconds,foroptimizeoperations. 30

INT_MAX

3600

ping_timeout Timeout,inseconds,forpingrequests. 30

INT_MAX

120

replication_factor

Thenumberofreplicaspershardforanewlycreatedindex. 0 10 2

replication_timeout

Timeout,inseconds,forreplicationoperations(backup,restore). 30

INT_MAX

43200

rollback_timeout

Timeout,inseconds,forrollbackoperations. 30

INT_MAX

3600

search_batch_size

Batchsizeforsearchrequests. 1INT_MAX

2500000

search_buffer_size

Buffersizeforsearchresults,inbytes.

4096

67108864

16777216

©CopyrightPivotalSoftware,Inc,2013-2019 171 3.3.0

Page 172: Pivotal Greenplum Text

search_param_separator

Delimitertouseinthe options parameterofthe gptext.search() UDF. '&'

search_post_buffer_size

Postbuffersizeforsearch,inbytes.512

4194304

4096

search_timeout

Timeout,inseconds,forsearches. 30 INT_MAX 600

stats_timeout Timeout,inseconds,forobtainingstatistics. 30

INT_MAX

600

idx_segment_error_limit

Limitforindexingerrorspersegment.Ifthisvalueisexceededonanysegment,theindexingoperationisstopped.

1INT_MAX

10

terms_batch_size

Batchsizefortermsoperations. 1INT_MAX

1000

©CopyrightPivotalSoftware,Inc,2013-2019 172 3.3.0