394

Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio
Page 2: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio
Page 3: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Contents

About This Book .......................................................................................xiiiAudience .......................................................................................................... xiiiPrerequisites ..................................................................................................... xiiiConventions ...................................................................................................... xiv

What’s New in SAS Information Retrieval Studio 1.3 ............................xv

1 About SAS Information Retrieval Studio .............................................11.1 What Is SAS Information Retrieval Studio? .............................................. 11.2 Benefits of Using SAS Information Retrieval Studio ................................ 31.3 How Does SAS Information Retrieval Studio Work? ............................... 31.4 How Does SAS Information Retrieval Studio Fit into the SAS Product Line? ................................................................................................... 41.5 How to Get Help for SAS Information Retrieval Studio ........................... 41.6 What is a Document? ................................................................................. 51.7 Architecture ................................................................................................ 5

2 SAS Information Retrieval Studio Interface .........................................72.1 Your First Look at the SAS Information Retrieval Studio User Interface . 72.2 Access the SAS Information Retrieval Studio User Interface ................... 82.3 Viewing the Overview Pane ....................................................................... 11

2.3.1 Overview of the Overview Pane ...................................................... 112.3.2 The Log Tab ..................................................................................... 11

2.4 Viewing the Web Crawler Pane ................................................................. 122.4.1 Overview of the Web Crawler Pane ................................................ 122.4.2 The Buttons ...................................................................................... 132.4.3 The Status Tab ................................................................................. 132.4.4 The Configuration Tab ..................................................................... 14

2.4.4.A The Main Tabs in the Configuration Tab ............................. 142.4.4.B The General Settings Tab ...................................................... 152.4.4.C The Entry Points Tab ............................................................ 182.4.4.D The Scope Tab ...................................................................... 202.4.4.E The Filename Extensions Tab ............................................... 212.4.4.F The Credentials Tab .............................................................. 232.4.4.G The Log Tab .......................................................................... 24

iii

Page 4: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.5 Viewing the File Crawler Pane ...................................................................242.5.1 Overview of the File Crawler Pane ..................................................242.5.2 The Buttons ......................................................................................242.5.3 The Status Tab ..................................................................................252.5.4 The Configuration Tab .....................................................................25

2.5.4.A The Main Tabs in the Configuration Tab ..............................252.5.4.B The General Settings Pane .....................................................262.5.4.C The Paths Pane ......................................................................282.5.4.D The Paths to Exclude Pane ....................................................292.5.4.E The Filename Extensions Pane ..............................................30

2.6 Viewing the Feed Crawler Pane .................................................................312.6.1 Overview of the Feed Crawler Pane .................................................312.6.2 The Buttons ......................................................................................312.6.3 The Status Tab ..................................................................................322.6.4 The Configuration Tab .....................................................................32

2.6.4.A The Main Tabs in the Configuration Tab ..............................322.6.4.B The General Settings Tab ......................................................332.6.4.C The Feeds Tab .......................................................................34

2.6.5 The Log Tab .....................................................................................352.7 Viewing the Proxy Server Pane ..................................................................35

2.7.1 Overview of the Proxy Server Pane .................................................352.7.2 The Buttons ......................................................................................362.7.3 The Status Tab ..................................................................................362.7.4 The Configuration Tab .....................................................................372.7.5 The Log Tab .....................................................................................39

2.8 Viewing the Pipeline Server Pane ..............................................................392.8.1 Overview of the Pipeline Server Tab ...............................................392.8.2 The Buttons ......................................................................................402.8.3 The Status Tab ..................................................................................402.8.4 The Document Processors Tab .........................................................412.8.5 The Document Inspector Tab ...........................................................422.8.6 The Log Tab .....................................................................................44

2.9 Viewing the Indexing Server Pane .............................................................442.9.1 Overview of the Indexing Server Pane .............................................442.9.2 The Buttons ......................................................................................452.9.3 The Status Tab ..................................................................................462.9.4 The Configuration Tab .....................................................................462.9.5 The Log Tab .....................................................................................47

iv SAS Information Retrieval Studio: Administrator’s Guide

Page 5: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.10 Viewing the Query Server Pane ............................................................... 482.10.1 Overview of the Query Server Tab ................................................ 482.10.2 The Buttons .................................................................................... 482.10.3 The Status Tab ............................................................................... 482.10.4 The Log Tab ................................................................................... 49

2.11 Viewing the Query Web Server Pane ...................................................... 492.11.1 Overview of the Query Web Server Tab ....................................... 492.11.2 The Buttons .................................................................................... 492.11.3 The Status Tab ............................................................................... 502.11.4 The Configuration Tab ................................................................... 50

2.11.4.A The Main Tabs in the Configuration Tab ........................... 502.11.4.B The Matching Tab ............................................................... 512.11.4.C The Sorting Tab .................................................................. 532.11.4.D The Labels Tab ................................................................... 562.11.4.E The Match Formatting Tab ................................................. 582.11.4.F The Theme Tabs .................................................................. 60

2.11.5 The Log Tab ................................................................................... 652.12 Viewing the Query Statistics Server Pane ............................................... 65

2.12.1 Overview of the Query Statistics Server Pane ............................... 652.12.2 The Buttons .................................................................................... 652.12.3 The Status Tab ............................................................................... 652.12.4 The Query Statistics Tab ................................................................ 66

2.12.4.A The Buttons ......................................................................... 662.12.4.B The Most Frequent Queries Tab ......................................... 682.12.4.C The Most Frequent Queries without Matches Tab .............. 692.12.4.D The Hourly Query Rate Tab ............................................... 702.12.4.E The Daily Query Rate Tab .................................................. 712.12.4.F The Monthly Query Rate Tab .............................................. 72

2.12.5 The Log Tab ................................................................................... 722.13 The Add Document Processor Windows ................................................. 73

2.13.1 Overview of Document Processor Windows ................................. 732.13.2 Access Document Processor Window ........................................... 732.13.3 The Document Processor: add_field Window ............................... 772.13.4 The Document Processor: content_categorization Wizard ............ 78

2.13.4.A Overview of the content_categorization Document Processor .......................................................................... 782.13.4.B Configure SAS Content Categorization Server .................. 782.13.4.C Specify the Projects ............................................................. 802.13.4.D Specify Input ....................................................................... 822.13.4.E Specify Categories ............................................................... 83

SAS Information Retrieval Studio: Administrator’s Guide v

Page 6: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.13.4.F Specify Concepts .................................................................842.13.4.G Specify Facts .......................................................................88

2.13.5 The Document Processor: default_mime_type_from_url Window 952.13.6 The Document Processor: default_title_from_url Window ...........952.13.7 The Document Processor: document_converter Window ..............962.13.8 The Document Processor: export_csv Window .............................972.13.9 The Document Processor: export_to_files Window ......................1002.13.10 The Document Processor: export_to_odbc Window ....................1022.13.11 The Document Processor: export_to_sentiment_analysis_workbench Window ....................................................................1042.13.12 The Document Processor: extract_abstract Window ...................1062.13.13 The Document Processor: extract_pdate Window .......................1072.13.14 The Document Processor: heuristic_parse_html Window ...........1082.13.15 The Document Processor: invalidate_duplicates_by_url Window ................................................................................................1102.13.16 The Document Processor: match_and_copy Window .................1102.13.17 The Document Processor: modify_field_name Window .............1122.13.18 The Document Processor: parse_html Window ...........................1122.13.19 The Document Processor: parse_xml Window ............................1142.13.20 The Document Processor: send Window .....................................1152.13.21 The Document Processor: strip_html Window ............................1162.13.22 The Document Processor: substitute Window .............................116

2.14 Miscellaneous Windows ...........................................................................1182.14.1 The Import Settings Window .........................................................1182.14.2 The Export Settings Window .........................................................1192.14.3 The Select an HTTP Proxy Window ..............................................1202.14.4 The Add Entry Point Window ........................................................1212.14.5 The Edit Entry Point Window ........................................................1252.14.6 The Add Feed Window ..................................................................1262.14.7 The Edit Feed Window ...................................................................1292.14.8 The Add Scope Rule Window ........................................................1302.14.9 The Edit Scope Rule Window ........................................................1322.14.10 The Add Filename Extension Window ........................................1332.14.11 The Edit Filename Extension Window ........................................1342.14.12 The Add Credential Window .......................................................1352.14.13 The Edit Credential Window ........................................................1362.14.14 The Add Path Window .................................................................1372.14.15 The Edit Path Window .................................................................1382.14.16 The Add Path to Exclude Window ...............................................138

vi SAS Information Retrieval Studio: Administrator’s Guide

Page 7: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.17 The Edit Path to Exclude Window ............................................... 1392.14.18 The Add Extension Window ........................................................ 1392.14.19 The Edit Extension Window ........................................................ 1402.14.20 The Add Backend Window .......................................................... 1412.14.21 The Edit Backend Window .......................................................... 1422.14.22 The Add Field Window for the Indexing Server ......................... 1432.14.23 The Add Field Window for the Query Web Server Matching Pane .......................................................................................... 1462.14.24 The Edit Field Window for the Query Web Server Matching Pane .......................................................................................... 1472.14.25 The Add Field Window: Query Web Server Labels Pane ........... 1482.14.26 The Edit Field Window: Query Web Server Labels .................... 1502.14.27 The Color Box Window ............................................................... 1512.14.28 Status Windows ........................................................................... 153

2.14.28.A Overview of Status Windows ........................................... 1532.14.28.B The Confirmation Window ............................................... 1532.14.28.C The Error Window ............................................................ 153

3 Choosing Your Components ................................................................1553.1 Before You Choose Your Components ...................................................... 1553.2 Choosing a Crawler .................................................................................... 1563.3 Purposes of the Proxy Server ..................................................................... 1573.4 Choosing Document Processors in the Pipeline Server ............................. 158

3.4.1 Overview of the Pipeline Server ...................................................... 1583.4.2 Choosing a Document Processor ..................................................... 1583.4.3 The Export Operations Performed by the Pipeline Server ............... 160

3.5 How the Indexing Server Works ................................................................ 1613.6 Querying the Index ..................................................................................... 161

3.6.1 Overview of the Querying ............................................................... 1613.6.2 Using the Query Server .................................................................... 1613.6.3 Using the Query Web Server ........................................................... 1623.6.4 Using the Query Statistics Server .................................................... 162

3.7 Defining Labels for Facetted Search .......................................................... 1633.8 After You Choose Your Components ........................................................ 1643.9 Exporting and Importing Component Specifications ................................. 164

4 Sample Configurations ..........................................................................1654.1 Why You Want to Understand Sample Configurations ............................. 1654.2 Before You Use a Sample Configuration to Create Your Own Application 166

SAS Information Retrieval Studio: Administrator’s Guide vii

Page 8: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

4.3 Sample Configurations That Use the Web Crawler ...................................1674.3.1 A Web Crawler, Indexing, and Searching Configuration ................1674.3.2 The Web Crawler with Exporting and Indexing Processes ..............178

4.4 A Sample Configuration That Uses the File Crawler .................................1784.5 A Sample Configuration That Uses the Feed Crawler ...............................184

5 Configuring the Web Crawler ...............................................................1915.1 Overview of the Web Crawler ....................................................................1915.2 Configuring the Web Crawler ....................................................................192

5.2.1 Overview of Configuring the Web Crawler .....................................1925.2.2 Specify the General Settings ............................................................1935.2.3 Specify Entry Points for the Web Crawler .......................................1965.2.4 Specify the Scope of the Crawl ........................................................1985.2.5 Exclude Certain Types of Files ........................................................2025.2.6 Specify Access Information for Password-Protected Sites ..............203

5.3 Run the Web Crawler .................................................................................2055.4 Troubleshoot with the Log File ..................................................................206

6 Configuring the File Crawler .................................................................2076.1 Overview of the File Crawler .....................................................................2076.2 Configure the File Crawler .........................................................................207

6.2.1 Overview of Configuring the File Crawler ......................................2076.2.2 Specify the General Settings ............................................................2086.2.3 Specify the Paths to Crawl ...............................................................2096.2.4 Specify the Paths to Exclude ............................................................2116.2.5 Specify the Types of Files to Return ................................................212

6.3 Run the File Crawler ...................................................................................2136.4 Troubleshoot with the Log File ..................................................................214

7 Configuring the Feed Crawler ...............................................................2177.1 Overview of the Feed Crawler ....................................................................2177.2 Configure the Feed Crawler .......................................................................218

7.2.1 Overview of Configuring the Feed Crawler .....................................2187.2.2 Specify the General Settings ............................................................2187.2.3 Specify the Feeds ..............................................................................220

7.3 Run the Feed Crawler .................................................................................2227.4 Troubleshoot with the Log File ..................................................................223

8 Configuring the Proxy Server ...............................................................225

viii SAS Information Retrieval Studio: Administrator’s Guide

Page 9: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

8.1 Overview of the Proxy Server .................................................................... 2258.2 View the Status of the Proxy Server and Input Files ................................. 2268.3 Configure the Proxy Server ........................................................................ 2288.4 Run the Proxy Server ................................................................................. 2298.5 Troubleshoot with the Log File .................................................................. 230

9 Configuring the Pipeline Server ...........................................................2339.1 Overview of the Pipeline Server ................................................................ 233

9.1.1 Processing Documents and Related SAS Applications ................... 2339.1.1.A How Document Processing and Export Operations Work Together ................................................................................... 2339.1.1.B Process Documents ............................................................... 2349.1.1.C Export Processed Documents ................................................ 235

9.2 Configuring the Pipeline Server ................................................................. 2359.2.1 Overview of the Document Processors ............................................ 2359.2.2 Checking Program Installations ....................................................... 2379.2.3 Configure the Document Processors ................................................ 238

9.3 See Input Documents with the Document Inspector .................................. 2409.4 Add a New Field to Input Documents ........................................................ 2429.5 Match Categories, Concepts, and Facts ..................................................... 2469.6 Export Categories and Concept Matches .................................................. 2569.7 Advanced Installation ................................................................................. 2589.8 Run the Pipeline Server .............................................................................. 2589.9 Troubleshoot with the Log File .................................................................. 260

10 Creating Facetted Search Labels Using content_categorization ....26110.1 Before You Begin Using This Example ................................................... 261

10.1.1 How the content_categorization Document Processor Creates Facetted Search Labels ................................................................ 26110.1.2 Using Related Programs to Define Labels ..................................... 26210.1.3 Mapping to Labels ......................................................................... 26410.1.4 Before You Build Your SAS Content Categorization Studio Project ............................................................................................ 26610.1.5 Before You Use the Example in This Chapter ............................... 269

10.2 Creating a Sample Project ........................................................................ 27410.2.1 Access the Projects on SAS Content Categorization Server ......... 27410.2.2 Add Projects ................................................................................... 277

SAS Information Retrieval Studio: Administrator’s Guide ix

Page 10: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

10.2.3 Determine the Input, Matching, and Output ...................................27910.2.3.A How Input Documents Are Handled ...................................27910.2.3.B Specify Input Fields .............................................................27910.2.3.C Specify Categories ...............................................................28010.2.3.D Specify Concepts .................................................................28110.2.3.E Specify Facts ........................................................................286

10.2.4 Specify Output ................................................................................29210.2.5 Apply content_categorization to Input Documents ........................293

10.3 Seeing the Results in the Query Interface ................................................295

11 Configuring the Indexing Server ........................................................29711.1 Overview of the Indexing Server ..............................................................29711.2 Configure an Index ...................................................................................29811.3 Changes That Affect the Indexing Server ................................................30111.4 Run the Indexing Server ...........................................................................30211.5 Troubleshoot with the Log File ................................................................303

12 Configuring the Query Server .............................................................30512.1 Overview of the Query Server ..................................................................30512.2 Run the Query Server ...............................................................................30512.3 Troubleshoot with the Log File ................................................................306

13 Configuring the Query Web Server ....................................................30913.1 Overview of the Query Web Server .........................................................30913.2 Choosing How Search Returns Are Displayed .........................................310

13.2.1 Displays with or without Labels .....................................................31013.2.2 No Labels Example ........................................................................31213.2.3 Hierarchical Labels Example .........................................................31213.2.4 No Hierarchical Display Example ..................................................31413.2.5 Flattened Hierarchical Example .....................................................315

13.3 Configure the Query Web Server .............................................................31613.3.1 Overview of Configuring the Query Web Server ..........................31613.3.2 Specify the Server Port ...................................................................31713.3.3 Specify How Matching Is Performed .............................................317

13.3.3.A Match Types ........................................................................31713.3.3.B Select a Match Type ............................................................318

13.3.4 Specify How Matches Are Sorted ..................................................31913.3.5 Specify Labels for Facetted Search ................................................32113.3.6 Specify the Formatting for the Matches .........................................326

x SAS Information Retrieval Studio: Administrator’s Guide

Page 11: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

13.3.7 Specify the Theme of the Search Window .................................... 32913.3.7.A Theme Overview ................................................................. 32913.3.7.B Specify the Colors of the Search Window .......................... 33113.3.7.C Load New Images into the Search Window ........................ 335

13.4 Run the Query Web Server ...................................................................... 33613.5 Troubleshoot with the Log File ................................................................ 337

14 Configuring the Query Statistics Server ............................................33914.1 Overview of the Query Statistics Server .................................................. 33914.2 Run the Query Statistics Server ............................................................... 34014.3 View the Query Statistics for a Selected Time Period ............................. 341

14.3.1 Overview of Time Period Views ................................................... 34114.3.2 See the Statistics for Today ............................................................ 34114.3.3 See the Statistics for This Month ................................................... 34414.3.4 See the Statistics for This Year ...................................................... 34614.3.5 See the Statistics for All Time ....................................................... 348

14.4 After You View the Query Data .............................................................. 34914.5 Troubleshoot with the Log File ................................................................ 349

Appendixes ............................................................................ 351

A Regular Expressions and XML Field Extraction File ..........................353A.1 Regular Expressions .................................................................................. 353A.2 XML File Field Extraction File Format .................................................... 353

B Recommended Reading ........................................................................355

C Glossary .................................................................................................359Index ...........................................................................................................363

SAS Information Retrieval Studio: Administrator’s Guide xi

Page 12: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

xii SAS Information Retrieval Studio: Administrator’s Guide

Page 13: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

About This Book

Audience

SAS Information Retrieval Studio is designed for the following administrators:

- Persons who install the software.- Persons who determine what components are used in the custom

information retrieval application that your organization requires.- Persons who choose the components, and their configurations, for

information retrieval.- Persons who design the search window that is used by end users to

query the index.You could be assigned one of these functions, or all of them.SAS Information Retrieval Studio enables you to use this software with other SAS products. This documentation focuses on tasks that define and configure the information retrieval application and the search interface.

Prerequisites

Here are the prerequisites for using SAS Information Retrieval Studio:- SAS Information Retrieval Studio installed on your machine.- A supported browser installed on your desktop client.- Access to data sources. - (Optional) Rules such as category rules and concept definitions created

in other SAS applications.If you have any questions about whether you are ready to use SAS Information Retrieval Studio, contact your system administrator.

xiii

Page 14: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Conventions

This manual uses the following typographical conventions:

Convention Description

TGM_ROOT The root directory where SAS Information Retrieval Studio is installed, typically the following:

Windows: C:/Program Files/SAS Information Retrieval StudioUNIX: /opt/SAS Information Retrieval Studio

.xml Code examples are shown in a fixed-width font.

Start button The labels for user interface controls are shown in a bold, sans-serif font.

www.sas.com The hypertext links are shown in a light blue, fixed-width font, and are underlined.

This manual contains instructional text that is subject to change.

xiv SAS Information Retrieval Studio: Administrator’s Guide

Page 15: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

What’s New in SAS Information Retrieval Studio 1.3

New and enhanced features in SAS Information Retrieval Studio include the following:

- SAS licensing replaces the Teragram license.- The content_categorization Document Processor wizard replaces the

categorizer, concept_extractor, and contextual_extractor processors.- The add_field Document Processor enables you to add a field with a

constant value to each input document.- The export_to_files document processor now enables you to mark pre-

escaped fields for XML documents. Use this processor to create nested XML tags.

- The parse_xml document processor can now be instantiated multiple times. This feature enables you to support multiple document schemas. This processor can also copy the original URL of the compound document into each resulting, split document.

- The export_csv document processor now supports a non-escaped output mode.

- Entry point quota control is now available for the web crawler. This feature enables seed-only crawling.

- The match_and_copy document processor is similar to the substitute document processor. Use the match_and_copy document processor to write the output to a different field from the input.

- The default fields ctime, mtime, and atime are included in the Input fields to exclude field for the content categorization document processor. These fields preclude these timestamps from processing by SAS Content Categorization Server.

- The passwords in the web crawler Credentials pane are now obscured.

xv

Page 16: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

xvi SAS Information Retrieval Studio: Administrator’s Guide

Page 17: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1About SAS Information Retrieval Studio

- What Is SAS Information Retrieval Studio?- Benefits of Using SAS Information Retrieval Studio- How Does SAS Information Retrieval Studio Work?- How Does SAS Information Retrieval Studio Fit into the SAS Product

Line?- How to Get Help for SAS Information Retrieval Studio- What is a Document?- Architecture

1.1 What Is SAS Information Retrieval Studio?

In many organizations, diverse information consumers need to quickly access specific data. In an environment where data, and its types, grow exponentially there is a need to automate the related processes. SAS Information Retrieval Studio combines several key technologies to provide a comprehensive solution to data collection, indexing, searching, and so on. These technologies are bundled into one customizable product.Easy information retrieval

The web, feed, and file crawlers gather the documents that you specify according to your parameters. Documents are chunks of text, with or without markup tags, gathered from the Internet, feeds, and databases. These chunks of text can be treated by the document processors that parse, convert, categorize, extract concepts and facts, and so on. The documents can then be sent to the index or to another program such as SAS Sentiment

Page 18: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Analysis Workbench. If indexed, the documents can be searched by your end users.

Build a custom information retrieval pipelineChoose to build an information retrieval system that is customized to meet the needs of your organization. You can choose all, or some, of the following components:Crawlers

Choose the web, feed, or file crawlers to gather documents from the Web, feeds, and file systems, respectively.

Pipeline serverChoose your document processors that parse, categorize, extract concepts, locate facts, convert documents into text, and so on. These processors can also hand the gathered documents to other applications such as SAS Sentiment Analysis Workbench.

Indexing serverChoose how, and whether input documents are indexed. End users can search indexed documents using a customized search window that runs on the query web server.

Query web serverSpecify how the matching documents are returned in the search window, the appearance of this window, and how end users can navigate the returns.

Query statistics serverSee the counts for the entered queries according to various time frames.

Easy component customizationEasy-to-use windows and wizards simplify the process of customizing the information retrieval components that you choose. These panes also provide log files, statistics, information about the processes involved, and data on documents in the pipeline.

2 SAS Information Retrieval Studio: Administrator’s Guide

Page 19: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1.2 Benefits of Using SAS Information Retrieval Studio

SAS Information Retrieval Studio provides the following benefits:Empowers business owners by locating data

SAS Information Retrieval Studio includes functionality that is designed to fit your organization’s requirements. Use this program to locate, process, index, and customize a search window for your data. See various types of informational statistics.

Improves the business value of IT and the corporate data that it managesSAS Information Retrieval Studio provides you with easy, self-service access to the information contained in your documents. Use SAS Information Retrieval Studio to locate, process, index, and search your data.

Saves money on training and support costsSAS Information Retrieval Studio is so simple that you can quickly become self-sufficient, with minimal IT support and no need for extensive training. Once you start using SAS Information Retrieval Studio you are no longer dependent on the IT staff.

1.3 How Does SAS Information Retrieval Studio Work?

SAS Information Retrieval Studio is an application that anyone can use to locate documents on the Internet or in file systems. You can specify how these documents are processed, index, or send them to another SAS program. If you choose to index the documents, end users can query this data in the search window that you customize. Use the SAS Information Retrieval Studio window to select the crawler and document processors. Also determine whether the documents are indexed or sent to another SAS application and customize a search window. All of these processes are optional. You can specify the components that you want to use,

SAS Information Retrieval Studio: Administrator’s Guide 3

Page 20: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

configure the components, and enable your end users to perform facetted search using labels.

1.4 How Does SAS Information Retrieval Studio Fit into the SAS Product Line?

As an integral part of the SAS product line, SAS Information Retrieval Studio provides crawlers, indexing, and searching capabilities. These functionalities facilitate the processes of information retrieval and management. Use these capabilities with the following SAS products, among others:Export document collections to SAS Sentiment Analysis Workbench and SAS Text Miner

Export the files that the web and feed crawlers gather in SAS Information Retrieval Studio to SAS Sentiment Analysis Workbench. Here you can see reports about overall sentiment. Analysts can also see and review individual documents in SAS Sentiment Analysis Workbench. You can also export to SAS Text Miner to locate topics and themes in your input documents.

Category, concept, and fact extractionSAS Information Retrieval Studio enables you to deploy the rules defined in SAS Content Categorization Studio and SAS Contextual Extraction Studio to your gathered documents.

Document conversionUse SAS Document Conversion to extract text from input files such as Adobe PDF and Microsoft Office.

1.5 How to Get Help for SAS Information Retrieval Studio

Select Help --> Contents or Help --> Using this Window.

4 SAS Information Retrieval Studio: Administrator’s Guide

Page 21: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1.6 What is a Document?

A document consists of a single text. For example, a document can be any of the following:

- an HTML page- a Microsoft Word file- a PDF file- one row in a CSV file or a database- one article or summary in a feed

In SAS Information Retrieval Studio, each document is represented as a configurable set of fields. Each file has a name and a value. Unnecessary fields can be either left empty or omitted from the document.

1.7 Architecture

Use the architecture diagram below to gain an overview of the application processes that you can choose to use in your customized configuration.

Figure 1-1 SAS Information Retrieval Studio

SAS Information Retrieval Studio: Administrator’s Guide 5

Page 22: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

6 SAS Information Retrieval Studio: Administrator’s Guide

Page 23: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2 SAS Information Retrieval Studio Interface

- Your First Look at the SAS Information Retrieval Studio User Interface

- Access the SAS Information Retrieval Studio User Interface- Viewing the Overview Pane- Viewing the Web Crawler Pane- Viewing the File Crawler Pane- Viewing the Feed Crawler Pane- Viewing the Proxy Server Pane- Viewing the Pipeline Server Pane- Viewing the Indexing Server Pane- Viewing the Query Server Pane- Viewing the Query Web Server Pane- Viewing the Query Statistics Server Pane- The Add Document Processor Windows- Miscellaneous Windows

2.1 Your First Look at the SAS Information Retrieval Studio User Interface

The SAS Information Retrieval Studio user interface provides the workspace necessary to configure the crawlers and servers that you select to gather and process information. The gathered documents are processed according to the specifications that you set and sent to either the indexing server or to another

Page 24: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

program. For example, choose to send your documents to SAS Sentiment Analysis Workbench where the sentiment that they express can be aggregated. If you want your end users to be able to query the documents that your crawlers locate, choose to index these documents.Use the windows in the SAS Information Retrieval Studio to choose your components according to the tasks that you want to perform:

1. Start, stop, configure, and monitor the status of the web, file, and feed crawlers. Specify the parameters for each type of crawl. A crawl is defined as the entire run of the crawler, instead of a single-page download.

2. Determine how input documents are processed and where they are sent using the pipeline server.

3. Specify how input documents are stored in the index by determining what fields are indexed and how the information in these fields is handled.

4. Choose the type of search that is available to end users and how matching and sorting are determined. You can also determine whether and how labels that facilitate facetted search are made available to end users.

5. Monitor the status of input queries using the query statistics server.

2.2 Access the SAS Information Retrieval Studio User Interface

To access the SAS Information Retrieval Studio user interface, complete this step:Select Start —> Programs —> SAS Information Retrieval Studio —> Administration.

8 SAS Information Retrieval Studio: Administrator’s Guide

Page 25: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Display 2.1 SAS Information Retrieval Studio User Interface

Use the components of this pane as specified below:

Table 2-1: Components of the Main Window

Component Description

Refresh button

By default, Auto-refresh is selected. Click to select Refresh.With the default setting, the status of the components is updated every few seconds. If you have a slow connection, you can disable the auto-refresh functionality. In this case, click Refresh to update the status of any of the components of SAS Information Retrieval Studio.

Overview Find information for the application and the import and export operations that apply to selected components of the application.

Web Crawler Start, stop, and configure the web crawler.

File Crawler Start, stop, and configure the file crawler.

Feed Crawler Start, stop, and configure the feed crawler.

Proxy Server Start, stop, and configure the proxy server. Also see and search a log file of output for this server.

Pipeline Server Start, stop, and configure the pipeline server. Observe the progress of gathered documents in the Status pane and see the specified log file output for this server. Use this component to specify the processors that act on each input document. These processors act on the document using the specified operation or pass the document to another component.

SAS Information Retrieval Studio: Administrator’s Guide 9

Page 26: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Indexing Server Start, stop, and configure the indexing server. Use the indexing server if you plan to perform search operations using SAS Information Retrieval Studio.

Query Server Start, stop, and configure the query server. The query server passes queries to the index from the search window and hands the results back to the query web server.

Query Web Server

Start, stop, and configure the query web server. Specify how matching and sorting operations are performed. Format the labels, document matches, and the theme of the search window.

Query Statistics Server

Start and stop the query statistics server. See the most frequent queries submitted, those search terms that are not matched, and query rates for specified time periods.

The width of these panels can be adjusted by dragging this icon located between the two main panes.

Table 2-1: Components of the Main Window (Continued)

Component Description

10 SAS Information Retrieval Studio: Administrator’s Guide

Page 27: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.3 Viewing the Overview Pane

2.3.1 Overview of the Overview Pane

For information about this pane, see Section 2.2 Access the SAS Information Retrieval Studio User Interface on page 8.

2.3.2 The Log Tab

See the log pane that describes the operations performed.Display 2.2 Log Pane.

Number of lines

(default is 20) see this maximum number of timestamped lines of text that

form the log for the proxy server. Click or to reset the limit.Retrieve button

see the log file contents in the Log pane below.

SAS Information Retrieval Studio: Administrator’s Guide 11

Page 28: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Text to highlight

enter the term that you are seeking to match in the input document.Find button

click to highlight the matched text in the log pane below. Log pane

see the specified number of lines in the log file here.

2.4 Viewing the Web Crawler Pane

2.4.1 Overview of the Web Crawler Pane

The web crawler searches the Web and returns Web pages according to the parameters that you specify. The operations that are available when you click the Web Crawler tab are explained in the following subsections.

Display 2.3 Web Crawler Pane

12 SAS Information Retrieval Studio: Administrator’s Guide

Page 29: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.4.2 The Buttons

The buttons that are available in the Web Crawler pane are listed below from left to right:Start

begin the crawl.Stop

end the crawl.Apply Changes

modify the behavior of the web crawler according to the changes that you made in this pane.

Revert

return to the last applied settings.

2.4.3 The Status Tab

See whether the web crawler is running.Display 2.4 Status Pane

SAS Information Retrieval Studio: Administrator’s Guide 13

Page 30: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.4.4 The Configuration Tab

2.4.4.A The Main Tabs in the Configuration Tab

The web crawler does not run until it is configured. Specify the settings for the web crawler, the points where the crawler enters the Web, the limits of its crawl, and the file types that it returns. You can also specify the permissions necessary to access password-protected sites.

General Settings

specify how the web crawler runs. For more information, see Section 2.4.4.B The General Settings Tab on page 15.

Entry Points

enter the starting URLs. The web crawler starts at these Web addresses and follows their links to gather documents. For more information, see Section 2.4.4.C The Entry Points Tab on page 18.

Scope

allow, or exclude, Web addresses from the crawling process. In either case, specify patterns with regular expressions, or a list of specific URLs. For more information, see Section 2.4.4.D The Scope Tab on page 20.

Filename Extensions

use the default list of excluded file types, or define your own list to include, or exclude, from the crawl. For more information, see Section 2.4.4.E The Filename Extensions Tab on page 21.

Credentials

enter the necessary names and passwords for password-protected sites where access is prevented without this information. When you specify this

14 SAS Information Retrieval Studio: Administrator’s Guide

Page 31: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

data, you enable the pages on these sites to be collected. For more information, see Section 2.4.4.F The Credentials Tab on page 23.

2.4.4.B The General Settings Tab

Specify the settings that determine the overall way that the web crawler runs.Display 2.5 General Settings Pane

Use the components of this window to specify the general settings.

Table 2-2: General Settings Pane Components

Component Description

HTTP proxy Specify the proxy server here, or click Auto-detect as explained below.

Auto-detect button Access the Select an HTTP Proxy window where you can choose a proxy server. For more information, see Section 2.14.3 The Select an HTTP Proxy Window on page 120.

SAS Information Retrieval Studio: Administrator’s Guide 15

Page 32: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Quota (files)

(Default: 25) Click or to change the file limit for the web crawler.

Quota (megabytes)

(Default: 1000) Click or to change the megabyte limit for the crawler. This is the maximum size of all collected files.

Number of downloader threads

(Default: 1) Click or to change the total number of threads that can be created.

Sleep interval (seconds) (Default: 1) Click or to change the number of

seconds that the web crawler pauses between page downloads.

Timeout (seconds)

(Default: 300) Click or to change the number of seconds the web crawler waits before it stops attempting to download a specific page.

Maximum number of retries (Default: 3) Click or to change the highest number

of times the crawler can attempt to download a page before it tries the next one.

Retry delay (seconds) (Default: 300) click or to change the highest

number of seconds that the web crawler waits before it reattempts to download a page.

Respect robots.txt

(Default: Yes) Click to select No. The robots.txt standard enables Web site authors to request that crawlers (robots) avoid downloading some portions of their site. Select Yes to ignore this request.

Table 2-2: General Settings Pane Components (Continued)

Component Description

16 SAS Information Retrieval Studio: Administrator’s Guide

Page 33: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Find links in Javascript and Flash (Default: Yes) Click to select No. Leave the default

setting to prohibit the web crawler from returning links that the crawler finds in either of these two types of code.

Link traversal order

(Default: Breadth first) Click to select Depth first. Breadth first means the top layer of linked pages at one site are gathered before the links to the next layer are followed. If you select Depth first, the links are followed on the first page. Then the crawler goes to the second page, and so on.

Table 2-2: General Settings Pane Components (Continued)

Component Description

SAS Information Retrieval Studio: Administrator’s Guide 17

Page 34: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.4.4.C The Entry Points Tab

Specify the Web addresses, or URLs, where the web crawler begins its crawl. You can also specify the quota, or maximum number of files that can be downloaded.

Display 2.6 Entry Points Pane

URL

see the list of the entry points to the Web for the web crawler. These crawlers access the URLs in the order in which they are listed in the Entry Points pane.

Hint: This ordering can affect the number of documents returned when you enter a quota for the web crawler or for each URL. For more information about the quota for the web crawler, see Section 2.4.4.B The General Settings Tab on page 15.

Quota

see the maximum number of files that can be collected for each Web address. When you specify the quota for a URL and the web crawler, the

18 SAS Information Retrieval Studio: Administrator’s Guide

Page 35: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

smaller of the two numbers applies. In other words, if you specify 100 for the web crawler and 35 for the selected URL, only 35 documents can be downloaded for this URL. On the other hand, if you specify 100 documents for this for URL and 35 for the web crawler, only 35 documents can be downloaded for this URL.

Add

access the Add Entry Point window where you can specify a Web address to begin the crawl. See Section 2.14.4 The Add Entry Point Window on page 121.

Remove

delete the selected entry point and quota from the address pane.Edit

access the Edit Entry Point window to make changes to the selected Web address. See Section 2.14.5 The Edit Entry Point Window on page 125.

SAS Information Retrieval Studio: Administrator’s Guide 19

Page 36: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.4.4.D The Scope Tab

Specify the Web addresses, or URLs, where the web crawler begins its crawl.Display 2.7 Scope Pane

URL Pattern

specify a pattern for matching URLs.Match Type

see prefix or regular expressions. Both are patterns. Select prefix to specify a match at the beginning of the URL. Select regular expression when you want to use Teragram regular expressions to specify the pattern of the searchable URLs. Regular expressions enable you to apply greater precision to the collection operation.

Action

specifies Allow or Exclude to either download, or to prevent a download, for pages whose URLs match the specified pattern.

Add

access the Add Scope Rule window. Scope determines the links that the crawler follows, if the URL has the specified prefix. The links in this URL

20 SAS Information Retrieval Studio: Administrator’s Guide

Page 37: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

are followed only if no other scope rules exclude this URL. See Section 2.14.8 The Add Scope Rule Window on page 130.

Remove

delete the selected URL pattern and its attributes.Edit

access the Edit Scope Rule window. See Section 2.14.9 The Edit Scope Rule Window on page 132.

2.4.4.E The Filename Extensions Tab

Specify the file extensions that are excluded or included.

Note: If you specify one file type as included, only the specified file types are returned.

Display 2.8 Filename Extensions Pane

Extension

see the list of file types that are listed by their file extension.Action

see the status of each file type. In other words, is this extension excluded or allowed. If the file type is specified as Allow, the crawler can return this

SAS Information Retrieval Studio: Administrator’s Guide 21

Page 38: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

type of page. If you enable at least one type of file to be returned, only those files with the Allow operation are returned.

Add

access the Add File Extension window to add an extension that is allowed or prohibited. See Section 2.14.10 The Add Filename Extension Window on page 133.

Remove

delete the selected file type.Edit

access the Edit Filename Extension window to make a change to the file extension or the operation. See Section 2.14.11 The Edit Filename Extension Window on page 134.

22 SAS Information Retrieval Studio: Administrator’s Guide

Page 39: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.4.4.F The Credentials Tab

Specify the sign-in information that enables you to gain access to the specified password-protected sites.

Display 2.9 Credentials Pane

Site

see a list of Web sites that are password-protected.Username

see the name of the user for each password-protected site.Password

see the password assigned to each user. The password is the second component of the credentials required for HTTP authentication.

Add

access the Add Credential window to add a password-protected site to crawl. See Section 2.14.12 The Add Credential Window on page 135.

Remove

delete the selected site with its credentials.Edit

access the Edit Credential window to make a change to the specifications for the password-protected site. See Section 2.14.13 The Edit Credential Window on page 136.

SAS Information Retrieval Studio: Administrator’s Guide 23

Page 40: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.4.4.G The Log Tab

This pane provides information about the web crawler operations. For more information, see Section 2.3.2 The Log Tab on page 11.

2.5 Viewing the File Crawler Pane

2.5.1 Overview of the File Crawler Pane

The file crawler searches the files on your file system and returns these files. The operations that are available when you click the File Crawler tab are explained in the following subsections.

Display 2.10 File Crawler Pane

2.5.2 The Buttons

The buttons that are available in the File Crawler pane are the same buttons that are available for the web crawler. For more information, see Section 2.4.2 The Buttons on page 13.

24 SAS Information Retrieval Studio: Administrator’s Guide

Page 41: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.5.3 The Status Tab

See whether the file crawler is running.Display 2.11 Status Pane

2.5.4 The Configuration Tab

2.5.4.A The Main Tabs in the Configuration Tab

The file crawler does not run until it is configured. Specify the settings for the file crawler, the points where the crawler enters a file system, the limits of its crawl, and the file types that it returns.

Display 2.12 Configuration Tab

SAS Information Retrieval Studio: Administrator’s Guide 25

Page 42: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

General Settings

specify how the file crawler runs, specify a date range for returned documents, and how .xml files are handled. For more information, see Section 2.5.4.B The General Settings Pane on page 26.

Paths

specify the directories that the file crawler accesses. For more information, see Section 2.5.4.C The Paths Pane on page 28.

Paths to Exclude

exclude certain directories from the crawl. For more information, see Section 2.5.4.D The Paths to Exclude Pane on page 29.

Filename Extensions

(optional) if you choose to specify the types of files that can be returned, only these are permitted. Leave this pane empty if you want the file crawler to return all file types. For more information, see Section 2.4.4.E The Filename Extensions Tab on page 21.

2.5.4.B The General Settings Pane

Specify the settings that determine the overall way that the file crawler runs.Display 2.13 General Settings Pane

Maximum file size (megabytes)

(default is 10) click or to reset the limit for the size of each file.

26 SAS Information Retrieval Studio: Administrator’s Guide

Page 43: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Oldest date

click to access the calendar where you can select the first date that the crawler can use for the files that it returns to a query.

Crawl continuously

(default setting is No) click to select Yes. If you leave the default selection, start the crawler when you update the files in your file system.

Encapsulate XML files

(default setting is No) click to select Yes. Keep XML files intact throughout processing.By default, XML files are passed by the pipeline server with top-level tags turned into similar-named fields in the document. If you want to exercise more control over these fields, set this specification to Yes. This setting enables you to turn nested tags into fields. When you make this selection, also specify the parse_xml document processor in the pipeline server.

SAS Information Retrieval Studio: Administrator’s Guide 27

Page 44: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.5.4.C The Paths Pane

See the addresses that the crawler can use to gather documents. You add these paths using the Add button.

Display 2.14 Paths Pane

Add

access the Add Path window. See Section 2.14.14 The Add Path Window on page 137.

Remove

delete the selected path.Edit

access the Edit Path window to change the text that specifies an address. See Section 2.14.15 The Edit Path Window on page 138.

28 SAS Information Retrieval Studio: Administrator’s Guide

Page 45: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.5.4.D The Paths to Exclude Pane

See the paths that are not crawled here. This pane also enables you to identify exceptions to the specified list in the Paths pane. These exceptions are subdirectories, or files. They are not crawled.

Display 2.15 Paths to Exclude Pane

Add

access the Add Path to Exclude window. See Section 2.14.16 The Add Path to Exclude Window on page 138.

Remove

delete the selected path.Edit

access the Edit Path to Exclude window. See Section 2.14.17 The Edit Path to Exclude Window on page 139.

SAS Information Retrieval Studio: Administrator’s Guide 29

Page 46: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.5.4.E The Filename Extensions Pane

Specify the file extensions that can be returned. However, if you enter any file extensions, only those types of extensions are crawled.

Display 2.16 Filename Extensions Pane

Add button

access the Add Extension window. See Section 2.14.10 The Add Filename Extension Window on page 133.

Remove

delete the selected file type.Edit

access the Edit Filename Extension window. See Section 2.14.11 The Edit Filename Extension Window on page 134.

30 SAS Information Retrieval Studio: Administrator’s Guide

Page 47: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.6 Viewing the Feed Crawler Pane

2.6.1 Overview of the Feed Crawler Pane

The feed crawler collects frequently updated documents on the Web. For example, the feed crawler collects pages from blogs and from forums. It can also collect pages such as press releases. Unlike the web crawler, the feed crawler collects only the Web documents that come from feeds.The operations that are available when you click the Feed Crawler tab are explained in the following subsections.

Display 2.17 Feed Crawler Pane

2.6.2 The Buttons

The buttons that are available in the Feed Crawler pane are the same buttons that are available for the web crawler. For more information, see Section 2.4 Viewing the Web Crawler Pane on page 12.

SAS Information Retrieval Studio: Administrator’s Guide 31

Page 48: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.6.3 The Status Tab

See whether the feed crawler is running.Display 2.18 Status Pane.

2.6.4 The Configuration Tab

2.6.4.A The Main Tabs in the Configuration Tab

The feed crawler does not run until it is configured. Specify the settings for the feed crawler, the URLs where this crawler enters the Internet, and the limits of its crawl.

Display 2.19 Configuration Pane

General Settings

specify how the feed crawler runs, the server, and other information necessary to the crawl. For more information, see Section 2.5.4.B The General Settings Pane on page 26.

Feeds

crawl one, or more, feeds using this pane. For more information, see Section 2.5.4.B The General Settings Pane on page 26.

32 SAS Information Retrieval Studio: Administrator’s Guide

Page 49: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.6.4.B The General Settings Tab

Use the General Settings tab for the feed crawler to configure the feed server and to specify information specific to the crawl.

Display 2.20 General Settings Tab

HTTP proxy

specify the server that you are accessing here, or click Auto-detect as explained below.

Auto-detect

access the Select an HTTP Proxy window where you can choose a proxy server or enter the address for this server. For more information, see Section 2.14.3 The Select an HTTP Proxy Window on page 120.

Crawl continuously

(default setting is Yes) click to select No. If you select Yes, fresh content is always available. For more information about this setting, see Section 7.2 Configure the Feed Crawler on page 218.

Recrawl interval

(default is 600) click or to select another wait time in seconds.User agent

(default agent is SAS Feed Crawler) enter the name of a third-party feed crawler, if you choose.

SAS Information Retrieval Studio: Administrator’s Guide 33

Page 50: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.6.4.C The Feeds Tab

Use the Feeds tab for the feed crawler to configure the feed crawler, to specify the feed addresses, whether links are followed, and other information.

Display 2.21 Feeds Tab

Feed URL

specify the Web address for a feed. Follow Links

choose whether the feed crawler should crawl links in the selected feed. For more information, see Section 2.14.3 The Select an HTTP Proxy Window on page 120.

Add

access the Add Feed window where you can choose the address for the feed that you want to crawl. For more information, see Section 2.14.6 The Add Feed Window on page 126.

Remove

delete the selected feed URL.Edit

access the Edit Filename Extension window. See Section 2.14.7 The Edit Feed Window on page 129.

34 SAS Information Retrieval Studio: Administrator’s Guide

Page 51: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.6.5 The Log Tab

See Section 2.3.2 The Log Tab on page 11.

2.7 Viewing the Proxy Server Pane

2.7.1 Overview of the Proxy Server Pane

The proxy server is an intermediary server. The proxy server takes the documents gathered by the crawler and sends them to the pipeline server for processing. As an intermediary server, the proxy server provides two benefits: First, the proxy server enables you to pause the flow of documents. The incoming documents form a queue until the server is restarted. Use the pause operation to perform maintenance on the system without interrupting the crawlers. Second, you can choose to specify more than one pipeline server. In this case, the pipeline servers run on different machines and the proxy server sends a copy of each incoming document to each machine. In case of hardware failure, these servers serve as mirrors.The operations that are available when you click the Proxy Server tab are explained in the following subsections.

Display 2.22 Proxy Server Pane

SAS Information Retrieval Studio: Administrator’s Guide 35

Page 52: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.7.2 The Buttons

The Start, Stop, and Apply Changes buttons work for the proxy server like they work for the crawlers. For more information, see Section 2.4 Viewing the Web Crawler Pane on page 12.Pause

stop the proxy server temporarily.Resume

restart the proxy server operations.

2.7.3 The Status Tab

Use the status tab to see whether the proxy server is running.Display 2.23 Status Pane

Documents received

see the number of texts handed to the proxy server.Documents processed

see the number of documents handled by the proxy server.Documents queued

see the number of documents waiting to be processed by the proxy server.

36 SAS Information Retrieval Studio: Administrator’s Guide

Page 53: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Last document received

see the timestamp of the latest document that the proxy server accepted.Last document processed

see the timestamp of the latest text that the proxy server handed to another server.

2.7.4 The Configuration Tab

Use the Configuration pane to specify the pipeline server where the proxy server sends a copy of each input document. You can choose to specify several pipeline servers running on different ports.You can use multiple pipeline servers for mirroring. To perform this operation, configure the mirror servers with the same specifications that you set for the local pipeline server.You can also set up multiple pipeline servers to perform different sets of operations on your input documents.

SAS Information Retrieval Studio: Administrator’s Guide 37

Page 54: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Display 2.24 Configuration Pane.

Host

see the name of the pipeline server. By default, when this server is running, the information for the local machine appears in the Configuration pane.

Port

see the number of the port where the pipeline server is running.Status

see whether the pipeline server is running.Add

click to access the Add Backend window. Here you can specify additional pipeline servers. For more information, see Section 2.14.20 The Add Backend Window on page 141.

Remove

delete the selected server.

38 SAS Information Retrieval Studio: Administrator’s Guide

Page 55: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Edit

click to access the Edit Backend window. Here you can change the pipeline server. For more information, see Section 2.14.21 The Edit Backend Window on page 142.

2.7.5 The Log Tab

This pane provides information about the proxy server operations. For more information, see Section 2.3.2 The Log Tab on page 11.

2.8 Viewing the Pipeline Server Pane

2.8.1 Overview of the Pipeline Server Tab

The pipeline server is used to analyze, modify, and export each document before it is sent to the indexer. The operations that are available when you click the Pipeline Server tab are explained in the following subsections.

Display 2.25 Pipeline Server Pane

SAS Information Retrieval Studio: Administrator’s Guide 39

Page 56: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.8.2 The Buttons

The Start and Stop buttons work for the pipeline server in the same ways that they work for the crawlers. For more information, see Section 2.4.2 The Buttons on page 13.Apply Changes

click, if the index server is running. The indexing server is restarted and the changes take effect for the new index.

2.8.3 The Status Tab

See whether the pipeline server is running and observe the document processing stages.

Display 2.26 Status Pane

Pipeline Stage

see the four stages of the pipeline server: Overall

see the documents that have completed all of the document processing XML parsing stages.

Document processing

see how the document processors are acting on input documentsSending to the indexer

see the documents that are in the indexing process.

40 SAS Information Retrieval Studio: Administrator’s Guide

Page 57: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Pending

see the number of documents that are in the queue for each stage, with the exception of Overall, in the pipeline.

Finished

see the number of documents that completed the processing operations in each stage.

Last busy time

see the latest processing date and time.

2.8.4 The Document Processors Tab

Document processors act on the documents in the pipeline. Document processors analyze the data in input documents, perform various operations on the documents, and export the data. These operations are performed before the field-value pairs, known as documents, are added to the index or passed to another application such as SAS Sentiment Analysis Workbench. (Each document consists of a set of field-value pairs.) In some cases, document processors pass documents directly to another application for analysis such as SAS Sentiment Analysis Workbench.

Display 2.27 Document Processors Pane

SAS Information Retrieval Studio: Administrator’s Guide 41

Page 58: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Add

click to access the Add Document Processor window. For more information, see Section 2.13 The Add Document Processor Windows on page 73.

Remove

delete the selected processor.Edit

access the Document Processor window that is specific to the selected operation. You can make some changes here. To make a more comprehensive set of changes including changes to the caption, see Section 13.3.5 Specify Labels for Facetted Search on page 321.

Move Up

reorder the document processing operations by moving the selected operation up one level.

Move Down

reorder the document processing operations by moving the selected operation down one level.

Note: The ordering of the document processor operations is performed according to the type of operations that are necessary to achieve the desired results. For more information, see Section 9.2 Configuring the Pipeline Server on page 235.

2.8.5 The Document Inspector Tab

The Document Inspector pane enables you to see all of the versions of the input document for each stage of the pipeline, simultaneously. At each stage of the pipeline, the original document changes, but you can still see its original text. This snapshot operation is available for one document at a time. It is available only when the documents are in the pipeline server.

42 SAS Information Retrieval Studio: Administrator’s Guide

Page 59: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Display 2.28 Document Inspector Pane

Take Snapshot

click this button and you can see the document in the various pipeline stages.

Cancel

click this button to stop the snapshot operation.Processing Stage

see a document. Click on a document field to see the document number in the Document pane.

Document

see the number of this document. Click on the document number to see the field names for this document in the Field pane.

SAS Information Retrieval Studio: Administrator’s Guide 43

Page 60: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Field

see the field names in this document. Click on a field to see the contents of the selected document in the Document Inspector pane.

Document Inspector pane

see the contents of the chosen field of the selected at a specific stage.

2.8.6 The Log Tab

This pane provides information about the pipeline server operations. For more information, see Section 2.3.2 The Log Tab on page 11.

2.9 Viewing the Indexing Server Pane

2.9.1 Overview of the Indexing Server Pane

The indexing server builds a searchable index of the input documents. If you want to enable end users to search the collected documents, build an index.Each document in the index consists of a set of field-value pairs. These fields are populated when they are matched to similarly named fields in the documents passed to the indexing server by the pipeline server. You can build one index at a time from the documents provided by the continuously running crawlers. Use the different types of index fields for querying.

44 SAS Information Retrieval Studio: Administrator’s Guide

Page 61: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Display 2.29 Indexing Server Pane

2.9.2 The Buttons

The Start, Stop, and Apply Changes buttons that are available in the indexing server pane are the same buttons that are available for the web crawler. For more information, see Section 2.4.2 The Buttons on page 13.Delete Index

remove the existing index. A new index can be built with the new configuration after you restart the crawler.

Revert click when the indexing server is running and the existing index is deleted. A new index can be built when new documents are input.

SAS Information Retrieval Studio: Administrator’s Guide 45

Page 62: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.9.3 The Status Tab

See whether the indexing server is running.Display 2.30 Indexing Server Status Pane

2.9.4 The Configuration Tab

The indexing server can be reconfigured according to your specifications.Display 2.31 Indexing Server Configuration Pane

Field Name

add to, or delete from, the list of field names entered by default. The default list includes title, date, and body.

Functionality

add to, or delete from, the list of uses for these fields. For more information, see Section 2.14.22 The Add Field Window for the Indexing Server on page 143.

46 SAS Information Retrieval Studio: Administrator’s Guide

Page 63: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Add

access the Add Field window where you add fields to the index according to the purpose that they are intended to serve. For more information, see Section 2.14.22 The Add Field Window for the Indexing Server on page 143.

Remove

delete the selected field from the index configuration. Use this button to change the configuration of the next index that is built. Any changes do not affect the current index.

Edit

access the Edit Field window where you make changes to the fields that you added to the index. For more information, see Section 2.14.22 The Add Field Window for the Indexing Server on page 143.

Language

optimize the index for the selected language. Click to select another language.

2.9.5 The Log Tab

This pane provides information about the indexing server operations. For more information, see Section 2.3.2 The Log Tab on page 11.

SAS Information Retrieval Studio: Administrator’s Guide 47

Page 64: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.10 Viewing the Query Server Pane

2.10.1 Overview of the Query Server Tab

The query server uses the index built by the indexing server to locate matching documents in response to queries. Use the query web server that is one of the components of SAS Information Retrieval Studio. You can also use a custom query API that you write to pass queries to the query server. You can use this component with the query API. For more information about the query API, see the SAS Search and Indexing: User and C and Java API Guide.

Display 2.32 Query Server Pane

2.10.2 The Buttons

The Start and Stop buttons work for the query server like they work for the crawlers. For more information, see Section 2.4.2 The Buttons on page 13.

2.10.3 The Status Tab

See whether the query server is running.Display 2.33 Query Server Status Pane

48 SAS Information Retrieval Studio: Administrator’s Guide

Page 65: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.10.4 The Log Tab

This pane provides information about the query server operations. For more information, see Section 2.3.2 The Log Tab on page 11.

2.11 Viewing the Query Web Server Pane

2.11.1 Overview of the Query Web Server Tab

Use the query web server to format the search window that the end user sees. Also configure this server to display the search results with, or without labels, and specify other ways of rendering the search returns.

Display 2.34 Query Web Server Pane

2.11.2 The Buttons

The Start, Stop, and Apply Changes buttons work for the query web server like they work for the crawlers. For more information, see Section 2.4.2 The Buttons on page 13.Revert

restore the default settings.

SAS Information Retrieval Studio: Administrator’s Guide 49

Page 66: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.11.3 The Status Tab

See whether the query web server is running.Display 2.35 Query Web Server Pane

2.11.4 The Configuration Tab

2.11.4.A The Main Tabs in the Configuration Tab

Specify the settings for the query web server in the Configuration tab. You configure the query web server when you specify the parser, the processors, and the order for document processing.

Display 2.36 Configuration Pane

Server Port

(default is 9100) click or to select another port where the query web server runs.

Matching tab

specify the types of searches that the end user can input, the fields the user can specify, and the weight of each field. For more information, see Section 2.11.4.B The Matching Tab on page 51.

50 SAS Information Retrieval Studio: Administrator’s Guide

Page 67: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Sorting tab

specify how the matching documents are ranked and their parameters. For more information, see Section 2.11.4.C The Sorting Tab on page 53.

Labels tab

specify labels when you choose to enable facetted search. Facetted search uses a web-like system of related labels to enable users to intuitively locate the results that they seek. For more information, see Section 2.11.4.D The Labels Tab on page 56.

Match Formatting tab

specify how documents that match the query are displayed in the list of results. For more information, see Section 2.11.4.E The Match Formatting Tab on page 58.

Theme tab

specify the look and feel of the query interface. For more information, see Section 2.11.4.F The Theme Tabs on page 60.

2.11.4.B The Matching Tab

Use the Matching pane to specify the types of searches users can use to query the index. Before you specify the fields in the index that can be searched and the weight of the matches returned, consider the simple and advanced search types. Select the simple, or fsearch query syntax, and the user enters words and phrases. The user can also mark these words and quoted phrases as required or excluded by prefixing them with plus (+) or minus (-) signs. Select the advanced bsearch query syntax, and the user enters field names as part of a query expression. The words and phrases can be combined with Boolean, positional, and counting operators.For more information about searching, see the SAS Information Retrieval Studio: User’s Guide.

SAS Information Retrieval Studio: Administrator’s Guide 51

Page 68: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Display 2.37 Matching Pane

Search type

(default setting is Simple) click to select Advanced.Field Name

(default is Body) see the name of the fields to search with input queries.Weight

(default is 1) is a scaling factor that compares one field to another.

52 SAS Information Retrieval Studio: Administrator’s Guide

Page 69: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Add

access the Add Field window. For more information, see Section 2.14.23 The Add Field Window for the Query Web Server Matching Pane on page 146.

Remove

delete the selected field. Edit

access the Edit Field window. For more information, see Section 2.14.24 The Edit Field Window for the Query Web Server Matching Pane on page 147.

2.11.4.C The Sorting Tab

Use the Sorting tab to specify how documents are ranked when multiple matches are returned to a query. The fields in the Sorting pane change according to the Sort type selection that you choose. At this time, only the combinations that appear based on specific selections are possible. The default selection, Relevancy, is shown below:

Display 2.38 Sorting Pane

To use the components of this pane, see the following table:

SAS Information Retrieval Studio: Administrator’s Guide 53

Page 70: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Table 2-3: Sorting Pane Components

Component Selection Description

Sort type Specify the relative importance of the matching documents according to the metrics that you choose. For example, Cosine Weight (the only metric that is part of relevancy by default) and Freshness Weight.

Relevancy Use the weight of specified fields, in combination with the following metrics to determine the best returns. These metrics are Cosine Weight, Proximity Weight, Position Weight, Density Weight, and Freshness Weight.

Number of matching terms

Returns the matching document with the highest number of terms that match those in the query syntax. You can select a tiebreaker to determine a match when two or more documents meet this threshold.

Number of matching fields

Returns the matching document with the highest number of matching fields. You can also choose a tiebreaker in cases where there are two, or more, matching documents.

Date (newest first)

Returns the matching document with the most recent date. The tiebreaker is Order added to the index.

Date (oldest first)

Returns the matching document with the earliest date. The tiebreaker is Order added to the index.

Field value (largest first)

Returns the matching documents with the highest numeric value stored in a field. For example, price or rating. You determine the field in Sort field. Choose a tiebreaker for two, or more, matching documents.

Field value (smallest first)

Returns the matching documents with the lowest numeric value stored in a field. For example, price or rating. You determine the field in Sort field. Choose a tiebreaker for two, or more, matching documents.

Field value (alphabetical)

Returns the matching documents with matches in alphabetical order. You determine the field in Sort field. Choose a tiebreaker for two, or more, matching documents.

Order added to the index

Select to make the first document input to the index the matching document. This is true when two or more documents both meet the match requirements.

54 SAS Information Retrieval Studio: Administrator’s Guide

Page 71: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

TiebreakerClick to make a different selection, see the relevant Sort Type above. There can be as many as three tiebreaker fields, depending on the selection that you make in the Sort Type field.

Specify the following weights according to your values. In other words, if the density weight is more important than any of the other weights, specify the highest weight number for this field.

Cosine Weight

(Default: 1) Click or to change this metric that weights frequently occurring terms more highly than those that are infrequent. This metric also takes noise words into consideration. (Noise words are the words that appear with enough frequency that they are ranked down.)

Proximity Weight

(Default: 0) Click or to specify how to weight matching query terms that are located close together in the document.

Position Weight (Default: 0) Click or to change the weight assigned to

matches on words located close to the beginning of the document.

Density Weight (Default: 0) click or to change the metric that

balances the number of matched query terms with the total number of words in the matching document. The number of match instances is measured as a percentage of the document.

Freshness Weight

(Default is 0) Click or to change the number that determines the age of the matching document. This metric combines several factors besides the age of the document.

Table 2-3: Sorting Pane Components (Continued)

Component Selection Description

SAS Information Retrieval Studio: Administrator’s Guide 55

Page 72: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.11.4.D The Labels Tab

Specify labels that enable your end users to use facetted search to intuitively locate the information that they seek. For more information about facetted search, see Section 3.7 Defining Labels for Facetted Search on page 163.

Display 2.39 Labels Pane

Field Name

see the names of the fields that you entered with the Add button.Caption

see the label that you added with the Add button.Add

access the Add Field window. For more information, see Section 2.14.25 The Add Field Window: Query Web Server Labels Pane on page 148.

Remove

delete the selected field.

56 SAS Information Retrieval Studio: Administrator’s Guide

Page 73: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Edit

access the Edit Field window. For more information, see Section 2.14.26 The Edit Field Window: Query Web Server Labels on page 150.

Move Up

change the location of the selected field. Click to move the selected field up one level in the display shown on the search results page.

Move Down

change the location of the selected field. Click to move the selected field down one level in the hierarchical taxonomy.

Maximum number of related labels

leave the default setting 10. You can also specify a new highest number of labels that can be displayed in response to a query.

SAS Information Retrieval Studio: Administrator’s Guide 57

Page 74: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.11.4.E The Match Formatting Tab

Specify how matching documents are displayed in the results list.Display 2.40 Match Formatting Pane

Use the components of this pane as explained below:

Table 2-4: Match Formatting Pane Components

Component Description

Title source

(Default: Text field) Click to select HTML field or None. Use the title fields in this pane to identify the type of field where the title of the document is located in the input document. Select None if you do not want to use the title fields for search. (When you select None, the Title field disappears.)

Title field

(Default: title) Click to select a different info field in the index.

58 SAS Information Retrieval Studio: Administrator’s Guide

Page 75: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Use filename when document has no title (Default: Yes) Click to select No.

Abstract source

(Default: Concordance) Click to select Text field, or HTML field. Use the abstract source fields to locate the summary of the input document. Use the Concordance selection if you want to enable hit highlighting. Hit highlighting bolds the matched query term in an input document.

Link source

(Default: Text field) Click to select None, URL, or HTML field. The link fields specify the type of fields that provide a path to the input text.

Link prefix Modify the URL at display time for the purposes of passing an argument from your own CGI script.

Link suffix Modify the URL at display time for the purposes of passing an argument from your own CGI script.Note: For more information about link suffixes and prefixes, see Section 13.3.6 Specify the Formatting for the Matches on page 326.

Add keywords to PDF links

(Default: Yes) Click to select No. Modify the URLs of PDF files to instruct Adobe Reader to highlight the search terms in an input document. This operation functions like a concordance, but works for the entire document, not only the abstract in the results list.

MIME type source

(Default: None) Click to select Text field. This field specifies where the name for the document format is located.

Date source

(Default: None) Click to select Date or Text field. This field is used to locate the source of the document date.

Table 2-4: Match Formatting Pane Components (Continued)

Component Description

SAS Information Retrieval Studio: Administrator’s Guide 59

Page 76: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.11.4.F The Theme Tabs

Specify the settings for the query web server user interface in the Theme tab and its related tabs. For more information about formatting the search window, see www.w3.org.

Display 2.41 Theme Pane

Title

(default is SAS Information Retrieval Studio) enter a new name to change the name that appears in the search window that end users see when they query the index.

Font

(default is sans-serif) enter a new font type into this field to change the display look of the title. For example, enter Times New Roman.

60 SAS Information Retrieval Studio: Administrator’s Guide

Page 77: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Font size

(default is 10) click or to change the size of the display letters.Use pop-up menus

(default setting is Yes) click to select No. Select No to disable the pop-up menus functionality for older browsers that do not support Javascript.

Colors tab

specify the colors for the user interface. (See www.w3.org for more information.)

Display 2.42 Colors Pane

SAS Information Retrieval Studio: Administrator’s Guide 61

Page 78: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Use the operations in this pane as follows:

Table 2-5: Colors Pane Components

Component Description

Header background color

(Default: Custom) Click to select an image. You can also click

to access the color box window to change the header color.Note: For more information about the Color Box window, see Section 2.14.27 The Color Box Window on page 151.

Header text color

(Default: Custom) Click to select one of the images that you loaded into the work/query-web-server subdirectory of your

installation directory. You can also click to access the color box window to select the text color for headers.

Link color

(Default: Custom) Click to select one of the images that you loaded into the work/query-web-server subdirectory of your

installation directory. You can also click to access the color box window and select the color of the links.

Visited Link color

(Default: Custom) Click to select one of the images that you loaded into the work/query-web-server subdirectory of your

installation directory. You can also click to access the color box window and select the color for your visited links.

62 SAS Information Retrieval Studio: Administrator’s Guide

Page 79: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Hover Link color

(Default: Custom) Click to select one of the images that you loaded into the work/query-web-server subdirectory of your

installation. directory. You can also click to access the color box window and select the color of the links when a user slides the cursor over them.

Menu border color

(Default: WindowFrame) Click to select one of the images that you loaded into the work/query-web-server subdirectory of your installation directory. This selection specifies how the color is applied to menus.

Menu unselected background color

(Default: Window) Click to select one of the images that you loaded into the work/query-web-server subdirectory of your installation directory. This selection specifies how the color is applied to menus.

Menu unselected text color

(Default: WindowText) Click to select one of the images that you loaded into the work/query-web-server subdirectory of your installation directory. This selection specifies how the color is applied to unselected text in menus.

Menu selected background color

(Default: Highlight) Click to select one of the images that you loaded into the work/query-web-server subdirectory of your installation directory. This selection specifies how the color is applied to the background of menus.

Table 2-5: Colors Pane Components (Continued)

Component Description

SAS Information Retrieval Studio: Administrator’s Guide 63

Page 80: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Images tabload the images that you plan to use for the search window into the work/query-web-server subdirectory of your installation directory.

Left header image

(default is None) click to select one of the images that you loaded into the work/query-web-server subdirectory of your installation directory.

Right header image

(default is sas.png) click to select one of the images that you loaded into the work/query-web-server subdirectory of your installation directory.

Menu selected text color

(Default: HighlightText) Click to select one of the images that you loaded into the work/query-web-server subdirectory of your installation directory. This selection specifies how the color is applied to the text in menus.

Reset to Default button

Click to restore the default settings.

Table 2-5: Colors Pane Components (Continued)

Component Description

64 SAS Information Retrieval Studio: Administrator’s Guide

Page 81: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.11.5 The Log Tab

This pane provides information about the query web server operations. For more information, see Section 2.3.2 The Log Tab on page 11.

2.12 Viewing the Query Statistics Server Pane

2.12.1 Overview of the Query Statistics Server Pane

The query statistics enables you to see the input query terms and information about when these terms were entered.

Display 2.43 Query Statistics Server Pane

2.12.2 The Buttons

The Start, and Stop buttons work for the query statistics server like they work for the crawlers. For more information, see Section 2.4.2 The Buttons on page 13.

2.12.3 The Status Tab

See whether the query statistics server is running.

SAS Information Retrieval Studio: Administrator’s Guide 65

Page 82: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Display 2.44 Query Statistics Server Status Pane

2.12.4 The Query Statistics Tab

2.12.4.A The Buttons

Specify the settings for the query statistics server in the Query Statistics tab. You can see the various query analytics run by SAS Information Retrieval Studio when you use this pane.

Display 2.45 Query Statistics Pane

Use the components of this pane as follows:

Table 2-6: Query Statistics Pane Components

Component Description

Today Click to see the date of the current day in the Year, Month, and Day fields.

This Month Click to see the current month in the Month field and the current day in the Day field. The Year field is unavailable.

This Year Click to see the current year in the Year field. The Day and Month fields are inaccessible.

66 SAS Information Retrieval Studio: Administrator’s Guide

Page 83: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

All Time Click and the Year, Month, Day fields and the Previous and Next buttons are all inaccessible.

Year field

Click to select a year. You can select a year back until 1980, or leave the default selection --.

Month field

Click to select a month.

Day field

Click to select a day.

Previous Click to enter the preceding date. For example, if you selected 2010, 2009 appears in the Year field.

Next Click to select the following date. For example, if you selected 2010, 8, and 20, the next day 21 appears in the Day field.

Table 2-6: Query Statistics Pane Components (Continued)

Component Description

SAS Information Retrieval Studio: Administrator’s Guide 67

Page 84: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.12.4.B The Most Frequent Queries Tab

See the terms that are most often searched in this pane for the selected time period.

Display 2.46 Most Frequent Queries Pane

Query

see a list of the input search terms ranked from the highest to the lowest input number.

Number of Occurrences

see the total number of query submissions.

68 SAS Information Retrieval Studio: Administrator’s Guide

Page 85: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.12.4.C The Most Frequent Queries without Matches Tab

See the terms that are most often searched and not matched in the input documents in this pane for the selected time period.

Display 2.47 Most Frequent Queries without Matches Pane

Query

see a list of the input search terms that are unmatched in the index. This list is ordered from the highest to the lowest number of entries.

Number of Occurrences

see the total number of times that each query was submitted.

SAS Information Retrieval Studio: Administrator’s Guide 69

Page 86: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.12.4.D The Hourly Query Rate Tab

See the number of search terms entered during each 24-hour period, by hour, for the selected time period.

Display 2.48 Hourly Query Rate Pane

Hour

see each of the 24 hours.Number of Queries

see the total number of queries submitted each hour.

70 SAS Information Retrieval Studio: Administrator’s Guide

Page 87: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.12.4.E The Daily Query Rate Tab

See the number of queries submitted for each day of the week in this pane.Display 2.49 Daily Query Rate Pane

Day

see a list of the seven days of the week.Number of Queries

see the total number of queries submitted each day.

SAS Information Retrieval Studio: Administrator’s Guide 71

Page 88: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.12.4.F The Monthly Query Rate Tab

See the number of queries submitted for each month of the year in this pane.Display 2.50 Monthly Query Rate Pane

Month

see a list of the 12 months of the year.Number of Queries

see the total number of queries submitted each month.

2.12.5 The Log Tab

This pane provides information about the query statistics server operations. For more information, see Section 2.3.2 The Log Tab on page 11.

72 SAS Information Retrieval Studio: Administrator’s Guide

Page 89: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.13 The Add Document Processor Windows

2.13.1 Overview of Document Processor Windows

Use the Add Document Processor window to perform a processing operation on input documents. For example, use these windows to remove mark-up tags, categorize, extract concepts, and so on.You can add more than one processor in order to perform several types of operations on a single, input document. The document processors act on the incoming documents in the order specified.

Note: The order of the document processing operations is important. For this reason choose to perform operations such as parse_html and heuristic_parse_html before operations such as categorizer.

2.13.2 Access Document Processor Window

To access a Document Processor window, complete these steps:

1. Select Pipeline Server --> Configuration --> Document Processors.

2. Click Add in the Document Processors pane for the pipeline server. The Add Document Processor window appears.

SAS Information Retrieval Studio: Administrator’s Guide 73

Page 90: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3. Select one of the following processors:add_field

add a new field with the value that you specify to each input document. This field has one name and one value that is the same for every indexed document. For more information, see Section 2.13.3 The Document Processor: add_field Window on page 77.

content_categorization

match categories, concepts, and facts in input documents. For more information, see Section 2.13.4 The Document Processor: content_categorization Wizard on page 78. Use this document processor to create labels to use in facetted search. You can see these labels in the labels pane in the Query Web Server Configuration pane.

default_mime_type_from_url

return the document type that is located in the address fields of input documents. For more information, see Section 2.13.5 The Document Processor: default_mime_type_from_url Window on page 95.

default_title_from_url

return the document title from the Web address of any input documents. For more information, see Section 2.13.6 The Document Processor: default_title_from_url Window on page 95.

document_converter

deploy SAS Document Conversion to change incoming files, such as Adobe PDF and Microsoft Office documents, into text. For more information, see Section 2.13.7 The Document Processor: document_converter Window on page 96.

export_csv

use this selection to transfer document text into a comma-separated format that can be exported into another program such as SAS Text Miner or Excel. For more information, see Section 2.13.8 The Document Processor: export_csv Window on page 97.

export_to_files

choose to save each document to a separate file. The name of the file is based on a hash of its contents. For more information, see

74 SAS Information Retrieval Studio: Administrator’s Guide

Page 91: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Section 2.13.9 The Document Processor: export_to_files Window on page 100.

export_to_odbc

Use the Document Processor: export_to_odbc window to send the documents directly to a database. For more information, see Section 2.13.10 The Document Processor: export_to_odbc Window on page 102.

export_to_sas_sentiment_analysis_workbench

send the document to SAS Sentiment Analysis Workbench. For more information, see Section 2.13.11 The Document Processor: export_to_sentiment_analysis_workbench Window on page 104.

extract_abstract

generate an abstract for the document based on the first 25 to 50 words in the body of the input text. For more information, see Section 2.13.12 The Document Processor: extract_abstract Window on page 106.

extract_pdate

normalize the date to a specific format that is understood by SAS Search and Indexing. For more information, see Section 2.13.13 The Document Processor: extract_pdate Window on page 107.

heuristic_parse_html

separate the body of an HTML document from its tags. This is a more advanced version of parse_html. heuristic_parse_html searches for paragraphs of text without many links and extracts these bodies of text. For more information, see Section 2.13.14 The Document Processor: heuristic_parse_html Window on page 108.

invalidate_duplicates_by_url

stop more than one document with the same Web address from being returned. For more information, see Section 2.13.15 The Document Processor: invalidate_duplicates_by_url Window on page 110.

match_and_copy

This is similar to the substitute document processor. Use the match_and_copy document processor to write the output to a different field than the input field. For more information, see

SAS Information Retrieval Studio: Administrator’s Guide 75

Page 92: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Section 2.13.16 The Document Processor: match_and_copy Window on page 110. Also see Section 2.13.22 The Document Processor: substitute Window on page 116.

modify_field_name

change the name of a field in an input document. For more information, see Section 2.13.17 The Document Processor: modify_field_name Window on page 112.

parse_html

separate the text from the HTML mark-up tags. For more information, see Section 2.13.18 The Document Processor: parse_html Window on page 112 and strip_html below.Use this operation when you an HTML document and you want to extract the body of this document, and possibly the metadata.

parse_xml

separate the text from the XML mark-up tags. For more information, see Section 2.13.19 The Document Processor: parse_xml Window on page 114.

send

save each input document to each pipeline server. For more information, see Section 2.13.20 The Document Processor: send Window on page 115.

strip_html

return only the text without the HTML mark-up tags. For more information, see Section 2.13.21 The Document Processor: strip_html Window on page 116 and parse_html above.Use strip_html when you have a field that contains some HTML code that you want to convert into plain text. For example, if input XML documents contain HTML code.

substitute

exchange all occurrences of one term, tag, or other attribute for another. For more information, see Section 2.13.22 The Document Processor: substitute Window on page 116.

4. Click OK and the appropriate Document Processor window appears. For more information, see Section 2.13 The Add Document Processor

76 SAS Information Retrieval Studio: Administrator’s Guide

Page 93: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Windows on page 73. The selected operation appears in the Document Processors pane.

2.13.3 The Document Processor: add_field Window

Use the Document Processor: add_field window to add a field with a constant string to each document passed to the index. For example, use this operation to specify an identifier for each indexed document that specifies a specific collection of documents.To use the Document Processor: add_field window, complete these steps:

1. Select add_field in the Add Document Processor window and click Next. The first Document Processor: add_field window appears.

2. Enter a field name into field. For example, enter corporate_documents.

Note: Field names can be entered only in lowercase letters.

3. Enter a field name into value. For example, enter MyCompanyName.

4. Click Finish.

SAS Information Retrieval Studio: Administrator’s Guide 77

Page 94: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.13.4 The Document Processor: content_categorization Wizard

2.13.4.A Overview of the content_categorization Document Processor

Use the Document Processor: content_categorization window to specify the categories, concepts, and facts that can be matched in indexed documents. You can also specify the labels that are used for facetted search using this wizard. These processors automatically populate the index and the query web server components with index fields and labels for facetted search. You can specify these labels or use the default settings.

2.13.4.B Configure SAS Content Categorization Server

The categories, concepts, and facts are applied by SAS Content Categorization Server. For this reason, you configure the server to work with the content_categorization Document Processor.To configure SAS Content Categorization Server, complete these steps:

1. Select content_categorization in the Add Document Processor window and click Next. The Document Processor: content_categorization window appears

2. (Optional) By default, the name of the server where SAS Content Categorization Server is running is specified in the Hostname field. For example, see localhost. You can enter a different server name if SAS Content Categorization Server is running on another server.

78 SAS Information Retrieval Studio: Administrator’s Guide

Page 95: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3. (Optional) By default, the port number for the specified server is entered

in Port. For example, see 6500. Click or to select a different port number.

4. (Optional) By default, 10 is entered into Timeout. Click or

to select a different number. This is the number of seconds that the Pipeline Server waits before it stops the download process.

5. Click Next to save these settings.

SAS Information Retrieval Studio: Administrator’s Guide 79

Page 96: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.13.4.C Specify the Projects

SAS Content Categorization Server applies categories, concepts, and facts to documents in the SAS Information Retrieval Studio Pipeline Server.To specify the projects, complete these steps:

1. The Document Processor: content_categorization window appears. This window lists the projects and their types. The categories, concepts, and facts that are applied by the pipeline server are limited to those that are specified in the projects that you specify.

2. Click Add. The Document Processor: content_categorization window appears.

3. (Optional) By default, Categorization is selected in Type. This is true if categories are part of the taxonomy in one of the SAS Content Categorization Studio projects that you uploaded to SAS Content Categorization Server. Alternately, Concept extraction, or

Contextual extraction is selected. Click to make a different selection.

80 SAS Information Retrieval Studio: Administrator’s Guide

Page 97: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

4. (Optional) By default, a project is selected. For example, see

Lifestyle. Click to select a different project. For example, select Sports. This selection limits the available categories, concepts, and facts to those in the project.

5. Click Ok and the project appears in the Document Processor: content_categorization window.

6. (Optional) Repeat Step 2. through Step 5. above until you have added all of your projects.

7. Click Next to save your changes.

SAS Information Retrieval Studio: Administrator’s Guide 81

Page 98: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.13.4.D Specify Input

SAS Content Categorization Server applies the categories, concepts, and facts to input fields. These matches are labelled or exported as output.To specify the input and output processing, complete these steps:

1. After you complete Step 7. on page 81, the Document Processor: content_categorization wizard appears.

2. (Optional) By default, Input Fields is blank. Enter any field names that you want to search for matches for your categories, concepts, and facts. If you leave this field blank, all of the fields are searched with the exception of any fields entered into Input fields to exclude.

3. (Optional) By default, Input Fields to exclude contains metadata fields. Enter any field names that you want to exclude from the search for your categories, concepts, and facts.

Hint: To ensure that excess time stamped fields are not sent to SAS Content Categorization Server, leave the ctime, atime, and mtime fields. These field names represent created, accessed, and modified dates for a file. For more information, see Section 2.8.5 The Document Inspector Tab on page 42.

4. (Optional) Click Finish to save your changes.

82 SAS Information Retrieval Studio: Administrator’s Guide

Page 99: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.13.4.E Specify Categories

Select all of the categories in a SAS Content Categorization Studio projects that you uploaded to SAS Content Categorization Server. These category rules are applied by SAS Content Categorization Server to the documents in the Pipeline Server running in SAS Information Retrieval Studio.

Note: When you select categories, you select all of the categories in all of the projects. However, when you select concepts and facts, you can choose all or some of the concepts and facts that exist in the selected projects.

To specify that all of the categories in the uploaded projects can be used by SAS Content Categorization Server, complete these steps:

1. Click Categories to access the Categories pane.

2. (Optional) By default, categories is entered into the Field name field. You can enter a new field name.

3. (Optional) By default, Categories is entered into Caption. You can enter a new caption name to change the label for facetted search. For

SAS Information Retrieval Studio: Administrator’s Guide 83

Page 100: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

more information, see Section 9.5 Match Categories, Concepts, and Facts on page 246 and Chapter 10: Creating Facetted Search Labels Using content_categorization.

4. (Optional) By default, %c is entered into Format for the category name. You can enter a new format that might include %% for a literal percent sign. You can also use x as a modifier to request XML escaping. For example, enter %xc.

5. (Optional) Enter a regular expression into the Category name pattern field.

6. (Optional) Enter a string into the Category name replacement field. This string is a constant value that replaces all of the category names.

7. (Optional) By default, ; (semicolon) appears in Separator. Enter a new separator such as a comma (,).

8. (Optional) By default, the highest number of categories that can be

matched in any single input document is 15. Click or to change this default selection in Max categories.

9. Click Finish.

2.13.4.F Specify Concepts

Select the concepts in the SAS Content Categorization Studio projects that you uploaded to SAS Content Categorization Server. These concept definitions are applied by SAS Content Categorization Server to the documents in the Pipeline Server running in SAS Information Retrieval Studio.To specify some of the concepts, complete the following steps. (If you want to specify all of the concepts for all of the selected projects, see Step 1. and then go to Step 10.)

1. Click Concepts to access the Concepts pane. Use this pane to add all of the concepts and contextual extraction concepts. The concepts that are

84 SAS Information Retrieval Studio: Administrator’s Guide

Page 101: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

specified by PREDICATE or SEQUENCE definitions are added in the Facts pane.

2. Click Add. The Document Processor: content_categorization window appears. Use this pane to specify the settings for each individual concept. These settings override the settings specified for all of the concepts in the Concepts tab.

SAS Information Retrieval Studio: Administrator’s Guide 85

Page 102: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3. Click to select a concept in the Concept field. For example, select SPORTS from the drop-down menu.

4. (Optional) By default, sports is entered into Field name. You can enter a new field name.

5. (Optional) By default, Sports is entered into the Caption field. You can enter a new caption name to change the label for facetted search. For more information, see Section 9.5 Match Categories, Concepts, and Facts on page 246.

6. (Optional) By default, % is entered into the Format field for the concept name. You can also use any of the following symbols:

If you want to output nested XML tags, specify the format for these tags such as <body>%xi</body>. For more information, see Section 2.13.9 The Document Processor: export_to_files Window on page 100.

7. (Optional) By default, ; (semicolon) appears in the Default separator field. Enter a new separator such as a comma (,).

8. Click Ok. If you want to make changes to your edits, click Copy Defaults.

Table 2-7: Default Format Symbols

Symbol Description

%c Match the concept name.

%p Add to %c to include the path with the concept name.

%m Match the text.

%i Match the information associated with the entity, or the match text if no information is available.

%I Match the information associated with the entity unconditionally.

%% Match the literal percent sign.

x Use as a modifier, such as in %xc to request XML escaping

86 SAS Information Retrieval Studio: Administrator’s Guide

Page 103: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

9. (Optional) Use Step 2. on page 85 through Step 8. on page 86, reiteratively, until you have added all of the concepts that you want to deploy in SAS Information Retrieval Studio.

10. (Optional) By default, concepts is entered into Default field name. You can enter a new field name.

11. (Optional) By default, Concepts is entered into Default caption. You can enter a new caption name to change the label for facetted search. For more information, see Section 9.5 Match Categories, Concepts, and Facts on page 246 and Chapter 10.

12. (Optional) By default, %c: %i is entered into the Default format field for the concept name. You can also use any of the following symbols:

Table 2-8: Default Format Symbols

Symbol Description

%c Match the concept name.

%p Exclude the path.

SAS Information Retrieval Studio: Administrator’s Guide 87

Page 104: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

13. (Optional) By default, ; (semicolon) appears in the Default separator field. Enter a new separator such as a comma (,).

14. (Optional) By default, the highest number of concepts that can be

matched in any single input document is 15. Click or to change this default selection in the Max concepts field.

15. Click Finish.

2.13.4.G Specify Facts

Select the facts in the SAS Content Categorization Studio projects that you uploaded to SAS Content Categorization Server. The arguments in the PREDICATE and SEQUENCE rules fact definitions are applied by SAS Content Categorization Server to input documents.To specify some of the facts, complete the following steps. (If you want to specify all of the concepts for all of the selected projects, see Step 1. and then go to Step 13. on page 93. In other words, when you leave facts in Default field name all of the facts in the project are selected by default.)

1. Click Facts to access the Facts pane. (Facts are the contextual extraction concepts that are defined by at least one PREDICATE or

%m Match the text.

%i Match the information associated with the entity, or the match text if no information is available.

%% Match the literal percent sign.

x Use as a modifier, such as in %xc to request XML escaping

Table 2-8: Default Format Symbols (Continued)

Symbol Description

88 SAS Information Retrieval Studio: Administrator’s Guide

Page 105: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

SEQUENCE rule. Unlike other concept rules, PREDICATE or SEQUENCE rules have arguments.)

SAS Information Retrieval Studio: Administrator’s Guide 89

Page 106: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2. Click Add to specify a fact with its field name and caption for facetted search.

3. Click to select a fact in the Fact field. Facts are contextual extraction concepts that contain at least one PREDICATE or SEQUENCE rule. For example, select SIDE_EFFECT from the drop-down menu.

4. (Optional) When you select a fact using Step 3. above, the Field name is automatically entered. For example, see sideeffect. Enter a new name if you choose.

5. (Optional) When you select a fact, the Caption field is automatically filled in. For example, see Side Effect. Enter a new name if you choose.

6. (Optional) By default, the format for the matched fact is entered into the Format field. For example, see the following format:

SIDE_EFFECT(drug: %v{drug}, sideeffect: %v{sideeffect})

In this example, drug and sideffect are the returned arguments for the fact SIDE_EFFECT if these arguments are matched. The match strings for the arguments are %v{drug} and %v{sideeffect}. If there are more than one PREDICATE or SEQUENCE rule in the definition with these same arguments, this line is specified for all rules. If there are

90 SAS Information Retrieval Studio: Administrator’s Guide

Page 107: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

any other types of rules in the definition, this fact also appears as a concept when you select the Concepts tab.

7. (Optional) By default, %n: %v is entered into the Argument format field for the argument format. You can also use any of the following symbols:

8. (Optional) By default, ; (semicolon) appears in the Default separator field. Enter a new separator such as a comma (,).

9. (Optional) By default, the highest number of concepts that can be

matched in any single input document is 15. Click or to change this default selection in the Max concepts field.

Table 2-9: Default Argument Format Symbols

Symbol Description

%f Match the fact name.

%a Match a formatted list of arguments.Note: If you do not specify the argument symbol, the Argument format field, even when specified, does not apply.

%v{name} Output the value for a specific argument.

%m Match the text.

%s Return the concordance list.Note: If you do not specify the concordance, the concordance is not returned. This is true even when you specify the Concordance type and Surrounding words in the Facts pane.

%% Match the literal percent sign.

x Use as a modifier, such as in %xf to request XML escaping.

Argument List:

%n Match the argument name.

x Use as a modifier, such as in %xf to request XML escaping.

SAS Information Retrieval Studio: Administrator’s Guide 91

Page 108: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

10. (Optional) Click Copy Defaults to insert the entries from the main Facts tab into all of the fields with the exception of the Fact field.

11. Click Ok.

92 SAS Information Retrieval Studio: Administrator’s Guide

Page 109: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

12. (Optional) Use Step 2. on page 90 through Step 11. on page 92, reiteratively, until you have added all of the facts that you want to apply in SAS Information Retrieval Studio.

13. (Optional) By default, facts is entered into the Default field name field. You can enter a new field name.

14. (Optional) By default, Facts is entered into the Default caption field. You can enter a new caption name to change the label for facetted search. For more information about facetted search, see Chapter 10.

15. (Optional) By default, %f(%a) is entered into the Default format field for the concept name. You can edit this entry using any of the symbols in Table 2-9 on page 91 with the exception of %v{name}.

SAS Information Retrieval Studio: Administrator’s Guide 93

Page 110: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Note: Unless you specify %a, no arguments are called. This is true even if you make entries in the Default argument format field.

16. (Optional) By default, %n: %v is entered into the Default argument format field for the concept name. You can edit this entry using the %n, %v, %%, and the x modifier symbols in Table 2-9 on page 91.

17. (Optional) By default, , (comma) is entered in the Default separator field. You can enter a new separator such as a semicolon (;).

18. (Optional) By default, Surrounding words is selected in Concordance

type. Click to select Full sentence.

Notes: Concordance refers to the surrounding text that is returned with the match.When you select Full sentence, the Surrounding words field disappears.If you do not specify the concordance using %s, the concordance is not returned. This is true even when you specify the Concordance type and Surrounding words in the Facts pane.

19. (Optional) By default, 10 is selected in the Surrounding words field.

Click or to change this default selection.

20. (Optional) By default, the highest number of facts that can be matched

in any single input document is 15. Click or to change this default selection in the Max facts field.

21. Click Finish to save your selections.

94 SAS Information Retrieval Studio: Administrator’s Guide

Page 111: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.13.5 The Document Processor: default_mime_type_from_url Window

Use the Document Processor: default_mime_type_from_url window to return the document type from the address fields of any input documents. To perform this operation, SAS Information Retrieval Studio looks for the filename extension in the URL of an input document.To use the Document Processor: default_mime_type_from_url window, complete these steps:

1. Select default_mime_type_from_url in the Add Document Processor window and click Next. The Document Processor: default_mime_type_from_url window appears.

2. Leave the default specification, mimetype, or enter a new field name into mime-type-field.

3. Leave the default specification, id, or enter the new field name into url-field.

2.13.6 The Document Processor: default_title_from_url Window

Use the Document Processor: default_title_from_url window to return the document title using the Web address of the input documents.To use the Document Processor: default_title_from_url window, complete these steps:

SAS Information Retrieval Studio: Administrator’s Guide 95

Page 112: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Select default_title_from_url in the Add Document Processor window and click Next. The Document Processor: default_title_from_url window appears.

2. Leave the default specification, title, or enter a new field that specifies what field is searched to locate the title of the input document. If the document has no value for the field, the value of the URL field is used.

3. Leave the default specification, id, or enter the field where the document title can be located into url-field.

2.13.7 The Document Processor: document_converter Window

Use the Document Processor: document_converter window to extract plain text from other types of file formats, such as Adobe PDF and Microsoft Word documents. This document processor uses SAS Document Conversion.To use the Document Processor: document_converter window, complete these steps:

96 SAS Information Retrieval Studio: Administrator’s Guide

Page 113: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Select document_converter in the Add Document Processor window and click Next. The Document Processor: document_converter window appears.

2. Replace the default specification, localhost:54321, with a string naming the server and its port in the server field. (The server name and port number are separated by a colon [:]).

3. Leave the default specification, mimetype, or enter a new field where the document type is found into mime-type-field. This field specifies that non-ASCII text can be formatted into text.

4. Leave the default entry id, or specify a different field where the identification information for the file can be located into filename-field.

5. Leave the default specification, raw in the input-field and the document processor gets the content in the body and title fields.

6. Leave the default specification, body, or enter a new location for the output information into output-field. The output field is where the plain text version of the document is stored.

2.13.8 The Document Processor: export_csv Window

Use the Document Processor: export_csv window to save documents to a .csv file under the column headings that you specify with fields. (CSV represents comma-separated value.) You can also specify categories, concepts, and facts

SAS Information Retrieval Studio: Administrator’s Guide 97

Page 114: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

instead of fields when new files are created. Export these files with escaped, or nonescaped characters, to be used in SAS Text Miner, Base SAS, Microsoft Excel, and so on.To use the Document Processor: export_csv window, complete these steps:

1. Select export_csv in the Add Document Processor window and click Next. The Document Processor: export_csv window appears.

2. Rename the default comma separated file, articles.csv, or leave this entry in the filename field. If you add %s to the entry in this field, the file is timestamped.

3. Leave the default entry 1, if you want to append to an existing file with the specified name in the append field. This operation takes place when the pipeline is restarted. Enter 0 to overwrite an existing file when the pipeline is restarted.

98 SAS Information Retrieval Studio: Administrator’s Guide

Page 115: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

4. Leave the default entry 0 in the new-file-after-n-rows field, or enter 1 if you appended %s to the entry in the filename field.

Notes: This field, like the following four fields, controls when a file is closed and another file is accessed.The operations in Step 4. through Step 7. are not mutually exclusive. For this reason, a new file is started when any of the enabled conditions is true.

5. Leave the default entry 0 in the new-file-after-n-idle-seconds field, or enter 1 if you appended %s to the entry in the filename field. A new file is created after the pipeline server is idle for the specified number of seconds.

6. Leave the default entry 0 in the new-file-after-n-seconds field, or enter 1 if you appended %s to the entry in the filename field. A new file is created after the pipeline server is idle for the specified number of seconds.

7. Leave the default entry 0 in the new-file-each-hour field. Enter 1 if you appended %s to the entry in the filename field and a new file is created every hour.

8. Leave the default entry 0 in the new-file-each-day field. Enter 1 if you appended %s to the entry in the filename field and a new file is created every day.

9. Leave the default entries id, title, and body as a comma-separated list of field names that corresponds to the columns in the CSV file. Alternatively, edit, or enter a new list of field names into the columns field.

If you plan to use categorization, specify categories. If you want to perform concept extraction, list the concepts in this field. This includes the contextual extraction concepts and facts.

10. Leave the default specification 0 in the invalidate field if you want to stop the input files at this point in the pipeline. Alternatively, enter 1 to enable further document processing in the pipeline.

SAS Information Retrieval Studio: Administrator’s Guide 99

Page 116: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

11. Leave the default specification 0 in the cleanup-white-space field if you want to remove any new lines in the document. This operation makes it easier to parse the document text. Alternatively, enter 1 to keep these lines in the document as it is parsed.

12. Leave the default comma (,) that is entered into the delimiter field. You can enter another character that is used to delimit the fields in the output file.

13. Leave the default 1 setting in the excel-quoting field. Set this number to 0 for nonescaped output.

14. Leave the default utf-8 encoding specification for input files in the encoding field.

2.13.9 The Document Processor: export_to_files Window

Use the Document Processor: export_to_files window to save each document to a separate file. To use the Document Processor: export_to_files window, complete these steps:

100 SAS Information Retrieval Studio: Administrator’s Guide

Page 117: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Select export_to_files in the Add Document Processor window and click Next. The Document Processor: export_to_files window appears.

2. Leave the default selection, work/export-to-files, or change this folder, in the directory field. SAS Information Retrieval Studio sends each document to a separate file in the specified directory, based on a hash of its contents.

3. Leave the default specification, xml, or enter text into the format field.

4. Enter a comma-separated list of field names to include in the output file. If you leave fields blank, all of the document fields appear in the output file.

5. Leave the default selections raw and mimetype in fields-to-exclude. You can also specify different field names. The text in these fields does not appear in the output file.

6. (Optional) When you want to output nested XML tags, enter the name of the field whose value contains the XML syntax into xml-preescaped-fields.

This field name is listed in the Field name of the Document Processor: content_categorization window. For example, organization. For more information, see Section 2.13.4.F Specify Concepts on page 84. This

SAS Information Retrieval Studio: Administrator’s Guide 101

Page 118: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

field name is commonly used when XML escaping is requested in the Format field of the Document Processor: content_categorization window. See the following example:

7. Leave the default specification, article if you specified XML for the output format. If you are using text as the output format, enter a different document tag type into the document-tag field.

8. Leave the default utf-8 encoding specification for input files in the encoding field.

9. Leave the default specification 0 in the invalidate field if you want to stop the input files at this point in the pipeline. Alternatively, enter 1 to enable further document processing in the pipeline.

2.13.10 The Document Processor: export_to_odbc Window

Use the Document Processor: export_to_odbc window to send the documents directly to a database.To use the Document Processor: export_to_odbc window, complete these steps:

102 SAS Information Retrieval Studio: Administrator’s Guide

Page 119: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Select export_to_odbc in the Add Document Processor window and click Next. The Document Processor: export_to_odbc window appears.

2. (Optional) Enter the name of the ODBC driver into the connection-string field.

Note: Consult your database documentation for details before you use this step and Step 6. below.

3. Enter the name of the database table into the table field.

4. Use the table-init field to specify the operation that is performed if the database table already exists. If drop is specified, the existing table is removed, allowing a new table to be created. The columns in the new table replace those in the old. If set to truncate, the existing table is preserved, but all of the rows in it are deleted.

5. Leave the default entries for the columns in the database table, or specify new fields in the columns field.

SAS Information Retrieval Studio: Administrator’s Guide 103

Page 120: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

6. Leave the default entry id in the merge-column field if you want to add new rows with a merge operation. If you use the merge operation, specify the name of a column. (Also see the note above.)

7. Leave the default specification 0 in the invalidate field if you want to stop the input files at this point in the pipeline. Alternatively, enter 1 to enable further document processing in the pipeline.

8. Leave the default setting of 1024 in the max-length field. You can also change the highest number of characters permitted in the value specified for a single database column.

2.13.11 The Document Processor: export_to_sentiment_analysis_workbench Window

Use the Document Processor: export_to_sentiment_analysis_workbench window to send the documents directly to SAS Sentiment Analysis Workbench.To use the Document Processor: export_to_sentiment_analysis_workbench window, complete these steps:

104 SAS Information Retrieval Studio: Administrator’s Guide

Page 121: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Select export_to_sentiment_analysis_workbench in the Add Document Processor window and click Next. The Document Processor: export_to_sentiment_analysis_workbench window appears.

2. Leave the default setting, localhost, or enter a new server name into the hostname field.

3. Leave the default setting, 4000, or enter a new number into the port field.

4. Enter the name of the SAS Sentiment Analysis Workbench project into project-name.

5. Enter the name of the output folder into document-set-name.

SAS Information Retrieval Studio: Administrator’s Guide 105

Page 122: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

6. Leave the default selection, docid, or delete this entry in docid-field. If this field is empty, a unique identifier is automatically generated for each input document.

7. Leave the default selection, link, or delete this entry in link-field. If this field is empty, a unique link to each document is automatically generated.

8. Leave the createtime entry in createtime-field to specify the time that the document was created. You can also specify another field for this entry. In either case, the format of the contents of this field is specified in the createtime-format field below.

9. Specify the format of the matching createtime data in the createtime-format field. For example, %m/%d/%Y %I:%M:%S %p for SAS Sentiment Analysis Workbench, or %Y%m%d for SAS Search and Indexing.

10. Leave the default specification, title, or enter the field where the name of the document is located into title-field.

11. Leave the default specification, author, or enter the field where the name of the person who wrote the document is located into author-field.

12. Leave the default specification, geolocation, or enter the field that specifies where the location is found into geolocation-field.

13. Leave the default specification, source, or enter the field that specifies where the document originates into source-field.

14. Leave the default specification, body, or enter the field where the text of the document is located into body-field.

15. Leave the default specification 0 in the invalidate field if you want to stop the input files at this point in the pipeline. Alternatively, enter 1 to enable further document processing in the pipeline.

2.13.12 The Document Processor: extract_abstract Window

Use the Document Processor: extract_abstract window to generate the text in the <abstract> tag. This document processor takes approximately the first 25 to 50 words from an existing field in a document, such as the body field, to

106 SAS Information Retrieval Studio: Administrator’s Guide

Page 123: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

generate the abstract. (This location typically contains summary information for a technical or scientific document.)The abstract functions like the concordance if the document is sent to the search engine. However, the abstract is static and therefore independent of any query, the concordance is query-specific. For this reason, the concordance is available only when a search operation is performed.To use the Document Processor: extract_abstract window, complete these steps:

1. Select extract_abstract in the Add Document Processor window and click Next. The Document Processor: extract_abstract window appears.

2. Leave the default specification, body, or enter a new source for the <body> tag into source-field.

3. Leave the default specification, abstract, or enter the name of the format tag where the document summary can be located into abstract-field.

2.13.13 The Document Processor: extract_pdate Window

Use the Document Processor: extract_pdate window to convert the date value in an input document into the pdate format. The pdate format is understood by the search operation.To use the Document Processor: extract_pdate window, complete these steps:

SAS Information Retrieval Studio: Administrator’s Guide 107

Page 124: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Select extract_pdate in the Add Document Processor window and click Next. The Document Processor: extract_pdate window appears.

2. Leave the default specification, date, or enter a new source for the document date into date-field. This date is converted into the pdate for the search operation.

3. (Optional) Enter the strptime format of the date field in the input document into date-format. If this field is left empty, the RFC822 format (Internet text message) is used.

4. Leave the default specification, pdate, or define a new field where the pdate is stored in new-field.

2.13.14 The Document Processor: heuristic_parse_html Window

Use the Document Processor: heuristic_parse_html window to separate the body of an HTML document from its tags. This operation skips sections of the document that it determines to be navigation sections.

108 SAS Information Retrieval Studio: Administrator’s Guide

Page 125: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

To use the Document Processor: heuristic_parse_html window, complete these steps:

1. Select heuristic_parse_html in the Add Document Processor window and click Next. The Document Processor: heuristic_parse_html window appears.

2. Leave the default specification, raw, or enter a new body field into input-field. The raw specification returns the text in the title and body fields.

3. Leave the default specification, title, or enter a new title field into title-output-field. The plain text of the title is output to this field.

4. Leave the default specification, body, or enter a new body field into body-output-field. The plain text of the body field is output to this field.

5. The entry 1 in require-mime-type specifies that matching documents have a mimetype of HTML. If you enter 0, this field is not required.

6. Leave the mimetype entry in mime-type-field, or specify a different field.

7. The entry 1 in the base64-input field specifies that the text is encoded in the mime content transfer encoding. If you enter 0, this encoding is not used.

SAS Information Retrieval Studio: Administrator’s Guide 109

Page 126: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.13.15 The Document Processor: invalidate_duplicates_by_url Window

Use the Document Processor: invalidate_duplicates_by_url window to run a checksum operation that eliminates the accidental error of storing duplicate documents with the same Web address. You can also specify where to store these checksum URLs that can be tracked even if a restart operation is performed on the pipeline server.To use the Document Processor: invalidate_duplicates_by_url, complete these steps:

1. Select invalidate_duplicates_by_url in the Add Document Processor window and click Next. The Document Processor: invalidate_duplicates_by_url window appears.

2. Leave the default specification, id, or enter a new field where the URL is stored into the url_field.

3. (Optional) Enter the name of the checksum file into the checksumfile field. When you enter this name, duplicate URLs continue to be eliminated even after the pipeline is restarted.

2.13.16 The Document Processor: match_and_copy Window

Use the Document Processor: match_and_copy window in ways that are similar to the Document Processor: substitute window. However, the

110 SAS Information Retrieval Studio: Administrator’s Guide

Page 127: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

match_and_copy window enables you to write the output to a field that is different from the input field.To use the Document Processor: match_and_copy window, complete these steps:

1. Select match_and_copy in the Add Document Processor window and click Next. The Document Processor: match_and_copy window appears.

2. Enter the name of the field to be located into input-field.

3. Specify the pattern of the regular expression for this field in the pattern field.

4. Enter the name of the field where the output is placed into output-field.

5. (Optional) If the format field contains a value, the value controls how each match is formatted, otherwise the matches are copied in full.

6. (Optional) The append parameter controls whether these values are added to the end of an existing value for the output field, or these values replace an existing value.

7. (Optional) By default the semicolon character (;) is entered into the separator field. You can enter a different character, or a string, if the output-field is used as a label or sent to the index.

SAS Information Retrieval Studio: Administrator’s Guide 111

Page 128: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

8. Click Finish.

2.13.17 The Document Processor: modify_field_name Window

Use the Document Processor: modify_field_name window to change a field name.To use the Document Processor: modify_field_name window, complete these steps:

1. Select modify_field_name in the Add Document Processor window and click Next. The Document Processor: modify_field_name window appears.

2. Enter the name of the field that you want to change into oldname.

3. Enter the name of the new field into newname.

4. Click Finish.

2.13.18 The Document Processor: parse_html Window

Use the Document Processor: parse_html window to extract the contents of an input HTML document. To use the Document Processor: parse_html window, complete these steps:

112 SAS Information Retrieval Studio: Administrator’s Guide

Page 129: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Select parse_html in the Add Document Processor window and click Next. The Document Processor: parse_html window appears.

2. Leave the default specification raw, or enter a new location where this processor can locate the unmodified document data, into input-field. If you leave raw, the parse HTML tool extracts the text and puts it into title-output-field and the body-output-field.

3. Leave the default specification title, or enter a new title field into title-output-field. The plain text of the title is output to this field. If you leave this field empty, no title text is output.

4. Leave the default specification body, or enter a new title field into body-output-field. The plain text of the body field is output to this field. If you leave this field empty, no body text is output.

5. Leave the default specification 0 in output-metadata, to specify that additional information in other fields is output. (The text in the body and title fields is always output.) For example, description and keywords might be output. The fields that are used for output depend on the meta field types that appear in the HTML documents.

6. The entry 1 in the require-mime-type field specifies that the mimetype field is required. If you enter 0, this field is not required.

SAS Information Retrieval Studio: Administrator’s Guide 113

Page 130: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

7. Leave the mimetype entry in mime-type-field, or specify a different type field.

8. The entry 1 in the base64-input field specifies that the text is encoded in the mime content transfer encoding. If you enter 0, this encoding is not used.

2.13.19 The Document Processor: parse_xml Window

Use the Document Processor: parse_xml window to extract the contents of an input XML document. You can instantiate this field multiple times in order to support multiple document schemas.To use the Document Processor: parse_xml window, complete these steps:

1. Select parse_xml in the Add Document Processor window and click Next. The Document Processor: parse_xml window appears.

2. Leave the default specification, raw, in input-field and the document processor gets the text from the document.

3. Leave the default specification mimetype in mime-type-field. Only documents with a mimetype of XML are processed.

4. (Optional) Enter the name of the file that tells the application how to treat fields in input documents. The template-filename field specifies the name and location of this file.

114 SAS Information Retrieval Studio: Administrator’s Guide

Page 131: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

5. (Optional) Enter the name of the tag in input XML documents that contains the string that is the identifier for output documents into the copy-url-from-field.

6. (Optional) Enter the name of the tag in output documents that contains the identifying string for these documents into the copy-url-to-field.

7. (Optional) Enter the name of the file that tells the application how to treat fields in input documents. The template-filename specifies the name and location of this file.

For more information, and to see an example of this file format, see Section A.2 XML File Field Extraction File Format on page 353.

2.13.20 The Document Processor: send Window

Use the Document Processor: send window to pass each document to another pipeline server. This operation is used when you want to deploy multiple pipeline servers.To use the Document Processor: send window, complete these steps:

1. Select send in the Add Document Processor window and click Next. The Document Processor: send window appears.

2. Enter a new server name into the host field.

3. Enter a new number into the port field.

SAS Information Retrieval Studio: Administrator’s Guide 115

Page 132: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Note: Change these settings to prevent an endless loop.

4. Leave the default entry id in id-field, or enter the name of a new field where the document identifier is located.

5. Leave the default setting 0 in the invalidate field if you want to stop the input files at this point in the pipeline. Alternatively, enter 1 to send each instance of a document to another instance of the pipeline.

2.13.21 The Document Processor: strip_html Window

Use the Document Processor: strip_html window when you have a field that contains some HTML code that you want to convert into plain text. For example, if input XML documents contain HTML code. This operation leaves the textual contents and removes the mark-up tags. (If the entire document is in HTML, use the parse_html operation instead.) To use the Document Processor: strip_html window, complete these steps:

1. Select strip_html in the Add Document Processor window and click Next. The Document Processor: strip_html window appears.

2. Leave the body field, or add new fields that are separated by commas into Fields. These are the fields where the HTML tags are stripped in order to return the text that they contain.

2.13.22 The Document Processor: substitute Window

Use the Document Processor: substitute window to perform regular expression substitutions.

116 SAS Information Retrieval Studio: Administrator’s Guide

Page 133: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

To use the Document Processor: substitute window, complete these steps:

1. Select substitute in the Add Document Processor window and click Next. The Document Processor: substitute window appears.

2. Enter the name of the first regular expression field to be located into Field.

3. Specify the pattern of the regular expression into the Pattern field.

4. Enter the replacement for the first regular expression field into replacement.

For more information about regular expressions, see Appendix A.

SAS Information Retrieval Studio: Administrator’s Guide 117

Page 134: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14 Miscellaneous Windows

2.14.1 The Import Settings Window

Use the Import Settings window to specify the XML file to import and the affected components of SAS Information Retrieval Studio. When you select this operation, you choose to modify the selected components of SAS Information Retrieval Studio with the settings that you import.To access and use the Import Settings window, complete these steps:

1. Click Import Settings in the Overview pane.

The Import Settings window appears.

2. Enter the name of the file that you want to import into the Filename field. For example, enter ProjASettings.xml.

118 SAS Information Retrieval Studio: Administrator’s Guide

Page 135: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3. (Default setting: all components are selected) Deselect any of the components that you do not want to modify with the imported file in the Components section. For example, deselect Feed crawler, Indexing server, and Query web server.

4. Click OK to save these settings.

2.14.2 The Export Settings Window

Use the Export Settings window to save the settings for the components that you configured in SAS Information Retrieval Studio as an XML file. You can then import this file to use it with another project.To access and use the Export Settings window, complete these steps:

1. Click Export Settings in the Overview pane.

The Export Settings window appears.

2. Enter the name of the file that you want to export into the Filename field.

3. Click OK to save these settings.

SAS Information Retrieval Studio: Administrator’s Guide 119

Page 136: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.3 The Select an HTTP Proxy Window

When you select the HTTP proxy server, you choose a server that is not the proxy server for SAS Information Retrieval Studio. The HTTP proxy server is a server that is an intermediary between the crawler and the Web site. The HTTP proxy server evaluates requests before passing them to the web server.To access and use the Add HTTP Proxy window, complete these steps:

1. Select Configuration --> General Settings in the Web Crawler pane.

120 SAS Information Retrieval Studio: Administrator’s Guide

Page 137: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2. Click Auto-detect and the Select an HTTP Proxy window appears.

3. Choose an HTTP Proxy. For example, select HTTPProxyname.

4. Click OK and the server appears in the HTTP proxy field.

2.14.4 The Add Entry Point Window

Use the Add Entry Point window to specify the URL that the web crawler uses to begin its Web crawl. You can also limit the scope of the crawl and the number of files that are downloaded from this site.To access and use the Add Entry Point window, complete these steps:

1. Click the Configuration tab in the Web Crawler pane.

SAS Information Retrieval Studio: Administrator’s Guide 121

Page 138: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2. Click the Entry Points tab.

122 SAS Information Retrieval Studio: Administrator’s Guide

Page 139: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3. Click Add. The Add Entry Point window appears.

4. Enter a Web address into the URL field. For example, enter www.sas.com.

Hint: The http:// part of the address is automatically inserted for you after you click OK.

5. (Optional) Leave the default selection Yes in the Add to scope field or

click to select No. Unless there are scope rules, the crawler follows all links found on the entry point page, the links found on those pages, and so on. Scope rules limit the links that the crawler follows. Use this feature to constrain the crawl to a single site, or section of the site. In other words, the scope rule follows the way that many Web pages are laid out. When you leave Yes selected, the URL is automatically added to the Scope tab in the web crawler Configuration pane. For more information, see Section 2.4.4.D The Scope Tab on page 20.

SAS Information Retrieval Studio: Administrator’s Guide 123

Page 140: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

6. (Optional) Click or to reset the number in the Quota field. For example, specify 90000. When you specify a quota for the links from the entry point, the overall quota for the crawler, or this number, applies.

7. Click OK and this address appears in the entry points list.

124 SAS Information Retrieval Studio: Administrator’s Guide

Page 141: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.5 The Edit Entry Point Window

Use the Edit Entry Point window to change the URL or Quota that you added using the Add Entry Point window.To access and use the Edit Entry Point window, complete these steps:

1. Select an entry point and click Edit in the Entry Points tab of the Configuration pane for the web crawler. The Edit Entry Point window appears.

2. Enter your changes into the URL field. For example, enter http://.*\.sas\.com.

3. (Optional) Click or to reset the number in the Quota field.

4. Click OK to save these settings.

SAS Information Retrieval Studio: Administrator’s Guide 125

Page 142: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.6 The Add Feed Window

The feed crawler collects postings, whether full texts or summaries, from both RSS and Atom feeds. Use the Add Feed window to add the URL for a feed to the Feeds pane for the feed crawler. In order to select a URL, you locate the Web page where the feed is located and copy the entire feed URL.If you choose to collect summaries, select Yes in the Follow links field of the Add Field window.

Hint: If the feed parser collects summaries, select the parse_html document processor in the pipeline server. You can also specify a custom document processor to handle the pages returned from these links. However, the follow links operation does not perform recursively like it does for the web crawler.

To obtain a feed, not a page, URL, complete these steps:

1. Locate the Web page with the orange box that symbolizes an RSS feed. For example, http://support.sas.com/community/rss/.

126 SAS Information Retrieval Studio: Administrator’s Guide

Page 143: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2. Click located to the left of the feed that you want. For example, Press Releases.

3. The feed page appears.

4. Copy the feed URL from the URL field in the browser. For example, copy http://www.sas.com/news/preleases/SASRecentPress.xml.

After you copy the URL for an RSS feed using the steps above, complete these steps:

SAS Information Retrieval Studio: Administrator’s Guide 127

Page 144: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Select Feed Crawler --> General Settings --> Feeds. The Feeds pane appears.

2. Click Add and the Add Feed window appears.

3. Paste the copied RSS feed URL into the Feed URL field.

4. Leave the default setting No, or click to select Yes in the Follow links field. (If you are collecting summaries, Yes.

5. Click OK and the URL appears in the Feeds pane.

128 SAS Information Retrieval Studio: Administrator’s Guide

Page 145: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.7 The Edit Feed Window

Use the Edit Feed window to change a URL that you added to the Feeds pane. To edit a feed URL, complete these steps:

1. Click Edit in the Feeds tab of the Configuration pane for the feed crawler. The Edit Feed window appears.

2. Place your cursor into the Feed URL field and make any necessary changes.

3. Leave the default setting No, or click to select Yes in the Follow links field.

4. Click OK to save these entries.

SAS Information Retrieval Studio: Administrator’s Guide 129

Page 146: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.8 The Add Scope Rule Window

Use the Add Scope Rule window to set limits for the pages that the web crawler can follow on the Internet.To access and use the Add Scope Rule window, complete these steps:

1. Click Add in the Scope tab of the Configuration pane for the web crawler. The Add Scope Rule window appears.

2. Enter a Web address, or enter a regular expression to define a matching pattern for URLs, into the URL Pattern field.

A URL pattern is a pattern that matches against URLs. It is not a URL itself. SAS Information Retrieval Studio supports two types of patterns. The first is the prefix pattern that matches against the beginning of the URL. For example, http://www.sas.com matches this pattern and could return http://www.sas.com/technologies/analytics/index.html. The second is a regular expression pattern that matches the whole URL, but it also supports wildcards and other operators.

3. Leave the default selection Prefix in the Match type field unless you specified a regular expression in the URL Pattern field. In this case,

click to select Regular expression.

130 SAS Information Retrieval Studio: Administrator’s Guide

Page 147: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

4. Leave the default selection Allow in the Action field unless you want to exclude URLs that match this pattern from the crawl. In this case, click

to select Exclude.

5. Click OK and this address appears under the URL Pattern heading.

SAS Information Retrieval Studio: Administrator’s Guide 131

Page 148: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.9 The Edit Scope Rule Window

Use the Edit Scope Rule window to change the limits of the web crawler’s Internet search.To access and use the Edit Scope Rule window, complete these steps:

1. Click Edit in the Scope tab of the Configuration pane for the web crawler. The Edit Scope Rule window appears.

2. Use Step 2. through Step 5. in the Section 2.14.8 The Add Scope Rule Window on page 130 to make any necessary changes.

132 SAS Information Retrieval Studio: Administrator’s Guide

Page 149: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.10 The Add Filename Extension Window

Use the Add Filename Extension window to create a list of file types that should specifically be excluded, or included, in the crawl. If you choose to include one or more file extension types, all others are excluded. For this reason, if you want to include all file types do not use the steps below.

Note: This matching is case-sensitive.

To access and use the Add Filename Extension window, complete these steps:

1. Click Add in the Filename Extensions tab of the Configuration pane for the web crawler. The Add Filename Extension window appears.

2. Enter the file extension that you want to return or to exclude in the Extension field. For example, enter gif.

3. Leave the default selection. If you click and select Exclude, the selected file type is not returned.

4. Click OK and the change appears in this URL pattern in the Filename Extension pane.

SAS Information Retrieval Studio: Administrator’s Guide 133

Page 150: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.11 The Edit Filename Extension Window

Use the Edit Filename Extension window to change the Allow or Exclude operation for the selected file type.

Note: This matching is case-sensitive.

To access and use the Edit Filename Extension window, complete these steps:

1. Click Edit in the Filename Extensions tab of the Configuration pane for the web crawler. The Edit Filename Extension window appears.

2. Use Step 2. through Step 4. in the Section 2.14.10 The Add Filename Extension Window on page 133 to make any changes.

134 SAS Information Retrieval Studio: Administrator’s Guide

Page 151: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.12 The Add Credential Window

Use the Add Credential window to set the user name and password that is required to access any password-protected site that you want to access.To access and use the Add Credential window, complete these steps:

1. Click Add in the Credentials tab of the Configuration pane for the web crawler. The Add Credential window appears.

2. Enter the address for a Web site that requires credentials into the Site field.

3. Enter the name of the user into the Username field.

4. Enter the secret term matched to this user into the Password field.

5. Click OK and these entries appear in the Credentials pane.

SAS Information Retrieval Studio: Administrator’s Guide 135

Page 152: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.13 The Edit Credential Window

Use the Edit Credential window to make changes to the information that you entered in the Add Credential window. Use this window to narrow the crawling scope. In other words, crawl everything but the specified file or directory.To access and use the Edit Credential window, complete these steps:

1. Click Edit in the Credentials tab of the Configuration pane for the web crawler. The Edit Credential window appears.

2. Use Step 2. through Step 5. in the Section 2.14.12 The Add Credential Window on page 135 to make any changes.

136 SAS Information Retrieval Studio: Administrator’s Guide

Page 153: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.14 The Add Path Window

Use the Add Path window to specify the location of a file or directory for the file crawler. You can use either relative or absolute paths. These path types are relative to the component that uses them. However, absolute paths are recommended for accurate search returns.To access and use the Add Path window, complete these steps:

1. Click Add in the Paths tab of the Configuration pane for the file crawler. The Add Path window appears.

2. Enter a file or directory name into the Path field. For Windows fileshares, use universal naming conventions (UNC) instead of local paths.

3. Click OK and this entry appears in the Paths pane.

SAS Information Retrieval Studio: Administrator’s Guide 137

Page 154: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.15 The Edit Path Window

Use the Edit Path window to make a change to the selected path.To access and use the Edit Path window, complete these steps:

1. Select a path and click Edit in the Paths tab of the Configuration pane for the file crawler. The Edit Path window appears.

2. Use Step 2. through Step 3. in the Section 2.14.14 The Add Path Window on page 137 to make any necessary changes.

2.14.16 The Add Path to Exclude Window

Use the Add Path to Exclude window to deny the file crawler access to a file or directory. This window enables you to limit the scope of the crawl.To access and use the Add Path to Exclude window, complete these steps:

1. Click Add in the Paths to Exclude tab of the Configuration pane for the file crawler. The Add Path to Exclude window appears.

2. Enter a path to the files that the file crawler should not access in the Path field.

3. Click OK and this entry appears in the Paths to Exclude pane.

138 SAS Information Retrieval Studio: Administrator’s Guide

Page 155: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.17 The Edit Path to Exclude Window

Use the Edit Path to Exclude window to make a change in the directory path.To access and use the Edit Path to Exclude window, complete these steps:

1. Click Edit in the Paths to Exclude tab of the Configuration pane for the file crawler. The Edit Path to Exclude window appears.

2. Use Step 2. through Step 3. in Section 2.14.16 The Add Path to Exclude Window on page 138 to make any changes.

2.14.18 The Add Extension Window

Use the Add Extension window to limit the access of the file crawler to the list of files specified in this pane. If you enable at least one type of file to be returned, only these files are returnedTo access and use the Add Extension window, complete these steps:

1. Click Add in the Filename Extensions tab of the Configuration pane for the file crawler. The Add Extension window appears.

2. Enter a string into the Extension field. For example, enter html.

3. Click OK and this entry appears in the Filename Extensions pane.

SAS Information Retrieval Studio: Administrator’s Guide 139

Page 156: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.19 The Edit Extension Window

Use the Edit Extension window to limit file crawler access to the list of files specified in this pane.To access and use the Edit Extension window, complete these steps:

1. Click Edit in the Filename Extensions tab of the Configuration pane for the file crawler. The Edit Extension window appears.

2. Use Step 2. through Step 3. in Section 2.14.18 The Add Extension Window on page 139 to make any necessary changes.

140 SAS Information Retrieval Studio: Administrator’s Guide

Page 157: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.20 The Add Backend Window

Use the Add Backend window to add a pipeline server to the configuration pane for the proxy server.To access and use the Add Backend window, complete these steps:

1. Click Add in the Configuration pane for the proxy server. The Add Backend window appears.

2. Enter a new string into the Host field. For example, enter newhost.

3. Click or to reset the number in the Port field. For example, specify 9008.

4. Click OK and the new server information appears in the configuration pane.

SAS Information Retrieval Studio: Administrator’s Guide 141

Page 158: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.21 The Edit Backend Window

Use the Edit Backend window to change information about the pipeline server that appears in the proxy server configuration pane.To access and use the Edit Backend window, complete these steps:

1. Click Edit in the Configuration pane for the proxy server. The Edit Backend window appears.

2. Use Step 2. through Step 4. in Section 2.14.20 The Add Backend Window on page 141 to make any necessary changes.

142 SAS Information Retrieval Studio: Administrator’s Guide

Page 159: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.22 The Add Field Window for the Indexing Server

Use the Add Field window to define the fields of the document stored in the index. You also use this window to change the specifications that you entered when you click Edit in the Configuration pane of the Indexing Server.To access and use the Add Field window, complete these steps:

1. Click Add in the Configuration pane of the Indexing Server and the Add Field window appears.

2. Enter the field name into the Name field. You can specify any field name that appears in your documents.

3. Click to choose one of the following selections:Searching

(Default) Search for words that match the input query terms in this field. This choice is equivalent to the standard functionality.

SAS Information Retrieval Studio: Administrator’s Guide 143

Page 160: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Label

(default) select this type of usage for facetted search. For more information about facetted search labels, see Section 9.5 Match Categories, Concepts, and Facts on page 246.

Display and Sorting

display the matching URLs according to the sorting type that you select. Sort the results alphabetically, or numerically, instead of by relevancy. This selection corresponds to marking the field as info.

Identification

choose this field to identify the field that contains the individual identification number for each document. Each field in the index requires a unique identifier. If a new document is added that has the same identification number as an old document, the new document replaces the old document.

Custom

select one, or more, of the following field types:

Standard

make this field a regular field. This selection enables searching within this field.

144 SAS Information Retrieval Studio: Administrator’s Guide

Page 161: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Boolean

enable searches with Boolean operators. Boolean fields require an exact match on a Boolean field in a document. If the entire contents of a Boolean field are equal to the term, in a byte-for-byte manner, there is a match. In other words, case, punctuation, and whitespace characters for the matched term are identical.

Info

make this field an information field. The information field is used to pass static data. Information fields are not modified and they cannot be matched by a query.

URL

contains either a Web address or a unique string.Date

contains a field that represents the date of the document.Number

contains an integer value that is associated with the document such as a price or rating. Use this selection for range-based query constraint at query time, if you choose to use the query API. For more information about the query API, see the SAS Search and Indexing: User and C and Java API Guide.

4. Click OK to add this field to the list of fields specified for this index.

SAS Information Retrieval Studio: Administrator’s Guide 145

Page 162: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.23 The Add Field Window for the Query Web Server Matching Pane

Use the Add Field window to add fields to the Matching pane of the query web server.

Note: The only fields that are available are those added to the index with search functionality.

To access and use the Add Field window, complete these steps:

1. Click Add in the Matching pane of the Query Web Server and the Add Field window appears.

2. Leave the default selection such as id to search the identification fields

of the input documents. Click to choose another field. The fields in this drop-down list are added in the pipeline server.

3. Leave the default Weight value 1, or click or to reset the weight assigned to this field. The weight value sets a number that is relative to the other matching fields and is used to prioritize matches.

4. Click OK to save this field in the Matching pane.

146 SAS Information Retrieval Studio: Administrator’s Guide

Page 163: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.24 The Edit Field Window for the Query Web Server Matching Pane

Use the Edit Field window to change field entries in the query web server Matching pane. To access and use the Edit Field window, complete these steps:

1. Click Edit in the Configuration pane of the Query Web Server and the Edit field window appears.

2. Use Step 2. through Step 3. in Section 2.14.23 The Add Field Window for the Query Web Server Matching Pane on page 146 as necessary.

SAS Information Retrieval Studio: Administrator’s Guide 147

Page 164: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.25 The Add Field Window: Query Web Server Labels Pane

Use the Add Field window to add labels to the query web server for use with facetted search. To access and use the Add Field window, complete these steps:

1. Click Add in the Labels pane of the Query Web Server and the Add a Field window appears.

2. Any field in the index that has label functionality is available in the

drop-down list when you click . For example, categories is added as a document processor to the pipeline server.

3. Enter a label name into the Caption field. This term appears for any matches on the categories. If you added concepts in the Pipeline Server tab, you could specify a different label for each concept.

4. Leave the default selection No in the Hierarchical field, or

click to select Yes or Flattened. No displays the list view. Yes displays the tree view, and Flattened displays the tree in a list format.

148 SAS Information Retrieval Studio: Administrator’s Guide

Page 165: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Note: You can specify Yes or Flattened in the Hierarchical field for categories only. Concepts and facts do not have parent-child relationships.

5. Leave the default selection No in the Display counts field, or

click to select Yes to see the number of matching values for each label field.

6. Leave the Show in matches value 0. You can click or to reset the number of labels found in each individual matching document that are displayed in the results list.

7. Click OK to save these entries in the Labels pane.

SAS Information Retrieval Studio: Administrator’s Guide 149

Page 166: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.26 The Edit Field Window: Query Web Server Labels

Use the Edit Field window to make changes to the labels in the query web server for use with facetted search. To access and use the Edit Field window, complete these steps:

1. Select a label in the Labels pane.

2. Click Edit and the Edit field window appears.

3. Use Step 2. through Step 7. in Section 2.14.25 The Add Field Window: Query Web Server Labels Pane on page 148 as necessary.

150 SAS Information Retrieval Studio: Administrator’s Guide

Page 167: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2.14.27 The Color Box Window

Use the color box window to make changes to the colors that you use for the query web server user interface. To access and use the color box window, complete these steps:

1. Select Query Web Server --> Configuration --> Theme --> Colors.

2. Click in the Color pane and the color box window appears.

3. Select to make a color change.

4. Click a color box beneath Basic to select a color.

5. See the colors that you previously selected in the Recently used boxes.

SAS Information Retrieval Studio: Administrator’s Guide 151

Page 168: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

6. Click Custom Colors to access the expanded version of the color box window.

7. See the color range for the selected color in the large pane.

8. Slide the and buttons up or down to select a new color.

9. Click or to reset the default number 112 assigned to the Red field.

10. Click or to reset the default number 138 assigned to the Green field.

11. Click or to reset the default number 116 assigned to the Blue field.

152 SAS Information Retrieval Studio: Administrator’s Guide

Page 169: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

12. See the newly selected color in the New pane.

13. See the existing color in the Current pane.

14. See the hexadecimal color code that corresponds with the selected color for the Web page.

15. Click OK to save this color in the selected field.

2.14.28 Status Windows

2.14.28.A Overview of Status Windows

Some, but not all, of the status windows that appear in SAS Information Retrieval Studio are displayed in this section. Use the status windows in this application to understand the processes, to catch errors, and to make changes to your application.

2.14.28.B The Confirmation Window

Use the Delete Index window to remove an index from the server.To access and use the Delete Index window, complete these steps:

1. Click Delete Index in the Indexing Server pane and the Delete Index window appears.

2. Click OK and the index that you have compiled is deleted.

2.14.28.C The Error Window

The Error window appears when an operation cannot be completed. This window contains a string providing relevant information. See the example below:

SAS Information Retrieval Studio: Administrator’s Guide 153

Page 170: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Display 2.51 Error Window

3. Click OK to close this window and make your changes.

154 SAS Information Retrieval Studio: Administrator’s Guide

Page 171: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3 Choosing Your Components

- Before You Choose Your Components- Choosing a Crawler- Purposes of the Proxy Server- Choosing Document Processors in the Pipeline Server- How the Indexing Server Works- Querying the Index- Defining Labels for Facetted Search- After You Choose Your Components- Exporting and Importing Component Specifications

3.1 Before You Choose Your Components

SAS Information Retrieval Studio enables you, the administrator, to create a custom document acquisition and processing application. You choose the components that fit the requirements of your organization and you configure each of these components. For example, if you choose to use the web crawler, you install SAS Web Crawler before installing SAS Information Retrieval Studio. If you want to perform search and indexing operations, you install SAS Search and Indexing before you install SAS Information Retrieval Studio.You also choose the order for the document processors and decide whether to index the processed texts or to send to another application. For information about sample configurations, see Chapter 4: Sample Configurations.

Page 172: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3.2 Choosing a Crawler

Choose a crawler to locate and return the documents that contain the information that you seek. Crawlers crawl the Internet and your corporate files. Each returned document could be a Web page, a blog post, or a file that is posted to the Internet or on your local machine. Each document is a unit of textual data. For example, a document can be an HTML page, a Microsoft Word or a PDF file, one row in a CSV file, or an article or summary in a feed.In SAS Information Retrieval Studio, each document is represented as a configurable set of fields. Each file has a name and a value. This name-value pair might be returned as binary data (Word documents). Each document has an associated ID tag that identifies the input document as it is collected, processed, and output by SAS Information Retrieval Studio. There are three types of crawlers in SAS Information Retrieval Studio:Web crawler

crawls the Web, according to the parameters that you set. These parameters define the types of documents and information that you seek, and they also limit the scope of the crawl. The scope, or breadth and depth of the crawl, prevent the crawler from attempting to return every document that appears on the Internet. When you limit the scope of the Web crawl, you optimize the crawl and minimize the time that it takes to return this data. You can also specify the credentials that are necessary to access password-protected sites.

File crawler

crawls fileshares on your organizations’s network, or your local machine, for the types of files that you specify. You input the parameters of the crawl to limit the retrieval operation to the document types and paths that you select, or exclude.

156 SAS Information Retrieval Studio: Administrator’s Guide

Page 173: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Feed crawler

use the feed crawler when you want to obtain blog posts, user forum pages, and other trending data such as press releases.

3.3 Purposes of the Proxy Server

All deployments of SAS Information Retrieval Studio require the proxy server. The proxy server controls the flow of documents. As an intermediary server, the proxy server copies the collected data to multiple places. This copy process prevents the loss of data in the case of hardware failure. The proxy server sends the same set of documents to each pipeline server.The proxy server performs two basic types of operations. These functions make it an integral link between the crawlers and the servers that together form your customized SAS Information Retrieval Studio application. For this reason, the proxy server is not optional.

First, the proxy server enables you to pause the flow of incoming documents from a crawler. When you stop the flow, the incoming documents form a queue until you instruct the proxy server to resume operations. You can use the pause operation to perform maintenance without interrupting crawling. Second, the proxy server enables you to specify a list of pipeline servers running on different machines. A copy of each incoming document can be sent to each of these servers.

Use the proxy server to send the same set of documents to multiple pipeline servers:

To create mirrors, specify the same configuration for each server. The servers that run on different machines act as mirrors in case of hardware failure.To create mirrors for multiple types of document processors, specify identical document processing capabilities for each type of pipeline server. In case of hardware failure, the input documents are saved in another pipeline server.

SAS Information Retrieval Studio: Administrator’s Guide 157

Page 174: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3.4 Choosing Document Processors in the Pipeline Server

3.4.1 Overview of the Pipeline Server

Use the pipeline server to specify the document processors that act on input documents. For example, strip the markup tags from HTML documents, or convert Microsoft Word documents into text. These document processors take the input documents and prepare them to be used by another component of SAS Information Retrieval Studio or another application.After the input documents are processed by the pipeline server, choose to export the data or build a searchable index of the documents using the indexing server. Use the pipeline server to pass the input documents to the selected program or to the indexing server.

3.4.2 Choosing a Document Processor

All deployments of SAS Information Retrieval Studio that process incoming documents require the pipeline server. Input documents can be processed and sent to the index or they can be passed to another application. For example, HTML documents collected by the web crawler are passed to the proxy server and then to the pipeline server. In the pipeline server, you can select the document processors that act on this document before it is indexed or sent to another application.For example, if a document is in HTML format, it requires processing before the text is used by another application. Use the parse_html, or heuristic_parse_html processors to perform HTML processing before the pipeline server performs any additional document processing operations on the input document.Use the following document processing operations before you index a document:categorizer

match category rule terms that appear in one, or more, fields of an input document. These rules are found in the categories project running on SAS Content Categorization Server.

158 SAS Information Retrieval Studio: Administrator’s Guide

Page 175: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

concept_extractor

extract any matching concepts from an input document. These terms are located in the specified concepts project running on SAS Content Categorization Server.

contextual_extractor

return the matching contextual extraction concepts and facts in the specified project running on SAS Content Categorization Server.

default_mime_type_from_url

determine the type of the original, input document based on the filename extension found in its Web address.

default_title_from_url

name documents that lack a title based on their Web addresses.document_converter

change the format of incoming files, such as Adobe PDF and Microsoft Office documents into text using the SAS Document Conversion application.

extract_abstract

obtain the summary for each document. extract_pdate

normalize the document dates into a format understood by SAS Search and Indexing.

heuristic_parse_html

separate the body of an HTML document from its tags using an operation that provides an algorithm to obtain the optimal result. This operation searches for paragraphs of text without many tags and extracts these bodies of text.

invalidate_duplicates_by_url

prevent the collection of multiple copies of the same document from being returned.

modify_field_name

change the name of a field in an input document.

SAS Information Retrieval Studio: Administrator’s Guide 159

Page 176: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

parse_html

separate the text from the HTML mark-up tags in an input HTML document.

parse_xml

separate the text from the XML mark-up tags in an input XML document.send

save each input document to each pipeline server.strip_html

remove the mark-up tags and return only the text from an HTML-formatted field in a document.

substitute

use regular expressions to modify fields that match a specified pattern. Regular expressions are used to locate patterns. For more information, see Appendix A.

3.4.3 The Export Operations Performed by the Pipeline Server

Use the pipeline server to perform the following operations. The export operation is a function of the proxy server. You can configure each of the following operations for the pipeline server:export_csv

transfer document text into a comma-separated format. The columns in this file match the fields that are processed and placed into the output file. The identification (ID) field in this file tracks the URL that identifies the document. The output can be used by many applications. For example, use this file type to place a document into a spreadsheet.

export_to_files

save each document to a separate file whose name is based on a hash of its contents.

export_to_odbc

send documents to a database or to another ODBC provider.

160 SAS Information Retrieval Studio: Administrator’s Guide

Page 177: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

export_to_sentiment_analysis_workbench

pass the information directly to the SAS Sentiment Analysis Workbench application for further analysis.

Note: By default, when SAS Search and Indexing is installed, documents are sent to the index if they are not exported to other applications.

3.5 How the Indexing Server Works

After the gathered documents are processed, they are passed to the indexing server that builds a searchable index of the documents. Each document in the index consists of a set of fields. These fields are populated with the data in the matched fields of the documents passed to the indexing server by the pipeline server. Use the different field types in the index to enable different types of queries. If you choose to enable facetted search, you can also specify fields as labels and enable intuitive search.

3.6 Querying the Index

3.6.1 Overview of the Querying

The query server and the query web server work together to obtain queries and pass them to the index. You can format the query web server to display the matched documents in the search window. To see a statistical analysis of these queries, use the selections that are available for the query statistical server.

3.6.2 Using the Query Server

The query server controls the flow of queries to the index and the matches that are returned to input queries. You can see a log file for the query server if you want to run a check on the server.

SAS Information Retrieval Studio: Administrator’s Guide 161

Page 178: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

You can design a Web page that enables users to input queries and to obtain search results. However, you can also use the query server with an application that does not require an interface to search the index. In this case, you can write a custom program to provide a connection between the query server and your application.

3.6.3 Using the Query Web Server

The query web server provides the capabilities that are necessary to customize a Web-browser interface for the query server. Use this window to specify the following types of parameters:Searches

specify simple or advanced searches. Simple searches match an input term. Advanced searches match field names and take various types of operators. Advanced searches limit and more accurately return results.You can also specify facetted search using labels. These labels enable users to follow related search terms to intuitively locate the results that they seek.

Sort the results

decide whether to return search results based on relevancy, date, or the number of matching terms or fields.

Format the user interface and results

select the way that results are displayed in the custom user interface that you design. You can also specify themes and colors.

3.6.4 Using the Query Statistics Server

The query statistics server enables you to monitor the queries entered from the query server. Choose to perform this monitoring by specifying a date range. You can choose to see any, or all, of the following statistics:Most frequent queries

see a list of the most frequent query terms and the number of occurrences for each term.

162 SAS Information Retrieval Studio: Administrator’s Guide

Page 179: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Most frequent queries without matches

see a list of the most frequent query terms that did not locate results in the index. You can also see the number of occurrences for each term.

Query rate by hour

see the number of queries for each hour in a day.Query rate by day

see the number of queries for each day of the week.Query rate by month

see the number of queries for each month in a year.Query rate for all time

see the number of queries since you installed the SAS Information Retrieval Studio application.

3.7 Defining Labels for Facetted Search

Facetted search enables end users to query the index using clusters of related labels to intuitively locate the information that they seek. The labels that appear in the interface are those that occur most frequently in the matching documents. End users can navigate between labels without using the Back button in the Web interface or breadcrumbs. Instead, users select meaningful terms and navigate by using them to refine their query.Labels are used to identify matching categories and concepts in indexed documents. In other words, labels use the names of the SAS Content Categorization Studio categories and concepts to cluster matching documents. You specify these labels when you create a SAS Content Categorization Studio project and upload it to SAS Content Categorization Server.Users can click on labels to see related documents, or select a document to view related labels. These webs of links, provided by the document classifications, are an alternative to the linear paths provided by breadcrumbs, or to the Back button in the browser.

SAS Information Retrieval Studio: Administrator’s Guide 163

Page 180: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3.8 After You Choose Your Components

After you select the SAS Information Retrieval Studio components that fit the document retrieval, processing, and search requirements of your organization, you can construct your application. When you design the application, it is important to consider the order of the processes involved. For example, you cannot use the query web server interface to search documents that are not indexed. These specifications, as well as all of the information necessary to configure each component of SAS Information Retrieval Studio is explained in the following chapters. It is necessary only to review the chapters that discuss the components that you choose to use.

3.9 Exporting and Importing Component Specifications

After you develop your SAS Information Retrieval Studio application, you can export the specifications for your components. When you choose to use this process, you create an XML file that can be imported into another project. Use this process to create a new project using the old settings for some, or all, of the SAS Information Retrieval Studio components. For more information, see Section 2.14.1 The Import Settings Window on page 118 and Section 2.14.2 The Export Settings Window on page 119.

164 SAS Information Retrieval Studio: Administrator’s Guide

Page 181: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

4 Sample Configurations

- Why You Want to Understand Sample Configurations- Before You Use a Sample Configuration to Create Your Own

Application- Sample Configurations That Use the Web Crawler- A Sample Configuration That Uses the File Crawler- A Sample Configuration That Uses the Feed Crawler

4.1 Why You Want to Understand Sample Configurations

When you understand how some of the sample configurations for SAS Information Retrieval Studio work, you have a better idea of how to build a customized application. For this reason, these examples include the types of processes, specifications, and purposes for these configurations. Specific examples of the settings for the necessary document processors in the pipeline server are also included. Select the document processors that act on input documents to prepare them to be handled by another server or application using the pipeline server.Configure the components for the application that meets the requirements of your organization using these samples to understand how the various components work together.

Page 182: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

4.2 Before You Use a Sample Configuration to Create Your Own Application

Sample configurations are examples that are designed to be changed to meet your organization’s requirements. It is important to understand some of the operations that are necessary to make when you develop an application and then choose to make changes. All of the following information is contained in the appropriate chapters that follow this chapter. For your convenience, a summary of the operations that are necessary when you make changes to your configuration is outlined below:

- Make sure that your document processors are listed in the order of logical operations:

a. Normalize input text. For example, place parse_html, heuristic_parse_html, or document_converter, first in the list of document processors. Each of these processors strips the text from the input document.

b. Process the text. For example, categorize, or extract concepts, extract an abstract, and so on. These processors act only on normalized text.

c. Export the documents to a SAS application or to third-party software. Send only the processed and normalized text that can be used by an index (by default, if you install SAS Search and Indexing, the documents are indexed) or by other applications. These applications include SAS Text Miner and SAS Sentiment Analysis Workbench.

- Click Apply Changes before you leave the tab for any component where you make changes.

- Delete the index if you want all of the gathered documents to be indexed according to the changed settings. If you do not delete the index, the documents that were indexed according to the old settings remain in the index. The documents that are added after you save your changes to the new index are indexed according to the new settings.

166 SAS Information Retrieval Studio: Administrator’s Guide

Page 183: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

- If necessary, stop and restart the web, file, or feed crawler that is running. If you delete the index, stop and restart the web, file, or feed crawler that you chose to build the index.

Hint: When you restart the crawling process, the documents that were previously gathered are collected again.

- If you decide to check the results of your index by entering query terms using the search window, consider the path and scope of your crawl. In other words, if you limit your crawl to SAS documents, do not expect to enter medical terms and locate matches in these documents.

4.3 Sample Configurations That Use the Web Crawler

4.3.1 A Web Crawler, Indexing, and Searching Configuration

For this configuration, using the web crawler with several other SAS Information Retrieval Studio components, make sure that the following components are installed:

- SAS Information Retrieval Studio- SAS Search and Indexing- SAS Web Crawler

Optional application: A category, concepts, or contextual extraction project developed in SAS Content Categorization Studio and loaded on SAS Content Categorization Server.

Note: It is necessary to choose an HTML document processor when you use the web crawler.

SAS Information Retrieval Studio: Administrator’s Guide 167

Page 184: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

To set up a simple project that crawls the Web, builds an index, and configures the query server, complete these steps:

1. Select Web Crawler --> Configuration --> General Settings.

2. Click Auto-detect and the Select an HTTP Proxy window appears. Use this window to select the proxy server that is located between the web crawler and the Internet.

3. Select a server. For example, choose my.default.proxy.server.

168 SAS Information Retrieval Studio: Administrator’s Guide

Page 185: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

4. Click OK and the selected server appears in the HTTP proxy field of the General Settings pane.

5. (Optional) Make changes to the web crawler. For example, Use these steps:

a. Increase the number of files that the web crawler can collect from 25 to 3000 in the Quota field.

b. Increase the total size of the files that can be collected to 3000 megabytes in the Quota field.

c. If you increase the Number of downloader threads, the web crawler can access more files quickly. However, too many threads can overwhelm the site that the web crawler is crawling.

6. Click OK and the server appears in the General Settings pane.

7. Select Configuration --> Entry Points to specify the Web site where the web crawler begins its Internet crawl.

SAS Information Retrieval Studio: Administrator’s Guide 169

Page 186: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

8. Click Add and the Add Entry Point window appears.

9. Enter the Web address that the web crawler uses to enter the Internet into the URL field.

10. (Optional) Limit this crawl to the specified site, select Add to scope. If you specify at least one permitted site, every other site is excluded. For more information, see Section 5.2.4 Specify the Scope of the Crawl on page 198.

11. (Optional) Change the limit on the number of files downloaded from

this site in the Quota field using or . The lesser of the numbers entered into this field and the Quota field in the General Setting field applies.

12. Click OK and this address appears in the Entry Points pane.

13. (Optional) To limit the file types that are returned, click the Filename Extensions tab. For more information, see Section 5.2.5 Exclude Certain Types of Files on page 202.

14. (Optional) To specify user names and passwords for password-protected sites, click the Credentials tab. For more information, see Section 5.2.6 Specify Access Information for Password-Protected Sites on page 203.

170 SAS Information Retrieval Studio: Administrator’s Guide

Page 187: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

15. Click Apply Changes to save the new web crawler configuration.

Note: Do not start the web crawler until you have configured all of the components for your application. If you start the web crawler before you configure the indexing server, delete and rebuild the index.

16. Select Pipeline Server --> Document Processors. The Document Processors pane appears.

SAS Information Retrieval Studio: Administrator’s Guide 171

Page 188: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

17. Click Add and the Add Document Processor window appears.

18. Select heuristic_parse_html.

Hint: You could also select parse_html but the heuristic_parse_html processor uses a heuristic to exclude navigation text.

19. Click Next and the Document Processor: heuristic_parse_html window appears.

20. Leave the default settings or make changes. For more information about these fields, see Section 2.13.14 The Document Processor: heuristic_parse_html Window on page 108.

172 SAS Information Retrieval Studio: Administrator’s Guide

Page 189: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

21. Click Finish and the document processor that you select appears in the Document Processors tab.

22. Use Step 17. through Step 21. above, reiteratively, until you have added all of the document processors that you require. For example, if you want to add labels to enable facetted search use the content_categorization Document Processor. For more information, see Section 2.13.4 The Document Processor: content_categorization Wizard on page 78.

In this example, both categories and concepts are added to enable facetted search.

23. Select a document processor and click Edit to make a change to a document processor. If you want to change the functionality of a category or concept field, see Step 26. below.

Note: Be sure to make the heuristic_parse document processor the first in the list in the Document Processors pane.

24. Click Apply Changes.

SAS Information Retrieval Studio: Administrator’s Guide 173

Page 190: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

25. Click the Configuration tab in the Indexing Server.

26. (Optional) Leave the default settings or click Edit to make changes to the functionality of an index field.

Hint: If you added fields such as categories or concepts, these field names automatically appear in the Configuration pane. The categories field, and each concept field, has the Label functionality.

27. (Optional) By default, the index is optimized for the English

Language. To change this setting, click and select another language from the drop-down menu that appears.

28. Select Query Web Server --> Configuration --> Matching. Use this pane to set the priorities for field matches. Weights are a relative

174 SAS Information Retrieval Studio: Administrator’s Guide

Page 191: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

setting. The priority value that you specify for each field is determined only in relationship to other matched fields in a document.

29. (Optional) To add a field and specify its weight, click Add, and the Add Field window appears.

30. Click to select a field that appears in the Configuration pane. For example, select title in the Name field.

SAS Information Retrieval Studio: Administrator’s Guide 175

Page 192: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

31. (Optional) To change the default setting 1 in the Weight field,

click or .

32. Click OK and this field and weight appear in the Matching pane.

33. Select the Labels pane to see all of the selected categories, concepts, and facts.

34. (Optional) Select a field and click Edit to make changes to this field. For example, use the Edit Field window that appears to change the caption, or label, name. For more information, see Section 2.14.25 The Add Field Window: Query Web Server Labels Pane on page 148.

35. Click or to change the default setting 10 in the Maximum number of related labels field. This is the highest number of related labels that can be displayed in response to a query. The end user sees these labels after entering a query into the SAS Information Retrieval Studio search window.

36. Click Apply Changes.

176 SAS Information Retrieval Studio: Administrator’s Guide

Page 193: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

37. Select the Web Crawler pane and click Start.

38. Select Query Web Server --> Status.

39. Click the blue hyperlink and the search window appears.

40. Enter a query into the search field in the SAS Information Retrieval Studio search window that appears. For example, enter analytics.

41. Click Search and see the labels that match the returned documents on the left side of the search window. On the right side see the matching documents with links to the full text for each document.

42. To see the statistics for queries, click the Query Statistics Server tab. For more information, see Section 14.3 View the Query Statistics for a Selected Time Period on page 341.

SAS Information Retrieval Studio: Administrator’s Guide 177

Page 194: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

4.3.2 The Web Crawler with Exporting and Indexing Processes

You can send the same set of documents, collected by the web crawler, to an index and SAS Text Miner. To perform these operations, configure the pipeline server with the document processors appropriate to the index and to the export operation.

4.4 A Sample Configuration That Uses the File Crawler

For this configuration, using several SAS Information Retrieval Studio components and processes, make sure that the following components are installed:

- SAS Information Retrieval Studio- SAS Search and Indexing- SAS Document Conversion

Optional application: A category, concepts, or contextual extraction project developed in SAS Content Categorization Studio and loaded on SAS Content Categorization Server.

Note: It is necessary to choose document_converter processor for this configuration.

178 SAS Information Retrieval Studio: Administrator’s Guide

Page 195: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

To set up a simple project that crawls the files on your machine and exports these files, complete the following steps:

1. Select File Crawler --> Configuration --> Paths.

2. Click Add to add one, or more, paths to the Paths pane. The Add Path window appears.

3. Select a directory. For example, enter \\MyComputer\Documents\FolderA.

SAS Information Retrieval Studio: Administrator’s Guide 179

Page 196: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

4. Click OK and the path appears in the Paths pane.

5. (Optional) Use Step 2. through Step 4., reiteratively, until you have added all of your paths.

6. Click any setting in the other tabs in the Configuration pane that you want to use to configure the file crawler.

7. Click Apply Changes to save the new file crawler configuration.

Note: Do not start the file crawler until all of the configuration processes are complete. If you start the file crawler before configuring components such as the indexing server, delete the index and rebuild it.

8. Select Pipeline Server --> Document Processors. The Document Processors pane appears.

180 SAS Information Retrieval Studio: Administrator’s Guide

Page 197: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

9. Click Add and the Add Document Processor window appears.

10. Select document_converter. This document processor extracts plain text from input document formats such as Microsoft Office and Adobe PDF files. This document processor is relevant for the file crawler, but it can also be used with the web crawler after the parse_html document processor is used.

11. Click Next and the Document Processor: document_converter window appears.

12. (Optional) Change any of the settings in this window. For more information, see Section 2.13.7 The Document Processor: document_converter Window on page 96.

SAS Information Retrieval Studio: Administrator’s Guide 181

Page 198: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

13. Click Finish and the selected document processor appears in the Document Processors pane.

14. Click Add and select export_to_files in the Add Document Processor window that appears.

15. Click Next and the Document Processor: export_to_files window appears.

16. (Optional) Add a field name such as body to fields.

182 SAS Information Retrieval Studio: Administrator’s Guide

Page 199: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Note: If you add one field, only the specified field is included. In this example, the body field was selected in the Document Processor: document_converter window.

17. (Optional) Make any other changes to the fields in the Document Processor: export_to_files window. For more information, see Section 2.13.9 The Document Processor: export_to_files Window on page 100.

18. Click Finish and the document processor that you added appears in the Document Processors Pane.

19. Use Step 14. through Step 18. above, reiteratively, until you have added all of the document processors required. For example, if you want to add labels to enable facetted search, see Section 2.13.4 The Document Processor: content_categorization Wizard on page 78.

Note: If you add any additional document processors, be sure to move them up above the export_to_files document processor in the Document Processors pane.

20. Click Edit to make any changes to your fields.

21. Click Apply Changes.

SAS Information Retrieval Studio: Administrator’s Guide 183

Page 200: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

4.5 A Sample Configuration That Uses the Feed Crawler

For this configuration, using several SAS Information Retrieval Studio components and processes, make sure that the following components are installed:

- SAS Information Retrieval Studio- SAS Search and Indexing

Optional application: A category, concepts, or contextual extraction project developed in SAS Content Categorization Studio and loaded on SAS Content Categorization Server.

Note: It is necessary to choose an HTML document processor for this configuration.

When you set up the feed crawler, you can choose to return either summaries or full length texts. If the feed collects summaries, you can enable the feed crawler to follow the links contained in the summaries to the full texts of each article. If the feed crawler collects summaries, it also follows any links to the full story. For this reason, enable this capability using the steps below.To set up the feed crawler, complete the following steps:

184 SAS Information Retrieval Studio: Administrator’s Guide

Page 201: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Select Feed Crawler --> Configuration.

2. Click Add and the Add Feed window appears.

SAS Information Retrieval Studio: Administrator’s Guide 185

Page 202: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3. Access your Web browser and locate the Web page with the orange box that symbolizes an RSS feed. For example, http://support.sas.com/community/rss/.

4. Click located to the left of the feed that you want. For example, Media Coverage.

186 SAS Information Retrieval Studio: Administrator’s Guide

Page 203: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

5. The feed page appears.

6. Copy the feed URL from the URL field in the browser. Paste this URL into the Feed URL field in the Add Feed window. For example, copy http://www.sas.com/news/mediacoverage/SASRecentMediaCoverage.xml into the Feed URL field.

7. Summaries of news articles comprise the RSS feed shown in the

example above. For this reason, click to select Yes in the Follow links field in the Add Feed window.

8. Click OK in the Add Feed window.

SAS Information Retrieval Studio: Administrator’s Guide 187

Page 204: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

9. Select Pipeline Server --> Document Processors and the Document Processors pane appears.

10. Click Add and the Add Document Processor window appears.

11. Select parse_html. In this example, summaries are collected and the feed crawler is instructed to follow links to the HTML pages that are linked to each summary. (See Step 7. on page 187 where Yes is selected in the Follow Links drop-down menu.)

188 SAS Information Retrieval Studio: Administrator’s Guide

Page 205: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

12. Click Next and the Add document Processor: parse_html window appears.

13. (Optional) Make any changes that you choose.

14. Click Finish. The parse_html document processor appears in the Document Processors pane.

15. Click Apply Changes.

Note: You can also add custom document processors to perform operations on the input feed text. For example, when the Follow links selection is enabled in the Add Feed window, documents that contain both a post and a list of comments or replies are returned. If you want to separate each post into a separate document, write a site-specific document processor. Use this document processor instead of parse_html.

SAS Information Retrieval Studio: Administrator’s Guide 189

Page 206: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

190 SAS Information Retrieval Studio: Administrator’s Guide

Page 207: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

5 Configuring the Web Crawler

- Overview of the Web Crawler- Configuring the Web Crawler- Run the Web Crawler- Troubleshoot with the Log File

5.1 Overview of the Web Crawler

The SAS Web Crawler is controlled by SAS Information Retrieval Studio. The web crawler searches the Internet and returns the documents that it locates, according to the parameters that you set. You specify the types of files to return, the Web addresses where the collection process begins, and the scope of the crawl. You can also specify the user names and passwords that are necessary to crawl password-protected sites.The web crawler passes the documents that it collects to the proxy server that sends them to the pipeline server. According to the processing that the pipeline server performs, the documents can be sent to an application, database, or to the indexing server where they can be queried.After the web crawler collects the maximum number of pages allowed, it stops running. You can restart the web crawler at any time.

Notes: If you plan to crawl blogs, user forums, or other time-sensitive data such as press releases, use the feed crawler instead.

Page 208: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

5.2 Configuring the Web Crawler

5.2.1 Overview of Configuring the Web Crawler

You configure the web crawler in stages, or according to the parameters set up for each tab in the Configuration pane. Each tab, or set of configurations, defines a specific aspect of the crawler. This section is set up as a how-to guide, but it also contains the background information that is necessary to set the specific parameters for each tab.

Display 5.1 Web Crawler Configuration Pane

Use each of the following sections to configure your web crawler with one exception. The Credentials information is necessary only when you choose to crawl password-protected sites.After you make all of your changes, click Apply Changes in the Web Crawler pane. If the file crawler is running when you click this button, the Restart Web Crawler window appears.

Display 5.2 Restart Web Crawler Window

Click Yes.If the web crawler is not running, click Start.

192 SAS Information Retrieval Studio: Administrator’s Guide

Page 209: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

5.2.2 Specify the General Settings

You configure the web crawler to specify how the crawl and download operations work. As you work through each of the steps below, the appropriate background information is included.To specify the parameters for the web crawler, complete these steps:

1. Click the General Settings tab in the Web Crawler pane.

2. Click Auto-detect to access the Select an HTTP Proxy window.

SAS Information Retrieval Studio: Administrator’s Guide 193

Page 210: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

a. Select a proxy server. For example, choose MyHTTPProxyServer.

The HTTP proxy server is a server that is an intermediary between the crawler and the Web site. The HTTP proxy server is not the proxy server for SAS Information Retrieval Studio. The HTTP proxy server evaluates requests before passing them to the web server.

b. Click OK and this server appears in the HTTP proxy field.

3. Click or to change the default setting of 25 in the Quota (files) field. This is the maximum size for all of the files collected by the web crawler.

4. Click or to change the default megabyte limit of 1000 for the maximum number of megabytes in the Quota (megabytes) field. This limit applies to all of the collected documents.

5. Click or to change the total number of threads that can be created in the Number of downloader threads field. For example, change this setting to 16. (The default setting is 1.) The more threads you specify, the faster the download process becomes. However, a higher number of downloaded files can also overwhelm a site and shut it down.

6. Click or to change the number of seconds that the web crawler rests between page downloads in the Sleep interval field. (The default setting is 1.) This setting enables the web crawler to be polite. In other words, a single thread does not overwhelm a site with download requests. This is not true, if you use this setting but have many threads. For example, 100 threads operating on 5-second sleep intervals could potentially launch 100 requests simultaneously to a site.

7. Click or to change the number of seconds before the web crawler stops trying to download a page in the Timeout field. (The default setting is 300.)

194 SAS Information Retrieval Studio: Administrator’s Guide

Page 211: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

8. Click or to change the number of times that the web crawler tries to download a page before it stops in the Maximum number of retries field. (The default setting is 3.)

9. Click or to change the highest number of seconds that the web crawler waits before it tries to download a page again in the Retry delay field. (The default setting is 300.)

10. Click to select No, the default setting is Yes, in the Respect robots.txt field. Select No to ignore a Web site author’s request not to crawl specific portions of a site.

11. Click to select No to prohibit the web crawler from following links found in either of these types of code in the Find links in Javascript and Flash field. The default setting is Yes,

12. Click to select Depth first, the default setting is Breadth first, in the Link traversal order field.

In the breadth-first mode, the crawler searches all of the links in the point-of-entry page. The crawler then searches all of the links in the first layer of child pages. The crawler repeats this process for the second layer of child pages, and so on, until it has crawled all of the links related to the point-of-entry page. This is a first in, first out operation.In depth-first mode, the crawler follows one set of links through all of its children. The crawler then backtracks to the next child page and crawls the links of its children, and so on. This process is repeated reiteratively until all of the links in a page are crawled. This operation drills deep and then backtracks, reiteratively.

SAS Information Retrieval Studio: Administrator’s Guide 195

Page 212: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

5.2.3 Specify Entry Points for the Web Crawler

After you specify the general settings for the web crawler, add the entry points. Entry points are the Web addresses that are used by the crawler to begin its crawl. Unless you specify otherwise in the Scope pane, the entire entry point site and all of its links are crawled. For example, if you want to crawl the SAS Web site, you could enter www.sas.com.To specify the entry points for the web crawler, complete these steps:

1. Click the Entry Points tab in the Web Crawler pane.

196 SAS Information Retrieval Studio: Administrator’s Guide

Page 213: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2. Click Add and the Add Entry Point window appears.

3. Enter the Web address for the first site into the URL field.

4. (Optional) To add this address to the Scope pane as an allowed site for the crawl, leave the default selection Yes in the Add to scope rules field. If you do not want to add this address to the Scope pane, click

and select No.

Note: If you add any URL patterns to the Scope pane, all of the other URLs are excluded from the crawl.

5. By default, the Quota (files) field is set to 100000000. Click or

to change this number.

6. Click OK and the Web address that you entered appears in the Entry Points pane.

7. Use this process reiteratively until you have added all of your URLs to the Entry Points pane.

SAS Information Retrieval Studio: Administrator’s Guide 197

Page 214: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

5.2.4 Specify the Scope of the Crawl

After you specify one or more entry points, you can add a list of permitted and excluded sites. However, the default setting is the empty pane. This is because all of the Web addresses on the Internet are allowed. If you specify at least one permitted site, every other site is excluded. This is true whether you specify that a site is part of the scope of the crawl within this window, or in the Add Entry Point window. If you want to exclude a segment of the site that you permitted, you can also perform this action. For example, you could use the Scope pane to limit the crawl to the SAS publications pages and exclude the new books pages from the crawl. Continuing with Section 5.2.3 Specify Entry Points for the Web Crawler on page 196, you limit the crawl to one section of the SAS Web site. Within the publications section, you exclude any new books.

198 SAS Information Retrieval Studio: Administrator’s Guide

Page 215: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

To specify the scope of the web crawl, complete these steps:

1. Click the Scope tab in the Web Crawler pane.

2. Click Add and the Add Scope Rule window appears.

3. Enter a pattern for a URL, or a regular expression, into the URL Pattern field. Both enable the web crawler to match patterns. For example, type https://support.sas.com/pubscat/complete.jsp.

SAS Information Retrieval Studio: Administrator’s Guide 199

Page 216: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Note: For information about how to write regular expressions, see Section A.1 Regular Expressions on page 353.

4. Leave the default setting Prefix in the Match type field, or

click to select Regular Expression. This setting tells the web crawler how to use the characters entered in the URL Pattern field.

A prefix match is one that matches against the beginning of the URL.

5. Leave the default setting Allow in the Action field. (Click to select Exclude if you do not want the crawler to download pages from this URL.)

This URL appears in the Add Scope pane.

6. Click Add and a new Add Scope Rule window appears.

7. Continuing with this example, type https://support.sas.com/pubscat/newbooks.jsp into the URL Pattern field.

8. Leave the default selection Prefix.

9. Click to select Exclude, in the Action field.

200 SAS Information Retrieval Studio: Administrator’s Guide

Page 217: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

10. Click OK and see the complete list of included and excluded URLs. In this example, the web crawler searches only the publications pages of the SAS Web site. It does not search the pages that list recent books.

SAS Information Retrieval Studio: Administrator’s Guide 201

Page 218: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

5.2.5 Exclude Certain Types of Files

After you specify the scope of the crawl, you might want to limit the types of files that are returned by the web crawler. For example, you could exclude files that contain programs or images.To specify the file types that are excluded from a crawl, complete these steps:

1. Click the Filename Extensions tab in the Web Crawler pane.

2. Click the Add button to access the Add Filename Extension window.

3. Enter the extension of the file type that you want to exclude into the Extension field.

202 SAS Information Retrieval Studio: Administrator’s Guide

Page 219: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Note: The file type extensions are case-sensitive.

4. Click to select Exclude to prevent the crawler from gathering this type of file. (The default setting is Allow.) If you enable one type of file to be returned, only those with the Allow specification are returned.

5.2.6 Specify Access Information for Password-Protected Sites

Some sites are password-protected. To crawl these sites, you provide the web crawler with the information that it requires to download these pages.To specify the sites and the user and password information that the web crawler uses to download pages, complete these steps:

1. Click the Credentials tab in the Web Crawler pane.

SAS Information Retrieval Studio: Administrator’s Guide 203

Page 220: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2. Click the Add button to access the Add Credential window.

3. Enter the URL followed by a colon (:) and the port number for the host into the Site field. For example, enter www.medscape.com:80.

4. Enter the name of a user, who has access to this site, into the Username field. For example, enter UserMD.

5. Enter the password for this user into the Password field. For example, enter mdpassword. When you enter this password, the characters that comprise the password are represented by the asterisk symbol (*) in the Credentials pane.

6. Click OK and this site with its credentials is added to the Credentials pane.

7. Click Apply Changes in the Web Crawler pane.

204 SAS Information Retrieval Studio: Administrator’s Guide

Page 221: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

5.3 Run the Web Crawler

After you configure the web crawler, you can run it. You should configure all of the components that you plan to use before you run the web crawler. Click Apply Changes after you modify any of the default settings for these components.To start, restart, and stop the web crawler, complete any of these actions:

- Click Start in the Web Crawler pane.

The appropriate message appears in the Status pane after any of these operations.

- Click Stop to stop the crawl.- (Optional) If you make any changes to the configuration while the web

crawler is running, click Apply Changes. The Restart Web Crawler window appears.

Click Yes.If the web crawler is not running, click Start.

- Click Revert to return to the last applied settings.

SAS Information Retrieval Studio: Administrator’s Guide 205

Page 222: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

5.4 Troubleshoot with the Log File

This log pane enables you to see a history of the operations performed by the web crawler. Use the contents of the Log pane when you require customer support.To access and use the log pane, complete these steps:

1. Click the Log tab in the Web Crawler pane.

2. (Optional) Click or if you want to change the default setting of 20 in the Number of lines field. This field specifies the maximum number of timestamped lines that are displayed for the searchable log file in this pane.

3. Click Retrieve to display the specified number of lines in the log file.

4. (Optional) Enter a search term into the Text to highlight field. For example, enter sas.

5. Click Find to locate all instances of the entered term in this pane.

206 SAS Information Retrieval Studio: Administrator’s Guide

Page 223: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

6 Configuring the File Crawler

- Overview of the File Crawler- Configure the File Crawler- Run the File Crawler- Troubleshoot with the Log File

6.1 Overview of the File Crawler

The file crawler crawls your organization’s file system and returns documents, according to the parameters that you set. These specifications include the paths to crawl, file types to return, and whether the crawl is continuous. They also include the oldest date and maximum file size that can be returned. The file crawler passes the documents to the proxy server that passes them to the pipeline server. According to the processing that the pipeline server performs, the files can be sent to an application, database, or to the indexing server where they can be queried.

6.2 Configure the File Crawler

6.2.1 Overview of Configuring the File Crawler

You configure the file crawler using the four tabs in the Configuration pane. Each stage, or set of configurations, defines a specific aspect of the crawler. This section is set up as a how-to guide, but it also contains the background information that is necessary to set the specific parameters for each stage.

Page 224: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Display 6.1 File Crawler Configuration Pane

Use each of the following sections to configure your file crawler. After each change, click Apply Changes in the File Crawler pane. If the file crawler is running when you click this button, the Restart File Crawler window appears.

Click Yes.

6.2.2 Specify the General Settings

You configure the file crawler to specify how the crawl and download operations work for the files that the file crawler collects. As you work through each of the steps below, the appropriate background information is included.To specify the parameters for the file crawler, complete these steps:

208 SAS Information Retrieval Studio: Administrator’s Guide

Page 225: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Click the General Settings tab in the File Crawler pane.

2. Click or to change the default setting 10 that is specified in the Maximum file size field. Increasing or decreasing this number affects the size of the documents collected. For example, you might want to gather white papers but not books.

3. Click to access the calendar where you can select the first date for the crawl. Documents that have creation dates before the date specified in the Oldest date field are not collected by the file crawler.

4. Click to select Yes, the default setting is No in the Crawl continuously field. Choose to continuously crawl your file system only when it is constantly updated.

5. Click to select Yes, the default setting is No in the Encapsulate XML files field. In this case, only top-level XML tags are turned into fields. If you select Yes, you can exert more control over this process. For example, you can turn nested fields into tags. In this case, also select the parse_xml document processor.

6.2.3 Specify the Paths to Crawl

After you specify the general settings for the file crawler, you can enter a list of paths to crawl. When you specify a list of paths to crawl, all other paths are not permitted. These paths should be absolute instead of relative. For

SAS Information Retrieval Studio: Administrator’s Guide 209

Page 226: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Windows fileshares, use universal naming conventions (UNC) names instead of local paths.To specify the paths for the file crawl, complete these steps:

1. Click the Paths tab.

2. Click Add and the Add Path window appears.

3. Enter an absolute path into the Path field. If you specify a Windows fileshare, enter a name that is written according to UNC conventions.

4. Click OK and the path appears in the Paths pane.

5. Continue this process, reiteratively, until you have added all of the paths that you want crawled.

210 SAS Information Retrieval Studio: Administrator’s Guide

Page 227: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

6.2.4 Specify the Paths to Exclude

After you specify the general settings, you can enter a list of paths that should not be crawled. This pane enables you to specify limits within the crawl that you set in the Paths pane. For example, choose to exclude the Trash folder on your local computer. Or choose one, or more subdirectories to exclude from the crawl. These paths should be absolute instead of relative. For Windows fileshares, use universal naming conventions (UNC) names instead of local paths.To specify the paths that the file crawler does not crawl, complete these steps:

1. Click the Paths to Exclude tab.

2. Click Add and the Add Path to Exclude window appears.

3. Enter an absolute path into the Path field. If you specify a Windows fileshare, enter a name that is written according to UNC conventions.

4. Click OK and the path appears in the Paths to Exclude pane.

SAS Information Retrieval Studio: Administrator’s Guide 211

Page 228: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

5. Continue this process, reiteratively, until you have added all of the paths that you want to exclude.

6.2.5 Specify the Types of Files to Return

You can choose to limit the types of files returned to the crawl. If you do not specify any files to return, all of the files that the file crawler locates are sent to the proxy server. To specify the paths for the file crawl, complete these steps:

1. Click the Filename Extensions tab.

2. Click Add and the Add Filename Extension window appears.

212 SAS Information Retrieval Studio: Administrator’s Guide

Page 229: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3. Enter a file extension into the Extension field. For example, enter txt or png. If you specify any file extension, only those file types are returned. No other files are collected.

4. Click to select Exclude, the default setting is Allow in the Action field.

5. Click OK and the path appears in the Filename Extensions pane.

6. Repeat Step 2. through Step 5., reiteratively, until you have added all of the file extension types that you want returned.

6.3 Run the File Crawler

After you configure the file crawler, you can run it. You should also configure all of the components that you plan to use before you run the file crawler. Click Apply Changes after you modify the default settings for any of these components.To start, restart, and stop the file crawler, complete any of these steps:

- Click Start in the File Crawler pane.

The appropriate message appears in the Status pane after any of these operations.

SAS Information Retrieval Studio: Administrator’s Guide 213

Page 230: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

- (Optional) If you make any changes to the configuration while the file crawler is running, click Apply Changes. The Restart File Crawler window appears.

Click Yes.If the file crawler is not running, click Start.

- (Optional) Click Revert to return to the last applied settings.- To stop the crawl, click Stop.

6.4 Troubleshoot with the Log File

This log pane enables you to see a history of the operations performed by the file crawler. Use the contents of the Log pane when you require customer support.To access and use this Log pane, complete these steps:

214 SAS Information Retrieval Studio: Administrator’s Guide

Page 231: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Click the Log tab in the File Crawler pane.

2. (Optional) Click or if you want to change the default setting of 20 in the Number of lines field. This field specifies the maximum number of timestamped lines that are displayed for the searchable log file in this pane.

3. Click Retrieve to display the specified number of lines in the log file.

4. (Optional) Enter a search term into the Text to highlight field. For example, enter filename.

5. Click Find to locate all instances of the entered term in this pane.

SAS Information Retrieval Studio: Administrator’s Guide 215

Page 232: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

216 SAS Information Retrieval Studio: Administrator’s Guide

Page 233: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

7 Configuring the Feed Crawler

- Overview of the Feed Crawler- Configure the Feed Crawler- Run the Feed Crawler- Troubleshoot with the Log File

7.1 Overview of the Feed Crawler

The feed crawler crawls the Internet for frequently updated content and returns these documents, according to the parameters that you set. Like the web crawler, the feed crawler is used for Web content, only. Unlike the web crawler, the feed crawler seeks newly updated information in the form of a Web feed. You specify the parameters for the Web address where the feed crawler begins its crawl and determine whether it follows links and crawls continuously. The feed crawler passes the documents to the proxy server that passes them to the pipeline server. According to the processing that the pipeline server performs, the files can be sent to an application, database, or to the indexing server where they can be queried. For example, the feed crawler is used to gather documents that express sentiment from blogs and customer reviews for SAS Sentiment Analysis Workbench.

Page 234: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

7.2 Configure the Feed Crawler

7.2.1 Overview of Configuring the Feed Crawler

You configure the feed crawler in stages, or according to the parameters set up for each tab in the Configuration pane. Each tab, or set of configurations, defines a specific aspect of the crawler. This section is set up as a how-to guide, but it also contains the background information that is necessary to set the specific parameters for each stage.

Display 7.1 Feed Crawler Configuration Pane

Use each of the following sections to configure your feed crawler.After you make all of your changes, click Apply Changes in the Feed Crawler pane. If the feed crawler is running when you click this button, the Restart Feed Crawler window appears.

Click Yes.If the feed crawler is not running, click Start.

7.2.2 Specify the General Settings

You configure the feed crawler to specify the location of the feed. As you work through each of the steps below, the appropriate background information for each setting is included.

218 SAS Information Retrieval Studio: Administrator’s Guide

Page 235: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

To specify the parameters for the feed crawler, complete these steps:

1. Select Configuration --> General Settings in the Feed Crawler pane.

2. Click Auto-detect to access the Select an HTTP Proxy window.

a. Select a proxy server. For example, choose MyHTTPProxyServer.

The HTTP proxy server is a server that is an intermediary between the crawler and the Web site. The HTTP proxy server is not the proxy server for SAS Information Retrieval Studio. The HTTP proxy server evaluates requests before passing them to the web server.

b. Click OK and this server appears in the HTTP proxy field.

3. (default setting is Yes) Click to select No in the Crawl continuously field. The crawler seeks updated items posted to the feed over time, unless this operation is prohibited.

SAS Information Retrieval Studio: Administrator’s Guide 219

Page 236: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

4. Click or to change the default setting of 600 for the number of seconds for the Recrawl interval field.

5. (Optional) Enter another name of the crawler into the User agent field if you choose to change the name of this crawler.

7.2.3 Specify the Feeds

The feed crawler collects postings, whether full texts or summaries, from both RSS and Atom feeds. For more information, see Section 2.14.5 The Edit Entry Point Window on page 125.You specify the feed urls and whether links are followed in the Feeds tab.To perform these operations, complete these steps:

1. Select Configuration --> Feeds in the Feed Crawler pane.

220 SAS Information Retrieval Studio: Administrator’s Guide

Page 237: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2. Click Add to access the Add Feed window.

a. Paste an address for a feed into the Feed URL field. For example, choose http://www.sas.com/success/SASRecentSuccess.xml.

b. Click to select No, the default setting is Yes, in the Follow Links field. This setting specifies whether links from the Web address set in the Feed URL field are crawled. If you select Yes, these links might lead to other feeds.

There are two common types of feeds. These are the full content and summary-only feeds. In the full content feed, all of the information that you seek is present in the feed itself. In the summary-only field, only a brief description of the content is passed. In this case, the link is followed, like a traditional Web page link, to locate the rest of the content. If you want to crawl the summary-only fields, select Yes in the Follow links field. Also select the parse_html document processor in the pipeline server. However, the follow links operation does not perform recursively like the Web crawler.

c. Click OK and this information appears in the Feeds pane.

3. Enter the Web address that you want to crawl into the Feed URL field.

SAS Information Retrieval Studio: Administrator’s Guide 221

Page 238: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

7.3 Run the Feed Crawler

After you configure the feed crawler, you can run it. You should configure all of the components that you plan to use before you run the feed crawler. Click Apply Changes after you modify any of the default settings for these components.To start, restart, and stop the feed crawler, complete any of these steps:

- Click Start in the Feed Crawler pane.

The appropriate message appears in the Status pane after any of these operations.

- (Optional) If you make any changes to the configuration while the feed crawler is running, click Apply Changes. The Restart Feed Crawler window appears.

Click Yes.If the feed crawler is not running, click Start.

- (Optional) Click Revert to return to the last applied settings.- To stop the crawl, click Stop.

222 SAS Information Retrieval Studio: Administrator’s Guide

Page 239: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

7.4 Troubleshoot with the Log File

After you configure the feed crawler, you can run it. Use the contents of the Log pane when you require customer support.To access and use this Log pane, complete these steps:

1. Click the Log tab in the Feed Crawler pane.

2. (Optional) Click or if you want to change the default setting of 20 in the Number of lines field. This field specifies the maximum number of timestamped lines that are displayed for the searchable log file in this pane.

3. Click Retrieve to display the specified number of lines in the log file.

4. (Optional) Enter a search term into the Text to highlight field. For example, enter close.

5. Click Find to locate all instances of the entered term in this pane.

SAS Information Retrieval Studio: Administrator’s Guide 223

Page 240: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

224 SAS Information Retrieval Studio: Administrator’s Guide

Page 241: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

8 Configuring the Proxy Server

- Overview of the Proxy Server- View the Status of the Proxy Server and Input Files- Configure the Proxy Server- Run the Proxy Server- Troubleshoot with the Log File

8.1 Overview of the Proxy Server

The proxy server sends the documents that it receives from one, or more, crawlers to the pipeline server for processing. The proxy server can also pass the same set of documents to a pipeline server that was set up with a second installation of SAS Information Retrieval Studio. When the proxy server passes the documents that it receives, the proxy server passes the same set to each server. This functionality makes it possible for you to perform different operations on the same set of documents on the respective servers. As an intermediary server, you only configure those specifications that are necessary for the proxy server to pass documents to another server. Use the proxy server for the following purposes:

- Pause the flow of documents if you want to perform maintenance on one, or more, of the components in your application.

- Send the documents to multiple pipeline servers for the following reasons:- Create mirrors. These are pipeline servers that perform identical

processing operations on the input documents. Multiple servers are used for backup purposes in case of hardware failure.

Page 242: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

- Use the same set of documents for multiple purposes. In this case, send the input documents to pipeline servers that are configured differently. For example, send the documents to one pipeline server for indexing and searching. Send this same set of documents to a second pipeline server that analyzes the sentiment located in them.

You can find information about the number of documents at different stages in this server and see a log file.For all of these reasons, the proxy server is an integral part of any customized configuration of SAS Information Retrieval Studio.

8.2 View the Status of the Proxy Server and Input Files

Use the Status pane to see whether the proxy server is running and where the input files are in the various processing stages. This pane provides view-only displays that show the current statistics for the proxy server. You can also use the Status pane to troubleshoot any backups in the input process.By default, the proxy server is running. If you add any servers in the configuration pane, click Apply Changes. You can then see the statistics for these operations in the Status pane.To see whether the proxy server is running and to see the statistics for this server, complete the following steps:

226 SAS Information Retrieval Studio: Administrator’s Guide

Page 243: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Click Status in the proxy server pane.

2. See the number of documents that were input to the proxy server in the Documents received field. For example, 25 documents were received.

In this example, the Quota (files) setting was set in the General Settings pane of the Configuration pane at 25 for the web crawler. This is the only crawler in this configuration. This crawler has returned the maximum number of allowed documents.

3. See the number of documents that the proxy server sent to the pipeline server in the Documents processed field. For example, see 25.

4. See the number of documents that are waiting to be received by the proxy server in the Documents queued field. For example, see 0.

5. See the date and time that the last document entered this server in the Last documents received field. For example, 2010-09-17 is the year and 12:46 is the month.

6. See the date and time that the last document entered this server in the Last documents processed field. See the example in Step 5. above.

7. Click Refresh, if the maximum number of documents specified has not been returned.

If you see a discrepancy, you can use the Log pane to see the connections and errors that might be the cause. For more information, see Section 8.5 Troubleshoot with the Log File on page 230.

SAS Information Retrieval Studio: Administrator’s Guide 227

Page 244: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

8.3 Configure the Proxy Server

When you configure the proxy server, you can either add a new pipeline server or you can change the settings for the default pipeline server. This server appears by default in the Configuration pane under the Host heading. Use the Configuration pane to add pipeline servers to the proxy server or to change the Host, Port, or Status settings. You can also add multiple pipeline servers. Choose to add these servers for backup purposes or to specify different types of processes for input documents.

Note: The same input documents are passed to each pipeline server.

To add a new proxy server, complete the following steps:

1. Click Configuration in the Proxy Server pane.

228 SAS Information Retrieval Studio: Administrator’s Guide

Page 245: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2. Click Add and the Back-end Server window appears. Use this window to add another pipeline server to the customized application that you are building.

3. Enter the name of the machine into the Host field. For example, enter Mirror1.

4. If the default setting is incorrect, click or to change the default setting of 9004 in the Port field. For example, change the port to 9100.

5. Click OK and the new server is added to the Configuration pane. (The new server is automatically started and its status is running.)

6. (Optional) Repeat Step 2. through Step 5., reiteratively, to add more servers to your pipeline.

7. Click Apply Changes in the Proxy Server pane.

8.4 Run the Proxy Server

If you make any configuration changes to the proxy server, or to another component of SAS Information Retrieval Studio, you can restart the proxy server. (By default, the proxy server is always running.) Click Apply Changes after you modify the default settings for any of these components.To start, stop, pause, resume, or apply changes to the proxy server, complete any of these steps:

SAS Information Retrieval Studio: Administrator’s Guide 229

Page 246: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

- If you have stopped the proxy server for any reason, click Start in the Proxy Server pane.

The appropriate message appears in the Status pane after any of these operations.

- (Optional) Click Stop and the proxy server ceases its running process. - (Optional) Click Pause to interrupt the running process.- If you have stopped or paused the proxy server, click Resume.- If you make any changes to the configuration while the proxy server is

running, click Apply Changes.

8.5 Troubleshoot with the Log File

The log pane enables you to see the history of the operations performed by the proxy server. Use the contents of the Log pane when you require customer support.To see this Log pane, complete these steps:

230 SAS Information Retrieval Studio: Administrator’s Guide

Page 247: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Click Log in the Proxy Server pane.

2. Use the default selection Connections. Click to select Errors.

3. Click or to change the default setting of 20 in the Number of lines field. This field specifies the number of lines that are displayed for the searchable log file in this pane.

4. Click Retrieve to see the specified number of lines in the log file pane below.

5. Enter the text that you want to locate in the Text to highlight field. For example, enter 10.

6. Click Find to see these terms, highlighted in bold font, in the log file pane. For example, see each instance of 10 highlighted in the dates and find it in the queue.

SAS Information Retrieval Studio: Administrator’s Guide 231

Page 248: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

232 SAS Information Retrieval Studio: Administrator’s Guide

Page 249: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

9 Configuring the Pipeline Server

- Overview of the Pipeline Server- Configuring the Pipeline Server- See Input Documents with the Document Inspector- Add a New Field to Input Documents- Match Categories, Concepts, and Facts- Export Categories and Concept Matches- Advanced Installation- Run the Pipeline Server- Troubleshoot with the Log File

9.1 Overview of the Pipeline Server

9.1.1 Processing Documents and Related SAS Applications

9.1.1.A How Document Processing and Export Operations Work Together

The pipeline server enables you to select document processors that act on input documents to prepare these texts to be handled by another server or application. These processes are known as normalization, analysis, and export operations. For example, normalization includes the process of stripping Web documents of their HTML markup tags and using SAS Document Conversion on documents collected by the file crawler. You can then use the SAS Content Categorization Studio to analyze input documents. Finally, export documents

Page 250: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

to SAS programs such as SAS Sentiment Analysis Workbench and SAS Text Miner.

Note: Before you can analyze or export your documents, make sure that any required software is installed and running.

For more information about installing these software applications, see SAS Information Retrieval Studio: Installation Guide or the installation guide for each SAS application that you want to use.

9.1.1.B Process Documents

The pipeline server performs many operations that are integral to document handling and processing. These normalization, analysis, and export operations include category matching, concept extraction, contextual extraction, document conversion, and exporting documents to other applications. The analysis operations of the SAS Content Categorization Studio document processors are also used for the labels associated with facetted search. These labels, or captions, are specific to the categorization, concepts extraction, or contextual extraction matching technologies in SAS Content Categorization Studio. When you choose to create labels, a series of windows enables you to track an input document field. You can track this field from the crawler through the pipeline and indexing servers and into the query web server. You can see the results in input documents when a query is entered in the search page.Make sure that the following programs are installed and running before you try to process documents using them:

SAS Content Categorization Server

identifies categories, concepts, and facts from SAS Content Categorization Studio and SAS Contextual Extraction Studio. Make sure that the taxonomies that you want to apply to the documents input to SAS Information Retrieval Studio are uploaded to SAS Content Categorization Server before you configure the pipeline server.

SAS Document Conversion

234 SAS Information Retrieval Studio: Administrator’s Guide

Page 251: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

extract plain text from documents such as Microsoft Word and PDF files.

9.1.1.C Export Processed Documents

Export the documents that were collected by a crawler to the following programs:

SAS Content Categorization Studio

uses input files for training and testing purposes.SAS Sentiment Analysis Workbench

analyze the sentiment in input documents.SAS Text Miner

identifies topics and themes in input documents.

9.2 Configuring the Pipeline Server

9.2.1 Overview of the Document Processors

Input documents are defined as one chunk of text returned as the result of a crawl. This text can be a news article, a file received from the file crawler, or a PDF document. Each of these documents is processed using the operations that you specify, before the document is passed to another server. By default, if SAS Search and Indexing is installed, all input documents go to the indexing server. This is true if the documents are also sent to other applications.Choose your document processors according to the operations that you want to perform:First, consider the crawlers that you defined and the document types that they are configured to return in order to normalize the input text. For example, the web crawler can return PDF and Microsoft Word documents in addition to HTML documents. For this reason, choose a processor to strip the HTML tags from the text such as parse_html or heuristic_parse_html. You can also select the document_converter operation to extract text from documents such as Microsoft Word and PDF.

SAS Information Retrieval Studio: Administrator’s Guide 235

Page 252: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Second, If you choose to use the feed crawler, you might select invalidate_duplicates_by_url. This operation ensures that only one copy of a document is passed to another process. This document processor is important for applications such as SAS Sentiment Analysis Workbench where the freshness of the document matters.Third, choose the content_categorization document processor if you want to enable facetted search using the categorizer, concept, or contextual extraction processors. You can also use these processors to categorize and extract concepts and facts from your input documents before passing them to another operation.Fourth, use the export_csv and the export_to_files processors to export the normalized (and analyzed) documents to put these documents into a format that can be used by another application. To send documents directly to SAS Sentiment Analysis Workbench, specify export_to_sas_sentiment_analysis_workbench.

Note: You can also add deployment-specific document processors by placing them into the bin/postrpocessors subdirectory of your installation.

By default, when SAS Search and Indexing is installed, all input documents go to the indexing server. This is true if the documents are also sent to other applications.After you consider these available operations, use the Add Document Processor window to add and configure your document processors. You can choose to use one document processor, or you can build a pipeline that orders several processors. For example, use the heuristic_parse_html operation to extract paragraphs of text without their HTML tags. The next processor in the pipeline might be the export_to_files processor that enables you to export the file in XML or in text format. In either case, you can specify whether the document stops here in the pipeline or goes to the indexing server.The operations that you specify in the Document Processors pane occur in the same order that they are listed in this pane. You can specify the document processors in any order and use the Move up and Move down buttons to reorder these operations. If document processing operations are incorrectly ordered, unexpected results might occur.

236 SAS Information Retrieval Studio: Administrator’s Guide

Page 253: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

9.2.2 Checking Program Installations

Document processors are specified and configured within the pipeline server. If you choose to use one of the following processing operations, make sure that the necessary application is running:

- SAS Document Conversion: If you want to process documents such as Microsoft Word and PDF files, install and run this application before you specify this document processor.

- SAS Content Categorization Server: Identify categories, concepts, and facts from SAS Content Categorization Studio and SAS Contextual Extraction Studio. Make sure that the taxonomies that you want to apply to the documents input to SAS Information Retrieval Studio are uploaded to SAS Content Categorization Server. Run SAS Content Categorization Server before you configure the pipeline server. Before you use SAS Content Categorization Server, create projects using SAS Content Categorization Studio and SAS Contextual Extraction Studio.

You can also export documents to another SAS program after you specify the document processor and start the program.

- SAS Content Categorization Studio: Use the files that you export from SAS Information Retrieval Studio for training and testing purposes.

- SAS Sentiment Analysis Workbench: Analyze the sentiment expressions in input documents.

- SAS Text Miner: Identify entities.For more information, see the SAS Information Retrieval Studio: Installation Guide.

SAS Information Retrieval Studio: Administrator’s Guide 237

Page 254: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

9.2.3 Configure the Document Processors

To add the parse_html processor, or to use this section as an example of how to add a different processor, complete these steps:

1. Select Pipeline Server --> Configuration --> Document Processors.

2. Click Add in the Document Processors pane. The Add Document Processor window appears.

3. Select parse_html.

238 SAS Information Retrieval Studio: Administrator’s Guide

Page 255: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

4. Click Next and the Document Processor: parse_html window appears.

5. Leave the default specification, raw, or enter a new field name in the input-field. The processor uses this field to obtain the unmodified, document data. raw specifies that the original, unmodified content was placed into the HTML document using this identifying field name.

6. Leave the default specification, title in the title-output-field. You can also enter a different field name where the processor stores the plain text of the document title.

7. Leave the default specification, body in the body-output-field. You can enter a different field where the processor stores the body text located in the input document. This field is used by other applications such as SAS Content Categorization Studio, when they are part of the processing pipeline.

8. Change the default entry to 1 in the output-metadata field and this processor populates other fields, such as keywords and description, with values taken from the HTML document.

9. The entry 1 in the require-mime-type field specifies that a document is checked to ensure that it is an HTML document. If you enter 0, this check is not required.

SAS Information Retrieval Studio: Administrator’s Guide 239

Page 256: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

10. Leave the mimetype entry in the mime-type-field, or specify a different field.

11. The entry 1 in the base64-input field specifies that the text is preserved in the mime content transfer encoding. If you enter 0, this encoding is not used.

12. Click Finish to save these settings.

13. (Optional) Continue adding the document processors to the pipeline.

14. (Optional) To make changes to the specifications for a processor, click Edit.

15. (Optional) To change the ordering of the processors in the pipeline, click Move up, or Move down until the order is correct.

Note: For more information, see Section 2.8.4 The Document Processors Tab on page 41.

9.3 See Input Documents with the Document Inspector

Use the Document Inspector pane to see all of the versions of the input document. You can see each version, simultaneously, at each stage in the pipeline server. The original document changes at each stage of the pipeline, but you can still see its original text. This snapshot operation is available for one document at a time, but only when the documents are moving through the pipeline server. In this pane, you can see each document as it moves, whether it is intact or split into multiple documents.

240 SAS Information Retrieval Studio: Administrator’s Guide

Page 257: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Display 9-1 Viewing a Document in the Document Inspector Pane

To use the document Inspector pane to see a document, use the following steps:

1. Click Take Snapshot.

2. Click on a document processing operation that appears in the Processing Stage pane. For example, click on heuristic_parse_html. A document number appears in the Document pane.

3. Click the number in the Document pane and the fields in this document appear in the Field pane. For example, click on 1.

4. Click on one of the fields that the document consists of in the Field pane. For example, click on body.

SAS Information Retrieval Studio: Administrator’s Guide 241

Page 258: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

5. See the contents of the selected field in the Document Inspector pane. For example, see http://money.cnn.com/2010/01/21/technology/sas_best_companies.fortune/.

9.4 Add a New Field to Input Documents

You can add a new field, with a constant value to each of the input documents. Use this feature to assign the same field to each indexed document. For example, you might want to add a field to all of the documents. This field might be used to specify that the documents are indexed from a particular source, during a specific time period, or for other defining purposes. When you choose to add a field, you also specify the alphanumeric string that is assigned to each document.To add a field to each input document, complete these steps:

1. Select Pipeline Server --> Document Processors. The document Processors pane appears.

242 SAS Information Retrieval Studio: Administrator’s Guide

Page 259: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2. Click Add. The Add Document Processor window appears.

3. Select add_field.

4. Click Next. The Document Processor: add_field window appears.

5. Enter the name of the field that you want to add to each input document into field. For example, type Date.

6. Enter the value that populates the added field into the value field. For example, type 062011.

SAS Information Retrieval Studio: Administrator’s Guide 243

Page 260: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

7. Click Finish to see this addition in the Document Processors pane.

8. Click Stop to halt the Pipeline Server.

9. Click Apply Changes.

10. Perform Step 8. through Step 9. above for the crawler that you are using.

11. Perform Step 8. through Step 9. above for the Proxy Server.

12. Perform Step 8. through Step 9. above for the Indexing Server.

244 SAS Information Retrieval Studio: Administrator’s Guide

Page 261: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

13. Select Pipeline Server --> Document Inspector.

14. Click Take Snapshot.

SAS Information Retrieval Studio: Administrator’s Guide 245

Page 262: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

15. Click Processing Stage to see a list of the document processors. Select a processor. For example, click add_field. The document number appears in the Document pane.

16. Click the document number under Document to see the fields for this document in the Field pane. For example, click 1 in the Document pane to see concepts, Data, promotion, and id in the Field pane.

17. Click a field in the Field pane to see the related information in the empty pane. For example, click Data to see that the value 062011 that you assigned to the add_field processor was assigned to document 1.

9.5 Match Categories, Concepts, and Facts

You can match categories, concepts, and facts in input documents using the content_categorization Document Processor. You use this processor to specify the categories, and classifier and grammar concepts, that you created and defined in SAS Content Categorization Studio. The concepts that you define in the SAS Content Categorization Studio add-on program SAS Contextual

246 SAS Information Retrieval Studio: Administrator’s Guide

Page 263: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Extraction Studio are used as concepts or facts. Any concept that is developed in SAS Contextual Extraction Studio and specified with a PREDICATE or SEQUENCE rule is a fact. The content_categorization Document Processor is the client for SAS Content Categorization Server. The categories, concepts, and facts are applied by SAS Content Categorization Server to the documents processed by SAS Information Retrieval Studio.The following example uses concepts. If you want to use categories or facts, make the appropriate substitutions. Also see Chapter 10: Creating Facetted Search Labels Using content_categorization. This chapter uses the Document Processor: content_categorization wizard to create labels for facetted search.To map concepts to labels, complete these steps:

1. Select Pipeline Server.

2. Click to access the Document Processors pane.

SAS Information Retrieval Studio: Administrator’s Guide 247

Page 264: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3. Click Add and the Add Document Processor window appears.

4. Select content_categorization. The Document Processor: content_categorization window appears.

5. (Optional) By default, the name of the server where SAS Content Categorization Server is running is specified in the Hostname field. For example, see localhost. You can enter a different server name if SAS Content Categorization Server is running on another server.

6. (Optional) By default, the port number for the specified server is entered

in the Port field. For example, see 6500. Click or to select a different port number.

7. (Optional) By default, 10 is entered into the Timeout field.

Click or to select a different number. This is the number of seconds that the Pipeline Server waits before this server stops attempting to download an input field.

248 SAS Information Retrieval Studio: Administrator’s Guide

Page 265: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

8. Click Next. The Document Processor content_categorization window appears. Use this window to add any of the projects that are uploaded to SAS Content Categorization Server to SAS Information Retrieval Studio.

9. Click Add and the Document Processor: content_categorization window appears.

10. (Optional) Click and select Concept extraction unless this processor is already selected.

11. (Optional) Click and select a project that you added, unless the project that you want to use is already selected. For example, select Entities.

12. Click Ok and the project appears in the Document Processor: content_categorization window. For example, see Entities under

SAS Information Retrieval Studio: Administrator’s Guide 249

Page 266: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Project and Concept extraction under Type. Your selection limits the available concepts to those in the project.

13. (Optional) Continue to add projects using Step 9. through Step 12. The concepts in each of the projects that you select are available to match your input documents.

14. Click Next. The Document Processor: content_categorization window appears. By default, the Input tab is displayed.

15. (Optional) Enter the fields that are in any of your input documents where you want to locate matches for your concepts. Enter these fields, as a comma-separated list into the Input fields field. If you leave this field blank, all fields, with the exception of those listed in the Input fields to exclude field are searched.

16. (Optional) By default, fields that contain information about the document are listed in the Input fields to exclude field. You can add additional fields, or delete fields from this list:

id,url,feed_url,raw,mimetype,date,pdate,source,promotion,ctime,atime,mtime

If you edit this list, be sure to insert a comma (,) between each field.

17. (Optional) If you make any changes, click Finish to save these edits.

250 SAS Information Retrieval Studio: Administrator’s Guide

Page 267: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

18. Click Concepts and the Concepts pane appears.

19. Click Add to specify the concepts that are matched in input documents. The Document Processor: content_categorization window appears.

20. Click to select the concept that you want to match in the Concept field. For example, select LOCATION.

SAS Information Retrieval Studio: Administrator’s Guide 251

Page 268: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Hint: Only the concepts that are part of the selected project are available in the drop-down menu that appears.

21. (Optional) By default, the name of the concept is entered into the Field name field. For example, see location. Enter a new name, if you choose.

22. (Optional) By default, the name of the label for the facetted search is entered into the Caption field. For example, see Location. Enter a new caption, if you choose. For more information about facetted search, see Chapter 10: Creating Facetted Search Labels Using content_categorization.

23. (Optional) By default, %c: %i is entered into the Format field. These symbols indicate that information about the concept (%c) followed by information about the entity (%i) is output. Choose different symbols from those symbols that are available, if you choose.

24. (Optional) By default, ; (the semicolon) is used to separate the output fields. Enter a different separator character if you choose.

Table 9-1: Default Format Symbols

Symbol Description

%c Match the concept name.

%p Add to %c to include the path with the concept name.

%m Match the text.

%i Match the information associated with the entity, or the match text if no information is available.

%I Match the information associated with the entity unconditionally.

%% Match the literal percent sign.

x Use as a modifier, such as in %xc to request XML escaping

252 SAS Information Retrieval Studio: Administrator’s Guide

Page 269: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

25. (Optional) Click Copy Defaults to revert to the concepts entries in the Concepts tab.

26. Click Ok to save your changes. The Document Processor: content_categorization window appears.

27. See the newly entered concept with its field name, and caption. For example, see Location under Concept, location under Field name, and Basketball Player under Caption.

SAS Information Retrieval Studio: Administrator’s Guide 253

Page 270: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

28. (Optional) If you want to continue to add concepts, click Add. Use Step 19. through Step 26. on page 253, reiteratively, until you have added all of the concepts that you want to use for facetted search.

Note: By default, you can add a maximum of 10 concepts to the project. To change this number, see Section 13.2.1 Displays with or without Labels on page 310.

29. (Optional) By default, concepts is entered into the Default field name field. You can choose to enter a different name into this field.

30. (Optional) By default, Concepts is entered into the Default caption field. You can choose to enter a different name into this field.

31. (Optional) By default, %c: %i is entered into the Default format field. You can choose to enter different symbols into this field. You can edit this entry using any of the symbols in Table 9-1 on page 252 with the exception of %I.

32. (Optional) By default, ; (semicolon) is entered into the Default separator field. Enter a different separator character if you choose.

33. (Optional) By default, 15 is entered into the Max concepts field. This is the highest number of concepts that can be located in an input

document. Click or to enter a different number.

34. Click Finish to save these settings.

35. If an index was in the process of building while you added captions for your concepts, the Delete Index window might appear:

36. Click Yes to delete the index.

254 SAS Information Retrieval Studio: Administrator’s Guide

Page 271: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

37. See the name that you entered into Field name appears in the Configuration pane of the indexing server when this operation is complete.

38. Click Start in the main Pipeline Server window to restart the Pipeline Server.

39. When you click the Add button in the Matching pane of the query web server, you can select this field in the Add Field window. This caption name appears as a field in the Matching pane of the Query Web Server.

SAS Information Retrieval Studio: Administrator’s Guide 255

Page 272: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

This caption also appears in the user interface when a matching term is located in an input document.

9.6 Export Categories and Concept Matches

You can export matches on your category and concept fields using file, CSV, and ODBC operations: To export matched categories and concepts fields, complete these steps:

256 SAS Information Retrieval Studio: Administrator’s Guide

Page 273: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Use the steps in Section 9.5 Match Categories, Concepts, and Facts on page 246.

2. (Optional) If you plan to export your matched fields without indexing them, deselect the Label field in the index check box.

3. Deselect the Label field in the query web server check box.

4. Select one of the following operations:File export

fields are exported as filesCSV export

fields are exported in commas separated formatODBC export

fields are exported into a database

5. Click Finish.

SAS Information Retrieval Studio: Administrator’s Guide 257

Page 274: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

9.7 Advanced Installation

When you choose to use the advanced installation, you can configure two or more pipeline servers. When you choose this type of SAS Information Retrieval Studio configuration, you can perform some of the document processing operations on one server. This pipeline server can send a copy of the processed documents to another pipeline server where more document processors can act on them.

9.8 Run the Pipeline Server

By default, the pipeline server is running. Configure all of the components that you plan to use. Click Apply Changes after you modify any of the default settings for these components. Perform these operations before you view the statistics for the pipeline server.To start, restart, and stop the pipeline server, complete any of these steps:

- Click Start in the Pipeline Server pane.

The appropriate message appears in the Status pane after any of these operations.See the progress of the input documents in the Status pane:

a. The Overall - Pending table cell is always empty.

258 SAS Information Retrieval Studio: Administrator’s Guide

Page 275: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

b. See how many documents have finished all of the processing operations in the Overall - Finished table cell. For example, 32.

c. See how many XML documents are in the process of having their XML tags removed in the XML parsing - Pending table cell. For example, 1.

d. See how many XML documents have completed the process of XML tag removal in the XML parsing - Finished table cell. For example, 31.

e. See how many documents are in the pipeline process in the Document processing - Pending table cell. For example, 22.

f. See how many documents have completed all of the pipeline operations in the Document processing - Finished table cell. For example, 8.

g. See how many documents are going to the indexing server in the Sending to the indexer - Pending table cell. For example, 7.

h. See how many documents have completed the indexing process in the Sending to the indexer - Finished table cell. For example, 0.

- (Optional) If you make any changes to the configuration while the pipeline server is running, click Apply Changes.

- (Optional) Click Revert to return to the last applied settings.- (Optional) Click Refresh to see any changes in this pane.- To stop the crawl, click Stop.

SAS Information Retrieval Studio: Administrator’s Guide 259

Page 276: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

9.9 Troubleshoot with the Log File

The log pane enables you to see the operations performed by the pipeline server. Use the contents of the Log pane when you require customer support.To access and use the log pane, complete these steps:

1. Click the Log tab in the Pipeline Server pane.

2. Use the default selection Connections. Click to select Errors.

3. Click or to change the default setting of 20 in the Number of lines field. This field specifies the number of lines that are displayed for the searchable log file in this pane.

4. Click Retrieve to display the specified number of lines in the log file.

5. (Optional) Enter a search term into the Text to highlight field. For example, enter target machine.

6. Click Find to locate all instances of this term in this pane.

260 SAS Information Retrieval Studio: Administrator’s Guide

Page 277: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

10 Creating Facetted Search Labels Using content_categorization

- Before You Begin Using This Example- Creating a Sample Project- Seeing the Results in the Query Interface

10.1 Before You Begin Using This Example

10.1.1 How the content_categorization Document Processor Creates Facetted Search Labels

Facetted search applies identifying labels to matched documents. These labels enable you to intuitively navigate to the documents that match your input query terms. Unlike traditional search, facetted search enables you to search instinctively and faster because the matching texts are pre-organized. (You can also apply in-line tagging using these labels. This tagging can be used by a third-party program at this time.)Labels are values within fields. These fields can have display names that are specified in the Caption field in the Document Processor: content_categorization windows. Captions do not have formatting restrictions, unlike internal field names that can contain only lowercase English letters.

Page 278: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Figure 10.1 Example of Facetted Labels

10.1.2 Using Related Programs to Define Labels

When you define labels for facetted search, you use the following programs:- SAS Content Categorization Studio- (Optional) SAS Contextual Extraction Studio- SAS Content Categorization Server

Use the following architectural diagram to gain an overview of these applications:

Figure 10.1 Architecture for Facetted Label Creation

262 SAS Information Retrieval Studio: Administrator’s Guide

Page 279: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Define your labels using the categories and concepts that you specify in SAS Content Categorization Studio with or without SAS Contextual Extraction Studio. Labels apply the matching requirements set by the rules that define categories and concepts. Labels also enable facetted search operations in the query interface of SAS Information Retrieval Studio.Use SAS Content Categorization Studio alone to develop categories that identify documents based on their subject matter. Also define concepts that locate relevant terms based on rules that are specified by lists of matching terms or parts of speech and other symbols.

Display 10.1 SAS Content Categorization Studio

When you use the add-on SAS Contextual Extraction Studio application with SAS Content Categorization Studio, you can define LITI concepts. These concepts increase matching precision (matches all of the relevant texts) and recall (matches only the relevant texts). LITI concepts differ from the classifier and grammar concepts in SAS Content Categorization Studio because you can mix rule types within a single definition.Contextual Extraction, or LITI, concepts can also include facts. Facts are rules that are defined by arguments. Arguments are defined by concepts that are related if they are matched by the fact rule. For this reason, facts return related pieces of information in input documents. For example, define facts when you want to identify relationships between drugs, symptoms, and gender.

SAS Information Retrieval Studio: Administrator’s Guide 263

Page 280: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Display 10.2 Two Facts in One LITI Concept

Note: Rules appear on only one line. The rules that appear on more than one line in this example are spaced only for illustrative purposes.

When you use facts as labels, you can specify the string that is returned for the label. Each string contains terms that are custom filled according to the matched text.

10.1.3 Mapping to Labels

You can enter the names of the labels for the categories, concepts, and facts that you want to serve as navigation tags for facetted search. These labels link to one or more matched documents in the query interface. For this reason, the names of the labels and the rules that define each taxonomy node should be part of a schema that reflects appropriate ways of searching. For example, if you want drugs to be part of the taxonomy for your SAS Content Categorization Studio, you might also define the SIDE_EFFECTS, GENDER, and DISEASE concepts.Use the Document Processor: content_categorization wizard to select categorization, concept, and fact extraction processors. These processors locate matching terms in the input text, or within the document fields that you specify, and return matches.

264 SAS Information Retrieval Studio: Administrator’s Guide

Page 281: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

When you choose categories, SAS Information Retrieval Studio applies all of the categories in the selected project to input texts. Although the default selections for concepts and facts are the same, you can select specific facts and concepts to apply.All LITI concept definitions that include PREDICATE and SEQUENCE rules are treated as facts. If a LITI concept rule contains one, or more, facts and other concept rules, the facts and the concepts are applied separately. The default settings in the Document Processor: content_categorization wizard return the matched fact and concept rules for each LITI definition under the same label name. For this reason, consider renaming either the fact or the concept label and field name for each LITI definition that contains a concept and a fact. Choose the display selection that works best for your end users:

Display 10.3 Example of Default Setting

SAS Information Retrieval Studio: Administrator’s Guide 265

Page 282: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Display 10.4 Facts and Concepts Labeled Differently

PREDICATE and SEQUENCE rules match two, or more, concepts to provide otherwise overlooked relationships in input documents. The related matches are facts. For this reason, facts consist of at least two arguments. In the example above, the arguments for SIDE_EFFECT are drug and sideeffect. These arguments match Topamax and restlessness, respectively, in the input document.

10.1.4 Before You Build Your SAS Content Categorization Studio Project

Before you develop, or choose to use an existing, SAS Content Categorization Studio project, consider the types of labels that you want to display in the query interface. The category, concept, and fact names that you define in SAS

266 SAS Information Retrieval Studio: Administrator’s Guide

Page 283: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Content Categorization Studio are the default settings for SAS Information Retrieval Studio. (You can also write a custom string that displays these names.) For this reason, use care when specifying names and writing PREDICATE and SEQUENCE rules that specify terms that are visible to the end user.Also use care when writing rules that return many matches. For example, you might develop a SAS Content Categorization Studio project that includes an EMAIL concept. This concept might contain rules defined by regular expressions that are designed to return all e-mail accounts within internal company documents. The inclusion of this EMAIL concept might not be appropriate for a facetted search on the Web.Before you upload a SAS Content Categorization Studio project to SAS Content Categorization Server, check the Project Settings - Misc tab of SAS Content Categorization Studio. If there are entries in the XML Default Fields field, remove these fields and leave the XML Default Fields field blank. (These fields apply to categories and to LITI concepts and facts. For this reason, grammar and classifier concepts in SAS Content Categorization Studio are matched regardless of the field entries. Other matches that should occur, might not.)

Display 10.5 Project Settings - Misc Tab

SAS Information Retrieval Studio: Administrator’s Guide 267

Page 284: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Use care when changing rules and uploading projects to avoid propagating the same rule or its variations. For example, you might upload a SAS Content Categorization Studio project to SAS Content Categorization Server. If you change a concept definition and upload the same project with a new name to the server, both rules are available for matching. This is true if you add both projects to your SAS Information Retrieval Studio project using the Document Processor: content_categorization wizard. In other words, matches might be made on concept definitions where one or more definitions is specified using an outdated rule. This behavior can occur because SAS Information Retrieval Studio consolidates all of the rules for categories, concepts, and facts with the same names. Naming also affects LITI facts and concepts. For example, you might have a LITI concept definition that includes both fact and concept definitions. See the example below:

Figure 10.2 Facts and Concept Rules in One Concept Definition

Note: For the purposes of this example only, each fact rule appears on two lines.

In this example, if matches occurred for both the facts and concepts, all of these matches would return a match on the SIDE_EFFECT concept. However, you can use the content categorization document processor to specify different names for the concept and fact matches.

268 SAS Information Retrieval Studio: Administrator’s Guide

Page 285: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

10.1.5 Before You Use the Example in This Chapter

Before you follow the example in this chapter, install the following programs:- SAS Content Categorization Studio- (Optional) SAS Contextual Extraction Studio- SAS Content Categorization Server

Note: This chapter provides one example of how labels are mapped to concepts and facts and viewed in the query interface for SAS Information Retrieval Studio. For more general information, see Section 9.5 Match Categories, Concepts, and Facts on page 246.

To understand how to use these applications with SAS Information Retrieval Studio, complete the following steps:

1. Develop a sample SAS Content Categorization Studio project with, or without, SAS Contextual Extraction Studio concepts and facts. For

SAS Information Retrieval Studio: Administrator’s Guide 269

Page 286: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

more information, see SAS Content Categorization Studio: User’s Guide and SAS Contextual Extraction Studio: User’s Guide.

270 SAS Information Retrieval Studio: Administrator’s Guide

Page 287: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2. Use the Build menu to build, compile, and upload the relevant categories and concepts projects to SAS Content Categorization Server. For more information, see SAS Content Categorization Studio: Administrator’s Guide.

3. Specify the name of the project in the Upload window that appears. The entry in the Server Project Name field can be unique for the SAS Information Retrieval Studio project.

4. (If you uploaded your project awhile ago) Select Start --> Programs --> SAS Content Categorization Server to make sure that the server is running.

SAS Information Retrieval Studio: Administrator’s Guide 271

Page 288: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

5. Configure a sample SAS Information Retrieval Studio project using the content categorization document processor that references your sample SAS Content Categorization Studio project. See the following sections of this chapter for step-by-step directions.

272 SAS Information Retrieval Studio: Administrator’s Guide

Page 289: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

6. Check the matching results using the Document Inspector tab. (Always click Take Snapshot before you start a crawler.)

7. Select Start --> Programs --> SAS Information Retrieval Studio --> Query Interface.

8. Enter a query term such as side effects.

SAS Information Retrieval Studio: Administrator’s Guide 273

Page 290: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

9. Click Search to see the results. The facetted search labels appear on the left side of the query interface.

10. (Optional) Check your search results against the original project to ensure that the expected results occur.

10.2 Creating a Sample Project

10.2.1 Access the Projects on SAS Content Categorization Server

Use the Document Processor: content_categorization window to specify the location where SAS Content Categorization Server is running. The category, concept, and fact definitions are uploaded in projects to SAS Content Categorization Server. The content_categorization Document Processor is the client for SAS Content Categorization Server. The categories, concepts, and fact definitions are

274 SAS Information Retrieval Studio: Administrator’s Guide

Page 291: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

applied by SAS Content Categorization Server to the documents processed by SAS Information Retrieval Studio.

Note: The following steps apply when text documents are input to SAS Information Retrieval Studio. If HTML or XML documents are input, use the appropriate parser. For example, add the parse_html document processor.

To specify the location of SAS Content Categorization Server, complete these steps:

1. Select Pipeline Server --> Document Processors.

Hint: The Pipeline Server can either be running or stopped.

SAS Information Retrieval Studio: Administrator’s Guide 275

Page 292: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2. Click Add. The Add Document Processor window appears.

3. Select content_categorization.

4. Click Next. The Document Processor: content_categorization window appears.

5. (Optional) By default, the name of the server where SAS Content Categorization Server is running is specified in the Hostname field. For example, see localhost. Enter a different server name if SAS Content Categorization Server is running on another server.

6. (Optional) By default, the port number for the specified server is entered

into the Port field. For example, see 6500. Click or to select a different port number.

7. (Optional) By default, 10 is entered into the Timeout field.

Click or to select a different number of seconds that the pipeline server waits before it stops trying to complete a matching operation.

276 SAS Information Retrieval Studio: Administrator’s Guide

Page 293: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

8. Click Next. You can now add your projects to SAS Content Categorization Server.

10.2.2 Add Projects

Use this section to specify the projects that you uploaded to SAS Content Categorization Server that are used by SAS Information Retrieval Studio. You use these projects to select the categories, concepts, and facts that SAS Information Retrieval Studio applies to input documents. For this reason, you select the project that you want to use for each type of label source.To add projects, complete the following steps:

1. Click Add in the Document Processor: content_categorization window that appears after you click Next in aboveStep 8..

The Document Processor: content_categorization window appears.

2. (Optional) By default, Categorization is selected in the Type field. This is true if you uploaded a categories project to SAS Content Categorization Server. Otherwise, Concept extraction or

SAS Information Retrieval Studio: Administrator’s Guide 277

Page 294: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Contextual extraction is selected. Click to change the default selection.

3. (Optional) By default, a project is selected in the Project field such as

Sample. Click to select a different project that is running on SAS Content Categorization Server with the appropriate matching technology. For example, if Sample is selected, only categories are available for matching. This is true because Categorization is selected in the Type field and Sample was uploaded as a categories project.

4. Click Ok and the project appears in the Document Processor: content_categorization window.

5. (Optional) Repeat Step 1. on page 277 through Step 4. above until you have added all of the projects and their matching types. For example, add MedicalProj to include concepts. Add MedicalProj2 to match LITI concepts and facts. If you have multiple project for a specific matching technology, you can upload all of these projects.

6. Click Next.

278 SAS Information Retrieval Studio: Administrator’s Guide

Page 295: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

10.2.3 Determine the Input, Matching, and Output

10.2.3.A How Input Documents Are Handled

The Document Processor: content_creation document processor enables you to specify the input fields, the matching field names, and how the fields are labeled or exported. These specifications determine how the content of input documents is handled.

10.2.3.B Specify Input Fields

Input documents such as HTML and XML documents contain fields, some of which are for informational purposes only. You can choose to limit the fields that are searched for categories, concepts, and facts. You can also exclude some fields. When you specify input fields, all of the unlisted fields are not searched.Before you use the steps below, consider the types of documents that you plan to input. These documents types determine the fields to include or exclude.To select the input fields, complete these steps after you click Next in Step 6. on page 278. The Document Processor content_categorization window appears. By default, the Input tab is selected.

1. (Optional) By default, the Input Fields field is blank. Use a comma (,) separated list to specify any field names that you want to search for matches for your categories, concepts, and facts. If you leave this field blank, all of the fields are searched with the exception of any fields entered into the Input fields to exclude. If you specify any fields, only the listed fields are searched.

SAS Information Retrieval Studio: Administrator’s Guide 279

Page 296: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2. (Optional) By default, the Input Fields to exclude field contains these fields:

id,url,feed_url,raw,mimetype,date,pdate,source,promotion,ctime,atime,mtime

Using a comma-separated format, you can edit this list.

3. (Optional) Click Finish to save your changes.

10.2.3.C Specify Categories

Categories define the information that is located in input documents by specifying the subject matter of the documents. When you select categories, unlike concepts and facts, all of the categories in the project are applied to input documents.To add categories to the project, complete these steps:

1. Click Categories to access the Categories pane.

2. (Optional) By default, categories is entered into the Field name field. You can enter a new field name.

3. (Optional) By default, Categories is entered into the Caption field. You can enter a new caption name for facetted search.

280 SAS Information Retrieval Studio: Administrator’s Guide

Page 297: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

4. (Optional) By default, %c is entered into the Format field for each, individual category name. You can enter a new format that might include %% for a literal percent sign. You can also use x as a modifier to request XML escaping. For example, enter %xc.

5. (Optional) Enter a regular expression into the Category name pattern field. Regular expressions specify the pattern for the category name.

6. (Optional) Enter a string into the Category name replacement field. This string is a constant value that replaces each of the individual category names with the name that you specify here.

7. (Optional) By default, ; (semicolon) appears in the Separator field. Enter a new separator such as a comma (,) for the matched categories.

8. (Optional) By default, the highest number of categories that can be

matched in any single input document is 15. Click or to change this default selection in the Max categories field.

Hint: This field specifies the number of categories that have the highest numbers of matches. For example, matches might occur for 25 categories. However, the results for this example are displayed only for the 15 categories with most matches in input documents.

9. Click Finish.

10.2.3.D Specify Concepts

Concepts identify metadata, or data on information. You specify concepts to locate specific types of information in input documents using SAS Content Categorization Studio. You can add the classifier and grammar concepts in SAS Content Categorization Studio or any of the concept types in SAS Contextual Extraction Studio. However, concepts that include SEQUENCE and PREDICATE rules in their definitions are added as Facts. After you upload these concepts to SAS Content Categorization Server, the projects that contain these rules can be applied by SAS Information Retrieval Studio.

SAS Information Retrieval Studio: Administrator’s Guide 281

Page 298: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Matches for any of the concepts that you specify explicitly, appear in the table at the top of the Document Processor: content_categorization window. These matches appear in the specified format and are placed into the specified output field. Matches for any other concepts that are not in the table are assigned the default format. The text of these matches appears in the Default field name.You can also choose to exclude concepts from matching. For example, exclude all of the matches that are not specified when you leave the Default field name empty in the Concepts tab. If you want to specify one or more concepts to exclude, leave the Field name blank when you specify the excluded concepts.If you want to prevent a specific concept from the output, leave the empty.To add concepts to the project, complete these steps:

1. Click Concepts to access the Concepts pane. You can use this pane to add all of the concepts and contextual extraction concepts. If any of the LITI concepts include PREDICATE or SEQUENCE rules, these rules are matched as facts. Access these facts using the Facts pane.

2. Click Add. The Document Processor: content_categorization window appears. Use this pane to specify the settings for each individual

282 SAS Information Retrieval Studio: Administrator’s Guide

Page 299: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

concept. These settings override the specifications for all of the concepts in the Concepts pane.

3. Click in the Concept field to select a concept from the available projects. For example, select SIDE_EFFECT from the drop-down menu.

4. (Optional) When you select a concept using Step 3. above, the name of the selected concept appears in the Field name field after you make a selection in the Concept field.

In this example, the concept SIDE_EFFECT also contains PREDICATE and SEQUENCE rules. For this reason, SIDE_EFFECT appears in the Facts drop-down list also. In order to avoid ambiguity in the search results, you can choose to rename either the concept or the fact. In this example, negativeeffects is entered.

5. (Optional) The name of the selected concept appears in the Caption field after you make a selection in the Concept field. For example, Negative Effects. You can enter a new caption name such as Negative Effects. For more information, see Section 9.5 Match Categories, Concepts, and Facts on page 246. You can also use the sample project in Chapter 4: Sample Configurations

Note: If you change the Field name field, also change the name that appears in the Caption field.

SAS Information Retrieval Studio: Administrator’s Guide 283

Page 300: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

6. (Optional) By default, % is entered into the Format field for the concept name. You can also use any of the following symbols:

7. (Optional) By default, ; (semicolon) appears in the Separator field. You can choose to enter a new separator such as a comma (,).

8. (Optional) Use Step 2. on page 282 to Step 7. above, reiteratively, until you have added all of your concepts.

9. Click Ok. If you want to reload the default settings, click Copy Defaults.

Table 10-1: Concept Output Format Symbols

Symbol Description

%c Output the concept name.

%p Output the concept name with its path. You can specify %c to include the path with the concept name.

%m Output the text.

%i Output the information associated with the entity, or the match text if no information is available.

%I Output the information associated with the entity unconditionally.

%% Output the literal percent sign.

x Use as a modifier, such as in %xc to request XML escaping

284 SAS Information Retrieval Studio: Administrator’s Guide

Page 301: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

10. See the concepts in the Concepts tab. Make sure that you loaded all of the concepts that you want to use in your project.

11. (Optional) By default, concepts is entered into the Default field name field. You can enter a new field name.

12. (Optional) By default, Concepts is entered into the Default caption field. You can enter a new caption name for facetted search.

13. (Optional) By default, %c: %i is entered into the Default format field for the concept name. You can edit this entry using any of the symbols in Table 10-1 on page 284.

14. (Optional) By default, ; (semicolon) appears in the Default separator field. You can enter a new separator such as a comma (,).

15. (Optional) By default, the highest number of concepts that can be

matched in any single input document is 15. Click or to change this default selection in the Max concepts field.

SAS Information Retrieval Studio: Administrator’s Guide 285

Page 302: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Hint: This field specifies the number of concepts that have the highest numbers of matches. For example, matches might occur for 25 concepts. However, the results for this example are displayed only for the 15 concepts with most matches in input documents.

16. Click Finish.

10.2.3.E Specify Facts

Facts match two, or more, concepts to provide otherwise overlooked relationships in input documents. Facts consist of at least two arguments and are defined by PREDICATE and SEQUENCE rules. If a contextual extraction concept contains either a PREDICATE or a SEQUENCE rule, these rules are treated as a fact by SAS Information Retrieval Studio. Facts are automatically separated from contextual extraction concepts when these concepts are uploaded to SAS Content Categorization Server. The arguments and the matched values appear in the query interface.Matches for any of the facts that you specify explicitly appear in the table at the top of the Document Processor: content_categorization window. These matches have the specified format. Matches for any other facts that are not in the table are assigned the default format. The text of these matches appears in the Default field name.You can also choose to exclude facts from matching. For example, exclude all of the matches that are not specified when you leave the Default field name empty in the Facts tab. If you want to specify one or more facts to exclude, leave the Field name blank when you specify the excluded facts.To add facts to the project, complete these steps:

286 SAS Information Retrieval Studio: Administrator’s Guide

Page 303: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Click Facts to access the Facts pane.

SAS Information Retrieval Studio: Administrator’s Guide 287

Page 304: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2. Click Add and the Document Processor: content_categorization window appears.

3. Click to select a fact in the Fact field. Facts are contextual extraction concepts that contain at least one PREDICATE or SEQUENCE rule. For example, select SIDE_EFFECT from the drop-down menu.

4. (Optional) When you select a fact using Step 3. above, the Field name field is automatically entered. For example, see sideeffect. Enter a new name if you choose.

5. (Optional) When you select a fact, the Caption field is automatically entered. For example, see Side Effect. Enter a new name if you choose.

6. (Optional) By default, the format for the matched fact is entered into the Format field. This is the argument string that fills in the document matches as a label. For example, see the following format:

SIDE_EFFECT(drug: %v{drug}, sideeffect: %v{sideeffect})

288 SAS Information Retrieval Studio: Administrator’s Guide

Page 305: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

7. In this example, the SIDE_EFFECT concept has two arguments drug and gender. You can use the following symbols to edit this field.

8. (Optional) By default, %n: %v is entered into the Argument format field. You can also use any of the symbols in Table 10-2 above.

9. (Optional) By default, a , (comma) appears in the Argument separator field. Enter a new separator such as a period (.).

10. (Optional) By default, a ; (semicolon) appears in the Separator field. Enter a new separator such as a hyphen (-).

11. (Optional) Use Step 2. on page 288 to Step 10. above, reiteratively, until you add all of your facts.

Table 10-2: Fact Output Format Symbols

Symbol Description

%f Output the fact name.

%a Output a formatted list of arguments.Note: If you do not specify the argument symbol, the Argument format field, even when specified, does not apply.

%v{name} Output the value for a specific argument.

%m Output the text.

%s Return the concordance list.Note: If you do not specify the concordance, the concordance is not returned. This is true even when you specify the Concordance type and Surrounding words in the Facts pane.

%% Output the literal percent sign.

x Use as a modifier, such as in %xf to request XML escaping.

The following symbols appear in the format string of the argument for a matched fact.

%n Output the argument name for the arguments that comprise the definition.

%v Output the value for the specified argument.

SAS Information Retrieval Studio: Administrator’s Guide 289

Page 306: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

12. Click Ok. If you want to use the same settings specified in the Facts tab, click Copy Defaults. The Facts tab appears.

13. (Optional) By default, facts is entered into the Default field name field. You can enter a new field name.

Note: If do not change the default entry facts, in the Default field name field in the Facts tab, all of the concepts are matched.

14. (Optional) By default, Facts is entered into the Default caption field. You can enter a new caption name for facetted search.

15. (Optional) By default, %f(%a) is entered into the Default format field for the concept name. You can edit this entry using any of the symbols in Table 10-2 on page 289 with the exception of %v{name}.

290 SAS Information Retrieval Studio: Administrator’s Guide

Page 307: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Note: Unless you specify %a, no arguments are called. This is true even if you make entries in the Default argument format field.

16. (Optional) By default, %n: %v is entered into the Default argument format field for the concept name. You can edit this entry using the %n, %v, %%, and the x modifier symbols in Table 10-2 on page 289.

17. (Optional) By default, , (comma) is entered in the Default separator field. You can enter a new separator such as a semicolon (;).

18. (Optional) By default, Surrounding words is selected in Concordance

type. Click to select Full sentence.

Concordance refers to the surrounding text that is returned with the match. When you select Full sentence, the Surrounding words field disappears. If you do not specify the concordance using %s, the concordance is not returned. This is true even when you specify the Concordance type and Surrounding words in the Facts pane.

19. (Optional) By default, 10 is selected in the Surrounding words field.

Click or to change this default selection.

20. (Optional) By default, the highest number of concepts that can be

matched in any single input document is 15. Click or to change this default selection in the Max facts field.

Hint: This field specifies the number of facts that have the highest numbers of matches. For example, matches might occur for 25 facts. However, the results for this example are displayed only for the 15 facts with most matches in input documents.

21. Click Finish to save your selections.

SAS Information Retrieval Studio: Administrator’s Guide 291

Page 308: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

10.2.4 Specify Output

After you specify the input and matching requirements, choose how the matched fields are treated.

1. Click Output to access the Output window.

2. (Optional) If you do not want to enable facetted search, deselect Label field in the index. (This field applies only to the index.) When you deselect this check box, you can use the label fields for another purpose.

3. (Optional) If you do not want to enable facetted search, deselect Label field in the query web server. This field applies only to the query web server pane. If you are using a custom query interface, you might select this operation. In this case, the Label field in the query web server operation is irrelevant.

4. (Optional) If you want to export these fields as files, select File export.

5. (Optional) If you want to export these fields in comma-separated format, select CSV export. Choose this selection to export your files into programs such as SAS Text Miner or Microsoft Excel.

6. (Optional) If you want to export these fields to a file system, select ODBC export.

292 SAS Information Retrieval Studio: Administrator’s Guide

Page 309: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

7. Click Finish and see the categories field listed in the Document Processors pane.

10.2.5 Apply content_categorization to Input Documents

After you specify the content_categorization document processor, you can apply these operations to input documents. To apply content_categorization to input documents, complete these steps:

SAS Information Retrieval Studio: Administrator’s Guide 293

Page 310: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Click Stop, Apply Changes, and Start to apply changes and to restart the Document Processor.

2. Begin with the selected crawler and work down through the list of components clicking Stop, Apply Changes, and Start.

You can also perform any of these operations:- Click Edit in the Document Processors pane. You can follow any of the

steps in Section 10.2.1 Access the Projects on SAS Content Categorization Server on page 274 through Section 10.2.4 Specify Output on page 292.

- Click Move up or Move down to reorder your document processors.- Select Pipeline Server --> Document Inspector. Click Take

Snapshot to see the results.

Hint: The Document Inspector captures the next document that is sent to the pipeline server.

294 SAS Information Retrieval Studio: Administrator’s Guide

Page 311: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Click Start in the Document Inspector pane (below the Document Processors pane) before you start a crawl.

10.3 Seeing the Results in the Query Interface

You can see, and test, the results of the content categorization document processor when you use the query interface. For comprehensive directions, see SAS Information Retrieval Studio: User’s Guide.To test the results of the content categorization document processor that you defined, complete these steps:

1. Select Start --> Program --> SAS Information Retrieval Studio --> Query Interface.

2. Enter a search term into the blank field to the left of the Search button.

SAS Information Retrieval Studio: Administrator’s Guide 295

Page 312: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3. Click Search.

4. See the results.

296 SAS Information Retrieval Studio: Administrator’s Guide

Page 313: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

11 Configuring the Indexing Server

- Overview of the Indexing Server- Configure an Index- Changes That Affect the Indexing Server- Run the Indexing Server- Troubleshoot with the Log File

11.1 Overview of the Indexing Server

The indexing server works like the index that is located in the back of a textbook. The index is a list of unique words and the locations where these words occur. Unless you specify another application, all of the documents that are collected by the crawlers are automatically sent to the indexing server.The unique fields in the index are populated by the data from the input documents. These fields are specified when you use the Document Processor windows in the Pipeline Server pane. The various types of fields in the index are used for different query functionalities.You can select a language when you build an index. Your language selection does not prevent documents written in other languages from being indexed, but it does optimize the index for the selected language. The index matches only words in the document.It is important to remember that you do not change the existing index, but that you can configure the next index that is built. For this reason, the Apply Changes button deletes the current index and configures the new index.The index can be searched during the build process, or after the index is built.

Page 314: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

11.2 Configure an Index

You configure an index in order to specify the fields that are used for search operations. You also determine how the information that is located in these fields is stored in the index.To configure an index, complete these steps:

1. Select Indexing Server --> Configuration.

See the list of field names that are the default selections for the index. For example, see id, title, date, and so on. - Click Remove to delete an entry in the current index configuration.

- Click Edit to make changes to the purposes specified for a field.

- Click Add to enter a new field name with its functionality. You can enter any field name that is found in any of the input documents. It is not necessary for every document to contain each of the specified fields.

298 SAS Information Retrieval Studio: Administrator’s Guide

Page 315: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2. When you click Add or Edit the Add Field window appears.

For detailed explanations for each of the functionalities that are available in this window, see Table 11-1 below:

Table 11-1: Field Functionalities

Field Type Purpose

Searching (Default) Search for words that match the input query terms. This selection is equivalent to the standard function.

Label Select for facetted search, only. This selection is equivalent to marking the field as both standard and Boolean. For more information about facetted search labels, see Section 9.5 Match Categories, Concepts, and Facts on page 246.

SAS Information Retrieval Studio: Administrator’s Guide 299

Page 316: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3. Click to select a language for index optimization. However, the index is built with the returned documents regardless of language.

4. Use the Add Field window to add, and make changes to, all of the fields in the index.

5. Click Apply Changes to delete the current index and to set the configuration for the new index.

Display and Sorting

Sort the results alphabetically, or numerically, instead of by relevancy. The field is returned with the URL of each document in the results list. This choice is equivalent to marking the field as info.

Identification Identify a document. This field corresponds to marking the field as URL. It is not necessary for this field to contain a standard-compliant URL. However, it is necessary for this field to contain a unique string.

Custom Choose one, or more, of the following selections:

- Standard: Make this field a regular field.- Info: Make this field an information field.- Boolean: Use Boolean, counting, and positional operators.- URL: Specify either a Web address or a unique string.

Table 11-1: Field Functionalities (Continued)

Field Type Purpose

300 SAS Information Retrieval Studio: Administrator’s Guide

Page 317: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

11.3 Changes That Affect the Indexing Server

If you choose to build an index, many other operations affect the index. For example, see the following list of operations:

- Starting and stopping a crawler affects the flow of documents to the server. For example, if you stop the crawler and then restart it, the same documents are collected.

- Some of the document processing operations in the pipeline server specify the names of the fields passed to the indexing server. If you make a change to one of these document processors while the indexing server is running, click Apply Changes.

- If you change the field names, types, and functionalities that you specify in the Configuration pane of the indexing server, the index is affected.

Whenever you make a change to any of these operations, the current index is not affected. These changes can affect only the new index. For this reason, you have two choices:

- Click the Delete Index button to remove the existing index. A new index can be built with the specified changes after you restart the crawler.

- Click the Apply Changes button when the indexing server is running. The existing index is deleted and the indexing server is restarted so that a new index can be built.

For example, if you make changes to fields in the pipeline server, they can affect the indexing server.

SAS Information Retrieval Studio: Administrator’s Guide 301

Page 318: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

11.4 Run the Indexing Server

By default, the indexing server is running. Configure all of the components that you plan to use. Click Apply Changes after you modify any of the default settings for these components. You might also want to delete the existing index after you make changes, and before end users enter queries.To start, restart, and stop the indexing server, complete any of these steps:

- Click Start in the Indexing Server pane.

The appropriate message appears in the Status pane after any of these operations.

- To stop the server, click Stop.- (Optional) If you make any changes that affect the index, click Delete

Index. This operation removes the old index. For example, if you add a title field to the list of indexed fields a new index might be necessary.

- (Optional) Click Apply Changes if the index server is running. This operation deletes the old index and restarts the indexing server. A new index can be built with this set of changes.

- (Optional) Click Revert to return to the last applied settings.

302 SAS Information Retrieval Studio: Administrator’s Guide

Page 319: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

11.5 Troubleshoot with the Log File

The indexing server log pane enables you to locate information about the operations of the indexing server. Use the contents of the Log pane when you require customer support.To access and use the log pane, complete these steps:

1. Click the Log tab in the Indexing Server pane.

2. Click or to change the default setting of 20 in the Number of lines field. This field specifies the number of lines that are displayed for the searchable log file in this pane.

3. Click Retrieve to display the specified number of lines in the log file.

4. (Optional) Enter a search term into the Text to highlight field. For example, enter ID.

5. Click Find to locate all instances of this term in this pane.

SAS Information Retrieval Studio: Administrator’s Guide 303

Page 320: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

304 SAS Information Retrieval Studio: Administrator’s Guide

Page 321: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

12 Configuring the Query Server

- Overview of the Query Server- Run the Query Server- Troubleshoot with the Log File

12.1 Overview of the Query Server

The query server serves the queries that it receives from the query web server to the index. The query server then returns the matched documents from the index to the query web server. The query web server displays these matches to the end user according to the parameters that you specify. The query server is merely the conduit that passes queries and results and logs these interactions with both servers.

12.2 Run the Query Server

The query server uses the index built by the indexing server to locate matching documents in response to queries. Use the query web server or specify a custom application that you write using the Query API, to pass queries to the query server.By default, the query server is running. Configure all of the components that you plan to use in SAS Information Retrieval Studio. Click Apply Changes after you modify any of the default settings for these components. The query server does not require updates.To start and stop the query server, complete any of these steps:

Page 322: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

- Click Start in the Query Server pane.

The appropriate message appears in the Status pane after both the start and stop operations.

- To stop the server, click Stop.

12.3 Troubleshoot with the Log File

The log pane enables you to see information about the query processing operations of the query server. Use the contents of the Log pane when you require customer support.To use this log pane, complete these steps:

1. Select Query Server --> Log.

306 SAS Information Retrieval Studio: Administrator’s Guide

Page 323: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2. Click or to select a new Number of lines, the default setting is 20. For example, choose 25 to see more lines.

3. Click Retrieve to display this number of lines in the blank pane below.

4. (Optional) Enter the terms that you want to locate in the Text to highlight field. For example, enter INDEX.

5. Click Find to display this text in the log file.

SAS Information Retrieval Studio: Administrator’s Guide 307

Page 324: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

308 SAS Information Retrieval Studio: Administrator’s Guide

Page 325: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

13 Configuring the Query Web Server

- Overview of the Query Web Server- Choosing How Search Returns Are Displayed- Configure the Query Web Server- Run the Query Web Server- Troubleshoot with the Log File

13.1 Overview of the Query Web Server

Use the query web server to specify the fields that are searched when an end user enters a query, and how this information is matched and prioritized. The links to the matches are displayed according to the selections that you choose. For example, you can specify the URLs, the text that is displayed, and what fields in the input document are searched to return this text. You can also specify labels to make facetted search possible. Facetted search enables users to locate the information that they seek moving in an intuitive, instead of linear progression. These labels appear to the left of the search returns that are displayed in list format on the right in a hierarchical, or flat, layout.You can also choose the display settings and design the query window for your end users. When you use the query web server to specify the look and feel of the search window, choose the banner, colors, and other components for this window. You can also access the search window through the link provided in the Status pane of the query web server window.

Page 326: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Display 13-1 SAS Information Retrieval Studio Search Window

13.2 Choosing How Search Returns Are Displayed

13.2.1 Displays with or without Labels

You can customize the way that end users see and navigate the matches that are located for their input query terms. These display selections enable the end user to navigate the returned documents and to optimize search within the returns to locate the results that they seek. To specify how labels are displayed for categories, concepts, and facts, complete these steps:

310 SAS Information Retrieval Studio: Administrator’s Guide

Page 327: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Select Query Web Server --> Configuration --> Labels.

2. Click Add and the Add Field window appears.

3. Click in the Hierarchical field to choose a hierarchical, non-hierarchical, or a flattened display of the labels. In this example, Yes is selected to enable a hierarchical display of the categories.

SAS Information Retrieval Studio: Administrator’s Guide 311

Page 328: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

4. Make any other changes and click OK to see this selection in the Labels pane.

See the following examples that include search windows that do, and do not, display labels.

13.2.2 No Labels Example

If you choose to use no labels, search results are displayed in list format. To see the matched document, select the blue hyperlink that appears to the right of the number in the ordered list. (Returns are ordered according to the specifications that you select in the Sorting pane of the Query Web Server.)

Display 13.1 No Labels

13.2.3 Hierarchical Labels Example

If you specify a hierarchical ordering of labels, the matching sections of taxonomy for categories is displayed. This is a matched portion of the same taxonomy that appears in the Taxonomy pane of the SAS Content Categorization Studio user interface. You can also choose to see a count of the matching documents.

312 SAS Information Retrieval Studio: Administrator’s Guide

Page 329: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Note: Concepts and facts do not have a parent-child relationship. For this reason, this specification does not work.

Display 13.2 Hierarchical Taxonomy of Matched Labels

You can click the left mouse button on a hyperlink label to make one of the following selections:Require

the path to the selected label appears below the search box. The displayed documents match both the query term and the selected label. If you specify more than one label, the documents match the query term and the selected labels. In this case each path is appended with a plus sign (+).

Exclude

one label, preceded by the minus sign (-) appears in the SAS Information Retrieval Studio search window. The displayed documents match the query term, but not the selected labels.

View

one label appears in the SAS Information Retrieval Studio search window that displays all of the matching documents for this label, only. Bolded matches for existing query terms no longer appear below the document links on the right side of the search window.

SAS Information Retrieval Studio: Administrator’s Guide 313

Page 330: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Remove

this operation is available for the label, or path, appearing at the top of the SAS Information Retrieval Studio search window. This selection is the only available after you use any of the above operations.

13.2.4 No Hierarchical Display Example

If you select No for the hierarchical display selections of the related taxonomy, the matched categories are displayed with slashes (/). These slashes indicate the paths, or parent-child relationships that exist in the SAS Content Categorization Studio taxonomy that they match. The hierarchical view does not work for concepts and facts. For this reason, they do not exist in a parent-child relationship.

Display 13.3 No Hierarchy

314 SAS Information Retrieval Studio: Administrator’s Guide

Page 331: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

13.2.5 Flattened Hierarchical Example

If you select a flattened display of the related taxonomy, the matched categories are displayed. However, the full path to that category does not appear.

Note: You can also use the Require, Exclude, View, and Remove operations. For more information, see Section 13.2.3 Hierarchical Labels Example on page 312.

Display 13.4 Flattened Hierarchy

Hover the mouse over a category or concept to see the hierarchy, or parent-child relationships existing in SAS Content Categorization Studio.

SAS Information Retrieval Studio: Administrator’s Guide 315

Page 332: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

13.3 Configure the Query Web Server

13.3.1 Overview of Configuring the Query Web Server

You configure the query web server using the configuration sets in each pane of the Configuration tab. Use this pane to specify the type of search and how results are sorted. If you enable facetted search, you also specify the captions that appear as label names. Choose the fields where matches can be located and design the appearance of the SAS Information Retrieval Studio search window. This section is set up as a how-to guide, but it also contains the background information that is necessary for each tab.

Display 13.5 Query Web Server Configuration Pane

316 SAS Information Retrieval Studio: Administrator’s Guide

Page 333: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

13.3.2 Specify the Server Port

The Server Port field displays the number of the port where the query web server is running. The default is 9100.

Display 13.6 Query Web Server Port

To change the query web server port, click or to select a new port number that is not already in use.

13.3.3 Specify How Matching Is Performed

13.3.3.A Match Types

Use the Matching pane to specify how the terms that are matched to the query are located. In other words, each document is indexed as a field-value pair. When you leave the default selection Simple selected, you can specify the fields that are searched in the index.There are two types of searches that you can specify for your end users:Simple (fsearch)

specify the query fields and enable end users to prefix required words and quoted phrases by prefixing them with plus (+) and minus (-) signs. When you make this selection, you specify the indexing fields that are searched.

Advanced (bsearch)

enable end users to specify the query fields. Query terms can be combined when you specify the following operators: - Boolean operators such as AND, OR, and NOT add precision to your search.- Positional words such as SENT and PAR specify that matches are located

only if the specified words appear in the same sentence or paragraph, respectively.

SAS Information Retrieval Studio: Administrator’s Guide 317

Page 334: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

- Counting words such as MINOC_n total the number of matched words. A match only occurs when there is at least (MINOC_n) this number, of matching words in the input document.When you specify Advanced search, you do not select any index fields. Instead, you choose the sorting and weights for matches.

13.3.3.B Select a Match Type

To determine the match type, complete these steps:

1. Select Query Web Server --> Configuration --> Matching.

2. Leave the default selection, Simple search, or click to select Advanced.

318 SAS Information Retrieval Studio: Administrator’s Guide

Page 335: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3. Click Add and the Add Field window appears.

4. Leave the default selection such as title. Click to select a different field. Each of these fields is listed in the Configuration pane of the indexing server. In other words, the fields that appear in this drop-down menu are also in the index.

5. Click or to select a new Weight. For example, choose 5 to weight matches that are located in the body field more heavily than those in the title field. The weight setting is relative across all fields.

6. Click OK to see this field in the Configuration pane.

7. Click Remove in the Matching pane to delete a field.

8. Click Edit to make a change in one of the fields.

13.3.4 Specify How Matches Are Sorted

Matches are sorted by date, field values, the number of matching terms or fields, or by relevancy. You can affect relevancy when you specify a weight for a field that ranks matches on one field higher than those on another field. For more information, see Section 13.3.3 Specify How Matching Is Performed on page 317. The following set of steps provides information about sorting by relevancy. To use the other selections that are available in this pane, see Section 2.11.4.C The Sorting Tab on page 53.To specify how matches are sorted, complete these steps:

SAS Information Retrieval Studio: Administrator’s Guide 319

Page 336: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Select Query Web Server --> Configuration --> Sorting.

2. Leave the default selection Relevancy, or click to choose a different selection in the Sort type drop-down menu. The selection that you make in this field determines the fields that are displayed below this field.

3. Specify the following weights according to your values. In other words, if the density weight is more important than any of the other weights, specify the highest weight number for this field:

a. (Default selection is 1) Click or to select a new Cosine Weight. This metric assigns the highest weights to the most frequently occurring terms. It takes noise words into consideration. (Noise words are the words that appear with enough frequency that they are ranked down.)

b. (Optional) Click to add Proximity Weight to the relevancy metric. Specify higher numbers when multiple query terms appear close to each other in a document.

c. (Optional) Click to add Position Weight to the relevancy metric. Specify higher weights for query terms that are matched at the beginning of a document.

320 SAS Information Retrieval Studio: Administrator’s Guide

Page 337: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

d. (Optional) Click to add Density Weight to the relevancy metric. Density measures the proximity of matches as a percentage of the document size.

e. (Optional) Click to add Freshness Weight to the relevancy metric. Freshness enables you to combine date sorting with the other measures.

4. Click Apply Changes before you select another pane.

13.3.5 Specify Labels for Facetted Search

Labels cluster matching documents. Labels apply the matched categories and concepts that occur most frequently in the matched documents. You can see a general label that uses the caption that you specify in the search window. Beneath this label, you can see a taxonomy, or list, of matched categories and concepts. You choose whether to use categories, concepts, or both. You make this choice in the Document Processor window of the pipeline server.You specify labels when you want to enable facetted search in the search window for your end users. Facetted search enables your users to intuitively search and locate documents that match the input word. For example, if a user enters the word cars, all of the documents that match cars are returned. Related labels such as car parts, car repair, and antique cars might also appear. These labels, or categories and concepts, enable the end user to see related terms that are also matched in the returned documents.Use the Labels pane to specify the caption for the categories and concepts that users see. For categories, the caption replaces the Top node that you see in the SAS Content Categorization Studio Taxonomy pane. You specify a caption for each concept, or SAS Information Retrieval Studio uses the name of the concept, by default. These captions are applied to index fields in order to rename them with user-friendly text. For more information about the fields to labels process, see Section 3.7 Defining Labels for Facetted Search on page 163.

SAS Information Retrieval Studio: Administrator’s Guide 321

Page 338: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Note: Only index fields where the specified functionality is Label can be accessed in this window and can have a caption.

To specify labels, complete these steps:

1. Select the Query Web Server --> Configuration --> Labels.

322 SAS Information Retrieval Studio: Administrator’s Guide

Page 339: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

2. Click Add and the Add Field window appears.

3. Leave the default field selection in Field name. For example, if you selected categories as the field with Label functionality in the Indexing Server pane, you can leave this default selection. The only selections that are available in this field are those with the Label functionality.

4. Enter a new name for the label into the Caption field. For example, enter Categories that uses an uppercase letter.

5. Leave the default selection No in the Hierarchical field or

click to select either Yes or Flattened. To see the types of results that are displayed for these selections, see Section 13.2 Choosing How Search Returns Are Displayed on page 310.

SAS Information Retrieval Studio: Administrator’s Guide 323

Page 340: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

6. By default, Yes is selected in the Display counts field. Click to select No. If you select No, the numbers of matching documents do not appear to the right of the labels in the SAS Information Retrieval Studio search window.

324 SAS Information Retrieval Studio: Administrator’s Guide

Page 341: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

7. Click or to select a number of matching fields in the Show in matches field. For example, choose to display the three categories with the highest number of matches in the SAS Information Retrieval Studio search window. (If there are more than the specified number of fields, the term and other information appears. This term is appended to the list to indicate that the display is incomplete.)

8. Click Move Up to relocate a match on this field when it is displayed in the search results window.

9. Click Move Down to relocate a match on this field it is displayed in the search results window.

10. Click or to change the Maximum number of related labels to display.

11. Click OK to save these settings.

12. Click the link in the Status tab to see the results of your changes in the search window.

SAS Information Retrieval Studio: Administrator’s Guide 325

Page 342: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

13.3.6 Specify the Formatting for the Matches

You can choose how matched information appears to the end user. For example, use the Match Formatting pane to specify how the links and fields for matching documents are displayed in the SAS Information Retrieval Studio search results window. This pane also enables you to specify the sources and the allowed prefix and suffix for the displayed links.To specify the formatting for the matching documents returned to queries, complete these steps:

1. Select Query Web Server --> Configuration --> Match Formatting.

2. (Default is Text field) Click to select HTML field in the Title source field. This field identifies the location of the title in the input document. Select None if you choose not to display the document title. In this case, the Title field disappears and No Title is displayed for each match.

326 SAS Information Retrieval Studio: Administrator’s Guide

Page 343: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3. (Default is title) Click in the Title field to select from the info fields in the index. This is the field where the document title can be located.

4. (Default is Yes) Click to select No in the Use filename when document has no title field. Use this selection to generate a title for the document when the title field in an input document is empty.

5. (Default is Concordance) Click to select Text field, or HTML field in the Abstract source field. Use these fields to locate the type of field where the summary of the input document is located. If you select None or Concordance, Abstract field disappears.

Select Concordance to enable hit highlighting. Matched query terms in an input document appear in bold.

6. (Default is Title) Click to select another field in Abstract field. This is the field where the abstract can be located.

7. (Default is URL) Click to select None or Text field in Link Source field. This field specifies the link to the matched document.

8. Enter a string to prepend to the URL, before it is displayed, into the Link prefix field. Specify how to modify the prefix of the URL at display time for the purposes of passing an argument from your own CGI script.

9. Enter a string to prepend to the URL, before it is displayed, into the Link suffix field. Specify how to modify the suffix of the URL at display time for the purposes of passing an argument from your own CGI script.

For example, the link field might contain the unique identifier 12345. However, the browser does not understand this string. In this case, set

SAS Information Retrieval Studio: Administrator’s Guide 327

Page 344: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

the link prefix to http://host/script?id= and the link suffix to &format=html. The browser now sees the link as http://host/script?id=12345&format=html. The only other requirements are that the CGI script exists and that the browser can render this type of ID.

10. (Default is Yes) Click to select No in the Add keywords to PDF links field. When you leave the default selection Yes, you instruct Adobe Reader to highlight the search terms in an input .pdf document. This operation functions like a concordance, but works for the entire document, not only the abstract in the results list.

11. (Default is None) Click to select Text field. This field is used to locate the type of input document in the MIME type source field.

12. (Default is None) Click to select Date or Text field in the Date source field. This field is used to locate the creation date of the matched document.

328 SAS Information Retrieval Studio: Administrator’s Guide

Page 345: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

13.3.7 Specify the Theme of the Search Window

13.3.7.A Theme Overview

There are three parts to the Theme pane. The first four fields provide the specifications for the matched documents. The Colors pane enables you to select the colors for the search window. The Images pane enables you to specify the images that replace the existing title bar.

Display 13.7 SAS Information Retrieval Studio Search Window

Determine the look and feel of the SAS Information Retrieval Studio search window.To specify the theme of the search window, complete these steps:

1. Select Query Web Server --> Configuration --> Theme.

2. Leave the default selection SAS Information Retrieval Studio, or enter a new name into the Title field.

SAS Information Retrieval Studio: Administrator’s Guide 329

Page 346: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3. Leave the default selection sans-serif, or enter a new display font into the Font field.

4. Leave the default selection 10, or click or to select a new size for the display letters in the Font size field. For example, choose 12 to display the search returns in a larger font size.

5. (Default is Yes) Click to select No in the Use popup menus field. Select No when you want to disable pop-up menus for older browsers that cannot use Javascript.

6. Click the link in the Status tab to see the results of your changes in the search window.

330 SAS Information Retrieval Studio: Administrator’s Guide

Page 347: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

13.3.7.B Specify the Colors of the Search Window

Determine the colors that are displayed in the SAS Information Retrieval Studio search window. For example, change the background color and the visited link colors to appear in red.

Display 13.8 New Colors Specified for the Search Window

For more information about formatting the search window, see http://www.w3.org/TR/CSS/ui.html#system-colors.

SAS Information Retrieval Studio: Administrator’s Guide 331

Page 348: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

To specify the colors in the search interface, complete these steps:

1. Select the Query Web Server --> Configuration --> Theme --> Colors.

2. Leave the default selection Custom in the Header background color

field. Click to select the location that you want to color. You can

also click to access the color box window and select a color such as red.

Note: For more information about the Color Box window, see Section 2.14.27 The Color Box Window on page 151.

3. Leave the default selection Custom in the Header text color field.

Click to select ActiveBorder, ActiveCaption, AppWorkspace,

332 SAS Information Retrieval Studio: Administrator’s Guide

Page 349: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Background, and so on. You can also click to access the color box window to select another color.

4. Leave the default selection Custom in the Link color field.

Click to select ActiveBorder, ActiveCaption, AppWorkspace,

Background, and so on. You can also click to access the color box window and select a color such as red.

5. Leave the default selection Custom in the Visited link color field.

Click to select ActiveBorder, ActiveCaption, AppWorkspace,

Background, and so on. You can also click to access the color box window and select another color.

6. Leave the default selection Custom in the Hover link color field.

Click to select ActiveBorder, ActiveCaption, AppWorkspace,

Background, and so on. You can also click to access the color box window and select a color such as red.

7. Leave the default selection Window in the Menu border color field.

Click to select ActiveBorder, ActiveCaption, AppWorkspace, Background, and so on.

8. Leave the default selection Window in the Menu unselected

background color field. Click to select ActiveBorder,

ActiveCaption, AppWorkspace, Background, and so on. Click to access the color box window and select a different color.

SAS Information Retrieval Studio: Administrator’s Guide 333

Page 350: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

9. Leave the default selection Custom in the Menu unselected text

color field. Click to select Custom, ActiveBorder, ActiveCaption, AppWorkspace, Background, and so on.

10. Leave the default selection GrayText in the Menu selected

background color field. Click to select Custom, ActiveBorder, ActiveCaption, AppWorkspace, Background, and so on.

11. Leave the default selection HighlightText in the Menu selected

text color field. Click to select Custom, ActiveBorder, ActiveCaption, AppWorkspace, Background, and so on.

12. (Optional) Click Reset to Default to revert to the standard SAS Information Retrieval Studio settings.

13. Click Apply Changes before you select another pane.

14. Click the link in the Status tab to see the results of your changes.

334 SAS Information Retrieval Studio: Administrator’s Guide

Page 351: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

13.3.7.C Load New Images into the Search Window

You can upload images, or borders, into the search window that your end users see. Before you use the steps below, make sure that you load your images into the work/query-web-server subdirectory of your installation directory.

Note: PNG, JPEG, and GIF images are all supported.

To upload images or borders, complete these steps:

1. Select the Query Web Server --> Configuration --> Theme --> Images.

2. Leave the default selection None in the Left header image field.

Click to select one of the images that you loaded into the work/query-web-server subdirectory of your installation directory.

3. Leave the default selection sas.png in the Right header image field.

Click to select one of the images that you loaded into the work/query-web-server subdirectory of your installation directory.

4. Click Apply Changes.

5. Click the link in the Status tab to see the results of your changes in the search window.

SAS Information Retrieval Studio: Administrator’s Guide 335

Page 352: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

13.4 Run the Query Web Server

By default, the query web server is running. Configure all of the components that you plan to use. Click Apply Changes after you modify any of the default settings for these components.To start, restart, and stop the query web server, complete any of these steps:

- Click Start in the Query Web Server pane.

The appropriate message appears in the Status pane after any of these operations.

- Click the link to the machine where the query web server is running to see the SAS Information Retrieval Studio search window.

- (Optional) If you make any changes to the configuration while the indexing server is running, click Apply Changes.

- (Optional) Click Revert to return to the last applied settings.- To stop the server, click Stop.

336 SAS Information Retrieval Studio: Administrator’s Guide

Page 353: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

13.5 Troubleshoot with the Log File

The log pane enables you to see information about the queries entered by an end user in the search pane. Use the contents of the Log pane when you require customer support.To use this Log pane, complete these steps:

1. Select Query Web Server --> Log.

2. Click or to select a new Number of lines, the default setting is 20. For example, choose 25 to see more lines.

3. Click Retrieve to display this number of lines in the blank pane below.

4. (Optional) Enter the terms that you want to locate in the Text to highlight field. For example, enter query.

5. Click Find to display this text in the log file.

SAS Information Retrieval Studio: Administrator’s Guide 337

Page 354: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

338 SAS Information Retrieval Studio: Administrator’s Guide

Page 355: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

14 Configuring the Query Statistics Server

- Overview of the Query Statistics Server- Run the Query Statistics Server- View the Query Statistics for a Selected Time Period- After You View the Query Data- Troubleshoot with the Log File

14.1 Overview of the Query Statistics Server

The query statistics server monitors the queries that end users enter into the SAS Information Retrieval Studio search window. The query statistics server tracks and displays information such as the most frequent query terms, query terms that did not return matching documents, and other time-related information. This server also provides the query analytics that enable you to troubleshoot or to see numbers that interpret the flow of traffic through the SAS Information Retrieval Studio search window.You can select the time-periods and types of data that you want to see when you use the panes in this window. See the most frequent queries or monitor the flow of traffic hour-by-hour. Some choices provide access to all of the tabs in this pane, other selections limit the panes that you can see to those that apply to your selection.

Page 356: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Display 14-1 Query Statistics Pane

14.2 Run the Query Statistics Server

By default the query statistics server is running.To start and stop the query statistics server, complete any of these steps:

- Click Start in the Query Statistics Server pane.

The appropriate message appears in the Status pane after either of these operations.

- To stop the server, click Stop.

340 SAS Information Retrieval Studio: Administrator’s Guide

Page 357: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

14.3 View the Query Statistics for a Selected Time Period

14.3.1 Overview of Time Period Views

By default, the Query Statistics pane displays the four periods of time that you can use to see related analytics. Use the Today, This Month, This Year, and All Time buttons in this tab to select a time period. You can then select one of the following panes, Most Frequent Queries, Most Frequent Queries Without Matches, Hourly Query Rate, Daily Query Rate, or Monthly Query Rate. The availability of these panes depends on the time period button that you click. For example, when you click All Time, you see all of the available tabs. If you click Today the first three tabs are available.

14.3.2 See the Statistics for Today

When you click Today, the Most Frequent Queries, Most Frequent Queries Without Matches, and the Hourly Query Rate tabs remain accessible.To see the query statistics for searches performed today or yesterday, complete these steps:

1. Select Query Statistics Server --> Query Statistics.

2. Click Today to see the screen shown above. Today’s date is displayed by Year, Month, and Day. For example, 2010, 9, 15.

SAS Information Retrieval Studio: Administrator’s Guide 341

Page 358: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

3. (Optional) Click Previous to see the date assigned to yesterday and to see these results. For example, 2010, 9, 14.

4. Click the Most Frequent Queries tab to see the query terms and number of times that these words were entered into the search window.

5. See the query with the highest number of entries at the top of the list under Query. For example, see the word sas.

6. See the total count for the number of entries under Number of Occurrences. For example, sas was entered 73 times.

342 SAS Information Retrieval Studio: Administrator’s Guide

Page 359: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

7. Click the Most Frequent Queries Without Matches tab to see the search terms that are not located.

8. See any search terms that were not matched by the searched corpus under Query. For example, see the term produc.

Hint: In this example, this term is also listed in the Most Frequent Queries pane.

9. Under Number of Occurrences, see the number of times this search term was input by end users. For example, produc was entered one time.

SAS Information Retrieval Studio: Administrator’s Guide 343

Page 360: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

10. Click the Hourly Query Rate tab to see the query traffic over the current 24-hour period.

11. See each Hour and the Number of Queries. For example, see 8 am, 17 and 9 am, 12.

14.3.3 See the Statistics for This Month

When you click This Month, the Most Frequent Queries, Most Frequent Queries Without Matches, Hourly Query Rate, and the Daily Query Rate tabs remain accessible.To see the query statistics for searches performed this month, complete these steps:

344 SAS Information Retrieval Studio: Administrator’s Guide

Page 361: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Select Query Statistics Server --> Query Statistics.

2. Click This Month.

3. See the screen shown above. The date for this month is displayed by Year and Month. For example, 2010 and 9.

4. Use Step 3. on page 342 through Step 11. on page 344.

5. Click Daily Query Rate.

6. See each Day of the week and the Number of Queries input for that day. For example, see Wednesday, 74.

SAS Information Retrieval Studio: Administrator’s Guide 345

Page 362: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

14.3.4 See the Statistics for This Year

When you click This Year, all of the tabs remain accessible.To see the query statistics for searches performed this year, complete these steps:

1. Select Query Statistics Server --> Query Statistics.

2. Click This Year.

3. See the date that is displayed in year format in Year. For example, 2010.

4. Use Step 3. on page 342 through Step 11. on page 344.

5. Use Step 5. through Step 6. on page 345 for the Daily Query Rate tab.

346 SAS Information Retrieval Studio: Administrator’s Guide

Page 363: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

6. Click Monthly Query Rate to see the total number of queries that were input during each of the weekdays during the selected month.

7. See each Month and the Number of Queries for that month. For example, see September, 97.

SAS Information Retrieval Studio: Administrator’s Guide 347

Page 364: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

14.3.5 See the Statistics for All Time

When you click All Time, all of the tabs remain accessible.To see the query statistics for searches performed from the time when your end users began to query until now, complete these steps:

1. Select Query Statistics Server --> Query Statistics

2. Click All Time.

3. See that no date is displayed and the Previous and Next buttons are not accessible for this date selection.

4. Use Step 4. on page 342 through Step 11. on page 344.

5. Use Step 5. through Step 6. on page 345 for the Daily Query Rate tab. The number for each day of the week matches the total number of input queries received on each of the respective weekdays over the course of the year.

Hint: The statistics are preserved when you restart the application, but not when SAS Information Retrieval Studio is reinstalled.

6. Use Step 6. through Step 7. on page 347 for the Monthly Query Rate tab.

348 SAS Information Retrieval Studio: Administrator’s Guide

Page 365: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

14.4 After You View the Query Data

The query statistics help you to understand whether changes should be made to the current configuration of SAS Information Retrieval Studio. For example, you can find the following information:

- Discover your peak query hours, days, and months. If performance should be increased, consider adding additional hardware or network bandwidth.

- Discover any changes that should be made to the index. For example, see whether queries without matches might be matched if an additional field is added to the index.

- See whether the most frequent query terms adequately match the searched corpus. If not add a new link.

14.5 Troubleshoot with the Log File

The query web server Log pane enables you to see the history of the queries entered by an end user in the search pane. Use the contents of the Log pane when you require customer support.To use this Log pane, complete these steps:

SAS Information Retrieval Studio: Administrator’s Guide 349

Page 366: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

1. Select Query Statistics Server --> Log.

2. Click or to select a new Number of lines, the default setting is 20. For example, choose 25 to see more lines.

3. Click Retrieve to display this number of lines in the blank pane below.

4. (Optional) Enter the terms that you want to locate in the Text to highlight field. For example, enter time.

5. Click Find to display this text in the log file.

350 SAS Information Retrieval Studio: Administrator’s Guide

Page 367: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Appendixes

- Appendix A: Regular Expressions and XML Field Extraction File on page 353

- Appendix B: Recommended Reading on page 355- Appendix C: Glossary on page 359

351

Page 368: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

352 SAS Information Retrieval Studio: Administrator’s Guide

Page 369: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Appendix: A Regular Expressions and XML Field Extraction File

- Regular Expressions- XML File Field Extraction File Format

A.1 Regular Expressions

The document processors, file crawler, and feed crawler use the Python compatible equivalent of PCRE. The query web server uses the Java compatible equivalent of PCRE. The web crawler uses the SAS wrapper for PCRE. For more information, see the following pages:

- PCRE: http://www.pcre.org/- Python compatible equivalent: http://docs.python.org/library/

re.html

- Java compatible equivalent: http://download.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

A.2 XML File Field Extraction File Format

Use this section when you want to extract the contents of a specific XML document. For more information, see Section The Document Processor: parse_xml Window on page 114.

Example A.1: Original Document

<article> <content>foo bar</content> <tsrc>unwanted garbage</tsrc>

353

Page 370: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

<thumbnail> <tsrc>http://img.com/</tsrc> </thumbnail></article>

Suppose you want to extract the value of the content field, and the value of the tsrc field in the thumbnail field. In order to extract only the tsrc field that is located inside the thumbnail field, specify the following syntax

<article> <content /> <tsrc index="no" /> <thumbnail index="no"> <tsrc /> </thumbnail></article>

In this example the attribute index has the value “no". This value specifies that the parser does not add the value of this field to its list of documents.The default value of the index attribute is "yes". This specification means that every field in the input XML that does not have the index attribute remains in the document.

354 SAS Information Retrieval Studio: Administrator’s Guide

Page 371: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Appendix: B Recommended Reading

The following books are recommended as companion guides:- SAS Information Retrieval Studio: Installation Guide: Install SAS

Information Retrieval Studio and prerequisite software.- SAS Information Retrieval Studio: User’s Guide: Use the search

window that an administrator customized to query the index built in SAS Information Retrieval Studio.

- SAS Sentiment Analysis Studio: User’s Guide: Create a SAS Sentiment Analysis Studio project, test, and upload it to SAS Sentiment Analysis Server.

- SAS Sentiment Analysis Server: Administrator’s Guide: Automate the process of applying the rules that you define in SAS Sentiment Analysis Studio to your input documents.

- SAS Sentiment Analysis Workbench: Installation Guide: Install SAS Sentiment Analysis Workbench and prerequisite software.

- SAS Sentiment Analysis Workbench: Administrator’s Guide: Set up SAS Sentiment Analysis Studio projects, add users, and specify the files to be used. These files include SAS Sentiment Analysis Studio and SAS Content Categorization Studio files.

- SAS Sentiment Analysis Workbench: User’s Guide: Review and edit the automated analyses and create reports illustrated with graphs that illustrate these analyses.

- SAS Content Categorization: User’s Guide: Create a SAS Content Categorization Studio project, test, and upload to SAS Content Categorization Server.

- SAS Content Categorization Studio: Quick Start Guide: Advanced users can learn how to expeditiously set up a SAS Content Categorization Studio project.

- SAS Content Categorization: Installation Guide: Install SAS Content Categorization Server.

355

Page 372: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

- SAS Content Categorization Server: Administrator’s Guide: Understand how SAS Content Categorization Server applies the .mco and .concepts files to input documents. Program this application using the Java language.

- SAS Contextual Extraction Studio: Administrator’s Guide: Use this add-on application to SAS Content Categorization Studio to write complex concept definitions that can include multiple rule types within a single definition.

- SAS Contextual Extraction Studio: Installation Guide: Install SAS Contextual Extraction Studio.

- SAS Document Conversion: Developer’s Guide: Use this C API for SAS Document Conversion to convert documents in formats such as Adobe PDF and Microsoft Office into text.

- Use the language book that applies to the language that you use to create your project. Each of the SAS world language books contain a comprehensive list of part-of-speech tags.

- SAS offers instructor-led training and self-paced e-learning courses to help you get started with the SAS add-in, learn how the SAS add-in works with the other products in the SAS Enterprise Intelligence Platform, and learn how to run stored processes in the SAS add-in. For more information about the courses available, see support.sas.com/training.

For a complete list of SAS publications, see the current SAS Publishing Catalog. To order the most current publications or to receive a free copy of the catalog, contact a SAS representative atSAS Publishing SalesSAS Campus DriveCary, NC 27513Telephone: (800) 727-3228*Fax: (919) 677-8166E-mail: [email protected] address:support.sas.com/pubs* For other SAS Institute business, call (919) 677-8000.

Customers outside the United States should contact their local SAS office.

356 SAS Information Retrieval Studio: Administrator’s Guide

Page 373: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Appendix: CGlossary

Boolean operators

specifies words such as AND, OR, and NOT, to construct logical definitions that locate the matches that you seek.

caption

specifies an alternative version of a label field name that is displayed to the end user during facetted search. Also see label.

categorization

concisely defines the subject matter of a document, in other words, the main idea or subject of the document.

checksum operation

eliminates a duplicate document according to the type of operation that you specify. For example, choose to remove old documents, or eliminate documents with the same URL.

concept

specifies any of the following—a string, token, or an argument—to locate in an input document. A concept locates the metadata of the input document.

contextual extraction

specifies concepts and facts.corpus

specifies one set of documents. For multiple sets, see corpora.corpora

specifies multiple sets of training documents. See corpus for one setcrawl

an entire run of a crawler, instead of a single page download.

359

Page 374: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

definition

defines a concept. There can be many rules for each concept definition. This term is also used interchangeably with rule. See rule.

document

is a unit of textual data. For example, a document can be an HTML page, a Microsoft Word file, a PDF file, or one row in a CSV file or a database. A document can also be an article or summary in a feed.In SAS Information Retrieval Studio, each document is represented as a configurable set of fields. Each file has a name and a value. Unnecessary fields can be either left empty or omitted from the document.

dominance

when an object is dominant, it is mentioned more frequently than the other comparable objects that you defined in the Products tab of SAS Sentiment Analysis Studio.

Facetted search

applies identifying labels to matched documents. These labels enable you to intuitively navigate to the documents that match your input query terms.

fact

links two, or more, concepts to provide otherwise overlooked relationships in input documents.

filter criteria that restrict the data that is displayed in a graph.

hash

change a string of characters into a value that can be indexed. The hash process expedites the search process.

label

specify the value of the field that is passed to the query web server for each match that appears within the search window. Also see caption.

metadata

identifies information about information.MIME

is an acronym for Multipurpose Internet Mail Extensions. Non ASCII messages are formatted using MIME to be sent over the Internet.

360 SAS Information Retrieval Studio: Administrator’s Guide

Page 375: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

MIME type

is the format of the input document.Noise words

appear with enough frequency that they are ranked down in the metrics for weight.

polite

means that a single thread does not overwhelm a site with download requests, but respects the robots.txt standard. This standard enables Web site developers to specify portions of their sites that should not be crawled.

precision

measures the accuracy of the model. It reflects the percentage of documents that were correctly classified.

prominence

see where the information about the product is located in the document. The information can appear primarily in the top 20%, or in the bottom 80%, of the selected document.

raw

specifies the original, unmodified content that was placed into an HTML document using this identifying field name.

recall

measures the number of documents that are a match for the definition out of those texts that were successfully returned.

rule

defines a category. Unless you use SAS Contextual Extraction Studio, only one rule defines each category. This term is also used interchangeably with definition. See definition.

sentiment

expresses feeling, or like or dislike.string

refers to a group of words or characters that you specify for a rule.

SAS Information Retrieval Studio: Administrator’s Guide 361

Page 376: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

362 SAS Information Retrieval Studio: Administrator’s Guide

Page 377: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Index

AAbstract source

defined ...............................................................................................................59Action heading

defined ...............................................................................................................20Add Backend window usage ..................................................................................141Add button

defined ............................................... 19, 20, 22, 23, 28, 29, 30, 38, 42, 47, 53, 56Add Credential window

usage ................................................................................................................135Add Entry Point window

usage ................................................................................................................121Add Extension window

usage ................................................................................................................139Add Field window

usage ................................................................................................ 143, 146, 148Add Filename Extension window

usage ................................................................................................................133Add HTTP Proxy window ......................................................................................120

usage ................................................................................................ 118, 119, 120Add keywords to PDF links

defined ...............................................................................................................59Add Path to Exclude window

usage ................................................................................................................138Add Path window

usage ................................................................................................................137Add Scope Rule window

usage ................................................................................................................130add_field

document processor ...........................................................................................74All Time button

defined ...............................................................................................................67

363

Page 378: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Apply Changes buttondefined .............................................................................................................. 13pipeline server ................................................................................................... 40

Auto-detect buttondefined .........................................................................................................15, 33feed crawler ..................................................................................................33, 34

Bbsearch

defined ............................................................................................................ 317

CCaption heading

defined .............................................................................................................. 56categorizer

defined ............................................................................................................ 158color box window

usage ............................................................................................................... 151colors

search window ................................................................................................ 331Colors tab

defined .............................................................................................................. 61concept_extractor

defined ............................................................................................................ 159Configuration pane

defined .............................................................................................................. 37content_categorization

document processor .......................................................................................... 74contextual_extractor

defined ............................................................................................................ 159Cosine Weight

defined .............................................................................................................. 55Crawl continuously

defined .........................................................................................................27, 33Credentials tab

web crawler ....................................................................................................... 14

364 SAS Information Retrieval Studio: Administrator’s Guide

Page 379: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

DDate

Sort tab ..............................................................................................................54Date source

defined ...............................................................................................................59Day

defined ...............................................................................................................71Day field

defined ...............................................................................................................67default_mime_type_from_url

defined .............................................................................................................159document processor ...........................................................................................74

default_title_from_urldefined .............................................................................................................159document processor ...........................................................................................74

Delete Index buttondefined ....................................................................................................... 45, 303

Delete Index windowusage ................................................................................................................153

Density Weightdefined ...............................................................................................................55

documentdefined ...............................................................................................................44

Document processing headingdefined ...............................................................................................................40

Document Processoradd_field window ..............................................................................................77content_categorization window ................................................................. 78, 274default_mime_type_from_url window ..............................................................95default_title_from_url window ..........................................................................95document_converter window ............................................................................96export_csv window ............................................................................................97export_to_files window ...................................................................................100export_to_odbc window ..................................................................................102export_to_sentiment_analysis_workbench window ........................................104extract_abstract window ..................................................................................106extract_pdate window ......................................................................................107heuristic_parse_html window ..........................................................................108invalidate_duplicates_by_url window .............................................................110match_and_copy window ................................................................................111modify_field_name window ............................................................................112

SAS Information Retrieval Studio: Administrator’s Guide 365

Page 380: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

parse_html window ......................................................................................... 112parse_xml window .......................................................................................... 114strip_html window ...................................................................................115, 116substitute window ........................................................................................... 117

document processorschoose ............................................................................................................. 235

document_converterdefined ............................................................................................................ 159document processor .......................................................................................... 74

Documents processeddefined .............................................................................................................. 36

Documents queueddefined .............................................................................................................. 36

Documents receiveddefined .............................................................................................................. 36

EEdit Backend window

usage ............................................................................................................... 142Edit button

defined .........................................19, 21, 22, 23, 28, 29, 30, 34, 39, 42, 47, 53, 57Edit Credential window

usage ............................................................................................................... 136Edit Entry Point window

usage ............................................................................................................... 125Edit Extension window

usage ............................................................................................................... 140Edit Field window

usage ........................................................................................................147, 150Edit Filename Extension window

usage ............................................................................................................... 134Edit Path to Exclude window

usage ............................................................................................................... 139Edit Path window

usage ............................................................................................................... 138Encapsulate XML files

define ................................................................................................................ 27entry points

defined ............................................................................................................ 196

366 SAS Information Retrieval Studio: Administrator’s Guide

Page 381: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Entry Points tabusage ................................................................................................................196web crawler .......................................................................................................14

Export Settings window ..........................................................................................119export_csv

defined .............................................................................................................160document processor ...........................................................................................74

export_to_filesdefined .............................................................................................................160document processor ...........................................................................................74

export_to_odbcdefined .............................................................................................................160document processor ...........................................................................................75

export_to_sas_sentiment_analysis_workbenchdocument processor ...........................................................................................75

export_to_sentiment_analysis_workbenchdefined .............................................................................................................161

Extensiondefined ...............................................................................................................21

extract_abstractdefined .............................................................................................................159document processor ...........................................................................................75

extract_pdatedefined .............................................................................................................159document processor ...........................................................................................75

Ffacetted search

defined .............................................................................................................163feed crawler

configure ..........................................................................................................218defined ......................................................................................................... 9, 157Feeds pane .........................................................................................................34General Settings pane ........................................................................................33operations ..........................................................................................................31run ....................................................................................................................222usage ................................................................................................................217

Feed URLfeed crawler .......................................................................................................34

SAS Information Retrieval Studio: Administrator’s Guide 367

Page 382: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Feeds tabfeed crawler ....................................................................................................... 32

Field Name headingdefined .........................................................................................................46, 56

Field Name tabdefined .............................................................................................................. 52

Field value ................................................................................................................ 54Sort tab .............................................................................................................. 54

file crawlerconfigure ......................................................................................................... 207defined .........................................................................................................9, 156general settings ............................................................................................... 208run ................................................................................................................... 213

Filename Extensions tabfile crawler ........................................................................................................ 26web crawler ....................................................................................................... 14

Find buttondefined .............................................................................................................. 12

Finished headingdefined .............................................................................................................. 41

flattened hierarchysearch returns .................................................................................................. 315

Follow Linksfeed crawler ....................................................................................................... 34

Fontdefined .............................................................................................................. 60

Font sizedefined .............................................................................................................. 61

formattingquery web server ............................................................................................. 162

Freshness Weightdefined .............................................................................................................. 55

fsearchdefined ............................................................................................................ 317

Functionality headingdefined .............................................................................................................. 46

368 SAS Information Retrieval Studio: Administrator’s Guide

Page 383: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

GGeneral Settings tab

feed crawler .......................................................................................................32file crawler .........................................................................................................26web crawler ............................................................................................... 14, 193

HHeader background color

defined ...............................................................................................................62heuristic_parse_html

defined .............................................................................................................159document processor ...........................................................................................75

Hostdefined ...............................................................................................................38

Hourdefined ...............................................................................................................70

HTTP proxydefined ...............................................................................................................15feed crawler .......................................................................................................33

IImport Settings window ..........................................................................................118index

configure ..........................................................................................................298defined .............................................................................................................161input documents ..................................................................................................8

indexing serverdefined ...............................................................................................................10run ....................................................................................................................302usage ................................................................................................................297

input documentsindex ....................................................................................................................8

invalidate_duplicates_by_urldefined .............................................................................................................159document processor ...........................................................................................75

SAS Information Retrieval Studio: Administrator’s Guide 369

Page 384: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Llabels

defined ............................................................................................................ 163hierarchy ......................................................................................................... 312navigation tools ............................................................................................... 264usage ............................................................................................................... 321

Labels tabdefined .............................................................................................................. 51

Last busy time headingdefined .............................................................................................................. 41

Last document processeddefined .............................................................................................................. 37

Last document receiveddefined .............................................................................................................. 37

Left header imagedefined .............................................................................................................. 64

Link prefixdefined .............................................................................................................. 59

Link sourcedefined .............................................................................................................. 59

Link suffixdefined .............................................................................................................. 59

Link traversal orderdefined .............................................................................................................. 17

log fileentire application ............................................................................................... 11feed crawler ..................................................................................................... 223file crawler ...................................................................................................... 214indexing server ................................................................................................ 303pipeline server ................................................................................................. 260proxy server ...............................................................................................12, 230query server ..................................................................................................... 306query web server ......................................................................................337, 349troubleshoot .................................................................................................... 223

MMatch Formatting tab

defined .............................................................................................................. 51usage ............................................................................................................... 326

370 SAS Information Retrieval Studio: Administrator’s Guide

Page 385: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

match typeselect ................................................................................................................318

Match Type headingdefined ...............................................................................................................20

match_and_copydocument processor ...........................................................................................75

matchessort ...................................................................................................................319

Matching tabdefined ...............................................................................................................50usage ................................................................................................................317

Maximum file size (megabytes)defined ...............................................................................................................26

Maximum number of related labelsdefined ...............................................................................................................57

Maximum number of retriesdefined ...............................................................................................................16

Menu selected background colordefined ...............................................................................................................63

Menu selected text colordefined ...............................................................................................................64

Menu unselected background colordefined ...............................................................................................................63

Menu unselected text colordefined ...............................................................................................................63

MIME type sourcedefined ...............................................................................................................59

modify_field_namedefined .............................................................................................................159document processor ...........................................................................................76

Monthdefined ...............................................................................................................72

Month fielddefined ...............................................................................................................67

most frequent queriesdefined .............................................................................................................162query statistics server ......................................................................................162

most frequent queries without matchesquery statistics server ......................................................................................163

Move Down buttondefined ......................................................................................................... 42, 57

SAS Information Retrieval Studio: Administrator’s Guide 371

Page 386: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Move Up buttondefined .........................................................................................................42, 57

NNext button

defined .............................................................................................................. 67no hierarchy

search returns .................................................................................................. 314no labels

search returns .................................................................................................. 312Number of downloader threads

defined .............................................................................................................. 16Number of lines

defined .............................................................................................................. 11Number of matching fields

Sort tab .............................................................................................................. 54Number of matching terms

Sort tab .............................................................................................................. 54Number of Occurrences

defined .............................................................................................................. 69Number of Occurrences heading

defined .............................................................................................................. 68Number of Queries

defined ...................................................................................................70, 71, 72

OOldest date

defined .............................................................................................................. 27operation history

log file ............................................................................................................. 206Order added to the index

Sort tab .............................................................................................................. 55Overall heading

defined .............................................................................................................. 40

372 SAS Information Retrieval Studio: Administrator’s Guide

Page 387: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Pparse_html

defined .............................................................................................................160document processor ...........................................................................................76

parse_xmldefined .............................................................................................................160document processor ...........................................................................................76

Passworddefined ...............................................................................................................23

password-protected sitescrawl ................................................................................................................203

Paths tabfile crawler .........................................................................................................26

paths to crawlspecify .............................................................................................................209

paths to excludefile crawler .......................................................................................................211

Paths to Exclude tabfile crawler .........................................................................................................26

Pending headingdefined ...............................................................................................................41

pipeline serverdefined .................................................................................................................9operations ........................................................................................................234run ....................................................................................................................258

Pipeline Server taboperations ..........................................................................................................39

Pipeline Stagestages .................................................................................................................40

Portdefined ...............................................................................................................38

Position Weightdefined ...............................................................................................................55

Previous buttondefined ...............................................................................................................67

processesorder .................................................................................................................164

Proximity Weightdefined ...............................................................................................................55

SAS Information Retrieval Studio: Administrator’s Guide 373

Page 388: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

proxy serverconfigure ......................................................................................................... 228defined ...................................................................................................9, 35, 157operations ........................................................................................................ 157run ................................................................................................................... 229status ............................................................................................................... 226usage ............................................................................................................... 225

Qqueries ........................................................................................................................ 8Query

defined .............................................................................................................. 69Query heading

defined .............................................................................................................. 68query rate by day

defined ............................................................................................................ 163query rate by hour

defined ............................................................................................................ 163query rate by month

defined ............................................................................................................ 163query rate for all time

defined ............................................................................................................ 163query rates

query statistics server ...................................................................................... 163query server

defined .................................................................................................10, 48, 305usage ............................................................................................................... 305

query statisticsall queries ........................................................................................................ 348see ............................................................................................................341, 344this year ........................................................................................................... 346usage ............................................................................................................... 349

Query Statistics paneusage ............................................................................................................... 341

query statistics serverdefined .............................................................................................................. 10run ................................................................................................................... 340usage ............................................................................................................... 339

374 SAS Information Retrieval Studio: Administrator’s Guide

Page 389: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

query web serverconfigure ..........................................................................................................316defined ...............................................................................................................10run ....................................................................................................................336usage ................................................................................................................309

Quota (files)defined ...............................................................................................................16

Quota (megabytes)defined ...............................................................................................................16

Quota headingdefined ...............................................................................................................18

RRecrawl interval

defined ...............................................................................................................33Refresh button

defined ........................................................................................................... 9, 13Relevancy

Sort tab ..............................................................................................................54Remove button

defined ......................................... 19, 21, 22, 23, 28, 29, 30, 34, 38, 42, 47, 53, 56Reset to Default button

defined ...............................................................................................................64Respect robots.txt

defined ...............................................................................................................16Retrieve button

defined ...............................................................................................................11Retry delay (seconds)

defined ...............................................................................................................16Revert button

defined ...............................................................................................................13indexing server ..................................................................................................45

Right header imagedefined ...............................................................................................................64

robots.txtdefined ...............................................................................................................16

SAS Information Retrieval Studio: Administrator’s Guide 375

Page 390: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Ssample project

set up ........................................................................................................168, 179SAS Content Categorization Server ................................................................234, 237

install ................................................................................................234, 235, 237taxonomies ...............................................................................................234, 237

SAS Content Categorization Studio ................................................................235, 237concepts and categories .................................................................................. 234

SAS Contextual Extraction Studioconcepts and facts ....................................................................................234, 237

SAS Document Conversioninstall ............................................................................................................... 234

SAS Sentiment Analysis Workbenchinstall ........................................................................................................235, 237

SAS Text Miner ..............................................................................................235, 237install ........................................................................................................235, 237

scopedefined ............................................................................................................ 198

Scope tabdefined .............................................................................................................. 14

searchquery web server ............................................................................................. 162type ...................................................................................................................... 8

search boxcustomize ........................................................................................................ 310

Search typedefined .............................................................................................................. 52

senddefined ............................................................................................................ 160document processor .......................................................................................... 76

Sending to the indexer headingdefined .............................................................................................................. 40

Server Port fielddefined .............................................................................................................. 50specify ............................................................................................................. 317

Sitedefined .............................................................................................................. 23

Sleep interval (seconds)defined .............................................................................................................. 16

sortmatches ........................................................................................................... 319

376 SAS Information Retrieval Studio: Administrator’s Guide

Page 391: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

sort the resultsquery web server .............................................................................................162

Sort typedefined ...............................................................................................................53

Sort Type fielddefined ...............................................................................................................55

Sorting tabdefined ......................................................................................................... 51, 53

Start buttondefined ...............................................................................................................13

Statusdefined ...............................................................................................................38

status tabdefined ...............................................................................................................36

Stop buttondefined ...............................................................................................................13

strip_htmldefined .............................................................................................................160document processor ...........................................................................................76

substitutedefined .............................................................................................................160document processor ...........................................................................................76

TTake Snapshot

usage ..................................................................................................................43Text to highlight

defined ...............................................................................................................12Theme pane

usage ................................................................................................................329Theme tab

defined ...............................................................................................................51This Month button

defined ...............................................................................................................66This Year button

defined ...............................................................................................................66Tiebreaker

defined ...............................................................................................................55Timeout (seconds)

defined ...............................................................................................................16

SAS Information Retrieval Studio: Administrator’s Guide 377

Page 392: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

Titledefined .............................................................................................................. 60

Title fielddefined .............................................................................................................. 58

Title sourcedefined .............................................................................................................. 58

Today buttondefined .............................................................................................................. 66

types of fileslimit ................................................................................................................. 202

UURL heading

defined .............................................................................................................. 18URL Pattern heading

defined .............................................................................................................. 20Use pop-up menus

defined .............................................................................................................. 61User agent

defined .............................................................................................................. 33Username

defined .............................................................................................................. 23

Wweb crawler

configure ......................................................................................................... 192defined .........................................................................................................9, 156run ................................................................................................................... 205specify operations ........................................................................................... 193

Web Crawler paneoperations .......................................................................................................... 12

Weight tabdefined .............................................................................................................. 52

XXML document

extract contents ............................................................................................... 353

378 SAS Information Retrieval Studio: Administrator’s Guide

Page 393: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

YYear field

defined ...............................................................................................................67

SAS Information Retrieval Studio: Administrator’s Guide 379

Page 394: Contentssupport.sas.com/documentation/onlinedoc/textaddons/13/irsag.pdf · 1.1 What Is SAS Information Retrieval Studio? .....1 1.2 Benefits of Using SAS Information Retrieval Studio

380 SAS Information Retrieval Studio: Administrator’s Guide