Upload
kais-hassan-phd
View
345
Download
0
Embed Size (px)
Citation preview
Information RetrievalJOSA Data Science Bootcamp
Kais Hassan
● Chief Data Officer @ Altibbi.com
○ Data Science○ BI
● Created several domain specific search solutions
● Previously Assistant Professor @ PSUT
● PhD in Computer Science from England (Medical Imaging)
AgendaIntroduction to IR and search
● Unstructured text, document-based storage● Search Engines vs. Databases● Inverted Index
Intro to Lucene/Solr● Available open source search libraries and engines.● Architectural diagram for Lucene and Solr
Solr basics● Hands-on implementation the first Solr collection● Indexing (example: XML files)● Retrieving Information from Solr - Basic Queries and Parameters
Field Custom data types● Copy fields● Analysis Chain: Analyzers, Tokenizers and Character Filters● Analyzers: Case Sensitivity, Lemmatization, Stemming, Synonyms, Shingles
Exercise● Autocomplete using n-grams
Solr @ Altibbi● Real life examples
• Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
– These days we frequently think first of web search, but there are many other cases:
• Corporate knowledge bases• Text classification• Text clustering
Information Retrieval
Basic assumptions of Information Retrieval
• Document-based storage/Collection: A set of self-contained documents, all of the data for the document is stored in the document itself — not in a related table as it would be in a relational database
• Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task
How good are the retrieved docs?
▪ Precision : Fraction of retrieved docs that are relevant to the user’s information need
▪ Recall : Fraction of relevant docs in collection that are retrieved
IR vs. databases:Structured vs unstructured data
• Structured data tends to refer to information in “tables”
Typically allows numerical range and exact match(for text) queries, e.g., Salary < 60000 AND Manager = Smith.
Unstructured data• Typically refers to free text• Allows
– Keyword queries including operators– More sophisticated “concept” queries e.g.,
• find all web pages dealing with drug abuse• Classic model for searching text documents
• 85% of the World’s data Unstructured
8
The Inverted Index - key data structure in IR
Stages of text processing• Tokenization
– Cut character sequence into word tokens• Normalization
– Map text and query term to same form• You want U.S.A. and USA to match
• Stemming– We may wish different forms of a root to match
• authorize, authorization• Stop words
– We may omit very common words (or not)• the, a, to, of
Inverted index construction
What is Lucene?
➔ High performance, scalable, full-text search library➔ Focus: Indexing + Searching Documents
◆ “Document” is just a list of name+value pairs➔ No crawlers or document parsing➔ Flexible Text Analysis (tokenizers + token filters)➔ 100% Java, no dependencies, no config files
Both Solr and ElasticSearch are based on it
What is Solr?
• A full text search server based on Lucene• XML/HTTP, JSON Interfaces• Faceted Search (category counting)• Flexible data schema to define types and fields• Hit Highlighting• Configurable Advanced Caching• Index Replication• Written in Java
Solr Architectural Diagram
Solr Terminology core: physical instance of a Lucene index files along with all the Solr configuration files
i.e. index with a given schema and that holds a set of documents.
collection: logical index in a SolrCloud cluster, associated with a config set files stored in Zookeeper
In a non-distributed search (standalone solr) some can refer to core as a collection
Understanding Solr Directory Structure
bin: bash files to control solrcontrib: additional plugins (ex. clustering)dist: Solr librariesdocs: documentation and Tutorialexample: sample data and configurationlicenses: Software licenses used in Solr
Server Foldercontexts + etc + lib + modules: jetty folders logs: solr and jetty log filesresources: logging configurationscripts: utility files for ZooKeeper and mapreducesolr: solr.home directory contains core directoriessolr-webapp: Solr server + admin tool
Solr Important Environment Variables
solr.install.dir: The location where you extracted the Solr installation.
solr.solr.home (SolrHome): contains core configuration and data, also must contain solr.xml (configuration for solr).
By default it is located inside
solr.install.dir/server/solr
But can be changed to any location
Exercise 1: getting started with SolrPrerequisites:
1. Java 7 or higher is installed and JAVA_HOME is set2. You have downloaded Solr 5.4.1(tgz for Linux, zip for Windows)3. Good text editor ( Anything but Notepad)4. Downloaded bootcamp_config + nytimes_facebook_statuses.csv
Starting/Stopping Solr
1. cd to the extracted solr folder2. To Start: bin/solr start (Linux) or bin\solr.cmd start (Windows)
○ Solr will start and listen on port 8983○ bin/solr start -help will show start options (useful for changing options)
3. To Stop: bin/solr stop
Creating a solr coreAfter starting solr, you can create a core either by
1. bin/solr create command2. Creating a folder inside solr.home containing
a. core.properties (containing core configuration such as, name=$CORE_NAME)
b. conf folder containing at least solrconfig.xml and “schema.xml”c. load core using api or Solr Admin (or restart Solr)
➔ We will use create command in this session ➔ Make sure you have copied bootcamp_config folder to solr.
install.dir/server/solr/configsets
bin/solr create -c hellosolr -d bootcamp_config
Why a Custom Configuration?
❖ Create command with default confdir copies configuration from data_driven_schema_configs, which is Managed (Schemaless) schema with field-guessing support enabled and dynamic fields. It is good for quick prototyping but I always prefer to choose my field types manually!!!
❖ basic_configs configuration: schema.xml and solrconfig.xml contains many unnecessary configuration/comments and can be a bit overwhelming to start with.➢ Although they have good documentation and I encourage you to read them
at some stage
Looking at hellosolr core folder● core.properties file: contains core name and other
configuration, see https://cwiki.apache.org/confluence/display/solr/Defining+core.properties
● data folder: contains Lucene index/files● conf folder: configuration for the code, inside it
○ schema.xml: main configuration file for defining fields, text analysis and etc.○ solrconfig.xml: configuration for request handlers, data, caching and etc.
Live demo explaining important parts of these files
Solr Admin - Demo
Indexing NYTimes Facebook Statuses 1
● 33k of NYTimes Facebook Statuses in csv format● Add the following fields to schema.xml:
<field name="status_message" type="text_en" indexed="true" stored="true" />
<field name="link_name" type="text_en" indexed="true" stored="true" />
<field name="status_type" type="string" indexed="true" stored="true" />
<field name="status_link" type="string" indexed="true" stored="true" />
<field name="status_published" type="tdate" indexed="true" stored="true" />
<field name="num_likes" type="tint" indexed="true" stored="true" />
<field name="num_comments" type="tint" indexed="true" stored="true" />
<field name="num_shares" type="tint" indexed="true" stored="true" />
Indexing NYTimes Facebook Statuses 2● Reload core via Solr Admin● Index documents via post util
bin/post -c hellosolr nytimes_facebook_statuses.csv
● If all is good, you should have 33,295 document in your index
You can add document to Solr via
● Data Import Handler (Recommended) ● post util● APIs● ManifoldCF (Not sure if it is worth it if you don’t have diverse inputs)
NYTimes Basic QueriesAdd the following request handler to solrconfig.xml + reload core
<requestHandler name="/search" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">edismax</str> <str name="mm">2</str>
<str name="fl">*,score</str> <str name="qf">status_message^9.0 link_name^3.0</str>
<str name="q.alt">*:*</str> <str name="facet">on</str>
<str name="facet.mincount">1</str> <str name="facet.limit">20</str> <str name="facet.field">status_type</str>
<str name="indent">true</str> </lst>
<lst name="invariants">
<str name="rows">10</str> <str name="wt">json</str> </lst>
</requestHandler>
Basic QueriesThe most basic query request for solr as follows:
http://ServerName:Port/solr/coreName/select?q=QueryString
To find china in the previously mentioned schema:
http://localhost:8983/solr/hellosolr/search?q=china
Looking closer at the request, notice that there is the q parameter
q: The q parameter is the main query for the request, If you assign q=*:*, it will return all the results.
edismax Query Parser - 1
The default query parser the comes with Solr is somehow limited
To use a more advanced query parser, use edismax
mm (Minimum 'Should' Match): this parameter is useful when searching for several words, for
example if
mm=1 (At least one word in the query must exist)
mm=2 (At least two words in the query must exist)
edismax Query Parser - 1
Notice the result difference between the following queries
http://localhost:8983/solr/hellosolr/search?q=china jordan&mm=1
AND
http://localhost:8983/solr/hellosolr/search?q=china jordan&mm=2
Field Definitions• Field Attributes: name, type, indexed, stored,
multiValued <field name="id“ type="string" indexed="true" stored="true"/><field name="sku“ type="textTight” indexed="true" stored="true"/><field name="name“ type="text“ indexed="true" stored="true"/><field name=“inStock“ type=“boolean“ indexed="true“ stored=“false"/><field name=“price“ type=“sfloat“ indexed="true“ stored=“false"/><field name="category“ type="text_ws“ indexed="true" stored="true“
multiValued="true"/>
Fields▪ Fields may
▪ Be indexed or not▪ Indexed fields may or may not be analyzed (i.e., tokenized with an
Analyzer)▪ Non-analyzed fields view the entire value as a single token
(useful for URLs, paths, dates, social security numbers, ...)▪ Be stored or not
▪ Useful for fields that you’d like to display to users▪ Optionally store term vectors
▪ Like a positional index on the Field’s terms▪ Useful for highlighting, finding similar documents, categorization
copyField• Copies one field to another at index time• Usecase #1: Analyze same field different ways
– copy into a field with a different analyzer– boost exact-case–
<field name=“title” type=“text”/><field name=“title_exact” type=“text_exact” stored=“false”/><copyField source=“title” dest=“title_exact”/>
• Usecase #2: Index multiple fields into single searchable field
Custom Field TypesIn Solr you can create custom fields which specifies the Text Analysis Pipeline
<fieldType name="my_arabi" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ar.txt" />
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
</analyzer>
</fieldType>
Tokenizers And TokenFiltersAnalyzers Are Typically Comprised Of Tokenizers And TokenFilters
● Tokenizer: Controls How Your Text Is Tokenized, There can be only one Tokenizer in each Analyzer
● TokenFilter: Mutates And Manipulates The Stream Of Tokens
Solr Lets You Mix And Match Tokenizers and TokenFilters in schema.xml To Define Analyzers
Most Factories Have Customization Options
Notable Token(izers|Filters) - 1/2
WhitespaceTokenizer: Creates tokens of characters separated by splitting on whitespace
StandardTokenizerFactory: General purpose tokenizer that strips extraneous characters
LowerCaseFilterFactory: Lowercases the letters in each token
TrimFilterFactory: Trims whitespace at either end of a token.
● Example: " Kittens! ", "Duck" ==> "Kittens!", "Duck".
PatternReplaceFilterFactory: Applies a regex pattern
● Example: pattern="([^a-z])" replacement=""
Notable Token(izers|Filters) - 2/2
StopFilterFactory
SynonymFilterFactory
EdgeNGramFilterFactory: creates n-grams ( sequence of n items )
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" />
Nigerian => "ni", "nig", "nige", "niger", "nigeri", "nigeria", "nigeria", "nigerian"
For a list of available Filters
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
Analysis ToolPart of Solr Admin that allows you to enter text And See How It Would Be Analyzed For A Given Field (Or Field Type)
Displays Step By Step Information For Analyzers Configured Using Solr Factories...
Token Stream Produced By The Tokenizer
How The Token Stream Is Modified By Each TokenFilter
How The Tokens Produced When Indexing Compare With The Tokens Produced When Querying
Helpful In Deciding Which Tokenizer/TokenFilters You Want To Use For Each Field Based On Your Goals
Hands-on Tokenizers, and Filters
Live Demo
Exercise - Autocomplete using n-grams
Requirements:
1) Match from the edge of the field, e.g. if the document field is it will match, but will not ,"مرض ال" and the query is "مرض السكري"match "السكري"
2) Matches any word in the input field, with implicit truncation. This means that the field "مرض السكري" will be matched by query We use this to get partial matches, but these should be ."السكري"boosted lower
Tip: WordDelimiterFilterFactory + EdgeNGramFilterFactory
Solr @ Altibbi
Live Demo
Further Reading
Always handy to use “Apache Solr Reference Guide”