code4lib 2011 preconference: What's New in Solr (since 1.4.1)

Preview:

DESCRIPTION

code4lib 2011 preconference, presented by Erik Hatcher of Lucid Imagination. Abstract: The library world is fired up about Solr. Practically every next-gen catalog is using it (via Blacklight, VuFind, or other technologies). Solr has continued improving in some dramatic ways, including geospatial support, field collapsing/grouping, extended dismax query parsing, pivot/grid/matrix/tree faceting, autosuggest, and more. This session will cover all of these new features, showcasing live examples of them all, including anything new that is implemented prior to the conference.

Citation preview

What's New in Solr?

code4lib 2011 preconferenceBloomington, IN

presented by Erik Hatcher of Lucid Imagination

about mespoken at several code4lib conferences

Keynoted Athens '07 along with the pioneering Solr preconference,

Providence '09, "Rising Sun"

pre-conferenced Asheville '10, "Solr Black Belt"

co-authored "Lucene in Action", first edition; ghost/toast on second edition

Lucene and Solr committer.

library world claims to fame are founding and naming Blacklight, original developer on Collex and the Rossetti Archive search

now at Lucid Imagination, dedicated to Lucene/Solr support/services/training/etc

abstract

The library world is fired up about Solr. Practically every next-gen catalog is using it (via Blacklight, VuFind, or other

technologies). Solr has continued improving in some dramatic ways, including geospatial support, field

collapsing/grouping, extended dismax query parsing, pivot/grid/matrix/tree faceting, autosuggest, and more. This

session will cover all of these new features, showcasing live examples of them all, including anything new that is

implemented prior to the conference.

LIA2 - Lucene in Action

Published: July 2010 - http://www.manning.com/lucene/

New in this second edition:

Performing hot backups

Using numeric fields

Tuning for indexing or searching speed

Boosting matches with payloads

Creating reusable analyzers

Adding concurrency with threads

Four new case studies, and more

Version NumberWhich one ya talking 'bout, Willis?

3.1? 4.0?? TRUNK??

playing with fire

index format changes to be expected

reindexing recommended/required

Solr/Lucene merged development codebases

releases should occur lock-step moving forward

dependencies

November 2009: Solr 1.4 (Lucene 2.9.1)

June 2010: Solr 1.4.1 (Lucene 2.9.3)

Spring 2011(?): Solr 3.1 (Lucene 3.1)

TRUNK: Solr 4.x (Lucene TRUNK)

lucene

per-segment field cache, etc

Unicode and analysis improvements throughout

Analysis "attributes"

AutomatonQuery: RegexpQuery, WildcardQuery

flexible indexing

and so much more!

README

Reindex!

Upgrade SolrJ libraries too (javabin format changed)

Read Lucene and Solr's CHANGES.txt files for all the details

Analysis

UAX, using ICU

CollationKey

PatternReplaceCharFilter

KeywordMarkerFilterFactory, StemmerOverrideFilterFactory

Standard tokenization

ClassicTokenizer: old StandardTokenizer

StandardTokenizer: now uses Unicode text segmentation specified by UAX#29

UAX29URLEmailTokenizer

maxTokenLength: default=255

PathHierarchyTokenizer

delimiter: default=/

replace: default=<delimiter>

"/foo/bar" => [/foo] [/foo/bar]

CollationKeyFilterA filter that lets one specify:

A system collator associated with a locale, or

A collator based on custom rules

This can be used for changing sort order for non-english languages as well as to modify the collation sequence for certain languages. You must use the same CollationKeyFilter at both index-time and query-time for correct results. Also, the JVM vendor, version (including patch version) of the slave should be exactly same as the master (or indexer) for consistent results.

http://wiki.apache.org/solr/UnicodeCollation

see also: ICUCollationKeyFilter

ICU

International Components for Unicode

ICUFoldingFilter

ICUNormalizer2Filter

name=nfc|nfkc|nfkc_cf

mode=compose|decompose

filter

ICUFoldingFilter

Accent removal, case folding,canonical duplicates folding,dashes folding,diacritic removal (including stroke, hook, descender), Greek letterforms folding, Han Radical folding, Hebrew Alternates folding, Jamo folding, Letterforms folding, Math symbol folding, Multigraph Expansions: All, Native digit folding, No-break folding, Overline folding, Positional forms folding, Small forms folding, Space folding, Spacing Accents folding, Subscript folding, Superscript folding, Suzhou Numeral folding, Symbol folding, Underline folding, Vertical forms folding, Width folding

Additionally, Default Ignorables are removed, and text is normalized to NFKC.

All foldings, case folding, and normalization mappings are applied recursively to ensure a fully folded and normalized result.

ICUTransformFilterid: specific transliterator identifier from ICU's Transliterator#getAvailableIDs()(required)

direction=forward|reverse

Examples:

Traditional-Simplified: 簡化字 => 简化字

Cyrillic-Latin: Российская Федерация => Rossijskaâ Federaciâ

Tom Burton-West's latest

ICU

shingles

query parser

ABC -> [A] [B] [C] or [AB] [BC]...

highlighter

deprecated old config, now config as standard search component

FastVectorHighlighter

FastVectorHighlighter

if termVectors="true", termPositions="true", and termOffsets="true"

and hl.useFastVectorHighlighter=true

hl.fragListBuilder

hl.fragmentsBuilder

spatialJTeam's plugin: packaged for easy deployment

Solr trunk capabilities

many distance functions

What's missing?

geo faceting? scoring by distance? distance pseudo-field?

All units in kilometers, unless otherwise specified

Spatial field types

Point: n-dimensional, must specify dimension (default=2), represented by N subfields internally

LatLon: latitude,longitude, represented by two subfields internally, single valued only

GeoHash: single string representation of lat/lon

Spatial query parsers

geofilt: exact filtering

bbox: uses (trie) range queries

Parameters:

sfield: spatial field

pt: reference point

d: distance

field collapsing/groupingbackwards compatibility mode?

http://wiki.apache.org/solr/FieldCollapsing

group=true

group.field / group.func / group.query

rows / start: for groups, not documents

group.limit: number of results per group

group.offset: offset into doclist of each group

sort: how to sort groups, by top document in each group

group.sort: how to sort docs within each group

group.format: grouped | simple

group.main=true|false:

faceting works as normal

not distributed savvy yet

query parsing

TextField: autoGeneratePhraseQueries="true"

if single string analyzes to multiple tokens

{!raw|term|field f=$f}...Recall why we needed {!raw} from last year

<fieldType = .../> - use one string, one numeric, (and one text?)

<field name="..."/>

table for numeric and for string (and text?):

{!raw f=$f} | TermQuery(...)

{!term f=$f} | ...

{!field f=$f} | ...

Which to use when? {!raw} works for strings just fine, but best to migrate to the generally safer/wiser {!term} for future-proofing.

{!term f=field}

fq={!term f=weight}1.5

dismax

q.op or schema.xml's <solrQueryParser defaultOperator="[AND|OR]"/> defaults mm to 0% (OR) or 100% (AND)

#code4lib: issues with non-analyzed fields in qf

edismaxSupports full lucene query syntax in the absence of syntax errors

supports "and"/"or" to mean "AND"/"OR" in lucene syntax mode

When there are syntax errors, improved smart partial escaping of special characters is done to prevent them... in this mode, fielded queries, +/-, and phrase queries are still supported.

Improved proximity boosting via word bigrams... this prevents the problem of needing 100% of the words in the document to get any boost, as well as having all of the words in a single field.

advanced stopword handling... stopwords are not required in the mandatory part of the query but are still used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be) then all will be required.

Supports the "boost" parameter.. like the dismax bf param, but multiplies the function query instead of adding it in

Supports pure negative nested queries... so a query like +foo (-foo) will match all documents

function queries

termfreq, tf, docfreq, idf, norm, maxdoc, numdocs

{!func}termfreq(text,ipod)

standard java.util.Math functions

facetingper-segment, single-valued fields:

facet.method=fcs (field cache per segment)

facet.field={!threads=-1}field_name

threads=0: direct execution

threads=-1: thread per segment

speeds up single and multivalued method=fc, especially for deep paging with facet.offset

date faceting improvements, generalized for numeric ranges too

can now exclude main query q={!tag=main}the+query&facet.field={!ex=main}category

pivot/grid/matrix/tree faceting

is this also "hierarchical faceting"? it depends!

pivot faceting output/select?q=*:*&rows=0&facet=on&facet.pivot=cat,popularity,inStock&facet.pivot=popularity,cat

spell checking

DirectSolrSpellChecker

no external index needed, uses automaton on main index

spellcheck configsolrconfig.xml <searchComponent name="spellcheck" class="solr.SpellCheckComponent">

<str name="queryAnalyzerFieldType">textgen</str>

<!-- a spellchecker that uses no auxiliary index --> <lst name="spellchecker"> <str name="name">default</str> <str name="field">spell</str> <str name="classname">solr.DirectSolrSpellChecker</str> <str name="minPrefix">1</str> </lst> </searchComponent>

spellcheck handler

solrconfig.xml <requestHandler name="standard" class="solr.SearchHandler" default="true"> <!-- default values for query parameters --> <lst name="defaults"> <str name="echoParams">explicit</str> <str name="spellcheck">true</str> <str name="spellcheck.collate">true</str> </lst>

<arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>

spellcheck response{ 'responseHeader'=>{ 'status'=>0, 'QTime'=>10, 'params'=>{ 'indent'=>'on', 'wt'=>'ruby', 'q'=>'ipud bluck'}}, 'response'=>{'numFound'=>0,'start'=>0,'docs'=>[] }, 'spellcheck'=>{ 'suggestions'=>[ 'ipud',{ 'numFound'=>1, 'startOffset'=>0, 'endOffset'=>4, 'suggestion'=>['ipod']}, 'bluck',{ 'numFound'=>1, 'startOffset'=>5, 'endOffset'=>10, 'suggestion'=>['black']}, 'collation','ipod black']}}

http://localhost:8983/solr/select?q=ipud%20bluck&wt=ruby&indent=on

autosuggest

new "spellcheck" component, builds TST

collates query

can check if collated suggestions yield results, optionally, providing hit count

suggest configsolrconfig.xml <searchComponent name="spellcheck" class="solr.SpellCheckComponent">

<str name="queryAnalyzerFieldType">textgen</str>

<lst name="spellchecker"> <str name="name">suggest</str> <str name="classname">org.apache.solr.spelling.suggest.Suggester</str> <str name="lookupImpl"> org.apache.solr.spelling.suggest.jaspell.JaspellLookup </str> <str name="field">suggest</str> <str name="buildOnCommit">true</str> </lst> </searchComponent>

schema.xml <field name="suggest" type="textgen" indexed="true" stored="false"/>

<copyField source="name" dest="suggest"/>

suggest handler

solrconfig.xml <requestHandler class="solr.SearchHandler" name="/suggest"> <lst name="defaults"> <str name="spellcheck">true</str> <str name="spellcheck.dictionary">suggest</str> <str name="spellcheck.collate">true</str> <str name="spellcheck.count">10</str> <str name="rows">0</str> <str name="spellcheck.maxCollationTries">20</str> <str name="spellcheck.maxCollations">10</str> <str name="spellcheck.collateExtendedResults">true</str> </lst> <arr name="components"> <str>query</str> <!-- to allow suggestion hit counts to be returned --> <str>spellcheck</str> </arr> </requestHandler>

suggest response

{ 'responseHeader'=>{ 'status'=>0, 'QTime'=>2}, 'response'=>{'numFound'=>0,'start'=>0,'docs'=>[] }, 'spellcheck'=>{ 'suggestions'=>[ 'ip',{ 'numFound'=>1, 'startOffset'=>0, 'endOffset'=>2, 'suggestion'=>['ipod']}, 'collation',[ 'collationQuery','ipod', 'hits',3, 'misspellingsAndCorrections',[ 'ip','ipod']]]}}

http://localhost:8983/solr/suggest?q=ip&wt=ruby&indent=on

sort

by function

&q=*:*&sfield=store&pt=39.194564,-86.432947&sort=geodist() asc

but still can't get value of function back

unless you force it to be the score somehow

clustering component

now works out-of-the-box; all Apache license compatible

supports distributed search

debug=true

debug=true|all|timing|query|results

debug=results&debug.explain.structured=true

structured explain

'debug'=>{ 'explain'=>{ 'doc1'=>{ 'match'=>true, 'value'=>0.076713204, 'description'=>'fieldWeight(title:solr in 0), product of:', 'details'=>[{ 'match'=>true, 'value'=>1.0, 'description'=>'tf(termFreq(title:solr)=1)'}, { 'match'=>true, 'value'=>0.30685282, 'description'=>'idf(docFreq=1, maxDocs=1)'}, { 'match'=>true, 'value'=>0.25, 'description'=>'fieldNorm(field=title, doc=0)'}]}}}}

http://localhost:8983/solr/select?q=title:solr&debug.explain.structured=true&debug=results&wt=ruby&indent=on

SolrCloud

shared/central config and core/shard managment via zookeeper,

built-in load balancing, and infrastructure for future SolrCloud work.

/update/jsonsolrconfig.xml <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/>

curl 'http://localhost:8983/solr/update/json?commit=true' -H 'Content-type:application/json' -d '{ "add": { "doc": { "id" : "MyTestDocument", "title" : "This is just a test" } }}'

wt=csv

Writes only docs (no response header or response extras) in CSV format

Roundtrippable with /update/csv

provided all fields are stored

UIMA

Unstructured Information Management Architecture

http://uima.apache.org/

New update processor chain, augmenting incoming documents from a UIMA annotator pipeline

http://wiki.apache.org/solr/SolrUIMA

(solr|lucene)-dev

ant [idea|eclipse]

go!

http://wiki.apache.org/solr/HowToContribute

works in progress

some interesting open issues (with patches):

PayloadTermQuery

XMLQueryParser plugin

join

{!join from=$f to=$t}

insert <what Yonik said>

https://issues.apache.org/jira/browse/SOLR-2272

Lucid (imagination)What's Lucid done for you lately -

Yonik, Mark, Grant, Hoss: Lucene and Solr performance, faceting, grouping, join query, spatial, Mahout, ORP, PMC, etc, etc, etc

Other technical staff involved in mailing list assistance, bug reporting, contributing patches (hi Lance, Erick, Jay, Tom, Grijesh, Tomas....)

extended dismax, join, faceting performance improvements

LucidWorks Enterprise

Hoss Simplicity

http://www.lucidimagination.com/blog/2011/01/21/solr-powered-isfdb-part1/

http://www.lucidimagination.com/blog/2011/01/28/solr-powered-isfdb-part-2/

LucidWorks Enterprise"lucid" query parser

click boosting

tunable norms, per-field

role filtering

administrative UI

REST API

Data sources, crawlers, and scheduling

Alerts

http://www.lucidimagination.com/enterprise-search-solutions/lucidworks

Community Questions

fire away!

resources

duh!: #code4lib

lucene.apache.org/solr

search.lucidimagination.com/?q=<your query>

Q&A: faceting

why is paging through facets the way it is?

short-circuits on enum

Community:

- The state of Extended DisMax, and what Lucene features remain incompatible with it.

- Any developments on faceting (I've implemented the standard workaround to the "unknown facet list size" problem...  but I'd still love to be able to know exactly how long the lists are)

- Hierarchical documents in Solr -- I haven't followed the conversations closely, but I gather that this topic is gaining some momentum in the Solr community.

contact info

erik.hatcher @ lucidimagination . com

http://www.lucidimagination.com

webinars, documentation

LucidFind: search.lucidimagination.com

search mailing list posts, wiki pages, web sites, our blog, etc for latest Lucene/Solr assistance

re: code4lib