64
Rapid Prototyping with Solr uberconf - July 14, 2011 Presented by Erik Hatcher [email protected] Lucid Imagination http://www.lucidimagination.com

Rapid Prototyping with Solr

Embed Size (px)

Citation preview

Page 1: Rapid Prototyping with Solr

Rapid PrototypingwithSolr

uberconf - July 14, 2011Presented by Erik Hatcher

[email protected] Imagination

http://www.lucidimagination.com

Page 2: Rapid Prototyping with Solr

About me...

• Co-author, “Lucene in Action”

• Commiter, Lucene and Solr

• Lucene PMC and ASF member

• Member of Technical Staff / co-founder, Lucid Imagination

Page 3: Rapid Prototyping with Solr

About Lucid Imagination...

• Lucid Imagination provides commercial-grade support, training, high-level consulting and value-added software for Lucene and Solr.

• We make Lucene ‘enterprise-ready’ by offering:

• Free, certified, distributions and downloads.

• Support, training, and consulting.

• LucidWorks Enterprise, a commercial search platform built on top of Solr.

Page 4: Rapid Prototyping with Solr

AbstractGot data? Let's make it searchable! Rapid Prototyping with Solr will demonstrate getting documents into Solr quickly, provide some tips in adjusting Solr's schema to match your needs better, and finally will discuss how to showcase your

data in a flexible search user interface. We'll see how to rapidly leverage faceting, highlighting, spell checking, and

debugging. Even after all that, there will be enough time left to outline the next steps in developing your search

application and taking it to production.

Page 5: Rapid Prototyping with Solr

• An open source Java-based IR library with best practice indexing and query capabilities, fast and lightweight search and indexing.

• 100% Java (.NET, Perl and other versions too).

• Stable, mature API.

• Continuously improved and tuned over more than 10 years.

• Cleanly implemented, easy to embed in an application.

• Compact, portable index representation.

• Programmable text analyzers, spell checking and highlighting.

• Not a crawler or a text extraction tool.

What is Lucene?

Page 6: Rapid Prototyping with Solr

• Created by Doug Cutting in 1999

• built on ideas from search projects Doug created at Xerox PARC and Apple.

• Donated to the Apache Software Foundation (ASF) in 2001.

• Became an Apache top-level project in 2005.

• Has grown and morphed through the years and is now both:

• A search library.

• An ASF Top-Level Project (TLP) encompassing several sub-projects.

• Lucene and Solr "merged" development in early 2010.

Lucene's History

Page 7: Rapid Prototyping with Solr

• An open source search engine.

• Indexes content sources, processes query requests, returns search results.

• Uses Lucene as the "engine", but adds full enterprise search server features and capabilities.

• A web-based application that processes HTTP requests and returns HTTP responses.

• Initially started in 2004 and developed by CNET as an in-house project to add search capability for the company website.

• Donated to ASF in 2006.

What is Solr?

Page 8: Rapid Prototyping with Solr

• There’s more than one answer!

• The current, released, stable version is 3.3

• The development release is referred to as “trunk”.

• This is where the new, less tested work goes on

• Also referred to as 4.0

• LucidWorks Enterprise is built on a trunk snapshot + additional features.

What Version of Solr?

Page 9: Rapid Prototyping with Solr

Why prototype?

• Demonstrate Solr can handle your needs

• Stake/purse-holder buy-in

• It's quick, easy, and fun!

• The User Interface is the app

Page 10: Rapid Prototyping with Solr

Workflow

• Ingest data

• Use

• Refine config/interactions, repeat

Page 11: Rapid Prototyping with Solr

Got Data?

• Rich text files?

• Databases?

• Feeds (Atom/RSS/XML)?

• 3rd party repositories? (SharePoint, Documentum, ...)

• CSV!!!!

Page 12: Rapid Prototyping with Solr

User Interface

Page 13: Rapid Prototyping with Solr

• Download Solr

• http://lucene.apache.org/solr

• "Install" it

• unzip or tar -xvf

• Start it

• cd example; java -jar start.jar

Getting Started

Page 14: Rapid Prototyping with Solr

e.g. Conference Attendees

First Name,Last Name,Company,Title,Work Country

Erik,Hatcher,Lucid Imagination,"Member, Technical Staff", USA

.

.

.

Page 15: Rapid Prototyping with Solr

First Try

curl "http://localhost:8983/solr/update/csv?stream.file=attendees.csv"

undefined field First Name

Page 16: Rapid Prototyping with Solr

Dynamic Fields

<dynamicField name="*_s" type="string" indexed="true" stored="true"/><dynamicField name="*_t" type="text" indexed="true" stored="true"/>

Page 17: Rapid Prototyping with Solr

Second try

curl "http://localhost:8983/solr/update/csv? stream.file=attendees.csv&fieldnames=first_s,last_s,company_s,title_t,country_s&header=true"

Document [null] missing required field: id

Page 18: Rapid Prototyping with Solr

uniqueKey

• Optional, Solr-specific, feature

• generally "string" type

• schema.xml: <uniqueKey>id</uniqueKey>

• adds of existing id'd documents updates (delete + add)

Page 19: Rapid Prototyping with Solr

id

curl "http://localhost:8983/solr/update/csv?stream.file=attendees.csv

&fieldnames=first_s,id,company_s,title_t,co untry_s&header=true"

<?xml version="1.0" encoding="UTF-8"?><response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">40</int> </lst></response>

Page 20: Rapid Prototyping with Solr

Tada!

Page 21: Rapid Prototyping with Solr

Schema tinkering

• Removed all example field definitions

• Uncomment and adjust catch-all dynamic field:

• <dynamicField name="*" type="string" multiValued="false"/>

• Ensure uniqueKey is appropriate

• unusual in this example, disabled it

• Make every document/field fully searchable!

• <copyField source="*" dest="text"/>

Page 22: Rapid Prototyping with Solr

After adjusting config...

• Restart Solr

• Or... reload the core (when in multicore mode)

Page 23: Rapid Prototyping with Solr

Clean import

# Delete all documentscurl "http://localhost:8983/solr/update?stream.body=%3Cdelete%3E%3Cquery %3E*:*%3C/query%3E%3C/delete%3E&commit=true"

# Index your datacurl "http://localhost:8983/solr/update/csv?commit=true&stream.file=EuroCon2010.csv&fieldnames=first,last, company,title,country&header=true"

Page 24: Rapid Prototyping with Solr

Facetshttp://localhost:8983/solr/browse?facet.field=country

Page 25: Rapid Prototyping with Solr

Value Normalization

• http://localhost:8983/solr/update/csv? commit=true&stream.file=attendees.csv&fieldnames=first,last,company,title,country&header=true&f.country.map=Great +Britain:United+Kingdom

Page 26: Rapid Prototyping with Solr

Polishing

• Customize request handler mappings

• Edit templates

• hit display

• header/footer

• style

Page 27: Rapid Prototyping with Solr

/browse<requestHandler name="/browse" class="solr.SearchHandler"> <lst name="defaults"> <str name="wt">velocity</str><str name="v.layout">layout</str> <str name="v.template">browse</str>

<str name="rows">10</str><str name="fl">*,score</str>

<str name="defType">lucene</str><str name="q">*:*</str> <str name="debugQuery">true</str> <str name="hl">on</str><str name="hl.fl">title</str> <str name="hl.fragsize">0</str> <str name="hl.alternateField">title</str> <str name="facet">on</str> <str name="facet.mincount">1</str> <str name="facet.missing">true</str> </lst> <lst name="appends"><str name="facet.field">country</str></lst></requestHandler>

Page 28: Rapid Prototyping with Solr

hit.vm

<div class="result-document"> <p>$doc.getFieldValue('first') $doc.getFieldValue('last')</p> <p>$!doc.getFieldValue('title'), $!doc.getFieldValue('company')</p> <p>$!doc.getFieldValue('country')</p></div>

Page 29: Rapid Prototyping with Solr

Voila!

Page 30: Rapid Prototyping with Solr

Adding bells and whistles

• jQuery

• <script type="text/javascript" src="/solr/admin/jquery.js"/>

• TreeMap

• <script type="text/javascript" src="/scripts/treemap.js"/>

Page 31: Rapid Prototyping with Solr

TreeMap code<script type="text/javascript"> function onLoad() { jQuery("#treemap-country").treemap(640,480, {}); }</script>----------------------------<body onload="onLoad();">----------------------------<table id="treemap-country">#foreach($facet in $response.getFacetField('country').values) <tr> <td>#if($facet.name)$esc.html($facet.name)#else&lt;Unspecified&gt;#end</td> <td>$facet.count</td> <td>#if($facet.name)$esc.html($facet.name)#{else}Unspecified#end</td> </tr>#end</table>

Page 32: Rapid Prototyping with Solr

TreeMap

Page 33: Rapid Prototyping with Solr

Ajax fun: giveaways

• Add a "static" templated page

• jQuery Ajax request

• snippet templated output

Page 34: Rapid Prototyping with Solr

"static" pagesolrconfig.xml<requestHandler name="/giveaways" class="solr.DumpRequestHandler"> <lst name="defaults"> <str name="wt">velocity</str> <str name="v.template">giveaways</str> <str name="v.layout">layout</str> </lst></requestHandler>

giveaways.vm<input type="button" value="Pick a Winner" onClick="javascript:$ ('#winner').load('/solr/generate_winner?sort=random_' + new Date().getTime() + '+asc');"> <h2>And the winner is...</h2> <center><font size="20"><div id="winner"></div></font></center>

Page 35: Rapid Prototyping with Solr

fragment templatesolrconfig.xml<requestHandler name="/generate_winner" class="solr.SearchHandler"> <!-- sort=random_... required --> <lst name="defaults"> <str name="wt">velocity</str> <str name="v.template">winner</str> <str name="rows">1</str> <str name="fl">first,last</str> <str name="defType">lucene</str> <str name="q">*:* -company:"Lucid Imagination" -company:"Stone Circle Productions"</str> </lst></requestHandler>

winner.vm#set($winner=$response.results.get(0))$winner.getFieldValue('first') $winner.getFieldValue('last')

Page 36: Rapid Prototyping with Solr

And the winner is...

Page 37: Rapid Prototyping with Solr

e.g. data.gov

Page 38: Rapid Prototyping with Solr

Data.gov CSV catalogURL,Title,Agency,Subagency,Category,Date Released,Date Updated,Time Period,Frequency,Description,Data.gov Data Category Type,Specialized Data Category Designation,Keywords,Citation,Agency Program Page,Agency Data Series Page,Unit of Analysis,Granularity,Geographic Coverage,Collection Mode,Data Collection Instrument,Data Dictionary/Variable List,Applicable Agency Information Quality Guideline Designation,Data Quality Certification,Privacy and Confidentiality,Technical Documentation,Additional Metadata,FGDC Compliance (Geospatial Only),Statistical Methodology,Sampling,Estimation,Weighting,Disclosure Avoidance,Questionnaire Design,Series Breaks,Non-response Adjustment,Seasonal Adjustment,Statistical Characteristics,Feeds Access Point,Feeds File Size,XML Access Point,XML File Size,CSV/TXT Access Point,CSV/TXT File Size,XLS Access Point,XLS File Size,KML/KMZ Access Point,KML File Size,ESRI Access Point,ESRI File Size,Map Access Point,Data Extraction Access Point,Widget Access Point"http://www.data.gov/details/4","Next Generation Radar (NEXRAD) Locations","Department of Commerce","National Oceanic and Atmospheric Administration","Geography and Environment","1991","Irregular as needed","1991 to present","Between 4 and 10 minutes","This geospatial rendering of weather radar sites gives access to an historical archive of Terminal Doppler Weather Radar data and is used primarily for research purposes. The archived data includes base data and derived products of the National Weather Service (NWS) Weather Surveillance Radar 88 Doppler (WSR-88D) next generation (NEXRAD) weather radar. Weather radar detects the three meteorological base data quantities: reflectivity, mean radial velocity, and spectrum width. From these quantities, computer processing generates numerous meteorological analysis products for forecasts, archiving and dissemination. There are 159 operational NEXRAD radar systems deployed throughout the United States and at selected overseas locations. At the Radar Operations Center (ROC) in Norman OK, personnel from the NWS, Air Force, Navy, and FAA use this distributed weather radar system to collect the data needed to warn of impending severe weather and possible flash floods; support air traffic safety and assist in the management of air traffic flow control; facilitate resource protection at military bases; and optimize the management of water, agriculture, forest, and snow removal. This data set is jointly owned by the National Oceanic and Atmospheric Administration, Federal Aviation Administration, and Department of Defense.","Raw Data Catalog",...

Page 39: Rapid Prototyping with Solr
Page 40: Rapid Prototyping with Solr

Debugginghttp://localhost:8983/solr/data.gov?q=searching&debugQuery=true

Page 41: Rapid Prototyping with Solr

Mapping field values

• CSV update handler can map field values

• &f.privacy_and_confidentiality.map=YES:Yes&f.data_quality_certification.map=YES:Yes

Page 42: Rapid Prototyping with Solr

Splitting keywords• CSV handler: f.keywords.split=true

• stored values are split, multivalued

• Or via schema

• Stored value remains as in original, single valued

<fieldType name="comma_separated" class="solr.TextField" omitNorms="true"> <analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"/> </analyzer></fieldType>...<field name="keywords" type="comma_separated" indexed="true" stored="true"/>

Page 43: Rapid Prototyping with Solr

Suggest• Suggest terms as user types in search box

• Technique: jQuery autocomplete, Solr’s TermsComponent, Velocity template

http://localhost:8983/solr/terms?terms.fl=suggest&terms.prefix=sola&terms.sort=count&wt=velocity&v.template=suggest

#foreach($t in $response.response.terms.suggest)$t.key#end

Page 44: Rapid Prototyping with Solr

Suggest schema<fieldType name="suggestable" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> </analyzer></fieldType>

...

<field name="suggest" type="suggestable" indexed="true" stored="false" multiValued="true"/>

Page 45: Rapid Prototyping with Solr

Custom pages

• Document detail page

• Multiple query intersection comparison with Venn visualization

Page 46: Rapid Prototyping with Solr

Document detailhttp://localhost:8983/solr/data.gov/document?id=http%3A%2F%2Fwww.data.gov%2Fdetails%2F61

Page 47: Rapid Prototyping with Solr

Document detail detailsolrconfig.xml<requestHandler name="/data.gov/document" class="solr.SearchHandler"> <lst name="defaults"> <str name="wt">velocity</str> <str name="v.template">document</str> <str name="v.layout">layout</str> <str name="title">Data.gov data set</str> <str name="q">{!raw f=id v=$id}</str> </lst></requestHandler>

document.vm#set($doc= $response.results.get(0))<span><a href="$doc.getFieldValue('id')">$doc.getFieldValue('id')</a></span>

<table>#foreach($fieldname in $doc.fieldNames) <tr> <td>$fieldname:</td> <td> #foreach($value in $doc.getFieldValues($fieldname)) $esc.html($value) #end </td> </tr>#end</table>

Page 48: Rapid Prototyping with Solr

Query intersection

• Just showing off.... how easy it is to do something with a bit of visual impact

• Compare three independent queries, intersecting them in a Venn diagram visualization

Page 49: Rapid Prototyping with Solr
Page 50: Rapid Prototyping with Solr

Compare static pagesolrconfig.xml<requestHandler name="/data.gov/compare" class="solr.DumpRequestHandler"> <lst name="defaults"> <str name="wt">velocity</str> <str name="v.template">compare</str> <str name="v.layout">layout</str> <str name="title">Data.gov Query Comparison</str> </lst></requestHandler> compare.vm

<script type="text/javascript"> function generate_venn() { var a=encodeURIComponent($("#a").val()); var b=encodeURIComponent($("#b").val()); var c=encodeURIComponent($("#c").val()); var ab='('+a+')+AND+('+b+')'; var ac='('+a+')+AND+('+c+')'; var bc='('+b+')+AND+('+c+')'; var abc='('+a+')+AND+('+b+')+AND+('+c+')';

$('#venn').load('/solr/select?q=*:*&wt=velocity&v.template=venn&rows=0&facet=on&facet.query={!key=a}'+a+'&facet.query={!key=b}'+b+'&facet.query={!key=c}'+c+'&facet.query={!key=intersect_ab}'+ab+'&facet.query={!key=intersect_ac}'+ac+'&facet.query={!key=intersect_bc}'+bc+'&facet.query={!key=intersect_abc}'+abc+'&q_a='+a+'&q_b='+b+'&q_c='+c+'&q_ab='+ab+'&q_ac='+ac+'&q_bc='+bc+'&q_abc='+abc); return false; }</script>

<form action="#" id="compare_form" onsubmit="return generate_venn()"> A: <input type="text" name="a" id="a" value="health"/> B: <input type="text" name="b" id="b" value="weather"/> C: <input type="text" name="c" id="c" value="ozone"/> <input type="submit"/></form>

<div id="venn"></div>

Page 51: Rapid Prototyping with Solr

Venn chartvenn.vm#set($values = $response.response.facet_counts.facet_queries)#set($params = $response.responseHeader.params)

<img src="http://chart.apis.google.com/chart?chs=600x400&cht=v&chd=t:$values.a,$values.b,$values.c,$values.intersect_ab,$values.intersect_ac,$values.intersect_bc,$values.intersect_abc&chdl=$esc.url($params.q_a)|$esc.url($params.q_b)|$esc.url($params.q_c)"/>

<ul> <li>A: <a href="/solr/data.gov?q={!lucene}$params.q_a">$params.q_a</a> ($values.a)</li> <li>B: <a href="/solr/data.gov?q={!lucene}$params.q_b">$params.q_b</a> ($values.b)</li> <li>C: <a href="/solr/data.gov?q={!lucene}$params.q_c">$params.q_c</a> ($values.c)</li> <li>A&B: <a href="/solr/data.gov?q={!lucene}$params.q_ab">$params.q_ab</a> ($values.intersect_ab)</li> <li>A&C: <a href="/solr/data.gov?q={!lucene}$params.q_ac">$params.q_ac</a> ($values.intersect_ac)</li> <li>B&C: <a href="/solr/data.gov?q={!lucene}$params.q_bc">$params.q_bc</a> ($values.intersect_bc)</li> <li>A&B&C: <a href="/solr/data.gov?q={!lucene}$params.q_abc">$params.q_abc</a> ($values.intersect_abc)</li></ul>

Page 52: Rapid Prototyping with Solr

Solritas

• Pronounced: so-LAIR-uh-toss

• Celeritas is a Latin word, translated as "swiftness" or "speed". It is often given as the origin of the symbol c, the universal notation for the speed of light - http://en.wikipedia.org/wiki/Celeritas

• VelocityResponseWriter - simply passes the Solr response through the Apache Velocity templating engine

• http://wiki.apache.org/solr/VelocityResponseWriter

Page 53: Rapid Prototyping with Solr

Solr Flare

• Ruby on Rails plugin

• facet field detection, autosuggest, saved search, inverted facets, pie charts, Simile Timeline and Exhibit integration

• Useful for rapid prototyping

• See Flare's big brother, Blacklight, for production quality

Page 54: Rapid Prototyping with Solr

Tang on Flare

Page 55: Rapid Prototyping with Solr

• UVA radiation = blacklight

• libraries are much more than books

• opinionated

• Ruby on Rails: best choice for an extensible user interface development framework

Page 56: Rapid Prototyping with Solr

Blacklight @ UVa

Page 57: Rapid Prototyping with Solr

Blacklight @ Stanford

Page 58: Rapid Prototyping with Solr

Blacklight @ AgNIC

Page 59: Rapid Prototyping with Solr

Prototyping Tools

• CSV update handler - /update/csv

• Schema Browser

• Solritas, Flare, Blacklight, or...

• just HTML+JavaScript (wt=json)

Page 60: Rapid Prototyping with Solr

Test

• Performance

• Scalability

• Relevance

• Automate all of the above, start baselines early, avoid regressions

Page 61: Rapid Prototyping with Solr

Then what?• Script the indexing process: full & delta

• Work with real users on actual needs

• Integrate with production systems

• Iterate on schema enhancements, configuration tweaks such as caching

• Deploy to staging/production environments and work at scale: collection size, real queries and performance, hardware and JVM settings

Page 62: Rapid Prototyping with Solr

LucidFind

http://www.lucidimagination.com/search/?q=user+interface

Page 63: Rapid Prototyping with Solr

For more information...• http://www.lucidimagination.com

• LucidFind

• search Lucene ecosystem: mailing lists, wikis, JIRA, etc

• http://search.lucidimagination.com

• Getting started with LucidWorks Enterprise:

• http://www.lucidimagination.com/products/lucidworks-search-platform/enterprise

• http://lucene.apache.org/solr - wiki, e-mail lists

Page 64: Rapid Prototyping with Solr

Thank You!