Upload
david-smiley
View
4.835
Download
1
Embed Size (px)
DESCRIPTION
Covers the new Apache Lucene 4 spatial module. Includes Solr usage info. Applicable to ElasticSearch too. Presented the 2012 Open Source Search in Government conference by Basis Technologies.
Citation preview
© 2012 The MITRE Corporation. All rights reserved.
LUCENE 4 SPATIAL2012 Basis Technology
Open Source Search Conference
Presented by David Smiley, MITRE
© 2012 The MITRE Corporation. All rights reserved.
About David Smiley• Working at MITRE, for 12 years
• web development, Java, search• 3 Solr apps, 1 Endeca
• Published 1st book on Solr; then 2nd edition (2009, 2011)• Apache Lucene / Solr committer (2012)
• Specializing on spatial
• Presented at Lucene Revolution (2010) & Basis O.S. Search Conference (2011)
• Taught Solr classes at MITRE (2010, 2011, 2012)• Solr search consultant within MITRE and its sponsors,
and privately via OpenSource Connections
2
© 2012 The MITRE Corporation. All rights reserved.
What is Spatial Search?
Primary features:• Spatial filter query• Spatial distance sorting• Spatial distance relevancy (i.e. spatial query score)
NOT “geocoding” – resolve “Boston” to its latitude and longitude
Typical use-case:
1. Index a location for each Lucene document given a latitude & longitude
2. Then search for matching documents by a circle (point-radius) or bounding box
3. Then sort results by distance
© 2012 The MITRE Corporation. All rights reserved.
History of Spatial for Lucene & Solr• 2007: Local-Lucene
• by Patric O’Leary (AOL)
• 2009-09: LL -> Lucene spatial contrib in Lucene 2.9.0• Local-Lucene graduates to an official Lucene contrib module
• 2009-12: Spatial Search Plugin (SSP) for Solr• by Chris Male (JTeam -> Orange11, ElasticSearch)
• 2010-10: SOLR-2155 a geohash prefix tree filter• by David Smiley (MITRE)
• 2011-01: Lucene Spatial Playground (LSP)• by Ryan McKinley (Voyager GIS), David, and Chris
• 2011-03: Solr 3.1 new spatial features• by Grant Ingersoll and Yonik Seeley (LucidWorks)
• 2012-03: LSP -> Lucene 4 spatial module + Spatial4j• replaces former Lucene spatial contrib module
© 2012 The MITRE Corporation. All rights reserved.
Lucene Spatial Committers• David Smiley, MITRE
• Bedford, MA
• Chris Male, Elastic Search• New Zealand
• Ryan McKinley, Voyager GIS• Oakland, CA
© 2012 The MITRE Corporation. All rights reserved.
Breakdown of Spatial Components
Spatial4j43%
Lucene spatial36%
Solr adapters6%
Misc16%
Total: 4,781 Non-Comment Source Statements (without javadocs or tests)
© 2012 The MITRE Corporation. All rights reserved.
Spatial4j: It’s all about the shapes• Shapes
• Types: Point, Rectangle, Circle, Polygon• Geospatial & Euclidean/2D implementations• Intersection: within, contains, intersects, disjoint
• Distance and area math utilities• Input/Output serialization to Well Known Text (WKT)
• Ex: POLYGON ((30 10, 10 20, 20 40, 40 40, 30 10))
• ASL licensed project independent of Apache on GitHub• Requires JTS (3rd party LGPL) for polygon & WKT support• Ported to .NET as Spatial4n and used by RavenDB
• by Itamar Syn-Herskhko
© 2012 The MITRE Corporation. All rights reserved.
Lucene 4 Spatial Module• There isn’t one best way to implement spatial indexing for
all use-cases• Index just points, or other shapes too? Which?• Multiple shapes per field?• Query by Intersection? Contains? Within? Equals? Disjoint? …• Distance sorting? Query boost by distance?
• Or more exotic shape relevancy like overlap percentage?
• Tradeoff shape precision for speed?
• Multiple SpatialStrategy implementations:• RecursivePrefixTreeStrategy and TermQueryPrefixTreeStrategy• PointVectorStrategy• BBoxStrategy (currently in trunk, not 4x)• JtsGeoStrategy (in Spatial4j/LSP)
Names subject to change!
© 2012 The MITRE Corporation. All rights reserved.
Strategy: PointVector• Similar to Solr’s PointType / LatLonType
• X & Y trie double fields; caching via FieldCache
• Characteristics• Indexes points (only)• Single-valued field (no multi)• Query by rectangle or circle (only)
• Circle uses FieldCache (requires memory)• Circle does bbox pre-filter for performance• Relations: Intersects, Within (only)
• Exact precision for x & y coordinates and query shape• Distance sort
• Uses FieldCache (requires memory)
© 2012 The MITRE Corporation. All rights reserved.
Strategy: RecursivePrefixTree
• Grid / Tile / Trie / Prefix-Tree based• With recursive decent
algorithm• Or TermQueryPrefixTree
alternative
• Choose Geohash (geo only) or Quad tree
• The most mature strategy to date
• The current evolution of SOLR-2155
Potential rename toGridFilterSpatialStrategy
© 2012 The MITRE Corporation. All rights reserved.
Strategy: RecursivePrefixTree• Characteristics:
• Indexes all shapes• Variable precision of shape edges
• Highly precise shapes other than point won’t scale• LineString’s possibly not precise enough for your needs
• Multi-valued field support• Query by any shape
• Variable precision for query shape• Highest precision usually scales
• Relations: Intersects (only)
• Distance sort (w/ multi-value support)• Warning: immature, won’t scale• Uses significant amounts of memory
• Fast spatial filtering; no cache needed
© 2012 The MITRE Corporation. All rights reserved.
Strategy: BBox• Implemented with 4 doubles & 1 boolean• Ported from ESRI Open SourceGeoPortal• Characteristics:
• Indexes rectangles (only)• Single-valued field (no multi)• Query by rectangle (only)
• Supports all relations: Intersects, Within, Contains, …
• Distance sort from box center• Uses FieldCache (requires memory)
• Area overlap sorting• Sort results by percentage overlap between query and indexed boxes• Uses FieldCache (requires memory)
• Note: FieldCache needs are somewhat high
© 2012 The MITRE Corporation. All rights reserved.
Strategy: JtsGeoStrategy• Stores any JTS geometry in Lucene 4’s DocValues
• Stores WKB -- WKT in binary format• Full vector geometry is retained for search
• DocValues is mostly a better FieldCache• Faster loading into memory• Can be disk resident or memory
• Characteristics:• Indexes any shape• Single valued field but can be MultiPoint, MultiPolygon, etc.• Query by any shape
• Uses DocValues (memory use optional)• Supports all relations: intersect, within, contains, …
• No sorting• Experimental / immature status
© 2012 The MITRE Corporation. All rights reserved.
Solr Adapters• Configuration:<fieldType name="geo" class="solr.SpatialRecursivePrefixTreeFieldType" spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"distErrPct="0.025" maxDistErr="0.000009" /><field name="geo" type="geo" indexed="true" stored="true” multiValued="true" />
• Adding data:<field name="geo">43.17614,-90.57341</field><field name="geo">POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))</field>
• Search Filterfq=geo:”Intersects(Circle(54.729696,-98.525391 d=10))”
• Distance Sortsort=query($sortsq) asc&sortsq={! score=distance v=$sq}&sq=store:"Intersects(Circle(54.729696,-98.525391 d=10))"
© 2012 The MITRE Corporation. All rights reserved.
Future Possibilities• Solr:
• Filter out points in multi-valued field from search results not matching filter• Heatmap/grid faceting spatial summarization
• Spatial-Temporal search• 3d (x,y,t) point shapes, and “track” shape queries
• Support any query shape for all Strategies• PrefixTreeStrategy:
• More efficient binary grid encoding; use Hilbert Curve order• Better multi-value point caches• Cache-less sort of top-N results• More query relations: Contains, Within
• Configurable DocValues vs. FieldCache choice• Choose floats or configurable bits instead of forcing doubles• CircleStrategy
© 2012 The MITRE Corporation. All rights reserved.
Thank you!• References
• Lucene 4 spatial javadocs• https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/spatial/
• Spatial4j at GitHub• https://github.com/spatial4j/spatial4j ( spatial4j.com redirect)• http://spatial4j.16575.n6.nabble.com -- [email protected]
• Solr• http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4
• Contact me:• David Smiley [email protected] [email protected]