50
About Solr People as A Search Problem Thursday, May 26, 2011

Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Embed Size (px)

DESCRIPTION

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011 Search oriented architectures are obvious approaches for web pages, emails, documents, and other text based entities. Often with traditional structured data, text searching is “added on” to the traditional Boolean queries in relational stores. When Jazzed was initiated we wanted search to be front and center. When we evaluated Solr we realized we could take the opposite approach “add on” Boolean components to textual searches. This hybrid query approach makes transitioning to flexible ranking easy and straightforward.

Citation preview

Page 1: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

About SolrPeople as A Search Problem

Thursday, May 26, 2011

Page 2: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

About Me

• Building websites since 1996, Java since 1997

• Prior web search experience• Building and scaling eHarmony

products since 2002

Thursday, May 26, 2011

Page 3: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

What is Jazzed

• Subscription Based Dating Site

• Incubated by eHarmony

Thursday, May 26, 2011

Page 4: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

What is Jazzed

• Create a profile• Search for others• View their photos• Privately

Communicate

Thursday, May 26, 2011

Page 5: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

What is Jazzed

• Create a profile• Search for others• View their photos• Privately

Communicate

Thursday, May 26, 2011

Page 6: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

What is Jazzed

• Create a profile• Search for others• View their photos• Privately

Communicate

Thursday, May 26, 2011

Page 7: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

What is Jazzed

• Create a profile• Search for others• View their photos• Privately

Communicate

Thursday, May 26, 2011

Page 8: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

How is it different?

• Covers broader range of relationships• Easy to get started• Real profiles screened by machine and

humans• Fast, effective search oriented tools

Thursday, May 26, 2011

Page 9: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Jazzed Stats

• Started Fall 2009• Beta Summer 2010• Launched October 2010• 100,000s of Profiles• 1,000s of Searches Daily

Thursday, May 26, 2011

Page 10: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Jazzed Architecture

• Event-driven SOA• REST, JSON, EIP, Not-only-SQL• Technology incubation

Thursday, May 26, 2011

Page 11: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Tech Stack

• Java 6, Spring 3, Jersey 1.1, JMS (AQMP)

• RHEL 4, Oracle 11g, Voldemort 0.81, Solr 1.4.1, NFS

Thursday, May 26, 2011

Page 12: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Thursday, May 26, 2011

Page 13: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Thursday, May 26, 2011

Page 14: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Not Covered

• Distributed Search• Caching Strategies• Data Import• Analyzers/Tokenizers

Thursday, May 26, 2011

Page 15: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Why Lucene?

• Proven Solid IR library• Prefer Open Source Solutions• Not Only SQL• Flexible Ranking • Pluggable

Thursday, May 26, 2011

Page 16: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Why Solr

• Performant, Extensible, RESTful Service• Configuration, Schema, Multicores• Admin Interface• Replication, Backups, Monitoring

Thursday, May 26, 2011

Page 17: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Open Source

• Strengthens Engineering Team• Be apart of great community• Not Brochure-ware

Thursday, May 26, 2011

Page 18: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Not Only SQL

• One solution does not fit all• Prefer availability over consistency• Horizontal Scaling over Vertical

Thursday, May 26, 2011

Page 19: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Flexible Ranking

• Query Strategies• Boolean Algebra• Vector Space Analysis• Hybrids

• Extensive Function Support• Index and Query Boosting

Thursday, May 26, 2011

Page 20: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

...Oh My!

• Standard Plugins - Geospatial*, Faceting, Spelling, MoreLikeThis

• Full Text with Highlighted Results• Client agnostic

Thursday, May 26, 2011

Page 21: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Inevitable Question

• “Does it scale?”• Solr POC Benchmark

• 10 Million profiles• >200 queries/sec under 100ms 90th• Default tuning until 5 million profiles

Thursday, May 26, 2011

Page 22: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Profile Service

• RESTful Hybrid Data Service• Public, Private, Attributes• Event Producer

Thursday, May 26, 2011

Page 23: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Profiles

• Mostly structured• Categories - Eye Color, Desired

Ethnicity• Dates - Birthdate• Numbers - Coordinates, Age Range• Text -Name, Headline

Thursday, May 26, 2011

Page 24: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Inverting People

• Stored as an inverted index

• Index random accessed by term

Term DocumentMALE 1, 3, 5, 7, 9

FEMALE 2, 4, 6, 8, 10HAIR_RED 8

HAIR_BLOND 1, 2, 5, 6EYE_BLUE 1, 2, 3, 10

EYE_BROWN 4, 5, 6, 7, 8, 9fun 1, 3, 7, 9

funny 2, 4, 6, 10beach 1, 2, 3, 4, 5, 6, 7, 8

Thursday, May 26, 2011

Page 25: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Schema Design

• Single “Table”• One-to-many = multi-value fields• Individual vs Composite Fields

• copyTo and have both!

Thursday, May 26, 2011

Page 26: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Field considerations

• Stored or not• Indexed or not• Multivalued - desires fields• Type

Thursday, May 26, 2011

Page 27: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Solr Types Used

• tdate, tint, tfloat* - birthdate, loginAt• text - all text• string - id, non indexed text• random - good for random sorts• enum - for all enumerations

The ‘t’ is for Trie

Thursday, May 26, 2011

Page 28: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Data Duplication

• By function - numberPhotos & hasPhotos

• By relationship - hiddenBy & hidden• By analysis - name & text

Thursday, May 26, 2011

Page 29: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Saving Profiles

• Updating is in memory operation• No partial updates• Commit means flush index changes• Autocommit on maxDocs, maxTime or

both

Thursday, May 26, 2011

Page 30: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Why Also Voldemort

• Private profiles can not be stale• Many fields not searchable or viewable

by others• Isolate queries from fetch by id

Thursday, May 26, 2011

Page 31: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Querying

• Superset of Lucene• Efficient Range Queries• Multiple Query Handlers

• Dismax, Boost, Geo

Thursday, May 26, 2011

Page 32: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Recall vs Precision

• Focus on recall when corpus is small• Precision once it is at critical mass

Thursday, May 26, 2011

Page 33: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Boolean Queries

• Default operator set to AND• +gender:FEMALE +seeking:MALE

+eyeColor:EYE_BLUE +hairColor:(HAIR_RED, HAIR_BLONDE)

• Sort order is important

Thursday, May 26, 2011

Page 34: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Hybrid Queries

• Default operator set to OR• +gender:FEMALE +seeking:MALE

eyeColor:EYE_BLUE hairColor:(HAIR_RED, HAIR_BLONDE)

Thursday, May 26, 2011

Page 35: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Why you’re lucky if you like redheads

• Inverse Document Frequency (IDF)

• Rarer is favored over more common

• More fields matched = higher ranking

1.Blue eyed, redheads2.Blue eyed, blonds3.Redheads4.Blonds

Thursday, May 26, 2011

Page 36: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Boosting

• Query time by importance• eyeColor:EYE_BLUE^2

hairColor:HAIR_BLOND

Thursday, May 26, 2011

Page 37: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Filter Fields

• Useful for roles and other lists

• -hidden:(2 4 6)

id hidden

1 2, 4, 6

2 1

Thursday, May 26, 2011

Page 38: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Filter Fields

• Useful for roles and other lists

• -hidden:(2 4 6)• -hiddenBy:1

id hidden

1 2, 4, 6

2 1

id hiddenBy1 22 14 16 1

Thursday, May 26, 2011

Page 39: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Date Math

• Simplifies query preprocessing• +birthDate:[NOW/DAY+1DAY-36YEAR

TO NOW/DAY-25YEAR]

Thursday, May 26, 2011

Page 40: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Date Math

• Simplifies query preprocessing• +birthDate:[NOW/DAY+1DAY-36YEAR

TO NOW/DAY-25YEAR]

Between 25 and 35 years old

Thursday, May 26, 2011

Page 41: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Distance Searching

• lat, lon, distance• SolrLocal by Patrick O’Leary• Additional overhead ~90ms per query• Superceded in Solr 3.1

Thursday, May 26, 2011

Page 42: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Testing Queries

• Log queries and ids returned• Version your search strategies• Improve one thing at a time

Thursday, May 26, 2011

Page 43: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Geo Service

• Read-mostly service• Fields - Postal Code, Country,

State, Cities, Lat, Lon• Usage - Registration

Validation, City Selection

Thursday, May 26, 2011

Page 44: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Operations

• Servlet container and filesystem• Jetty 6, 64 Java 6 JVM• 8G Heap -XX:+UseCompressedOops

Thursday, May 26, 2011

Page 45: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Operations

• Active/Passive • Layer 7 Load balancing• Nightly snapshots• Eventually SolrCloud

Thursday, May 26, 2011

Page 46: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Multicore

• Run multiple schemas on the same• Hot swappable for backwards

compatible changes• private / public profiles

Thursday, May 26, 2011

Page 47: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Security

• No security provided• At minimum secure

your UpdateHandler• Separate Cores

<delete><query>*:*</query>

</delete>

Thursday, May 26, 2011

Page 48: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Future

• Solr 3.1• Mutual Matching• Faceting / Guided Search• Incorporating spelling• Hierarchies, categories, better ranking

models

Thursday, May 26, 2011

Page 49: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Faceting

• Returns counts with query results

• Efficient • Guides the user

toward precision

Thursday, May 26, 2011

Page 50: Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Thank [email protected]

Twitter: @jtuberville

Thursday, May 26, 2011