50

The Seven Deadly Sins of Solr

Embed Size (px)

DESCRIPTION

Etsy is using Solr and Lucene to serve queries at a rate of more than 8 billion per year (and growing). In this case study, we will describe how Etsy has integrated Solr/Lucene into our continuous deployment infrastructure, allowing for Solr configuration, Java-based indexers, and query parsing logic to go from passing tests to production code in minutes.

Citation preview

Page 1: The Seven Deadly Sins of Solr
Page 2: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Introductions…!

 Who  the  hell  am  I?   Jay  Hill,  Lucid  Imagina-on  

 7  years  Lucene  experience   4  years  Solr  experience   Author  of  Lucid  Training   SME  for  Lucid  Cer-fica-on  

 Who  the  hell  are  you?   New  to  search?   New  to  Lucene/Solr?   BaKle-­‐tested  veterans?  

Page 3: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

We'll Leave Time For Q&A!

 Who's  doing  what?   Solr  3.1?   Solr  1.4.1?   Nightly  build?   Solr  1.3  or  older?  

  Are  there  any  specific  problems  you're  having?   Meanwhile,  interrupt,  ask  ques8ons  as  we  go,  etc.    

Page 4: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

A Brief Word About Lucid Imagination!

  Lucid  Imagina8on:   The  commercial  company  suppor-ng    

Lucene/Solr  open  source  search.   Founded  by    

 Yonik  Seeley  –  Creator  of  Solr   Erik  Hatcher  –  Co-­‐author,  Lucene  In  Ac-on   Grant  Ingersoll  –  Apache  PMC  Chair   Marc  Krellenstein  –  Lucid  CTO  

 Staff  includes  9  Lucene/Solr  commiKers  

 Training,  cer-fica-on,  support,  LucidWorks  Enterprise  

Page 5: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Lucid Customers (That I've Worked With)!

Page 6: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

…On To The Sinning!!

Page 7: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Sins As Anti-Patterns?!

  "Sorta  kinda"   Specify  Nothing  (Sloth)   Creeping  Featurei-s  (Greed)   Blowhard  Jamboree  (Pride)   Boat  Anchor  (Lust)   Not  Invented  Here  (Envy)   Phatware  (GluKony)   Emperor's  New  Clothes  (Wrath)  

Page 8: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Sins Can Contradict One Another!!

  You'll  no-ce  that  many  of  the  "sins"    we  see  will  be  the  exact  opposite  of    others  

  Just  as  some  of  us  tend  towards    laziness,  others  towards  excess  

  Some-mes  you  -­‐  

 "Look  before  you  leap."   Other  -mes,    

 "He  who  hesitates  is  lost."    In  Solr  (or  any  search  app),  one  size  never  fits  all  

Page 9: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

"I  don't  know  and  I  don't  care."  

Page 10: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Sloth!

  "We  aren't  really  into  open  source."   Lack  of  commitment  to  Solr  and/or  the  search  

applica-on  itself    Not  developing  in-­‐house  Solr  exper-se    Not  paying  enough  aKen-on  to  JVM  sebngs,    

garbage  collec-on,  and  RAM  alloca-on.  

Page 11: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Sloth!

  Neglec-ng  to  get  familiar  with  the  source  code   It  is  open  source  ader  all!  

  Not  taking  the  -me  to  understand  the  main  parts  of  Solr:   Request  Handlers   Search  components  

 Query  parsers   Extend  QParserPlugin  class  

 ValueSource  &  ValueSourceParser  –  custom  func-ons  

 New  pseudo-­‐fields  in  4.x   Response  writers  

Page 12: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Sloth!

  Not  keeping  up  with  new  features  and  developments  in  Lucene  and  Solr  

CHANGES.txt  –  use  "diff"  to  keep  up  on  changes  

Page 13: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Sloth!

  New  features  in  Solr  3.1:   Solr  spa8al   Edismax  query  parser  

 NOT  experimental!   Dynamic  metadata  extrac-on  via  UIMA  

 Numeric  range  face8ng  (like  date  face-ng)  

 Lucene  RAMDirectoryFactory  available   Face-ng  performance  improvements  

 Spellcheck  and  Terms  components  now  work  for  distributed  search  

 Suggester  component  –  beKer  autosuggest!  

 Can  add  custom  dict.,  phrases,  etc.  

Page 14: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Sloth!

  New  features  coming  in  Solr  4.x:   Lucene  DocumentWritersPerThread  (DWPT)  

 Moving  towards  "real  -me"  

 UpdateHandler  upgrade  to  work  with  real-­‐-me     Field  collapsing/grouping   Pivot  facets   SolrCloud  (Zookeeper)   Fuzzy  queries  100  -mes  faster  

 Pseudo  fields  via  func-ons   Relevancy  func-on  queries:  n,  idf,  docFreq,  norm,  …  

Page 15: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Sloth: The Path To Salvation!

  Commit  to  the  project  and  to  learning  Solr    Stay  up  to  date  on  Solr  changes    Stay  current  with  ongoing  releases    Get  familiar  with  the  source  code    Spend  some  -me  to  understand  the  main  

configura-on  files:  

 solrconfig.xml   schema.xml  

  Read  through  the  en-re  Solr  Wiki  once  every  so  oden  

  Develop  in-­‐house  Solr  exper-se  

Page 16: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Save  a  penny,  lose  a  customer.  

Page 17: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Greed!

  Skimping  on  resources  such  as:   RAM    

 "Here's  a  quarter  buddy,  go  buy  some  RAM!"  

 Storage  space  

  You  will  get  what  you  pay  for!   …on  the  other  hand,  not  every  company  has  "deep  pockets"  

Page 18: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Greed!

  Trying  to  "squeeze  by",  indexing  to,  and  searching  on,  the  same  server  

Indexing  

Searches  

Shards  (Indexers)  

Slave/Searchers  

Load  Balancer  

Indexing  

Searches  

Page 19: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Greed!

  Not  making  the  effort  to  find  the  right  balance  between  precision  and  recall  

Precision:  What  frac-on  of  the  returned  results  are  relevant  to  the  informa-on  need?  

Recall:  What  frac-on  of  the  relevant  documents  in  the  collec-on  were  re-­‐  turned  by  the  system?    

Page 20: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Greed!

  A  few  thoughts  about  relevance:   Get  feedback  from  domain  experts  

 Is  it  beKer  to  have  lots  of  results  with  less    precision,  or  fewer,  more  targeted  results?  

 Different  sites  will  have  very  different    requirements  

Page 21: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Greed: The Path To Salvation!

  Pry  open  your  wallet  –  don't  be  cheap    You  don't  have  to  push  the  envelope    Find  the  right  balance  between  recall  and  precision    Don't  push  for  more  results  over  precision  –  unless  

that  is  a  clear  requirement  (some-mes  it  is)  

Page 22: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

"What  could  possibly  go  wrong?  

Page 23: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Pride!

  Reinven-ng  the  wheel   "Why  don't  we  just  write  our  own  search  

libraries?"   Nobody  has  a  use  case  like  us  –  right?   "We  need  to  change  the  scoring  algorithms."  

Page 24: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Pride!

  Thinking  you  can  "do  it  all"  in  Solr   Solr  is  rarely  a  good  choice  as  a  SOR  

  Consider  other  tools  to  work  with  Solr:   Nutch   Mahout  

 OpenNLP   Google  Connector  Framework   Your  own  code  

Page 25: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Pride!

  Stubbornly  refusing  to  use  resources  such  as  the    mailing  lists:  

 Solr  user  list:   solr-­‐[email protected]  

 Solr  developer  list:   [email protected]  

 Lucene  user  list:   java-­‐[email protected]    

  LucidFind:  hKp://www.lucidimagina-on.com/search/    

Page 26: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Pride!

  "I  will  not  yield!"   Trying  to  "win  baKles"  on  the  mailing  lists  

 Good  Karma  –  be  a  good  ci-zen  in  the  community  

Page 27: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Pride: The Path To Salvation!

  Ask  for  help  when  needed    Let  the  business  needs  define  the  project  –  don't  

let  the  tail  wag  the  dog    Get  a  feel  for  the  Solr  community  and  respect  the  

experience  of  others  

  You're  situa-on,  while  possibly  unique,  is  probably  not  completely  dissimilar  to  others.  Learn  from  the    pioneers  and  Solr  veterans  

Page 28: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

"Someone  stop  me!"  

Page 29: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Lust!

 Obsessing  over  unimportant  details  too  early  in  the  project  

 Agile  approach  is  well  suited  to  Solr  development  –  iterate!  

  Trying  to  "push  the  envelope"   Necessary  some-mes,  but  it's  not  called  

the  "bleeding  edge"  without  reason  

 "Ease  in"  to  major  changes    Too  much  aKen-on  to  JVM  sebngs  

 Solr  experts  are  not  usually  JVM/GC  experts  

Page 30: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Lust!

  "An--­‐greed"  –  CommiEng  too  many  resources    to  Solr  

 Make  sure  the  OS  has  plenty  of  RAM  to  cache  files,  etc  

  "If  one  is  good,  a  dozen  must  be  beKer!"  

 As  much  as  possible,  try  to  get  a  sense  of  what  your  query  volume  will  be,  and  don't  just  throw  money  at  building  a  monstrous  farm  of  searchers  

 Solr  has  proven  to  be  much  more  efficient  than  some    large,  commercial  search  solu-ons  

Page 31: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Lust!

  Blood  from  a  turnip:   Trying  some  absurd  new  technique,    

"just  because"  

  RAMDirectoryFactory  –  not  a  secret  way  to  faster  indexing/searching   No  disk-­‐backed  persistence   Usually  not  worth  it   …but  you  never  know…  

  Research  first  before  going  "extreme"  

Page 32: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Lust!

  No  need  to  index  millions  of  docs  for  development    BeKer  to  work  with  small  sets  of  data  while  

gebng  started.    Don't  worry  too  much  about  field  types  as  you  get  

started.  Get  data  in  the  index,  then  analyze  and  refine.  

Page 33: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Lust: The Path To Salvation!

  Use  an  agile  approach  –  start  simply,  build  your  applica-on  slowly,  iterate  

  Deal  with  the  low-­‐hanging  fruit  first    Measure  twice,  cut  once  

  Don't  miss  the  forest  for  the  trees  –  no  need  to  obsess  over  details  in  the  early  stages  

  Do  some  due  diligence  before  trying  unorthodox  approaches  

  Get  a  small  sample  of  data  indexed  w/o  worrying  about  type,  then  itera-ons  of  refinement  

Page 34: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

"If  we  had  some  bacon    we  could  have  some  

 bacon  and  eggs  –  if  we    had  some  eggs."  

Page 35: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Envy!

  Adding  "cool"  features  you  see  on  other  sites,  but  don't  really  need  

 Keep  it  "lean  and  mean",  especially  to  start  

 Resist  the  urge  to  include  the    "kitchen  sink"  

Page 36: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Envy!

  You  too  can  master  dismax!   Don't  be  afraid  of  dismax/edismax  

 Lots  of  controls  to  learn,  but  also  lots  of  power  

 Flexibility  to  search  mul-ple  fields  

 Boost  different  fields   Boost  phrase  fields  (pf)  higher  than  query  fields  (qf)   Use  boost  queries  (bq)  and  func-on  queries  (bf)   Most  in-mida-ng  params:  

 -e   mm  

Page 37: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Envy!

  Spa-al  search  –  seems  complicated,  but  major  sites  make  it  look  easy  

  Now,  in  Solr  3.1  –  it  is  easy!    You  can:  

 Store  spa-al  data  in  your  index   Filter  by  distance   Sort  by  distance   Boost/bias  by  distance   Facet  by  distance  

  Also  consider:  Search-­‐based  naviga-on  such  as  "Show  me  in-­‐stock  items  only"  

Page 38: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Envy: The Path To Salvation!

  Focus  on  your  requirements,  don't  try  to  add  "bells  and  whistles"  you  don't  need  

  Don't  be  hesitant  to  dive  into  the  power  of  dismax/edismax  

  Take  advantage  of  new  features  such  as  Solr  spa-al,  if  those  features  will  add  value  to  the  end  user  experience  

Page 39: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

"A  fat  stomach  never    breeds  fine  thoughts."  

Page 40: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Gluttony!

  “Staying  fit  and  trim”  is  usually  good  prac-ce    when  designing  and  running  Solr  applica-ons  

 Once  again  –  keep  it  "lean  and  mean"      A  lot  of  these  issues  cross  over  into  the  “Sloth”    

category  

 The  effort  needed  to  keep  your  configura-on    and  data  efficiently  managed  is  not  considered    important  

  Don't  lose  control  of  your  configura-on  files   Remove  unnecessary  elements   Version  control  all  configura-on  files  

Page 41: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Gluttony!

  Slim  down  those  "bloated"  queries:  

 q="red  shoes"&  accountId=(12343  OR  338899  OR  554443  OR  243445  OR  55442OR  3330899    OR  59927  OR  3888999  OR  549  OR  440293579  34201  OR  339917  OR  300191  OR  339338  OR    109823  OR  679176  OR  31407815  OR  3001756    OR  134322  OR  311123  OR  987888  OR  997181  OR  771819  OR  100292  OR  3389474  OR  5505759  OR  2459577  OR  4499957  OR  1996571  OR  559590  OR  220299  OR  4404872  OR  151510  OR  66017  OR  666  OR  113459  OR  890575  OR  505725  OR  330393  OR  349940  OR  4094994  OR  1245995  OR  2459959  OR  4255909  OR  

899955  OR  7878899  OR  100999  …  ∞  )  

Page 42: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Gluttony!

  Stay  in  shape  –  Flex  Your  Solr  Muscles!   Keep  up  on  new  features   Training,  when  appropriate   Cer-fica-on   Contribute!   Follow  the  user  lists   Refactor  when  new  features  can  help   Keep  up  to  date  on  new  releases  

Page 43: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Gluttony: The Path To Salvation!

  Keep  configura-on  files  clean  and  trim.  Remove  unused  elements  

  Periodically  review  queries  to  make  sure  they  are  efficient  

  Refactor  when  necessary  –  keep  your  applica-on  fit  and  trim  

Page 44: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

"Hope  is  the  denial  of  reality."  

Page 45: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Wrath!

 Wrath  -­‐  usually  synonymous  with  anger,  but…    Let’s  use  an  older  defini-on  here:    

 “A  vehement  denial  of  the  truth,    both  to  others  and  in  the  form  of    self-­‐denial  and  impaMence.”  

  Step  back  every  now  and  then  and  look  objec-vely  at  your  applica-on  

Page 46: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Wrath!

  Resist  the  push  to  rush  to  produc-on…  

Page 47: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Wrath!

  Ignoring  new  Solr  releases   OK  to  wait  un-l  a  release  is  proven   But  gebng  too  far  behind  makes  upgrading  

more  painful  with  each  release  

 We  don't  have  -me  to  do  it  right,  but  we  always    have  -me  to  fix  it  

Page 48: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Wrath!

  Ignoring  complaints  about  results  relevance    Disregarding  feedback  from  stakeholders  

  Remember  –  the  point  of  your  search  applica-on  is  to  support  the  business,  not  to  "build  cool  stuff"  

  Not  taking  advantage  of  log  files   Consider  mining  log  files,  storing  data  in  

rela-onal  DB  for  genera-ng  reports   Capturing  user  queries  and  query  counts  can  be  

extremely  useful   Can  also  be  used  for  query-­‐based  autosuggest.  (not  just  indexed  terms)  

Page 49: The Seven Deadly Sins of Solr

©  Lucid  Imagina-on,  Inc.  

Wrath: The Path To Salvation!

  Keep  your  version  of  Solr  up  to  date   OK  to  wait  "awhile",  but  don't  skip  versions  

  Seek  and  embrace  feedback  from  business  and    domain  experts  

  Constantly  gauge  and  improve  relevance  as  an    ongoing  task  

  Avoid  the  push  to  release  too  soon  (as  best  you  can)    Take  advantage  of  log  files  to  understand  what    

users  are  doing,  and  what  is  not  working  well  

Page 50: The Seven Deadly Sins of Solr

¡Búsqueda,  y  usted  encontrará!