Upload
robert-porter
View
221
Download
7
Embed Size (px)
Citation preview
Efficient full-text search in Efficient full-text search in databasesdatabases
Andrew Aksyonoff, Peter ZaitsevAndrew Aksyonoff, Peter ZaitsevPercona Ltd.Percona Ltd.shodan (at) shodan.rushodan (at) shodan.ru
www.rit2007.ru
SearchSearch in databases in databases??
Databases are continually growing “everyone” has got 1M records 10-100M record databases are not that rare 1B+ record databases which require full-text search do exist
(most prominent example is Google)
Open-source DBMS are widely used We will talk about MySQL “The word on the street” is that other DBMSes have similar
problems
Unfortunately, built-in solutions are not good enough for full-text search And especially so, if there is something beyond “just” full-text
search required…
www.rit2007.ru
Types of special requirementsTypes of special requirements
“Just” search is a key requirement, but… Amazing, but it happens rather rarely (in DBMS world) Rather a Web-search engine task
Additional sorting is frequently required On a value different from relevance – for instance, on
product price
Additional filtering is frequently required For instance, by product category, or posting author ID
Match grouping is frequently required For instance, by date, or by data source (eg. site) ID
What do built-in solutions offer?
www.rit2007.ru
Built-Built-in in MySQL FTSMySQL FTS
Pro – built-in, updates “instantly” Con – scales poorly Con – ignores word positions
This causes ranking issues This causes phrase search to be slow
Con – only 1 FT index per query (columns…) Con – does not interoperate with other indexes
I.e. WHERE, ORDER/GROUP BY, LIMIT clauses would be handled separately and “manually”
Conclusion – it is often unacceptable
www.rit2007.ru
External engines shootoutExternal engines shootout
We tested a number of well-known (to us) open-source solutions Let the vendors advertise commercial solutions themselves
MySQL FTS mnoGoSearch, http://mnogosearch.org/
Designed for Web, but can do databases too (htdb)
Lucene, http://lucene.apache.org/ Popular Java full-text search library
Sphinx, http://sphinxsearch.com/ Designed for full-text search in databases from day one
www.rit2007.ru
~3.5M records, ~5 GB text (from Wikipedia) mnoGoSearch dropped out of a race more details in EuroOscon‘2006 talk by Peter Zaitsev
MySQL Lucene Sphinx
Indexing time, min 1627 176 84
Index size, MB 3011 6328 2850
Match all, ms/q 286 30 22
Match phrase, ms/q 3692 29 21
Match bool top-20, ms/q 24 29 13
Benchmarking resultsBenchmarking results
www.rit2007.ru
Existing solutionsExisting solutions
mnoGoSearch Con – indexing and searching time issues FATAL – did not complete indexing 5 GB in 24 hours
Lucene Pro – “instant” index updates Pro – wildcard, fuzzy searches Con – integration cost (this is Java library) Con – filtering implementation (searching speed) Con – no support for grouping
Sphinx Con – “monolithic” indexes Pro – everything else
www.rit2007.ru
SphinxSphinx – – overviewoverview
External solution for database search Two principal programs
Indexer, used for re-indexing FT indexes Searchd, search daemon
Easy integration Built-in support for MySQL, PostgreSQL Provides APIs for PHP, Python, Perl, Ruby, etc Provides MySQL Storage Engine
High speed Indexing speed – 4-10 MB/sec Searching speed– avg 20-30 ms/q @ 5 GB, 3.5M docs
www.rit2007.ru
Sphinx – ideologySphinx – ideology
Indexes locally available databases “A-la SQL” document structure supported from
day one Up to 256 full-text fields Any amount of attributes (integer/timestamp/etc)
“Fast re-indexing instead of slow searching” Non-updateable index format – was initially chosen to
maximize searching speed But then it turned out – that re-indexing is very fast, too In case of partial updates – we can still use re-indexing
“partial” (delta) indexes once per N minutes
www.rit2007.ru
Sphinx – searchingSphinx – searching
Quality Always accounts for word positions, not just frequencies
Scalability Up to 50-100 GB per 1 CPU Supports distributed searches Distributed indexes are fully transparent to client application
Examples Boardreader.com – 500M+ records, 550+ GB text, 12 CPU
cluster Mininova.org – not many records (less than 1M), but 2-3M
searches per day
www.rit2007.ru
Sphinx –Sphinx – advanced featuresadvanced features
Sorting On any attribute combination, SQL-like syntax
Filtering matches with a condition Performed at earliest possible searching stage – for
speed Attributes are always either kept in RAM, or copied
multiple times all over the index in required order – for speed
Fun fact – sometimes full scan of all matches and filtering those on Sphinx side are times faster than corresponding MySQL SELECT query – and are used in production instead…
www.rit2007.ru
Sphinx –Sphinx – advanced featuresadvanced features
Grouping On any attribute Performed in fixed RAM Performed approximately (!) Performed quite efficiently (compared to MySQL etc)
Query words highlighting Special service, which needs document bodies and the query
passed to it
MySQL Storage Engine Can be used for especially complex queries on MySQL side
which can not be run fully on Sphinx side Can be used to simplify integration
www.rit2007.ru
ConclusionsConclusions
Large and very large databases require external solutions for full-text search
There is a number of requirements to such solutions beyond “just” searching (filtering, grouping, etc)
There is a number of open-source solutions with different degrees of matching these requirements
For most tasks, try Sphinx, http://sphinxsearch.com/