13
Efficient full-text Efficient full-text search in databases search in databases Andrew Aksyonoff, Peter Zaitsev Andrew Aksyonoff, Peter Zaitsev Percona Ltd. Percona Ltd. shodan (at) shodan.ru shodan (at) shodan.ru

Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru

Embed Size (px)

Citation preview

Page 1: Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru

Efficient full-text search in Efficient full-text search in databasesdatabases

Andrew Aksyonoff, Peter ZaitsevAndrew Aksyonoff, Peter ZaitsevPercona Ltd.Percona Ltd.shodan (at) shodan.rushodan (at) shodan.ru

Page 2: Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru

www.rit2007.ru

SearchSearch in databases in databases??

Databases are continually growing “everyone” has got 1M records 10-100M record databases are not that rare 1B+ record databases which require full-text search do exist

(most prominent example is Google)

Open-source DBMS are widely used We will talk about MySQL “The word on the street” is that other DBMSes have similar

problems

Unfortunately, built-in solutions are not good enough for full-text search And especially so, if there is something beyond “just” full-text

search required…

Page 3: Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru

www.rit2007.ru

Types of special requirementsTypes of special requirements

“Just” search is a key requirement, but… Amazing, but it happens rather rarely (in DBMS world) Rather a Web-search engine task

Additional sorting is frequently required On a value different from relevance – for instance, on

product price

Additional filtering is frequently required For instance, by product category, or posting author ID

Match grouping is frequently required For instance, by date, or by data source (eg. site) ID

What do built-in solutions offer?

Page 4: Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru

www.rit2007.ru

Built-Built-in in MySQL FTSMySQL FTS

Pro – built-in, updates “instantly” Con – scales poorly Con – ignores word positions

This causes ranking issues This causes phrase search to be slow

Con – only 1 FT index per query (columns…) Con – does not interoperate with other indexes

I.e. WHERE, ORDER/GROUP BY, LIMIT clauses would be handled separately and “manually”

Conclusion – it is often unacceptable

Page 5: Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru

www.rit2007.ru

External engines shootoutExternal engines shootout

We tested a number of well-known (to us) open-source solutions Let the vendors advertise commercial solutions themselves

MySQL FTS mnoGoSearch, http://mnogosearch.org/

Designed for Web, but can do databases too (htdb)

Lucene, http://lucene.apache.org/ Popular Java full-text search library

Sphinx, http://sphinxsearch.com/ Designed for full-text search in databases from day one

Page 6: Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru

www.rit2007.ru

~3.5M records, ~5 GB text (from Wikipedia) mnoGoSearch dropped out of a race more details in EuroOscon‘2006 talk by Peter Zaitsev

MySQL Lucene Sphinx

Indexing time, min 1627 176 84

Index size, MB 3011 6328 2850

Match all, ms/q 286 30 22

Match phrase, ms/q 3692 29 21

Match bool top-20, ms/q 24 29 13

Benchmarking resultsBenchmarking results

Page 7: Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru

www.rit2007.ru

Existing solutionsExisting solutions

mnoGoSearch Con – indexing and searching time issues FATAL – did not complete indexing 5 GB in 24 hours

Lucene Pro – “instant” index updates Pro – wildcard, fuzzy searches Con – integration cost (this is Java library) Con – filtering implementation (searching speed) Con – no support for grouping

Sphinx Con – “monolithic” indexes Pro – everything else

Page 8: Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru

www.rit2007.ru

SphinxSphinx – – overviewoverview

External solution for database search Two principal programs

Indexer, used for re-indexing FT indexes Searchd, search daemon

Easy integration Built-in support for MySQL, PostgreSQL Provides APIs for PHP, Python, Perl, Ruby, etc Provides MySQL Storage Engine

High speed Indexing speed – 4-10 MB/sec Searching speed– avg 20-30 ms/q @ 5 GB, 3.5M docs

Page 9: Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru

www.rit2007.ru

Sphinx – ideologySphinx – ideology

Indexes locally available databases “A-la SQL” document structure supported from

day one Up to 256 full-text fields Any amount of attributes (integer/timestamp/etc)

“Fast re-indexing instead of slow searching” Non-updateable index format – was initially chosen to

maximize searching speed But then it turned out – that re-indexing is very fast, too In case of partial updates – we can still use re-indexing

“partial” (delta) indexes once per N minutes

Page 10: Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru

www.rit2007.ru

Sphinx – searchingSphinx – searching

Quality Always accounts for word positions, not just frequencies

Scalability Up to 50-100 GB per 1 CPU Supports distributed searches Distributed indexes are fully transparent to client application

Examples Boardreader.com – 500M+ records, 550+ GB text, 12 CPU

cluster Mininova.org – not many records (less than 1M), but 2-3M

searches per day

Page 11: Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru

www.rit2007.ru

Sphinx –Sphinx – advanced featuresadvanced features

Sorting On any attribute combination, SQL-like syntax

Filtering matches with a condition Performed at earliest possible searching stage – for

speed Attributes are always either kept in RAM, or copied

multiple times all over the index in required order – for speed

Fun fact – sometimes full scan of all matches and filtering those on Sphinx side are times faster than corresponding MySQL SELECT query – and are used in production instead…

Page 12: Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru

www.rit2007.ru

Sphinx –Sphinx – advanced featuresadvanced features

Grouping On any attribute Performed in fixed RAM Performed approximately (!) Performed quite efficiently (compared to MySQL etc)

Query words highlighting Special service, which needs document bodies and the query

passed to it

MySQL Storage Engine Can be used for especially complex queries on MySQL side

which can not be run fully on Sphinx side Can be used to simplify integration

Page 13: Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru

www.rit2007.ru

ConclusionsConclusions

Large and very large databases require external solutions for full-text search

There is a number of requirements to such solutions beyond “just” searching (filtering, grouping, etc)

There is a number of open-source solutions with different degrees of matching these requirements

For most tasks, try Sphinx, http://sphinxsearch.com/