Upload
lykhanh
View
280
Download
1
Embed Size (px)
Citation preview
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PGroongaMake PostgreSQL
fast full text search platformfor all languages!
Kouhei Sutou ClearCode Inc.PGConf.ASIA 2016
2016-12-03
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PostgreSQL and mePostgreSQL
Some my patches are
merged
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Patches
BUG #13840: pg_dump generates unloadable SQLpg_dumpSQL
BUG #14160: DROP ACCESS METHOD IF EXISTS isn't impl.DROP ACCESS METHOD IF EXISTS
They are found while developing PGroongaPGroonga
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PGroonga dev stylePGroonga
When there are problems in related projects including PostgreSQLPostgreSQL
We fix these problems in these projects instead of choosing workaround in PGroongaPGroonga
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PostgreSQL and FTSPostgreSQL
PostgreSQL has out-of-the-box full text search featurePostgreSQL
It has some problems...
We fixed them by PGroongaPGroonga
instead of fixing PostgreSQL PostgreSQL
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Because...
Our approach is different from PostgreSQL's approachPGroongaPostgreSQL
1.
PG provides plugin systemPostgreSQL
Implementing as a plugin is PostgreSQL way!PostgreSQL
2.
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PG FTS problemPostgreSQL
Many langs aren't supported
e.g.: Asian languages
Japanese, Chinese and more
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
FTS for Japanese11
SELECT to_tsvector('japanese', '');-- ERROR: text search configuration-- "japanese" does not exist-- LINE 2: to_tsvector('japanese',-- ^
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
FTS for Japanese22
CREATE EXTENSION pg_trgm;SELECT '' % '';-- substring-- ?column? -- ------------ f Must be "t"!-- (1 row)
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Existing solution
pg_bigm
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
pg_bigm
An extension
Similar to pg_trgmpg_trgm
Operator class for GINGIN
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
pg_bigm: Usagepg_bigm
CREATE INDEX index ON table USING gin (column gin_bigm_ops);-- Use GIN Specify op class
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
pg_bigm: Demeritpg_bigm
Slow for large document(Normally, we want to use FTS for large document)
Because it needs "recheck"recheck
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
"recheck"
"Exact" seq. search after"loose" index search
The larger text, the slower
text = doc size * N docs = *
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Benchmark
0
0.5
1
1.5
2
2.5
3
311 14706 20389
Data: Japanese Wikipedia(Many records and large documents)N records: About 0.9millionsAverage text size: 6.7KiB
Slow
Slow
Elapsed time (sec)
(Shorter is better)
N hits
pg_bigm
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
New solution
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PGroonga
Pronunciation: pz:ln
An extension
Index and operator classes
Not operator classes for GINGIN
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PGroonga layer
GIN
textsearchpg_trgmpg_bigm
Index
Operatorclass
PGroonga
PGroonga
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Benchmark
0
0.5
1
1.5
2
2.5
3
311 14706 20389
Data: Japanese Wikipedia(Many records and large documents)N records: About 0.9millionsAverage text size: 6.7KiB
Fast Fast
Elapsed time (sec)
(Shorter is better)
N hits
PGroonga pg_bigm
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up11
PostgreSQL doesn't support Asian languagesPostgreSQL
pg_bigm and PGroonga support all languagespg_bigmPGroonga
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up22
Many hits case:
pg_bigm is slowpg_bigm
PGroonga is fastPGroonga
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Why is PGroonga fast?PGroonga
Doesn't need "recheck"recheck
Is "recheck" really slow?recheck
See one more benchmark result
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Benchmark
0
0.5
1
1.5
2
2.5
3
0 100000 200000 300000 400000 500000
Data: Japanese Wikipedia(Many records and large documents)N records: About 0.9millionsAverage text size: 6.7KiB
Slow
Slow
Fast for many hits!
Query: ""
Elapsed time (sec)
(Shorter is better)
N hits
PGroonga pg_bigm
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Why is pg_bigm fast?pg_bigm
Query is ""
Point: 2 characters2
pg_bigm doesn't need "recheck" for 2 chars querypg_bigm2recheck
It means that "recheck" is slowrecheck
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
N-gram and "recheck"N-gramrecheck
N-gram approach needs "phrase search" when query has N or more charactersN+1
N=2 for pg_bigm, N=3 for pg_trgmpg_bigmN=2pg_trgmN=3
GIN needs "recheck" for "phrase search"GINrecheck
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Phrase search
Phrase search is "token search" and "position check"
Tokens must exist and be ordered
OK: "car at" for "car at" query
NG: "at car" for "car at" query
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
N-gram and phrase searchSplit text to tokens
"cat""ca","at"
1.
Search all tokens
"ca" and "at" exist: Candidate!
2.
Check appearance pos.
"ca" then "at": Found!
3.
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
N-gram and GIN: CreateN-gramGIN
GIN
"ca","at"
Tokenize
Documents
catat car
10
20
ID Text"ca""at""t "
Token Posting list10,20
10,20
20" c""ar"
2020
"at","t "," c","ca","ar"
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
N-gram and GIN: SearchN-gramGIN
"ca""at""t "
Token Posting listGIN
10,20
10,20
20
cat Query
"ca","at"Tokenize
AND
catat car
10
20
DocumentsID Text
10,20
Candidates" c""ar"
2020
Search
Appearance position check(Point: Out of GIN)
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
GIN and phrase searchGIN
Phrase search needs position check
GIN doesn't support position checkGIN
GIN needs "recheck"Slow!GINrecheck
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Why is PGroonga fast?PGroonga
PGroonga uses N-gram by defaultPGroongaN-gram
But doesn't need "recheck"PGroongarecheck
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Why no "recheck"?recheck
PGroonga usesfull
inverted indexPGroonga
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Full inverted index
Including position
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Inverted index diff
catat car
1020
Documents
Full: Doc ID + pos
"ca""at""t "
ID Text
Token Posting list
20:2
10:2,20:110:1,20:4
"ca","at"1 2
"at","t "," c","ca","ar"1 2 3 4 5
Tokenize
Not full: Only doc ID
" c""ar"
20:3
20:5
"ca""at""t "
Token Posting list
20
10,2010,20
" c""ar"
20
20
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
N-gram/PGroonga: SearchN-gramPGroonga
"ca""at""t "
Token Posting listPGroonga
10:1,20:4
10:2,20:1
20:2
cat Query
TokenizeAND
catat car
10
20
DocumentsID Text 10Result
" c""ar"
20:320:5
Search
Appearance position check(Point: In PGroonga)
"ca","at"1 2
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up
N-gram needs phrase searchN-gram
Full inverted index provides fast phrase search
GIN isn't full inverted indexGIN
PGroonga uses full inverted indexPGroonga
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
FTS and English(*)
Normally, N-gram isn't used for English FTSN-gram
N-gram is slower than word based approach (textsearch approach)N-gramtextsearch
Stemming/stop word can't be usedN-gram
(*) EnglishAlphabet based languages
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PGroonga and EnglishPGroonga
PGroonga uses N-gram by defaultPGroongaN-gram
Is PGroonga slow for English?PGroonga
No. Similar to textsearchtextsearch
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PGroonga: SearchPGroonga
0
0.2
0.4
0.6
0.8
1
1.2
1.4
PostgreSQL OR MySQL database America
Data: English Wikipedia(Many records and large docs)N records: About 5.3millionsAverage text size: 6.4KiB
Elapsed time (ms)
(Shorter is better)
Query
PGroonga textsearch
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PGroonga's N-gram
Variable size N-gramN-gram
Continuous alphabets are 1 token(= word based approach)1=
Hello"Hello" not "He","el",
No alphabet is 2-gram2-gram
"","",
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up11
PGroonga's search is fast for all languagesPGroonga
Including alphabet based languages and Asian languages mixed case(textsearch doesn't support mixed case)textsearch
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up22
PGroonga makes PostgreSQLfast full text search platform
for all languages!PGroongaPostgreSQL
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
More about PGroongaPGroonga
Performance
JSON supportJSON
Replication
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Performance
Search and update
Index only scan
Direct Groonga searchGroonga
Index creation
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Search and update
Doesn't decrease search performance while updating
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Characteristics
Sear
ch t
hrou
ghpu
t
Update throughput
PGroonga
Sear
ch t
hrou
ghpu
t
Update throughput
GIN
Keepsearch performancewhile many updates
Decreasesearch performancewhile updating
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Update and lock
Update without read locks
Write locks are required
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
GIN: Read/WriteGIN
Conn1Conn2
INSERTstart
SELECTstart
Blocked
INSERTfinish
SELECTfinish
GIN
Slow down!
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PGroonga: Read/WritePGroonga
Conn1Conn2
INSERTstart
SELECTstart
INSERTfinish
SELECTfinish
PGroonga
No slow down!
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Fast stably
GIN has intermittent performance decrementsGIN
For details:"GIN pending list"GIN pending list
PGroonga keeps fast searchPGroonga
PGroonga keeps index latestPGroonga
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Index only scan
GIN: Not supportedGIN
PGroonga: SupportedPGroonga
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
More faster search
Direct Groonga search is more fasterGroonga
Groonga: Full text search engine PGroonga usesGroongaPGroonga
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Direct Groonga searchGroonga
0
0.2
0.4
0.6
0.8
1
1.2
1.4
PostgreSQL OR MySQL database America
Data: English Wikipedia(Many records and large docs)N records: About 5.3millionsAverage text size: 6.4KiBGroonga is 30x faster than others
Elapsed time (ms)
(Shorter is better)
Query
PGroonga Groonga textsearch
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Index creation time
0
0.5
1
1.5
2
2.5
3 Data: English WikipediaSize: About 33GiBMax text size: 1MiB
2x fasterthan textsearch
Elapsed time (hour)
(Shorter is better)
Module
Index creationPGroonga textsearch pg_trgm
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Performance: Wrap up
Keep fast search with update
Support index only scan
Direct Groonga search is more fasterGroonga
Fast index creation
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
JSON supportJSON
Support full text search
Target: All texts in JSONJSON
Not only a text in a path(GIN supports only this style)GIN
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
JSON: FTSJSON
INSERT INTO logs (record) VALUES ('{"host": "app1"}'), ('{"message": "app is down"}');SELECT * FROM logs WHERE record @@ 'string @@ "app"'-- record-- ---------------------------- {"host": "app1"}-- {"message": "app is down"}
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
JSON: Wrap upJSON
Support full text search against all texts in JSONJSON
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Replication
Support with PostgreSQL 9.6!PostgreSQL 9.6
PostgreSQL 9.6 ships "generic WAL"PostgreSQL 9.6generic WAL
Third party index can support WAL generationWAL
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Implementation
Master: Encode action logs as MessagePackMessagePack
1.
Master: Write the action logs to WALWAL
2.
Slaves: Read the action logs and apply them
3.
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Overview
Master PostgreSQL
Index file
PGroonga DB
INSERT PGroonga
Update
Append action logsvia generic WAL API
Action log
Slave
Apply pending action logson SELECT
SELECT
WAL
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Action log: "action"
{ "_action": ACTION_ID}# ACTION_ID: 0: INSERT# ACTION_ID: 1: CREATE_TABLE# ACTION_ID: 2: CREATE_COLUMN# ACTION_ID: 3: SET_SOURCES
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Action log: INSERTINSERT
{ "_action": 0, "_table": "TABLE_NAME", "ctid": PACKED_CTID_VALUE, "column1": COLUMN1_VALUE, ...}
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Action log: Logs
{"_action": ACTION_ID, ...}{"_action": ACTION_ID, ...}{"_action": ACTION_ID, ...}...
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Write action logs
Index file Page
Header
Actionlogs
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Apply action logs
Index file PGroonga DB
Applied offset(Block#+Offset)
(2,10)
1 2
3 4Apply
(2,50) AfterBefore
1050
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Action log: Why msgpack?msgpack
Because MessagePack supports streaming unpackMessagePack
It's useful to stop applying action logs when WAL is applied partially on slavesWAL
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Replication: Wrap up
Support with PostgreSQL 9.6!PostgreSQL 9.6
Concept: Action logs on WALWAL
It'll be an useful pattern for out of PostgreSQL storage indexPostgreSQL
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up11
PostgreSQL doesn't support FTS for all languagesPostgreSQL
PGroonga supports FTS for all languagesPGroonga
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up22
PGroonga is fast stablyPGroonga
PGroonga supports FTS for all texts in JSONPGroongaJSON
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up33
PGroonga supports replicationPGroonga
PostgreSQL 9.6 is requiredPostgreSQL 9.6
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up44
PGroonga makes PostgreSQLfast full text search platform
for all languages!PGroongaPostgreSQL
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
See also
https://pgroonga.github.io/
Tutorial: /tutorial/
Install: /install/
Reference: /reference/Includes replication doc and benchmark docs
Community: /community/
https://pgroonga.github.io/https://pgroonga.github.io/tutorial/https://pgroonga.github.io/install/https://pgroonga.github.io/reference/https://pgroonga.github.io/community/