73
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0 PGroonga Make PostgreSQL fast full text search platform for all languages! Kouhei Sutou ClearCode Inc. PGConf.ASIA 2016 2016-12-03

Pgroonga – PostgreSQLを全言語対応高速全文検索プラットフォームに!

  • Upload
    lykhanh

  • View
    280

  • Download
    1

Embed Size (px)

Citation preview

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    PGroongaMake PostgreSQL

    fast full text search platformfor all languages!

    Kouhei Sutou ClearCode Inc.PGConf.ASIA 2016

    2016-12-03

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    PostgreSQL and mePostgreSQL

    Some my patches are

    merged

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Patches

    BUG #13840: pg_dump generates unloadable SQLpg_dumpSQL

    BUG #14160: DROP ACCESS METHOD IF EXISTS isn't impl.DROP ACCESS METHOD IF EXISTS

    They are found while developing PGroongaPGroonga

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    PGroonga dev stylePGroonga

    When there are problems in related projects including PostgreSQLPostgreSQL

    We fix these problems in these projects instead of choosing workaround in PGroongaPGroonga

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    PostgreSQL and FTSPostgreSQL

    PostgreSQL has out-of-the-box full text search featurePostgreSQL

    It has some problems...

    We fixed them by PGroongaPGroonga

    instead of fixing PostgreSQL PostgreSQL

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Because...

    Our approach is different from PostgreSQL's approachPGroongaPostgreSQL

    1.

    PG provides plugin systemPostgreSQL

    Implementing as a plugin is PostgreSQL way!PostgreSQL

    2.

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    PG FTS problemPostgreSQL

    Many langs aren't supported

    e.g.: Asian languages

    Japanese, Chinese and more

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    FTS for Japanese11

    SELECT to_tsvector('japanese', '');-- ERROR: text search configuration-- "japanese" does not exist-- LINE 2: to_tsvector('japanese',-- ^

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    FTS for Japanese22

    CREATE EXTENSION pg_trgm;SELECT '' % '';-- substring-- ?column? -- ------------ f Must be "t"!-- (1 row)

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Existing solution

    pg_bigm

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    pg_bigm

    An extension

    Similar to pg_trgmpg_trgm

    Operator class for GINGIN

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    pg_bigm: Usagepg_bigm

    CREATE INDEX index ON table USING gin (column gin_bigm_ops);-- Use GIN Specify op class

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    pg_bigm: Demeritpg_bigm

    Slow for large document(Normally, we want to use FTS for large document)

    Because it needs "recheck"recheck

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    "recheck"

    "Exact" seq. search after"loose" index search

    The larger text, the slower

    text = doc size * N docs = *

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Benchmark

    0

    0.5

    1

    1.5

    2

    2.5

    3

    311 14706 20389

    Data: Japanese Wikipedia(Many records and large documents)N records: About 0.9millionsAverage text size: 6.7KiB

    Slow

    Slow

    Elapsed time (sec)

    (Shorter is better)

    N hits

    pg_bigm

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    New solution

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    PGroonga

    Pronunciation: pz:ln

    An extension

    Index and operator classes

    Not operator classes for GINGIN

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    PGroonga layer

    GIN

    textsearchpg_trgmpg_bigm

    Index

    Operatorclass

    PGroonga

    PGroonga

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Benchmark

    0

    0.5

    1

    1.5

    2

    2.5

    3

    311 14706 20389

    Data: Japanese Wikipedia(Many records and large documents)N records: About 0.9millionsAverage text size: 6.7KiB

    Fast Fast

    Elapsed time (sec)

    (Shorter is better)

    N hits

    PGroonga pg_bigm

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Wrap up11

    PostgreSQL doesn't support Asian languagesPostgreSQL

    pg_bigm and PGroonga support all languagespg_bigmPGroonga

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Wrap up22

    Many hits case:

    pg_bigm is slowpg_bigm

    PGroonga is fastPGroonga

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Why is PGroonga fast?PGroonga

    Doesn't need "recheck"recheck

    Is "recheck" really slow?recheck

    See one more benchmark result

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Benchmark

    0

    0.5

    1

    1.5

    2

    2.5

    3

    0 100000 200000 300000 400000 500000

    Data: Japanese Wikipedia(Many records and large documents)N records: About 0.9millionsAverage text size: 6.7KiB

    Slow

    Slow

    Fast for many hits!

    Query: ""

    Elapsed time (sec)

    (Shorter is better)

    N hits

    PGroonga pg_bigm

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Why is pg_bigm fast?pg_bigm

    Query is ""

    Point: 2 characters2

    pg_bigm doesn't need "recheck" for 2 chars querypg_bigm2recheck

    It means that "recheck" is slowrecheck

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    N-gram and "recheck"N-gramrecheck

    N-gram approach needs "phrase search" when query has N or more charactersN+1

    N=2 for pg_bigm, N=3 for pg_trgmpg_bigmN=2pg_trgmN=3

    GIN needs "recheck" for "phrase search"GINrecheck

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Phrase search

    Phrase search is "token search" and "position check"

    Tokens must exist and be ordered

    OK: "car at" for "car at" query

    NG: "at car" for "car at" query

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    N-gram and phrase searchSplit text to tokens

    "cat""ca","at"

    1.

    Search all tokens

    "ca" and "at" exist: Candidate!

    2.

    Check appearance pos.

    "ca" then "at": Found!

    3.

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    N-gram and GIN: CreateN-gramGIN

    GIN

    "ca","at"

    Tokenize

    Documents

    catat car

    10

    20

    ID Text"ca""at""t "

    Token Posting list10,20

    10,20

    20" c""ar"

    2020

    "at","t "," c","ca","ar"

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    N-gram and GIN: SearchN-gramGIN

    "ca""at""t "

    Token Posting listGIN

    10,20

    10,20

    20

    cat Query

    "ca","at"Tokenize

    AND

    catat car

    10

    20

    DocumentsID Text

    10,20

    Candidates" c""ar"

    2020

    Search

    Appearance position check(Point: Out of GIN)

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    GIN and phrase searchGIN

    Phrase search needs position check

    GIN doesn't support position checkGIN

    GIN needs "recheck"Slow!GINrecheck

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Why is PGroonga fast?PGroonga

    PGroonga uses N-gram by defaultPGroongaN-gram

    But doesn't need "recheck"PGroongarecheck

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Why no "recheck"?recheck

    PGroonga usesfull

    inverted indexPGroonga

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Full inverted index

    Including position

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Inverted index diff

    catat car

    1020

    Documents

    Full: Doc ID + pos

    "ca""at""t "

    ID Text

    Token Posting list

    20:2

    10:2,20:110:1,20:4

    "ca","at"1 2

    "at","t "," c","ca","ar"1 2 3 4 5

    Tokenize

    Not full: Only doc ID

    " c""ar"

    20:3

    20:5

    "ca""at""t "

    Token Posting list

    20

    10,2010,20

    " c""ar"

    20

    20

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    N-gram/PGroonga: SearchN-gramPGroonga

    "ca""at""t "

    Token Posting listPGroonga

    10:1,20:4

    10:2,20:1

    20:2

    cat Query

    TokenizeAND

    catat car

    10

    20

    DocumentsID Text 10Result

    " c""ar"

    20:320:5

    Search

    Appearance position check(Point: In PGroonga)

    "ca","at"1 2

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Wrap up

    N-gram needs phrase searchN-gram

    Full inverted index provides fast phrase search

    GIN isn't full inverted indexGIN

    PGroonga uses full inverted indexPGroonga

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    FTS and English(*)

    Normally, N-gram isn't used for English FTSN-gram

    N-gram is slower than word based approach (textsearch approach)N-gramtextsearch

    Stemming/stop word can't be usedN-gram

    (*) EnglishAlphabet based languages

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    PGroonga and EnglishPGroonga

    PGroonga uses N-gram by defaultPGroongaN-gram

    Is PGroonga slow for English?PGroonga

    No. Similar to textsearchtextsearch

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    PGroonga: SearchPGroonga

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    PostgreSQL OR MySQL database America

    Data: English Wikipedia(Many records and large docs)N records: About 5.3millionsAverage text size: 6.4KiB

    Elapsed time (ms)

    (Shorter is better)

    Query

    PGroonga textsearch

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    PGroonga's N-gram

    Variable size N-gramN-gram

    Continuous alphabets are 1 token(= word based approach)1=

    Hello"Hello" not "He","el",

    No alphabet is 2-gram2-gram

    "","",

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Wrap up11

    PGroonga's search is fast for all languagesPGroonga

    Including alphabet based languages and Asian languages mixed case(textsearch doesn't support mixed case)textsearch

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Wrap up22

    PGroonga makes PostgreSQLfast full text search platform

    for all languages!PGroongaPostgreSQL

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    More about PGroongaPGroonga

    Performance

    JSON supportJSON

    Replication

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Performance

    Search and update

    Index only scan

    Direct Groonga searchGroonga

    Index creation

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Search and update

    Doesn't decrease search performance while updating

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Characteristics

    Sear

    ch t

    hrou

    ghpu

    t

    Update throughput

    PGroonga

    Sear

    ch t

    hrou

    ghpu

    t

    Update throughput

    GIN

    Keepsearch performancewhile many updates

    Decreasesearch performancewhile updating

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Update and lock

    Update without read locks

    Write locks are required

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    GIN: Read/WriteGIN

    Conn1Conn2

    INSERTstart

    SELECTstart

    Blocked

    INSERTfinish

    SELECTfinish

    GIN

    Slow down!

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    PGroonga: Read/WritePGroonga

    Conn1Conn2

    INSERTstart

    SELECTstart

    INSERTfinish

    SELECTfinish

    PGroonga

    No slow down!

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Fast stably

    GIN has intermittent performance decrementsGIN

    For details:"GIN pending list"GIN pending list

    PGroonga keeps fast searchPGroonga

    PGroonga keeps index latestPGroonga

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Index only scan

    GIN: Not supportedGIN

    PGroonga: SupportedPGroonga

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    More faster search

    Direct Groonga search is more fasterGroonga

    Groonga: Full text search engine PGroonga usesGroongaPGroonga

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Direct Groonga searchGroonga

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    PostgreSQL OR MySQL database America

    Data: English Wikipedia(Many records and large docs)N records: About 5.3millionsAverage text size: 6.4KiBGroonga is 30x faster than others

    Elapsed time (ms)

    (Shorter is better)

    Query

    PGroonga Groonga textsearch

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Index creation time

    0

    0.5

    1

    1.5

    2

    2.5

    3 Data: English WikipediaSize: About 33GiBMax text size: 1MiB

    2x fasterthan textsearch

    Elapsed time (hour)

    (Shorter is better)

    Module

    Index creationPGroonga textsearch pg_trgm

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Performance: Wrap up

    Keep fast search with update

    Support index only scan

    Direct Groonga search is more fasterGroonga

    Fast index creation

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    JSON supportJSON

    Support full text search

    Target: All texts in JSONJSON

    Not only a text in a path(GIN supports only this style)GIN

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    JSON: FTSJSON

    INSERT INTO logs (record) VALUES ('{"host": "app1"}'), ('{"message": "app is down"}');SELECT * FROM logs WHERE record @@ 'string @@ "app"'-- record-- ---------------------------- {"host": "app1"}-- {"message": "app is down"}

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    JSON: Wrap upJSON

    Support full text search against all texts in JSONJSON

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Replication

    Support with PostgreSQL 9.6!PostgreSQL 9.6

    PostgreSQL 9.6 ships "generic WAL"PostgreSQL 9.6generic WAL

    Third party index can support WAL generationWAL

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Implementation

    Master: Encode action logs as MessagePackMessagePack

    1.

    Master: Write the action logs to WALWAL

    2.

    Slaves: Read the action logs and apply them

    3.

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Overview

    Master PostgreSQL

    Index file

    PGroonga DB

    INSERT PGroonga

    Update

    Append action logsvia generic WAL API

    Action log

    Slave

    Apply pending action logson SELECT

    SELECT

    WAL

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Action log: "action"

    { "_action": ACTION_ID}# ACTION_ID: 0: INSERT# ACTION_ID: 1: CREATE_TABLE# ACTION_ID: 2: CREATE_COLUMN# ACTION_ID: 3: SET_SOURCES

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Action log: INSERTINSERT

    { "_action": 0, "_table": "TABLE_NAME", "ctid": PACKED_CTID_VALUE, "column1": COLUMN1_VALUE, ...}

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Action log: Logs

    {"_action": ACTION_ID, ...}{"_action": ACTION_ID, ...}{"_action": ACTION_ID, ...}...

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Write action logs

    Index file Page

    Header

    Actionlogs

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Apply action logs

    Index file PGroonga DB

    Applied offset(Block#+Offset)

    (2,10)

    1 2

    3 4Apply

    (2,50) AfterBefore

    1050

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Action log: Why msgpack?msgpack

    Because MessagePack supports streaming unpackMessagePack

    It's useful to stop applying action logs when WAL is applied partially on slavesWAL

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Replication: Wrap up

    Support with PostgreSQL 9.6!PostgreSQL 9.6

    Concept: Action logs on WALWAL

    It'll be an useful pattern for out of PostgreSQL storage indexPostgreSQL

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Wrap up11

    PostgreSQL doesn't support FTS for all languagesPostgreSQL

    PGroonga supports FTS for all languagesPGroonga

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Wrap up22

    PGroonga is fast stablyPGroonga

    PGroonga supports FTS for all texts in JSONPGroongaJSON

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Wrap up33

    PGroonga supports replicationPGroonga

    PostgreSQL 9.6 is requiredPostgreSQL 9.6

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    Wrap up44

    PGroonga makes PostgreSQLfast full text search platform

    for all languages!PGroongaPostgreSQL

  • PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0

    See also

    https://pgroonga.github.io/

    Tutorial: /tutorial/

    Install: /install/

    Reference: /reference/Includes replication doc and benchmark docs

    Community: /community/

    https://pgroonga.github.io/https://pgroonga.github.io/tutorial/https://pgroonga.github.io/install/https://pgroonga.github.io/reference/https://pgroonga.github.io/community/