20
Quick Search algorithm and strstr Quick Search algorithm and strstr Cybozu Labs 2012/3/31 MITSUNARI Shigeo(@herumi) x86/x64 optimization seminar 3(#x86opti)

Quick Search algorithm and strstr

Embed Size (px)

DESCRIPTION

strstr with SSE4.2 is faster than quick search

Citation preview

Page 1: Quick Search algorithm and strstr

Quick Search algorithm and strstr

Quick Search algorithm and strstr

Cybozu Labs

2012/3/31 MITSUNARI Shigeo(@herumi)

x86/x64 optimization seminar 3(#x86opti)

Page 2: Quick Search algorithm and strstr

Agenda Agenda

Quick Search

vs. strstr of gcc on Core2Duo

vs. strstr of gcc on Xeon

fast strstr using strchr of gcc

vs. my implementation on Xeon

restriction

vs. strstr of VC2011 beta on Core i7

feature of pcmpestri

range version of strstr

2012/3/31 #x86opti /20 2

Page 3: Quick Search algorithm and strstr

Quick Search algorithm(1/2) Quick Search algorithm(1/2)

Simplified and improved Boyer-Moore algorithm

initialized table for "this is"

How to initialize table for given string [str, str + len)

2012/3/31 #x86opti /20 3

0 1 2 3 4 5 6 7 8 9 A B C D E F 0 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 1 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 2 3 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 3 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 5 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 6 8 8 8 8 8 8 8 8 6 2 8 8 8 8 8 8 7 8 8 8 1 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 A 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 B 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 C 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 D 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 E 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 F 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

char 't' 'h' 'I' 's' ' ' other

skip +7 +6 +2 +1 +3 +8

int tbl_[256]; void init(const char *str, int len) { std::fill(tbl_, tbl_ + 256, len); for (size_t i = 0; i < len; i++) { tbl_[str[i]] = len - i; } }

Page 4: Quick Search algorithm and strstr

Quick Search algorithm(2/2) Quick Search algorithm(2/2)

Searching phase

simple and fast

see http://www-igm.univ-mlv.fr/~lecroq/string/node19.html

2012/3/31 #x86opti /20 4

const char *find(const char *begin, const char *end) { while (begin <= end - len_) { if (memcmp(str_, begin, len_) == 0) return begin; begin += tbl_[begin[len_]]; } return end; };

Page 5: Quick Search algorithm and strstr

Benchmark Benchmark

2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8

33MB UTF-8 text

Qs(Quick search) is faster for long substring

Remark: assume text does not have ‘¥0’for strstr 2012/3/31 #x86opti /20 5

0

2

4

6

8

10

cycle

/Byte

to

fin

d

substring

strstr

org Qs

fast

Page 6: Quick Search algorithm and strstr

A little modification of Qs A little modification of Qs

avoid memcmp

2012/3/31 #x86opti /20 6

const char *find(const char *begin, const char *end) { while (begin <= end - len_) { if (memcmp(str_, begin, len_) == 0) return begin; begin += tbl_[begin[len_]]; } return end; }

const char *find(const char *begin, const char *end){ while (begin <= end - len_) { for (size_t i = 0; i < len_; i++) { if (str_[i] != begin[i]) goto NEXT; } return begin; NEXT: begin += tbl_[static_cast<unsigned char>(begin[len_])]; } return end; }

Page 7: Quick Search algorithm and strstr

Benchmark again Benchmark again

2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8

33MB UTF-8 text

modified Qs(Qs’) is more faster

Should we use modified Qs'? 2012/3/31 #x86opti /20 7

0

2

4

6

8

10

cycle

/Byte

to

fin

d

substring

strstr

org Qs

Qs'

fast

Page 8: Quick Search algorithm and strstr

strstr on gcc 4.6 with SSE4.2 strstr on gcc 4.6 with SSE4.2

Xeon X5650 2.67Gz on Linux

strstr with SSE4.2 is faster than Qs’ for substring with length less than 9 byte

Is strstr of gcc is fastest implementation?

2012/3/31 #x86opti /20 8

0

1

2

3

4

5

cycle

/Byte

to

fin

d

substring

strstr

Qs'

fast

Page 9: Quick Search algorithm and strstr

strstr implementation by strchr strstr implementation by strchr

Find candidate of location by strchr at first, and verify the correctness

strchr of gcc with SSE4.2 is fast

2012/3/31 #x86opti /20 9

const char *mystrstr_C(const char *str, const char *key) { size_t len = strlen(key); while (*str) { const char *p = strchr(str, key[0]); if (p == 0) return 0; if (memcmp(p + 1, key + 1, len - 1) == 0) return p; str = p + 1; } return 0; }

Page 10: Quick Search algorithm and strstr

strstr vs. mystrstr_C strstr vs. mystrstr_C

Xeon X5650 2.6GHz + gcc 4.6.1

mystrstr_C is 1.5 ~ 3 times faster than strstr

except for “ko-re-wa”(in UTF-8)

maybe penalty for many bad candidates

2012/3/31 #x86opti /20 10

0

2

4

6

8

10

cycle

/Byte

to

fin

d

substring

strstr

Qs'

my_strstr_C

fast

Page 11: Quick Search algorithm and strstr

real speed of SSE4.2(pcmpistri) real speed of SSE4.2(pcmpistri)

my_strstr is always faster than Qs’

2 ~ 4 times faster than strstr of gcc

2012/3/31 #x86opti /20 11

0

2

4

6

8

10

cycle

/Byte

to

fin

d

substring

strstr

Qs'

my_strstr_C

my_strstr

fast

Page 12: Quick Search algorithm and strstr

Implementation of my_strstr(1/2) Implementation of my_strstr(1/2)

https://github.com/herumi/opti/blob/master/str_util.hpp

written in Xbyak(for my convenience)

Main loop

2012/3/31 #x86opti /20 12

// a : rax(or eax), c : rcx(or ecx) // input a : ptr to text // key : ptr to key // use save_a, save_key, c movdqu(xm0, ptr [key]); // xm0 = *key L(".lp"); pcmpistri(xmm0, ptr [a], 12); // 12(1100b) = [equal ordered:unsigned:byte] jbe(".headCmp"); add(a, 16); jmp(".lp"); L(".headCmp"); jnc(".notFound");

Page 13: Quick Search algorithm and strstr

Implementation of my_strstr(2/2) Implementation of my_strstr(2/2)

Compare tail in“headCmp”

2012/3/31 #x86opti /20 13

... add(a, c); // get position mov(save_a, a); // save a mov(save_key, key); // save key L(".tailCmp"); movdqu(xm1, ptr [save_key]); pcmpistri(xmm1, ptr [save_a], 12); jno(".next"); js(".found"); // rare case add(save_a, 16); add(save_key, 16); jmp(".tailCmp"); L(".next"); add(a, 1); jmp(".lp");

Page 14: Quick Search algorithm and strstr

Pros and Cons of my_strstr Pros and Cons of my_strstr

Pros

very fast

Is this implementation with Qs fastest?

No, overhead is almost larger(variable address offset)

Cons

access max 16 bytes beyond of the end of text

almost no problem except for page boundary

allocate memory with margin

2012/3/31 #x86opti /20 14

001 003 FF7 FF8 FF9 FFA FFB FFC FFD FFE FFF 000 002

not readable page 4KiB readable page

pcmpistri access

violation end of text

Page 15: Quick Search algorithm and strstr

strstr of Visual Studio 11 strstr of Visual Studio 11

almost same speed as my_strstr

of Couse safe to use

i7-2620 3.4GHz + Windows 7 + VS 11beta

2012/3/31 #x86opti /20 15

0

2

4

6

8

cycle

/Byte

to

fin

d

substring

strstr

Qs'

my_strstr

fast

Page 16: Quick Search algorithm and strstr

All benchmarks on i7-2600 All benchmarks on i7-2600

find "ko-re-wa" in 33MiB text

the results strongly depends on text and key

2012/3/31 #x86opti /20 16

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

rate for the timing of Qs(gcc)

strstr(before SSE4.2)

Qs(gcc)

Qs'(gcc)

strstr(gcc;SSE4.2)

strstr(VC;SSE4.2)

my_strstr(SSE4.2)

fast

Page 17: Quick Search algorithm and strstr

range version of strstr range version of strstr

strstr is not available for string including‘¥0’

use std::string.find()

but it is not optimized for SSE4.2

naive but fast implementation by C

str_util.hpp provides findStr with SSE4.2

4 ~ 5 times faster than findStr_C on i7-2600 + VC11 2012/3/31 #x86opti /20 17

const char *findStr_C(const char *begin, const char *end, const char *key, size_t keySize) { while (begin + keySize <= end) { const char *p = memchr(begin, key[0], end - begin); if (p == 0) break; if (memcmp(p + 1, key + 1, keySize - 1) == 0)return p; begin = p + 1; } return end; }

Page 18: Quick Search algorithm and strstr

feature of pcmpestri feature of pcmpestri

very complex mnemonics

2012/3/31 #x86opti /20 18

L(".lp"); pcmpestri(xmm0, ptr [p], 12); lea(p, ptr [p + 16]); lea(d, ptr [d - 16]); ja(".lp"); jnc(".notFound"); // compare leading str...

pcmpestri xmm0, ptr [p], 12 xmm0 : head of key

rax : keySize

p : pointer to text

rdx : text size rcx : pos of key if found

CF : if found

ZF : end of text

SF : end of key

OF : all match

do not change carry

Page 19: Quick Search algorithm and strstr

Difference between Xeon and i7 Difference between Xeon and i7

main loop of my_strstr

2012/3/31 #x86opti /20 19

L(".lp"); pcmpistri(xmm0, ptr [a], 12); if (isSandyBridge) { lea(a, ptr [a + 16]); ja(".lp"); } else { jbe(".headCmp"); add(a, 16); jmp(".lp"); L(".headCmp"); } jnc(".notFound"); // get position if (isSandyBridge) { lea(a, ptr [a + c - 16]); } else { add(a, c); }

a little faster on i7

1.1 times faster on Xeon

Page 20: Quick Search algorithm and strstr

other features of str_util.hpp other features of str_util.hpp

strchr_any(text, key)[or findChar_any]

returns a pointer to the first occurrence of any character of key int the text

same speed as strchr by using SSE4.2

max length of key is 16

strchr_range(txt, key)[or findChar_range]

returns a pointer to the first occurrence of a character in range [key[0], key[1]], [key[2], key[3]], ...

also same speed as strchr and max len(key) = 16

2012/3/31 #x86opti /20 20

// search character position of '?', '#', '$', '!', '/', ':' strchr_any(text,"?#$!/:");

// search character position of [0-9], [a-f], [A-F] strchr_range(text,"09afAF");