Upload
mitsunari-shigeo
View
3.815
Download
2
Embed Size (px)
DESCRIPTION
strstr with SSE4.2 is faster than quick search
Citation preview
Quick Search algorithm and strstr
Quick Search algorithm and strstr
Cybozu Labs
2012/3/31 MITSUNARI Shigeo(@herumi)
x86/x64 optimization seminar 3(#x86opti)
Agenda Agenda
Quick Search
vs. strstr of gcc on Core2Duo
vs. strstr of gcc on Xeon
fast strstr using strchr of gcc
vs. my implementation on Xeon
restriction
vs. strstr of VC2011 beta on Core i7
feature of pcmpestri
range version of strstr
2012/3/31 #x86opti /20 2
Quick Search algorithm(1/2) Quick Search algorithm(1/2)
Simplified and improved Boyer-Moore algorithm
initialized table for "this is"
How to initialize table for given string [str, str + len)
2012/3/31 #x86opti /20 3
0 1 2 3 4 5 6 7 8 9 A B C D E F 0 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 1 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 2 3 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 3 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 5 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 6 8 8 8 8 8 8 8 8 6 2 8 8 8 8 8 8 7 8 8 8 1 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 A 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 B 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 C 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 D 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 E 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 F 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
char 't' 'h' 'I' 's' ' ' other
skip +7 +6 +2 +1 +3 +8
int tbl_[256]; void init(const char *str, int len) { std::fill(tbl_, tbl_ + 256, len); for (size_t i = 0; i < len; i++) { tbl_[str[i]] = len - i; } }
Quick Search algorithm(2/2) Quick Search algorithm(2/2)
Searching phase
simple and fast
see http://www-igm.univ-mlv.fr/~lecroq/string/node19.html
2012/3/31 #x86opti /20 4
const char *find(const char *begin, const char *end) { while (begin <= end - len_) { if (memcmp(str_, begin, len_) == 0) return begin; begin += tbl_[begin[len_]]; } return end; };
Benchmark Benchmark
2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8
33MB UTF-8 text
Qs(Quick search) is faster for long substring
Remark: assume text does not have ‘¥0’for strstr 2012/3/31 #x86opti /20 5
0
2
4
6
8
10
cycle
/Byte
to
fin
d
substring
strstr
org Qs
fast
A little modification of Qs A little modification of Qs
avoid memcmp
2012/3/31 #x86opti /20 6
const char *find(const char *begin, const char *end) { while (begin <= end - len_) { if (memcmp(str_, begin, len_) == 0) return begin; begin += tbl_[begin[len_]]; } return end; }
const char *find(const char *begin, const char *end){ while (begin <= end - len_) { for (size_t i = 0; i < len_; i++) { if (str_[i] != begin[i]) goto NEXT; } return begin; NEXT: begin += tbl_[static_cast<unsigned char>(begin[len_])]; } return end; }
Benchmark again Benchmark again
2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8
33MB UTF-8 text
modified Qs(Qs’) is more faster
Should we use modified Qs'? 2012/3/31 #x86opti /20 7
0
2
4
6
8
10
cycle
/Byte
to
fin
d
substring
strstr
org Qs
Qs'
fast
strstr on gcc 4.6 with SSE4.2 strstr on gcc 4.6 with SSE4.2
Xeon X5650 2.67Gz on Linux
strstr with SSE4.2 is faster than Qs’ for substring with length less than 9 byte
Is strstr of gcc is fastest implementation?
2012/3/31 #x86opti /20 8
0
1
2
3
4
5
cycle
/Byte
to
fin
d
substring
strstr
Qs'
fast
strstr implementation by strchr strstr implementation by strchr
Find candidate of location by strchr at first, and verify the correctness
strchr of gcc with SSE4.2 is fast
2012/3/31 #x86opti /20 9
const char *mystrstr_C(const char *str, const char *key) { size_t len = strlen(key); while (*str) { const char *p = strchr(str, key[0]); if (p == 0) return 0; if (memcmp(p + 1, key + 1, len - 1) == 0) return p; str = p + 1; } return 0; }
strstr vs. mystrstr_C strstr vs. mystrstr_C
Xeon X5650 2.6GHz + gcc 4.6.1
mystrstr_C is 1.5 ~ 3 times faster than strstr
except for “ko-re-wa”(in UTF-8)
maybe penalty for many bad candidates
2012/3/31 #x86opti /20 10
0
2
4
6
8
10
cycle
/Byte
to
fin
d
substring
strstr
Qs'
my_strstr_C
fast
real speed of SSE4.2(pcmpistri) real speed of SSE4.2(pcmpistri)
my_strstr is always faster than Qs’
2 ~ 4 times faster than strstr of gcc
2012/3/31 #x86opti /20 11
0
2
4
6
8
10
cycle
/Byte
to
fin
d
substring
strstr
Qs'
my_strstr_C
my_strstr
fast
Implementation of my_strstr(1/2) Implementation of my_strstr(1/2)
https://github.com/herumi/opti/blob/master/str_util.hpp
written in Xbyak(for my convenience)
Main loop
2012/3/31 #x86opti /20 12
// a : rax(or eax), c : rcx(or ecx) // input a : ptr to text // key : ptr to key // use save_a, save_key, c movdqu(xm0, ptr [key]); // xm0 = *key L(".lp"); pcmpistri(xmm0, ptr [a], 12); // 12(1100b) = [equal ordered:unsigned:byte] jbe(".headCmp"); add(a, 16); jmp(".lp"); L(".headCmp"); jnc(".notFound");
Implementation of my_strstr(2/2) Implementation of my_strstr(2/2)
Compare tail in“headCmp”
2012/3/31 #x86opti /20 13
... add(a, c); // get position mov(save_a, a); // save a mov(save_key, key); // save key L(".tailCmp"); movdqu(xm1, ptr [save_key]); pcmpistri(xmm1, ptr [save_a], 12); jno(".next"); js(".found"); // rare case add(save_a, 16); add(save_key, 16); jmp(".tailCmp"); L(".next"); add(a, 1); jmp(".lp");
Pros and Cons of my_strstr Pros and Cons of my_strstr
Pros
very fast
Is this implementation with Qs fastest?
No, overhead is almost larger(variable address offset)
Cons
access max 16 bytes beyond of the end of text
almost no problem except for page boundary
allocate memory with margin
2012/3/31 #x86opti /20 14
001 003 FF7 FF8 FF9 FFA FFB FFC FFD FFE FFF 000 002
not readable page 4KiB readable page
pcmpistri access
violation end of text
strstr of Visual Studio 11 strstr of Visual Studio 11
almost same speed as my_strstr
of Couse safe to use
i7-2620 3.4GHz + Windows 7 + VS 11beta
2012/3/31 #x86opti /20 15
0
2
4
6
8
cycle
/Byte
to
fin
d
substring
strstr
Qs'
my_strstr
fast
All benchmarks on i7-2600 All benchmarks on i7-2600
find "ko-re-wa" in 33MiB text
the results strongly depends on text and key
2012/3/31 #x86opti /20 16
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
rate for the timing of Qs(gcc)
strstr(before SSE4.2)
Qs(gcc)
Qs'(gcc)
strstr(gcc;SSE4.2)
strstr(VC;SSE4.2)
my_strstr(SSE4.2)
fast
range version of strstr range version of strstr
strstr is not available for string including‘¥0’
use std::string.find()
but it is not optimized for SSE4.2
naive but fast implementation by C
str_util.hpp provides findStr with SSE4.2
4 ~ 5 times faster than findStr_C on i7-2600 + VC11 2012/3/31 #x86opti /20 17
const char *findStr_C(const char *begin, const char *end, const char *key, size_t keySize) { while (begin + keySize <= end) { const char *p = memchr(begin, key[0], end - begin); if (p == 0) break; if (memcmp(p + 1, key + 1, keySize - 1) == 0)return p; begin = p + 1; } return end; }
feature of pcmpestri feature of pcmpestri
very complex mnemonics
2012/3/31 #x86opti /20 18
L(".lp"); pcmpestri(xmm0, ptr [p], 12); lea(p, ptr [p + 16]); lea(d, ptr [d - 16]); ja(".lp"); jnc(".notFound"); // compare leading str...
pcmpestri xmm0, ptr [p], 12 xmm0 : head of key
rax : keySize
p : pointer to text
rdx : text size rcx : pos of key if found
CF : if found
ZF : end of text
SF : end of key
OF : all match
do not change carry
Difference between Xeon and i7 Difference between Xeon and i7
main loop of my_strstr
2012/3/31 #x86opti /20 19
L(".lp"); pcmpistri(xmm0, ptr [a], 12); if (isSandyBridge) { lea(a, ptr [a + 16]); ja(".lp"); } else { jbe(".headCmp"); add(a, 16); jmp(".lp"); L(".headCmp"); } jnc(".notFound"); // get position if (isSandyBridge) { lea(a, ptr [a + c - 16]); } else { add(a, c); }
a little faster on i7
1.1 times faster on Xeon
other features of str_util.hpp other features of str_util.hpp
strchr_any(text, key)[or findChar_any]
returns a pointer to the first occurrence of any character of key int the text
same speed as strchr by using SSE4.2
max length of key is 16
strchr_range(txt, key)[or findChar_range]
returns a pointer to the first occurrence of a character in range [key[0], key[1]], [key[2], key[3]], ...
also same speed as strchr and max len(key) = 16
2012/3/31 #x86opti /20 20
// search character position of '?', '#', '$', '!', '/', ':' strchr_any(text,"?#$!/:");
// search character position of [0-9], [a-f], [A-F] strchr_range(text,"09afAF");