26
P A G E 1 www.exensa.com www.exensa.com PRESENTER: GUILLAUME PITEL 2016 JUNE 9 Approximate counting for NLP Count-Min Tree Sketch Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand, Abdul Mouhamadsultane 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0101 b=2/c=110 b=4/c=01011001 conflict between counters 4 and 7

Count-Min Tree Sketch : Approximate counting for NLP tasks

Embed Size (px)

Citation preview

Page 1: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 1

www.exensa.com

www.exensa.com

PRESENTER: GUILLAUME PITEL 2016 JUNE 9Approximate counting for NLPCount-Min Tree SketchGuillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand, Abdul Mouhamadsultane

0

1

1 0 1 0

1 1 1 1

1 0

0 1

1

1

1

0

0

0

1

1

1

1

0

0

0

0

1

1

0101

b=2/c=110 b=4/c=01011001

conflict between counters 4 and 7

Page 2: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 2

www.exensa.com

A bit of contextWhy do we need to count ?

Data analysis platform : eXenGine. Processes different kind of data (mostly text).

We need to create relevant cross-features : to do that we need to count occurrences of all possible cross-features. In the case of text data, a particular kind of cross-feature is known as n-grams.

There are many different measures to decide if a n-gram is interesting. All require to count the occurrences of the cross-feature and the features themselves (i.e. count bigrams and words in bigrams)

Counting exactly is easy, distributable, and very slow because of memory usage. Also, having the whole data structure containing the counts in memory is impossible, so one has to resort to using huge map/reduce with joins to do the job.

Page 3: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 3

www.exensa.com

A bit of contextWhat kind of data are we talking about ?

Google N-grams

tokens 1024 Billions

sentences 95 Billions

1-grams (count > 200) 14 Millions

2-grams (count > 40) 314 Millions

3-grams 977 Millions

4-grams 1.3 Billion

5-grams 1.2 Billion

Page 4: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 4

www.exensa.com

A bit of contextWhat kind of data are we talking about ?

Zipfian distribution

[Le Quan & al. 2003]

Page 5: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 5

www.exensa.com

A bit of contextWhat kind of measures are we talking about ?

PMI, TF-IDF, LLR

Page 6: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 6

www.exensa.com

A bit of contextSummary / Goals

Many counts

Logarithms in measures

We need to store a large amount of counts

We care about the order of magnitude

Fast and memory controlled

We don’t want a distributed memory for the counts

Zipfian counts

Many very small counts that will be filtered out later

Page 7: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 7

www.exensa.com

A bit of contextSummary / Goals

Many counts

Logarithms in measures

We need to store a large amount of counts

We care about the order of magnitude

Fast and memory controlled

We don’t want a distributed memory for the counts

Zipfian counts

Many very small counts that will be filtered out later

We can use probabilistic structures

Page 8: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 8

www.exensa.com

Count-Min SketchA probabilistic data structure to store counts [Cormode & Muthukrishnan 2005]

Page 9: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 9

www.exensa.com

Count-Min SketchA probabilistic data structure to store counts

Conservative Update : improve CMS by updating

only min values

Page 10: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 10

www.exensa.com

Count-Min Log SketchA probabilistic data structure to store logarithmic counts

[Pitel & Fouquier, 2015] : same idea than [Talbot, 2009] in a Count-min Sketch

Instead of using regular 32 bit counters, we use 8 or 16 bits “Morris” counters counting logarithmically.

Since counts are used in logs anyway, the error on the PMI/TF-IDF/… is almost the same, but we can use more counters

However, a count of 1 still uses the same amount of memory than a count of 10000. Also, at some point, error stops improving with space (there is an inherent residual error)

Page 11: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 11

www.exensa.com

Count-Min Tree SketchA count min sketch with shared counters

Idea : use a hierarchal storage where most significant bits are shared between counters.

Somehow similar to TOMB counters [Van Durme, 2009], except that overflow is managed very differently.

Page 12: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 12

www.exensa.com

Tree Shared Counters

Sharing most significant bits

8 counters structure

oA tree is made of three kinds of storage:o Counting bitso Barrier bitso Spire (not required except for

performance)oSeveral layers alternating counting

and barrier bits.oHere we have a <[(8,8),(4,4),(2,2),

(1,1)],4> counter

Or : how can we store counts with an average approaching 4 bits / counter

0

1

1 0 1 0

1 1 1 1

1 0

0 1

1

1

1

0

0

0

1

1

1

1

0

0

0

0

1

1

0101

barrier bits

counting bits

spire

base layer

Page 13: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 13

www.exensa.com

Tree Shared Counters

Sharing most significant bits

8 counters structure

o8 counters in 30 bits + spireoWithout a spire, n bits can count up

to oMany small shared counters with spires

are more efficient than a large shared counter

Or : how can we store counts with an average approaching 4 bits / counter

0

1

1 0 1 0

1 1 1 1

1 0

0 1

1

1

1

0

0

0

1

1

1

1

0

0

0

0

1

1

0101

barrier bits

counting bits

spire

base layer

Page 14: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 14

www.exensa.com

Tree Shared Counters

Reading values

oA counter stops at the first ZERO barrieroWhen two barrier paths meet, there is a

conflictoBarrier length (b) is evaluated in unaryoCounter bits (c) are evaluated in a more

classical way

0

1

1 0 1 0

1 1 1 1

1 0

0 1

1

1

1

0

0

0

1

1

1

1

0

0

0

0

1

1

0101

b=2/c=110 b=4/c=01011001

conflict between counters 4 and 7

Page 15: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 15

www.exensa.com

Tree Shared Counters

Incrementing (counter 5)

0

0

0 0 0 0

0 0 0 0

0 0

0 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0000

0

0

0 0 0 0

0 0 0 0

0 0

0 0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0000

0

0

0 0 0 0

0 0 0 0

0 0

0 0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0000

0 1 2

Page 16: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 16

www.exensa.com

Tree Shared Counters

Incrementing (counter 5)

0

0

0 0 0 0

0 0 0 0

0 0

0 0

0

0

0

0

0

0

0

0

0

0

1

1

0

0

0

0

0000

0

0

0 0 1 0

0 0 0 0

0 0

0 0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0000

0

0

0 0 1 0

0 0 0 0

0 0

0 0

0

0

0

0

0

0

0

0

0

0

1

1

0

0

0

0

0000

3 4 5

Page 17: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 17

www.exensa.com

Tree Shared Counters

Incrementing (counter 5)

0

0

0 0 0 0

0 0 1 0

0 0

0 0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0000

61

A bit at that level is worth …

224

48

Page 18: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 18

www.exensa.com

Count-Min Tree Sketches

Experiments

Results !

• 140M tokens from English Wikipedia* • 14.7M words (unigrams + bigrams)• Reference counts stored in UnorderedMap 815MiB

Perfect storage size : suppose we have a perfect hash function and store the counts using 32-bits counters. For 14.7M words, it amounts to 59MiB.

Performance : our implementation of a CMTS using <[(128,128),(64,64)…],32> counters is equivalent to native UnorderedMap performance.

We use 3-layers sketches (good performance/precision tradeoff)

* We preferred to test our counters with a large number of parameters rather than with a large corpus, so we limit to 5% of Wikipedia.

Page 19: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 19

www.exensa.com

Count-Min Tree Sketches

Average Relative Error

Results !

Page 20: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 20

www.exensa.com

Count-Min Tree Sketches

RMSE

Results !

Page 21: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 21

www.exensa.com

Count-Min Tree Sketches

RMSE on PMI

Results !

Page 22: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 22

www.exensa.com

Count-Min Tree SketchQuestion : are CMTS really useful in real-life ?

1 – CMTS are better on the whole vocabulary, but what happens if we skip the least frequent words / bigrams ?2 – CMTS are better on average, but what happens quantile by quantile ?

Page 23: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 23

www.exensa.com

Count-Min Tree Sketches

PMI Error per quantile (sketches at 50% perfect size, limit eval to f > 10-7 )

Results !

Page 24: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 24

www.exensa.com

Count-Min Tree Sketches

Relative Error per log2-quantile (sketches at 50% perfect size, limit eval to f > 10-7 )

Results !

Page 25: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 25

www.exensa.com

ConclusionWhere are we ?

CMTS significantly outperforms other methods to store and update Zipfian counts in a very efficient way.

Because most of the time in sketch accesses is due to memory access, its performance is on-par with other methods

• Main drawback : at very high (and unpractical anyway) pressures (less than 10% of the perfect storage size), the error skyrockets

• Other drawback : implementation is not straightforward. We have devised at least 4 different ways to increment the counters.

Merging (and thus distributing) is easy once you can read and set a counter.

Page 26: Count-Min Tree Sketch : Approximate counting for NLP tasks

PAGE 26

www.exensa.com

ConclusionWhere are we going ?

Dynamic : we are working on a CMTS version that can automatically grow (more layers added below)

Pressure control : when we detect that pressure becomes too high, we can divide and subsample to stop the collisions to cascade

Open Source python package on its way