Noise Tolerant Learning Presented by Aviad Maizels Based on: “Noise-tolerant learning, the parity problem, and the statistical query model” \ Avrim Blum,

$Page 1: Noise Tolerant Learning Presented by Aviad Maizels Based on: “Noise-tolerant learning, the parity problem, and the statistical query model” \ Avrim Blum,$
Noise Tolerant Learning

Presented by Aviad Maizels

Based on:“Noise-tolerant learning, the parity problem, and the statistical query model” \ Avrim Blum, Adam Kalai and Hal Wasserman“A Generalized Birthday problem” \ David Wagner“Hard-core predicates for any one way function” \ Goldreich O. and L.A.Levin“Simulated annealing and Boltzmann machines” \ Emile Aarts and Jan Korst

void Agenda() {

do { A few sentences about Codes The opposite problem Learning with noise The k-sum problem Can we do it faster ?? Annealing

} while (!understandable);

}

void fast_introduction_to_LECC() {

The communication channel may disrupt the

original data:

0

1

0

1

1-p

1-p

p

p

Proposed solution: encode messages to

give some protection against errors.

void fast_introduction_to_LECC()(Continue – terminology)

Linear Codes: Fixed sized block code Additive closureCode is tagged using two parameters (n,k): k – data size n – encoded word size

Source Encoder Channelmsg=u1u2…uk codeword=x1x2…xn

noise


Systematic code – original data appears directly inside the codeword.

data redundancy

k n-k

Generating matrix (G) - a matrix s.t. multiplying it with a message will output the encoded word. Num of rows == space dimension (k) Every codeword can be represented as a linear

combination of G’s rows.


Hamming distance – the number of places two vectors differ in Denoted by dist(x,y)

Hamming weight – the number of places that differ from zero in a vector Denoted by wt(x)

Minimum distance of linear code – minimum weight of any non-zero vector

000

001010

011

100

101

111

110

000

001010

011

100

101

111

110


Perfect code (t)- Every vector has hamming distance <=t from a unique codeword

Channel Decoderreceived word=x + e msg’ ??

error vector=e1e2…en

Target


Complete Decoding – The acceptance groups around the codewords together contains all the vectors of length n

...

}

void the_opposite_problem() {

Decoding linear (n,k) codes in the presence of random noise when k > O(logn) in poly(n)-time. k = O(logn) is trivial

| in !(coding-theory) terms: Given a finite set of code words (examples) of

length n, their labels and a codeword , find\learn the label of , in the presence of random noise, in poly(n) time.

yy

void the_opposite_problem()(Continue – Main idea)

Without noise: Any vector can be written as a linear

combination of previously seen examples. Deducing the vector’s label can be done in the

same way.

So… All we need is to find a basis to deduce any label of a new example.

Qs: Is it the same with the presence of noise ??

void the_opposite_problem()(Continue – Main idea)

Well… No.

Summing examples actually boosts the noise:

Given s examples and a noise rate of η < ½, the sum of s examples has a noise rate of

½ + ½(1-2η)s

write basis vectors as a sum of small number of examples and the new sample as a linear combination of the above.

}

void learning_with_noise() {

Concept – boolean function over the input space

Concept class – set of conceptsWorld model:

There is a fixed noise rate η<1/2, Fixed probability distribution D over the input

space The alg. may ask for labeled example (x,l). &… an unknown concept c.

( :{0,1} {0,1})nusually c

void learning_with_noise() {

Goal: Find an e-approximation of c a function h s.t. Prx←D[h(x) = c(x)] ≥ 1-e

Parity function: defined by a corresponding vector v{0,1}n. The function is then given by the rule ( ) (mod 2)f x x v

1010111… x

k bits

c

void learning_with_noise()(Continue – Preliminaries)

Efficiently learnable: Concept class C is E.L. in the presence of random classification noise under distribution D if: alg A s.t. e>0, δ>0, η>0 and concept cC A produces an e-approximation of c with

probability at least 1- δ when given access to D-random examples.

A must run in time polynomial in n,1/e,1/ δ and in 1/(1/2- η).

void learning_with_noise()(Continue – Goal)

We’ll show that: The length-k parity problem for noise rate η<1/2, can be solved with computation time and total size of examples of 2O(k/logk).

Observe the behavior of the noise when we’re adding up examples:

void learning_with_noise()(Continue – Noise behavior)

pi + qi = 1

Denote: si = pi-qi = 2pi–1 = 1–2qi si[-1,1]

p3 = p1q2+p2q1 ; q3 = p1p2+ q1q2

s3 = p3–q3 = s1s2

1111011…

1010111… p1= appearing frequency of noisy bit.q1= appearing frequency of correct bit.

p2= appearing frequency of noisy bit.q2= appearing frequency of correct bit.

1

1 1... will have the correct value with probability (1 2 )

2 2k

kl l

void learning_with_noise()(Continue – Idea)

Main idea: Draw much more examples than needed to find basis vectors as a sum of relatively small number of examples.

If η<1/2 the sum of (logn) labels will be polynomially indistinguishable from random

We can repeat the process to boost reliability

void learning_with_noise()(Continue – Definitions)

A few more definitions:k = a*b

Vi - subspace of {0,1}ab consisting of vectors whose last i blocks are zeroed

i-sample – set of independent vectors that are uniformly distributed over Vi

1010111…

…b bits

1 2 a

void learning_with_noise()(Continue – Main construction)

Construction: Given i-sample of size s, we construct (i+1)-sample of size at least s-2b in time O(s)

Behold: i-sample={x1,…,xs}. Partition the x’s based on the (a-i) block (we’ll get

max 2b partitions). For each non-empty partition, pick a random vector,

add it to the other vectors on his partition and then discard the vector.

Result: z1,…,zm vectors, m≥s-2b where: The block (a-i-1) is zeroed out zj are independent uniformly distributed over Vi+1

void learning_with_noise()(Continue – Algorithm)

Algorithm (Finding the 1st bit): Ask for a2b labeled examples Apply construction (a-1) times to get (a-1)-sample

There is 1-1/e chance that the vector (1,0,…,0) will be a member of the (a-1)- sample. If it’s not there, we’ll do it again with new labeled examples (expected number of repetitions is constant)

Note: we’ve written (1,0,…,0) as a sum of 2(a-1) examples, causing the noise rate to boost to

| (a-1)-sample| 2b

( 1)21 1(1 2 )

2 2

a

void learning_with_noise()(Continue – Observations)

Observations: We found the first bit of our new sample using the number of

examples and computation time in poly

We can shift all examples to determine the remainder bits Fixing a=(1/2)logk and b=2k/logk will give the desired

for a constant noise rate η.( )log2k

Ok

21(( ) , 2 )

1 2

a b

}

void the_k_sum_problem() {

The key to improve the above alg is to find a better way to solve a problem similar to “k-sum”.

Problem: Given k lists L1,…,Lk of elements, drawn uniformly and independently from {0,1}n, find x1L1,…,xkLk s.t.

Note: a solution to the “k-sum” problem exists with good probability if |L1|*|L2|*…*|Lk| >> 2n (Similar to birthday paradox)

1 2 0kx x x

void the_k_sum_problem()(Continue – Wagner’s Algorithm - Definitions)

Preliminary definitions and observations: Lowl(x) – the l LS bits of x L1 xl L2 – contains all pairs from L1 x L2 that agree

on the l LS bits. If lowl (x1x2)=0 and lowl (x3x4)=0 then

lowl (x1x2x3x4)=0 and Pr[x1x2x3x4=0]=2l/2n

Join (xl) operation: Hash join: stores one list and scans through the other

(|L1| + |L2|) steps, O(|L1|+|L2|) storage Merge join: sorts & scans the two sorted lists

O(max(|L1|,|L2|)log(max(|L1|,|L2|))) time

void the_k_sum_problem()(Continue – Wagner’s Algorithm – Simple case)

The 4 lists case: Extends lists until they each

contains 2l elements Generate a new list L12 of

values x1x2 s.t. lowl(x1x2)=0 and a new list L34 in the same way

Search for matches between L12 and L34

xl xl

L2 L3 L4

L1 xl L2 L3 xl L4

L1

xl

{(x1,…,x4):x1…x4=0}

void the_k_sum_problem()(Continue – Wagner’s Algorithm)

Observation:Pr[lowl(xixj)=0]=1/2l when 1ij 4 and

xi,xj are chosen uniformly at randomE[|Lij|]=(|Li|*|Lj|)/2l=22l/2l=2l The expected number of elements common

between L12 and L34 that will yield the desired solutions is |L12|*|L34|/2n-l (ln/3 will give us at least 1)

Complexity:O(2n/3) time and space


Improvisations: We don’t need low l bits to be zero. We can fix them to any

α (i.e. ) The value 0 in x1… xk=0 can be replaced with a constant

c of our choice (by replacing Lk with Lk’=Lkc)

If k>k’ the complexity of the “k-sum” problem can be no larger than the complexity of the “k’-sum” problem (just pick arbitrary xk’+1,…,xk, define c=xk’+1… xk and use “k’-sum” alg to find a solution for x1… xk’=c)

we can solve “k-sum” problem with complexity at most O(2n/3) for all k4

( ( ))i l j lL x L x


Extending the 4 lists case: Create complete binary tree of depth logk. At depth h we’ll use

So we’ll get an algorithm that requires

time and space

Note: if k is not a power of 2 we’ll take k’ to be

- the largest power of 2 less than k, using afterwards the list elimination trick

1 logh

hnl

k

1 log( 2 )n

kO k

logk2

}

void can_we_do_it_better_?() {

But… Maybe there’s a problem with the approach ?

How many samples do we really need to get a solution with good probability ?

Do we even need a basis ?Can we do it without scanning the whole

space ?Do we need the best solution ?

•Yes•Yes•K+logk-log(-ln(1-e))•Yes & no…•Yes•no

void can_we_do_it_better_?()(Continue – Sampling space)

To have a solution we need k linearly independent vectors in our sampling space S. So…

We’ll want: where e[0,1]

|sampling space|=O(k+logk+f(e))

11

0

1( , ) (2 1)(2 2) (2 2 ) (1 2 )

(2 )

ks s s k j s

s kj

P S k

11 1

2 ( )1 2 2

1

1 1( , ) (2 2 ) (1 )

(2 ) 2

s ks k s k

k ks k k

s k s kP S k e

log( ln(1 )) log 1S e k k

( , ) 1P s k e

}

void annealing() {

Physical process of heating up solid until it melts, followed by cooling it down into a state of perfect lattice.

Problem’: finding, among potentially very large number of solutions, a solution with minimal cost. Note: We don’t even need the minimal cost solution

- just one who has a noise rate below our threshold

void annealing()(Continue – Combinatorial optimization)

Some definitions: The set of solutions to the combinatorial problem

is taken as the set of states S’ Note: In our case:

The price function is the energy E:S’ R that we minimize

The transition probability between neighboring states depends on their energy difference and an external temperature T

| || | | | | |' 2

1 2Ss s s

Sk

void annealing()(Continue – Pseudo code algorithm)

Set T to a high temperature Choose an arbitrary initial state c Loop:

Select a neighbor c’ of c; set ΔE = E(c')-E(c) If ΔE < 0 then move to c',

else move to c' with probability exp(-ΔE/T). Do the 2 steps above several more times Decrease T

Wait long enough and cross fingers…(preferably more than 2)

void annealing()(Continue – Problems)

Problems:Not all states can yield our new sample

(only the ones containing at least one vector from S\basis).

The probability that a “capable” state will yield the zero vector is 1/2k

The probability that any 1jk vectors from S will yield a solution is Note: When |S|k the phrase above approaches zero

1(1 )

2k

k

j

S

j

void annealing()(Continue – Reduction)

Idea: Sample a little more than is needed: |S|=O(c*k) Assign each vector its hamming weight and sort

S by it.Reduction:

Spawning the next generation: all the states which includes a vector who has a hamming weight 2*wt(l)

cos

1

1( ') where m>1

( ( ) 1)t l

m

i

f l S lwt l

1 if f(l) f(i)( ) ( )

exp( ) if f(l)>f(i)( ( ', )) { f i f l

T

P l Gen S i

void annealing()(Continue – Convergence & Complexity ??)

Complexity:

Where L denotes the number of steps to reach quasi-equilibrium in each phase and denotes the computation time of a transition ln(|S’|) denotes the number of phases to reach an

accepted solution, using polynomial-time cooling schedule

( ln ' )O L S

lnSL S H S S

( ( ))O S f k

“I don’t even see the code anymore… all I can see now are blondes, brunettes, redheads…”- Cipher (“The matrix”)

void appendix()([GL])

Theorem: Suppose we have oracle access to random process bx:{0,1}n{0,1}, so that

where the probability

is taken uniformly over internal coin tosses of bx and all possible choices of r, and b(x,r) denote the inner-product mod 2 of x and r.

Then, We can in time polynomial in n/ output a list of string that contains x with probability at least ½.

{0,1}

1Pr [ ( ) ( , )]

2n xrb r b x r

void appendix()(Continue – [GL] – highway)

How ??

1 way (to extract xi):

Suppose s(x)=Pr[bx(r)=b(x,r)]3/4+ (hmmm??)

The probability that both bx(r)=b(x,r) and bx(rei)=b(x,rei) will hold is at least

but…

( ) ( ) ( , ) ( , )

( , )x x i i

i i

b r b r e b x r b x r e

b x e x

1 11 2( ) 2

4 2

void appendix()(Continue – [GL] – better way)

2nd way:

Idea: Guess b(x,r) by ourselves.

Problem: Need to guess polynomially many r’s.

Solution: Generate polynomially many r’s so that they are “sufficiently” random but still we can guess them with non-negligible probability.

void appendix()(Continue – [GL] – better way)

Construction: Select uniformly strings in {0,1}n

and denote them by s1,…,sl. Guess

The probability that all guesses are correct is

assign each rj to different subsets of {1,..,l} s.t.

Note that: Try all possibilities for 1,…,l and output a list of

2l candidates for zi{0,1}n

log( ( ) 1)nl poly

1 1( , ),..., ( , )l lb x s b x s

12 ( / )l

poly n

j j J jr s( , ) ( , ) ( , )j j J j j J j j J jb x r b x s b x s

Documents

Noise Tolerant Learning Presented by Aviad Maizels Based on: “Noise-tolerant learning, the parity problem, and the statistical query model” \ Avrim Blum,