View
216
Download
1
Embed Size (px)
Citation preview
Noise Tolerant Learning
Presented by Aviad Maizels
Based on:“Noise-tolerant learning, the parity problem, and the statistical query model” \ Avrim Blum, Adam Kalai and Hal Wasserman“A Generalized Birthday problem” \ David Wagner“Hard-core predicates for any one way function” \ Goldreich O. and L.A.Levin“Simulated annealing and Boltzmann machines” \ Emile Aarts and Jan Korst
void Agenda() {
do { A few sentences about Codes The opposite problem Learning with noise The k-sum problem Can we do it faster ?? Annealing
} while (!understandable);
}
void fast_introduction_to_LECC() {
The communication channel may disrupt the
original data:
0
1
0
1
1-p
1-p
p
p
Proposed solution: encode messages to
give some protection against errors.
void fast_introduction_to_LECC()(Continue – terminology)
Linear Codes: Fixed sized block code Additive closureCode is tagged using two parameters (n,k): k – data size n – encoded word size
Source Encoder Channelmsg=u1u2…uk codeword=x1x2…xn
noise
void fast_introduction_to_LECC()(Continue – terminology)
Systematic code – original data appears directly inside the codeword.
data redundancy
k n-k
Generating matrix (G) - a matrix s.t. multiplying it with a message will output the encoded word. Num of rows == space dimension (k) Every codeword can be represented as a linear
combination of G’s rows.
void fast_introduction_to_LECC()(Continue – terminology)
Hamming distance – the number of places two vectors differ in Denoted by dist(x,y)
Hamming weight – the number of places that differ from zero in a vector Denoted by wt(x)
Minimum distance of linear code – minimum weight of any non-zero vector
000
001010
011
100
101
111
110
000
001010
011
100
101
111
110
void fast_introduction_to_LECC()(Continue – terminology)
Perfect code (t)- Every vector has hamming distance <=t from a unique codeword
Channel Decoderreceived word=x + e msg’ ??
error vector=e1e2…en
Target
void fast_introduction_to_LECC()(Continue – terminology)
Complete Decoding – The acceptance groups around the codewords together contains all the vectors of length n
...
}
void the_opposite_problem() {
Decoding linear (n,k) codes in the presence of random noise when k > O(logn) in poly(n)-time. k = O(logn) is trivial
| in !(coding-theory) terms: Given a finite set of code words (examples) of
length n, their labels and a codeword , find\learn the label of , in the presence of random noise, in poly(n) time.
yy
void the_opposite_problem()(Continue – Main idea)
Without noise: Any vector can be written as a linear
combination of previously seen examples. Deducing the vector’s label can be done in the
same way.
So… All we need is to find a basis to deduce any label of a new example.
Qs: Is it the same with the presence of noise ??
void the_opposite_problem()(Continue – Main idea)
Well… No.
Summing examples actually boosts the noise:
Given s examples and a noise rate of η < ½, the sum of s examples has a noise rate of
½ + ½(1-2η)s
write basis vectors as a sum of small number of examples and the new sample as a linear combination of the above.
}
void learning_with_noise() {
Concept – boolean function over the input space
Concept class – set of conceptsWorld model:
There is a fixed noise rate η<1/2, Fixed probability distribution D over the input
space The alg. may ask for labeled example (x,l). &… an unknown concept c.
( :{0,1} {0,1})nusually c
void learning_with_noise() {
Goal: Find an e-approximation of c a function h s.t. Prx←D[h(x) = c(x)] ≥ 1-e
Parity function: defined by a corresponding vector v{0,1}n. The function is then given by the rule ( ) (mod 2)f x x v
1010111… x
k bits
c
void learning_with_noise()(Continue – Preliminaries)
Efficiently learnable: Concept class C is E.L. in the presence of random classification noise under distribution D if: alg A s.t. e>0, δ>0, η>0 and concept cC A produces an e-approximation of c with
probability at least 1- δ when given access to D-random examples.
A must run in time polynomial in n,1/e,1/ δ and in 1/(1/2- η).
void learning_with_noise()(Continue – Goal)
We’ll show that: The length-k parity problem for noise rate η<1/2, can be solved with computation time and total size of examples of 2O(k/logk).
Observe the behavior of the noise when we’re adding up examples:
void learning_with_noise()(Continue – Noise behavior)
pi + qi = 1
Denote: si = pi-qi = 2pi–1 = 1–2qi si[-1,1]
p3 = p1q2+p2q1 ; q3 = p1p2+ q1q2
s3 = p3–q3 = s1s2
1111011…
1010111… p1= appearing frequency of noisy bit.q1= appearing frequency of correct bit.
p2= appearing frequency of noisy bit.q2= appearing frequency of correct bit.
1
1 1... will have the correct value with probability (1 2 )
2 2k
kl l
void learning_with_noise()(Continue – Idea)
Main idea: Draw much more examples than needed to find basis vectors as a sum of relatively small number of examples.
If η<1/2 the sum of (logn) labels will be polynomially indistinguishable from random
We can repeat the process to boost reliability
void learning_with_noise()(Continue – Definitions)
A few more definitions:k = a*b
Vi - subspace of {0,1}ab consisting of vectors whose last i blocks are zeroed
i-sample – set of independent vectors that are uniformly distributed over Vi
1010111…
…b bits
1 2 a
void learning_with_noise()(Continue – Main construction)
Construction: Given i-sample of size s, we construct (i+1)-sample of size at least s-2b in time O(s)
Behold: i-sample={x1,…,xs}. Partition the x’s based on the (a-i) block (we’ll get
max 2b partitions). For each non-empty partition, pick a random vector,
add it to the other vectors on his partition and then discard the vector.
Result: z1,…,zm vectors, m≥s-2b where: The block (a-i-1) is zeroed out zj are independent uniformly distributed over Vi+1
void learning_with_noise()(Continue – Algorithm)
Algorithm (Finding the 1st bit): Ask for a2b labeled examples Apply construction (a-1) times to get (a-1)-sample
There is 1-1/e chance that the vector (1,0,…,0) will be a member of the (a-1)- sample. If it’s not there, we’ll do it again with new labeled examples (expected number of repetitions is constant)
Note: we’ve written (1,0,…,0) as a sum of 2(a-1) examples, causing the noise rate to boost to
| (a-1)-sample| 2b
( 1)21 1(1 2 )
2 2
a
void learning_with_noise()(Continue – Observations)
Observations: We found the first bit of our new sample using the number of
examples and computation time in poly
We can shift all examples to determine the remainder bits Fixing a=(1/2)logk and b=2k/logk will give the desired
for a constant noise rate η.( )log2k
Ok
21(( ) , 2 )
1 2
a b
}
void the_k_sum_problem() {
The key to improve the above alg is to find a better way to solve a problem similar to “k-sum”.
Problem: Given k lists L1,…,Lk of elements, drawn uniformly and independently from {0,1}n, find x1L1,…,xkLk s.t.
Note: a solution to the “k-sum” problem exists with good probability if |L1|*|L2|*…*|Lk| >> 2n (Similar to birthday paradox)
1 2 0kx x x
void the_k_sum_problem()(Continue – Wagner’s Algorithm - Definitions)
Preliminary definitions and observations: Lowl(x) – the l LS bits of x L1 xl L2 – contains all pairs from L1 x L2 that agree
on the l LS bits. If lowl (x1x2)=0 and lowl (x3x4)=0 then
lowl (x1x2x3x4)=0 and Pr[x1x2x3x4=0]=2l/2n
Join (xl) operation: Hash join: stores one list and scans through the other
(|L1| + |L2|) steps, O(|L1|+|L2|) storage Merge join: sorts & scans the two sorted lists
O(max(|L1|,|L2|)log(max(|L1|,|L2|))) time
void the_k_sum_problem()(Continue – Wagner’s Algorithm – Simple case)
The 4 lists case: Extends lists until they each
contains 2l elements Generate a new list L12 of
values x1x2 s.t. lowl(x1x2)=0 and a new list L34 in the same way
Search for matches between L12 and L34
xl xl
L2 L3 L4
L1 xl L2 L3 xl L4
L1
xl
{(x1,…,x4):x1…x4=0}
void the_k_sum_problem()(Continue – Wagner’s Algorithm)
Observation:Pr[lowl(xixj)=0]=1/2l when 1ij 4 and
xi,xj are chosen uniformly at randomE[|Lij|]=(|Li|*|Lj|)/2l=22l/2l=2l The expected number of elements common
between L12 and L34 that will yield the desired solutions is |L12|*|L34|/2n-l (ln/3 will give us at least 1)
Complexity:O(2n/3) time and space
void the_k_sum_problem()(Continue – Wagner’s Algorithm)
Improvisations: We don’t need low l bits to be zero. We can fix them to any
α (i.e. ) The value 0 in x1… xk=0 can be replaced with a constant
c of our choice (by replacing Lk with Lk’=Lkc)
If k>k’ the complexity of the “k-sum” problem can be no larger than the complexity of the “k’-sum” problem (just pick arbitrary xk’+1,…,xk, define c=xk’+1… xk and use “k’-sum” alg to find a solution for x1… xk’=c)
we can solve “k-sum” problem with complexity at most O(2n/3) for all k4
( ( ))i l j lL x L x
void the_k_sum_problem()(Continue – Wagner’s Algorithm)
Extending the 4 lists case: Create complete binary tree of depth logk. At depth h we’ll use
So we’ll get an algorithm that requires
time and space
Note: if k is not a power of 2 we’ll take k’ to be
- the largest power of 2 less than k, using afterwards the list elimination trick
1 logh
hnl
k
1 log( 2 )n
kO k
logk2
}
void can_we_do_it_better_?() {
But… Maybe there’s a problem with the approach ?
How many samples do we really need to get a solution with good probability ?
Do we even need a basis ?Can we do it without scanning the whole
space ?Do we need the best solution ?
•Yes•Yes•K+logk-log(-ln(1-e))•Yes & no…•Yes•no
void can_we_do_it_better_?()(Continue – Sampling space)
To have a solution we need k linearly independent vectors in our sampling space S. So…
We’ll want: where e[0,1]
|sampling space|=O(k+logk+f(e))
11
0
1( , ) (2 1)(2 2) (2 2 ) (1 2 )
(2 )
ks s s k j s
s kj
P S k
11 1
2 ( )1 2 2
1
1 1( , ) (2 2 ) (1 )
(2 ) 2
s ks k s k
k ks k k
s k s kP S k e
log( ln(1 )) log 1S e k k
( , ) 1P s k e
}
void annealing() {
Physical process of heating up solid until it melts, followed by cooling it down into a state of perfect lattice.
Problem’: finding, among potentially very large number of solutions, a solution with minimal cost. Note: We don’t even need the minimal cost solution
- just one who has a noise rate below our threshold
void annealing()(Continue – Combinatorial optimization)
Some definitions: The set of solutions to the combinatorial problem
is taken as the set of states S’ Note: In our case:
The price function is the energy E:S’ R that we minimize
The transition probability between neighboring states depends on their energy difference and an external temperature T
| || | | | | |' 2
1 2Ss s s
Sk
void annealing()(Continue – Pseudo code algorithm)
Set T to a high temperature Choose an arbitrary initial state c Loop:
Select a neighbor c’ of c; set ΔE = E(c')-E(c) If ΔE < 0 then move to c',
else move to c' with probability exp(-ΔE/T). Do the 2 steps above several more times Decrease T
Wait long enough and cross fingers…(preferably more than 2)
void annealing()(Continue – Problems)
Problems:Not all states can yield our new sample
(only the ones containing at least one vector from S\basis).
The probability that a “capable” state will yield the zero vector is 1/2k
The probability that any 1jk vectors from S will yield a solution is Note: When |S|k the phrase above approaches zero
1(1 )
2k
k
j
S
j
void annealing()(Continue – Reduction)
Idea: Sample a little more than is needed: |S|=O(c*k) Assign each vector its hamming weight and sort
S by it.Reduction:
Spawning the next generation: all the states which includes a vector who has a hamming weight 2*wt(l)
cos
1
1( ') where m>1
( ( ) 1)t l
m
i
f l S lwt l
1 if f(l) f(i)( ) ( )
exp( ) if f(l)>f(i)( ( ', )) { f i f l
T
P l Gen S i
void annealing()(Continue – Convergence & Complexity ??)
Complexity:
Where L denotes the number of steps to reach quasi-equilibrium in each phase and denotes the computation time of a transition ln(|S’|) denotes the number of phases to reach an
accepted solution, using polynomial-time cooling schedule
( ln ' )O L S
lnSL S H S S
( ( ))O S f k
“I don’t even see the code anymore… all I can see now are blondes, brunettes, redheads…”- Cipher (“The matrix”)
void appendix()([GL])
Theorem: Suppose we have oracle access to random process bx:{0,1}n{0,1}, so that
where the probability
is taken uniformly over internal coin tosses of bx and all possible choices of r, and b(x,r) denote the inner-product mod 2 of x and r.
Then, We can in time polynomial in n/ output a list of string that contains x with probability at least ½.
{0,1}
1Pr [ ( ) ( , )]
2n xrb r b x r
void appendix()(Continue – [GL] – highway)
How ??
1 way (to extract xi):
Suppose s(x)=Pr[bx(r)=b(x,r)]3/4+ (hmmm??)
The probability that both bx(r)=b(x,r) and bx(rei)=b(x,rei) will hold is at least
but…
( ) ( ) ( , ) ( , )
( , )x x i i
i i
b r b r e b x r b x r e
b x e x
1 11 2( ) 2
4 2
void appendix()(Continue – [GL] – better way)
2nd way:
Idea: Guess b(x,r) by ourselves.
Problem: Need to guess polynomially many r’s.
Solution: Generate polynomially many r’s so that they are “sufficiently” random but still we can guess them with non-negligible probability.
void appendix()(Continue – [GL] – better way)
Construction: Select uniformly strings in {0,1}n
and denote them by s1,…,sl. Guess
The probability that all guesses are correct is
assign each rj to different subsets of {1,..,l} s.t.
Note that: Try all possibilities for 1,…,l and output a list of
2l candidates for zi{0,1}n
log( ( ) 1)nl poly
1 1( , ),..., ( , )l lb x s b x s
12 ( / )l
poly n
j j J jr s( , ) ( , ) ( , )j j J j j J j j J jb x r b x s b x s