Upload
chandler-torres
View
21
Download
1
Embed Size (px)
DESCRIPTION
Computational Learning Theory and Kernel Methods. Tianyi Jiang March 8, 2004. General Research Question. “ Under what conditions is successful learning possible and impossible? ” “ Under what conditions is a particular learning algorithm assured of learning successfully? ” - PowerPoint PPT Presentation
Citation preview
1
Computational Learning Theoryand Kernel Methods
Tianyi JiangMarch 8, 2004
2
General Research Question
“Under what conditions is successful learning possible and impossible?”
“Under what conditions is a particular learning algorithm assured of learning successfully?”
-Mitchell, ‘97
3
Computational Learning Theory
1. Sample Complexity
2. Computational Complexity
3. Mistake Bound
-Mitchell, ‘97
4
Problem Setting
Instance Space: X, with a stable distribution D
Concept Class: C, s.t. c: X {0,1}
Hypothesis Space: H
General Learner: L
5
Error of a Hypothesis
+ +
Where c and h disagree
h -
-
- c
6
PAC Learnability
)()(Pr)( xhxcherrorx
DDTrue Error:
Difficulties in getting 0 error:
1. Multiple hypothesis consistent with training examples2. Training examples can mislead the Learner
7
PAC-Learnable
Learner L will output a hypothesis h with probability(1-) s.t. )(herrorD
in time that is polynomial in )(,,1,1
csizen
where2
10,
2
10
n = size of a training examplesize(c) = encoding length of c in C
8
Consistent Learner & Version Space
Consistent Learner – Outputs hypotheses that perfectly fit the training data whenever possible
Version Space: )()()(,|, xcxhExcxHhVS EH
VSH,E is -exhausted with respect to c and D if: )(, herrorVSh EH D
9
Version Space
Hypothesis space H ( =.21)
.error=.1r=.2
.error=.3r=.1
.error=.3r=.4
.error=.2r=.3
.error=.2r=0
.error=.1r=0
VSH,
E
10
Sample Complexity for Finite Hypothesis Spaces
Theorem - -exhausting the version space:
If H is finite, the probability that VSH,D is NOT -exhausted (with respect to c) is:
|H|e- m
where m1, sequence of i.r.d. examples of some target concept c; 0 1
11
Upper bound on sufficient number of training examples
If we set probability of failure below some level,
meH ||
then…
1
ln||ln1
Hm
… however, too loose of a bound due to |H|
12
Agnostic Learning
What if concept c H?
Agnostic Learner: simply finds the h with min. training error
Find upper bound on m s.t.
bestEbest herrorherror D
Where hbest = h with lowest training error
13
Upper bound on sufficient number of training examples - errorE(hbest) 0
From Chernoff Bounds, we have:
22)()(Pr mE eherrorherror D
then…
22)()(Pr mE eHherrorherrorHh D
1
ln||ln2
12
Hm
thus…
14
Example:
Given a consistent learner and a target concept of conjunctions of up to 10 Boolean literals, how many training examples are needed to learn a hypothesis with error < .1 95% of the time?
|H|=?=?=?
15
Example:
Given a consistent learner and a target concept of conjunctions of up to 10 Boolean literals, how many training examples are needed to learn a hypothesis with error < .1 95% of the time?
|H|=310
=.1=.05
14005.
1ln3ln10
1.
1
1ln3ln
1
m
nm
16
Sample Complexity for Infinite Hypothesis Spaces
Consider subset of instances: S X, and h H s.t. h imposed dichotomy on S: 2 subsets: {x S | h(x)=1 } & {x S | h(x)=0 }
Thus for any instance set S, there are 2|S| possible dichotomies.
Definition: A set of instance S is shattered by hypothesis space H iff for every dichotomy of S there exist some h H consistent with this dichotomy
17
3 Instances Shattered by 8 Hypotheses
Instance Space X
18
Vapnik-Chervonenkis Dimension
Definition: VC(H), is the size of the largest finite subset of X shattered by H.
If arbitrarily large finite sets of X can be shattered by H, then VC(H)=
For any finite H, VC(H) log2|H|
19
Example of VC Dimension
Along a line…
In a plane…
20
VC Dimension Example 2
21
VC dimensions in Rn
Theorem: Consider some set of m points in Rn. Choose any one of the points as origin. Then the m points can be shattered by oriented hyperplanes iff the position vectors of the remaining points are linearly independent.
So VC dimension of the set of oriented hyperplanes in R10 is ?
22
Bounds on m with VC Dimension
13
log)(82
log41
22 HVCm
VC(H) log2|H|
Upper Bound:
Lower Bound:
32
1)(,
1log1
maxCVC
23
Mistake Bound Model of Learning
“How many mistakes will the learner make in its predictions before it learns the target concept?”
The best algorithm in worst case scenario (hardest target concept, hardest training sequence) will makeOpt(C) mistakes, where
CCOptCVC 2log)()(
24
Linear Support Vector Machines
Consider a binary classification problem:
Training data: {xi, yi}, i=1,…,; yi {-1, +1}; xi Rd
Points x lie on the separating hyperplane satisfy:wx+b=0
where w is normal to the hyperplane |b|/||w|| is the perpendicular distance to origin ||w|| is the Euclidean norm of w
25
Linear Support Vector Machine, Definitions
Let d+ (d-) be the shortest distance from the separating hyperplane to the closest positive (negative) example
Margin of a separating hyperplane= d+ + d-
=1/||W||+1/||W||=2/||w||
Constraints:
ibwxy
yforbwx
yforbwx
ii
ii
ii
01)(
11
11
26
Linear Separating Hyperplane for the Separable Case
27
Problem of Maximizing the Margins
H1 and H2 are parallel, & with no training points in between
Thus we reformulate the problem as:
Maximize margin by minimizing ||W||2
s.t. ibwxy ii 01)(
28
Ties to Least Squares
y
x
b
bxwxfy )(
21
),(
l
i ii bxwybwLLoss Function:
29
Lagrangian Formulation1. Transform constraints into Lagrange multipliers2. Training data will only appear in dot products form
Let ,,,1, lii be positive Lagrange multipliers
We have the Lagrangian:
its
bwxywL
i
l
i i
l
i iiiP
0..
)(2
111
2
30
Transform the convex quadratic programming problem
Observations: minimizing LP w.r.t. w, b, and simultaneously require that
0
PL ii 0subject to
is a convex quadratic programming problemthat can be easily solved in its Dual form
31
Transform the convex quadratic programming problem – the Dual
LP’s Dual: Maximize LP, subject to gradients of LP w.r.t. w and b vanish, and i0
i jijijijiiD
iii
iiii
xxyyL
y
xyw
,2
1
0
32
Observations about the Dual
i ji
jijijiiD xxyyL,2
1
• There is a Lagrangian multiplier i for every training point• In the solution, points for which i > 0 are called “support vectors”. They lie on either H1 or H2
• Support vectors are critical elements of the training set, they lie closest to the “boundary”• If all other points are removed or moved around (but not crossing H1 or H2), the same separating hyperplane would be found
33
Prediction
• Solving the SVM problem is equivalent to finding a solution for the Karush-Kuhn-Tucker (KTT) conditions (KTT conditions are satisfied at the solution of any constrained optimization problem)
Once we solved for w, b, we predict x to be sign(wx+b)
34
Linear SVM: The Non-Separable Case
We account for outliers by introducing slack conditions:
i
yforbwx
yforbwx
i
iii
iii
0
11
11
We penalize outliers by changing the cost function to:
iiCw 2
2
1min
35
Example of Linear SVM with slacks
36
Linear SVM Classification Examples
Linearly Separable Linearly Non-Separable
37
Nonlinear SVMObservation: data appear as dot products in the training
problem
So we can use a mapping function , to map data into a high dimensional space where points are linearly separable:
Hd :
To make things easier, we define a kernel function K s.t.
jiji xxxxK ),(
38
Nonlinear SVM (cont.)Kernel functions can compute dot products in the highdimensional space without explicitly work with
Example: 22
2/),(
ji xx
ji exxK
Rather than computing w, we make prediction on x via:
bxsKy
bxsyxf
S
S
N
i iii
N
i iii
1
1
,
)(
39
Example of mapping
Image, in , of the square [-1,1]x[-1,1] R2 under the mapping
40
Example Kernel FunctionsKernel functions must satisfy the Mercer’s condition, orsimple, the Hessian Matrix
jijiij xxyyHH ,
must be positive semidefinite. (non-negative eigenvalues)
Example Kernels:
pji yxxxK )1(),(
)tanh(),( ykxxxK ji
41
Nonlinear SVM Classification Examples (Degree 3 Polynomial Kernel)
Linearly Separable Linearly Non-Separable
42
Multi-Class SVM
1. One-against-all
2. One-against-one (majority vote)
3. One-against-one (DAGSVM)
43
Global Solution and Uniqueness
• Every local solution is also global (property of any convex programming problem)
• Solution is guaranteed unique if the objective function is strictly convex (Hessian matrix is positive definite)
44
Complexity and Scalability
Curse of dimensionality:1. The proliferation of parameters causing intractable complexity
2. The proliferation of parameters causing overfitting
SVM circumvent these via the use of 1. Kernel functions (trick) that computes at O(dL)2. Support vectors that focus on the “boundary”
45
Structural Risk Minimization
Empirical Risk:
l
i iiemp xfyl
R1
,2
1)(
Expected Risk:
l
hlhRR emp
)4/log()1)/2(log()()(
46
Structural Risk Minimization
Nested subsets of functions, ordered by VC dimensions