Upload
truongkhue
View
219
Download
0
Embed Size (px)
Citation preview
Mathematics and Statistics for International Development Engineering
Yukihiko YamasitaDept. of International Development Engineering
Tokyo Institute of Technology
Contents1. Linear algebra
2. Statistical estimation
3. Optimization
1
1 Introduction1.1 PurposeTo learn basic mathematics and Statistics for international development.
1.2 Linear algebra• Eigenvalue problem
– Physical analysis– Statistical Optimization (Dimensional reduction)
• Singular value decomposition (SVD)– Approximation of a matrix– Regularization (rank reduction)
• Generalized inverses of matrix– Inverse filter– Limit of regularization
• Octave (GPL Software for linear algebra)
2
1.3 Statistical estimationEstimate the origin from results.
• Random variable•Mean and variance•Well-known Probabilistic distributions including• Characterization of normal distribution• Coordinate transform• Test (Kentei)• Estimator (Suiteiryou), estimate (Suiteichi)• Unbiased estimator• Cramer-Rao lower bound for unbiased estimators• Statistical learning theory
3
1.4 Optimization• Optimization is very important. It is used for calculating
– Statistical estimation,– Economical planing,– Environmental assessment.
• Gradient method– Maximum gradient method– Conjugate gradient method
• Newton’s method (quasi-Newton’s method)– Second order derivatives are considered.
• Conditional optimization• Support vector machine for pattern recognition
4
1.5 global maximum vs. local maximum
•We can only see our neighborhood.– Assume that we optimize
f (x1, x2, . . . , xN),– search 10 points for each xi.⇒ calculation at 10N points.⇒ It is impossible when N is large.
• No method to solve this problem in general.Heuristic methods :– Inertia term :α · (new change) + (1 − α) · (previous change)
– Homotopy method :a simple function→ a complex function
– Simulated annealing.– GA (generic algorithm).
5
1.6 Schedule
10/8 IntroductionProjection matrix
10/15 Eigenvalue problemSingular value problem
10/22 Generalized inverse10/29 Octave
(Program for linear algebra)11/5 Probability I11/6 Probability II11/19 Normal distribution
11/26 Test and Estimation I12/3 Test and estimation II12/10 Cramer-Rao lower bound12/17 Statistical learning theory12/24 Maximum gradient method
Conjugate gradient method1/7 quasi-Newton’s method1/14 Conditional optimization1/25 Support Vector Machine
6
2 Eigenvalue problem2.1 Range and null space of a matrixDefinition• Rn : n-dimensional real vector space• S is a subspace if and only if for any vectors x and y in S and any scalarsα and β,
αx + βyis in S .• R(A) ≡ Ax | x ∈ Rm : the range of an (n,m)-matrix A• N(A) ≡ x ∈ Rm | Ax = 0 : the null space of an (n,m)-matrix A• R(A) and N(A) are subspaces.
(Proof.) If x, y ∈ R(A), there exist f and g such that x = A f and y = Ag.Then, αx + βy = αA f + βAg = A(α f + βg) yields αx + βy ∈ R(A).If x, y ∈ N(A), we have Ax = 0 and Ay = 0. Then, A(αx + βy) = 0 sothat αx + βy ∈ N(A).
Problem 1. Prove the underlined equation A(αx + βy) = 0.7
2.2 Projection Matrix (Projector)Definition 1. A matrix P is a projection matrixif and only if P2 = P.
For a projection matrix P, we have
Px = P2x = P3x = P4x = · · ·Lemma 1. For a projection matrix P, We have
R(P) = N(I − P), (1)N(P) = R(I − P), (2)
R(P) ∩ N(P) = 0, (3)R(P) + N(P) = Rn. (4)
Remarks:• The sum of subspaces R(P) and N(P) isRn.• The intersection of R(P) and N(P) is only
0
• P is the projectionmatrix onto R(P)along N(P).
8
(Proof.)Assume that f ∈ R(P). There exists g such that f = Pg and
(I − P) f = (I − P)Pg = Pg − PPg = Pg − Pg = 0.
Thus we haveR(P) ⊂ N(I − P). (5)
Conversely, assume that f ∈ N(I − P). We have (I − P) f = 0 and f = P f .Then f ∈ R(P) holds and it follows that
R(P) ⊃ N(I − P). (6)
Equations (5) and (6) yield eq. (1).
(I − P)(I − P) = I − P − P + P2 = I − P (7)
yields that I − P is a projection matrix. Substituting I − P into P, eq.(1), wehave eq.(2).
9
When f ∈ R(P), there exists g such that f = Pg. Then, we have
f = Pg = PPg = P f
Assume that f ∈ R(P) ∩ N(P) = R(P) ∩ R(I − P). Then, f = P f andf = (I−P) f hold so that we have f = P f = P(I−P) f = (P−PP) f = O f = 0and eq.(3). In general, we have R(A + B) ⊂ R(A) + R(B) for any matrices Aand B, then eq.(4) holds. QED (quod erat demonstrandum).
Problem 2. Proof the underlined formula R(A + B) ⊂ R(A) + R(B)
10
2.2.1 Orthogonal projection matrix
Definition 2. For vectors x and y and sub-spaces S 1 and S 2,
• x ⊥ y : ⟨x, y⟩ = 0,• x ⊥ S 1 : ⟨x, y⟩ = 0 for any y ∈ S 1,• S 1 ⊥ S 2 : ⟨x, y⟩ = 0 for any x ∈ S 1 and
any y ∈ S 2.• S⊥ = x ∈ Rn | x ⊥ S
Lemma 2. Let S 1 and S 2 be subspaces. IfS 1 ⊂ S 2, then
S⊥1 ⊃ S⊥2 .
(Proof.)Since S 1 ⊂ S 2, y ∈ S⊥2 yields that y ∈ S⊥1 .QED.
Lemma 3. For a sub-space S we have
S⊥⊥ = S .
For the proof, the ex-istence of the orthogo-nal projection has to beproved (difficult).
11
Lemma 4. Let S 1 and S 2 be subspaces. We have
(S 1 ∩ S 2)⊥ = S⊥1 + S⊥2(=
x + y | x ∈ S⊥1 , y ∈ S⊥2
)(Proof.)
(S 1 ∩ S 2)⊥ ⊃ S⊥1 and (S 1 ∩ S 2)⊥ ⊃ S⊥2 yield (S 1 ∩ S 2)⊥ ⊃ S⊥1 + S⊥2 .(S⊥1 + S⊥2 )⊥ ⊂ (S⊥1 )⊥ = S 1 and (S⊥1 + S⊥2 )⊥ ⊂ (S⊥2 )⊥ = S 2 yield
(S⊥1 + S⊥2 )⊥ ⊂ S 1 ∩ S 2 and S⊥1 + S⊥2 ⊃ (S 1 ∩ S 2)⊥. QED.
Lemma 5. Let AT be the transposition of A. We have
R(AT ) = N(A)⊥ (8)N(AT ) = R(A)⊥ (9)
(Proof.)For all x ∈ R(A)⊥, we have ⟨x, Ay⟩ = 0 for all y ∈ Rm.⇒ ⟨AT x, y⟩ = 0⇒ AT x = 0⇒ x ∈ N(AT )⇒ R(A)⊥ ⊂ N(AT ).
Conversely, for all x ∈ N(AT ) and all y ∈ Rm,0 = ⟨AT x, y⟩ = ⟨x, Ay⟩ ⇒ x ∈ R(A)⊥⇒ N(AT ) ⊂ R(A)⊥.
Then, eq.(9) is proved. The rest of this lemma is clear. QED.12
Problem 3. Prove the underlined sentence.Hint: Use (AT )T = A and S = T yields S⊥ = T⊥.Lemma 6. PT is the projection matrix onto N(P)⊥ along R(P)⊥.(Proof.)
PT PT = (P2)T = PT
and Lemma 5 yield the lemma. QED.Definition 3. A matrix P is an orthogonal projection matrix if and only if
PT P = P. (10)
Remark: Since P = PT yields N(P) = N(PT ) = R(P)⊥, it can be called theorthogonal projection matrix.Remark: Equation (10) is equivalent to the following system of equations.
P2 = P (11)PT = P (12)
(Proof.)When PT P = P, we have PT = (PT P)T = PT P = P. QED.
13
Lemma 7. For any x ∈ Rn, Px provides the nearest element in R(P).
(Proof.)For any element y ∈ R(P), Py = y and PT (I − P) = PP − P = O yield
∥x − y∥2 = ∥x − Px + Px − y∥2 = ∥(I − P)x + P(x − y)∥2= ⟨(I − P)x + P(x − y), (I − P)x + P(x − y)⟩= ∥(I − P)x∥2 + 2⟨(I − P)x, P(x − y)⟩ + ∥P(x − y)∥2= ∥x − Px∥2 + 2⟨PT (I − P)x, x − y⟩ + ∥Px − Py∥2= ∥x − Px∥2 + ∥Px − y∥2≥ ∥x − Px∥2.
The inequality∥x − y∥2 ≥ ∥x − Px∥2
implies that Px is the nearest to x in R(P). QED.
14
2.3 EigenequationA : a symmetric real (n, n)-matrixEigenequation
Ax = λx (13)λ : eigenvaluex : eigenvector (x , 0)
There exist n solutions (including degeneration):eigenvalue: λ1 ≥ λ2 ≥ · · · ≥ λneigenvector: u1,u2, . . . , un
Aui = λiui
We can choose uini=1 as an orthonormal basis (ONB) :
⟨ui,u j⟩ = δi, j =
1 (i = j)0 (i , j) (14)
15
(Proof.)From eq.(13), we have
(A − λI)x = 0, (15)where I is a unit matrix. Since x , 0, Ax−λI must not be regular. Therefore,
det(A − λI) = 0. (16)This is a polynomial of degree n with respect to λ. There exist n solutions inC (C is the set of complex numbers). We denote a solution in C by λ. For theλ, there exists x = (xi) in Cn such that x , 0 and
(A − λI)x = 0. (17)We prove λ ∈ R. Since Ai j = A ji ∈ R, we have
λ∥x∥2 = λ
n∑i=1|xi|2 =
n∑i=1
λxixi =n∑
i=1
n∑j=1
Ai jx jxi =n∑
j=1
n∑i=1
x jAi jxi
=
n∑j=1
x j
n∑i=1
A jixi =n∑
j=1x jλx j = λ
n∑j=1|x j|2 = λ∥x∥2
16
so that λ ∈ R. Sincen∑
j=1Ai j(x j + x j)/2 = λ(xi + xi)/2. (18)
(19)
and (xi + xi)/2 is the real part of xi, we assume that all xi are real numberswithout loss of generality.
From the above discussion, we know that at least a pair of real eigenvalueand real eigenvector exists. Let λn = λ, un = x/∥x∥, and vin−1
i=1 be an or-thonormal system (ONS) of the subspace which is orthogonal to un. We have⟨vi, v j⟩ = δi j and ⟨vi,un⟩ = 0. We define (n, n − 1)-matrix as
V = (v1 v2 · · · vn−1).
It is clear thatVT V = In−1 and VT un = 0.
Since(VT AV)T = VT AT VTT = VT AV,
17
VT AV is a symmetric (n − 1, n − 1)-matrix.Hereafter, we use the mathematical induction to prove the theorem. We
assume that VT AV in Rn−1 has eigenvalues λ1, . . . , λn−1 and a set of corre-sponding eigenvectors u′1, . . . , u
′n−1, which is an orthonormal basis (ONB)
in Rn−1 (u′i is an (n − 1)-dimensional vector).. . Then we have
(VT AV)u′i = λiu′iand
⟨un, AVu′i⟩ = ⟨Aun,Vu′i⟩ = λn⟨VT un,u′i⟩ = λn⟨0,u′i⟩ = 0.un ⊥ AVu′i yields AVu′i ∈ R(V) and
λi(Vu′i) = VVT AVu′i =n−1∑j=1⟨AVu′i , v j⟩v j = AVu′i .
since v jn−1j=1 is an ONS. Then we have
A(Vu′i
)= λi
(Vu′i
)(i = 1, 2, . . . , n)
18
and
⟨Vu′i ,Vu′j⟩ = ⟨VT Vu′i ,u
′j⟩ = ⟨u
′i ,u′j⟩ = δi j, ⟨Vu′i ,un⟩ = 0.
Therefore, λ1, . . . λn, Vu′1, . . .Vu′n−1,un satisfy the condition. QED.
Remark. When λi , λ j, then ⟨ui,u j⟩ = 0.
λi⟨ui,u j⟩ = ⟨Aui,u j⟩ = ⟨ui, AT u j⟩ = ⟨ui, Au j⟩ = λ j⟨ui,u j⟩Then, (λi − λ j)⟨ui,u j⟩ = 0. Therefore, ⟨ui,u j⟩ = 0 if λi , λ j.Remark. When λi = λ j, every linear combination of ui and u j is an eigen-vector.
Since we have Aui = λui and Au j = λu j, for real numbers α and β
A(αui + βu j) = αAui + βAu j = αλui + βλu j = λ(αui + βu j)
Such vectors span an eigenspace.
Let U = (u1 u2 · · · un), we have
UT U = UUT = I,19
UT AU =
λ1 0. . .
0 λn
, A = U
λ1 0. . .
0 λn
UT ,
A =n∑
i=1λiuiuT
i = λ1u1uT1 + · · · + λnunuT
n
(Proof of the last equation)Since ui is an ONB, for any x we can describe x =
∑i⟨x,ui⟩ui. Then,
Ax = A∑
i⟨x,ui⟩ui =
∑i⟨x,ui⟩Aui =
∑iλi⟨x,ui⟩ui
On the other hand, for vectors a and b we have abT x = ⟨x, b⟩a, then, n∑i=1
λiuiuTi
x =∑
iλi⟨x,ui⟩ui = Ax.
QED.
20
2.3.1 Principal component analysis (PCA)Here, the PCA for sample data is explained.
• f1, f2 . . . , f K : samples in RN.• u1,u2 . . . , uN : an ONB.• For an input vector f , the i-th component is given as
⟨ f ,ui⟩ui.
• Choose uiNi=1 such as the sum from the first to M-th components providesthe best approximation of f in the viewpoint of mean square for samples.That is, for any integer M (1 ≤ M ≤ N), minimize
J =K∑
k=1
∥∥∥∥∥∥∥∥M∑i=1⟨ f k,ui⟩ui − f k
∥∥∥∥∥∥∥∥2
(20)
with respect to uiNi=1.• The decomposition by such uiNi=1 is called PCA and uiNi=1 is called a
PCA basis.21
We define the sample correlation matrix as
R =1K
K∑k=1
f k fTk . (21)
R is a symmetric N × N-matrix.
Problem 4. Proof that R is symmetric.
Theorem 1. Let λ1 ≥ λ2 ≥ . . . ≥ λN and v1, v2 . . . , vN be eigenvalues andcorresponding eigenvectors of R, respectively. viNi=1 is a PCA basis.
(Proof.)Assume that uiNi=1 is a PCA basis. Let P =
∑Mi=1 uiuT
i . P is an orthogonalprojection matrix. Then, we have
J =K∑
k=1∥P f k − f k∥2 =
K∑k=1
(∥P f k∥2 − 2⟨P f k, f k⟩ + ∥ f k∥2
)22
=
K∑k=1
(∥P f k∥2 − 2⟨PT P f k, f k⟩ + ∥ f k∥2
)= −
K∑k=1∥P f k∥2 +
K∑k=1∥ f k∥2.
Therefore, we maximize∑K
k=1 ∥P f k∥2. For any vector x, it is clear that
tr(xxT ) = tr
x1x1 x1x2 · · · x1xNx2x1 x2x2 · · · x2xN... ... . . . ...
xN x1 xN x2 · · · xN xN
=N∑
i=1xixi = ∥x∥2.
Then,K∑
k=1∥P f k∥2 =
K∑k=1
tr[(P f k)(P f k)T ] = tr
PK∑
k=1( f k fT
k )PT
= Ktr(PRPT ).
23
Since R =∑N
i=1 λivivTi ,
Ktr(PRPT ) = Ktr
PN∑
i=1λivivT
i PT
= KN∑
i=1λi∥Pvi∥2
Since λ1 ≥ λ2 ≥ · · · ≥ λN,
0 ≤ ∥Pvi∥2 ≤ 1N∑
i=1∥Pvi∥2 = M,
K∑N
i=1 λi∥Pvi∥2 is maximized when
∥Pvi∥ =
1 (i ≤ M)0 (1 > M) .
Then, R(P) should be spanned by viMi=1 for any M and the theorem holds.QED.
24
2.4 Singular Value Decomposition (SVD)A: (n,m)-matrix (may be non-squared or unsymmetric)
We want to decompose A similarly to the case that A is symmetric. AAT isa symmetric (n, n)-matrix, because
(AAT )T = (AT )T AT = AAT .
Then, AAT has eigenvalues λ1 ≥ λ2 ≥ · · · λn ≥ 0 and corresponding eigen-vectors u1,u2, . . .un such that uini=1 is an ONB. And we have
AAT ui = λiui.
λi ≥ 0 (i = 1, 2, . . . , n) because
λi∥ui∥2 = ⟨λiui,ui⟩ = ⟨AAT ui,ui⟩ = ⟨AT ui, AT ui⟩ = ∥AT ui∥2.Similarly, AT A has eigenvalues µ1 ≥ µ2 ≥ . . . ≥ µm ≥ 0 and corresponding
eigenvectors w1,w2, . . .wm. We have
AT A(AT ui) = AT (AAT ui) = λi AT ui
25
Then, λi and AT ui are eigenvalues and eigenvectors of AT A, respectively.Similarly, µi and Awi are eigenvalues and eigenvectors of AAT , respectively.Then, we have µi = λi for every i = 1, 2, . . . ,min(m, n).
Now we assume that λ1 ≥ λ2 ≥ · · · ≥ λk > 0 and λk+1 = · · · = λn = 0. Let
vi = AT ui/√λi
for i ≤ k and let vimi=k+1 be an arbitrary orthonormal system in the subspacex | AT Ax = 0. Then, for any i ≤ k and any j ≤ k, we have
⟨vi, v j⟩ =⟨
AT ui√λi,
AT u j√λ j
⟩=⟨AAT ui,u j⟩√
λi√λ j
=λi√λi
√λ j⟨ui,u j⟩ = δi j.
As we will prove later, we have x | AT Ax = 0 = x | Ax = 0 = N(A). Itfollows that Avi = 0 for i > k. Then, for any i ≤ k and j > k, we have
⟨vi, v j⟩ = ⟨AT ui/√λi, v j⟩ = ⟨ui/
√λi, Av j⟩ = ⟨ui/
√λi, 0⟩ = 0.
Then, for any i, j ≤ m we have
⟨vi, v j⟩ = δi j.
26
When i ≤ k, the definition of vi and λi > 0 yields that
AT ui =√λivi,
andAvi = AAT ui/
√λi = λiui/
√λi =
√λiui.
We summarize the above as
AT ui =
√λivi (i ≤ min(m, n))
0 (else) , (22)
Avi =
√λiui (i ≤ min(m, n))
0 (else) . (23)
We define (n, n)- and (m,m)-matrices
U = (u1 u2 · · · un) (24)V = (v1 v2 · · · vm) (25)
Since UT U = I and VT V = I, we have
UUT = UT U = I, VVT = VT V = I,27
and
UT AV = UT (Av1 Av2 · · · Avm) =
uT
1uT
2...
uTn
(√λ1u1
√λ2u2 · · ·
√λmin(m,n)umin(m,n) 0 · · · 0)
=
√λ1 0 · · · 0
0√λ2 · · · 0 0
... . . . . . . ...
0 0 · · ·√λmin(m,n)
(m ≥ n)
√λ1 0 · · · 00√λ2 · · · 0
... . . . . . . ...
0 0 · · ·√λmin(m,n)
0
(m < n)
.
28
Then, we have
A = U
√λ1 0 · · · 0
0√λ2 · · · 0 (0)
... . . . . . . ...
0 0 · · ·√λmin(m,n)
(0)
VT , (26)
and
A =k∑
i=1
√λiuivT
i =√λ1u1vT
1 +√λ1u2vT
2 + · · · +√λkukvT
k (27)
(Proof.)
Av j =√λ ju j,
k∑i=1
√λiuivT
i
v j =√λ ju j,
and vimi=0 is an ONB. QED.
29
Let (n, k)-matrix U′ and (m, k)-matrix U′ beU′ = (u1, . . . , uk),V′ = (v1, . . . , vk),
respectively, then
U′T AV′ =
√λ1 0 · · · 00√λ2 · · · 0
... . . . . . . ...0 0 · · ·
√λk
,and
A = U′
√λ1 0 · · · 00√λ2 · · · 0
... . . . . . . ...0 0 · · ·
√λk
V′T
Problem 5. Prove µi and Awi are an eigenvalue and an eigenvector of AAT ,respectively.
30
2.5 Generalized Inverses of MatrixFor an (n,m)-matrix A (may not be regular), we define
Type Equation Notation for X1-inverse AX A = A A(1) or A−
2-inverse X AX = X A(2)
3-inverse (AX)T = AX A(3)
4-inverse (X A)T = X A A(4)
Example of other notations :A(1), A(1,2),A† denotes A(1,2,3,4) and called the Moore-Penrose generalized inverse of
A.A1 : the set of all 1-inverse matrices of A.A1, 2 : the set of all 1,2-inverse matrices of A.
31
2.5.1 1-inverseConsider a linear equation
Ax = y. (28)If eq.(28) has a solution (⇔ y ∈ R(A)), it is given as
A−ywith a 1-inverse A− of A.(Proof.)
Since y ∈ R(A), we can let y = Az with a vector z. Then we have
A(A−y) = AA−Az = Az = yQED.
Since(A−A)2 = A−AA−A = A−A,(AA−)2 = AA−AA− = AA−,
then A−A and AA− are projection matrices. QED.32
In general, 1-inverse matrices are not unique. A general form of 1-inverseis given as
X = A− +W − A−AW AA−. (29)with an arbitrary matrix W.(Proof.)AX A = A(A− +W − A−AW AA−)A = AA−A + AW A − AA−AW AA−A
= A + AW A − AW A = A.For any 1-inverse B of A, let W = B − A−, then we have
X = A− + (B − A−) − A−A(B − A−)AA− = B + A−(A − A)A− = B.It yields that eq.(29) is a general form. QED.
R(A) = R(AA−) (30)N(A) = N(A−A) (31)
(Proof.)R(A) = R(AA−A) ⊂ R(AA−) ⊂ R(A)
34
yields R(A) = R(AA−). Similarly,
N(A) = N(AA−A) ⊃ N(A−A) ⊃ N(A)
QED.
2.5.2 1,2-inverserank(A) = rank(A(1,2))
(Proof.)rank(A) = rank(AA(1,2) A) ≤ rank(A(1,2))
andrank(A(1,2)) = rank(A(1,2) AA(1,2)) ≤ rank(A)
QED.
Lemma 8.
R(A(1,2)) = R(A(1,2) A) (32)N(A(1,2)) = N(AA(1,2)) (33)
35
(Proof of eq.(32).)Since
R(A(1,2)) = R(A(1,2) AA(1,2)) ⊂ R(A(1,2) A) ⊂ R(A(1,2)),
R(A(1,2)) = R(A(1,2) A) holds. Eq.(33) can be proved similarly. QED.
Problem 6. Prove R(AB) ⊂ R(A) and N(AB) ⊃ N(B) for any matrices Aand B.
For any x and y in R(A(1,2)), we have A(1,2) Ax = x and A(1,2) Ay = y. Inthis case, Ax = Ay implies x = y because
y = A(1,2) Ay = A(1,2) Ax = y
Similarly, for any a and b in R(A) such that A(1,2)a = A(1,2)b, we have a = b.Therefore, there is a one-to-one relation between R(A(1,2)) and R(A).
We have the freedom for choosing of R(A(1,2)) and N(A(1,2)).
36
2.5.3 1,2,3,4-inverse
1,3-inverse: N(AA(1,3)) = R(A)⊥ : minimum mean square errorFor a vector y
∥Ax − y∥ → min. ⇒ x = A(1,3)y
1,4-inverse: R(A(1,4) A) = N(A)⊥ : minimum normFor a vector y ∈ R(A)
∥x∥ → min. ⇒ x = A(1,4)ysubject to Ax = y.
1.2,3,4-inverse has all features: N(A†) = R(A)⊥, R(A†) = N(A)⊥
38
Theorem 2. 1,2,3,4-inverse is unique.
(Proof.)A†, the 1,2,3,4-inverse of A, satisfies AA†A = A, A†AA† = A†, (AA†)T =
AA†, and (A†A)T = A†A. We define another matrix B that satisfies the sameequations ABA = A, BAB = B, (AB)T = AB, and (BA)T = BA. Then, wehave
B = BAB = BAA†AB = B(A†)T AT BT AT = B(A†)T (ABA)T
= B(A†)T AT = B(AA†)T = BAA† = BAA†AA†
= AT BT AT (A†)T A† = AT (A†)T A† = (A†A)T A† = A†AA† = A†.
QED.
For every y ∈ Rm, there exists x :subject to ∥Ax − y∥ → min.,
∥x∥ → min.
This x is given by x = A†y.44
2.5.4 Calculation of generalized inversesA− : sweep-out method, LU-decomposition
A† :
• Using 1-inverse: A† = AT (AAT )−A(AT A)−AT
• SVD: When
A =k∑
i=1µiuivT
i
with µi , 0 (i = 1, 2, . . . , l), then
A† =k∑
i=1
1µi
viuTi (34)
• Regularization:A† = lim
ε→0AT (AAT + εI)−1 = lim
ε→0(AT A + εI)−1AT
Problem 7. Prove eq.(34). (Examine 1, 2, 3, and 4-inverse conditions.)
45
2.5.5 N(A) = N(AT A)Lemma 9. We have
N(AT A) = N(A), (35)R(AAT ) = R(A). (36)
(Proof.)It is clear that N(AT A) ⊃ N(A). Assume that AT Ax = 0. Then, we have
∥Ax∥2 = ⟨Ax, Ax⟩ = ⟨AT Ax, x⟩ = 0.
∥Ax∥ = 0 yields Ax = 0 and N(AT A) ⊂ N(A). This yields eq.(35).Since we have N(AAT ) = N(AT ) by substituting AT into A, Lemma 5
yields thatR(A) = N(AT )⊥ = N(AAT )⊥ = R(AAT ).
QED.
46
2.6 OctaveTo do the calculation of linear algebra•Make a program by ourselves.• Use packages→ Faster
– Lapack (Linear Algebra PACKage)Linear equation, eigenvalue or singular value decomposition
– BLAS (Basic Linear Algebra Subprograms)Add and multiply matrices
– ATLAS (Automatically Tuned Linear Algebra Software) generates BLAS• Use Octave (or Matlab)→ Easier
– They use the packages for matrix operations→ Faster– Their programming language is realized by interpreter→ Slower
⇒ It is important to understand their feature
Problem 8. Explain about BLAS (You can find the explanation in web).47
• Substitute 3 into a variable x.octave: > x = 3
• ’;’ at the end of sentence is for not to display the result.octave: > x = 3;
• Substitute vectors
(1, 2, 3)
123
into variables a and b, respectively.octave: > a = [1 2 3]
octave: > b = [1;2;3]
• Substitute a matrix 1 0 72 2 93 −1 5
into a variable A.octave: > A = [1 0 7 ; 2 2 9 ; 3 -1 5]
48
• In order not to display the results, add ; at the end of sentence.octave: > B = [1 8 -1 ; 3 2 1 ; 3 1 -1];
• Transpositionoctave: > A’
octave: > a’
• Add vectors or matrices.octave: > A + B
octave: > a’ * b
If sizes are not correct, error message is displayed.•Multiply vectors or matrices.octave: > A * B
octave: > A * b
octave: > a * b
• Inner product of c = (2,−1, 4)T and d = (1, 5,−3)T
octave: > c = [2 ; -1 ; 4];
octave: > d = [1 5 -3]’;
octave: > c’ * d
49
• X = A−1 is given byoctave: > X = inv(A)
• In order to solve a linear equation Ax = b,octave: > x = A\b
• Check the result.octave: > A * x
• The equation can be solve by using its inverse matrix.octave: > x = inv(A) * b;
A\b is faster than inv(A).•Multiplying or dividing individual elements, that is Ci j = Ai j × Bi j or
Ci j = Ai j/Bi j is given byoctave: > C = A .* B
octave: > C = A ./ B
50
• Add from 1 to 100octave: > sum = 0;
octave: > for i = 1:100
> sum = sum + i;
> end
octave: > sum
•Write the following program in the file yama1.msumm = 0;
for i = 1:100
summ = summ + i;
end
summ
and inputoctave: > yama1
• function : File name : yama3.mfunction [a b] = yama3(c, d, e)
....................................
51
• Etc.– Display by more : more on, more off– Continue the sentence after ’Enter’ : ’...’– Zero matrix : zeros(3), zeros(2,3)– Matrix of which all elements are one : ones(3), ones(2,3)– Size of a matrix : size(A)– Refer/substitute an element of a matrix : A(2,3), A(1,3) = 5– Refer/substitute a part of a matrix :A = [ 1 2 3 4 5 ; 6 7 8 9 10 ; 11 12 13 14 15; ...
16 17 18 19 20]
A(:,2:4)
A(3:4,2:3) = [-1 -2 ; -4 -5 ; -7 -8 ]
– Eigenvalue expansion eig(A)– SVD : svd(A)– Sum : sum(A)– average : mean(a)
52
– variance: var(a)– Norm : norm(a)– Elementary functions : sin(x), cos(x), tan(x), asin(x), acos(x),atan(x), sinh(x), . . ., exp(x), log(x)
– Comment line : %– String : “This is a string.” or ’This is a string.’– File I/O∗ printf∗ fopen∗ fclose∗ fprintf∗ fscanf∗ fread∗ fwrite
• Complex numbers : 1.2 + 3.1i• Use help or Web.
53
2.7 Demo2.7.1 Matrix multiplicationC Program (mmult.c)
#include <math.h>
#include <stdio.h>
double a[1000 * 1000];
double b[1000 * 1000];
double c[1000 * 1000];
main()
int i, j, k, sum, n = 1000;
for (i = 0 ; i < n ; i++)
for (j = 0 ; j < n ; j++)
a[i + n * j] = rand();
b[i + n * j] = rand();
54
printf("Finish set \n");
for (i = 0 ; i < n ; i++)
for (j = 0 ; j < n ; j++)
sum = 0;
for (k = 0 ; k < n ; k++)
sum += a[i + n * k] * b[k + n * j];
c[i + n * j] = sum;
printf("Finish multiplication %f \n", c[0]);
Compile and execute
gcc -O3 mmult.c
./a.out
55
Matlab Program 1 (mmult.m)
a = rand(1000);
b = rand(1000);
printf("Finish setting random values to matrices \n");
c = a * b;
printf("Finish multiplication %f \n", c(1,1));
Execution
octave: > mmult
Let’s compare the calculation time.
56
2.7.2 Image processingInput imageorgImg = imread("barb.jpg");
Display imageimshow(orgImg);
Output imageimwrite(empImg, "empImg.jpg");
3×3 image filter (filter2.m)
OutImage(i, j) =1∑
k=−1
1∑l=−1
A(k, l) InImage(i + k, j + l)
A(k, l) is a filter.1/8 0 -1/82/8 0 -2/81/8 0 -1/8
1/8 2/8 1/80 0 0
-1/8 -2/8 -1/8
0 -1 0-1 4 -10 -1 0
horizontal edge detection vertical edge detection Laplacian57
function out = imgFilter(filter, inImg)
out = zeros(size(inImg));
c = size(inImg, 3);
[yi xi] = size(inImg);
xi = xi/c;
[yf xf] = size(filter);
xfh = floor((xf-1) / 2);
yfh = floor((yf-1) / 2);
extendImg = zeros(yi + yf, xi + xf, c);
extendImg((1+yfh):(yi+yfh), (1+xfh):(xi+xfh),:) = inImg;
58
% Extension
for x = 1:xfh
extendImg((1+yfh):(yi+yfh), x, :) = inImg(:,1,:);
extendImg((1+yfh):(yi+yfh), xi+xf-x,:) = inImg(:,xi,:);
endfor
for y = 1:yfh
extendImg(y, :, :) = extendImg(yfh+1,:,:);
extendImg(yi+yf-y, :, :) = extendImg(yfh+1,:,:);
endfor
for y = 1:yf
for x = 1:xf
out(:,:,:) = out(:,:,:) + filter(y,x) * ...
extendImg((1+yf-y):(yi+yf-y),(1+xf-x):(xi+xf-x),:);
endfor
endfor
59
Example (imgProc.m)
% input image
orgImg = imread("barb.jpg");
% display the input image
figure(1)
imshow(orgImg)
% calculate edge image
divFilt = [0 -1 0 ; -1 4 -1 ; 0 -1 0];
edgeImg = imgFilter(divFilt, orgImg);
% display emphasized image
empImg = edgeImg + orgImg;
empImg = empImg .* (empImg >= 0 & empImg <= 255) ...
+ 255 * (empImg > 255);
60
figure(2)
imshow(empImg);
% display filter image
edgeImg = edgeImg + 128;
edgeImg = edgeImg .* (edgeImg >= 0 & edgeImg <= 255) ...
+ 255 * (edgeImg > 255);
figure(3)
imshow(edgeImg / 255);
imwrite(empImg, "empImg.jpg");
imwrite(edgeImg, "edgeImg.jpg");
61
3 Probability3.1 Probability space• Probability is define by using the set theory.
Definition of probability space (V,F , P)V : a set (sample space)F : σ-algebra, that is a set of subsets of V satisfying
1. V ∈ F .
2. If A ∈ F , then V − A ∈ F .
3. If Ai ∈ F , then ∪∞i=1Ai ∈ F .
• A ∈ F is called the ’event’.
63
P : probability measure on F satisfying(Measure is a mapping from a subset to a real number.)
1. P(A) ≥ 0 for every A ∈ F .
2. P(V) = 1.
3. If Ai ∈ F and Ai ∩ A j = ϕ for i , j, then
P(∪∞i=1Ai) =∞∑
i=1P(Ai)
IndependenceTwo event A and B are called independent if and only if
P(A ∩ B) = P(A)P(B)
Conditional distributionThe conditional probability of A under the condition of B is given by
P(A|B) =P(A ∩ B)
P(B)64
Random variable X is a mapping from V to a set of numbers W (integer, realnumbers) such that for any x ∈ W we have
v | X(v) ≤ x ∈ F .In this case, v | X(v) ≤ x is an event.
The cumulative distribution function (CDF) P(x) is defined by
F(x) = P(v ∈ V | X(v) ≤ x).The variable x is called a realized value of a random variable X.
Assume that the range of a random variable X is included in a set of integer.Since v | X(v) = x is an event, probability distribution P(x) is defined by
P(x) = P(v ∈ V | X(v) = x).(The same letter P is used for the measure and probability distribution.)
65
3.2 Random variable of which range is the set of integerExample: (Dice)
V = 1, 2, 3, 4, 5, 6X(v) = v : we can identify V and W.
P(1) = P(1) = F(1) =16
P(1, 2, 3) = F(3) =12
P(ϕ) = 0P(1, 2, 3, 4, 5, 6) = F(6) = 1
P(1, 3, 5|1, 2, 3, 4) = P(1, 3, 5 ∩ 1, 2, 3, 4)P(1, 2, 3, 4) =
P(1, 3)P(1, 2, 3, 4)
=
1323
=12
66
3.3 Random variable of which range is RCDF :
F(x) = P(v ∈ V | X(v) ≤ x).
Probability density function (p.d.f.) : p(x).When there exists a function p(x) such that
p(x)dx = P(v ∈ V | x < X(v) ≤ x + dx = F(x + dx) − F(x) (37)
with the approximation of the first order of dx. We have
p(x) =dF(x)
dx. (38)
(This p.d.f. is not the portable document format)We denote the set v ∈ V | a < X(v) ≤ b by (a, b].Remarks:
(x, x + dx] = (V − (−∞, x]) ∩ (−∞, x + dx],(x, x + dx] should be an event.
67
It is clear that we can handle the probability of V as if V is R.
Expectation of g(x), which is a function on R is defined as.
E[g(X)] =∫R
g(x)dP. (39)
When its density function exists, it is written as
E[g(X)] =∫R
g(x)p(x)dx. (40)
In eq.(39) when P is a probability measure, the definition of integral (it issimplified and we assume g(x) ≥ 0 for brief.) is given as∫
Rg(x)dP = lim
h→0
∞∑l=0
lh · P(x | lh < g(x) ≤ (l + 1)h) (41)
68
Let F is a CDF, the definition of expectation is given as Stieltjes integral.Consider division a = x0 < x1 < x2 < · · · < xn = b and,let ε = maxl=1,2,...,n−1(xl+1 − xl) and ξl ∈ (xl, xl+1).∫
Rg(x)dF = lim
a→−∞ limb→∞
limε→0
n−1∑l=0
g(ξl)F(xl+1) − F(xl)
, (42)
Average (or mean) :
E[X] =∫R
xdP =∫R
xp(x)dx (43)
Variance:
V(X) ≡ E[(X − E[X])2] = E[X2 − 2E[X]X + (E[X])2)]= E[X2] − 2E[X]E[X] + (E[X])2
= E[X2] − (E[X])2 (44)
69
3.4 Two random variables of which range are RConsider two random variables X1(v) and X2(v) from V to R.
For x = (x1, x2) ∈ R2,CDF :
F(x1, x2) = P(v ∈ V | X1(v) ≤ x1, X2(v) ≤ x2)We denote v ∈ V | a1 < X1(v) ≤ b1, a2 < X2(v) ≤ b2 by (a1, b1] × (a2, b2].
(Joint) probability density function (p.d.f.) :When there exists a function p(x1, x2) such that
p(x1, x2)dx1dx2 = P((x1, x1 + dx1] × (x2, x2 + dx2]) (45)
with the approximation of the first order of dx1, dx2. We have
p(x1, x2) =∂2F(x1, x2)∂x1∂x2
. (46)
It is clear that we can handle the probability of V as if V is R2.
70
Marginal distributionCDF of Marginal distributions are defined by
F1(x1) = P(v | X1(v) ≤ x1), (47)F2(x2) = P(v | X2(v) ≤ x2) (48)
and their p.d.f. are defined by
p1(x1) =∂F1(x1)∂x1
=
∫R
p(x1, x2)dx2, (49)
p2(x2) =∂F2(x2)∂x2
=
∫R
p(x1, x2)dx1. (50)
X1 and X2 are said to be independent if and only if
P(v ∈ V | X1(v1) ≤ x1, X2(v2) ≤ x2)= P(v ∈ V | X1(v1) ≤ x1P(v ∈ V | X2(v2) ≤ x2
orF(x1, x2) = F1(x1)F2(x2).
71
By differentiating F(x1, x2) = F1(x1)F2(x2) the condition of independenceby p.d.f. is given as
p(x1, p2) = p1(x1)p2(x2)
Conditional distribution Under the condition that X2 ≤ x2, the probabilityof X1 ≤ x1.
F(x1|x2) =F(x1, x2)F2(x2)
If the probability of X1 ∈ (x1, x1 + dx1] under the condition that X2 ∈(x2, x2 + dx2] can be written as
p(x1|x2)dx1
p(x1|x2) is called the conditional p.d.f. of X1 under the condition that X2 = x2and we have
p(x1|x2) =p(x1, x2)p2(x2)
.
72
Expectation
E[ f (X1, X2)] =∫R
∫R
f (x1, x2)dP =∫R
∫R
f (x1, x2)p(x1, x2)dx1dx2 (51)
Average (or mean) : E[x1] or E[x2]
E[(
X1X2
)]=
(E[X1]E[X2]
)Variance:
E[(X1 − E[X1])2] = E[X21] − (E[X1])2
E[(X2 − E[X2])2] = E[X22] − (E[X2])2
Covariance:
E[(X1 − E[X1])(X2 − E[X2])] = E[X1X2] − E[X1]E[X2]
73
Variance-covariance matrix:
V(X) ≡(
E[(X1 − E[X1])2] E[(X1 − E[X1])(X2 − E[X2])]E[(X2 − E[X2])(X1 − E[X1])] E[(X2 − E[X2])2]
)= E
[((X1 − E[X1])2 (X1 − E[X1])(X2 − E[X2])
(X2 − E[X2])(X1 − E[X1]) (X2 − E[X2])2
)]= E
X21 X1X2
X2X1 X22
− (E[X1]2 E[X1]E[X2]
E[X2]E[X1] E[X2]2
). (52)
Or it can be written by vectors
V(X) = E[(
X1 − E[X1]X2 − E[X2]
)(X1 − E[X1], X2 − E[X2])
]. (53)
Problem 9. Prove eq.(52).
74
3.5 n random variables of which range are RCDF :
F(x1, x2, . . . , xn) = P(v ∈ V | X1(v) ≤ x1, X2(v) ≤ x2, . . . , Xn(v) ≤ xn)We denote v ∈ V | a1 < X1(v) ≤ b1, a2 < X2(v) ≤ b2, . . . , an < Xn(v) ≤ bn
by (a1, b1] × (q2, b2] × · · · × (qn, bn].Probability density function (p.d.f.) :
When there exists a function p(x1, x2, . . . , xn) such that
p(x1, x2, . . . , xn)dx1dx2 . . . dxn= F((x1, x1 + dx1] × (x2, x2 + dx2] × · · · × (xn, xn + dxn]) (54)
with the approximation of the first order of dx1, dx2, . . . , dxn. We have
p(x1, x2, . . . , xn) =∂nF(x1, x2, . . . , xn)∂x1∂x2 . . . ∂xn
. (55)
Marginal and conditional distributions, and independence are defined simi-larly to the case of two random variables.
75
Expectation
E[ f (X1, X2, . . . , Xn)]
=
∫R
∫R· · ·
∫R
f (x1, x2, . . . , xn)dP
=
∫R
∫R
f (x1, x2, . . . , xn)p(x1, x2, . . . , xn)dx1dx2 · · · dxn (56)
Average (or mean) : E[X1] or E[X2] or · · · or E[Xn] or
E
X1X2...
Xn
=
E[X1]E[X2]...
E[Xn]
76
Variance-covariance matrix:
V(X) ≡ E
X1 − E[X1]X2 − E[X2]
...Xn − E[Xn]
(X1 − E[X1], X2 − E[X2], . . . , Xn − E[Xn])
= E
X1X2...
Xn
(X1, X2, . . . , Xn)
− E
X1X2...
Xn
E
[(X1, X2, . . . , Xn)
]
=
E[X1X1] E[X1X2] · · · E[X1Xn]E[X2X1] E[X2X2] · · · E[X2Xn]· · · · · · . . . · · ·
E[XnX1] E[XnX2] · · · E[XnXn]
−
E[X1]E[X1] E[X1]E[X2] · · · E[X1]E[Xn]E[X2]E[X1] E[X2]E[X2] · · · E[X2]E[Xn]· · · · · · . . . · · ·
E[Xn]E[X1] E[Xn]E[X2] · · · E[Xn]E[Xn]
77
3.6 Example of distributionNormal distribution (Section 6)
Successive binary trialsevent S : probability pevent T : probability 1 − p
Binomial distribution (Bernoulli trials)Consider that event S occurs k times in n times trials. The probability isgiven as
nCkpk(1 − p)n−k. (57)
Geometrical distributionAfter the events T occurs k − 1 times continuously, the event S occurs.(TT· · ·TS) The probability is given as
(1 − p)k−1p (58)
78
Pascal distribution At the n-th trail, the k-th event S occurs. The proba-bility is given as
n−1Cn−kpk(1 − p)n−k (59)
Poisson distribution In the binomial distribution, fix np = λ and take limitn→ ∞, then
p(k) = e−λλk
k!. (60)
Beta distributionDistribution of the p-th smallest value among p+q−1 uniform distributionon [0, 1].
p(x) =1
B(p, q)xp−1(1 − x)q−1, (61)
where
B(p, q) =∫ 1
0xp−1(1 − x)q−1dx. (62)
79
Exponential distributionp(x) = λe−λx. (63)
Gamma distributionSum of exponential distribution.Standard gamma distribution is given by
p(x) =1Γ(γ)
xγ−1e−x. (64)
Gamma functionΓ(γ) =
∫ ∞0
tγ−1etdt (65)
(Γ(n) = (n − 1)! when n is an integer.)
80
3.7 Coordinate transformConsider two coordinates x = (x1, x2, . . . , xn) and x′ = (x′1, x′2, . . . , x′n) arerelated by a coordinate transform between:
x′ = f (x). (66)
When p.d.f. of a distribution in x is given by p(x), the p.d.f. of the distribu-tion in x′ is given by
p′(x′) = p( f−1(x′))
∣∣∣∣∣∣∣det∂x′i∂x j
∣∣∣∣∣∣∣−1
. (67)
81
(Proof.)
p′(x′)dx′1dx′2 · · · dx′n= P(x′ | x′ ∈ [x′1, x
′1 + dx′1] × [x′2, x′2 + dx′2] × · · · × [x′n, x
′n + dx′n])
= P(x | f (x) ∈ [x′1, x′1 + dx′1] × [x′2, x′2 + dx′2] × · · · × [x′n, x′n + dx′n])
= P(
f−1([x′1, x′1 + dx′1] × [x′2, x′2 + dx′2] × · · · × [x′n, x′n + dx′n]))
= p( f−1(x′))
·volume of
(∂ fi∂x j
)−1([x′1, x′1 + dx′1] × [x′2, x′2 + dx′2] × · · · × [x′n, x′n + dx′n])
= p( f−1(x′))
∣∣∣∣∣∣det(∂ fi∂x j
)∣∣∣∣∣∣−1dx′1dx′2 · · · dx′n.
QED.
Remark: ∂ fi∂x j
is the Jacobian matrix.
82
Consider a two-dimensional random variable (X1, X2) and its p.d.f is givenby p(x1, x2). Assume that the random variable Y is given by
Y = X1 + X2.
Let q(y) be the p.d.f. of Y . Then, we have
q(y) =∫ ∞−∞
p(x1, y − x1)dx1 =
∫ ∞−∞
p(y − x2, x2)dx2. (68)
83
(Proof.) Consider a coordinate Transform:
y = x1 + x2,
x2 = x2.
det
∂y∂x1
∂y∂x2
∂x2∂x1
∂x2∂x2
= det(
1 10 1
)= 1.
Then, we havedydx2 = dx1dx2.
(This equality implies the area of the small parallelogram is dydx2).Let p1(y, x2) be the p.d.f. of joint distribution of y and x2. Since q(y) is given
by the marginal distribution of p1(y, x2) = p(x1, x2),
q(y) =∫ ∞−∞
p1(y, x2)dx2 =
∫ ∞−∞
p(y − x2, x2)dx2.
QED.84
4 Normal distributionThe normal distribution is the mostly used distribution.
• Observation error• Processing error• Result of examination
Gaussian function :p(x) =
1√
2πe−x2/2.
We will provide characterizations of the normal distribution.
Problem 10. Explain an application where the normal distribution is used.
85
4.1 Probability density function of normal distribution1-dimensional standard normal distribution
p(x) =1√
2πe−x2/2. (69)
1-dimensional normal distribution
p(x) =1
√2π|σ|
e−(x−µ)2/2σ2. (70)
n-dimensional multivariate normal distribution
p(x) =1
√(2π)n
e−(x21+x2
2+···+x2n)/2=
1√
(2π)ne−∥x∥
2/2. (71)
n-dimensional multivariate normal distribution
p(x) =1
√(2π)n|Σ|
e−⟨x−µ,Σ−1(x−µ)⟩/2. (72)
86
4.2 Gauss distributionLet p(x; µ) be a p.d.f. on R with a parameter µ ∈ R.
(i) The p.d.f. p(x; µ) depends on only x − µ. That is,
p(x; µ) = p(x − µ). (73)
(ii) The most likelihood estimator of µ is given as the average of sample data.(We assume that sample date are independent.)
Then, with σ > 0 we have
p(x; µ) =1√
2πe−(x−µ)2/2σ2
. (74)
(Proof.)Let X = (X1, X2, . . . , Xn) is sampled date. The likelihood is given as
L(X; µ) = p(X1; µ)p(X2; µ) · · · p(Xn; µ). (75)
87
The most likelihood estimator µ is the µ which maximizes L(X; µ). Condition(ii) implies
µ =1n
n∑i=1
Xi. (76)
The assumption is described as for any x = (x1, x2, . . . , xn)
0 =∂L(x; µ)∂µ
∣∣∣∣∣∣µ=µ
=∂
∂µp(x1 − µ) · · · p(xn − µ)
∣∣∣∣∣∣µ=µ
= −L(x; µ)n∑
i=1
p′(xi − µ)p(xi − µ)
∣∣∣∣∣∣∣∣µ=µ
,
where we redefine µ = 1n∑n
i=1 xi. Then we haven∑
i=1
p′(xi − µ)p(xi − µ)
= 0. (77)
88
Let
ti = xi − µ = xi −1n
n∑j=1
x j.
It is clear thatn∑
i=1ti = 0,
andn∑
i=1
p′(ti)p(ti)
= 0.
Lemma 10. If a differentiable function f (t) satisfies that∑n
i=1 f (ti) = 0 forany ti such that
∑ni=1 ti = 0, then f (t) = at with a constant a.
89
(Proof.)It follows that f (−t2 − t3) + f (t2) + f (t3) = 0. Differentiate the equation by
t3, and we have− f ′(−t2 − t3) + f ′(t3) = 0.
Let t2 = −t3, for any t3 we have
f ′(t3) = f ′(0) = const = a.
Then,f (t) = at + b
with a constant b. We have
f (−t2 − t3) + f (t2) + f (t3) = 3b = 0.
This completes the proof. QED.
90
From Lemma 4, we havep′(t)p(t)= at,
and it yields that
log p(t) =12
at2 + c.
Then, we havep(t) = e
12at2+c,
andp(t) = e
12at2+c.
Let σ2 = −1/a and∫ ∞−∞ p(x; µ)dx = 1 yields that
p(x; µ) = p(x − µ) =1√
2πσe−(x−µ)2/2σ2
.
QED.
91
4.3 Maxwell distributionMaxwell distribution is the velocity distribution of ideal gas.
Let’s consider two independent random variables X1 and X2. Their p.d.f. isdenoted by pX(x1, x2). For θ , 0,±π2,±π, . . ., we define two random variablesby linear combinations of X1 and X2 as
Y1 = X1 cos θ + X2 sin θ, (78)Y2 = −X1 sin θ + X2 cos θ. (79)
Their p.d.f. is denoted by pY(y1, y2). We assume that the random variables Y1and Y2 are independent. Then, the distribution of X1 and X2 should be normal.
Since the transform is a rotation, we have
pX(x1, x2) = pY(y1, y2). (80)
Since the distribution is also independent with respect to Y1 and Y2, we have
pX(x1, x2) = a(x1)b(x2),pY(y1, y2) = c(y1)d(y2).
92
where a, b, c, d are functions. It is clear that∂
∂x1∂x2log pX = 0, (81)
∂
∂y1∂y2log pY = 0. (82)
We have∂
∂y1=∂x1∂y1
∂
∂x1+∂x2∂y1
∂
∂x2= cos θ
∂
∂x1− sin θ
∂
∂x2,
∂
∂y2=∂x1∂y2
∂
∂x1+∂x2∂y2
∂
∂x2= sin θ
∂
∂x1+ cos θ
∂
∂x2It follows that
0 =∂
∂y1
∂
∂y2log pY
= sin θ cos θ
∂2
∂x21
− ∂2
∂x22
log pX + (cos2 θ − sin2 θ)∂
∂x1∂x2log pX
93
= sin θ cos θ
∂2
∂x21
− ∂2
∂x22
log pX. (83)
From eq.(81) we have
d2
d2x1log a(x1) =
d2
d2x2log b(x2) = α = const.
By integrating the equation, we have
log a(x1) = αx21 + β1x1 + γ1,
log b(x2) = αx22 + β2x2 + γ2,
andpX(x1, x2) = eα(x2
1+x22)+β1x1+β2x2+γ1+γ2.
It is clear that we can describe the expression as
p(x1, x2) =1
2πσ2e−(x1−µ1)2+(x2−µ2)2/2σ2. (84)
94
4.3.1 EquilibriumThe p.d.f of ideal gas does not change after collision.Definition:
(x1, y1) : Velocity of particle 1 before collision(x2, y2) : Velocity of particle 2 before collision(x3, y3) : Velocity of particle 1 after collision(x4, y4) : Velocity of particle 2 after collision
Conservation of momentum:
x1 + x2 = x3 + x4 (85)y1 + y2 = y3 + y4 (86)
Conservation of energy:
(x1)2 + (y1)2 + (x2)2 + (y2)2 = (x3)2 + (y3)2 + (x4)2 + (y4)2 (87)
Probability of collision.Probability from (1, 2) to (3, 4) : Cip(x1, y1)p(x2, y2)Probability from (3, 4) to (1, 2) : Cop(x3, y3)p(x4, y4)
95
Since the mutual speed is the same between (1, 2) and (3, 4), we haveCi = Co.
Then equilibrium is express byp(x1, y1)p(x2, y2) = p(x3, y3)p(x4, y4). (88)
for any x1, y1, x2, y2, x3, y3, x4, y4 that satisfy conservation law. For small dxand dy, let x1 = x + dx, x2 = x, x3 = x + dx, x4 = x, y1 = y + dy, y2 = y,y3 = y, and y4 = y + dy. Since it is clear that such xi, yi (i = 1, 2, 3, 4) satisfyconservation law, eq.(88) has to hold.
p(x + dx, y + dy)p(x, y) = p(x + dx, y)p(x, y + dy) (89)By Taylor expansion until second order, we havep +
∂p∂x
dx +∂p∂y
dy +12∂2p∂x2 (dx)2 +
∂2p∂x∂y
(dx)(dy) +12∂2p∂y2 (dy)2
p
=
p +∂p∂x
dx +12∂2p∂x2 (dx)2
p +∂p∂y
dy +12∂2p∂y2 (dy)2
.96
In the equation, the 0-th and the first order terms are vanished. The remainedsecond order terms are given by ∂2p
∂x∂y− ∂p∂x∂p∂y
dxdy = 0.
Then, we have∂2p∂x∂y
− ∂p∂x∂p∂y= 0 (90)
From the equation, we have
∂
∂x∂
∂ylog p =
∂
∂x
(1p∂p∂y
)=
1p2
∂2p∂x∂y
− ∂p∂x∂p∂y
= 0.
so that eqs.(81) and (82) hold since equilibrium does not depend on the coor-dinates. Therefore p(x, y) has to be normal.
Conversely, the probability is normal, from conservation of energy, it is clearthat eq. (88) holds.
97
4.4 Maxwell-Boltzmann distribution4.4.1 Boltzmann distributionDefinition :
n : the number of particlesm : the number of boxesεi : the energy of the box iε : the total energyni : the number of particles in the box i
Assume that we can distinguish each particle. When we fix ni for each box,the number of cases G is given as
G =n!
n1!n2! . . . nM!. (91)
We maximize G under the condition that the total number of particles and the
98
total energy are fixed. Then, maximize G subject toM∑
i=1ni = n, (92)
M∑i=1
εini = ε. (93)
The Stirling’s asymptotic formula is given as
log n! ≃ n(log n − 1). (94)
By using this formula, we have
log G ≃ n(log n − 1) −m∑
i=1ni(log ni − 1) = n log n −
m∑i=1
ni log ni.
We use Lagrange’s method.
J = n log n −M∑
i=1ni log ni + α
n − M∑i=1
ni
+ βε − M∑
i=1εini
. (95)
99
By differentiating J with ni, we have
0 =∂J∂ni= − log ni − 1 − α − βεi.
Then,ni = e−βεi−α−1, (96)
where α and β are decided by eqs.(92) and (93).
4.4.2 Maxwell-Boltzmann distributionSince energy is often given by a squared term such as εi ∼ v2
i , we have
ni = e−βv2i −α−1. (97)
It implies the normal distribution.
100
4.4.3 Maximum entropyMaximize the following entropy of the p.d.f. p(x)∫ ∞
−∞p(x) log p(x)dx (98)
under the conditions that its variance is constant∫ ∞−∞
x2p(x)dx = σ2, (99)
and p(x) is a p.d.f. ∫ ∞−∞
p(x)dx = 1. (100)
By using the Lagrange’s method, we can obtain .
p(x) =1√
2πσe−x2/2σ2
.
101
4.5 Central limit theoremDefinition:
pX(x) : an arbitrary p.d.f. (with some conditions).X1, X2, . . . , Xn : independently sampled data from pX(x).µ : its average (=
∫xpX(x)dx)
σ2 : its variance (=∫
(x − µ)2pX(x)dx)
Consider a random variable Y given by
Y =X1 + X2 + · · · Xn − nµ
√nσ
. (101)
It converges to a standard normal distribution when n → ∞. (’Standard’means that its average and variance are 0 and 1, respectively.)
p(y) =1√
2πe−y2/2 (102)
102
Characteristic function (Fourier transform of p.d.f.)The characteristic function ϕX(ω) of p(x) is defined as
ϕX(ω) =∫ ∞−∞
eiωxp(x)dx. (103)
We have
E[Xn] = (−i)n dn
dωnϕX(ω)
∣∣∣∣∣∣ω=0
. (104)
(Proof.)
dn
dωnϕX(ω)
∣∣∣∣∣∣ω=0=
∫ ∞−∞
(dn
dωneiωx)
p(x)dx
∣∣∣∣∣∣ω=0= (i)n
∫ ∞−∞
xneiωxp(x)dx
∣∣∣∣∣∣ω=0
= (i)n∫ ∞−∞
xnp(x)dx
= (i)nE[Xn]
QED.103
When Y = X1 + X2, we have
pY(y) =∫
pX1(y − x)pX2(x)dx
andϕY(ω) = ϕX1(ω) · ϕX2(ω).
LetZ =
X − µ√
nσ. (105)
Then,pZ(z) =
√nσpX(
√nσz + µ). (106)
We have
µZ = E[Z] =E[X] − µ√
nσ= 0, (107)
σ2Z = E[Z2] =
E[(X − µ)2]nσ2 =
1n. (108)
104
The Taylor expansion of ϕZ(ω) yields that
ϕZ(ω) = ϕZ(0) +dϕZdω
∣∣∣∣∣ω=0
ω +12!
dϕZdω
∣∣∣∣∣ω=0
ω2 + · · · (109)
= ϕZ(0) + iE[Z]ω +i2
2!E[Z2]ω2 + · · · (110)
= 1 + i0ω − 12nω2 + · · ·
= 1 − 12nω2 + · · · . (111)
SinceY = Z1 + Z2 + · · · + Zn,
we have
ϕY(ω) = [ϕZ(ω)]n =
1 − ω2
2n+ ω3O(n−3/2)
n
.
Sincelim
n→∞
(1 +
xn
)n= ex,
105
we havelim
n→∞ ϕY(ω) = e−ω2/2. (112)
This is a characteristic function for the standard normal distribution. Sincethe Fourier transform of a Gauss function is given by a Gauss function, thecentral limit theorem was proved.
4.6 Limit of Binomial distributionIts probability is given as
nCkpk(1 − p)n−k. (113)
When n is very large, it is approximated by the normal distribution of whichaverage is np and variance is np(1 − p) (Laplace’s theorem).
106
4.7 Coordinate transform and standard normal distributionA normal distribution:
p(x) =1
√(2π)n|Σ|
e−⟨x−µ,Σ−1(x−µ)⟩/2.
Define a random variable X′ as
X′ = Σ−1/2(X − µ).
For a symmetric non-negative definite matrix Σ, Σ1/2 is defined as a sym-metric non-negative definite matrix satisfying (Σ1/2)2 = Σ.
Σ = U
λ1 0 · · · 00 . . . . . . ...... . . . . . . ...0 · · · 0 λn
UT , Σ12 = U
√λ1 0 · · · 00 . . . . . . ...... . . . . . . ...
0 · · · 0√λn
UT ,
where λi ≥ 0 and U is an orthogonal matrix.
107
We haveX = Σ1/2X′ + µ,
∂x′i∂x j= Σ−1/2.
Then, the standard normal density function
p′(x′) = | det(Σ−1/2)|−1p(Σ1/2x′ + µ) =1
√(2π)n
e−∥x′∥2/2 (114)
is derived from the coordinate transform.
Mahalanobis distance : Distance in the transformed space is measured bythe vectors of original space as
d(x, y) = ⟨x − y,Σ−1(x − y)⟩
Problem 11. Prove det(Σ
12
)=√
det(Σ) =√λ1λ2 · · · λn.
108
4.8 Distributions related to normal distribution4.8.1 χ2-distributionLet X1, X2, . . . Xr be random variables of independent standard normal distri-butions. χ2 distribution of r degree of freedom is a distribution of
X = X21 + X2
2 + · · · + X2r . (115)
Its p.d.f. is given as
pχ2r(x) =
12r/2Γ(r/2)
xr1−1e−
x2, (116)
where Γ(a) =∫ ∞0 xa−1e−xdx. (Γ(n) = (n − 1)!)
Properties:
E[X] = r,V[X] = 2r,
argmaxx pχ2r(x) = r − 2.
109
Theorem 3. (Cochran’s theorem)X1, X2, . . . , Xn : random variables of independent standard normal distri-
butions.Y1,Y2, . . . , Ym : random variables of independent standard normal distri-
butions which are given by linear combinations of X1, X2, . . . , Xn as Y1...
Ym
= A
X1...
Xn
,where A is an (m, n)-matrix. Then,
X21 + X2
2 + · · · + X2n − (Y2
1 + Y22 + · · · + Y2
m)
is a χ2-distribution of (n − m) degree of freedom.
110
ExampleLet X1, X2, . . . , Xn be independent samples of a standard normal distribution
of the average m and the variance σ2 (N(µ, σ2)). Let
µ =1n
n∑i=1
Xi. (117)
The distribution of µ is given as N(µ, σ2/n).
1σ2
n∑i=1
(Xi − µ)2
=1σ2
n∑i=1
[(Xi − µ) − (µ − µ)
]2=
1σ2
n∑i=1
(Xi − µ) − 1n
n∑j=1
(X j − µ)
2
111
=1σ2
n∑i=1
(Xi − µ)2 − 2n
n∑i=1
n∑j=1
(Xi − µ)(X j − µ) +1n
n∑i=1
(Xi − µ)
2
=
n∑i=1
(Xi − µσ
)2−
n∑i=1
Xi − µσ√
n
2
.
Since
(Xi − µ)/σ : N(0, 1), (118)n∑
i=1(Xi − µ)/σ
√n : N(0, 1). (119)
And∑n
i=1(Xi − µ)/σ√
n is an linear combination of (Xi − µ)/σ.
⇒We can use Cochran’s theorem.
112
Then, 1σ2
∑ni=1(Xi − µ)2 is a χ2-distribution of n− 1 degrees of freedom. We
have
E
1σ2
n∑i=1
(Xi − µ)2
= n − 1.
An estimator of σ2 is given as
σ2 =1
n − 1
n∑i=1
(Xi − µ)2. (120)
This is an unbiased estimator ofσ2. That is, the average of estimator coincideswith the true value.
E[σ2] = E
1n − 1
n∑i=1
(Xi − µ)2
= σ2
n − 1E
1σ2
n∑i=1
(Xi − µ)2
(121)
=σ2
n − 1(n − 1) = σ2 (122)
113
4.9 F-distributionDefinition
S 1 : the random variable of χ2-distribution of d1 degrees of freedom.S 2 : the random variable of χ2-distribution of d2 degrees of freedom.
S 1 = X21 + X2
2 + · · · + X2d1
S 2 = Y21 + Y2
2 + · · · + Y2d2
where X1, . . . , Xd1 and Y1, . . . , Yd2 are independent random variables of nor-mal distributions.
F : a random variable (F(d1, d2)-distribution) defined as
F =
S 1d1S 2d2
Its p.d.f. is given by (B(x, y) is the beta function)
pF(x) =1
xB(d1/2, d1/2)
(d1x
d1x + d2
)d1/2 (1 − d1x
d1x + d2
)d2/2
114
4.10 t-distribution (student distribution)Definition
Xi (i = 1, 2, . . . , n) : random variables of a normal distributionµ : the average of the normal distributionσ2 : the variance of the normal distributionX and σ2 are defined as
X =1n
n∑i=1
Xi, σ2 =1
n − 1
n∑i=1
(Xi − X)2
T : a random variable (T -distribution) defined as
T =X − µσ
√n
Its p.d.f. is given by
pT (x) =Γ(ν + 1)/2√νπ Γ(ν/2)
(1 + t2/ν)(v+1)/2
where v = n − 1 (degree of freedom).115
5 Test5.1 Example (two-side test)• p : The probability that the front of a coin appears.• Hypothesis : p = 1/2.• Toss the coin 20 times.• The front appears 15 times. ⇒ Is the hypothesis correct?• If the hypothesis is correct, its probability is
20C15
(12
)15 (12
)5≃ 0.015.
• Since the probability is very low, hypothesis does not seem to be correct.• In statistics this is called test.• If we assume that the appearance of 15 times front implies the hypothesis
is incorrect, those of 16, 17, 19, and 20 times also.•We evaluate the probability not of 15 times but of not less than 15 times.• Furthermore, when its back appears not less than 15 times, we should as-
sume that the hypothesis is incorrect as well.116
• The probability that the front or the back appears not less than 15 times is
2 × (20C15 +20 C16 +20 C17 +20 C18 +20 C19 +20 C20)(12
)20= 0.0414
Octave function: k = 15
binocdf(20 - k, 20, 1.0 - 0.5) + binocdf(20 - k, 20, 0.5)
• p-value : The probability of the region where we should reject the hypoth-esis if we do it for the sample data.
k 20 19 18 17 16p-value 1.91 × 10−6 4.01 × 10−5 4.02 × 10−4 2.58 × 10−3 1.18 × 10−2
15 14 13 12 11 104.14 × 10−2 1.15 × 10−1 2.63 × 10−1 5.03 × 10−1 8.24 × 10−1 1.0• Significance level : It p-value is less than the significance level, the hy-
pothesis is rejected.• The hypothesis p = 1/2 is rejected at significance level 0.05.• The hypothesis p = 1/2 is not rejected at significance level 0.01.
117
5.2 Example (one-side test)• p : the probability of a defective product• Hypothesis : p = 1/1000. (or less)• For 10000 productions, we found 17 defective products.• The probability that not less than 17 defective products are produced is
given by CDF of the binomial distribution.binocdf(10000 - 17, 10000, 1.0 - 0.001) = 0.026977
• The hypothesis p = 1/1000 is rejected at significance level 0.05.• The hypothesis p = 1/1000 is not rejected at significance level 0.01.• In this case we do not consider the opposite side.• For example, assume that we could not find any defective product.• Even if it is rare from the hypothesis p = 1/1000 and p should be smaller,
we do not reject the hypothesis.•When p should be larger, the hypothesis is rejected.binocdf(10000 - 17, 10000, 1.0 - 0.0011) = 0.055823
then the hypothesis p = 1.1/1000 is not rejected at significance level 0.05.
118
Problem 12. Fill the three ?.binocdf(x, n, p) = 1.0 - binocdf(?, ?, ?)
5.3 Critical region• Critical region : the region of which probability coincides with the signifi-
cance level and where the hypothesis should be rejected.• For the first example (two-side test), let K and µ be the number of appear-
ance of front and its average. Here, µ = 10. Then the critical region ofsignificance level 0.05 is given by
|K − µ| ≥ 5.
p-value = 0.041389 for |K − µ| ≥ 5.p-value = 0.11532 for |K − µ| ≥ 4.
• For the second example (one-side test), the critical region of significancelevel 0.05 is given by
K ≥ 16.Its p-value is 0.048654.
119
5.4 Power• Null hypothesis : Hypothesis may be rejected. (Ex. p = 1/2))• Alternative hypothesis : the negation of the null hypothesis. (Ex. p , 1/2))• p-value is calculated by using the null hypothesis.• Power : The probability that the null hypothesis is rejected, when it is not
correct. It depends on p.• Power function : power for given parameters. (Ex. of parameter: p).• For the first example, let the significance level be 0.05. When the front
appears not more than 5 or not less than 15 times (|K − µ| ≥ 5), the nullhypothesis is rejected.• Then its probabitly if given by using the CDF of binomial distribution.
Octave description: binocdf(5, 20, p) + binocdf(5, 20, 1-p)
p 0.0 or 1.0 0.1 or 0.9 0.2 or 0.8 0.3 or 0.7 0.4 or 0.6 0.5power 1.0 0.98875 0.80421 0.41641 0.12721 0.041389
Even if p = 0.5, the hypothesis p = 0.5 is rejected with probability 0.041389.
120
5.5 Number of trials•When the number of trials n is very large, the binomial distribution can be
approximated by the normal distribution:Average : µ = npVariance : σ2 = np(1 − p)P.d.f.:
p(k) =1√
2πσe−(k−µ)2
2σ2
•Meaning : for 0 ≪ m ≪ k, the probabity that the front appears from k tok + m − 1 times is approximately given by p(k)m.• The transformed random variable
X =K − µσ
expresses the standard normal distribution approximately.
121
• Let µ0 and σ0 be µ and σ when p = 1/2, respectively.
µo = 0.5nσo = 0.5
√n
• The critical region for significance level α can be described by
|X| = |K − µo|σ0
≥ xα/2.
• xα/2 is given by ∫ ∞xα/2
1√
2πe−x2/2dx = α/2.
•When α = 0.05, xα/2 ≃ 1.96.• The previous relation can be shown by the following equation.∫ µ0−σ0xα/2
−∞
1√2πσ0
e−(k−µ0)2
2σ20 dk +
∫ ∞µ0+σ0xα/2
1√
2πσ0e−(k−µ0)2
2σ20 dk = α.
122
• The value of the power function at p is given by∫ µ0−σ0xα/2
−∞p(k)dk +
∫ ∞µ0+σ0xα/2
p(k)dk
=
∫ µ0−σ0xα/2
−∞
1√
2πσe−(k−µ)2
2σ2 dk +∫ ∞µ0+σ0xα/2
1√
2πσ0e−(k−µ)2
2σ2 dk
=
∫ µ0−µ−σ0xα/2σ
−∞
1√
2πe−x2/2dx +
∫ ∞µ0−µ+σ0xα/2
σ
1√
2πe−x2/2dx
=
∫ µ0−µ−σ0xα/2σ
−∞
1√
2πe−x2/2dx +
∫ −µ0−µ+σ0xα/2
σ
−∞
1√
2πe−x2/2dx
• The power function for significance level 0.05 and p = 0.6:n 20 100 200 260 300 400 500power 0.14019 0.51631 0.81228 0.90165 0.93762 0.98133 0.99483
•When p = 0.6 and we have 260 trials, the probability that we can rejectp = 0.5 at significance level 0.05 is 0.90165.
123
Octave commands for power function
n = 260
p = 0.6
xa = 1.96
mu0 = 0.5 * n;
mu = p * n;
sigma0 = 0.5 * sqrt(n);
sigma = sqrt(p * (1-p) * n);
normcdf((mu0 - mu - sigma0 * xa) /sigma) ...
+ normcdf((mu - mu0 - sigma0 * xa) /sigma)
Octave functionsBeta Distr. betapdf betacdf
Binomial Distr. binopdf binocdf
Cauchy Distr. cauchy_pdf cauchy_cdf
Chi-Square Distr. chi2pdf chi2cdf
Univariate Discrete Distr. discrete_pdf discrete_cdf
124
Empirical Distr. empirical_pdf empirical_cdf
Exponential Distr. exppdf expcdf
F-Distr. fpdf fcdf
Gamma Distr. gampdf gamcdf
Geometric Distr. geopdf geocdf
Hypergeometric Distr. hygepdf hygecdf
Kolmogorov Smirnov Distr. N.A. kolmogorov_smirnov_cdf
Laplace Distr. laplace_pdf laplace_cdf
Logistic Distr. logistic_pdf logistic_cdf
Log-Normal Distr. lognpdf logncdf
Pascal Distr. nbinpdf nbincdf
Univariate Normal Distr. normpdf normcdf
Poisson Distr. poisspdf poisscdf
t- (Student) Distr. tpdf tcdf
Univariate Discrete Distr. unidpdf unidcdf
Uniform Distr. unifpdf unifcdf
Weibull Distr. wblpdf wblcdf125
5.6 Theory of test5.6.1 Most powerful test• p(x; θ) is a p.d.f of X with a parameter θ.• The null hypothesis : θ = θ0• wα : the critical region for the significance level α.
P(X ∈ wα; θ0) ≤ α• The power when θ = θ1(, θ0) is given by
L(θ1) = P(X ∈ wα; θ1)
•We would like to maximize L(θ1) with respect to wα.• Test function :
ϕ(x) =
1 (x ∈ wα)0 (x < wα)
• For probabilistic test, we consider ϕ(x) is a function such that 0 ≤ ϕ(x) ≤ 1.• Probabilistic test : the hypothesis is rejected probabilistically for the same
sample x. The value of ϕ(x) is the rejection rate for x.
126
• The problem is that maximize the power∫ϕ(x)p(x; θ1)dx (123)
under the condition that ∫ϕ(x)p(x; θ0)dx = α. (124)
•With a parameter λ, we have∫ϕ(x)p(x; θ1)dx =
∫ϕ(x)
[p(x; θ1) − λp(x; θ0)
]dx + λα.
• This is maximized when
ϕ(x) =
1 (p(x; θ1) − λp(x; θ0) > 0)0 (p(x; θ1) − λp(x; θ0) < 0) . (125)
• For a function given by (125), by choosing λwe can make it satisfy eq.(124).• Let ϕ∗(x) is such a function. For any ϕ(x) which satisfies (124), we have∫
ϕ∗(x)p(x; θ1)dx ≥∫
ϕ(x)p(x; θ1)dx
127
and ∫ϕ∗(x)p(x; θ0)dx = α.
⇒Most powerful test function (Neyman-Pearson’s fundamental lemma)• Example : Consider normal distribution N(µ, 1).
Null hypothesis : µ = µ0 ≡ 0Alternative hypothesis : µ = µ1 > 0
(the case that we do not have to consider µ < 0).• From eq.(125), the critical region is given by
1√
2πe−(x−µ1)2/2 > λ
1√
2πe−x2/2.
for µ1 > 0. The equality is equivalent to x ≥ β for a constant β.• From the condition (124), critical region is given by
x ≥ xα.
(one-side test)
128
5.6.2 Unbiased test• In the previous example, ϕ(x) does not depend on µ1(> 0).⇒ Uniformly most powerful test (UMP test)•When the alternative hypothesis is only µ = µ1 , 0, there is no UMP test.
(If µ1 < 0, the critical region is given by X ≤ −xα).• L(µ) is define by
L(µ) = E[ϕ(x); µ] =∫
ϕ(x)p(x; µ)dx
= (Probability of rejection at µ).
• Unbiased test : For the null hypothesis µ = µ0, there exists α such that
L(µ0) ≤ α (126)L(µ) ≥ α for any µ , µ0 (127)
• Under the conditions (126) and (127), we maximize L(µ) with respect toϕ(x).
129
• Since L(µ) should be minimum at µ = µ0, equations to be solve are∫ϕ(x)p(x; µ0)dx = α, (128)
ddµ
∫ϕ(x)p(x; µ)dx
∣∣∣∣∣∣µ=µ0
= 0. (129)
• The solution for the example is given by the two-side test of a normaldistribution:
|X − µ0| ≥ xα/2.• Its critical region does not depend on µ.⇒ UMP unbiased test
130
5.6.3 Likelihood ratio test• Likelihood ratio test : test using Likelihood ratio
p(x; θ0)p(x; θ1)
.
for a given sample x.• Its critical region is given by
p(x; θ0)maxθ p(x; θ)
≤ c. (130)
The bound c is decided as the probability of the region given by eq.(130)coincides with the significance level α.
131
5.6.4 Composite hypothesis• A random variable depends on two parameters θ and η.• Null hypothesis : θ = θ0 but η is arbitrary.• Such η is called the nuisance parameter.• Similar test : the probability for rejection is a constant α for any η.
P(X ∈ w ; θ = θ0, η) = α, for any η,
where w is the critical region.• Example : t-test (We discuss later) :
The mean θ = µ is tested but the standard deviation η = σ is not tested.• Likelihood ratio test:
maxη p(x; θ0, η)maxθ, η p(x; θ, η)
≤ c.
The bound c is decided as the probability of the region is the significancelevel.
132
5.6.5 χ2-test• Sample set : extracted independently from normal distributions of which
variance are equal to one.• Null hypothesis : the averages of the normal distributions are the same.• Alternative hypothesis : they are not equal.• α : significance level• n : number of samples in the set• Xi : samples• µ : sample mean• σ2 : sample variance•We have
µ =1n
n∑i=1
Xi
σ2 =1
n − 1
n∑i=1
(Xi − µ)2
133
•We define χ2n−1,α as ∫ ∞
χ2n−1,a
pχ2n−1
(x)dx = α.
• Under the null hypothesis, the distribution of (n − 1)σ2 is given byχ2-distribution of degree of freedom n − 1.• Then, the critical region for significance level α is given by
σ2 ≥ 1n − 1
χ2n−1,α .
• If the sample variance is large, the hypothesis that all means of normal dis-tribution are the same does not seems to be correct. Because the probabilityof the case that σ2 is too large is very small.
134
5.6.6 Pearson’s χ2-test• A sample Xi (i = 1, 2, . . . , n) takes m kinds of values k (k = 1, 2, . . . ,m).• Sample set : extracted independently from a distribution (iid).• Null hypothesis : the probability that Xi takes the value k is given by pk.• Alternative hypothesis : they are not pk.• n : number of samples in the set• Yk (k = 1, 2, . . . ,m) : the number of samples that takes the value k.
Yk =n∑
i=1δXi, k.
Yk are random values.• Note that Yk are not independent since they are constrainted by
∑mk=1 Yk =
n.
135
Theorem 4. The distribution of the following random variable
W =m∑
k=1
(Yk − npk)2
npk(131)
converges the χ2-distribution of degree of freedom m − 1 as n→ ∞.
Example• Null hypothesis : Probability of women and man are 1/2.• 193 men and 155 women exist. Then, we have n = 348, np = 174, and
w =(193 − 174)2
174+
(155 − 174)2
174= 4.4024.
• Then, the p-value is given by1 − (CDF of χ2 of degree of freedom 1 at 4.4024) = 0.035888.• The null hypothesis is rejected at significance level 0.05.• The null hypothesis is not rejected at significance level 0.01.
136
(Proof of Theorem)From Laplace’s theorem, Yk converges a normal distribution of which mean
and variance are npk and npk(1 − pk) as n→ ∞. We define random variablesZk by
Zk =Yk − npk√
npk.
It is clear that Zk converges a a normal distribution of which mean and vari-ance are 0 and 1 − pk. We calculate the covariance between Zk and Zl (k , l).
E[ZkZl] = E[Yk − npk√
npk
Yl − npl√npl
]=
1n√
pkplE
[YkYl − npkYl − nplYk + npknpl
]=
1n√
pkpl
(E[YkYl] − npknpl − nplnpk + npknpl
)=
1n√
pkpl
(E[YkYl] − n2pkpl
)137
Since δXi,kδXi,l = 0 (k , l), we have
E[YkYl] = E
n∑
i=1δXi,k
n∑
j=1δX j,l
= n∑
i=1E
[δXi,kδXi,l
]+
n∑i=1
∑j,i
E[δXi,kδX j,l
]= n(n − 1)E
[δX1,kδX2,l
]= n(n − 1)pkpl
Then, we have
E[ZkZl] =1
n√
pkpl
[n(n − 1)pkpl − n2pkpl
]= −√
pkp j.
We can say (Z1 Z2 . . . Zm)T converges to an m-dimensional random variableof a multivariate normal distribution of which variance and covariance aregiven by 1 − pk and −√pkp j, respectively.
Let G = (G1 G2 . . . Gm)T be an m-dimensional random variable of a mul-tivariate standard normal distribution (their variance and covariance are 1 and
138
0, respectively). We define random variables Hk as
Hk = Gk −√
pk
m∑a=1
√paGa.
It is clear that the distribution of H is a multivariate standard normal distribu-tion of which variance and covariance are respectively given by
E[H2k ] = E[G2
k] − 2√
pk
m∑a=1
√paE[GkGa] +
√p2
k
m∑a=1
m∑b=1
√papbE[GaGb]
= E[G2k] − 2
√pk
m∑a=1
√pa δk,a + pk
m∑a=1
m∑b=1
√papb δa,b
= 1 − 2pk + pk
m∑a=1
pa
= 1 − pk,
139
and
E[HkHl] = E[GkGl] −√
pl
m∑a=1
√paE[GkGa] − √pk
m∑a=1
√paE[GaGl]
+√
pkpl
m∑a=1
m∑a=1
√papbE[GaGb]
= 0 − 2√
plpk −√
pkpl
m∑a=1
pa
= −√plpk.
Therefore, we can say Zk converges to Hk and W converges to∑m
k=1 H2k .
Problem 13. 342, 189, and 155 students like A, B, and C, respectively. Testthe null hypothesis : their probability is 1/2, 1/4, and 1/4, respectively.
140
We havem∑
k=1H2
k =m∑
k=1
Gk −√
pk
m∑a=1
√paGa
2
=
m∑k=1
G2k − 2
m∑k=1
Gk√
pk
m∑a=1
√paGa +
m∑k=1
pk
m∑a=1
√paGa
2
=
m∑k=1
G2k −
m∑a=1
√paGa
2
.
Furthermore, we have
E
m∑a=1
√paGa
2 =
m∑a=1
m∑b=1
√pa√
pbE[GaGb] =m∑
a=1pa = 1
Then, Cochran’s theorem yields that the distribution of∑m
k=1 H2k is the χ2-
distribution of degree of freedom m − 1. QED.141
5.6.7 F-test• Two sample sets : extracted independently from two normal distributions• Null hypothesis : variances of two sets of samples are equal• Alternative hypothesis : variances of two sets of samples are not equal• α : significance level• n1 : number of samples in set 1• n2 : number of samples in set 2• σ1
2 : estimated variance of set 1• σ2
2 : estimated variance of set 2•We assume the case that σ1
2 > σ22.
• f = σ12/σ2
2 (Its distribution is F(n1 − 1, n2 − 1))
We define fα as ∫ ∞fα
pF(x)dx = α.
If f > fα, the null hypothesis is rejected at level of significance (significancelevel) α.
142
5.6.8 t-test• Two sample sets : extracted independently from two normal distributions
of which variances are the same.• Null hypothesis : averages of two sets of samples are equal• Alternative hypothesis : averages of two sets of samples are not equal
(One-side test : µ1 − µ2 > 0. Two-side test : µ1 − µ2 , 0.)• α : significance level• n1 : number of samples in set 1• n2 : number of samples in set 2• µ1 : estimated average of set 1• µ2 : estimated average of set 2• σ1
2 : estimated variance of set 1• σ2
2 : estimated variance of set 2• A random variable (t-distribution of n1 + n2 − 2 degree freedom) :
t =µ1 − µ2
σ√
1n1+ 1
n2
143
where
σ2 =(n1 − 1)σ1
2 + (n2 − 1)σ22
n1 + n2 − 2We define tα as ∫ ∞
tαpT (x)dx = α.
If t > tα (one-side test) or ˆ|t| > tα/2 (two-side test), the null hypothesis isrejected at level of significance (significance level) α.
If the variances of the two normal distributions are different, the following tobeys approximately t-distribution
t =µ1 − µ2√σ1
2
n1+σ2
2
n2
144
5.6.9 Test using p-valueχ2-distribution ∫ ∞
χpχ2
1(x)dx
F-distribution ∫ ∞f
pF(x)dx
t-distribution ∫ ∞t
pT (x)dx
If the p-value is smaller than 0.05 or 0.01 the null hypothesis is rejected.
145
6 Estimation6.1 Point estimation• Assume that the probability depends on a parameter (Bosuu in Japanese).• Statistical estimation : estimate the parameter according to observed data.• Let consider the probability p of the case that the front of a coin appears is
a parameter.• Assume that we toss a coin n times and its front appears X times (X is a
random variable).• It is natural to estimate p by
p1 =Xn. (132)
• p1 is also a random variable because X is a random variable.• Estimator means a random variable or a method for estimation.• Estimate means a estimated value.
146
6.2 Bias• Bias : the difference between the mean of an estimator and true value.
E[ p] − p.
• Unbiased estimator : an estimator of which bias is zero (E[p] = p).• Biased estimator : an estimator of which bias is not zero (E[p] , p).• Estimator (132) is an unbiased estimator.
E[ p1] = E[Xn
]=
npn= p.
• Consider an estimatorp2 =
X + 1n + 2
. (133)
It’s mean is given by
E[ p2] = E[X + 1n + 2
]=
np + 1n + 2
.
If p , 1/2, p2 is a biased estimator (E[ p2] , p).
147
6.3 Variance of estimator• Variance of an estimator
V(p) = E[(p − E[ p])2]
• For estimators, small variance is better than large variance.• Example:
V( p1) =1n2V(X) =
1n
p(1 − p)
V( p2) =1
(n + 2)2V(X) =n
(n + 2)2 p(1 − p)
The variance of p2 is smaller than p1.(However, p2 is a biased estimator.)
148
6.4 Mean square error•Mean square error of an estimator is defined by
mse( p) = E[(p − p)2]
•We have
mse( p) = E[( p − p)2]= E[( p − E[ p] + E[p] − p)2]= (E[ p] − p)2 + V( p) (134)
• The meaning of this equation is given by
(mean square error) = (bias)2 + variance.
• Bias variance trade off
Problem 14. Show (134).
149
• Example:
mse(p1) =1n
p(1 − p)
mse(p2) =(np + 1n + 2
− p)2+
n(n + 2)2 p(1 − p)
=−(n − 4)p2 + (n − 4)p + 1
(n + 2)2
When n = 5, p = 0.4, we have mse( p1) = 0.0480 and mse( p2) = 0.0253.In this case, mse of p2 is smaller than that of p1.
150
6.5 Interval estimation• Estimate an interval which includes the parameter.• For the previous example, estimate the lower bound u(X) and the upper
bound v(X) of p from an observed data X.
u(X) ≤ p ≤ v(X) (135)
• Confidence coefficient 1 − α.
P(u(X) ≤ p ≤ v(X)) ≥ 1 − α.under the assumtion p = p. More precisely,
P(ω | u(X(ω)) ≤ p ≤ v(X(ω))) ≥ 1 − α.• In many cases, we let α = 0.05. (Confidence coefficient is 0.95).• Condition (135) is rewritten as
c(p) ≤ X ≤ d(p).
c(p) and d(p) are the inverse functions of v(X) and u(X), respectively.
151
• This formula can be written as
P(X ≤ c(p) or X ≥ d(p)) ≤ α. (137)
• This formula is the same as test. Then, we can use the boundary of a criticalregion to decide c(p) and d(p).• From c(p) and d(p), we can obtain an interval [u(X), v(X)] for p of which
confidence coefficient is 1 − α.• Example: Toss a coin n times. (n is large.) Its front appears X times.• The distribution is approximated by a normal distribution of which average
and variance are np and np(1 − p).• According to two-side test of a normal distribution, the interval is given by
|X − np| ≤ xα/2√
np(1 − p).
• It can be written by
np − xα/2√
np(1 − p) ≤ X ≤ np + xα/2√
np(1 − p).
• By solving above inequality with respect to p, we have the confidence
153
interval.
2nX + nx2α/2 −
√−4nX2x2
α/2 + 4n2Xx2α/2 + n2x4
α/2
2(n2 + nx2α/2)
≤ p ≤2nX + nx2
α/2 +√−4nX2x2
α/2 + 4n2Xx2α/2 + n2x4
α/2
2(n2 + nx2α/2)
• Let p0 be any probability in the confidence interval. We can say the hy-pothesis p = p0 is not rejected at significance level 5%.• Note that it does not mean the probability that p is included in the confi-
dence interval is 1 − α. (The probability is for X.)• Example: n = 100 and X = 40. Let α = 0.05 (xα/2 = 1.96). Confidence
interval of confidence coefficient 0.95 is given by
0.310 ≤ p ≤ 0.498
• Region estimation : for the case that the number of parameters are not one.
154
6.6 Theory of estimationDefinition:
• X : a observed vector : n-dimensional stochastic variable.x : a realized value of X.
• p(x; θ) : the p.d.f. with K-dimensional vector θ (parameters).• Θ : the parameter space (⊂ RK).• Eθ, Vθ : the expectation and the variance of the distribution with p(x; θ)• θ(X) : an estimator for θ (a random variable)• unbiased estimator :
Eθ[θ(X)] = θ.
• uniformly minimum variance unbiased estimator (UMVUE) : θ∗(X) is anunbiased estimator and for any unbiased estimator θ(X) we have
V(θ∗(X)) ≤ V(θ(X)).
155
6.6.1 Cramer-Rao lower boundTheorem 5. 1. Θ is an open set of RK.
2. A = x| p(x; θ) = 0) does not depend of θ.
3. ∀x ∈ A and ∀θ ∈ Θ, there exist ∂∂θi
log p(x; θ) (i = 1, 2, . . . ,K) and they arefinite.
4. The differentials and the integrates are commutative.
5. We define a (K,K)-matrix I (Fisher’s information matrix) as
(I)i j = Eθ
[(∂
∂θilog p(X; θ)
) (∂
∂θ jlog p(X; θ)
)].
6. For a sample X, we define a random variable T as
T =
t1(X)...
tr(X)
.156
We assume that T is an unbiased estimator of
g(θ) =
g1(θ)...
gr(θ)
.7. A matrix D(θ) is defined as
(D(θ))i j =∂
∂θ jgi(θ).
Then we haveVθ(T) − D(θ)(I(θ))−1 D(θ)T ≥ O. (138)
Remark. If we estimate θ itself, gi(θ) = θi.Since D(θ) is a unit matrix, the we have
V(θ) − (I(θ))−1 ≥ 0.
157
(Proof of 1-dimensional case.)We have
Eθ
[∂
∂θlog p(X; θ)
]=
∫1
p(x; θ)∂p(x; θ)∂θ
p(x; θ)dx
=∂
∂θ
∫p(x; θ)dx =
∂
∂θ1 = 0,
and
Eθ
[(T − Eθ[T ])
∂
∂θlog p(X; θ)
]= Eθ
[(T − g(θ))
∂
∂θlog p(X; θ)
]=
∫t(x)
(∂
∂θlog p(x; θ)
)p(x; θ)dx − g(θ)Eθ
[∂
∂θlog p(X; θ)
]=
∂
∂θ
∫t(x)p(x; θ)dx =
∂
∂θEθ[T ] =
∂
∂θg(θ).
158
Then, (∂
∂θg(θ)
)2=
(Eθ
[(T − Eθ[T ])
(∂
∂θlog p(X; θ)
)])2
≤ Eθ[(T − Eθ[T ])2
]Eθ
( ∂∂θ log p(X; θ))2
≤ Vθ(T )I(θ).
It follows that
Vθ(T ) ≥(∂
∂θg(θ)
)I(θ)−1
(∂
∂θg(θ)
). (139)
QED.
• Efficient estimator : an unbiased estimator of which variance achieves theCramer-Rao lower bound.• Every efficient estimator is a UMVUE.
159
6.6.2 Maximum likelihood estimator (MLE)• p(x ; θ) : a probability distribution or a p.d.f. with a parameter θ.• Likelihood function : L(θ) = p(x ; θ).
We reconsider θ is a variable and x is a parameter.• log likelihood function : log L(θ).•Maximum likelihood estimator (MLE) is given by
θ = argmaxθ
L(θ) = argmaxθ
log L(θ)
• Example : estimate the mean and variance of a normal distribution byMLE.– X1, X2, . . . , Xn : i.i.d. (independent and identically distributed) random
variable of a normal distribution.x1, x2, . . . , xn : their realized values.
– Likelihood function L(µ, σ) is given by
L(µ, σ) =nΠ
i=1
1√
2πσe−(xi−µ)2
2σ2 .
160
– Its log likelihood is given by
log L(µ, σ) = − 12σ2
n∑i=1
(xi − µ)2 − n2
log 2π − n logσ
– By letting derivatives be zero, we have equations for MLE.∂
∂µlog L(µ, σ) = 0,
∂
∂σlog L(µ, σ) = 0.
– MLE is given by
µ =1n
n∑i=1
Xi (140)
σ2 =1n
n∑i=1
(Xi − µ)2 (141)
Problem 15. Prove eqs.(140) and (141).161
7 Statistic7.1 Sufficient Statistic• X1, X2, . . . , Xn : observed data (random variables).
x1, x2, . . . , xn : their realized values.• Let t be a function of n variables X1, X2, . . . , Xn.• Statistic : a random variable T defined by
T = t(X1, X2, . . . , Xn)
• θ : a parameter of a distribution.• Assume that a conditional probability or p.d.f.
p(x1, x2, . . . , xn|T = t; θ)
does not depend on θ for any t. ⇒ T is called sufficient statistic.• The condition can be written as
p(x1, x2, . . . , xn; θ) = p(x1, x2, . . . , xn|T = t)p(t; θ).
• It is enough to know T to estimate θ.
162
• Example: Estimate the mean of a normal distribution by its samples.T = X1 + X2 + · · · + Xn
is a sufficient statistic, where X1, X2, . . . , Xn are iid. random variables.(iid. : independent and identically distributed)• Consider its two dimensional case and let
T =1√
2(X1 + X2), U =
1√
2(X1 − X2),
and t and u be realized value of T and U, respectively.• Then, we have (We assume σ2 = 1)
p(x1, x2; µ) = p(x1; µ)p(x2; µ)
=1√
2πe−(x1−µ)2/2 1
√2π
e−(x2−µ)2/2
=1
2πe−
((x1−x2)/
√2)2/2−
((x1+x2)/
√2−√
2µ)2/2.
=1√
2πe−u2/2 1
√2π
e−(t−√
2µ)2/2.
163
andp(x1, x2|T = t; µ) = p(t, u|T = t; µ) =
1√
2πe−u2/2.
It does not depend on µ.• T1,T2, . . . Tm are sufficient statistics with respect to θ:
p(x1, x2, . . . , xn; θ)= p(x1, x2, . . . , xn|T1 = t1,T2 = t2, . . . , Tm = tm) p(t1, t2, . . . , tm; θ).
• Rao-Blackwell’s Theorem– T is a sufficient statistic of a parameter θ.– θ is an unbiased estimator of θ. (θ = θ(X1, X2, . . . , Xn))– We define a new estimator as
θ∗ = E[θ|T ].
– Then θ∗ is also an unbiased estimator of θ and we have
V(θ∗) ≤ V(θ)
(A better estimator can be obtained.)
164
8 Regression8.1 Method of least square• Data : (c1, x2), (c1, x2), . . . , (cn, xn).• For example ci is income, xi is consumption of person i.•We may assume that its relation can be expressed by the first order poly-
nomial.xi = α + βci.
• It will not hold strictly. Let ui is error from the relation.
ui = xi − (α + βci).
•We often minimize their squared sum of uis.
Q =∑
iu2
i =∑
i(xi − α − βci)
2.
•We haveQ = nα2 − 2α
∑i
(xi − βci) +∑
i(xi − βci)
2. (142)
165
•With respect to α, eq.(142) is minimized when
α ≡ 1n
∑i
(xi − βci) = x − βc, (143)
wherec =
1n
∑i
ci, x =1n
∑i
xi.
•We substitute α into α in eq.(142).
Q = β2∑
i(ci − c)2 − 2β
∑i
(ci − c)(xi − x) +∑
i(xi − x)2 (144)
Problem 16. Prove eq.(144).
• β is given by
β =
∑i(ci − c)(xi − x)∑
i(ci − c)2 . (145)
⇒ Linear regression.(Method of least square).
166
• Letxi = α + βci.
Then, we haveQ =
∑i
(xi − xi)2.
• If Q is small comparing to∑
i(xi− x)2, the assumption seems to be correct.•We define r2 as
r2 = 1 − Q∑i(xi − x)2.
If r2 is near to 1, the assumption seems to be correct.• By substituting β into β in eq.(144), we have
Q =∑
i(xi − x)2 −
∑i(ci − c)(xi − x)
2∑i(ci − c)2 .
and
r2 =
∑i(ci − c)(xi − x)
2∑i(xi − xi)2 ∑
i(ci − c)2.
167
8.2 Linear model• Xi (i = 1, 2, . . . , n) : random variables of normal distribution of which
variances are σ2. (The distribution of Xi are different).• θ j ( j = 1, 2, . . . ,m) : parameter to decide the mean. (m < n).
E[Xi] =n∑
j=1Ci jθ j
where Ci j is the (i, j)-element of a given matrix C.•We define random variables as
Z =n∑
i=1X2
i ,
Y j =n∑
i=1Ci jXi.
• (Z,Y1, . . . , Ym) are the sufficient statistics of σ, θ1, θ2, . . . , θm.
168
•We have
E[Y j] = E
n∑i=1
Ci jXi
= n∑i=1
Ci jE[Xi] =n∑
i=1
n∑k=1
Ci jCikθk
• θ j : estimator of θ j satisfies
Y j =n∑
k=1
n∑i=1
Ci jCikθk (146)
•We define
X =
X1X2...
Xn
, Y =
Y1Y2...
Ym
, Θ =
θ1θ2...θm
, Θ =
θ1θ2...
θm
.• Eq.(146) yields the normal equation:
CT CΘ = DΘ = Y. (147)
where D = CT C.169
• Estimator of Θ is given by
Θ = D−1Y. (148)
• Θ is an unbiased estimator. Since E[X] = CΘ, we have
E[Θ] = (CT C)−1E[Y] = D−1CT E[X] = D−1 DΘ = Θ.
• The variance covariance matrix of Y, V(Y), is given by
V(Y) = E[(Y − E[Y])(Y − E[Y])T ] = CT E[(X − E[X])(X − E[X])T ]C= CT (σ2I)C = σ2 D.
• The variance covariance matrix of Θ (used in regression) is given by
V(Θ) = E[(Θ − E[Θ])(Θ − E[Θ])T ]= D−1E[(Y − E[Y])(Y − E[Y])T ]D−1 = D−1(σ2 D)D−1 = σ2 D−1
•We define random variables as
Ui = Xi − E[Xi]Vi = θi − θi
170
•We also define
U =
U1U2...
Un
(= X − E[X]), V =
V1V2...
Vm
(= Θ −Θ).
• E[U] = 0, E[V] = 0 andV = D−1CT U.
• Since E[X] = CΘ, the residue Q is given by
Q =n∑
i=1
Xi −m∑
j=1Ci jθ j
2
=∑
i
Ui −m∑
j=1Ci jV j
2
=
n∑i=1
U2i −
m∑j=1
m∑k=1
D jkV jVk
• V(V)(= V(Θ)) is given by σ2 D−1. Therefore, the distribution of1σ2
∑mj=1
∑mk=1 D jkV jVk is a χ2-distribution of m degree of freedom.
171
• Since Vi are linear combination of Ui,Qσ2 =
1σ2
∑i
U2i −
1σ2
∑j
∑k
D jkV jVk
is the χ2-distribution of (n − m) degree of freedom.• Since Y = CT X and Θ = D−1CT X, we also have
Q = ∥X − CΘ∥2 =n∑
i=1X2
i −m∑
j=1Y jθ j
• The estimator of σ2 is given by
σ2 =1
n − m
n∑i=1
X2i −
m∑j=1
Y jθ j
172
8.3 Theory of regression• Linear regression equation :
E(X) = α + βc.
• c : explaining variable• X : dependent variable• α, β : regression coefficient• The problem is to estimate regression coefficients.• Let Xi is samples at ci.
We consider the average and variance of Xi are given by α + βci and σ2,respectively.• ci is not a random variable.• Linear regression equation can be rewritten by
X = α + βc + U.
• U : error. The distribution of U is assumed to be identical for all samplepoints and its average is zero.
173
• Use the linear model. We consider θ1 = α, θ2 = β,
C =
1 c11 c2... ...1 cn
.•We have
Y1 =n∑
i=1Ci1Xi =
n∑i=1
Xi,
Y2 =n∑
i=1Ci2Xi =
n∑i=1
ciXi,
D =(
n ncnc
∑ni=1 c2
i
).
D−1 =1
n∑n
i=1(ci − c)2
( 1n∑n
i=1 c2i −c
−c 1
).
174
• The estimators of α and β are given by(α
β
)= D−1
( ∑ni=1 Xi,∑n
i=1 ciXi
)(149)
Eq.(149) is equivalent to (143) and (145).• The estimator of σ2 (variance of Xi) is given by
σ2 =1
n − 2Q =
1n − 2
n∑i=1
X2i − α
n∑i=1
Xi − βn∑
i=1ciXi
=
1n − 2
∑i
(Xi − X)2 −
∑i(ci − c)(Xi − X)
2∑i(ci − c)2
.• The variance covariance matrix of α and β is given by
V(α, β) =
σ2α Cov(α, β)
Cov(α, β) σ2β
= σ2 D,
where Cov(α, β) = E(α − α)(β − β).175
• Its estimator is given by
V(α, β) =
σ2α
ˆCov(α, β)ˆCov(α, β) σ2
β
= σ2 D.
• Hypothesis : α = α0.
α − α0√σ2α
obeys t-distribution of n − 2 degree of freedom.⇒We can test the hypothesis or calculate its confidence interval.• Hypothesis β = β0 yields
β − β0√σ2β
also obeys t-distribution of n − 2 degree of freedom.
176
9 Statistical learning theory9.1 Framework• z: pattern• Pattern are observed randomly.• F(z): cumulative probability distribution function•When z is a real number, F(z) is defined as a probability of the case that a
pattern Z satisfying Z ≤ z is observed.• The ensemble expectation of g(z) with respect to z is given by
Eg(z) ≡∫
g(z)dF(z). (150)
If F(z) is differentiable, its probability density function (p.d.f) f (x) is givenby
f (z) =dFdz.
(The probability of the case that a pattern in [z, z+dz] is observed is f (z)dz.)
177
•When we use the p.d.f., expectation of g(z) is given as
Eg(z) =∫
g(z) f (z)dz.
(By using not f (z) but F(z), expectation with respect to discrete valuessuch as dice can be expressed.)• Assumption: each pattern has its category.• Assumption: a family of (classification) functions dα(z) (α ∈Λ) that returns
a category for an input pattern z.• Λ is the set of α that is the indicator of classification function.⇔ Each classification function corresponds to an element α in Λ.• Assumption: a risk function is given as follows.
R(z, y) ≡
0 (The category of pattern z is y)1 (else) .
•We define a function Q(α, z) with a classification function dα(z).
Q(α, z) ≡ R(z, dα(z)) =
0 (the result of dα(z) is correct)1 (the result of dα(z) is incorrect) .
178
• R(α) : the expectation of the risk of a classification function dα.
R(α) = EzQ(α, z) =∫
Q(α, z)dF(z)(=
∫Q(α, z) f (z)dz
).
The problem is to obtain the classification function dα that minimizes R(α).
R(α) is equal to the probability of the case that dα classifies incorrectly(Q(α, z) = 1) when patterns appear according to a probability distributionfunction F(z).
179
• For learning, we can use only finite number of patterns.• L : the number of learning patterns.
z1, z2, . . . , zL.
• Assumption: we know the category of each learning patterns.• The risk of a classification function for learning patterns is calculated by
Remp(α) =1L
L∑i=1
Q(α, zi).
We call it the empirical risk.• The difference between two risks is evaluated in statistical learning theory.
R(α) =∫
Q(α, z)dF(z),
Remp(α) =1L
L∑i=1
Q(α, zi).
⇒ Estimate the ensemble expectation from the sample expectation.181
• Application: evaluation of ERM principle
Definition 4. ERM (empirical risk minimization) principleUse dα that minimized Remp(α).
• Learning patterns appear according to a p.d.f. f (z).• Estimate the expectation of Q(α, z) from Q(α, zi) for learning patterns zi.
182
9.2 Expectation and supremum• The expectation of g(z1, z2, . . . , zL) with z1, z2, . . . , zL :
Ez1,z2,...,zLg(z1, z2, . . . , zL)
≡∫ ∫
· · ·∫
g(z1, z2, . . . , zL)dF(z1)dF(z2) · · · dF(zL).
• The supremum of g(z1, z2, . . . , zL) with z1, z2, . . . , zL :
supz1,z2,...,zL
g(z1, z2, . . . , zL).
• For any distribution and any function g, we have
Ez1,z2,...,zLg(z1, z2, . . . , zL) ≤ supz1,z2,...,zL
g(z1, z2, . . . , zL).
9.3 Entropy, annealed entropy, growth function• NΛ(z1, z2, . . . , zL) : the number of different (Q(α, z1),Q(α, z2), . . .Q(α, zL))
when we fix (z1, z2, . . . , zL) and vary α ∈ Λ.• That is the number of different types of risk array of classification functions
for fixed input patterns.183
• Entropy :HΛ(L) = Ez1,z2,...,zL ln NΛ(z1, z2, . . . , zL)
• Annealed entropy :
HΛann(L) = ln Ez1,z2,...,zLNΛ(z1, z2, . . . , zL)
• Growth function :
GΛ(L) = ln maxz1,z2,...,zL
NΛ(z1, z2, . . . , zL)
•We have following relation.
HΛ(L) ≤ HΛann(L) ≤ GΛ(L)
Problem 17. Prove
ln αx1 + (1 − α)x2 ≥ α ln x1 + (1 − α) ln x2
for x1(> 0), x2(> 0), and α (0 ≤ α ≤ 1).
184
Hinto of Problem 17Let f (x) = ln x. It is clear that
d2 f (x)dx2 < 0 (151)
for all x > 0. We define a function as
F(x) = f (x) −
f (x2) − f (x1)x2 − x1
(x − x1) + f (x1).
We haveF(x1) = F(x2) = 0,
F(αx1 + (1 − α)x2) = f (αx1 + (1 − α)x2) − α f (x1) + (1 − α) f (x2),F. (x)dx=
d f (x)dx− f (x2) − f (x1)
x2 − x1,
andd2F(x)
dx2 =d2 f (x)
dx2 . (152)
185
From eqs.(1) and (2), dF(x)dx is monotone decreasing. Then, the number of
extreme points of F(x) (x1 ≤ x ≤2) is one at most. From the intermediatevalue theorem, there exists c (x1 ≤ c ≤ x2) such that
dF(x)dx
∣∣∣∣∣x=c=
d f (x)dx
∣∣∣∣∣x=c− f (x2) − f (x1)
x2 − x1= 0.
Then, you can prove F(x) ≥ 0 (x1 ≤ x ≤2) from the above relations.(End of Hinto)
9.3.1 Basic inequalityThe following theorem is one of the most important theorems for statisticallearning theory.
186
Theorem 6. We have following relation. (Basic inequality)
P
supα∈Λ|R(α) − Remp(α)| > ε
< 4 exp
HΛann(2L)
L−
(ε − 1
L
)2 L
.• The left hand side is the probability that learning patterns (z1, z2, . . . , zL)
appears, with which the maximum absolute difference between the ensem-ble risk and the empirical risk among all α ∈ Λ is larger than ε.• Since HΛann(2L) < GΛ(2L), the above inequality holds even if we replace
HΛann(2L) to GΛ(2L).• Let η be the right hand of the basic inequality.
η = 4 exp
HΛann(2L)
L−
(ε − 1
L
)2 L
.Solve it with respect to ε, and we have
ε =
√HΛann(2L) − ln(η/4)
L+
1L.
187
Corollary 1. We have the following relation with probability 1 − η.
|R(α) − Remp(α)| ≤
√HΛann(2L) − ln(η/4)
L≤
√GΛ(2L) − ln(η/4)
L.
DefinitionsZ2L = (z1, . . . , zL, zL+1, . . . , z2L)
Z1 = (z1, . . . , zL), Z2 = (zL+1, . . . , z2L)
ν(α,Z1) =1L
L∑i=1
Q(zi, α), ν(α, Z2) =1L
2L∑i=L+1
Q(zi, α)
ρ(α, Z2L) = |ν(α, Z1) − ν(α, Z2)|π(α,Z1) =
∣∣∣∣∣∫ Q(α, z)dF − ν(α,Z1)∣∣∣∣∣ = ∣∣∣R(α) − Remp(α))
∣∣∣Lemma 11. (Basic Lemma)
P
supα∈Λ
π(α,Z1) > ε< 2P
supα∈Λ
ρ(α, Z2L) > ε − 1L
188
(Proof.)First, we fix α∗ and consider the case that samples satisfy R(α∗) > ν(α∗,Z1).
We denote the probability of the case C under that condition by P1C. Thatis, P1C ≡ PC |R(α∗) > ν(α∗,Z1). Then, for any α∗ ∈ Λ, we have thefollowing equation
P1
supα∈Λ
ρ(α, Z2L) > ε − 1L
≥ P1
(supα∈Λ
ρ(α, Z2L) > ε − 1L
)∧ (π(α∗,Z1) > ε
)≥ P1
(ρ(α∗,Z2L) > ε − 1
L
)∧ (π(α∗,Z1) > ε
)= P1
(∣∣∣ν(α∗,Z1) − ν(α∗,Z2)∣∣∣ > ε − 1
L
)∧
(∣∣∣R(α∗) − ν(α∗,Z1)∣∣∣ > ε)
= P1
(∣∣∣ν(α∗,Z1) − ν(α∗,Z2)∣∣∣ > ε − 1
L
)∧ (
R(α∗) − ν(α∗,Z1) > ε)
(153)189
The following logic relation is clear.(ν(α∗,Z2) > R(α∗) − 1
L
)∧ (
R(α∗) − ν(α∗,Z1) > ε)
⇒(−ν(α∗,Z1) + ν(α∗,Z2) > ε − 1
L
)∧ (
R(α∗) − ν(α∗,Z1) > ε)
⇒(∣∣∣ν(α∗,Z1) − ν(α∗,Z2)
∣∣∣ > ε − 1L
)∧ (
R(α∗) − ν(α∗,Z1) > ε).
In the relation, it is clear that the probability of the case that the last conditionis true is not less than that of the first condition.
P1
(∣∣∣ν(α∗,Z1) − ν(α∗,Z2)∣∣∣ > ε − 1
L
)∧ (
R(α∗) − ν(α∗,Z1) > ε)
≥ P1
(ν(α∗,Z2) > R(α∗) − 1
L
)∧ (π(α∗,Z1) > ε
)
190
Since Z1 and Z2 are sampled independently.
P1
(ν(α∗,Z2) > R(α∗) − 1
L
)∧ (π(α∗,Z1) > ε
)= P1
(ν(α∗,Z2) > R(α∗) − 1
L
)· P1
(π(α∗,Z1) > ε
)• For a sample z, the probability of the case that Q(α∗, z) = 1 is R(α∗).• Consider a coin of which probability of the front side for a tossing is R(α∗).•When we toss it L times and it shows the front Lν(α∗,Z2) times, the prob-
ability of the case that Lν(α∗,Z2) is greater than LR(α∗) − 1 is greater than1/2. (This is a property of binomial distribution.)• The first probability in the right hand side of the above inequality is the
same as the above probability and greater than 1/2.
From those relation we have
P1
supα∈Λ
ρ(α, Z2L) > ε − 1L
>
12
P1(π(α∗,Z1) > ε
).
191
Next, we consider the case R(α∗) ≤ ν(α∗,Z1). We denote the probabilityunder the condition by P2. Then, considering the case ν(α∗,Z2) < R(α∗) + 1
L,we can similarly prove
P2
supα∈Λ
ρ(α, Z2L) > ε − 1L
>
12
P2(π(α∗,Z1) > ε
).
When we recall the calculation of a marginal distribution,PR(α∗) > ν(α∗,Z1) + PR(α∗) ≤ ν(α∗,Z1) = 1 yields
P1C · PR(α∗) > ν(α∗,Z1) + P2C · PR(α∗) ≤ ν(α∗,Z1) = PC.for any case C. It completes the proof. Q.E.D.(Proof of basic inequality)• Denote the operation of permutation of 2L samples by Ti (i = 1, 2, . . . , (2L)!).• Since they are sampled independently, the probability is not changed by
the permutation:
P
supα∈Λ
ρ(α, Z2L) > ε − 1L
= P
supα∈Λ
ρ(α, TiZ2L) > ε − 1
L
192
•When we fix samples Z2L, the number of different sets of (Q(α, z1),Q(α, z2),. . . ,Q(α, z2L)) for changing α ∈ Λ is given by NΛ(z1, · · · , z2L).•We denote their representatives by Λ∗(Z2L).• The number of elements in Λ∗(Z2L) is NΛ(z1, · · · , z2L).•We define the step function θ(x) as
θ(x) =
1 (x > 0)0 (x ≤ 0) .
• The following relation is clear.
Ph(z) > 0 =∫z | h(z)>0
dF(z) =∫
θ(h(x))dF(z) = E[θ(h(x))]
• Then we have
P
supα∈Λ
ρ(α,Z2L) > ε − 1L
=
1(2L)!
(2L)!∑i=1
P
supα∈Λ∗(Z2L)
ρ(α, TiZ2L) > ε − 1
L
193
=1
(2L)!
(2L)!∑i=1
P
supα∈Λ∗(Z2L)
ρ(α,TiZ2L) − ε + 1
L> 0
=
1(2L)!
(2L)!∑i=1
E
θ supα∈Λ∗(Z2L)
ρ(α, TiZ2L) − ε + 1
L
≤ 1(2L)!
(2L)!∑i=1
E
∑α∈Λ∗(Z2L)
θ
(ρ(α, TiZ
2L) − ε + 1L
)= E
∑α∈Λ∗(Z2L)
1(2L)!
(2L)!∑i=1
θ
(ρ(α, TiZ
2L) − ε + 1L
) .• Assume that among Q(α, z1),Q(α, z2), . . . ,Q(α, z2L), m of them are 1 and
the others (2L − m of them) are 0.
194
• From the definition ρ(α, Z2L) can be written as
ρ(α,Z2L) =
∣∣∣∣∣∣∣∣1LL∑
i=1Q(zi, α) − 1
L
2L∑i=L+1
Q(zi, α)
∣∣∣∣∣∣∣∣ . (154)
• Assume that among the first and the second terms in eq.(154) k and m − kQ(zi, α) are 1, respectively.• Then, we have
ρ(α, Z2L) =∣∣∣∣∣kL − m − k
L
∣∣∣∣∣ .• The learning patters are reordered by Ti.
θ
(ρ(α, TiZ
2L) − ε + 1L
)is 1 if and only if k that is the number of Q(zi, α) equals to 1 in the firstsum in ρ(α, TiZ2L) satisfies∣∣∣∣∣kL − m − k
L
∣∣∣∣∣ > ε − 1L.
195
• nCk denotes the number of combinations taking k from n.
A set of integers K is defined by
K =
k∣∣∣∣∣ ∣∣∣∣∣kL − m − k
L
∣∣∣∣∣ > ε − 1L
.
•We have
1(2L)!
(2L)!∑i=1
θ
(ρ(α,TiZ
2L) − ε + 1L
)=
∑k∈K
mCk 2L−mCL−k
2LCL. (155)
• Consider m red and 2L − m blue balls and we splits them into two bags.Each bag includes L balls. The left hand side of eq.(155) equals to thenumber of cases when the first bag has k red ball for all k ∈ K divided by(2L)!.•We can prove the following equation holds for any m.∑
k∈K
mCk 2L−mCL−k
2LCL< 2 exp
−(ε − 1
L
)2 .
196
Then, the followings hold and the basic inequality is proven.
P
supα∈Λ|R(α) − Remp(α)| > ε
≤ 2P
supα∈Λ
ρ(α,Z2L) > ε − 1L
≤ 2E
∑α∈Λ∗(Z2L)
1(2L)!
(2L)!∑i=1
θ
(ρ(α, TiZ
2L) − ε + 1L
)< 2E
∑α∈Λ∗(Z2L)
2 exp
−(ε − 1
L
)2
= 4E[NΛ(z1, . . . , z2L)
]exp
−(ε − 1
L
)2
= 4 exp
HΛann(2L)
L−
(ε − 1
L
)2 L
.197
9.4 VC (Vapnik-Chervonenkis) dimension• For a growth function for a set of classifiers
GΛ(L) = ln maxz1,...,zL
NΛ(z1, . . . , zL),
we have the following theorem.
Theorem 7. For any integer L, we have GΛ(L) = L ln 2 or there exists aninteger h (VC dimension) such that
GΛ(L)
= L ln 2 (L ≤ h)≤ ln
(∑hi=0 LCi
)(L > h)
.
•When the number of samples is less than or equal to h, there exists anclassifier that splits all samples correctly for any labeling of samples.
198
Lemma 12. If we have
NΛ(z1, z2, . . . , zL) >m∑
i=0LCi,
there exists a subsequence of samples z∗1, z∗2, . . . , z∗m+1 such that
NΛ(z∗1, z∗2, . . . , z∗m+1) = 2m+1.
Now, we prove Theorem 7. Let
h = argmaxL
(GΛ(L) = L ln 2.
)Let z1, z2, . . . , zh be the set of h elements that achieves the maximum. It isclear that for any L ≤ h we have
GΛ(L) = L ln 2.
with any L elements in z1, z2, . . . , zh because NΛ(z1, z2, . . . , zh) = 2h meansany separation of the set is possible by choosing α ∈ Λ.
199
Suppose that for some L(> h) we have
GΛ(L) > lnh∑
i=0LCi,
then from Lemma 12 there exists h + 1 elements such that
NΛ(z∗1, z∗2, . . . , z∗h+1) = 2h+1.
Then, we haveGΛ(h + 1) = (h + 1) ln 2.
This equality contradicts the choice of h.
200
When VC dimension h is defined for Λ, we have
HΛann(L) ≤ GΛ(L) ≤ ln
h∑i=1
LCi
≤ h(1 + ln
Lh
).
Then we have the following theorem and corollary.(We omit the proof of the last inequailty).
Theorem 8.
P
supα∈Λ|R(α) − Remp(α)| > ε
< 4 exp
h(1 + ln(2L/h))
L−
(ε − 1
L
)2 L
.
201
9.4.1 VC dimension and risk
Corollary 2. With probability 1 − η the following relation holds.
R(α) − Remp(α) ≤√
h(1 + ln(2L/h)) − ln(η/4)L
.
202
9.4.2 Linear classifier and VC dimension
• VC dimension of the set of all linear classifiers in Rn : n + 1
203
9.4.3 Margin and VC dimension
• Assume that the norm of patterns are upper bounded. The larger marginthe fewer VC dimension is.
204
(Lemma 12)If we have
NΛ(z1, z2, . . . , zL) >m∑
i=0LCi,
there exists a subsequence z∗1, z∗2, . . . , z∗m+1 such that
NΛ(z∗1, z∗2, . . . , z∗m+1) = 2m+1.
(Proof of Lemma 12)We define φ(n, L) as
φ(n, L) =n−1∑i=0
LCi.
We define LCi = 0 if i > L.It is clear that
φ(1, L) = LC0 = 1.
205
When L < n we have
φ(n, L) =n−1∑i=0
LCi =L∑
i=0LCi = (1 + 1)L = 2L
The property of the combination
LCi = L−1Ci + L−1Ci−1
yieldsφ(n, L) = φ(n, L − 1) + φ(n − 1, L − 1).
Let n = m+1 and we prove this lemma with respect to n by the mathematicalinduction.
First, let n = 1. If we have
NΛ(z1, . . . , zL) > φ(1, L) = 1,
for at least one z∗ ∈ z1, . . . , zL there exists α0, α1 ∈ Λ such that
Q(z∗, α0) = 0, Q(z∗, α1) = 1.
206
This relation yieldsNΛ(z∗) = 2.
Then, Lemma 12 for n = 1 holds.Next, when n ≤ n0, Lemma 12 holds for any L and any Λ and consider the
case n = n0 + 1.When L < n0 + 1, the assumption of Lemma 12 is written as
NΛ(z1, . . . , zL) > φ(n0 + 1, L) = 2L.
However, since NΛ(z1, . . . , zL) is not larger than 2L, the assumption does nothold and the lemma holds for this case.
Then, we assume that when L ≤ L0, Lemma 12 holds for anyΛ and considerthe case L = L0 + 1. Now, the assumption of Lemma 12
NΛ(z1, . . . , zL0+1) > φ(n0 + 1, L0 + 1)
holds.We consider the following two cases.
207
case 1 : NΛ(z1, . . . , zL0) > φ(n0 + 1, L0)case 2 : NΛ(z1, . . . , zL0) ≤ φ(n0 + 1, L0)
For case 1, since this lemma holds for n = n0 + 1, L = L0 from the assump-tion of induction, z∗1, . . . , z
∗n0+1 satisfying
NΛ(z∗1, . . . , z∗n0+1) = 2n0+1
are included in z1, . . . , zL0(⊂ z1, . . . , zL0, zL0+1), and Lemma 12 holds.Consider case 2. We consider the equivalent class for values
(Q(z1, α), . . . ,Q(zL0, α)) with respect to Λ. That is, when we split them as
Λ = Λ1 ∪ Λ2 ∪ · · · ∪ ΛNΛ(z1,...,zL0),
α, β ∈ Λi (i = 1, 2, . . . ,ΛNΛ(z1,...,zL0)) implies Q(z j, α) = Q(z j, β) for anyj = 1, 2, . . . , L0. Next, we define K1 and K2 as
K1: The number of Λi that includes α, β such that Q(zL0+1, α) , Q(zL0+1, β)K2: The number of Λi such that for any α, β ∈ Λi we have
Q(zL0+1, α) = Q(zL0+1, β).208
Then, we have
NΛ(z1, . . . , zL0) = K1 + K2,
NΛ(z1, . . . , zL0, zL0+1) = 2K1 + K2,
andNΛ(z1, . . . , zL0, zL0+1) = K1 + NΛ(z1, . . . , zL0).
Let’s extract an element αi fromΛi that satisfies the condition for K1. LetΛ∗
be the set of all such αi. From the rule of choosingΛi, (Q(z1, α), . . . ,Q(zL0, α))for the element in Λ∗ are different from each other. Then, we have
NΛ∗(z1, . . . , zL0) = K1.
When K1 > φ(n0, L0), since Lemma 12 holds for n = n0, L = L0, andΛ = Λ0, there exists a subsequence z∗1, . . . , z
∗n0 such that
NΛ∗(z∗1, . . . , z
∗n0) = 2n0.
209
Furthermore, the condition of K1 yields
2n0+1 ≥ NΛ(z∗1, . . . , z∗n0, zL0+1) ≥ 2NΛ
∗(z∗1, . . . , z
∗n0) = 2n0+1.
Then, Lemma 12 holds.When K1 ≤ φ(n0, L0), the condition of case 2 yields
NΛ(z1, . . . , zL0, zL0+1) = K1 + NΛ(z1, . . . , zL0)≤ φ(n0, L0) + φ(n0 + 1, L0)= φ(n0 + 1, L0 + 1).
This relation implies that the assumption of Lemma 12 does not hold.This completes the proof. QED.
9.5 References• Vladimir N. Vapnik, “Statistical Learning Theory,” Wiley, New-York, 1998.• K.-R. Muller and S. Mika and G. Ratsch and K. Tsuda and B. Scholkopf,
“An Introduction to Kernel-Based Learning Algorithms,” IEEE Transac-tions on Neural Networks, Vol.12, No.2, pp. 181–201, 2001.
210
10.1 Gradient methodAssume that J is continuously differentiable. Let
x =
x1...
xn
, a =
a1...
an
.The Taylor expansion at x = a is given as
J(x1, . . . , xn) = J(a1, . . . , an) +∂J∂x1
∣∣∣∣∣∣x=a(x1 − a1) + · · · + ∂J
∂xn
∣∣∣∣∣x=a(xn − an)
+(not less than second order terms).
We define
J1(x1, . . . , xn) = J(a1, . . . , an) +∂J∂x1
∣∣∣∣∣∣x=a(x1 − a1) + · · · + ∂J
∂xn
∣∣∣∣∣x=a(xn − an).
(156)JI expresses the tangential plane of J at x = a.
212
Gradient of J:
∇J =
∂J∂x1...∂J∂xn
.JI(x) = J(a) + ⟨∇J|x=a, x − a⟩.
∇J|x=a is perpendicular to the contour-line of J.In Rn+1, the vector (∇J|x=a,−1) is perpendicular to the tangential plane
JI(x).⟨(∇J|x=a,−1), (x − a, J1(x) − J1(a))⟩ = 0
Gradient method: Search the minimum point by using ∇J.
214
10.1.1 Maximum gradient methodIt searches the minimum point in the direction ∇J. For brief ∇J|x=a is de-noted by ∇J[a].
Algorithm1. x1 : an initial point. Let k = 1.
2. Calculate ∇J[xk].
3. If ∥∇J[xk]∥ is enough small, then stop.
4. Search the (local) minimum point in the direction ∇J[xk](linear search), and let xk+1 be the point.
5. k ← k + 1.
6. Go to 2.
If xk+1 is a local minimum, we have
∇J[xk+1] ⊥ ∇J[xk]
215
Calculation of ∇J
• Analytic• Difference:
∇J ≃
J(x1+δ,x2,...,xn)−J(x1−δ,x2,...,xn)
2δ...
J(x1,x2,...,xn+δ)−J(x1,x2,...,xn−δ)2δ
216
Linear search
F(t) = J(a − t∇J[a])In general, the minimum point of F(t) is not the minimum point of J(x).
A simple algorithm for linear search1. Choose initial values t1, t2, and t3 such that t1 < t2 < t3, F(t1) >
F(t2), and F(t3) > F(t2).
2. Fit a quadratic function with t1, t2, and t3. Let t0 be the minimumpoint of the quadratic function fitting the three points.
3. If t0 < t2 and F(t0) < F(t2), then t1 ← t1, t2 ← t0, t3 ← t2.If t0 < t2 and F(t0) > F(t2), then t1 ← t0, t2 ← t2, t3 ← t3.If t0 > t2 and F(t0) < F(t2), then t1 ← t2, t2 ← t0, t3 ← t3.If t0 > t2 and F(t0) < F(t2), then t1 ← t1, t2 ← t2, t3 ← t0.
4. Go to 2.
217
10.1.2 Conjugate gradient method for quadratic functionFirst, we consider a quadratic function in Rn. For a symmetric matrix A wedefine
J(x) =12⟨Ax, x⟩ + ⟨b, x⟩ + c =
12
∑i
∑j
Ai jxix j +∑
ibixi + c
A non-negative (definite) matrix A ≥ 0⟨Ax, x⟩ ≥ 0 for all x⇔ λi ≥ 0 for all eigenvalues λi
A positive (definite) matrix A > 0⟨Ax, x⟩ > 0 for all x(, 0)⇔ λi > 0 for all eigenvalues λi
A > 0 a unique minimum point−A > 0 a unique maximum pointA ≥ 0, b ∈ R(A) minimum points−A ≥ 0, b ∈ R(A) maximum pointsA ≥ 0 or −A ≥ 0, b < R(A) no minimum or maximum pointNeither A ≥ 0 nor −A ≥ 0 a saddle point
219
From here, we assume that A > 0. Then, we have
J(x) =12⟨x + A−1b, A(x + A−1b)⟩ + c − 1
2⟨b, A−1b⟩
(Proof.)
J(x) =12⟨x + A−1b, A(x + A−1b)⟩ + c − 1
2⟨b, A−1b⟩
=12⟨x, Ax⟩ + 1
2⟨x, AA−1b⟩ + 1
2⟨A−1b, Ax⟩
+12⟨A−1b, AA−1b⟩ + c − 1
2⟨b, A−1b⟩
=12⟨x, Ax⟩ + ⟨b, x⟩ + c
QED.
Since A > 0, J(x) is minimum if and only if x + A−1b = 0,that is, x = −A−1b.
220
• It is a hard task to calculate A−1.• To extend to a non-quadratic optimization problem.• The maximum gradient method is not efficient. Even for a quadratic prob-
lem, the infinite iterations are necessary in general.• By the conjugate gradient method, we can obtain a solution with n itera-
tions when the problem is an n-dimensional quadratic function. (neglectingcalculation error)
xk : the result of the k-th iterationsk : the direction for linear search for xk+1 from xk.
sk is determined by using ∇J[xk].We have
xk+1 = xk + αksk, (157)where αk is a real number.
For fixed xk and sk, αk that minimizes
J(xk + αksk)
221
is given by
αk = −⟨Axk + b, sk⟩⟨sk, Ask⟩
(158)
(Proof.)
J(x + αs) =12⟨x + αs, A(x + αs)⟩ + ⟨b, x + αs⟩ + c
=12⟨s, As⟩α2 + ⟨Ax + b, s⟩α + 1
2⟨x, Ax⟩ + ⟨b, x⟩ + c.
QED.
Definition 5. si and si are A-conjugate : ⟨Asi, s j⟩ = 0 if i , j.
If A > 0 and si(, 0) (i = 1, 2, · · · n) are A-conjugate, then si are linearlyindependent.
222
(Proof.)Suppose that si(, 0) (i = 1, 2, · · · n) are not linearly independent. There
exists l and real numbers ξ j ( j , l) such that
sl =∑j,l
ξ js j.
For any k , l, we have
0 = ⟨sl, Ask⟩ =⟨∑
j,lξ js j, Ask
⟩= ξk⟨sk, Ask⟩.
Since ⟨Ask, sk⟩ > 0 holds, we have ξk = 0. This contradiction implies that si
are linearly independent. QED.
223
Let [s1, s2, · · · , sk] be the subspace spanned by s1, s2, · · · , sk.Let x1 + [s1, s2, · · · , sk] = x1 + x | x ∈ [s1, s2, · · · , sk] (a linear variety).We have
∇J[x] = Ax + b.(Proof.)
Since
J(x) =12
n∑i=1
n∑j=1
Ai jxix j +n∑
i=1bixi + c,
then Ai j = A ji yields that
(∇J[x])k =∂J∂xk=
n∑i=1
Akixi + bk = (Ax + b)k . (159)
QED.
Problem 18.Derive eq.(159) in more detail.
224
Theorem 9.When eqs.(157) and (158) are used and s j are A-conjugate, we have
i) ⟨∇J[xk+1], s j⟩ = 0 for j = 1, 2, · · · , k.ii) ⟨∇J[x1], sk⟩ = ⟨∇J[x2], sk⟩ = . . . = ⟨∇J[xk], sk⟩
iii) xk+1 minimizes J(x) in x1 + [s1, s2, · · · , sk]. (xk+1 = xk + αksk)
Problem 19. Explain the meaning of ⟨∇J[xk+1], sk⟩ = 0.
(Proof.)i) From eq.(158)
⟨∇J[xk+1], sk⟩ = ⟨Axk+1 + b, sk⟩ = ⟨A(xk + αksk) + b, sk⟩= αk⟨Ask, sk⟩ + ⟨Axk + b, sk⟩ = 0.
For j = 1, 2, . . . , k − 1, we have
225
⟨∇J[xk+1], s j⟩ = ⟨Axk+1 + b, s j⟩ =⟨
A
x j+1 +k∑
i= j+1αisi
+ b, s j⟩
= ⟨Ax j+1 + b, s j⟩ +k∑
i= j+1αi⟨Asi, s j⟩ = ⟨∇J[x j+1], s j⟩
= 0.ii)
⟨∇J[x j], sk⟩ = ⟨Ax j + b, sk⟩ =⟨
A
xk −k−1∑i= j
αisi
+ b, sk⟩
= ⟨Axk + b, sk⟩ −k−1∑i= j
αi⟨Asi, sk⟩ = ⟨∇J[xk], sk⟩
iii) Any x ∈ x1 + [s1, s2, . . . , sk] can be expressed asx = x1 + β1s1 + β2s2 + . . . + βksk
226
with β1, β2, . . . , βk ∈ R. Then,
J(x) =12
⟨x1 +
k∑i=1
βisi, A
x1 +k∑
i=1βisi
⟩+
⟨b, x1 +
k∑i=1
βisi⟩+ c
=12
⟨x1, Ax1
⟩+
12
⟨x1,
k∑i=1
βi Asi⟩+
12
⟨ k∑i=1
βisi, Ax1⟩
+12
⟨ k∑i=1
βisi,k∑
i=1βi Asi
⟩+ ⟨b, x1⟩ +
⟨b,
k∑i=1
βisi⟩+ c
=12
k∑i=1
k∑j=1
βiβ j⟨si, As j⟩ +k∑
i=1βi⟨Ax1 + b, si⟩ + 1
2⟨x1, Ax1⟩ + ⟨b, x1⟩ + c
=
k∑i=1
[12⟨si, Asi⟩(βi)2 + ⟨Ax1 + b, si⟩βi
]+
12⟨x1, Ax1⟩ + ⟨b, x1⟩ + c.
227
Therefore, it becomes minimum if and only if
βi = −⟨Ax1 + b, si⟩⟨si, Asi⟩
. (160)
Since Ax1 + b = ∇J[x1], Axi + b = ∇J[xi], (ii) in Theorem 9 yields that
αi = −⟨Axi + b, si⟩⟨si, Asi⟩
= −⟨∇J[xi], si⟩⟨si, Asi⟩
= −⟨∇J[x1], si⟩⟨si, Asi⟩
= −⟨Ax1 + b, si⟩⟨si, Asi⟩
= βi.
(161)Since
xk+1 = x1 +k∑
i=1αisi = x1 +
k∑i=1
βisi, (162)
xk+1 minimizes J(x) in x1 + [s1, . . . , sk]. QED.
Since we have x1 + [s1, . . . , sn] = Rn, then xn+1 minimizes J(x) in Rn.
228
How to calculate s1, s2, . . . , sk. s1 is given by the gradient of J at x1.
s1 = −∇J[x1]. (163)
And we assume that
sk+1 = −∇J[xk+1] +k∑
j=1λk+1
j s j. (164)
Since sk+1 and s j ( j = 1, 2, . . . , k) are A-conjugate,
0 = ⟨sk+1, As j⟩ =⟨−∇J[xk+1] +
k∑i=1
λki si, As j
⟩= −⟨∇J[xk+1], As j⟩ + λk+1
j ⟨sj, As j⟩.
Then, we have
λk+1j =
⟨∇J[xk+1], As j⟩⟨s j, As j⟩
.
229
Theorem 10. For j = 1, 2, . . . , k − 1, we have
λk+1j = 0. (165)
(Proof.)From eq.(164), for j = 1, 2, . . . , k
∇J[x j] = −s j +
j−1∑i=1
λji si.
Then, for j = 1, 2, . . . , k
⟨∇J[xk+1],∇J[x j]⟩ =⟨∇J[xk+1],−s j +
j−1∑i=1
λji si
⟩
= −⟨∇J[xk+1], s j⟩ +j−1∑i=1
λji ⟨∇J[xk+1], si⟩ = 0.
(166)
230
Since
∇J[x j+1] − ∇J[x j] = Ax j+1 + b − (Ax j + b) = A(x j+1 − x j)= A(x j + α js j − x j) = α j As j,
we have
As j =∇J[x j+1] − ∇J[x j]
α j . (167)
For j = 1, 2, . . . , k − 1, eq.(166) yields
λk+1j =
⟨∇J[xk+1], As j⟩⟨s j, As j⟩
=⟨∇J[xk+1],∇J[x j+1] − ∇J[x j]⟩
α j⟨s j, As j⟩
=⟨∇J[xk+1],∇J[x j+1]⟩
α j⟨s j, As j⟩− ⟨∇J[xk+1],∇J[x j]⟩
α j⟨s j, As j⟩= 0.
QED.
Since λk+11 = λk+1
2 = · · · = λk+1k−1 = 0, we define λk by
λk = λk+1k .
231
Algorithm of conjugate gradient method for a quadratic function.
1. x1 : initial vector,s1 = −(Ax1 + b),k = 1.
2. if ∥Axk + b∥ is enough small, stop.
3. αk = −⟨Axk+b,sk⟩⟨sk,Ask⟩ .
4. xk+1 = xk + αksk.
5. λk = ⟨Axk+1+b,Ask⟩⟨sk,Ask⟩ .
6. sk+1 = −Axk+1 − b + λksk.
7. k = k + 1.
8. go to 2.
232
10.1.3 Conjugate gradient method for non-quadratic functionExtension to a non-quadratic case.
• αk can be obtained by linear search (xk+1 = xk + αksk).•We can use ∇J.• However, we cannot use Ask . Then, λk cannot be obtained.
Theorem 11. (Fletcher-Reeves’ equation)When J is a quadratic function, we have
λk =∥∇J[xk+1]∥2
∥∇J[xk]∥2
233
(Proof.)From eq.(167), we have
αk⟨sk, Ask⟩ = ⟨sk,∇J[xk+1] − ∇J[xk]⟩= ⟨sk,−∇J[xk]⟩= ⟨−∇J[xk] + λk−1sk−1,−∇J[xk]⟩= ∥∇J[xk]∥2, (168)
and eq.(166) yield
⟨∇J[xk+1],∇J[xk+1] − ∇J[xk]⟩ = ∥∇J[xk+1]∥2. (169)
Then, we have
λk =⟨∇J[xk+1],∇J[xk+1] − ∇J[xk]⟩
αk⟨sk, Ask⟩=∥∇J[xk+1]∥2
∥∇J[xk]∥2. (170)
234
Algorithm of conjugate gradient method for a non-quadratic function.
1. x1 : initial vector,s1 = −∇J[x1],k = 1.
2. if ∇J[xk] is enough small, stop.
3. Solve linear search problem with respect to αk:
F(αk) = J(xk + αksk)
and letxk+1 = xk + αksk
4. λk = ∥∇J[xk+1]∥2∥∇J[xk]∥2
5. sk+1 = −∇J[xk+1] + λksk
6. k = k + 1.
7. go to 2.
235
10.2 Newton’s methodTaylor expansion :
J(x1, . . . , xn) = J(a1, . . . , an)
+
n∑i=1
∂J∂xi
∣∣∣∣∣x=a(xi − ai)
+12!
n∑i=1
n∑j=1
∂2J∂xi∂x j
∣∣∣∣∣∣x=a(xi − ai)(x j − a j)
+(not less than the 3rd order terms). (171)
Hessian matrix (Hesse)
(∇2J)i j =∂2J∂xi∂x j
.
(∇2J)i j is a symmetric matrix. We define the following notation.
∇2J[a] = ∇2J|x=a.
236
Eq.(171) can be described as
J(x) = J(a) + ⟨∇J[a], x − a⟩ + 12!⟨x − a,∇2J[a](x − a)⟩ + · · · .
We define J2 that is an approximation of J with a quadratic function as
J2(x) = J(a) + ⟨∇J[a], x − a⟩ + 12!⟨x − a,∇2J[a](x − a)⟩.
If ∇2J[a] > 0, the minimum point of J2(x) is given as
xmin = −(∇2J[a])−1(∇J[a] − (∇2J[a])a)= a − (∇2J[a])−1∇J[a]. (172)
237
Algorithm of Newton’s method
1. x1 : initial vector,k = 1
2. Calculate ∇J[xk] and ∇2J[xk].
3. If ∥∇J[xk]∥ is small, then stop
4. xk+1 = xk − (∇2J[xk])−1∇J[xk]
5. k = k + 1
6. go to 2.
This method is equivalent to the Newton-Raphson method for solving thefollowing equation:
∇J[x] = 0. (173)
239
10.2.1 Quasi Newton’s methodHk : a matrix that approximates the inverse of a Hessian matrix, (∇2J)−1.
Algorithm of Quasi Newton’s method (except for obtaining Hk)
1. x1 : initial vector (H1 = I, k = 1)
2. Calculate ∇J[xk].
3. If ∥∇J[xk]∥ is small, stop.
4. sk = −Hk∇J(xk) : the direction for linear search.
5. Solve the linear search problem of
F(αk) = J(xk + αksk)
αand letxk+1 = xk + αksk
6. Calculate Hk+1 by using (xk+1 − xk), (∇J[xk+1] − ∇J[xk]), and Hk.
7. k = k + 1 and go to 2.240
Condition for obtaining Hk :Let
pk = xk+1 − xk = αksk, (174)qk = ∇J[xk+1] − ∇J[xk]. (175)
Consider the case that J is a quadratic function, that is
J(x) =12⟨x, Ax⟩ + ⟨b, x⟩ + c. (176)
Then,
qk = Axk+1 + b − (Axk + b) = A(xk+1 − xk) = ∇2J[xk]pk. (177)
If there exists the inverse of ∇2J[x], we have
(∇2J[xk])−1qk = pk. (178)
241
Quasi-Newton condition for Hk+1
Hk+1qk = pk. (179)
Theorem 12. Assume that
J(x) =12⟨x, Ax⟩ + ⟨b, x⟩ + c. (180)
Let xk be the result of Quasi-Newton’s method at the k-th iteration. We assumethat A and Hk (k = 1, 2, . . . , n) are positive definite (n,n)-matrices and
∇J(xk) , 0 k = 1, 2, . . . , n, (181)Hk+1q j = p j j = 1, 2, . . . , k, k = 1, 2, . . . , n. (182)
Then the following relations hold.
(i) s1, . . . , sn are A-conjugate and linearly independent. (sk = −Hk∇J[xk])(ii) xn+1 is the minimum point of J and
Hn+1 = A−1.
242
(Proof.)Since J is a quadratic function, it follows that
q j = Ax j+1 + b − (Ax j + b) = α j As j, (183)
andp j = x j+1 − x j = α js j. (184)
Eq.(182) yields that
α js j = p j = Hk+1q j = Hk+1α j As j = α jHk+1 As j.
Then, we haveHk+1 As j = s j ( j = 1, 2, . . . , k). (185)
(i)
s1 = −H1∇J[x1] = −(Ax1 + b),s2 = −H2∇J[x2] = −H2(Ax2 + b).
243
It follows that
⟨s2, As1⟩ = ⟨−H2∇J[x2], As1⟩ = −⟨Ax2 + b,H2 As1⟩ = −⟨Ax2 + b, s1⟩= −⟨A(x1 + α1s1) + b, s1⟩ = −α1⟨s1, As1⟩ − ⟨Ax1 + b, s1⟩
=⟨Ax1 + b, s1⟩⟨s1, As1⟩
⟨s1, As1⟩ − ⟨Ax1 + b, s1⟩ = 0.
Suppose that s1 and s2 are not linearly independent, A > 0 and ⟨s2, As1⟩ = 0yield that s1 = 0 or s2 = 0. Since we assume that Hk > 0, this contradicts to∇J[x1] , 0 and ∇J[x2] , 0. Then, s1 and s2 are linearly independent.
Assume that s1, . . . , sk are A-conjugate and linearly independent. Sinces1, . . . , sk are A-conjugate, Theorem 9 yields
⟨∇J[xk+1], s j⟩ = 0 ( j = 1, 2, . . . , k).
Eq.(185) yields that for j = 1, 2, . . . , k
⟨sk+1, As j⟩ = ⟨−Hk+1∇J[xk+1], As j⟩ = −⟨∇J[xk+1], Hk+1 As j⟩= −⟨∇J[xk+1], s j⟩ = 0.
Then, s1, . . . , sk+1 are A-conjugate. (mathematical induction)244
Suppose that s1, . . . , sk+1 are not linearly independent. Then, for real num-bers λ1, . . . , λk+1 we have
k+1∑i=1
λisi = 0.
Then,
0 =⟨
Ask+1,k+1∑i=1
λisi⟩=
k+1∑i=1
λi⟨sk+1, Asi⟩ = λk+1⟨sk+1, Ask+1⟩,
and A > 0 yield that λk+1 = 0. It follows thatk∑
i=1λisi = 0.
This contradicts the assumption that s1, . . . , sk are linearly independent. Themathematical induction yields (i).
245
(ii)From (iii) in Theorem 9 and (i), we can know that xn+1 is the minimum
point of J. Since for i = 1, 2, . . . , n
Hn+1 Asi = si,
and s1, . . . , sn are linearly independent, we have
Hn+1 A = I,Hn+1 = A−1.
QED.
Problem 20. Prove that a positive definite (N,N)-matrix A is regular (it hasa inverse matrix).
Hint: First, prove that Ax = 0 implies x = 0. This implies A is an injector.Second, for a basis eiNi=1 in RN, prove AeiNi=1 is also a basis in RN. Thisimplies A is an surjection. We can prove it by showing Aei (i = 1, 2, . . . ,N)are linearly independent (using ei (i = 1, 2, . . . ,N) are linearly independent).Then, because A is an injector and a surjection, it is regular.
246
10.2.2 Formula for obtaining Hk
We assume thatHk+1 = Hk + ∆Hk. (186)
Since it is preferable that the formula for ∆Hk is simple, we consider the casethat the rank of ∆HK is 1 or 2.Formula of rank 1 We can let
∆Hk = αuk(uk)T
for a real number α and a vector uk. The quasi-Newton condition yields that
Hkqk + αuk(uk)T qk = pk. (187)
Then, we haveα⟨uk, qk⟩uk = pk − Hkqk. (188)
In order to satisfy this equation, we let
uk = pk − Hkqk, (189)α = 1/⟨uk, qk⟩. (190)
247
It follows that
Hk+1 = Hk +(pk − Hkqk)(pk − Hkqk)T
⟨pk − Hkqk, qk⟩. (191)
This formula has a disadvantage that even if Hk > 0, Hk+1 may not bepositive definite when it happens ⟨pk − Hkqk, qk⟩ < 0.
Formula of rank 2 (Davidon-Fletcher-Powell (DFP) formula)Let
∆HK = αuk(uk)T + βvk(vk)T .
The quasi-Newton condition yields that
Hkqk + αuk(uk)T qk + βvk(vk)T qk = pk. (192)
Then, we have
Hkqk + α⟨uk, qk⟩uk + β⟨vk, qk⟩vk = pk. (193)
248
There are an infinite number of solutions. Here, we let
uk = pk, (194)α = 1/⟨uk, qk⟩, (195)
vk = −Hkqk, (196)β = 1/⟨vk, qk⟩. (197)
It follows that
Hk+1 = Hk +pk(pk)T
⟨pk, qk⟩− Hkqk(qk)T Hk
⟨qk, Hkqk⟩. (198)
Theorem 13. We assume that (H1)T = H1 > 0 and ∇J[xk] , 0 (k =1, 2, . . . , n). For the quasi-Newton’s method using the DFP formula, we havethe following relations.
(i) Hk > 0 and (Hk)T = Hk for k = 1, 2, . . . , n.(ii) When J is a quadratic function
J(x) =12⟨x, Ax⟩ + ⟨b, x⟩ + c. (199)
249
s1, . . . , sn are A-conjugate to each other and
Hk+1q j = p j (200)
for j = 1, 2, . . . , k and k = 1, 2, . . . , n.
(Proof.)(i) (Hk)T = Hk is clear.
Assume that Hk > 0. We will prove Hk+1 > 0. There exists a symmetricpositive definite matrix Dk such that,
Hk = Dk Dk
We can write as H = U
λ1 0 · · ·· · · . . . · · ·0 · · · λn
UT with λi > 0 (i = 1, 2, . . . , n).
Let D = U
√λ1 0 · · ·· · · . . . · · ·0 · · ·
√λn
UT .
Since UUT = UT U = I, it is clear that D = DT and H = DD.250
For any z,
⟨z, Hk+1z⟩ = ⟨z, Hkz⟩ + ⟨z, pk(pk)T z⟩⟨pk, qk⟩
− ⟨z, Hkqk(qk)T Hkz⟩⟨qk, Hkqk⟩
= ⟨z, Dk Dkz⟩ + ⟨z, pk⟩2
⟨pk, qk⟩− ⟨z, Hkqk⟩2
⟨qk, Dk Dkqk⟩
= ∥Dz∥2 − ⟨Dkz, Dkqk⟩2
∥Dkqk∥2+⟨z, pk⟩2
⟨pk, qk⟩
=∥Dkz∥2∥Dkqk∥2 − ⟨Dkz, Dkqk⟩2
∥Dkqk∥2+⟨z, pk⟩2
⟨pk, qk⟩
≥ ⟨z, pk⟩2
⟨pk, qk⟩(201)
If pk = 0, then 0 = pk/αk = sk = Hk∇J[xk]. Since Hk > 0, ∇J[xk] = 0.
This contradiction yields pk , 0.If qk = 0, then ∇J[xk+1] = ∇J[xk]. Let F(t) = J[xk + tsk].
251
Since t = αk is an extreme point of F(t), we have
0 =dFdt
∣∣∣∣∣t=αk= ⟨sk,∇J[xk + αksk]⟩ = −⟨Hk∇J[xk],∇J[xk+1]⟩. (202)
Then, ∇J[xk+1] = ∇J[xk] yields
⟨∇J[xk+1], Hk∇J[xk]⟩ = ⟨∇J[xk], Hk∇J[xk]⟩ = 0. (203)
This contradicts that ∇J[xk] , 0 and Hk > 0. Then, qk , 0.Since
dFdt
∣∣∣∣∣t=0= ⟨sk,∇J[xk]⟩ = −⟨Hk∇J[xk],∇J[xk]⟩ < 0,
t = 0 is not the minimum point and we have αk > 0. Then, eq.(202) yields
⟨pk, qk⟩ = ⟨αksk,∇J[xk+1] − ∇J[xk]⟩ = −αk⟨∇J[xk], sk⟩= αk⟨∇J[xk], Hk∇J[xk]⟩ > 0. (204)
Therefore,⟨z, Hk+1z⟩ ≥ 0.
252
Assume that ⟨z, Hk+1z⟩ = 0. From eq.(201), we have ⟨z, pk⟩ = 0 and|⟨Dkz, Dkqk⟩| = ∥Dkz∥∥Dkqk∥ so that the equatity condition of the Cauchy-Schwartz’s inequality yields
Dkz = βDkqk
with a real number β. Since Dk > 0, we have
z = βqk.
Since ⟨z, pk⟩ = 0 yields β⟨pk, qk⟩ = 0. Then, eq.(204) yields that β = 0 andz = βqk = 0. Therefore, we have Hk+1 > 0.
(ii) From the definition we have
H2 = I +p1(p1)T
⟨p1, q1⟩− q1(q1)T
⟨q1, q1⟩.
Then,H2q1 = q1 + p1 − q1 = p1.
253
We assume that s1, . . . , sk are A-conjugateHk+1q j = p j ( j = 1, 2, . . . , k)
,
and will prove that s1, . . . , sk+1 are A-conjugateHk+2ql = pl (l = 1, 2, . . . , k + 1)
.
From (i) in Theorem 9, for j = 1, 2, . . . , k we have
α j⟨sk+1, As j⟩ = ⟨−Hk+1∇J[xk+1], A(α js j)⟩ = −⟨Hk+1∇J[xk+1], Ap j⟩= −⟨∇J[xk+1], Hk+1 Ap j⟩ = −⟨∇J[xk+1], Hk+1q j⟩= −⟨∇J[xk+1], p j⟩ = −α j⟨∇J[xk+1], s j⟩ = 0. (205)
Since α j , 0, ⟨sk+1, As j⟩ = 0 for j = 1, 2, . . . , k. Then, s1, s2, . . . , sk, sk+1 areA-conjugate.
254
For j = 1, 2, . . . , k we have
⟨pk+1, q j⟩ = ⟨pk+1, Ap j⟩ = αk+1α j⟨sk+1, As j⟩ = 0 (206)⟨qk+1, Hk+1q j⟩ = ⟨qk+1, p j⟩ = ⟨Apk+1, p j⟩ = αk+1α j⟨sk+1, As j⟩ = 0.
(207)
Then,
Hk+2q j = Hk+1q j +⟨pk+1, q j⟩⟨pk+1, qk+1⟩
pk+1 − ⟨qk+1, Hk+1q j⟩
⟨qk+1, Hk+1qk+1⟩Hk+1qk+1
= Hk+1q j = p j. (208)
Furthermore,
Hk+2qk+1 = Hk+1qk+1 + pk+1 − Hk+1qk+1 = pk+1. (209)
This completes the proof. QED.
Note that (i) holds for not-quadratic problems.
255
Broyden-Fletcher-Goldfarb-Shannon (BFGS) formula
This formula is now the most popular.
Hk+1 =pk(pk)T
⟨pk, qk⟩+
I − pk(qk)T
⟨pk, qk⟩
HkI − pk(qk)T
⟨pk, qk⟩
(210)
256
11 Conditional Optimization Problemf : Rn → R : a function to be optimizatedg : Rn → Rm : a vector of functions for condition
g(x) = (g1(x), g2(x), . . . , gm(x))T
Equality condition:subject to g(x) = 0,minimize f (x).
Inequality condition:subject to g(x) ≥ 0,minimize f (x).
• m : the number of conditions.• g(x) ≥ 0 is equivalent to g1(x) ≥ 0, g2(x) ≥ 0,. . ., gm(x) ≥ 0.• An equality condition can be transformed to inequality conditions
subject to g(x) ≥ 0,−g(x) ≥ 0,
minimize f (x).
257
11.1 Lagrange’s method of undetermined coefficientsProblem : (Equality conditions)
Subject to g1(x1, x2, . . . , xn) = 0,
· · ·gm(x1, x2, . . . , xn) = 0,
(211)
minimizef (x1, x2, . . . , xn) (212)
where f , g1, . . . , gm are continuously differentiable. Let
A =
∂g1∂x1· · · ∂gm
∂x1· · · · · · · · ·∂g1∂xn· · · ∂gm
∂xn
(213)
259
Theorem 14. (Lagrange’s method of undetermined coefficients)Assume that rank(A) = m for every (x1, . . . , xn) satisfying eq.(211). If
(x1, . . . , xn) is a stationary point of f among those satisfying eq.(211), thenthe differentials of
J(x1, . . . , xn, λ1, . . . , λm)= f (x1, . . . , xn) − λ1g1(x1, . . . , xn) − · · · − λmgm(x1, . . . , xn) (214)
with respect to x1, . . . , xn, λ1, . . . , and λm are 0.
λ1, . . . , λm are called the Lagrange’s (undetermined) coefficients.
(Proof.)We use the following Implicit function theorem.
(As for its proof, see a textbook of Calculus.)
260
Proposition 1. Assume that x = a = (a1, . . . , an) satisfies eq.(211) and
det
∂g1∂x1
∣∣∣∣x=a · · ·∂gm∂x1
∣∣∣∣x=a· · · · · · · · ·
∂g1∂xm
∣∣∣∣x=a · · ·∂gm∂xm
∣∣∣∣x=a
, 0. (215)
Then, there exist continuously differentiable functions
h1(xm+1, . . . , xn), . . . , hm(xm+1, . . . , xn)
on a neighborhood of (am+1, . . . , an) such that
g j(h1(xm+1, . . . , xn), . . . , hm(xm+1, . . . , xn), xm+1, . . . , xn) = 0 (216)
for all j = 1, 2, . . . ,m. Furthermore, any (x1, . . . , xn) satisfying eq.(211) onthe neighborhood of a can be written as
(h1(xm+1, . . . , xn), . . . , hm(xm+1, . . . , xn), xm+1, . . . , xn)
261
ExampleFor a function g(x1, x2), weassume that g(a1, a2) = 0and
∂g∂x1
∣∣∣∣∣∣x1=a1,x2=a2
, 0.
Then there exist a functiony = f (x2) in the neighbor-hood of x2 = a2 such that
g( f (x2), x2) = 0.
g(x1, x2) expresses implic-itly a function y = f (x2).
262
Since rank(A) = m, by exchanging variables (rows in A), we can assumethat eq.(215) holds at a stationary point. Then, we have hi(xm+1, . . . , xn) (i =1, 2, . . . ,m) satisfying eq.(216). We define F(xm+1, . . . , xn) and G j(xm+1, . . . , xn)as
F(xm+1, . . . , xn)= f (h1(xm+1, . . . , xn), . . . , hm(xm+1, . . . , xn), xm+1, . . . , xn), (217)
G j(xm+1, . . . , xn)= g j(h1(xm+1, . . . , xn), . . . , hm(xm+1, . . . , xn), xm+1, . . . , xn) (218)
for j = 1, 2, . . . ,m. Then, we have
G j(xm+1, . . . , xn) = 0. (219)
for any (xm+1, . . . , xn) in the neighborhood of the stationary point. We alsohave
∂F∂xk= 0 (220)
at the stationary point for k = m + 1, . . . , n.
263
Eqs.(220) and (219) yield that
∂F∂xk=
m∑i=1
∂ f∂xi
∂hi∂xk+∂ f∂xk= 0, (221)
∂G j
∂xk=
m∑i=1
∂g j
∂xi
∂hi∂xk+∂g j
∂xk= 0. (222)
for j = 1, 2, . . . ,m and k = m + 1, . . . , n.By using matrices, we can describe the equations as
∂F∂xm+1...∂F∂xn
=
∂h1∂xm+1
· · · ∂hm∂xm+1
· · · · · · · · ·∂h1∂xn
· · · ∂hm∂xn
∂ f∂x1...∂ f∂xm
+
∂ f∂xm+1...∂ f∂xn
= 0, (223)
264
and ∂G1∂xm+1
· · · ∂Gm∂xm+1
· · · · · · · · ·∂G1∂xn
· · · ∂Gm∂xn
=
∂h1∂xm+1
· · · ∂hm∂xm+1
· · · · · · · · ·∂h1∂xn
· · · ∂hm∂xn
∂g1∂x1· · · ∂gm
∂x1· · · · · · · · ·∂g1∂xm· · · ∂gm
∂xm
+
∂g1∂xm+1
· · · ∂gm∂xm+1
· · · · · · · · ·∂g1∂xn
· · · ∂gm∂xn
= O. (224)
Let
H =
∂h1∂xm+1
· · · ∂hm∂xm+1
· · · · · · · · ·∂h1∂xn
· · · ∂hm∂xn
,
265
p =
∂ f∂x1...∂ f∂xm
p′ =
∂ f
∂xm+1...∂ f∂xn
,and
Q =
∂g1∂x1· · · ∂gm
∂x1· · · · · · · · ·∂g1∂xm· · · ∂gm
∂xm
Q′ =
∂g1∂xm+1
· · · ∂gm∂xm+1
· · · · · · · · ·∂g1∂xn
· · · ∂gm∂xn
Then, eqs.(221) and (222) can be written as
Hp+ p′ = 0, (225)HQ + Q′ = O. (226)
Since Q is regular matrix, let
λ =
λ1...λm
= Q−1 p. (227)
266
From eq.(226), we haveH = −Q′Q−1. (228)
From eq.(225),0 = p′ + (−Q′Q−1)Qλ = p′ − Q′λ (229)
From p− Qλ = 0 and p′ − Q′λ = 0, it follows that∂J∂xk=∂ f∂xk− λ1
∂g1∂xk− · · · − λm
∂gm∂xk= 0 (230)
for k = 1, 2, . . . , n. Furthermore,∂J∂λl= −gl = 0 (231)
for l = 1, 2, . . . ,m. QED.
267
Example
g(x1, x2) = 2x1 + x2 − 1, (232)f (x1, x2) = x2
1 + 2x22. (233)
Since
A =(
21
)implies that rank(A) = 1, we can use
J = x21 + 2x2
2 − λ(2x1 + x2 − 1)
270
∂J∂x1= 2x1 − 2λ = 0
∂J∂x2= 4x2 − 2λ = 0
∂J∂λ= 2x1 + x2 + 1 = 0
The answer is given as x1 =49
x2 =19. (234)
271
Bad example
g(x1, x2) = (2x1 + x2 − 1)2, (235)f (x1, x2) = x2
1 + 2x22. (236)
LetJ = x2
1 + 2x22 − λ(2x1 + x2 − 1)2.
∂J∂x1= 2x1 − 4λ(2x1 + x2 − 1) = 0 (237)
∂J∂x2= 4x2 − 2λ(2x1 + x2 − 1) = 0 (238)
∂J∂λ= −(2x1 + x2 − 1)2 = 0 (239)
By substituting eq.(239) to eqs.(237) and (238), we havex1 = 0x2 = 0 . (240)
272
But eq.(239) implies
1 = 0 → contradiction.
This is because
A =(
4(2x1 + x2 − 1)2(2x1 + x2 − 1)
)=
(00
)when (x1, x2) satisfies eq.(239). (rank(A) = 0.)→ Lagrange’s method cannot be applied.
Problem 21. Solve the following optimization problem by Lagrange’s method.
g(x1, x2) = x1x2 − 1, (241)f (x1, x2) = x2
1 + 4x1x2 + 4x22. (242)
273
ExampleConsider the following optimization problem of a matrix B.
g(B) = BA − C, (243)f (B). (244)
NG:J(B) = f (B)−λtr((BA−C)(BA−C)T ) = f (B)−λ
∑i
∑j|(BA−C)i j|2 (245)
OK:
J(B) = f (B) − tr(Λ(BA − C)T ) = f (B) −∑
i
∑j
(Λ)i j(BA − C)i j (246)
The squared term cannot be used for the condition in Lagrange’s method.
However, we often use
J(x) = f (x) − λg(x) + a(g(x))2, (247)
where a is a constant to improve a stability of numerical calculation.274
11.2 Inequality conditionProblem: Subject to
g1(x1, x2, . . . , xn) ≥ 0,· · ·
gm(x1, x2, . . . , xn) ≥ 0,(248)
minimizef (x1, x2, . . . , xn), (249)
where f , g1, . . . , gm are continuously differentiable.
11.2.1 KKT (Karush-Kuhn-Tucker) conditionWe define Lagrange function as
J(x1, . . . , xn, λ1, . . . , λm)= f (x1, . . . , xn) − λ1g1(x1, . . . , xn) − · · · − λmgm(x1, . . . , xn) (250)
with Lagrange’s coefficients λ1, . . . , λm.
275
Theorem 15. (KKT condition)If (x1, . . . , xm) is a stationary point of f among those satisfying eq.(248),
then at the point we have∂J∂xi= 0 (251)
for i = 1, 2, . . . , n and we have
λ1 ≥ 0,· · ·λm ≥ 0,g1(x1, x2, . . . , xn) ≥ 0,· · ·gm(x1, x2, . . . , xn) ≥ 0,λ1g1(x1, x2, . . . , xn) = 0,· · ·λmgm(x1, x2, . . . , xn) = 0.
(252)
276
•When λi > 0,• gi(x) = 0.⇒ The minimum point is on the boundary of the condition by gi.
KKT condition277
•When λi = 0,• gi(x) ≥ 0.⇒ The minimum point is minimum without the condition of gi(x).
KKT condition278
11.3 Saddle point theorem and dual problem• Criteria (248) and (249) can be written as
min(x1,...,xn)
max(λ1≥0,...,λm≥0)
J(x1, . . . , xn, λ1, . . . , λm)
Remark: The number of variables increases but conditions becomes simple.(Proof)
We have
max(λ1≥0,...,λm≥0)
[f (x1, . . . , xn) − λ1g1(x1, . . . , xn) − · · · − λmgm(x1, . . . , xn)
]=
f (x1, . . . , xn) if g1(x1, . . . , xn) ≥ 0, . . . , gm(x1, . . . , xn) ≥ 0∞ else.
Since the above value is minimized with respect to x1, . . . , xn andf (x1, . . . , xn) < ∞, its minimum point has to satisfy conditionsg1(x1, . . . , xn) ≥ 0, . . . , gm(x1, . . . , xn) ≥ 0. QED.
279
Theorem 16. (Saddle point theorem) :We have
min(x1,...,xn)
max(λ1≥0,...,λm≥0)
J(x1, . . . , xn, λ1, . . . , λm)
= max(λ1≥0,...,λm≥0)
min(x1,...,xn)
J(x1, . . . , xn, λ1, . . . , λm).
(We can commute minimization and maximization.)
Remark : It does not hold if the domain of variables is discrete.
x \ λ 1 2 maxλ1 0.3 0.5 0.52 0.7 0.4 0.7
minx maxλ = 0.5minx 0.3 0.4 maxλminx = 0.4
280
Saddle point theorem
• By using the saddle point theorem, we can rewrite the primal problem toits dual problem.
281
11.4 Dual problem of linear programmingDefinition of a linear programming problem
x : n-dimensional variableA : (m, n)-matrixb : m-dimensional variablec : n-dimensional variableMinimize
⟨c, x⟩ (253)subject to
Ax − b ≥ 0 (254)x ≥ 0 (255)
Let y = (y1, y2, · · · , ym)T is the vector of Lagrange’s undetermined coeffi-cients. Lagrange’s function can be written by
J(x, y) = ⟨c, x⟩ − y1(Ax − b)1 − · · · − ym(Ax + b)m (256)= ⟨c, x⟩ − ⟨y, Ax + b⟩. (257)
282
Then, the conditional optimization problem can be solved by
minx≥0
maxy≥0
J(x, y).
Since we haveminx f (x) = −maxx (− f (x))
From the saddle point theorem we have
minx≥0
maxy≥0
J(x, y) = maxy≥0
minx≥0
J(x, y) = −miny≥0
maxx≥0
(−J(x, y)) . (258)
Since we have
−J(x, y) = −⟨c, x⟩ + ⟨y, Ax⟩ + ⟨y, b⟩= ⟨y, b⟩ − ⟨x, (−AT y + c)⟩, (259)
eq.(258) is given by
−miny≥0
maxx≥0
(⟨y, b⟩ − x1(−AT y + c)1 − · · · − xn(−AT y + c)n
). (260)
283
Let’s consider x and y are variable and Lagrange’s coefficients respectively.Then, eq.(260) is interpreted as minimizing
⟨b, y⟩ (261)subject to
−AT y + c ≥ 0, (262)y ≥ 0. (263)
The signs of minimum values of the two problems are opposite.We rewrite the latter problem as maximizing
−⟨b, y⟩ (264)subject to
AT y − c ≤ 0, (265)y ≥ 0. (266)
This is called the dual problem of the linear programming problem.Let’s consider the linear programming problem by combining eq.(253)-
(255) and (261)-(263).284
Minimize⟨c, x⟩ + ⟨b, y⟩ (267)
subject toAx − b ≥ 0, −AT y + c ≥ 0, x ≥ 0, y ≥ 0. (268)
Because x and y at the solution of the primal problem coincide with thoseof the dual problem, this problem is minimized if and only if both problemsare minimized. Since the signs of minimum values of eq.(253) and (261)are opposite, the minimum value of eq.(267) has to be zero. Therefore, theminimization problem can be solved by
⟨c, x⟩ + ⟨b, y⟩ = 0, Ax − b ≥ 0, −AT y + c ≥ 0, x ≥ 0, y ≥ 0. (269)
The linear programming problem is transformed to a system of linearinequalities.
285
11.5 Penalty method11.5.1 External penalty methodEquality condition
subject to g(x) = 0,minimize f (x).
Letψ1(t) = |t|, (270)
and we defineJ1(x; µ) = f (x) + µψ1(g(x)) (271)
Algorithm1. µ : initial positive value.
2. minimize J(x, µ) with respect to x.
3. increase µ.
4. go to 2.
286
Inequality condition:subject to g(x) ≥ 0,minimize f (x).
Let
ψ2(t) =
0 (t ≥ 0),−t (t < 0), (272)
and we defineJ2(x; µ) = f (x) + µψ2(g(x)). (273)
287
11.5.2 Internal penalty methodInequality condition:
subject to g(x) ≥ 0,minimize f (x).
Letψ3(t) = 1/t (274)
for t > 0 and we define
J3(x; µ) = f (x) + µψ3(g(x)). (275)
Algorithm1. µ : initial positive value.
2. minimize J(x, µ) with respect to x.
3. decrease µ > 0.
4. go to 2.
289
11.6 Example : Support Vector Machine (SVM)11.6.1 Optimum hyperplane classifier (OHC)• SVM, OHC : Two-class-classifier• A pattern belongs to one of two classes labeled by +1 or −1.• Ω+1,Ω−1 : The set of all patterns of which label is +1 and −1, respectively.• The classification function of OHC :
d(z) = ⟨w, z⟩ + θ (276)
with a vector w and a scalar θ.• Linear classifier
For an input vector z,d(z) > 0 z is classified to Ω+1d(z) < 0 z is classified to Ω−1
• The boundary for classification is a hyperplane perpendicular to w.• zl, ylLl=1 : Training data.
zl : Training patternyl : the label of .
291
First we assume that zl, ylLl=1 can be classified linearly.
• That is, there exist w and θ such that⇔d(zl) ≥ 1 if yl = 1,
d(zl) ≤ −1 if yl = −1.
(d(z) = ⟨w, z⟩ + θ)• It can be rewritten as for any l
yl d(zl) ≥ 1
oryl (⟨w, zl⟩ + θ) ≥ 1 (277)
• The margin of d(z) is defined as the half of the distance between the hy-perplanes of d(z) = 1 and d(z) = −1.• The margin M is given as
M =1∥w∥
292
•Maximizing the margin, a basic optimization problem of OHC is given asSubject to
yl(⟨w, zl⟩ + θ) − 1 ≥ 0 (l = 1, 2, . . . L) (278)Minimize
∥w∥2 (279)
11.6.2 Soft margin• The training patterns may not be linearly separable.• It may be better to increase the margin even if it provides an incorrect
answer to a small number of training patters.•We introduce variables ξl ≥ 0 (l = 1, 2, . . . , L) to weaken the condition as
yl(⟨w, zl⟩ + θ) ≥ 1 − ξl. (280)
• If ξl = 0, eq.(280) is equivalent to eq.(277).• If ξl > 0, the margin is soften for the zl.• Especially if ξl > 1, the training pattern may be misclassified.⇒We have to introduce an optimum condition, which makes ξl smaller.
295
• The optimization problem with respect to w and θ for learning is given asSubject to
yl(⟨w, zl⟩ + θ) − 1 + ξl ≥ 0 (281)ξl ≥ 0 (282)
Minimize12∥w∥2 +C
L∑l=1
ξl (283)
where C is a constant.• The smaller C is, the softer the margin is.
The number of misclassified training patterns is increased.
299
11.6.3 Solution of OHCLagrange function :
LP =12∥w∥2 +C
L∑k=1
ξk −L∑
k=1µkξk −
L∑k=1
αkyk(⟨w, zk⟩ + θ) − 1 + ξk, (284)
where α1, α2 . . . , αL, µ1, µ2 . . . , µL are Lagrange’s coefficients.
• Saddle point theorem yields
minw, θ, ξ
maxα≥0, µ≥0
LP = maxα≥0, µ≥0
minw, θ, ξ
LP,
where α ≥ 0 and µ ≥ 0 implies α1 ≥ 0, . . . , αL ≥ 0 and µ1 ≥ 0, . . . , µL ≥ 0,respectively.• There is no condition for minimizing w = (w1, . . . ,wN), θ, and ξ = (ξ1, . . . , ξL).• The minimization can be done by letting the partial derivatives with respect
to w1, . . . ,wN, θ, ξ1, . . . , ξL to be zero.
300
• Then, we have
∂LP∂wn
= wn −L∑
k=1αkyk(zk)n = 0 (285)
∂LP∂θ= −
L∑k=1
αkyk = 0 (286)
∂LP∂ξl= C − µl − αl = 0 (287)
301
•We substitute them in to eq.(284)
LP =12
⟨ L∑l=1
αlylzl,L∑
k=1αkykzk
⟩+
L∑k=1
ξk(C − µk − αk)
−L∑
k=1αkyk
⟨ L∑l=1
αlylzl, zk
⟩−
L∑k=1
αkykθ +L∑
k=1αk
=
L∑k=1
αk −12
L∑l=1
L∑k=1
αlαkykyl⟨zl, zk ⟩ ≡ LD (288)
•We can obtain the dual problem LD.• In order to obtain SVM, we have to solve
maxα≥0, µ≥0
LD
under the conditions eqs.(285), (286), and (287).
302
• Since variables in LD are only αl, the restrictions are simplified.– Eq.(285) (wn −
∑Lk=1 αkyk(zk)n = 0) is not restriction for maximizing
LD since wn is neither a variable of LD nor used in the other restrictions.It is used to calculate wn from αl at the minimum point of LD.The classification function can be written as
d(z) =L∑
l=1αlyl⟨zl, x⟩ + θ. (289)
– Eq.(286) (−∑Lk=1 αkyk = 0) remains as a restriction.
– As for eq.(287) (C − µl − αl = 0), µl is not a variable of LD but is usedin the restriction µl ≥ 0. Therefore, it induces a restriction of αl as
αl = C − µl ≤ C
µl at the optimum point can be also calculated from αl.
303
Optimization problem to be solvedSubject to
0 ≤ αl ≤ C,L∑
l=1αlyl = 0 (l = 1, 2, . . . L) (290)
Maximize
LD =L∑lαl −
12
L∑l=1
L∑k=1
αlαkylyk⟨zl, zk⟩ (291)
– Variables for optimization problem are only αl.– The dimension of this problem is the number of samples.– w can be calculated by eq.(285).
w =L∑
k=1αkykzk
– θ is obtained by using KKT condition.
304
• KKT condition of the primal problemFor l = 1, 2, . . . L,
∂LP∂wn= wn −
L∑k=1
αkyk(zk)n = 0 (292)
∂LP∂θ= −
L∑k=1
αkyk = 0 (293)
∂LP∂ξl= C − µl − αl = 0 (294)
yl(⟨w, zl⟩ + θ) − 1 + ξl ≥ 0 (295)ξl ≥ 0 (296)
αl ≥ 0, µl ≥ 0 (297)αlyl(⟨w, zl⟩ + θ) − 1 + ξl = 0 (298)
µlξl = 0 (299)
where wl and (zk)l are the l-th component of w and zk, respectively.
305
KKT conditions imply• 0 < αl < C
– 0 < µl < C yields ξl = 0 and we have yl(⟨w, zl⟩ + θ) − 1 = 0.⇒ zl is on the boundary for the margin.
– Such a point is called the support vector (SV).(Sometimes zl for αl = C is also called SV.)
– θ can be obtained by SVs.
θ = −⟨w, zl⟩ + yl
• αl = 0– µl = C yields ξl = 0 and we have yl(⟨w, zl⟩ + θ) − 1 ≥ 0.⇒ zl is included in the correct half space not in the margin region.
• αl = C– µl = 0 yields ξl ≥ 0 and we have yl(⟨w, xl⟩ + θ) − 1 + ξl = 0.⇒ xl is included in the wrong half space or the margin region.
306
11.6.4 Support Vector Machine (SVM)• SV’ : the set of zl for αl > 0• Classification function:
d(x) =L∑
k=1αkyk⟨zk, x⟩ + θ
=∑
xk∈SV′αkyk⟨zk, x⟩ + θ.
• The sum is not for all training patterns but for SVs⇒ Linear support vector machine• Computational cost of classification is reduced.
307
•We don’t have to calculate the following parameters for obtaining d(x).– µl can be obtained from eq.(294).
µl = C − αl
µl ≥ 0 is satisfied.– ξl can be obtained from eqs.(298).
If αl , C,ξi = 0
If αl = C,ξl = −yl(⟨w, zl⟩ + θ) + 1
• Remark : We cannot substitute eqs.(298), (299) into LP for optimizationproblem since they hold only at a stationary point.
⇒ If we substitute it, derivatives can not be obtained correctly.
308
11.6.5 VC dimension• Recall VC (Vapnik Chervonenkis) dimension.
The maximum number of points that the classifiers can shatter for anylabeling of them.
• The VC dimension of linear classifiers in n-dimensional space is: n + 1.
309
11.6.6 Kernel method• After patterns are mapped by a nonlinear function Φ into a very high (or
infinite) dimensional space, linear classification is performed.
Example: Φ(x1, x2) = (x21, x2
2,√
2x1x2,√
2x1,√
2x2, 1)
310
• However, the cost of calculation in a high or infinite dimensional space byusing Φ(x) is huge.• In many case, only by the inner product
⟨Φ(x),Φ(z)⟩ = k(x, z)
in the space, we can construct a classifier and classify input patterns with-out direct calculation in the space.• For the previous example:
Φ((x1, x2)) = (x21, x
22,√
2x1x2,√
2x1,√
2x2, 1),
its kernel function is given by
k(x, z) = ⟨Φ(x),Φ(z)⟩= x2
1z21 + x2
2z22 + 2x1x2z1z2 + 2x1z1 + 2x2z2 + 1
= (x1z1 + x2z2 + 1)2
= (⟨x, z⟩ + 1)2 .
⇒ Simple.
311
Definition 6. Mercer kernel function is defined as a function satisfying thefollowing condition.
• Symmetric : k(x, z) = k(z, x)• Positive definite : For any function h(z), we have∫ ∫
k(x, z)h(x)h(z)dxdz ≥ 0.
Theorem 17. Let k(x, y) be a Mercer kernel function. Then, there exists amapping Φ(x) such that
⟨Φ(x),Φ(z)⟩ = k(x, z).
312
Kernel method
•When we fix a Mercer kernel function, we can calculate inner productsbetween patterns mapped by Φ without calculating Φ.•Many classifiers do not need Φ(x) but their inner product.•We can extend such classifiers to those of nonlinear version.
Examples of kernel functions :
k(x, y) = ⟨x, y⟩ (Linear kernel) (300)
k(x, y) = (⟨x, y⟩ + 1)20 (Polynomial kernel) (301)
k(x, y) = exp∥x − y∥2/(2σ2) (Gaussian kernel) (302)
313
11.6.7 (nonlinear) SVM• Classification function and its criterion of linear SVM are expressed only
by inner products.⇒ Kernel method can be applied.• Classification function
d(x) =∑
zl∈S Vαlylk(zl, x) + θ (303)
• The optimization problem for learning is given asSubject to
0 ≤ αl ≤ C,L∑
l=1αlyl = 0 (l = 1, 2, . . . L), (304)
minimize
LD =L∑lαl −
12
L∑l=1
L∑k=1
αlαkylykk(zl, zk) (305)
with respect to αl.
314