Upload
technosphere1
View
143
Download
0
Embed Size (px)
DESCRIPTION
L4: Решающие деревья
Citation preview
Ââåäåíèå â Data Science
Çàíÿòèå 3. Ìîäåëè, îñíîâàííûå íà ïðàâèëàõ
Íèêîëàé Àíîõèí Ìèõàèë Ôèðóëèê
16 ìàðòà 2014 ã.
Ïëàí çàíÿòèÿ
Äåðåâüÿ ðåøåíèé
Çàäà÷à
Äàíî:
îáó÷àþùàÿ âûáîðêà èç ïðîôèëåéíåñêîëüêèõ äåñÿòêîâ òûñÿ÷÷åëîâåê
I ïîë (binary)
I âîçðàñò (numeric)
I îáðàçîâàíèå (nominal)
I è åùå 137 ïðèçíàêîâ
I íàëè÷èå èíòåðåñà ê êîñìåòèêå
Çàäà÷à:
Äëÿ ðåêëàìíîé êàìïàíèèîïðåäåëèòü, õàðàêòåðèñòèêèëþäåé, èíòåðåñóþùèõñÿêîñìåòèêîé
Îáàìà èëè Êëèíòîí?
Õîðîøèé äåíü äëÿ ïàðòèè â ãîëüô
Ðåãèîíû ïðèíÿòèÿ ðåøåíèé
Ðåêóðñèâíûé àëãîðèòì
decision_tree(XN):
åñëè XN óäîâëåòâîðÿåò êðèòåðèþ ëèñòà:
ñîçäàåì òåêóùèé óçåë N êàê ëèñò
âûáèðàåì ïîäõîäÿùèé êëàññ CN
èíà÷å:
ñîçäàåì òåêóùèé óçåë N êàê âíóòðåííèé
ðàçáèâàåì XN íà ïîäâûáîðêè
äëÿ êàæäîé ïîäâûáîðêè Xj:
n = decision_tree(Xj)
äîáàâëÿåì n ê N êàê ðåáåíêà
âîçâðàùàåì N
CART
Classi�cation And Regression Trees
1. Êàê ïðîèñõîäèò ðàçäåëåíèå?
2. Íà ñêîëüêî äåòåé ðàçäåëÿòü êàæäûé óçåë?
3. Êàêîé êðèòåðèé ëèñòà âûáðàòü?
4. Êàê óêîðîòèòü ñëèøêîì áîëüøîå äåðåâî?
5. Êàê âûáðàòü êëàññ êàæäîãî ëèñòà?
6. ×òî äåëàòü, åñëè ÷àñòü çíà÷åíèé îòñóòñòâóåò?
×èñòîòà óçëà
Çàäà÷à
Âûáðàòü ìåòîä, ïîçâîëÿþùèé ðàçäåëèòü óçåë íà äâà èëè íåñêîëüêîäåòåé íàèëó÷øèì îáðàçîì
Êëþ÷åâîå ïîíÿòèå � impurity óçëà.
1. Misclassi�cationi(N) = 1−max
kp(x ∈ Ck)
2. Gini
i(N) = 1−∑k
p2(x ∈ Ck) =∑i 6=j
p(x ∈ Ci )p(x ∈ Cj)
3. Èíôîðìàöèîííàÿ ýíòðîïèÿ
i(N) = −∑k
p(x ∈ Ck) log2 p(x ∈ Ck)
Òåîðèÿ èíôîðìàöèè
Êîëè÷åñòâî èíôîðìàöèè ∼ �ñòåïåíü óäèâëåíèÿ�
h(x) = − log2 p(x)
Èíôîðìàöèîííàÿ ýíòðîïèÿ H[x ] = E [h(x)]
H[x ] = −∑
p(x) log2 p(x) èëè H[x ] = −∫
p(x) log2 p(x)dx
Óïðàæíåíèå
Äàíà ñëó÷àéíàÿ âåëè÷èíà x , ïðèíèìàþùàÿ 4 çíà÷åíèÿ ñ ðàâíûìèâåðîÿòíîñòÿìè 1
4 , è ñëó÷àéíàÿ âåëè÷èíà y , ïðèíèìàþùàÿ 4 çíà÷åíèÿñ âåðîÿòíîñòÿìè { 12 ,
14 ,
18 ,
18}. Âû÷èñëèòü H[x ] è H[y ].
Âûáîð íàèëó÷øåãî ðàçäåëåíèÿ
Êðèòåðèé
Âûáðàòü ïðèçíàê è òî÷êó îòñå÷åíèÿ òàêèìè, ÷òîáû áûëîìàêñèìàëüíî óìåíüøåíèå impurity
∆i(N,NL,NR) = i(N)− NL
Ni(NL)− NR
Ni(NR)
Çàìå÷àíèÿ
I Âûáîð ãðàíèöû ïðè ÷èñëîâûõ ïðèçíàêàõ: ñåðåäèíà?
I Ðåøåíèÿ ïðèíèìàþòñÿ ëîêàëüíî: íåò ãàðàíòèè ãëîáàëüíîîïòèìàëüíîãî ðåøåíèÿ
I Íà ïðàêòèêå âûáîð impurity íå ñèëüíî âëèÿåò íà ðåçóëüòàò
Åñëè ðàçäåëåíèå íå áèíàðíîå
Åñòåñòâåííûé âûáîð ïðè ðàçäåëåíèè íà B äåòåé
∆i(N,N1, . . . ,NB) = i(N)−B∑
k=1
Nk
Ni(Nk)→ max
Ïðåäïî÷òåíèå îòäàåòñÿ áîëüøèì B. Ìîäèôèêàöèÿ:
∆iB(N,N1, . . . ,NB) =∆i(N,N1, . . . ,NB)
−∑B
k=1Nk
N log2Nk
N
→ max
(gain ratio impurity)
Èñïîëüçîâàíèå íåñêîëüêèõ ïðèçíàêîâ
Ïðàêòèêà
Çàäà÷à
Âû÷èñëèòü íàèëó÷øåå áèíàðíîå ðàçäåëåíèå êîðíåâîãî óçëà ïîîäíîìó ïðèçíàêó, ïîëüçóÿñü gini impurity.
� Ïîë Îáðàçîâàíèå Ðàáîòà Êîñìåòèêà
1 Ì Âûñøåå Äà Íåò2 Ì Ñðåäíåå Íåò Íåò3 Ì Íåò Äà Íåò4 Ì Âûñøåå Íåò Äà1 Æ Íåò Íåò Äà2 Æ Âûñøåå Äà Äà3 Æ Ñðåäíåå Äà Íåò4 Æ Ñðåäíåå Íåò Äà
Êîãäà îñòàíîâèòü ðàçäåëåíèå
Split stopping criteria
I íèêîãäà
I èñïîëüçîâàòü âàëèäàöèîííóþ âûáîðêó
I óñòàíîâèòü ìèíèìàëüíûé ðàçìåð óçëà
I óñòàíîâèòü ïîðîã ∆i(N) > β
I ñòàòèñòè÷åñêèé ïîäõîä
χ2 =2∑
k=1
(nkL − NL
N nk)2
NL
N nk
Óêîðà÷èâàåì äåðåâî
Pruning (a.k.a. îòðåçàíèå âåòâåé)
1. Ðàñòèì �ïîëíîå� äåðåâî T0
2. Íà êàæäîì øàãå çàìåíÿåì ñàìûé �ñëàáûé� âíóòðåííèé óçåë íàëèñò
Rα(Tk) = err(Tk) + αsize(Tk)
3. Äëÿ çàäàííîãî α èç ïîëó÷èâøåéñÿ ïîñëåäîâàòåëüíîñòè
T0 � T1 � . . . � Tr
âûáèðàåì äåðåâî Tk , ìèíèìèçèðóþùåå Rα(Tk)
Çíà÷åíèå α âûáèðàåòñÿ íà îñíîâàíèè òåñòîâîé âûáîðêè èëè CV
Êàêîé êëàññ ïðèñâîèòü ëèñòüÿì
1. Ïðîñòåéøèé ñëó÷àé:êëàññ ñ ìàêñèìàëüíûì êîëè÷åñòâîì îáúåêòîâ
2. Äèñêðèìèíàòèâíûé ñëó÷àé:âåðîÿòíîñòü p(Ck |x)
Âû÷èñëèòåëüíàÿ ñëîæíîñòü
Âûáîðêà ñîñòîèò èç n îáúåêòîâ, îïèñàííûõ m ïðèçíàêàìè
Ïðåäïîëîæåíèÿ
1. Óçëû äåëÿòñÿ ïðèìåðíî ïîðîâíó
2. Äåðåâî èìååò log n óðîâíåé
3. Ïðèçíàêè áèíàðíûå
Îáó÷åíèå. Äëÿ óçëà ñ k îáó÷àþùèìè îáúåêòàìè:
Âû÷èñëåíèå impurity ïî îäíîìó ïðèçíàêó O(k)Âûáîð ðàçäåëÿþùåãî ïðèçíàêà O(mk)Èòîã: O(mn) + 2O(m n
2 ) + 4O(m n4 ) + . . . = O(mn log n)
Ïðèìåíåíèå. O(log n)
Îòñóòñòâóþùèå çíà÷åíèÿ
I Óäàëèòü îáúåêòû èç âûáîðêè
I Èñïîëüçîâàòü îòñòóòñâèå êàê îòäåëüíóþ êàòåãîðèþ
I Âû÷èñëÿòü impurity, ïðîïóñêàÿ îòñóòñòâóþùèå çíà÷åíèÿ
I Surrogate splits: ðàçäåëÿåì âòîðûì ïðèçíàêîì òàê, ÷òîáû áûëîìàêñèìàëüíî ïîõîæå íà ïåðâè÷íîå ðàçäåëåíèå
Surrogate split
c1 : x1 =
078
, x2 =
189
, x3 =
290
, x4 =
411
, x5 =
522
c2 : y1 =
333
, y2 =
604
, y3 =
745
, y4 =
856
, y5 =
967
Óïðàæíåíèå
Âû÷èñëèòü âòîðîé surrogate split
Çàäà÷à î êîñìåòèêå
X[0] <= 26.5000entropy = 0.999935785529
samples = 37096
X[2] <= 0.5000entropy = 0.987223228214
samples = 10550
X[6] <= 0.5000entropy = 0.998866839115
samples = 26546
entropy = 0.9816samples = 8277
value = [ 3479. 4798.]
entropy = 0.9990samples = 2273
value = [ 1095. 1178.]
entropy = 0.9951samples = 16099
value = [ 8714. 7385.]
entropy = 0.9995samples = 10447
value = [ 5085. 5362.]
X0 � âîçðàñò, X4 � íåîêîí÷åííîå âûñøåå îáðàçîâàíèå, X6 - ïîë
Çàäà÷è ðåãðåññèèImpurity óçëà N
i(N) =∑y∈N
(y − y)2
Ïðèñâîåíèå êëàññà ëèñòüÿìI Ñðåäíåå çíà÷åíèåI Ëèíåéíàÿ ìîäåëü
Êðîìå CART
ID3 Iterative Dichotomiser 3I Òîëüêî íîìèíàëüíûå ïðèçíàêèI Êîëè÷åñòâî äåòåé â óçëå = êîëè÷åñòâî çíà÷åíèé ðàçäåëÿþùåãî
ïðèçíàêàI Äåðåâî ðàñòåò äî ìàêñèìàëüíîé âûñîòû
Ñ4.5 Óëó÷øåíèå ID3I ×èñëîâûå ïðèçíàêè � êàê â CART, íîìèíàëüíûå � êàê â ID3I Ïðè îòñóòñòâèè çíà÷åíèÿ èñïîëüçóþòñÿ âñå äåòèI Óêîðà÷èâàåò äåðåâî, óáèðàÿ íåíóæíûå ïðåäèêàòû â ïðàâèëàõ
C5.0 Óëó÷øåíèå C4.5I Ïðîïðèåòàðíûé
Ðåøàþùèå äåðåâüÿ. Èòîã
+ Ëåãêî èíòåðïðåòèðóåìû. Âèçóàëèçàöèÿ (íÿ!)
+ Ëþáûå âõîäíûå äàííûå
+ Ìóëüòèêëàññ èç êîðîáêè
+ Ïðåäñêàçàíèå çà O(log n)
+ Ïîääàþòñÿ ñòàòèñòè÷åñêîìó àíàëèçó
� Ñêëîííû ê ïåðåîáó÷åíèþ
� Æàäíûå è íåñòàáèëüíûå
� Ïëîõî ðàáîòàþò ïðè äèñáàëàíñå êëàññîâ
Êëþ÷åâûå ôèãóðû
I Claude Elwood Shannon(Òåîðèÿ èíôîðìàöèè)
I Leo Breiman(CART, RF)
I John Ross Quinlan(ID3, C4.5, C5.0)
Äðóãèå ìîäåëè, îñíîâàííûå íà ïðàâèëàõ
I Market Basket, AssociationRules, A-Priori
I Logical inference, FOL
Çàêëþ÷åíèå
�Binary Trees give an interesting and often illuminating way of looking at
the data in classi�cation or regression problems. They should not be used
to the exclusion of other methods. We do not claim that they are always
better. They do add a �exible nonparametric tool to the data analyst's
arsenal.�
�Breiman, Friedman, Olshen, Stone
Çàäà÷à
Ïðåäñêàçàòü êàòåãîðèþ ñåìåéíîãî äîõîäà íà îñíîâàíèè ïðîôèëåéïîëüçîâàòåëåé ñ èñïîëüçîâàíèåì äåðåâà ðåøåíèé (èìïëåìåíòàöèÿ èçsklearn).
Ìåòðèêà êà÷åñòâà:
µ =accuracy
maxkP(Ck)
Íàãðàäà:Â ÄÇ ìîæíî èñïîëüçîâàòü ëþáóþ ãîòîâóþ èìïëåìåíòàöèþ DT
Äîìàøíåå çàäàíèå 2
Äåðåâüÿ ðåøåíèé
Ðåàëèçîâàòü
I àëãîðèòì CART äëÿ çàäà÷è ðåãðåññèè
I àëãîðèòì CART äëÿ çàäà÷è êëàññèôèêàöèè
Ïîääåðæêà: ðàçíûå impurity, split stopping, pruning (+)
Êëþ÷åâûå äàòû
I Äî 2014/03/22 00.00 âûáðàòü çàäà÷ó è îòâåòñòâåííîãî â ãðóïïå
I Äî 2014/03/29 00.00 ïðåäîñòàâèòü ðåøåíèå çàäàíèÿ
Ñïàñèáî!
Îáðàòíàÿ ñâÿçü