32
30 分分分分 分分 Python Feature Selection James CC Huang

30 分鐘學會實作 Python Feature Selection

Embed Size (px)

Citation preview

Page 1: 30 分鐘學會實作 Python Feature Selection

30 分鐘學會實作 Python Feature Selection

James CC Huang

Page 2: 30 分鐘學會實作 Python Feature Selection

Disclaimer• 只有實作• 沒有數學• 沒有統計

Source: Internet

Page 3: 30 分鐘學會實作 Python Feature Selection

Warming Up• 聽說這場分享不會有人問問題 ( 把講者釘在台上 )

• 原 session 只講 40 分鐘,但是今天的分享給了 2 小時• 考驗我的記憶力和理解力• 講者講了一大堆名詞但沒有講實作 ( 不可能有時間講 )

• 我用 Python 實作範例• 希望大家如果跟我一樣,不搞理論也不搞數學統計,回家用剪貼的就可以用

scikit-learn 做 feature selection

Page 4: 30 分鐘學會實作 Python Feature Selection

Reinventing the Wheel?

Source: P.60 http://www.slideshare.net/tw_dsconf/ss-62245351

Page 5: 30 分鐘學會實作 Python Feature Selection

進行 Machine Learning 和 Deep Learning…• 到底需不需要懂背後的數學、統計、理論…?• 推廣及普及 Machine Learning / Deep Learning• 工具的易用性及快速開發

• 正反方意見都有• 正方例子:談到投入大演算 ”… 你會認為這需要繁重的數學和嚴謹的理論工作,其實不然,反倒這所需要的是從艱深的數學理論抽離,以便能看到學習現象的整體模式。” ( 大演算 The Master Algorithm, P. 40)• 反方例子: Deep Neural Networks - A Developmental

Perspective (slides, video)

Page 6: 30 分鐘學會實作 Python Feature Selection

2014 – 2016 台灣資料科學”愛好者”年會我的分享

一、連續 3 年吃便當的經驗二、 2016 聽完 Feature Engineering in Machine Learning 演講後夢到的東西

Page 7: 30 分鐘學會實作 Python Feature Selection

三年的進化• 參加的人愈來愈多• [ 不負責任目測 ] 與會者平均年齡愈來愈大 XD

• 內容愈來愈多、場次愈來愈多• 演講者身份的改變:教授和來自研究單位變多• Deep Learning 這個詞出現頻率大幅增加• $$ 愈來愈貴• 朝向使用者付費• 部分付費課程也會持續開課

• 便當沒有進化(都是同樣那幾家)

Page 10: 30 分鐘學會實作 Python Feature Selection

http://datasci.tw/agenda.php

Page 11: 30 分鐘學會實作 Python Feature Selection

http://datasci.tw/agenda.php

Page 12: 30 分鐘學會實作 Python Feature Selection

http://datasci.tw/agenda.php

Page 13: 30 分鐘學會實作 Python Feature Selection

http://datasci.tw/agenda.php

Page 14: 30 分鐘學會實作 Python Feature Selection

http://datasci.tw/agenda.php

Page 15: 30 分鐘學會實作 Python Feature Selection

Feature Engineering in Machine Learning Session (Speaker: 李俊良 )

Source: http://www.slideshare.net/tw_dsconf/feature-engineering-in-machine-learning

Page 16: 30 分鐘學會實作 Python Feature Selection

用 Feature Engineering 可否判斷出寫作風格?• 羅琳化名寫小說 曝光後銷量飆升 http://

www.bbc.com/zhongwen/trad/uk_study/2013/07/130714_rowling_novel• “ 曾有書評評價新書《杜鵑鳥在呼喚》是部「才華橫溢的處女作」,還有書評盛讚這名男性作者,能如此精湛地描述女性的服裝。”• “… 出版 ( 3 個月 ) 的這部小說,已經售出 1500冊。但亞馬遜網站報道說,周日正午 12點後,該書的銷售量飆增,增速高達 500000% 。”

• 原投影片 P. 14 (Source: http://www.slideshare.net/tw_dsconf/feature-engineering-in-machine-learning)

Page 17: 30 分鐘學會實作 Python Feature Selection

Find Word / Doc Similarity with Deep LearningUsing word2vec and Gensim (Python)

Page 18: 30 分鐘學會實作 Python Feature Selection

Goal (or Problem to Solve)• Problem: Tech Support engineers (TS) want to “precisely” categorize

support cases. The task is being performed manually by TS engineers.• Goal: Automatically categorize support case.• What I have:

• 156 classified cases (with “so-called” correct issue categories)• Support cases in database

• Challenges:• Based on current data available, supervised classification algorisms can‘t be applied.• Clustering may not 100% achieve the goal.• What about Deep Learning?

Page 19: 30 分鐘學會實作 Python Feature Selection

Gensim (word2vec implementation in Python)from os import listdirimport gensimLabeledSentence = gensim.models.doc2vec.LabeledSentence

docLabels = []docLabels = [f for f in listdir(“../corpora/2016/”) if f.endswith(‘.txt’)]

data = []

for doc in docLabels: data.append(open(“../corpora/2016/” + doc, ‘r’))

class LabeledLineSentence(object): def __init__(self, doc_list, labels_list): self.labels_list = labels_list self.doc_list = doc_list

def __iter__(self): for idx, doc in enumerate(self.doc_list): yield LabeledSentence(words=doc.read().split(), labels=[self.labels_list[idx]])

Page 20: 30 分鐘學會實作 Python Feature Selection

Gensim (Cont’d)it = LabeledLineSentence(data, docLabels)

model = gensim.models.Doc2Vec(alpha=0.025, min_alpha=0.025) model.build_vocab(it)

for epoch in range(10): model.train(it) model.alpha -= 0.002 model.min_alpha = model.alpha

# find most similar support caseprint model.most_similar(“00111105”)

Page 21: 30 分鐘學會實作 Python Feature Selection

江湖傳言• 用 Deep Learning 就不需要做 feature selection ,因為 deep

learning 會自動幫你決定• From Wikipedia (https://en.wikipedia.org/wiki/Deep_learning):• “One of the promises of deep learning is replacing handcrafted features with

efficient algorithms for unsupervised or semi-supervised feature learning and hierarchical feature extraction.”

• 真 的 有 這 麼 神 奇 嗎 ?

Page 22: 30 分鐘學會實作 Python Feature Selection

Feature selection for Iris Dataset as Example• Iris dataset attributes

1. sepal length in cm 2. sepal width in cm 3. petal length in cm 4. petal width in cm 5. class: -- Iris Setosa -- Iris Versicolour -- Iris Virginica

Page 23: 30 分鐘學會實作 Python Feature Selection

Feature Selection - LASSO>>> from sklearn.linear_model import Lasso >>> from sklearn.datasets import load_iris >>> from sklearn.feature_selection import SelectFromModel >>> iris = load_iris() >>> X, y = iris.data, iris.target>>> print X.shape

(150, 4)

>>> clf = Lasso(alpha=0.01)>>> sfm = SelectFromModel(clf, threshold=0.25)>>> sfm.fit(X, y)>>> n_features = sfm.transform(X).shape[1]>>> print n_features

2 petal width & petal length

Page 24: 30 分鐘學會實作 Python Feature Selection

Feature Selection - LASSO (Cont’d)>>> scaler = StandardScaler()>>> X = scaler.fit_transform(X)>>> names = iris["feature_names"]>>> lasso = Lasso(alpha=0.01, positive=True)>>> lasso.fit(X, y)>>> print (sorted(zip(map(lambda x: round(x, 4), lasso.coef_), names), reverse=True))

[(0.47199999999999998, 'petal width (cm)'), (0.3105, 'petal length (cm)'), (0.0, 'sepal width (cm)'), (0.0, 'sepal length (cm)')]

Page 25: 30 分鐘學會實作 Python Feature Selection

Feature Selection – Random Forest>>> from sklearn.datasets import load_iris>>> from sklearn.ensemble import RandomForestRegressor>>> iris = load_iris()>>> X, y = iris.data, iris.target>>> print (X.shape)

(150, 4)

>>> names = iris["feature_names"]>>> rf = RandomForestRegressor()>>> rf.fit(X, y)>>> print (sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names), reverse=True))

[(0.50729999999999997, 'petal width (cm)'), (0.47870000000000001, 'petal length (cm)'), (0.0091000000000000004, 'sepal width (cm)'), (0.0048999999999999998, 'sepal length (cm)')]

Page 26: 30 分鐘學會實作 Python Feature Selection

Dimension Reduction - PCA>>> from sklearn.datasets import load_iris>>> from sklearn.decomposition import PCA as pca>>> from sklearn.preprocessing import StandardScaler

>>> iris = load_iris()>>> X, y = iris.data, iris.target>>> X = StandardScaler().fit_transform(X)>>> sklearn_pca = pca(n_components=2)>>> sklearn_pca.fit_transform(X)>>> print (sklearn_pca.components_)

[[ 0.52237162 -0.26335492 0.58125401 0.56561105] [-0.37231836 -0.92555649 -0.02109478 -0.06541577]]

petal width & petal

length

Page 27: 30 分鐘學會實作 Python Feature Selection

There are many others…這次分享就是僅是把原講者所提到的方式實際做出來

簡單的我做完了 , 難的就留給大家去發掘 ~

Page 28: 30 分鐘學會實作 Python Feature Selection

Reference

Page 30: 30 分鐘學會實作 Python Feature Selection

Gensim• https://radimrehurek.com/gensim/index.html

Page 31: 30 分鐘學會實作 Python Feature Selection

HoG (Histogram of Oriented Gradients)• Python code example http://

scikit-image.org/docs/dev/auto_examples/plot_hog.html

Page 32: 30 分鐘學會實作 Python Feature Selection

An Introduction to Variable and Feature Selection• Author: Isabelle Guyon and Andre Elisseeff• PDF download: http://

jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf