Upload
-
View
182
Download
0
Embed Size (px)
Citation preview
從瀏覽文章行為來預測使用者的性別Using Browsing Behavior Log to Predict User’s Gender
Rick , Kent , Koi
Overview● Huge Data Burn Money (燒錢啊 )
o 28 Million PV / Day o 7.7 Million UV / Dayo Have Total 4.4 Billion Articleso Have Total 4.7 Million Registered User
● Only 2% Login , Who is 98% ?
Problem Definition• Use Only 2% History Data to Prediction 98% users
Train Model
User ModelTo Predict
Training Data Model
Unknown Cookie’ Gender Result
Training Flow Training Data Selection
RawLog
TargetData
Preprocessing
TransformedData
Transformation
Data Mining
Pattern
取得最近三個月內的有登入者瀏覽紀錄,並且看過兩篇不同的文上以上的使用者
使用 Naïve Bayes 演算去來產生預測模型
• Feature Extraction• Feature Selection
Prediction Flow
SelectionRawLog
PredictData
Preprocessing
TransformedData
Preprocessing
Transformation
Naive BayesPattern
取得最近三個月內的未登入者瀏覽紀錄,數量約佔全站資料的 98% 使用 Naïve Bayes 演算去來預測性別
Naive Bayes Formula
大至說穿了就是看看哪一個出現比較多次!!
Naive Bayes in Python Scikit-learn
http://scikit-learn.org/stable/modules/naive_bayes.html#bernoulli-naive-bayes
Raw Data (Matrix? )
Training Data Set OverviewItem Description Comment
Date 20150223 ~ 20150424
Total Click Counts 10908692
Login User Male : 149403 Female: 229448
Feature Before: 2543240After : 508648
use chi-squre as feature selection
Feature Extraction• Category Feature -> Binary Feature• Example
Feature Name Feature Value
Article Type A, B , C , D, E
Feature Name Feature Value
Article Type - A 0 ,1
Article Type - B 0 ,1
Article Type - C 0 ,1
Article Type - D 0 ,1
Article Type - E 0 ,1
Features ListFeature Name Description Example
gender the gender of login user 1 or 2
cat The article’s category 旅遊url is a blog url http://kittyfish.pixnet.net/blog/post/
345566174
ariticle_author the blog’s author kittyfish
article_id the blog’s unique id 345566174
hours the time of click event 6
refers http://www.google.com/
country the country that predicted by ip address tw
But …… Too Many Features(又是燒錢 )
• T = 2,450,000 x 2,543,240• Many Irrelevant Feature for
Prediction
2,543,240 Feature
Feature Selection – Chi Square
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.htmlhttp://www.slideshare.net/parth241989/chi-square-test-16093013
Chi Square Value Dependence with Result
Large High
Small Low
• 2543240 Features -> 508648 Features
• Precision 74% -> 81%
Important Feature is ?feature_name male_prob female_prob male_count female_count total prob_distance
cat_財經企管 0.137798 0.045564 20587 10454 31041 0.184468
cat_美容彩妝 0.062211 0.137009 9294 31436 40730 0.149596
cat_時尚流行 0.079325 0.151936 11851 34861 46712 0.145221
cat_親子育兒 0.079640 0.133178 11898 30557 42455 0.107076
cat_心情日記 0.180942 0.231797 27033 53185 80218 0.101709
cat_國外旅遊 0.152288 0.194490 22752 44625 67377 0.084403
author_XXXXX 0.049975 0.009037 7466 2073 9539 0.081877
cat_食譜分享 0.054607 0.093596 8158 21475 29633 0.077978
cat_圖文創作 0.085483 0.122831 12771 28183 40954 0.074696
Important Feature is ?• 以分類就可以初步判定性別傾向• 部份特定作者及文章,可以特別用來識別是否為男性• 男性點擊分佈特定傾向大於女性 ,這在後續使用 GA 作線上實驗,男性的預測精準度是大於女性,不謀而合
Feature Distribution
少數的 feature 很具有引響力,但是其它的 feature的長尾效應還是有的,對於提升最後幾個百分點是有效力的
Prediction Set Data AnalysisIntersection/Training Intersection/Prediction
hour 100.00% 91.67%
author 94.37% 7.79%
country 100.00 2.46%
category 100.00 ???
article 84.53 2.64%
referer 94.50% 8.76%
Real War Record
Live Experiment on PIXNET Falcon(Advertisement) System
Validation by Google Analytics● Is God ?● How to Use ?
UGD sayMale
UGD sayFemale
GA Set 1
GA Set 2
GA Say Male
GA Say Female
GA Say Male
GA Say Female
An non-registration user
Classification Model
Prediction
Prediction Set Data Analysis• 於由 Prediction Data 遠高於 Training Data ,故以 Training Set 為分母來看的話,交集的比率頗高• 但是以 Prediction Data 為分母的話, Article 、 Author 、 Country 、
Referer ,交集的比率均小於 10%,如下圖所示• Article 及 Author 是因為 Pixnet 使用者的閱讀習慣集中在特定的文章,其它的文章點擊次數非常的少,甚至沒有被其它人閱覽過
Prediction set Training Set
Article 、 Author 、 Referrer
Hour & Category
Prediction Set
Training Set
Implementation - System Architecture
Implementation - Technology-Inventor List
Technology Tool Purpose
Scikit-learn Machine learning library
Redis Cookie profile database
Python Programing language
Celery Scheduling framework
Redshift Large raw data datawarehouse
Django & Rest framework Build api service for internal sytem
Implement - Performance Tuning● CPU
● Batch Prediction● 1000 x Speed Up
● Parallel Process● Full usage mulit-core – 8 x Speed Up● Python
● Memory● Garbage Collection● Python - del
Reference● http://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_ex
amples.pdf● https://www.iperceptions.com/~/media/files/knowledge/whitepapers/iperceptio
nsintentrecognitionenginewhitepaperfeb2014v13.ashx● A Two-Stage Ensemble of Diverse Models for Advertisement ...● http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html● Whyo use naive bayes : http://
www.aaai.org/Papers/FLAIRS/2004/Flairs04-097.pdf● Unbias : http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf