Using browsing behavior history to predict user’s gender presenation

從瀏覽文章行為來預測使用者的性別Using Browsing Behavior Log to Predict User’s Gender

Rick , Kent , Koi

Overview● Huge Data Burn Money (燒錢啊 )

o 28 Million PV / Day o 7.7 Million UV / Dayo Have Total 4.4 Billion Articleso Have Total 4.7 Million Registered User

● Only 2% Login , Who is 98% ?

Problem Definition• Use Only 2% History Data to Prediction 98% users

Train Model

User ModelTo Predict

Training Data Model

Unknown Cookie’ Gender Result

Training Flow Training Data Selection

RawLog

TargetData

Preprocessing

TransformedData

Transformation

Data Mining

Pattern

取得最近三個月內的有登入者瀏覽紀錄，並且看過兩篇不同的文上以上的使用者

使用 Naïve Bayes 演算去來產生預測模型

• Feature Extraction• Feature Selection

Prediction Flow

SelectionRawLog

PredictData

Preprocessing

TransformedData

Preprocessing

Transformation

Naive BayesPattern

取得最近三個月內的未登入者瀏覽紀錄，數量約佔全站資料的 98% 使用 Naïve Bayes 演算去來預測性別

Naive Bayes Formula

大至說穿了就是看看哪一個出現比較多次！！

Naive Bayes in Python Scikit-learn

http://scikit-learn.org/stable/modules/naive_bayes.html#bernoulli-naive-bayes

Raw Data (Matrix？ )

Training Data Set OverviewItem Description Comment

Date 20150223 ~ 20150424

Total Click Counts 10908692

Login User Male : 149403 Female: 229448

Feature Before: 2543240After : 508648

use chi-squre as feature selection

Feature Extraction• Category Feature -> Binary Feature• Example

Feature Name Feature Value

Article Type A, B , C , D, E

Feature Name Feature Value

Article Type - A 0 ,1

Article Type - B 0 ,1

Article Type - C 0 ,1

Article Type - D 0 ,1

Article Type - E 0 ,1

Features ListFeature Name Description Example

gender the gender of login user 1 or 2

cat The article’s category 旅遊url is a blog url http://kittyfish.pixnet.net/blog/post/

345566174

ariticle_author the blog’s author kittyfish

article_id the blog’s unique id 345566174

hours the time of click event 6

refers http://www.google.com/

country the country that predicted by ip address tw

But …… Too Many Features(又是燒錢 )

• T = 2,450,000 x 2,543,240• Many Irrelevant Feature for

Prediction

2,543,240 Feature

Feature Selection – Chi Square

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.htmlhttp://www.slideshare.net/parth241989/chi-square-test-16093013

Chi Square Value Dependence with Result

Large High

Small Low

• 2543240 Features -> 508648 Features

• Precision 74% -> 81%

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html

http://www.slideshare.net/parth241989/chi-square-test-16093013

http://www.slideshare.net/parth241989/chi-square-test-16093013

Important Feature is ?feature_name male_prob female_prob male_count female_count total prob_distance

cat_財經企管 0.137798 0.045564 20587 10454 31041 0.184468

cat_美容彩妝 0.062211 0.137009 9294 31436 40730 0.149596

cat_時尚流行 0.079325 0.151936 11851 34861 46712 0.145221

cat_親子育兒 0.079640 0.133178 11898 30557 42455 0.107076

cat_心情日記 0.180942 0.231797 27033 53185 80218 0.101709

cat_國外旅遊 0.152288 0.194490 22752 44625 67377 0.084403

author_XXXXX 0.049975 0.009037 7466 2073 9539 0.081877

cat_食譜分享 0.054607 0.093596 8158 21475 29633 0.077978

cat_圖文創作 0.085483 0.122831 12771 28183 40954 0.074696

Important Feature is ?• 以分類就可以初步判定性別傾向• 部份特定作者及文章，可以特別用來識別是否為男性• 男性點擊分佈特定傾向大於女性，這在後續使用 GA 作線上實驗，男性的預測精準度是大於女性，不謀而合

Feature Distribution

少數的 feature 很具有引響力，但是其它的 feature的長尾效應還是有的，對於提升最後幾個百分點是有效力的

Prediction Set Data AnalysisIntersection/Training Intersection/Prediction

hour 100.00% 91.67%

author 94.37% 7.79%

country 100.00 2.46%

category 100.00 ???

article 84.53 2.64%

referer 94.50% 8.76%

Real War Record

Live Experiment on PIXNET Falcon(Advertisement) System

Validation by Google Analytics● Is God ?● How to Use ?

UGD sayMale

UGD sayFemale

GA Set 1

GA Set 2

GA Say Male

GA Say Female

GA Say Male

GA Say Female

An non-registration user

Classification Model

Prediction

Prediction Set Data Analysis• 於由 Prediction Data 遠高於 Training Data ，故以 Training Set 為分母來看的話，交集的比率頗高• 但是以 Prediction Data 為分母的話， Article 、 Author 、 Country 、

Referer ，交集的比率均小於 10%，如下圖所示• Article 及 Author 是因為 Pixnet 使用者的閱讀習慣集中在特定的文章，其它的文章點擊次數非常的少，甚至沒有被其它人閱覽過

Prediction set Training Set

Article 、 Author 、 Referrer

Hour ＆ Category

Prediction Set

Training Set

Implementation - System Architecture

Implementation - Technology-Inventor List

Technology Tool Purpose

Scikit-learn Machine learning library

Redis Cookie profile database

Python Programing language

Celery Scheduling framework

Redshift Large raw data datawarehouse

Django & Rest framework Build api service for internal sytem

Implement - Performance Tuning● CPU

● Batch Prediction● 1000 x Speed Up

● Parallel Process● Full usage mulit-core – 8 x Speed Up● Python

● Memory● Garbage Collection● Python - del

Reference● http://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_ex

amples.pdf● https://www.iperceptions.com/~/media/files/knowledge/whitepapers/iperceptio

nsintentrecognitionenginewhitepaperfeb2014v13.ashx● A Two-Stage Ensemble of Diverse Models for Advertisement ...● http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html● Whyo use naive bayes : http://

www.aaai.org/Papers/FLAIRS/2004/Flairs04-097.pdf● Unbias : http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

http://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_examples.pdf

http://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_examples.pdf

https://www.iperceptions.com/~/media/files/knowledge/whitepapers/iperceptionsintentrecognitionenginewhitepaperfeb2014v13.ashx

https://www.iperceptions.com/~/media/files/knowledge/whitepapers/iperceptionsintentrecognitionenginewhitepaperfeb2014v13.ashx

https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CCQQFjAA&url=https://kaggle2.blob.core.windows.net/competitions/kddcup2012/2748/media/NTU.pdf&ei=71w0VcKuE8T48QXm6YHYDg&usg=AFQjCNGtDSNr_bE96rF0FMRnybBxAS461w&sig2=eeMQeaYVwNqAP80xYiMtKw&bvm=bv.91071109,d.dGc









Data & Analytics

Using browsing behavior history to predict user’s gender presenation