Upload
treasure-data-inc
View
367
Download
1
Embed Size (px)
Citation preview
IntroductiontoNewfeaturesandUsecasesofHivemall
ResearchEngineerMakotoYUI@myui
1
2016/03/30TreasureDataTechtalk
http://eventdots.jp/event/583226
Ø 2015.04JoinedTreasureData,Inc.1st ResearchEngineerinTreasureData
Ø 2010.04-2015.03SeniorResearcheratNationalInstituteofAdvancedIndustrialScienceandTechnology,Japan.
Ø 2009.03Ph.D.inComputerSciencefromNAISTØ TD登山部部長Ø 部員3名(うち幽霊部員1名)
WhoamI?
2
Ø 2015.04JoinedTreasureData,Inc.1st ResearchEngineerinTreasureData
Ø 2010.04-2015.03SeniorResearcheratNationalInstituteofAdvancedIndustrialScienceandTechnology,Japan.
Ø 2009.03Ph.D.inComputerSciencefromNAISTØ TD登山部部長Ø 部員3名(うち幽霊部員1名)
WhoamI?
3
12
他製品連携
SQL
Server
CRM
RDBMS
Appログ
センサー
Webログ
ERP
バッチ型分析
アドホック型分析
API
ODBCJDBC
PUSH
TreasureAgent
分析ツール連携
データ可視化・共有
TreasureDataCollectors
組込み
Embulk
モバイルSDK
JSSDK
TreasureDatasupportsML-as-a-Service
MachineLearning
WhatisHivemall
ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2
14
https://github.com/myui/hivemall
WhatisHivemall
HadoopHDFS
MapReduce(MR v1)
Hive /PIG
Hivemall
ApacheYARN
ApacheTezDAGprocessing MRv2
MachineLearning
QueryProcessing
ParallelDataProcessingFramework
ResourceManagement
DistributedFileSystem
15
ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2
Hivemall’s Vision:MLonSQL
ClassificationwithMahout
CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers
✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)✓InteractiveandStableAPIsw/ SQLabstraction
ThisSQLqueryautomaticallyrunsinparallelonHadoop
16
ListofFeaturesinHivemallv0.3.xClassification (bothbinary- andmulti-class)✓ Perceptron✓ PassiveAggressive(PA)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓AdaGrad+RDA
Regression✓LogisticRegression(SGD)✓PARegression✓AROWRegression✓AdaGrad✓AdaDELTA
kNN andRecommendation✓Minhash andb-BitMinhash(LSHvariant)✓ Similarity SearchusingK-NN
(Euclid/Cosine/Jaccard/Angular)✓MatrixFactorization
Featureengineering✓ FeatureHashing✓ FeatureScaling(normalization, z-score)✓TF-IDFvectorizer✓Polynomial Expansion
AnomalyDetection✓ LocalOutlierFactor
Top-kqueryprocessing
17
Features supportedinHivemallv0.4.0
18
1.RandomForest• classification,regression
2.FactorizationMachine• classification,regression (factorization)
Features supportedinHivemallv0.4.1-alpha
19
1. NLPTokenizer (形態素解析)• Kuromoji
2. Mini-batchGradientDescent3. RandomForest scalabilityImprovements
TreasureDataisoperatingHivemallv0.4.1-alpha.6
Theabovefeaturearealreadysupported
Ø CTRpredictionofAdclicklogs•Freakout Inc.andmore•ReplacedSparkMLlibw/HivemallatcompanyX
IndustryusecasesofHivemall
21http://www.slideshare.net/masakazusano75/sano-hmm-20150512
22
ØGenderpredictionofAdclicklogs•Scaleout Inc.
http://eventdots.jp/eventreport/458208
IndustryusecasesofHivemall
23
IndustryusecasesofHivemallØ ValuepredictionofRealestates•Livesense
http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall
25
ØChurnDetection•OISIX
IndustryusecasesofHivemall
http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix
26
会員サービスの解約予測
•10万人の会員による定期購買が会社全体の売上、利益を左右するが、解約リスクのある会員を事前に把握、防止する策を欠いていた
•統計の専門知識無しで機械学習•解約予測リストへのポイント付与により解約率が半減•解約リスクを伴う施策、イベントを炙り出すと同時に、非解約者の特徴的な行動も把握可能に•リスク度合いに応じて UI を変更するなど間接的なサービス改善も実現
•機械学習を行い、過去1ヶ月間のデータをもとに未来1ヶ月間に解約する可能性の高い顧客リストを作成•具体的には、学習用テーブル作成 -> 正規化 -> 学習モデル作成-> ロジスティック回帰の各ステップをTD + Hivemall を用いてクエリで簡便に実現
Web
Mobile
属性情報
行動ログ
クレーム情報
流入元
利用サービス情報
直接施策
間接施策
ポイント付与 ケアコール
成功体験への誘導UI変更
予測に使うデータ
40
FactorizationMachines
Contextinformation(e.g.,time)canbeconsidered
Source:http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf
Features tobesupportedinHivemallv0.4.1
47
1. NLPTokenizer (形態素解析)• Kuromoji integrationwasrequestedbyCompanyR
2. Mini-batchGradientDescent3. RandomForest scalabilityImprovements4. RecommendationforImplicitFeedbackDataset• Usefulwhereonlypositive-onlyfeedbackisavailable• BPR:BayesianPersonalizedRankingfromImplicitFeedback,Proc.UAI,2009.
Plannedtoreleasev0.4.1inApril.
Features tobesupportedinHivemallv0.4.2
48
1.GradientTreeBoosting• classifier,regression• basedonSmilehttps://github.com/haifengl/smile/
Features tobesupportedinHivemallv0.4.2
49
1.GradientTreeBoosting• classifier,regression• basedonSmilehttps://github.com/haifengl/smile/
2.Field-awareFactorizationMachine• classification,regression (factorization)
Plannedtoreleasev0.4.1inJune
Features tobesupportedinHivemallv0.5
50
1. MixserveronApacheYARN• Serviceforparametersharingamongworkers
学習器1
学習器2
学習器N
パラメタ交換
学習モデル
分割された訓練例
データ並列
データ並列
Features tobesupportedinHivemallv0.5
51
1. MixserveronApacheYARN• Serviceforparametersharingamongworker
2. OnlineLDA• topicmodeling,clustering
3. XGBoost Integration4.GeneralizedLinearModel• Ridge/Elasticnet/Lassoregularization• Supportsvariouslossfunctions
5. AlternatingDirectionMethodofMultipliers(ADMM)convexoptimization
6. T-sne DimensionReduction
52
AnalyticsWorkflowMachinelearningworkflowscanbesimplifiedusingournewworkflowengine,namedDigdag
+main:+prepare:
_parallel: true
+train:td>: ./tasks/train_join.sql
+test:td>: ./tasks/test_join.sql
+quantify:td>: ./tasks/train_quantify.sql
+model_test_quantify:_parallel: true
+model:td>: ./tasks/make_model.sql
+test_quantify:td>: ./tasks/test_quantify.sql
+pred:td>: ./tasks/prediction.sql
CLIversionwillbereleasedsoon.Staytuned!
ConclusionandTakeaway
53
HivemallprovidesacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs
Hivemall’s Positioning
TreasureDataprovidesML-as-a-ServiceusingHivemall
Majordevelopmentleapsinv0.4
Morewillfollowinv0.4.1andlater
• ForSQLusersthatneedML• Easy-of-useandscalabilityinmind
• RandomForest• FactorizationMachine
54
BlogarticleaboutHivemall
http://blog-jp.treasuredata.com/
TD,Hivemall,Jupyter,Pandas-TDを使ってKaggleの課題を解くシリーズ