Upload
makoto-yui
View
221
Download
3
Embed Size (px)
Citation preview
Hivemall:MachineLearningLibraryforApacheHive/Spark
ResearchEngineerMakotoYUI(油井誠)@myui
12016/09/09HadoopCon16,Taipei
Ø 2015.04~ ResearchEngineeratTreasureData,Inc.• MymissionisdevelopingML-as-a-ServiceinaHadoop-as-
a-servicecompany
Ø 2010.04-2015.03SeniorResearcheratNationalInstituteofAdvancedIndustrialScienceandTechnology,Japan.産業技術総合研究所• DevelopedHivemallasapersonalresearchproject
Ø 2009.03Ph.D.inComputerSciencefromNAIST• MajoredinParallelDataProcessing,notMLthen
Ø VisitingscholarinCWI,AmsterdamandUniv.Edinburgh
Littleaboutme..
2016/09/09HadoopCon16,Taipei 2
2016/09/09HadoopCon16,Taipei 3
Hiro YoshikawaCEO
Kaz OtaCTO
Sada FuruhashiChief Architect
Open source business veteran
Founder - world’s largest Hadoop group
Invented Fluentd, Messagepack
TODAY100+ Employees, 30M+ funding
2015 New office in Seoul, Korea
2013 New office in Tokyo, Japan
2012 Founded in Mountain View, CA
InvestorsJerry YangYahoo! Founder
Bill TaiAngel Investor
Yukihiro MatsumotoRuby Inventor
Sierra Ventures - Tim GuleriEntrerprise Software
Scale Ventures - Andy Vitus B2B SaaS
TreasureData
2016/09/09HadoopCon16,Taipei 4
WeOpen-source!TDinvented..
Streaming log collector Bulk data import/export efficient binary serialization
Streaming Query ProcessorMachine learning on Hadoop
digdag.io
Workflow engine (Beta)
2016/09/09HadoopCon 16,Taipei 5
Microsoft OperationManagementSuite andGoogleCloudPlatform(Kubernates)areusingFluentd forlogcollection
Point
Ourtechnologyusers
2016/09/09HadoopCon 16,Taipei 6
Microsoft OperationManagementSuite andGoogleCloudPlatform(Kubernates)areusingFluentd forlogcollection
Point
Ourtechnologyusers
2016/09/09HadoopCon16,Taipei 7
TreasureData’sSolution
2016/09/09HadoopCon16,Taipei 8
BigDataStatsinTD
Ad-tech
IoT
三菱重工
Agency/Trading Desk DMP / DSP Ad-Network
Diverse Corporate Identity Manual 02
コーポレートカラー
千歳緑(ちとせみどり)この千歳緑をDiversのコーポレートカラーとします。
千歳緑は、常緑の松の緑をさし、吉祥的な意味を持つ事から、おめでたく、喜ばしい意味を持ちます。
繁栄・幸運を意味し、吉祥天は幸福・美・富を顕す神であるとともに、美女の代名詞ともされています。
■ CMYK / プロセスカラーC : 85% M : 17% Y : 76% K : 57%
■ PANTONE / プロセスカラー555EC
■ RGB / モニターR : 0 G : 80 B : 60
背景と干渉する場合に使用するボックスロゴ
背景と干渉する場合に使用するボックスロゴ 白黒
白黒のみの場合
EC Media Game/SNS
Gaminge-Commerce InternetService
Retail Finance TechnologyTelecommunicationMaker
Otherdomain
OurCustomers
2016/09/09HadoopCon16,Taipei 9
Ad-tech
IoT
三菱重工
Agency/Trading Desk DMP / DSP Ad-Network
Diverse Corporate Identity Manual 02
コーポレートカラー
千歳緑(ちとせみどり)この千歳緑をDiversのコーポレートカラーとします。
千歳緑は、常緑の松の緑をさし、吉祥的な意味を持つ事から、おめでたく、喜ばしい意味を持ちます。
繁栄・幸運を意味し、吉祥天は幸福・美・富を顕す神であるとともに、美女の代名詞ともされています。
■ CMYK / プロセスカラーC : 85% M : 17% Y : 76% K : 57%
■ PANTONE / プロセスカラー555EC
■ RGB / モニターR : 0 G : 80 B : 60
背景と干渉する場合に使用するボックスロゴ
背景と干渉する場合に使用するボックスロゴ 白黒
白黒のみの場合
EC Media Game/SNS
Gaminge-Commerce InternetService
Retail Finance TechnologyTelecommunicationMaker
Otherdomain
OurCustomers
2016/09/09HadoopCon16,Taipei 10
1. WhatisHivemall(introduction)
2. WhyHivemall(motivationsetc.)
3. HivemallInternals
4. HowtouseHivemall
5. Futureroadmap
Agenda
2016/09/09HadoopCon16,Taipei 11
WhatisHivemall
ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2
12
https://github.com/myui/hivemall
2016/09/09HadoopCon16,Taipei
HadoopHDFS
MapReduce(MRv1)
Hivemall
ApacheYARN
ApacheTezDAGprocessing
Machine Learning
Query Processing
Parallel Data Processing Framework
Resource Management
Distributed File SystemCloud Storage
SparkSQL
ApacheSpark
MESOS
Hive Pig
MLlib
Hivemall’s TechnologyStack
AmazonS3
2016/09/09HadoopCon16,Taipei 13
Hivemall’s Vision:MLonSQL
ClassificationwithMahout
CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers
✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)✓InteractiveandStableAPIsw/ SQLabstraction
ThisSQLqueryautomaticallyrunsinparallelonHadoop
142016/09/09HadoopCon16,Taipei
ListofsupportedAlgorithms
Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification
15
Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression
SCW is a good first choiceTry RandomForest if SCW does not work
Logistic regression is good for getting a probability of a positive class
Factorization Machines is good where features are sparse and categorical ones
2016/09/09HadoopCon16,Taipei
ListofAlgorithmsforRecommendation
16
K-NearestNeighbor✓ Minhash andb-BitMinhash
(LSHvariant)✓ SimilaritySearchonVectorSpace
(Euclid/Cosine/Jaccard/Angular)
MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)
each_top_k functionofHivemallisusefulforrecommendingtop-kitems
2016/09/09HadoopCon16,Taipei
OtherSupportedAlgorithms
17
AnomalyDetection✓ LocalOutlierFactor(LoF)
FeatureEngineering✓FeatureHashing✓FeatureScaling
(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion
(FeaturePairing)✓ Amplifier
NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)
2016/09/09HadoopCon16,Taipei
• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore
• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.
IndustryusecasesofHivemall
182016/09/09HadoopCon16,Taipei
• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore
• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.
• Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo
IndustryusecasesofHivemall
19
Problem:Recommendationusinghot-itemishardinhand-craftedproductmarketbecauseeachcreatorsellsfewsingleitems(willsoonbecomeout-of-stock)
2016/09/09HadoopCon16,Taipei
minne.com
• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore
• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.
• Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo
• ValuepredictionofRealestates• Algorithm:Regression• Livesense
IndustryusecasesofHivemall
202016/09/09HadoopCon16,Taipei
• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore
• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.
• Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo
• ValuepredictionofRealestates• Algorithm:Regression• Livesense
• Userscorecalculation• Algrorithm:Regression• Klout
IndustryusecasesofHivemall
21
bit.ly/klout-hivemall
2016/09/09HadoopCon16,Taipei
Influencermarketing
klout.com
OISIX,aleadingfooddeliveryservicecompanyinJapan,usedHivemall’s LogisticRegressiontogetchurnprobability
2016/09/09HadoopCon16,Taipei 22
ChurnDetectionofMonthlyPaymentService
ChurnratedroppedalmostbyhalfbygivinggiftpointstocustomersbeingpredictedtoleaveJ
1. WhatisHivemall
2. WhyHivemall(motivationsetc.)
3. HivemallInternals
4. HowtouseHivemall
5. Futureroadmap
Agenda
2016/09/09HadoopCon16,Taipei 23
2016/09/09HadoopCon16,Taipei
Motivation– WhyanewMLframework?
Mahout?
VowpalWabbit?(w/Hadoopstreaming)
SparkMLlib?
0xdataH2O? ClouderaOryx?
MachineLearningframeworksoutthere thatrunwithHadoop
QuickPoll:Howmanypeopleinthisroomareusingthem?
24
HowIusedtodoMLprojectsbeforeHivemall
GivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
Extract-Transform-Load
MachineLearning
file
2016/09/09HadoopCon16,Taipei 25
height:173cmweight:60kg
age:34gender:man
…
HowIusedtodoMLprojectsbeforeHivemall
GivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
height:173cmweight:60kg
age:34gender:man
…
Extract-Transform-Load
file
Needtodoexpensivedatapreprocessing
(Joins,Filtering,andFormattingofDatathatdoesnotfitinmemory)
MachineLearning2016/09/09HadoopCon16,Taipei 26
HowIusedtodoMLprojectsbeforeHivemall
GivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
Extract-Transform-Load
file
DonotscaleHavetolearnR/PythonAPIs
height:173cmweight:60kg
age:34gender:man
…
2016/09/09HadoopCon16,Taipei 27
Hivemall’s Vision:MLonSQL(again)
ClassificationwithMahout
CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers
✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)✓InteractiveandStableAPIsw/ SQLabstraction
ThisSQLqueryautomaticallyrunsinparallelonHadoop
2016/09/09HadoopCon16,Taipei 28
29
HivemallonApacheSpark
Installationisveryeasyasfollows:$spark-shell--packagesmaropu:hivemall-spark:0.0.6
2016/09/09HadoopCon16,Taipei
1. WhatisHivemall
2. WhyHivemall(motivationsetc.)
3. HivemallInternals
4. HowtouseHivemall
5. Futureroadmap
Agenda
2016/09/09HadoopCon16,Taipei 30
ImplementedmachinelearningalgorithmsasUser-DefinedTablegeneratingFunctions(UDTFs)
HowHivemallworksintraining
+1,<1,2>..+1,<1,7,9>
-1,<1,3,9>..+1,<3,8>
tuple<label,array<features>>
tuple<feature,weights>
Predictionmodel
UDTF
Relation<feature,weights>
param-mix param-mix
Trainingtable
Shufflebyfeature
train train
● Resulting prediction model is a relation of feature and its weight
● # of mapper and reducers are configurable
UDTFisafunctionthatreturnsarelation
ParallelismisPowerful
2016/09/09HadoopCon16,Taipei 31
32
train train
+1,<1,2>..
+1,<1,7,9>
-1,<1,3,9>..
+1,<3,8>
tuple<label,featues>
array<weight>
Trainingtable
-1,<2,7,9>..
+1,<3,8>
MIX
-1,<2,7,9>..
+1,<3,8>
train train
array<weight>
Parameteraveraging(bagging)
2016/09/09HadoopCon16,Taipei
AlternativeApproachinHivemallHivemallprovidesthe amplify UDTFtoenumerateiterationeffectsinmachinelearningwithoutseveralMapReduce steps
SET hivevar:xtimes=3;
CREATE VIEW training_x3asSELECT*
FROM (SELECTamplify(${xtimes}, *) as (rowid, label, features)FROMtraining
) tCLUSTER BY rand()
2016/09/09HadoopCon16,Taipei 33
1. WhatisHivemall
2. WhyHivemall(motivationsetc.)
3. HivemallInternals
4. HowtouseHivemall
5. Futureroadmap
Agenda
2016/09/09HadoopCon16,Taipei 34
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Datapreparation352016/09/09HadoopCon16,Taipei
Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
HowtouseHivemall- Datapreparation
DefineaHivetablefortraining/testingdata
362016/09/09HadoopCon16,Taipei
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
FeatureEngineering
372016/09/09HadoopCon16,Taipei
create view e2006tfidf_train_scaled asselect
rowid,rescale(target,${min_label},${max_label}) as label,
featuresfrom
e2006tfidf_train;
Applying a Min-Max Feature Normalization
HowtouseHivemall- FeatureEngineering
Transformingalabelvaluetoavaluebetween0.0and1.0
382016/09/09HadoopCon16,Taipei
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Training
392016/09/09HadoopCon16,Taipei
HowtouseHivemall- Training
CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight
FROM (SELECT logress(features,label,..)
as (feature,weight)FROM train
) tGROUP BY feature
Trainingbylogisticregression
map-onlytasktolearnapredictionmodel
Shufflemap-outputstoreducesbyfeature
Reducersperformmodelaveraginginparallel
402016/09/09HadoopCon16,Taipei
HowtouseHivemall- Training
CREATE TABLE news20b_cw_model1 ASSELECT
feature,voted_avg(weight) as weight
FROM(SELECT
train_cw(features,label) as (feature,weight)
FROMnews20b_train
) t GROUP BY feature
TrainingofConfidenceWeightedClassifier
Votetousenegativeorpositiveweightsforavg
+0.7,+0.3,+0.2,-0.1,+0.7
TrainingfortheCWclassifier
412016/09/09HadoopCon16,Taipei
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Prediction
422016/09/09HadoopCon16,Taipei
HowtouseHivemall- Prediction
CREATE TABLE lr_predictasSELECT
t.rowid, sigmoid(sum(m.weight)) as prob
FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)
GROUP BY t.rowid
PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel
Noneedtoloadtheentiremodelintomemory
432016/09/09HadoopCon16,Taipei
Real-timeprediction
MachineLearning
Batch Training on Hadoop
Online Prediction on RDBMS
PredictionModel Label
FeatureVector
FeatureVector
Label
Exportpredictionmodels
44
bit.ly/hivemall-rtp
2016/09/09HadoopCon16,Taipei
RandomForestinHivemall
EnsembleofDecisionTrees
2016/09/09HadoopCon16,Taipei 45
TrainingofRandomForest
2016/09/09HadoopCon16,Taipei 46
PredictionofRandomForest
2016/09/09HadoopCon16,Taipei 47
1. WhatisHivemall
2. WhyHivemall(motivationsetc.)
3. HivemallInternals
4. HowtouseHivemall
5. Futureroadmap
Agenda
2016/09/09HadoopCon16,Taipei 48
49
FutureofHivemall
HivemallwillbecomeApacheHivemall(?)Nowonvotingthough..
2016/09/09HadoopCon16,Taipei
50
ApacheIncubationstatus
2016/09/09HadoopCon16,Taipei
•MakotoYui<TreasureData>• TakeshiYamamuro <NTT>Ø HivemallonApacheSpark• DanielDai<Hortonworks>Ø HivemallonApachePigØ ApachePigPMCmember• TsuyoshiOzawa<NTT>ØApacheHadoopPMCmember• KaiSasaki<TreasureData>
51
Initialcommitters
2016/09/09HadoopCon16,Taipei
Champion
NominatedMentors
52
Projectmentors
• ReynoldXin<Databricks,ASFmember>ApacheSparkPMCmember• MarkusWeimer<Microsoft,ASFmember>ApacheREEFPMCmember• Xiangrui Meng <Databricks,ASFmember>ApacheSparkPMCmember
• RomanShaposhnik <Pivotal,ASFmember>ApacheBigtop/IncubatorPMCmember
2016/09/09HadoopCon16,Taipei
• PossiblyenterApacheIncubatorsoon• IPclearanceandproject/repositorysitesetup•Contributionguideline•CreatewhouseHivemalllist•Moredocumentations!SepttoNov• InitialApacheReleasewillbeDec(orlateNov?)
53
Roadmap
2016/09/09HadoopCon16,Taipei
ü HivemallonSpark2.0w/Dataframesupportü XGBoost support
54
ComingNewFeatures- alreadymergedinMaster
2016/09/09HadoopCon16,Taipei
PleaseReferbit.ly/hivemall-xgboost
fordetail
ü ChangeFinder• Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data
55
ComingNewFeatures- alreadymergedinMaster
J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.
2016/09/09HadoopCon16,Taipei
ü ChangeFinder• Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data
56
ComingNewFeatures- alreadymergedinMaster
J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.
2016/09/09HadoopCon16,Taipei
ü VariousEvaluationMetrics•PR#326
57
ComingNewFeatures- alreadymergedinMaster
2016/09/09HadoopCon16,Taipei
• v0.5-beta{1,2}release(Oct-Nov)üone-hotencodingü Field-awareFactorizationMachinesü Kernelized PassiveAggressiveüGeneralizedLinearModelü OptimizerframeworkincludingADAMü L1/L2regularization
ü GradientTreeBoostingü OnlineLDA
58
Otherundergoingnewfeatures
2016/09/09HadoopCon16,Taipei
ConclusionandTakeaway
HivemallprovidesacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs
59
Ø ForSQLusersthatneedMLØ ForwhomalreadyusingHiveØ Easy-of-useandscalabilityinmind
Donotrequirecoding,packaging,compilingorintroducinganewprogramminglanguageor APIs.
Hivemall’s Positioning
WewelcomeyourcontributionstoApacheHivemallJ
2016/09/09HadoopCon16,Taipei
60
Anyfeaturerequestorquestions?
#hivemall
2016/09/09HadoopCon16,Taipei