Upload
amaya382
View
1.333
Download
0
Embed Size (px)
Citation preview
Who am I?
• ITO Ryuichi(@amaya382)
• Graduate School of Information Science and Technology, Osaka University(’16-)
• Accelerating graph processing engine: concurrency control, hardware-aware optimization
• (a little) Natural language processing:conversation system with context consistency
❤ Scala, C#
About Hivemall
• A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉
• Can use many features via HQL(Hive Query Language, like SQL) • Classification
• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation
• Matrix Factorisation, Factorisation Machine, etc. • Utilities
• Feature engineering, Additional array operations, etc. • etc.
About Hivemall
• A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉
• Can use many features via HQL(Hive Query Language, like SQL) • Classification
• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation
• Matrix Factorisation, Factorisation Machine, etc. • Utilities
• Feature engineering, Additional array operations, etc. • etc.
Cute Logo!
About Hivemall
• A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉
• Can use many features via HQL(Hive Query Language, like SQL) • Classification
• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation
• Matrix Factorisation, Factorisation Machine, etc. • Utilities
• Feature engineering, Additional array operations, etc. • etc.
Cute Logo!
About Hivemall
• A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉
• Can use many features via HQL(Hive Query Language, like SQL) • Classification
• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation
• Matrix Factorisation, Factorisation Machine, etc. • Utilities
• Feature engineering, Additional array operations, etc. • etc.
Cute Logo!
About Hivemall(cont.)
• How does Hivemall work on Hive? • Hivemall is a set of UDFs(User-Defined Functions)
• UDF: projection, one entry -> one entry • UDTF(Table-generating): some entries -> some entries • UDAF(Aggregate): all entries -> one entry
• Define features as UDFs following interfaces in Java prepared by Hive
• And by loading Hivemall jar file, enable to use extra functions in HQL
About Hivemall(cont.)
• Example: Training by logistic regression
• Only HQL, no need to be familiar with programming. (Already, HQL(Hive) is close to data!)
CREATE TABLE model AS SELECT feature, AVG(weight) AS weight FROM ( SELECT logress(features, label, ...) AS (feature, weight) FROM train_data) t GROUP BY feature
Benchmark
• Based on bench-ml (https://github.com/szilard/benchm-ml)
• Several pre-defined test cases w/ prepared data set 1. Logistic Regression 2. Random Forest
• Several hyper parameters 3. Boosting 4. Deep Learning
• Already tested by several tools(e.g. R, Python-sklearn, Spark, etc.)NOTE: basically, using common environment, but some cases use different environments For more details, see bench-ml project
Benchmark
• Based on bench-ml (https://github.com/szilard/benchm-ml)
• Several pre-defined test cases w/ prepared data set 1. Logistic Regression 2. Random Forest
• Several hyper parameters 3. Boosting 4. Deep Learning
• Already tested by several tools(e.g. R, Python-sklearn, Spark, etc.)NOTE: basically, using common environment, but some cases use different environments For more details, see bench-ml project
TriedTried
Benchmark(cont.)
• Environment • Amazon Web Service
• EMR(Elastic MapReduce) • m3.xlarge*3 + c3.xlarge*3 • Hadoop: Amazon 2.7.2 • Tez: 0.8.4 • Hive: 2.1.0 • Hivemall: 0.4.2-RC2
• Misc. • Basically, using six parallel processing, fitting to #instances
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
(Time[sec] / AUC[%])
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
10x10x
(Time[sec] / AUC[%])
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
10x10x 12.5x12.5x
(Time[sec] / AUC[%])
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
✖✖10x10x 12.5x12.5x
(Time[sec] / AUC[%])
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
✖✖
1.3x1.3x
10x10x 12.5x12.5x
(Time[sec] / AUC[%])
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
✖✖
4.9x4.9x1.3x1.3x
10x10x 12.5x12.5x
(Time[sec] / AUC[%])
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
✖✖
4.9x4.9x1.3x1.3x
10x10x 12.5x12.5x
3.9x3.9x
(Time[sec] / AUC[%])
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
✖✖
4.9x4.9x1.3x1.3x
10x10x 12.5x12.5x
3.9x3.9x
(Time[sec] / AUC[%])
High initial overhead caused by Hive
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
✖✖
4.9x4.9x1.3x1.3x
10x10x 12.5x12.5x
3.9x3.9x
±0±0
(Time[sec] / AUC[%])
High initial overhead caused by Hive
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
✖✖
4.9x4.9x1.3x1.3x
10x10x 12.5x12.5x
3.9x3.9x
±0±0
+0.4+0.4
(Time[sec] / AUC[%])
High initial overhead caused by Hive
Benchmark - Random Forest(1)• Using train_randomforest_classifier() on Hivemall • (1)Regulation: 500 trees, three variables
• Hivemall is almost good until 0.1M, but cannot process 1M • Need to tune environment and parameters
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall
(Time[sec] / AUC[%])
Benchmark - Random Forest(1)• Using train_randomforest_classifier() on Hivemall • (1)Regulation: 500 trees, three variables
• Hivemall is almost good until 0.1M, but cannot process 1M • Need to tune environment and parameters
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall
(Time[sec] / AUC[%])
Benchmark - Random Forest(1)• Using train_randomforest_classifier() on Hivemall • (1)Regulation: 500 trees, three variables
• Hivemall is almost good until 0.1M, but cannot process 1M • Need to tune environment and parameters
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall
(Time[sec] / AUC[%])
Amazing…
Benchmark - Random Forest(2)• Using train_randomforest_classifier() on Hivemall • (2)Regulation: 100 trees, max depth 20
• Hivemall is good until 1M, but cannot process 10M • Need to tune environment and parameters
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/z-other-tools/10-hivemall
(Time[sec] / AUC[%])
Benchmark - Random Forest(2)• Using train_randomforest_classifier() on Hivemall • (2)Regulation: 100 trees, max depth 20
• Hivemall is good until 1M, but cannot process 10M • Need to tune environment and parameters
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/z-other-tools/10-hivemall
(Time[sec] / AUC[%])
Add several new features
• systemtest module
• Feature binning
• Feature selection
• Some spark integrations
Add new features - systemtest
• What’s systemtest? • Testing framework for UDFs
• Also can apply other applications based on UDFs • Already tests exist, not? Why need?
• Yes, but the existing is... • Cannot run on Hive actually, only run as Java programs • Difficult to write coverall tests
• e.g. in UDAF, several work flows depending on a kind of function, data set and environment
• Difficult to use existing resources • Low extendability, etc.
Add new features - systemtest
• Example: a part of an existing testfinal SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); final ObjectInspector[] OIs = new ObjectInspector[] {ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false)); evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer(); evaluator.reset(agg); ...for (int i = 0; i < features.length; i++) { final List<IntWritable> labelList = new ArrayList<IntWritable>(); for (int label : labels[i]) { labelList.add(new IntWritable(label)); } evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), labelList}); } final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); ...Assert.assertArrayEquals(answer, result, 1e-5);
Add new features - systemtest
• Example: a part of an existing testfinal SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); final ObjectInspector[] OIs = new ObjectInspector[] {ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false)); evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer(); evaluator.reset(agg); ...for (int i = 0; i < features.length; i++) { final List<IntWritable> labelList = new ArrayList<IntWritable>(); for (int label : labels[i]) { labelList.add(new IntWritable(label)); } evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), labelList}); } final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); ...Assert.assertArrayEquals(answer, result, 1e-5);
omitt
ed a
lot
→
→
Add new features - systemtest
• Example: a part of an existing testfinal SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); final ObjectInspector[] OIs = new ObjectInspector[] {ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false)); evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer(); evaluator.reset(agg); ...for (int i = 0; i < features.length; i++) { final List<IntWritable> labelList = new ArrayList<IntWritable>(); for (int label : labels[i]) { labelList.add(new IntWritable(label)); } evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), labelList}); } final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); ...Assert.assertArrayEquals(answer, result, 1e-5);
omitt
ed a
lot
Useless and long initializationUseless and long initialization
→
→
Add new features - systemtest
• Example: a part of an existing testfinal SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); final ObjectInspector[] OIs = new ObjectInspector[] {ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false)); evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer(); evaluator.reset(agg); ...for (int i = 0; i < features.length; i++) { final List<IntWritable> labelList = new ArrayList<IntWritable>(); for (int label : labels[i]) { labelList.add(new IntWritable(label)); } evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), labelList}); } final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); ...Assert.assertArrayEquals(answer, result, 1e-5);
omitt
ed a
lot
Useless and long initializationUseless and long initialization
→
→
Useless many conversionsUseless many conversions
Add new features - systemtest
• Example: a part of an existing testfinal SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); final ObjectInspector[] OIs = new ObjectInspector[] {ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false)); evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer(); evaluator.reset(agg); ...for (int i = 0; i < features.length; i++) { final List<IntWritable> labelList = new ArrayList<IntWritable>(); for (int label : labels[i]) { labelList.add(new IntWritable(label)); } evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), labelList}); } final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); ...Assert.assertArrayEquals(answer, result, 1e-5);
omitt
ed a
lot
Useless and long initializationUseless and long initialization
→
→
Useless many conversionsUseless many conversions
And not run on Hive, only logical test!!
And not run on Hive, only logical test!!
Add new features - systemtest
• Solution • New module based on JUnit, HiveRunner and td-client-java
• What it can do? • Short and unified initialization • Write and combine HQL • Run local Hive and also remote Treasure Data with the
same code • Testbed is prepared and cleaned up automatically • Easy to use external resources, e.g. TSV file • Literal definition(HQL), but test with debugger • Useful DSL
Add new features - systemtest(1)
• How does it work?
SystemTestRunner
TDSystemTestRunner
Treasure Data HiveRunner
Test code
User
1. Write tests based on SystemTestRunner interface
ImplementationInterface
SystemTestTeam
HiveSystemTestRunner
Add new features - systemtest(2)
• How does it work?
SystemTestRunner
TDSystemTestRunner
Treasure Data HiveRunner
Test code
User
ImplementationInterface
SystemTestTeam
HiveSystemTestRunner
2. Read initialization and execute via impls of SystemTestRunner
It works based on JUnit @ClassRule
Prepare database specialized for each test class
Use external resources depending on needs
Add new features - systemtest(3)
• How does it work?
SystemTestRunner
TDSystemTestRunner
Treasure Data HiveRunner
Test code
User
ImplementationInterface
SystemTestTeam
HiveSystemTestRunner
3. Execute first test
It works based on JUnit @Rule
Run as HQL, and check return values
Rewrite DSL & HQL for each env
Add new features - systemtest(4)
• How does it work?
SystemTestRunner
TDSystemTestRunner
Treasure Data HiveRunner
Test code
User
ImplementationInterface
SystemTestTeam
HiveSystemTestRunner
4. Reset testbeds
It works based on JUnit @Rule
Drop temporary tables
Add new features - systemtest(5,6…)
• How does it work?
SystemTestRunner
TDSystemTestRunner
Treasure Data HiveRunner
Test code
User
ImplementationInterface
SystemTestTeam
HiveSystemTestRunner
5. Execute second test 6. Reset testbeds …repeat all tests
It works based on JUnit @Rule
Add new features - systemtest(7)
• How does it work?
SystemTestRunner
TDSystemTestRunner
Treasure Data HiveRunner
Test code
User
ImplementationInterface
SystemTestTeam
HiveSystemTestRunner
7. Finalize test
Drop temporary database and disconnect
It works based on JUnit @ClassRule
Add new features - systemtest
• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);
Add new features - systemtest
• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);
no o
mis
sion
!
Add new features - systemtest
• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);
no o
mis
sion
!
Common initialization with external dataCommon initialization with external data
Add new features - systemtest
• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);
no o
mis
sion
!
Common initialization with external dataCommon initialization with external data
Add new features - systemtest
• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);
no o
mis
sion
!
Common initialization with external dataCommon initialization with external data
Testbed-specific initializationTestbed-specific initialization
Add new features - systemtest
• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);
no o
mis
sion
!
Common initialization with external dataCommon initialization with external data
Testbed-specific initializationTestbed-specific initialization
Set common runnerSet common runner
Add new features - systemtest• Example: test cases(1)
@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }
Add new features - systemtest• Example: test cases(1)
@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }
no o
mis
sion
!
Add new features - systemtest• Example: test cases(1)
@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }
no o
mis
sion
!
Execute tests on clean testbeds using database created by initExecute tests on clean testbeds using database created by init
Add new features - systemtest• Example: test cases(1)
@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }
no o
mis
sion
!
Execute tests on clean testbeds using database created by initExecute tests on clean testbeds using database created by init
Run on HiveRunnerRun on HiveRunner
Add new features - systemtest• Example: test cases(1)
@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }
no o
mis
sion
!
Execute tests on clean testbeds using database created by initExecute tests on clean testbeds using database created by init
Run on HiveRunnerRun on HiveRunner
Add new features - systemtest• Example: test cases(1)
@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }
no o
mis
sion
!
Execute tests on clean testbeds using database created by initExecute tests on clean testbeds using database created by init
Run on HiveRunnerRun on HiveRunner
Run on HiveRunner and TreasureDataRun on HiveRunner and TreasureData
Add new features - systemtest
• Example: test cases(2)
@Testpublic void someTest0() throws Exception { final String tableName = "color"; team.initBy(HQ.uploadByResourcePathAsNewTable( tableName, ci.initDir + "color.tsv", new LinkedHashMap<String, String>() {{ put("name", "string"); put("red", "int"); put("green", "int"); put("blue", "int");}})); team.set(HQ.fromStatement("" + "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " + tableName + " u LEFT JOIN color c on u.favorite_color = c.name"), "rgb(255,165,0)\trgb(255,192,203)"); team.run(); } @Testpublic void someTest1() throws Exception { team.set(HQ.autoMatchingByFileName("hoge"), ci); team.run(); }
Add new features - systemtest
• Example: test cases(2)
@Testpublic void someTest0() throws Exception { final String tableName = "color"; team.initBy(HQ.uploadByResourcePathAsNewTable( tableName, ci.initDir + "color.tsv", new LinkedHashMap<String, String>() {{ put("name", "string"); put("red", "int"); put("green", "int"); put("blue", "int");}})); team.set(HQ.fromStatement("" + "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " + tableName + " u LEFT JOIN color c on u.favorite_color = c.name"), "rgb(255,165,0)\trgb(255,192,203)"); team.run(); } @Testpublic void someTest1() throws Exception { team.set(HQ.autoMatchingByFileName("hoge"), ci); team.run(); }
no o
mis
sion
!
Add new features - systemtest
• Example: test cases(2)
@Testpublic void someTest0() throws Exception { final String tableName = "color"; team.initBy(HQ.uploadByResourcePathAsNewTable( tableName, ci.initDir + "color.tsv", new LinkedHashMap<String, String>() {{ put("name", "string"); put("red", "int"); put("green", "int"); put("blue", "int");}})); team.set(HQ.fromStatement("" + "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " + tableName + " u LEFT JOIN color c on u.favorite_color = c.name"), "rgb(255,165,0)\trgb(255,192,203)"); team.run(); } @Testpublic void someTest1() throws Exception { team.set(HQ.autoMatchingByFileName("hoge"), ci); team.run(); }
no o
mis
sion
! Test-specific initialization It also can chainTest-specific initialization It also can chain
Add new features - systemtest
• Example: test cases(2)
@Testpublic void someTest0() throws Exception { final String tableName = "color"; team.initBy(HQ.uploadByResourcePathAsNewTable( tableName, ci.initDir + "color.tsv", new LinkedHashMap<String, String>() {{ put("name", "string"); put("red", "int"); put("green", "int"); put("blue", "int");}})); team.set(HQ.fromStatement("" + "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " + tableName + " u LEFT JOIN color c on u.favorite_color = c.name"), "rgb(255,165,0)\trgb(255,192,203)"); team.run(); } @Testpublic void someTest1() throws Exception { team.set(HQ.autoMatchingByFileName("hoge"), ci); team.run(); }
no o
mis
sion
! Test-specific initialization It also can chainTest-specific initialization It also can chain
Use HQL and answers written in external filesUse HQL and answers written in external files
Add new features - systemtest
• More details? • https://github.com/myui/hivemall/issues/323 • https://github.com/myui/hivemall/pull/336 • And systemtest/README.md
Add new features - feature binning
• What’s feature binning? • A method to divide quantitative variables
into meaningful categorical variables
Add new features - feature binning
• How does it work? • [UDAF] build_bins(weight, num_of_bins[, auto_shrink])
• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles)
build_bins feature_binning
Add new features - feature binning
• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])
• Use percentile internally, make all areas uniform
Add new features - feature binning
• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])
• What’s auto_shrink?
Add new features - feature binning
• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])
• What’s auto_shrink?
Sometimes made void bins by small or skewed data set
!?!? ->
Add new features - feature binning
• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])
• What’s auto_shrink?
Sometimes made void bins by small or skewed data set
!?!? ->
Add new features - feature binning
• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])
• What’s auto_shrink?
Exception!Sometimes made void bins by small or skewed data set
!?!? ->
Add new features - feature binning
• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles) • Distribute variables into bins by its value
feature_binning
Age:17
Add new features - feature binning
• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles) • Distribute variables into bins by its value
feature_binning
bin 0 ->bin 1 ->bin 2 ->
Age:17
Add new features - feature binning
• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles) • Distribute variables into bins by its value
feature_binning
17 is between -Infinity and 18.0 …
bin 0 ->bin 1 ->bin 2 ->
Age:17
Add new features - feature binning
• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles) • Distribute variables into bins by its value
feature_binning
17 is between -Infinity and 18.0 …
<here!bin 0 ->bin 1 ->bin 2 ->
Age:17
Add new features - feature binning
• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles) • Distribute variables into bins by its value
feature_binning
17 is between -Infinity and 18.0 …
<here!bin 0 ->bin 1 ->bin 2 ->
Age:17
Add new features - feature binning
• More details? • https://github.com/myui/hivemall/issues/319 • https://github.com/myui/hivemall/pull/322
Add new features - feature selection
• What’s feature selection? • A generic term of methods to select meaningful
features • Used to preprocessing of machine learning
• Why used? • Enhance results • Shorten learning time • Make a set of features human-understandable
Add new features - feature selection
• A kind of feature selection • Use variance • Use Chi-square value • Use SNR(Signal Noise Ratio) • mRMR(minimum Redundancy Maximum Relevance) • etc.
Add new features - feature selection
• A kind of feature selection • Use variance • Use Chi-square value • Use SNR(Signal Noise Ratio) • mRMR(minimum Redundancy Maximum Relevance) • etc.
Implemented
Implemented
Add new features - feature selection
• Feature selection using Chi-square value • To calc Chi-square value, need both observed
values and expected values(=hypothesis)
• Observed: aggregated features of each class • Expected: assuming each features and each
classes are independent, calc expected values • Calc Chi-square value • Select top-k features
Chi-square
Add new features - feature selection
• How does it work on Hivemall? • [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>
• [UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>
• [UDF] select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>
Chi-square
Add new features - feature selection
• [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF
YXT
Chi-square
Add new features - feature selection
• [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF
YXT
Chi-square
Add new features - feature selection
• [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF
YXT
Chi-square
Add new features - feature selection
• [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF
YXT
Maybe you think matrix multiplication requires repetition…
Chi-square
Add new features - feature selection
• [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF
YXT
Calculate incrementally!Maybe you think matrix multiplication requires repetition…
Chi-square
Add new features - feature selection
• [UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>> • Calculate Chi-square value and p-value
•
• Calculate p-value by above and Chi-square distribution
Chi-square
Add new features - feature selection
• [UDF] select_k_best(X::array<number>, importance_list::array<int>, k::int)::array<double> • Select top-k elements from X by importance_list • Generic UDF
NOTE: Current implementation expects all each importance_list and k are equal
k = 2
Chi-square
Add new features - feature selection
• [UDF] select_k_best(X::array<number>, importance_list::array<int>, k::int)::array<double> • Select top-k elements from X by importance_list • Generic UDF
NOTE: Current implementation expects all each importance_list and k are equal
k = 2
Chi-square
Add new features - feature selection
• Feature selection using SNR • Aggregate mean and variance of each feature
and each class • When termination, calc Signal Noise Ratio
between all combination of classes, of each feature
• Sum up Signal Noise Ratio each feature
Signal Noise Ratio
Add new features - feature selection
• How does it work on Hivemall?
• [UDAF] snr(X::array<number>, label::array<int>)::array<double>
Signal Noise Ratio
Add new features - feature selection
• [UDAF] snr(X::array<number>, label::array<int>)::array<double> • Aggregate variance by Chan’s method
• Calc Signal Noise Ratio and sum them up each features
Signal Noise Ratio
Add new features - feature selection
• More details? • https://github.com/myui/hivemall/issues/338 • https://github.com/myui/hivemall/pull/352
Add new features - spark integration
• Integrated feature selection into spark module
• Improved build flow for resolving binary incompatibility between spark-1.6 and spark-2.0