Machinelearning Spark Hadoop User Group Munich Meetup 2016

19.4.2016 MachineLearning - Databricks

file:///Users/lhaferkamp/Downloads/MachineLearning.html 1/6

Machine Learning with SparkScikit Learn Cheat Sheet

Load basic dependencies

>

inputCsvDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/csv ouputParquetDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/

import java.util.Base64 import java.nio.charset.StandardCharsets encB64: (str: String)String decB64: (str: String)String

import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.hadoop.conf.Configuration import java.net.URI import org.apache.hadoop.fs.FileStatus listS3: (s3Path: String)Array[org.apache.hadoop.fs.FileStatus] ls3: (s3FolderPath: String)Unit rm3: (s3Path: String)Boolean

? s3a://bigpicture-guild/nyctaxi/sample_1_month/csv/trip_data_and_fare.csv.gz [967.71 MiB]

Read taxi data as dataframe from parquet

>

%run "/meetup/kickoff/connect_s3"

// read Parquet filesval parquetTable= sqlContext.read.parquet(ouputParquetDir)val toDouble = udf[Double, Float]( _.toDouble)val taxiData = parquetTable.withColumn("tip_amount_d", toDouble(parquetTable.col("tip_amount")))

(http://databricks.com) Import Notebook

MachineLearning

http://databricks.com/



>

>

Showing the first 1000 rows.

2D4B95E2FA7B2E85118EC5CA4570FA58 CD2F522EEE1FF5F5A8D8B679E23576B3 CMT 1 N 2013-01-07T15:33:28.000+0000

0C5296F3C8B16E702F8F2E06F5106552 D2363240A9295EF570FC6069BC4F4C92 CMT 1 N 2013-01-07T22:25:46.000+0000

312E0CB058D7FC1A6494EDB66D360CD2 7B5156F38990963332B33298C8BAE25E CMT 1 N 2013-01-05T11:54:49.000+0000

DD98E2C3AF5C47B4449F720ECC5778D4 79807332B275653A2473554C7328500A CMT 1 N 2013-01-02T06:58:08.000+0000

0B57B9633A2FECD3D3B1944AFC7471CF CCD4367B417ED6634D986F573A552A62 CMT 1 N 2013-01-07T14:46:55.000+0000

Scatter plot for tip amount and fare amount

>

500m1.001.502.002.503.003.504.004.505.005.506.006.507.007.508.008.509.009.50

5.00 10.0 15.0 20.0 25.0 30.0 35.0 40.0

fare_amount

tip_amount

Showing sample based on the first 1000 rows.

Transformation of data with standard dataframe operations

>

The pipeline concept of Spark ML

medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime

taxiData.registerTempTable("ml_nyc_taxi")

%sql SELECT * FROM ml_nyc_taxi

%sql SELECT tip_amount, fare_amount FROM ml_nyc_taxi WHERE tip_amount > 0 AND tip_amount < 10 AND fare_amount < 50

import org.apache.spark.mllib.linalg.{Vector, Vectors}val toVec = udf[Vector, Int, Float] { (a, b) => Vectors.dense(a, b) }val trainingData = taxiData .filter(toDouble(taxiData.col("tip_amount")) > 0.0) .withColumn("label", toDouble(taxiData.col("tip_amount"))) .withColumn("features", toVec(taxiData.col("passenger_count"), taxiData.col("fare_amount")))



A Pipeline chains Transformers and EstimatorsA Transformer can also be an estimator from a previous trained modelImportant for easily

training with different model parameters e.g. for cross-validationwith different test and training data (train-validation split)repeat the transformation steps before estimation

Watch out for KeyStoneML (http://keystone-ml.org (http://keystone-ml.org)), a ML pipeline framework with a richer set of operatorson Spark

SQL transformer:

Select and filter the relevant data

>

VectorAssembler:

Transform the data into labeled data as needed for ML estimators

>

+------------------+----------+ | label| features| +------------------+----------+ |1.2000000476837158| [1.0,5.5]| | 4.199999809265137|[1.0,20.5]| | 5.900000095367432|[1.0,29.0]| | 5.380000114440918|[1.0,21.0]| | 1.399999976158142| [6.0,6.5]| | 1.0| [1.0,5.0]| | 1.25| [1.0,4.5]| | 3.0|[6.0,26.0]| | 1.0|[1.0,14.5]| |1.2999999523162842| [1.0,6.5]| | 1.899999976158142| [5.0,9.5]| |1.6200000047683716| [1.0,6.5]| | 1.899999976158142| [1.0,9.0]| | 2.0|[1.0,22.0]| | 6.0|[1.0,25.0]| |3.5999999046325684|[1.0,17.5]| |1.2000000476837158| [1.0,6.0]| | 7.5|[1.0,24.5]|

Initialize the estimator

import org.apache.spark.ml.feature.SQLTransformerval taxiDataSelector = new SQLTransformer().setStatement( "SELECT tip_amount_d as label, passenger_count, fare_amount FROM ml_nyc_taxi WHERE tip_amount_d > 0")val selectedTaxiData = taxiDataSelector.transform(taxiData)

import org.apache.spark.ml.feature.VectorAssemblerimport org.apache.spark.mllib.linalg.Vectors

val trainingDataAssembler = new VectorAssembler() .setInputCols(Array("passenger_count", "fare_amount")) .setOutputCol("features")

val assembledTaxiData = trainingDataAssembler.transform(selectedTaxiData)assembledTaxiData.select("label", "features").show()

http://keystone-ml.org/



>

LogisticRegression parameters: elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha =1, it is an L1 penalty (default: 0.0, current: 0.8) featuresCol: features column name (default: features) fitIntercept: whether to fit an intercept term (default: true) labelCol: label column name (default: label) maxIter: maximum number of iterations (>= 0) (default: 100, current: 10) predictionCol: prediction column name (default: prediction) regParam: regularization parameter (>= 0) (default: 0.0, current: 0.3) solver: the solver algorithm for optimization. If this is not set or empty, default value is 'auto'. (default: auto) standardization: whether to standardize the training features before fitting the model (default: true) tol: the convergence tolerance for iterative algorithms (default: 1.0E-6) weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (default: )

import org.apache.spark.ml.regression.LinearRegression linearRegressionEstimator: org.apache.spark.ml.regression.LinearRegression = linReg_54024ee673fd

Split the data into training and test set

>

Setup the transformation and estimation PIPELINE

>

Use the pipeline to train the model

>

Predict with the trained model on the test data

>

5.00

10.0

15.0

20.0

25.0

30.0

35.0

5.00 10.0 15.0

prediction

label

Showing sample based on the first 1000 rows.

How to get started with Spark MLSetup your Laptop (16+ GB RAM recommended)

import org.apache.spark.ml.regression.LinearRegression// Create a LogisticRegression instance. This instance is an Estimator.val linearRegressionEstimator = new LinearRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8)// Print out the parameters, documentation, and any default values.println("LogisticRegression parameters:\n" + linearRegressionEstimator.explainParams() + "\n")

val Array(trainingTaxiData, testTaxiData) = taxiData.randomSplit(Array(0.9, 0.1), seed = 12345)

import org.apache.spark.ml.{Pipeline, PipelineModel}val pipeline = new Pipeline().setStages(Array(taxiDataSelector, trainingDataAssembler, linearRegressionEstimator))

// Learn a LogisticRegression model.// val lrModel = linearRegressionEstimator.fit(trainingData)val lrModel = pipeline.fit(trainingTaxiData)

display(lrModel.transform(testTaxiData) .select("label", "prediction"))



mac$ brew install sparkor get Databricks Community Edition Notebook (Wait List)

Get dataJoin a ML competition and get BIG data from KaggleAnalyze the Panama Papers: https://github.com/amaboura/panama-papers-dataset-2016(https://github.com/amaboura/panama-papers-dataset-2016)

Visualize the data (Databricks or Zeppelin Notebook: https://zeppelin.incubator.apache.org/(https://zeppelin.incubator.apache.org/))Throw some algorithms on it !

? have a coffee? and maybe read the docs ? http://spark.apache.org/docs/latest/mllib-guide.html (http://spark.apache.org/docs/latest/mllib-guide.html)? read the Kaggle competition forums and blog

Graphs from the Panama Papers

https://github.com/amaboura/panama-papers-dataset-2016

https://zeppelin.incubator.apache.org/

http://spark.apache.org/docs/latest/mllib-guide.html



Data & Analytics

Machinelearning Spark Hadoop User Group Munich Meetup 2016