Upload
comsysto-reply-gmbh
View
264
Download
3
Embed Size (px)
Citation preview
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 1/6
Machine Learning with SparkScikit Learn Cheat Sheet
Load basic dependencies
>
inputCsvDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/csv ouputParquetDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/
import java.util.Base64 import java.nio.charset.StandardCharsets encB64: (str: String)String decB64: (str: String)String
import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.hadoop.conf.Configuration import java.net.URI import org.apache.hadoop.fs.FileStatus listS3: (s3Path: String)Array[org.apache.hadoop.fs.FileStatus] ls3: (s3FolderPath: String)Unit rm3: (s3Path: String)Boolean
? s3a://bigpicture-guild/nyctaxi/sample_1_month/csv/trip_data_and_fare.csv.gz [967.71 MiB]
Read taxi data as dataframe from parquet
>
%run "/meetup/kickoff/connect_s3"
// read Parquet filesval parquetTable= sqlContext.read.parquet(ouputParquetDir)val toDouble = udf[Double, Float]( _.toDouble)val taxiData = parquetTable.withColumn("tip_amount_d", toDouble(parquetTable.col("tip_amount")))
(http://databricks.com) Import Notebook
MachineLearning
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 2/6
>
>
Showing the first 1000 rows.
2D4B95E2FA7B2E85118EC5CA4570FA58 CD2F522EEE1FF5F5A8D8B679E23576B3 CMT 1 N 2013-01-07T15:33:28.000+0000
0C5296F3C8B16E702F8F2E06F5106552 D2363240A9295EF570FC6069BC4F4C92 CMT 1 N 2013-01-07T22:25:46.000+0000
312E0CB058D7FC1A6494EDB66D360CD2 7B5156F38990963332B33298C8BAE25E CMT 1 N 2013-01-05T11:54:49.000+0000
DD98E2C3AF5C47B4449F720ECC5778D4 79807332B275653A2473554C7328500A CMT 1 N 2013-01-02T06:58:08.000+0000
0B57B9633A2FECD3D3B1944AFC7471CF CCD4367B417ED6634D986F573A552A62 CMT 1 N 2013-01-07T14:46:55.000+0000
Scatter plot for tip amount and fare amount
>
500m1.001.502.002.503.003.504.004.505.005.506.006.507.007.508.008.509.009.50
5.00 10.0 15.0 20.0 25.0 30.0 35.0 40.0
fare_amount
tip_amount
Showing sample based on the first 1000 rows.
Transformation of data with standard dataframe operations
>
The pipeline concept of Spark ML
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime
taxiData.registerTempTable("ml_nyc_taxi")
%sql SELECT * FROM ml_nyc_taxi
%sql SELECT tip_amount, fare_amount FROM ml_nyc_taxi WHERE tip_amount > 0 AND tip_amount < 10 AND fare_amount < 50
import org.apache.spark.mllib.linalg.{Vector, Vectors}val toVec = udf[Vector, Int, Float] { (a, b) => Vectors.dense(a, b) }val trainingData = taxiData .filter(toDouble(taxiData.col("tip_amount")) > 0.0) .withColumn("label", toDouble(taxiData.col("tip_amount"))) .withColumn("features", toVec(taxiData.col("passenger_count"), taxiData.col("fare_amount")))
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 3/6
A Pipeline chains Transformers and EstimatorsA Transformer can also be an estimator from a previous trained modelImportant for easily
training with different model parameters e.g. for cross-validationwith different test and training data (train-validation split)repeat the transformation steps before estimation
Watch out for KeyStoneML (http://keystone-ml.org (http://keystone-ml.org)), a ML pipeline framework with a richer set of operatorson Spark
SQL transformer:
Select and filter the relevant data
>
VectorAssembler:
Transform the data into labeled data as needed for ML estimators
>
+------------------+----------+ | label| features| +------------------+----------+ |1.2000000476837158| [1.0,5.5]| | 4.199999809265137|[1.0,20.5]| | 5.900000095367432|[1.0,29.0]| | 5.380000114440918|[1.0,21.0]| | 1.399999976158142| [6.0,6.5]| | 1.0| [1.0,5.0]| | 1.25| [1.0,4.5]| | 3.0|[6.0,26.0]| | 1.0|[1.0,14.5]| |1.2999999523162842| [1.0,6.5]| | 1.899999976158142| [5.0,9.5]| |1.6200000047683716| [1.0,6.5]| | 1.899999976158142| [1.0,9.0]| | 2.0|[1.0,22.0]| | 6.0|[1.0,25.0]| |3.5999999046325684|[1.0,17.5]| |1.2000000476837158| [1.0,6.0]| | 7.5|[1.0,24.5]|
Initialize the estimator
import org.apache.spark.ml.feature.SQLTransformerval taxiDataSelector = new SQLTransformer().setStatement( "SELECT tip_amount_d as label, passenger_count, fare_amount FROM ml_nyc_taxi WHERE tip_amount_d > 0")val selectedTaxiData = taxiDataSelector.transform(taxiData)
import org.apache.spark.ml.feature.VectorAssemblerimport org.apache.spark.mllib.linalg.Vectors
val trainingDataAssembler = new VectorAssembler() .setInputCols(Array("passenger_count", "fare_amount")) .setOutputCol("features")
val assembledTaxiData = trainingDataAssembler.transform(selectedTaxiData)assembledTaxiData.select("label", "features").show()
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 4/6
>
LogisticRegression parameters: elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha =1, it is an L1 penalty (default: 0.0, current: 0.8) featuresCol: features column name (default: features) fitIntercept: whether to fit an intercept term (default: true) labelCol: label column name (default: label) maxIter: maximum number of iterations (>= 0) (default: 100, current: 10) predictionCol: prediction column name (default: prediction) regParam: regularization parameter (>= 0) (default: 0.0, current: 0.3) solver: the solver algorithm for optimization. If this is not set or empty, default value is 'auto'. (default: auto) standardization: whether to standardize the training features before fitting the model (default: true) tol: the convergence tolerance for iterative algorithms (default: 1.0E-6) weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (default: )
import org.apache.spark.ml.regression.LinearRegression linearRegressionEstimator: org.apache.spark.ml.regression.LinearRegression = linReg_54024ee673fd
Split the data into training and test set
>
Setup the transformation and estimation PIPELINE
>
Use the pipeline to train the model
>
Predict with the trained model on the test data
>
5.00
10.0
15.0
20.0
25.0
30.0
35.0
5.00 10.0 15.0
prediction
label
Showing sample based on the first 1000 rows.
How to get started with Spark MLSetup your Laptop (16+ GB RAM recommended)
import org.apache.spark.ml.regression.LinearRegression// Create a LogisticRegression instance. This instance is an Estimator.val linearRegressionEstimator = new LinearRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8)// Print out the parameters, documentation, and any default values.println("LogisticRegression parameters:\n" + linearRegressionEstimator.explainParams() + "\n")
val Array(trainingTaxiData, testTaxiData) = taxiData.randomSplit(Array(0.9, 0.1), seed = 12345)
import org.apache.spark.ml.{Pipeline, PipelineModel}val pipeline = new Pipeline().setStages(Array(taxiDataSelector, trainingDataAssembler, linearRegressionEstimator))
// Learn a LogisticRegression model.// val lrModel = linearRegressionEstimator.fit(trainingData)val lrModel = pipeline.fit(trainingTaxiData)
display(lrModel.transform(testTaxiData) .select("label", "prediction"))
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 5/6
mac$ brew install sparkor get Databricks Community Edition Notebook (Wait List)
Get dataJoin a ML competition and get BIG data from KaggleAnalyze the Panama Papers: https://github.com/amaboura/panama-papers-dataset-2016(https://github.com/amaboura/panama-papers-dataset-2016)
Visualize the data (Databricks or Zeppelin Notebook: https://zeppelin.incubator.apache.org/(https://zeppelin.incubator.apache.org/))Throw some algorithms on it !
? have a coffee? and maybe read the docs ? http://spark.apache.org/docs/latest/mllib-guide.html (http://spark.apache.org/docs/latest/mllib-guide.html)? read the Kaggle competition forums and blog
Graphs from the Panama Papers
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 6/6