Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Azure Machine Learning
– 其他篇
台灣微軟
技術傳教士
吳宏彬
8/25/2016
什麼是R語言
Open Source “lingua franca”
Analytics, Computing, Modeling
Global Community
Millions of users 7000+ Algorithms, Test Data & Evaluations
Can be Scaled to Big Data,
Big Analytics
Ecosystem
Scalability
Polls of data miners and analytics professionals on their software
choices since 2007Source: http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html
R is developed and contributed by open source community
CRAN – the Comprehensive R Archive Network Package repository of R
7500+ packages, covering all aspects of statistical analysis, machine learning, natural language processing …
Still exponentially growth
Free!
Source: http://r4stats.com/2014/04/07/r-continues-its-rapid-growth/
1.Seasonal ARIMA
2.Non Seasonal
ARIMA
3.Seasonal ETS
4.Non -Seasonal ETS
5.Average of Seasonal
ETS and Seasonal
ARIMA
Mean Error (ME) - Average forecasting error (an error is the difference between the
predicted value and the actual value) on the test dataset
Root Mean Squared Error (RMSE) - The square root of the average of squared errors of
predictions made on the test dataset.
Mean Absolute Error (MAE) - The average of absolute errors
Mean Percentage Error (MPE) - The average of percentage errors
Mean Absolute Percentage Error (MAPE) - The average of absolute percentage errors
Mean Absolute Scaled Error (MASE)
Symmetric Mean Absolute Percentage Error (sMAPE)
DatasizeIn-memory
In-memory In-Memory or Disk Based
Speed of
AnalysisSingle threaded Multi-threaded
Multi-threaded, parallel
processing 1:N servers
SupportCommunity Community Community + Commercial
Analytic
Breadth &
Depth
7500+ innovative analytic
packages7500+ innovative analytic
packages
7500+ innovative packages
+ commercial parallel high-
speed functions
LicenseOpen Source
Open Source
Commercial license.
Supported release with
indemnity
Microsoft
R Open
Microsoft
R Server
Support standard Python library types such as
Pandas data frames and NumPy arrays.
Execute the Python code is based on Anaconda
2.1, It comes with close to 200 of the most
common Python packages (as NumPy, SciPy and
Scikits-Learn )
Output generate images from MatplotLib
KNN
21
What is Spark?
Data is growing faster than processing
speeds
Only solution is to parallelize data
processing on large clusters
Example: HDInsight
Fast, expressive cluster computing system compatible with Apache
Hadoop
• Works with any Hadoop-supported storage system (HDFS, S3, Avro, …)
Improves efficiency through:
• In-memory computing primitives
• General computation graphs
Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
Spark was initially started by Matei Zaharia at UC Berkeley AMPLab
in 2009, was open sourced in 2010 and donated to Apache in 2013
Up to 100× faster
Often 2-10× less code
What is Spark?
Spark for Azure HDInsight
Spark Node
Spark Node
Spark Node
Spark Node
Spark Node
Storage Layer
Decision Maker
Decision Maker
Decision
Maker
Spark Cluster
clients
Spark Notebooks
Using the Spark shell to run
interactive queries
Using the Spark shell to run Spark
SQL queries
Using a standalone Scala program
Spark
Notebooks
Zeppelin – for
Scala users
Zupyter – for
Python users
Programming
Spark
2015 System
Human Error Rate 4%
Speech Recognition could reach human parity in the next 3 years
33
Microsoft 透過深度學習技術贏得 ImageNet 2015所有比賽項目冠軍
28.225.8
16.4
11.7
7.3 6.73.5
ILSVRC 2010NEC
America
ILSVRC 2011Xerox
ILSVRC 2012AlexNet
ILSVRC 2013Clarifi
ILSVRC 2014VGG
ILSVRC 2014GoogleNet
ILSVRC 2015MSRA
ResNet
ImageNet Classification top-5 error (%)
Microsoft had all 5 entries being the 1-st places this year: ImageNet
classification, ImageNet localization, ImageNet detection, COCO
detection, and COCO segmentation
CNTK At the Heart: Computational Networks
•A generalization of machine learning models that can be described as a series of computational steps.
• E.g., DNN, CNN, RNN, LSTM, DSSM, Seq2Sqe, Log-linear model
•Representation: • A list of computational nodes denoted as
n = {node name : operation name}
• The parent-children relationship describing the operands
{n : c1, · · · , cKn }• Kn is the number of children of node n. For leaf nodes Kn = 0.
• Order of the children matters: e.g., XY is different from YX
• Given the inputs (operands) the value of the node can be computed.
•Can flexibly describe deep learning models. • Adopted by many other popular tools as well
35
36
•A generalization of machine learning models that can be described as a series of computational steps.
• E.g., DNN, CNN, RNN, LSTM, DSSM, Log-linear model
•Representation: • A list of computational nodes denoted as
n = {node name : operation name}
• The parent-children relationship describing the operands
{n : c1, · · · , cKn }• Kn is the number of children of node n. For leaf nodes Kn = 0.
• Order of the children matters: e.g., XY is different from YX
• Given the inputs (operands) the value of the node can be computed.
•Can flexibly describe deep learning models. • Adopted by many other popular tools as well
“CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.”
Theano only supports 1 GPU
Achieved with 1-bit gradient quantizationalgorithm
0
10000
20000
30000
40000
50000
60000
70000
80000
CNTK Theano TensorFlow Torch 7 Caffe
speed comparison (samples/second), higher = better[note: December 2015]
1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs)
* TensorFlow add distributed compute support in April 2016
Micrsoft Reacher SLAWEKSMYL win in CIF 2016 byusing LSTM Neural Network
Powered by CNTK
CIF Competition 2016 – Final Results• Contestant 1 – Slawek Smyl (LSTM-based
NN on deseasonalized data)
• Contestant 2 – Slawek Smyl (weighted average of my 3 methods)
• Contestant 3 – prof. Sven Crone (Multilayer Perceptron with a thorough feature search)
• Contestant 4 - Mikhail Artyukhov (previous competition winner, ensemble models)
• Contestant 5 - Joerg Wichard, Bayer Healthcare AG (Adaptive Forecasting Strategy with Hybrid Ensemble Models)
• Contestant 6 – Slawek Smyl (LSTM-based NN)
CNTK Demo
CNTK Architecture
41
CNBuilder
LambdaCN
Description Use Build
ILearnerIDataReaderFeatures &
Labels Load Get data
IExecutionEngine
CPU/GPU
Task-specific
reader
SGD, AdaGrad,
etc.
Evaluate
Compute Gradient
(1) Kai Chen and Qiang Huo, “Scalable training of deep learning machines by incremental block training with intra-block
parallel optimization and blockwise model-update filtering”,
in Internal Conference on Acoustics, Speech and Signal Processing , March 2016, Shanghai, China.
CNTK is a powerful tool that supports CPU/GPU and runs under Windows/Linux
CNTK is extensible with the low-coupling modular design: adding new readers and new computation nodes is easy with a new reader design
Network definition language, macros, and model editing language (as well as Python and C++ bindings in the future) makes network design and modification easy
Compared to other tools CNTK has a great balance between efficiency, performance, and flexibility
microsoft.com/cognitive
Mahout Spark ML Azure ML R Server
Shared Service No No Yes No
Deployment Model PaaS PaaS PaaS IaaS
Extensibility High High Medium High
Deployment Complexity Medium High Low Medium
Cost High High Low High
Programming Languages Java/Scala Scala/Java/Python Python/R R
Algorithms Limited (growing) MLlib/scikit Many (scikit/CRAN) Many (CRAN)
Scalability High High Medium Medium
xgboost Vowpal Wabbit
Rattle
CNTK
*Copy
雲端隨選隨用 各式資料 快速上線服務 資料分享跟協同合作
開放 支援完整資料分析流程
https://gallery.cortanaintelligence.com/
唯一一家提供從資料匯入到產生行動及資料呈現完整的解決方案
ConnectR• High-speed & direct
connectors
Available for:• High-performance XDF
• SAS, SPSS, delimited & fixed format text data files
• Hadoop HDFS (text & XDF)
• Teradata Database & Aster
• EDWs and ADWs
• ODBC
ScaleR• Ready-to-Use high-performance
big data big analytics
• Fully-parallelized analytics
• Data prep & data distillation
• Descriptive statistics & statistical tests
• Range of predictive functions
• User tools for distributing customized R algorithms across nodes
DistributedR• Distributed computing framework
• Delivers cross-platform portability
Available on:
• Windows Servers
• Red Hat and SuSE Linux Servers
• Teradata Database
• Cloudera Hadoop
• Hortonworks Hadoop
• MapR Hadoop
R+CRAN• Open source R interpreter
• R 3.2.2
• Freely-available huge range of R algorithms
• Algorithms callable by RevoR
• 100% Compatible with existing R scripts, functions and packages
RevoR• Performance enhanced R
interpreter
• Based on open source R
• Adds high-performance math library to speed up linear algebra functions
R Open Microsoft R Server
DeployRDevelopR
Gradient Boosted Decision Trees
Naïve Bayes
Data import – Delimited, Fixed, SAS, SPSS,
OBDC
Variable creation & transformation
Recode variables
Factor variables
Missing value handling
Sort, Merge, Split
Aggregate by category (means, sums)
Min / Max, Mean, Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product matrix for set
variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data (standard tables & long
form)
Marginal Summaries of Cross Tabulations
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Subsample (observations & variables)
Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics Sum of Squares (cross product matrix for set
variables)
Multiple Linear Regression
Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
Covariance & Correlation Matrices
Logistic Regression
Classification & Regression Trees
Predictions/scoring for models
Residuals for all models
Predictive Models K-Means
Decision Trees
Decision Forests
Cluster Analysis
Classification
Simulation
Variable Selection
Stepwise Regression
Simulation (e.g. Monte Carlo)
Parallel Random Number Generation
Combination rxDataStep
rxExec
PEMA-R API Custom Algorithms
Additional Resources
• CNTK: • https://github.com/Microsoft/CNTK
• Contains all the source code and example setups
• You may understand better how CNTK works by reading the source code
• New features are added constantly
• How to contact:• CNTK team: ask a question on CNTK GitHub!
• Alexey: • Email: [email protected]
• : https://www.linkedin.com/in/alexeykamenev
59