Neural Networks and Deep Learning

NEURAL NETWORKS AND DEEPLEARNINGASIM JALIS

GALVANIZE

INTRO

ASIM JALISGalvanize/Zipfian, DataEngineeringCloudera, Microso!,SalesforceMS in Computer Sciencefrom University ofVirginia

GALVANIZE PROGRAMSProgram Duration

Data ScienceImmersive

12weeks

DataEngineeringImmersive

12weeks

WebDeveloperImmersive

6months

Galvanize U 1 year

TALK OVERVIEW

WHAT IS THIS TALK ABOUT?Using Neural Networksand Deep LearningTo recognize imagesBy the end of the classyou will be able tocreate your own deeplearning systems

HOW MANY PEOPLE HERE HAVEUSED NEURAL NETWORKS?

HOW MANY PEOPLE HERE HAVEUSED MACHINE LEARNING?

HOW MANY PEOPLE HERE HAVEUSED PYTHON?

DEEP LEARNING

WHAT IS MACHINE LEARNINGSelf-driving carsVoice recognitionFacial recognition

HISTORY OF DEEP LEARNING

HISTORY OF MACHINE LEARNINGInput Features Algorithm Output

Machine Human Human Machine

Machine Human Machine Machine

Machine Machine Machine Machine

FEATURE EXTRACTIONTraditionally data scientists to define featuresDeep learning systems are able to extract featuresthemselves

DEEP LEARNING MILESTONESYears Theme

1980s Backpropagation invented allows multi-layerNeural Networks

2000s SVMs, Random Forests and other classifiersovertook NNs

2010s Deep Learning reignited interest in NN

IMAGENETAlexNet submitted to the ImageNet ILSVRC challenge in2012 is partly responsible for the renaissance.Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton usedDeep Learning techniques.They combined this with GPUs, some other techniques.The result was a neural network that could classify imagesof cats and dogs.It had an error 16% compared to 26% for the runner up.

Ilya Sutskever, Alex Krizhevsky, Geoffrey Hinton

INDEED.COM/SALARY

MACHINE LEARNING

MACHINE LEARNING AND DEEPLEARNING

Deep Learning fits insideMachine LearningDeep Learning aMachine LearningtechniqueShare techniques forevaluating andoptimizing models

WHAT IS MACHINE LEARNING?Inputs: Vectors or points of high dimensionsOutputs: Either binary vectors or continuous vectorsMachine Learning finds the relationship between themUses statistical techniques

SUPERVISED VS UNSUPERVISEDSupervised: Data needs to be labeledUnsupervised: Data does not need to be labeled

TECHNIQUESClassificationRegressionClusteringRecommendationsAnomaly detection

CLASSIFICATION EXAMPLE:EMAIL SPAM DETECTION

CLASSIFICATION EXAMPLE:EMAIL SPAM DETECTION

Start with large collection of emails, labeled spam/not-spamConvert email text into vectors of 0s and 1s: 0 if a wordoccurs, 1 if it does notThese are called inputs or featuresSplit data set into training set (70%) and test set (30%)Use algorithm like Random Forest to build modelEvaluate model by running it on test set and capturingsuccess rate

CLASSIFICATION ALGORITHMSNeural NetworksRandom ForestSupport Vector Machines (SVM)Decision TreesLogistic RegressionNaive Bayes

CHOOSING ALGORITHMEvaluate different models on dataLook at the relative success ratesUse rules of thumb: some algorithms work better on somekinds of data

CLASSIFICATION EXAMPLESIs this tumor benign or cancerous?Is this lead profitable or not?Who will win the presidential elections?

CLASSIFICATION: POP QUIZIs classification supervised or unsupervised learning?

Supervised because you have to label the data.

CLUSTERING EXAMPLE: LOCATECELL PHONE TOWERS

Start with GPScoordinates of all cellphone usersRepresent data asvectorsLocate towers in biggestclusters

CLUSTERING EXAMPLE: T-SHIRTSWhat size should a t-shirt be?Everyone’s real t-shirtsize is differentLay out all sizes andclusterTarget large clusterswith XS, S, M, L, XL

CLUSTERING: POP QUIZIs clustering supervised or unsupervised?

Unsupervised because no labeling is required

RECOMMENDATIONS EXAMPLE:AMAZON

Model looks at userratings of booksViewing a book triggersimplicit ratingRecommend user newbooks

RECOMMENDATION: POP QUIZAre recommendation systems supervised or unsupervised?

Unsupervised

REGRESSIONLike classificationOutput is continuous instead of one from k choices

REGRESSION EXAMPLESHow many units of product will sell next monthWhat will student score on SATWhat is the market price of this houseHow long before this engine needs repair

REGRESSION EXAMPLE:AIRCRAFT PART FAILURE

Cessna collects datafrom airplane sensorsPredict when part needsto be replacedShip part to customer’sservice airport

REGRESSION: QUIZIs regression supervised or unsupervised?

Supervised

ANOMALY DETECTION EXAMPLE:CREDIT CARD FRAUD

Train model on goodtransactionsAnomalous activityindicates fraudCan pass transactiondown to human forinvestigation

ANOMALY DETECTION EXAMPLE:NETWORK INTRUSION

Train model on networklogin activityAnomalous activityindicates threatCan initiate alerts andlockdown procedures

ANOMALY DETECTION: QUIZIs anomaly detection supervised or unsupervised?

Unsupervised because we only train on normal data

FEATURE EXTRACTIONConverting data to feature vectorsNatural Language ProcessingPrincipal Component AnalysisAuto-Encoders

FEATURE EXTRACTION: QUIZIs feature extraction supervised or unsupervised?

Unsupervised

MACHINE LEARNING WORKFLOW

DEEP LEARNING USED FORFeature ExtractionClassificationRegression

HISTORY OF MACHINE LEARNINGInput Features Algorithm Output

Machine Human Human Machine

Machine Human Machine Machine

Machine Machine Machine Machine

DEEP LEARNING FRAMEWORKS

DEEP LEARNING FRAMEWORKSTensorFlow: NN library from GoogleTheano: Low-level GPU-enabled tensor libraryTorch7: NN library, uses Lua for binding, used by Facebookand GoogleCaffe: NN library by Berkeley AMPLabNervana: Fast GPU-based machines optimized for deeplearning

DEEP LEARNING FRAMEWORKSKeras, Lasagne, Blocks: NN libraries that make Theanoeasier to useCUDA: Programming model for using GPUs in general-purpose programmingcuDNN: NN library by Nvidia based on CUDA, can be usedwith Torch7, CaffeChainer: NN library that uses CUDA

DEEP LEARNING PROGRAMMINGLANGUAGES

All the frameworks support PythonExcept Torch7 which uses Lua for its binding language

TENSORFLOWTensorFlow originallydeveloped by GoogleBrain TeamAllows using GPUs fordeep learningalgorithmsSingle processor versionreleased in 2015Multiple processorversion released inMarch 2016

KERASSupports Theano andTensorFlow as back-endsProvides deep learningAPI on top of TensorFlowTensorFlow provideslow-level matrixoperations

TENSORFLOW: GEOFFREYHINTON, JEFF DEAN

KERAS: FRANCOIS CHOLLET

NEURAL NETWORKS

WHAT IS A NEURON?

Receives signal on synapseWhen trigger sends signal on axon

MATHEMATICAL NEURON

Mathematical abstraction, inspired by biological neuronEither on or off based on sum of input

MATHEMATICAL FUNCTION

Neuron is a mathematical functionAdds up (weighted) inputs and applies sigmoid (or otherfunction)This determines if it fires or not

WHAT ARE NEURAL NETWORKS?Biologically inspired machine learning algorithmMathematical neurons arranged in layersAccumulate signals from the previous layerFire when signal reaches threshold

NEURAL NETWORKS

NEURON INCOMINGEach neuron receivessignals from neurons inprevious layerSignal affected byweightSome are moreimportant than othersBias is the base signalthat the neuron receives

NEURON OUTGOINGEach neuron sends itssignal to the neurons inthe next layerSignals affected byweight

LAYERED NETWORK

Each layer looks at features identified by previous layer

US ELECTIONS

ELECTIONSConsider the electionsThis is a gated systemA way to aggregatedifferent views

HIGHEST LEVEL: STATES

NEXT LEVEL: COUNTIES

ELECTIONSIs this a Neural Network?How many layers does ithave?

NEURON LAYERSThe nomination is thelast layer, layer NStates are layer N-1Counties are layer N-2Districts are layer N-3Individuals are layer N-4Individual brains haveeven more layers

GRADIENT DESCENT

TRAINING: HOW DO WEIMPROVE?

Calculate error from desired goalIncrease weight of neurons who voted rightDecrease weight of neurons who voted wrongThis will reduce error

GRADIENT DESCENTThis algorithm is called gradient descentThink of error as function of weights

FEED FORWARDAlso called forwardpropagation or forwardpropInitialize inputsCalculate activation ofeach layerCalculate activation ofoutput layer

BACK PROPAGATIONUse forward prop tocalculate the errorError is function of allnetwork weightsAdjust weights usinggradient descentRepeat with next recordKeep going over trainingset until convergence

HOW DO YOU FIND THE MINIMUMIN AN N-DIMENSIONAL SPACE?

Take a step in the steepest direction.Steepest direction is vector sum of all derivatives.

PUTTING ALL THIS TOGETHERUse forward prop toactivateUse back prop to trainThen use forward propto test

TYPES OF NEURONS

SIGMOID

TANH

RELU

BENEFITS OF RELUPopularAccelerates convergenceby 6x (Krizhevsky et al)Operation is faster sinceit is linear notexponentialCan die by going to zero

Pro: Sparse matrixCon: Network can die

LEAKY RELUPro: Does not dieCon: Matrix is not sparse

SOFTMAXFinal layer of networkused for classificationTurns output intoprobability distributionNormalizes output ofneurons to sum to 1

HYPERPARAMETER TUNING

PROBLEM: OIL EXPLORATIONDrilling holes isexpensiveWe want to find thebiggest oilfield withoutwasting money on dudsWhere should we plantour next oilfield derrick?

PROBLEM: NEURAL NETWORKSTestinghyperparameters isexpensiveWe have an N-dimensional grid ofparametersHow can we quickly zeroin on the bestcombination ofhyperparameters?

HYPERPARAMETER EXAMPLEHow many layers shouldwe haveHow many neuronsshould we have inhidden layersShould we use Sigmoid,Tanh, or ReLUShould we initialize

ALGORITHMSGridRandomBayesian Optimization

GRIDSystematically searchentire gridRemember best foundso far

RANDOMRandomly search the gridRemember the best found so farBergstra and Bengio’s result and Alice Zheng’sexplanation (see References)60 random samples gets you within top 5% of grid searchwith 95% probability

BAYESIAN OPTIMIZATIONBalance betweenexplore and exploitExploit: test spots withinexplored perimeterExplore: test new spotsin random locationsBalance the trade-off

SIGOPTYC-backed SF startupFounded by Scott ClarkRaised $2MSells cloud-basedproprietary variant ofBayesian Optimization

BAYESIAN OPTIMIZATION PRIMERBayesian Optimization Primer by Ian Dewancker, MichaelMcCourt, Scott ClarkSee References

OPEN SOURCE VARIANTSOpen source alternatives:

SpearmintHyperoptSMACMOE

PRODUCTION

DEPLOYINGPhases: training,deploymentTraining phase run onback-end serversOptimize hyper-parameters on back-endDeploy model to front-end servers, browsers,devicesFront-end only usesforward prop and is fast

SERIALIZING/DESERIALIZINGMODEL

Back-end: Serialize model + weightsFront-end: Deserialize model + weights

HDF 5Keras serializes model architecture to JSONKeras serializes weights to HDF5Serialization model for hierarchical dataAPIs for C++, Python, Java, etchttps://www.hdfgroup.org

https://www.hdfgroup.org/

DEPLOYMENT EXAMPLE: CANCERDETECTION

Rhobota.com’s cancerdetecting iPhone appDeveloped by BryanShaw a!er his son’sillnessModel built on back-end,deployed on iPhoneiPhone detects retinalcancer

DEEP LEARNING

WHAT IS DEEP LEARNING?Deep Learning is a learning method that can train the

system with more than 2 or 3 non-linear hidden layers.

WHAT IS DEEP LEARNING?Machine learning techniques which enable unsupervisedfeature learning and pattern analysis/classification.The essence of deep learning is to computerepresentations of the data.Higher-level features are defined from lower-level ones.

HOW IS DEEP LEARNINGDIFFERENT FROM REGULAR

NEURAL NETWORKS?Training neural networks requires applying gradientdescent on millions of dimensions.This is intractable for large networks.Deep learning places constraints on neural networks.This allows them to be solvable iteratively.The constraints are generic.

AUTO-ENCODERS

WHAT ARE AUTO-ENCODERS?An auto-encoder is a learning algorithmIt applies backpropagation and sets the target values tobe equal to its inputsIn other words it trains itself to do the identitytransformation

WHY DOES IT DO THIS?Auto-encoder places constraints on itselfE.g. it restricts the number of hidden neuronsThis allows it to find a good representation of the data

IS THE AUTO-ENCODERSUPERVISED OR UNSUPERVISED?

It is unsupervised.The data is unlabeled.

WHAT ARE CONVOLUTIONNEURAL NETWORKS?

Feedforward neural networksConnection pattern inspired by visual cortex

CONVOLUTIONAL NEURALNETWORKS

CNNSThe convolutional layer’s parameters are a set oflearnable filtersEvery filter is small along width and heightDuring the forward pass, each filter slides across the widthand height of the input, producing a 2-dimensionalactivation mapAs we slide across the input we compute the dot productbetween the filter and the input

CNNSIntuitively, the network learns filters that activate whenthey see a specific type of feature anywhereIn this way it creates translation invariance

CONVNET EXAMPLE

Zero-Padding: the boundaries are padded with a 0Stride: how much the filter moves in the convolutionParameter sharing: all filters share the same parameters

CONVNET EXAMPLEFrom http://cs231n.github.io/convolutional-networks/

http://cs231n.github.io/convolutional-networks/

WHAT IS A POOLING LAYER?The pooling layer reduces the resolution of the imagefurtherIt tiles the output area with 2x2 mask and takes themaximum activation value of the area

REVIEWkeras/examples/mnist_cnn.py

Recognizes hand-written digitsBy combining different layers

RECURRENT NEURAL NETWORKS

RNNSRNNs capture patternsin time series dataConstrained by sharedweights across neuronsEach neuron observesdifferent times

LSTMSLong Short Term Memory networksRNNs cannot handle long time lags between eventsLSTMs can pick up patterns separated by big lagsUsed for speech recognition

RNN EFFECTIVENESSAndrej Karpathy usesLSTMs to generate textGenerates Shakespeare,Linux Kernel code,mathematical proofs.Seehttp://karpathy.github.io/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

RNN INTERNALS

LSTM INTERNALS

CONCLUSION

REFERENCESBayesian Optimization by Dewancker et al

Random Search by Bengio et al

Evaluating machine learning modelsAlice Zheng

http://sigopt.com

http://jmlr.org

http://www.oreilly.com

https://sigopt.com/static/pdf/SigOpt_Bayesian_Optimization_Primer.pdf

http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf

http://www.oreilly.com/data/free/files/evaluating-machine-learning-models.pdf

REFERENCESDropout by Hinton et al

Understanding LSTM Networks by Chris Olah

Multi-scale Deep Learning for Gesture Detection andLocalizationby Neverova et al

Unreasonable Effectiveness of RNNs by Karpathy

http://cs.utoronto.edu

http://github.io

http://uoguelph.ca

http://karpathy.github.io

https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

http://www.uoguelph.ca/~gwtaylor/publications/neverova2014multi-scale.pdf

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

QUESTIONS

Data & Analytics

Neural Networks and Deep Learning