Upload
duonglien
View
231
Download
0
Embed Size (px)
Citation preview
Machine Learning
by John Paul Mueller and Luca Massaron
Machine Learning For Dummies®
Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com
Copyright © 2016 by John Wiley & Sons, Inc., Hoboken, New Jersey
Media and software compilation copyright © 2016 by John Wiley & Sons, Inc. All rights reserved.
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.
For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit www.wiley.com/techsupport.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.
Library of Congress Control Number: 2016940023
ISBN: 978-1-119-24551-3
ISBN 978-1-119-24577-3 (ebk); ISBN ePDF 978-1-119-24575-9 (ebk)
Manufactured in the United States of America
10 9 8 7 6 5 4 3 2 1
Contents at a GlanceIntroduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Part 1: Introducing How Machines Learn . . . . . . . . . . . . . . . . . . . . . 7CHAPTER 1: Getting the Real Story about AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9CHAPTER 2: LearningintheAge ofBigData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23CHAPTER 3: Having a Glance at the Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Part 2: Preparing Your Learning Tools . . . . . . . . . . . . . . . . . . . . . . . . 45CHAPTER 4: InstallinganRDistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47CHAPTER 5: Coding in R Using RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63CHAPTER 6: InstallingaPythonDistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89CHAPTER 7: Coding in Python Using Anaconda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109CHAPTER 8: Exploring Other Machine Learning Tools . . . . . . . . . . . . . . . . . . . . . . . . . 137
Part 3: Getting Started with the Math Basics . . . . . . . . . . . . . . . 145CHAPTER 9: Demystifyingthe MathBehindMachineLearning . . . . . . . . . . . . . . . . . 147CHAPTER 10:DescendingtheRight Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167CHAPTER 11: Validating Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181CHAPTER 12:StartingwithSimpleLearners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Part 4: Learning from Smart and Big Data . . . . . . . . . . . . . . . . . . 217CHAPTER 13:PreprocessingData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219CHAPTER 14:LeveragingSimilarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237CHAPTER 15:WorkingwithLinearModelsthe EasyWay . . . . . . . . . . . . . . . . . . . . . . . . 257CHAPTER 16:HittingComplexitywithNeuralNetworks . . . . . . . . . . . . . . . . . . . . . . . . 279CHAPTER 17: Going a Step beyond Using Support Vector Machines . . . . . . . . . . . . . . 297CHAPTER 18:ResortingtoEnsemblesofLearners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Part 5: Applying Learning to Real Problems . . . . . . . . . . . . . . . . 331CHAPTER 19:ClassifyingImages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333CHAPTER 20:ScoringOpinionsandSentiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349CHAPTER 21:RecommendingProductsandMovies . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Part 6: The Part of Tens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383CHAPTER 22: Ten Machine Learning Packages to Master . . . . . . . . . . . . . . . . . . . . . . . . 385CHAPTER 23:TenWaystoImproveYourMachineLearningModels . . . . . . . . . . . . . . 391
INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
Table of ContentsINTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
AboutThisBook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1FoolishAssumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2IconsUsedinThisBook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3BeyondtheBook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4WheretoGofromHere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
PART 1: INTRODUCING HOW MACHINES LEARN . . . . . . . . . . . 7
CHAPTER 1: Getting the Real Story about AI . . . . . . . . . . . . . . . . . . . . . . . . 9Moving beyond the Hype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10DreamingofElectricSheep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
UnderstandingthehistoryofAIandmachinelearning . . . . . . . . . .12ExploringwhatmachinelearningcandoforAI . . . . . . . . . . . . . . . .13Consideringthegoalsofmachinelearning . . . . . . . . . . . . . . . . . . . .13Definingmachinelearninglimitsbasedonhardware . . . . . . . . . . .14
OvercomingAIFantasies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15DiscoveringthefadusesofAIandmachinelearning . . . . . . . . . . .16ConsideringthetrueusesofAIandmachinelearning . . . . . . . . . .16Beinguseful;beingmundane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18
Considering the Relationship between AI and Machine Learning . . . .19ConsideringAIandMachineLearningSpecifications . . . . . . . . . . . . . .20DefiningtheDividebetweenArtandEngineering . . . . . . . . . . . . . . . . .20
CHAPTER 2: Learning in the Age of Big Data . . . . . . . . . . . . . . . . . . . . . . . 23DefiningBigData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24ConsideringtheSourcesofBigData . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
Buildinganewdatasource . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26Using existing data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27Locating test data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28
SpecifyingtheRoleofStatisticsinMachineLearning . . . . . . . . . . . . . .29UnderstandingtheRoleofAlgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . .30
Definingwhatalgorithmsdo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30Consideringthefivemaintechniques . . . . . . . . . . . . . . . . . . . . . . . .30
DefiningWhatTrainingMeans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32
CHAPTER 3: Having a Glance at the Future . . . . . . . . . . . . . . . . . . . . . . . . 35CreatingUsefulTechnologiesfortheFuture . . . . . . . . . . . . . . . . . . . . .36
Consideringtheroleofmachinelearninginrobots . . . . . . . . . . . . .36Usingmachinelearninginhealthcare . . . . . . . . . . . . . . . . . . . . . . . .37Creatingsmartsystemsforvariousneeds . . . . . . . . . . . . . . . . . . . .37
Table of Contents v
Usingmachinelearninginindustrialsettings . . . . . . . . . . . . . . . . . .38Understandingtheroleofupdatedprocessors and other hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39
DiscoveringtheNewWorkOpportunitieswith Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39
Workingforamachine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40Workingwithmachines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41Repairingmachines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41Creatingnewmachinelearningtasks . . . . . . . . . . . . . . . . . . . . . . . . .42Devisingnewmachinelearningenvironments . . . . . . . . . . . . . . . . .42
AvoidingthePotentialPitfallsofFutureTechnologies . . . . . . . . . . . . .43
PART 2: PREPARING YOUR LEARNING TOOLS . . . . . . . . . . . . . . 45
CHAPTER 4: Installing an R Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47ChoosinganRDistributionwithMachineLearninginMind . . . . . . . . .48Installing R on Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49Installing R on Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56Installing R on Mac OS X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57DownloadingtheDatasetsandExampleCode . . . . . . . . . . . . . . . . . . . .59
Understanding the datasets used in this book . . . . . . . . . . . . . . . . .59Definingthecoderepository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60
CHAPTER 5: Coding in R Using RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63UnderstandingtheBasicDataTypes . . . . . . . . . . . . . . . . . . . . . . . . . . . .64Working with Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .66OrganizingDataUsingLists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .66Working with Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67
Creatingabasicmatrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .68Changingthevectorarrangement . . . . . . . . . . . . . . . . . . . . . . . . . . .69Accessingindividualelements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .69Namingtherowsandcolumns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70
InteractingwithMultipleDimensionsUsingArrays . . . . . . . . . . . . . . . .71Creating a basic array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71Namingtherowsandcolumns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
CreatingaDataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .74Understandingfactors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .74Creatingabasicdataframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76Interactingwithdataframes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77Expandingadataframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79
PerformingBasicStatisticalTasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80Making decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80Working with loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82
vi Machine Learning For Dummies
Performingloopedtaskswithoutloops . . . . . . . . . . . . . . . . . . . . . . .84Workingwithfunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85Findingmeanandmedian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85Charting your data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .87
CHAPTER 6: Installing a Python Distribution . . . . . . . . . . . . . . . . . . . . . . 89ChoosingaPythonDistributionwithMachineLearninginMind . . . . .90
GettingContinuumAnalyticsAnaconda . . . . . . . . . . . . . . . . . . . . . .91Getting Enthought Canopy Express . . . . . . . . . . . . . . . . . . . . . . . . . .92Getting pythonxy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93Getting WinPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93
Installing Python on Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93Installing Python on Mac OS X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94Installing Python on Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .96DownloadingtheDatasetsandExampleCode . . . . . . . . . . . . . . . . . . . .99
UsingJupyterNotebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .100Definingthecoderepository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .101Understanding the datasets used in this book . . . . . . . . . . . . . . . .106
CHAPTER 7: Coding in Python Using Anaconda . . . . . . . . . . . . . . . . . . . 109WorkingwithNumbersandLogic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .110
Performingvariableassignments . . . . . . . . . . . . . . . . . . . . . . . . . . .112Doingarithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .113ComparingdatausingBooleanexpressions . . . . . . . . . . . . . . . . . .115
Creating and Using Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .117InteractingwithDates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .118Creating and Using Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .119
Creatingreusablefunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .119Callingfunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .121Working with global and local variables . . . . . . . . . . . . . . . . . . . . . .123
UsingConditionalandLoopStatements . . . . . . . . . . . . . . . . . . . . . . . .124Makingdecisionsusingtheifstatement . . . . . . . . . . . . . . . . . . . . .124Choosingbetweenmultipleoptionsusingnesteddecisions . . . .125Performingrepetitivetasksusingfor . . . . . . . . . . . . . . . . . . . . . . . .126Usingthewhilestatement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .127
StoringDataUsingSets,Lists,andTuples . . . . . . . . . . . . . . . . . . . . . . .128Creating sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .128Performingoperationsonsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . .128Creating lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .129Creating and using tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .131
DefiningUsefulIterators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .132IndexingDataUsingDictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .134Storing Code in Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .134
Table of Contents vii
CHAPTER 8: Exploring Other Machine Learning Tools . . . . . . . . . . 137MeetingthePrecursorsSAS,Stata,andSPSS . . . . . . . . . . . . . . . . . . . .138LearninginAcademiawithWeka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .140AccessingComplexAlgorithmsEasilyUsingLIBSVM . . . . . . . . . . . . . .141Running As Fast As Light with Vowpal Wabbit . . . . . . . . . . . . . . . . . . .142VisualizingwithKnimeandRapidMiner . . . . . . . . . . . . . . . . . . . . . . . . .143DealingwithMassiveDatabyUsingSpark . . . . . . . . . . . . . . . . . . . . . .144
PART 3: GETTING STARTED WITH THE MATH BASICS . . . . . 145
CHAPTER 9: Demystifying the Math Behind Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147WorkingwithData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .148
Creatingamatrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .150Understanding basic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . .152Performingmatrixmultiplication . . . . . . . . . . . . . . . . . . . . . . . . . . .152Glancingatadvancedmatrixoperations . . . . . . . . . . . . . . . . . . . . .155Usingvectorizationeffectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .155
ExploringtheWorldofProbabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . .158Operating on probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .159ConditioningchancebyBayes’theorem . . . . . . . . . . . . . . . . . . . . .160
DescribingtheUseofStatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .163
CHAPTER 10: Descending the Right Curve . . . . . . . . . . . . . . . . . . . . . . . . . . 167InterpretingLearningAsOptimization . . . . . . . . . . . . . . . . . . . . . . . . . .168
Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .168Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169Reinforcementlearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169The learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .170
Exploring Cost Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .173DescendingtheErrorCurve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .174UpdatingbyMini-BatchandOnline . . . . . . . . . . . . . . . . . . . . . . . . . . . .177
CHAPTER 11: Validating Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 181CheckingOut-of-SampleErrors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .182
Lookingforgeneralization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .183GettingtoKnowtheLimitsofBias . . . . . . . . . . . . . . . . . . . . . . . . . . . . .184KeepingModelComplexityinMind . . . . . . . . . . . . . . . . . . . . . . . . . . . .186KeepingSolutionsBalanced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .188
Depictinglearningcurves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .189Training,Validating,andTesting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .191Resorting to Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .191LookingforAlternativesinValidation . . . . . . . . . . . . . . . . . . . . . . . . . . .193
viii Machine Learning For Dummies
OptimizingCross-ValidationChoices . . . . . . . . . . . . . . . . . . . . . . . . . . .194Exploringthespaceofhyper-parameters . . . . . . . . . . . . . . . . . . . .195
AvoidingSampleBiasandLeakageTraps . . . . . . . . . . . . . . . . . . . . . . .196Watchingoutforsnooping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .198
CHAPTER 12: Starting with Simple Learners . . . . . . . . . . . . . . . . . . . . . . . . 199DiscoveringtheIncrediblePerceptron . . . . . . . . . . . . . . . . . . . . . . . . . .200
Fallingshortofamiracle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .200Touchingthenonseparabilitylimit . . . . . . . . . . . . . . . . . . . . . . . . . .202
GrowingGreedyClassificationTrees . . . . . . . . . . . . . . . . . . . . . . . . . . .204Predictingoutcomesbysplittingdata . . . . . . . . . . . . . . . . . . . . . . .204Pruning overgrown trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .208
Taking a Probabilistic Turn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .209UnderstandingNaïveBayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .209EstimatingresponsewithNaïveBayes . . . . . . . . . . . . . . . . . . . . . . .212
PART 4: LEARNING FROM SMART AND BIG DATA . . . . . . . . 217
CHAPTER 13: Preprocessing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219GatheringandCleaningData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .220RepairingMissingData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .221
Identifyingmissingdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .221Choosingtherightreplacementstrategy . . . . . . . . . . . . . . . . . . . .222
TransformingDistributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .225CreatingYourOwnFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .227
Understandingtheneedtocreatefeatures . . . . . . . . . . . . . . . . . .227Creatingfeaturesautomatically . . . . . . . . . . . . . . . . . . . . . . . . . . . .228
CompressingData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .230DelimitingAnomalousData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .232
CHAPTER 14: Leveraging Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237MeasuringSimilaritybetweenVectors . . . . . . . . . . . . . . . . . . . . . . . . . .238
Understandingsimilarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .238Computingdistancesforlearning . . . . . . . . . . . . . . . . . . . . . . . . . . .239
UsingDistancestoLocateClusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . .240Checkingassumptionsandexpectations . . . . . . . . . . . . . . . . . . . . .241Inspectingthegearsofthealgorithm . . . . . . . . . . . . . . . . . . . . . . .243
TuningtheK-MeansAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .244ExperimentingK-meansreliability . . . . . . . . . . . . . . . . . . . . . . . . . .245Experimentingwithhowcentroidsconverge . . . . . . . . . . . . . . . . .247
SearchingforClassificationbyK-NearestNeighbors . . . . . . . . . . . . . .251LeveragingtheCorrectKParameter . . . . . . . . . . . . . . . . . . . . . . . . . . .252
Understandingthekparameter . . . . . . . . . . . . . . . . . . . . . . . . . . . .252Experimentingwithaflexiblealgorithm . . . . . . . . . . . . . . . . . . . . .253
Table of Contents ix
CHAPTER 15: Working with Linear Models the Easy Way . . . . . . . . 257StartingtoCombineVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .258MixingVariablesofDifferentTypes . . . . . . . . . . . . . . . . . . . . . . . . . . . .264Switching to Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .267
Specifyingabinaryresponse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .267Handlingmultipleclasses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .270
Guessing the Right Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .271Definingtheoutcomeoffeaturesthatdon’tworktogether . . . . .271Solvingoverfittingbyusingselection . . . . . . . . . . . . . . . . . . . . . . . .272
LearningOneExampleataTime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .274Using gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .275UnderstandinghowSGDisdifferent . . . . . . . . . . . . . . . . . . . . . . . .275
CHAPTER 16: Hitting Complexity with Neural Networks . . . . . . . . 279LearningandImitatingfromNature . . . . . . . . . . . . . . . . . . . . . . . . . . . .280
Goingforthwithfeed-forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .281Going even deeper down the rabbit hole . . . . . . . . . . . . . . . . . . . .283GettingBackwithBackpropagation . . . . . . . . . . . . . . . . . . . . . . . . .286
StrugglingwithOverfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .289Understandingtheproblem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .289Opening the black box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .290
IntroducingDeepLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .293
CHAPTER 17: Going a Step beyond Using Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297RevisitingtheSeparationProblem:ANewApproach . . . . . . . . . . . . .298ExplainingtheAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .299
GettingintothemathofanSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . .301Avoidingthepitfallsofnonseparability . . . . . . . . . . . . . . . . . . . . . .302
ApplyingNonlinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .303Demonstratingthekerneltrickbyexample . . . . . . . . . . . . . . . . . .305Discoveringthedifferentkernels . . . . . . . . . . . . . . . . . . . . . . . . . . .306
IllustratingHyper-Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .308ClassifyingandEstimatingwithSVM . . . . . . . . . . . . . . . . . . . . . . . . . . .309
CHAPTER 18: Resorting to Ensembles of Learners . . . . . . . . . . . . . . . . 315LeveragingDecisionTrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .316
Growingaforestoftrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .317Understandingtheimportancemeasures . . . . . . . . . . . . . . . . . . . .321
WorkingwithAlmostRandomGuesses . . . . . . . . . . . . . . . . . . . . . . . . .324BaggingpredictorswithAdaboost . . . . . . . . . . . . . . . . . . . . . . . . . .324
BoostingSmartPredictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .327Meeting again with gradient descent . . . . . . . . . . . . . . . . . . . . . . . .328
AveragingDifferentPredictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .329
x Machine Learning For Dummies
PART 5: APPLYING LEARNING TO REAL PROBLEMS . . . . . . 331
CHAPTER 19: Classifying Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333WorkingwithaSetofImages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .334Extracting Visual Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .338RecognizingFacesUsingEigenfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . .340ClassifyingImages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .343
CHAPTER 20: Scoring Opinions and Sentiments . . . . . . . . . . . . . . . . . . . 349IntroducingNaturalLanguageProcessing . . . . . . . . . . . . . . . . . . . . . . .349Understanding How Machines Read . . . . . . . . . . . . . . . . . . . . . . . . . . .350
Processing and enhancing text . . . . . . . . . . . . . . . . . . . . . . . . . . . . .352Scrapingtextualdatasetsfromtheweb . . . . . . . . . . . . . . . . . . . . .357Handlingproblemswithrawtext . . . . . . . . . . . . . . . . . . . . . . . . . . .360
UsingScoringandClassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .362Performingclassificationtasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .362Analyzingreviewsfrome-commerce . . . . . . . . . . . . . . . . . . . . . . . .365
CHAPTER 21: Recommending Products and Movies . . . . . . . . . . . . . . 369Realizing the Revolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .370DownloadingRatingData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .371
Trudging through the MovieLens dataset . . . . . . . . . . . . . . . . . . . .371Navigatingthroughanonymouswebdata . . . . . . . . . . . . . . . . . . .373Encounteringthelimitsofratingdata . . . . . . . . . . . . . . . . . . . . . . .374
LeveragingSVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .375ConsideringtheoriginsofSVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . .376UnderstandingtheSVDconnection . . . . . . . . . . . . . . . . . . . . . . . . .377SeeingSVDinaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .378
PART 6: THE PART OF TENS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
CHAPTER 22: Ten Machine Learning Packages to Master . . . . . . . . 385Cloudera Oryx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .386CUDA-Convnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .386ConvNetJS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .387e1071 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .387gbm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .388Gensim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .388glmnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .388randomForest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .389SciPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .389XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .390
Table of Contents xi
CHAPTER 23: Ten Ways to Improve Your Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391Studying Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .392Using Cross-Validation Correctly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .393Choosing the Right Error or Score Metric . . . . . . . . . . . . . . . . . . . . . . .394SearchingfortheBestHyper-Parameters . . . . . . . . . . . . . . . . . . . . . . .395Testing Multiple Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .395Averaging Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .396Stacking Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .396Applying Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .397SelectingFeaturesandExamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .397LookingforMoreData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .398
INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
xii Machine Learning For Dummies
Introduction 1
Introduction
The term machine learning has all sorts of meanings attached to it today, especially after Hollywood’s (and others’) movie studios have gotten into the picture. Films such as Ex Machina have tantalized the imaginations of
moviegoers the world over and made machine learning into all sorts of things that it really isn’t. Of course, most of us have to live in the real world, where machine learning actually does perform an incredible array of tasks that have nothing to do with androids that can pass the Turing Test (fooling their makers into believing they’re human). Machine Learning For Dummies provides you with a view of machine learning in the real world and exposes you to the amazing feats you really can perform using this technology. Even though the tasks that you perform using machine learning may seem a bit mundane when compared to the movie version, by the time you finish this book, you realize that these mundane tasks have the power to impact the lives of everyone on the planet in nearly every aspect of their daily lives. In short, machine learning is an incredible technology — just not in the way that some people have imagined.
About This BookThe main purpose of Machine Learning For Dummies is to help you understand what machine learning can and can’t do for you today and what it might do for you in the future. You don’t have to be a computer scientist to use this book, even though it does contain many coding examples. In fact, you can come from any discipline that heavily emphasizes math because that’s how this book focuses on machine learning. Instead of dealing with abstractions, you see the concrete results of using specific algorithms to interact with big data in particular ways to obtain a certain, useful result. The emphasis is on useful because machine learning has the power to perform a wide array of tasks in a manner never seen before.
Part of the emphasis of this book is on using the right tools. This book uses both Python and R to perform various tasks. These two languages have special features that make them particularly useful in a machine learning setting. For example, Python provides access to a huge array of libraries that let you do just about any-thing you can imagine and more than a few you can’t. Likewise, R provides an ease of use that few languages can match. Machine Learning For Dummies helps you under-stand that both languages have their role to play and gives examples of when one language works a bit better than the other to achieve the goals you have in mind.