Upload
jonathan-dinu
View
1.088
Download
2
Embed Size (px)
DESCRIPTION
Data Science is a comparatively new field and as such it is constantly changing as new techniques, tools, and problems emerge every day. Traditionally education has taken a top down approach where courses are developed on the scale of years and committees approve curricula based on what might be the most theoretically complete approach. This is at odds however with an evolving industry that needs data scientists faster than they can be (traditionally) trained. If we are to sustainably push the field of Data Science forward, we must collectively figure out how to best scale this type of education. At Zipfian I have seen (and felt) first hand what works (and what doesn't) when tools and theory are combined in a classroom environment. This talk will be a narrative about the lessons learned trying to integrate high level theory with practical application, how leveraging the Python ecosystem (numpy, scipy, pandas, scikit-learn, etc.) has made this possible, and what happens when you treat curriculum like product (and the classroom like a team).
Citation preview
On Building a Data Science CurriculumNovember 23nd, 2014
Jonathan DinuDirector of Education, Galvanize
[email protected]@clearspandex
Questions? tweet @galvanize
Formerly
Questions? tweet @galvanize
Formerly
Questions? tweet @galvanize
+
Currently
Questions? tweet @galvanize
Challenge
The Challenge
Questions? tweet @galvanize
Challenge
Tools
Framework/Library
Big Data (scalability)
Small Data
Bespoke Code
Cloudera ML
Mahout
MLlib (amplab)H20 (0xdata)
C/C++
MapReduce (Streaming)
MapReduce (Java)
Cascading/Crunch
Pig/Hive
Vowpal Rabbit
GiraphGraphLab
SparkStorm
CRANR
Python
Javascikit-learn
pandas
mlpack
Weka
Numpy
Javascript
Questions? tweet @galvanize
Obligatory Name Drop
Questions? tweet @galvanize
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
Flask
At Scale Locally
scrapy
Hadoop Streaming (w/ BeautifulSoup4)
mrjob or Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
Flask
scikit-learn/NLTK
Challenge
Questions? tweet @galvanize
Challenge
Now do that in 8 weeks
Questions? tweet @galvanize
Challenge
Questions? tweet @galvanize
Intuition
Iteration 0: Intuition
Questions? tweet @galvanize
Content
Questions? tweet @galvanize Source: Metacademy
Bottom Up Approach
Questions? tweet @galvanize
Content
Content
Source: Coursera
Content
Source: UC Berkeley Masters
Not Everybody Learns This Way
Questions? tweet @galvanize
Issues
Issues
• Not Enough Context
• Not Enough Concept Overlap
• Takes too much Time
• Nothing Happens in a Vacuum
Questions? tweet @galvanize
Digression
Not Just for Data Science
Questions? tweet @galvanize
(relevant to learning any complex subject)
Experience
Iteration 1: Experience
Questions? tweet @galvanize
Theory
Mathematics Statistical Analysis
Mathematics & Statistics
Distributions (Binomial, Poisson,
etc.)
Summary Statistics (Mean, Variance, etc.)
Hypothesis Testing
Bayesian Analysis
Linear Algebra (Matrix Factorization)
Calculus (Integrals,
Derivatives, etc)
Graph Theory
Probability/Combinatorics
Questions? tweet @galvanize
Questions? tweet @galvanize
Worth the Upfront Investment
Theory
Technique
Distributed Computing
Supervised (SVM, Random
Forest)
NLP / Information Retrieval
Algorithms & Data Structures
Data Visualization
Data Munging
Machine Learning & Software Engineering
Machine Learning
Software Engineering
Validation, Model Comparison
Unsupervised (K-means, LDA)
Questions? tweet @galvanize
Questions? tweet @galvanize
Just ask them!
Network
(the students)
Context is King
Questions? tweet @galvanize
Network
Questions? tweet @galvanize
Network
Iris Dataset Classification
Questions? tweet @galvanize
Network
Iris Dataset Classification
NYT Topic Modeling
Questions? tweet @galvanize
Network
Iris Dataset Classification
NYT Topic Modeling
Real-time Fraud scoring service
Questions? tweet @galvanize
Network
Iris Dataset Classification
NYT Topic Modeling
Real-time Fraud scoring service
Personal Capstone Project
Questions? tweet @galvanize
Network
Iris Dataset Classification
NYT Topic Modeling
Real-time Fraud scoring service
Personal Capstone
“Domesticated Data” Learn the tools/theory
Questions? tweet @galvanize
Network
Iris Dataset Classification
NYT Topic Modeling
Real-time Fraud scoring service
Personal Capstone
“Domesticated Data” Learn the tools/theory
Learn the application“Wild Data”
Questions? tweet @galvanize
Network
Iris Dataset Classification
NYT Topic Modeling
Real-time Fraud scoring service
Personal Capstone
“Domesticated Data” Learn the tools/theory
Learn the application“Wild Data”
Simulated Case Study Learn the process
Questions? tweet @galvanize
Network
Iris Dataset Classification
NYT Topic Modeling
Real-time Fraud scoring service
Personal Capstone
“Domesticated Data” Learn the tools/theory
Learn the application“Wild Data”
Simulated Case Study
Greenfield Project Learn the practice/art
Learn the process
Theory
Questions? tweet @galvanize
Theory
Application
Synthesis
$$$ PROFIT!!
Questions? tweet @galvanize
Just ask them!
Network
Network
Questions? tweet @galvanize
Questions? tweet @galvanize
Just ask them!(and be flexible)
Network
Questions? tweet @galvanize
Treat them like customers(because they are)
Network
Questions? tweet @galvanize
Always Validate!
Network
Metrics
Iteration 2: Data!
Questions? tweet @galvanize
Experience
Iteration 2: Data!
Questions? tweet @galvanize
METRICS
METRICS EVERYWHERESaturday, April 9, 2011
Metrics
Questions? tweet @galvanize
Questions? tweet @galvanize
• Commits
• Pull Requests
• Passing Tests
• Etc.
Metrics
Curriculum as Product
Learning Techniques
Questions? tweet @galvanize
Questions? tweet @galvanize
Industry Techniques
Source: http://en.wikipedia.org/wiki/Extreme_programming
Questions? tweet @galvanize
Industry Techniques
Source: http://lostechies.com/scottreynolds/2009/10/07/how-we-do-things-tdd-bdd/
Questions? tweet @galvanize
Industry Techniques
Code Reviews
Source: http://agile.dzone.com/articles/re-pair-programming
Our House
@Zipfian(now Galvanize)
Questions? tweet @galvanize
source: http://www.sebastienmillon.com/Rainbow-Immersion-Therapy-Art-Print-15
Methodology
Commun
ity Education
Industry
Meetup
Student Groups
Corporate Training
Questions? tweet @galvanize
Methodology
Questions? tweet @galvanize
• Outcomes focused
• Project-based curriculum using real datasets
• Guest lectures from leaders in the field
• Mock interviews and hiring preparation
• Full instructional staff + personal mentorship
Employment
Questions? tweet @galvanize Source: http://www.nerdwallet.com/nerdscholar/grad_surveys/highest-employment-rates
University of Massachusetts-Amherst School of Nursing
98%
Georgetown University McDonough School of Business
94%
Michigan State University College of Nursing
92%
Syracuse University School of Architecture
90%
University of Massachusetts-Amherst Isenberg School of Management
90%
Michigan State University School of Hospitality Business
89%
New York University 88%
Boston College Connell School of Nursing
88%
Boston College Carroll School of Management
87%
Case Western Reserve University Frances Payne Bolton School of Nursing
86%
Highest Employment Rates (2012)
1. Princeton University
2. Harvard University
3. Yale University
4. Columbia University
5. Stanford University
6. University of Chicago
7. Duke University
8. MIT
9. University of Pennsylvania
10. California Institue of Technology
U.S. News and World Report Ranking
Timeline
Questions? tweet @galvanize
STRUCTURED CURRICULUM
HIRING DAY
CAPSTONE PROJECT
GRADUATION
08 10.5 12
INTERVIEWS
Data Science Immersive
Questions? tweet @galvanize
Industry Student Projects
Questions? tweet @galvanize
!
• Working knowledge of programming
• Background in a quantitative discipline
• Comfortable with mathematics and statistics
• Child-like curiosity
What We Look For
Our Students
Our Students
Questions? tweet @galvanize
Educational Background
BS
MS
PhD
0 4 8 12 16
Questions? tweet @galvanize
Disciplines
Software EngineeringAnalysts
Finance/EconomicsEngineering
PhysicsPhysical Sciences
MathematicsStatistics
AstronomyLinguistics
Professional Poker
0 2 4 6 8
Our Students
Questions? tweet @galvanize
Data Science Immersive
Masters in Data Science
Data Engineering Immersive
Weekend Workshops
+
Questions? tweet @galvanize
Immersive
Masters
Questions? tweet @galvanize
Immersive
Masters
(not to scale)
Questions? tweet @galvanize
Masters of Science - 1 year (Starts in Spring)
http://www.galvanizeu.com/request-info
Goals
Questions? tweet @galvanize
!
• Present a guest lecture or share a data story
• Donate datasets and propose projects
• Sponsor a scholarship
• Attend our Hiring Day
Get Involved
Goals
Questions? tweet @galvanize
!
• Full-time Instructors
• TAs
• Mentor (volunteer)
We’re Hiring!
Questions?
Questions? tweet @galvanize
Thank You!
Jonathan DinuDirector of Education, Galvanize
[email protected]@clearspandex