Upload
charlie-greenbacker
View
576
Download
4
Embed Size (px)
DESCRIPTION
As presented at Great Wide Open on 02 April 2014 in Atlanta, GA http://www.gwoapp.com/events/open-source-software-for-data-scientists ========================= Harvard Business Review called it "the sexiest job of the 21st century." These days, data scientists are faced with an onslaught of companies pitching products that promise to solve all your problems. Is there such a thing as a "silver bullet" for data science, and is it worth the hefty price tag? This talk will briefly discuss what data science is, it will argue why open source software is usually the right choice for data scientists, and it will examine some of the leading OSS tools for data science available today. Topics will include statistical analysis, data mining, machine learning, natural language processing, and data visualization. Additional materials will be provided on the presentation's companion website: oss4ds.com
Open Source Software for Data Scientists
Charlie Greenbacker, Director of Data Science 02 Apr 2014
Altamira Technologies Corporation 2014
Agenda
■ What is a Data Scientist? ■ Why use Open Source Software? ■ Survey of Open Source Software Tools:
¤ Statistical Analysis ¤ Data Mining ¤ Machine Learning ¤ Natural Language Processing ¤ Social Network Analysis ¤ Data Visualization
Altamira Technologies Corporation 2014
About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable
photo: Columbia Pictures
Altamira Technologies Corporation 2014
Best reason for not finishing PhD
Altamira Technologies Corporation 2014
@ExploreAltamira
What is a Data Scientist?
credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/
Paul Cooper, ITProPortal.com
“A data scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking”
Computer Programming
Mathematics & Analytic Methodology
Distributed Computing & Big Data
Data Science
Stat
istic
al A
naly
sis
Dat
a M
inin
g
Mac
hine
Lea
rnin
g
Nat
ural
Lan
guag
e Pr
oces
sing
Soci
al N
etw
ork
Ana
lysis
Dat
a V
isual
izat
ion
Domain Knowledge & Communication Skills
etc.
Altamira Technologies Corporation 2014
Why use Open Source Software?
photo: Karen (https://flic.kr/p/5njby2)
THERE ARE NO SILVER BULLETS."
photo: Paul Inkles (https://flic.kr/p/e2QMS5)
IF YOUR BOSS BUYS SOMETHING,"YOU DAMN WELL BETTER USE IT."
photo: Valugi (http://bit.ly/1jrvVBC)
BUDGETS DON’T SCALE."
Survey of OSS Tools
Altamira Technologies Corporation 2014
Statistical Analysis
■ Name: R ■ Creator: Gentleman, Ihaka, et al. ■ License: GPL Version 2 ■ Website: r-project.org ■ Source: cran.us.r-project.org/src/base/ ■ Features:
¤ Language & environment for statistical computing & viz ¤ Linear and nonlinear modeling, classical statistical tests,
time-series analysis, graphical techniques, and more… ¤ 5000+ packages available in CRAN repository
Altamira Technologies Corporation 2014
Data Mining
■ Name: Pandas ■ Creator: Wes McKinney, et al. ■ License: BSD 3-Clause License ■ Website: pandas.pydata.org ■ Source: github.com/pydata/pandas ■ Features:
¤ Data analysis workflow in Python ¤ DataFrame object for fast manipulation & indexing ¤ Tools for reading & writing data between formats ¤ Label-based slicing, indexing, and subsetting of data
Altamira Technologies Corporation 2014
Data Mining
■ Name: Impala ■ Creator: Cloudera ■ License: Apache License 2.0 ■ Website: impala.io ■ Source: github.com/cloudera/impala ■ Features:
¤ MPP query engine implemented on Hadoop ¤ Low latency, high concurrency SQL & BI queries ¤ Same interfaces as Apache Hive, but ~24x faster ¤ Written in C++; does not use MapReduce
Altamira Technologies Corporation 2014
Machine Learning
■ Name: Mahout ■ Creator: ASF ■ License: Apache License 2.0 ■ Website: mahout.apache.org ■ Source: svn.apache.org/viewvc/mahout ■ Features:
¤ Distributed/scalable ML library for Hadoop ¤ Classification, Clustering, Collaborative filtering ¤ Logistic regression, naïve Bayes, random forest, neural
networks, HMM, k-means, SVD, PCA, ALS, LDA, etc.
Altamira Technologies Corporation 2014
Machine Learning
■ Name: Scikit-learn ■ Creator: Cournapeau, et al. ■ License: BSD 3-Clause License ■ Website: scikit-learn.org ■ Source: github.com/scikit-learn/scikit-learn ■ Features:
¤ ML library for Python built on NumPy, SciPy, matplotlib ¤ Support for classification, clustering, dimensionality
reduction, regression, model selection, preprocessing ¤ SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
Altamira Technologies Corporation 2014
Machine Learning + NLP
■ Name: Mallet ■ Creator: UMass (McCallum, et al.) ■ License: Common Public License 1.0 ■ Website: mallet.cs.umass.edu ■ Source: hg-iesl.cs.umass.edu/hg/mallet ■ Features:
¤ Java-based “Machine Learning for Language Toolkit” ¤ Document classification, clustering, topic modeling,
information extraction & sequence tagging, etc. ¤ Efficient implementation of LDA for topic modeling
Altamira Technologies Corporation 2014
Natural Language Processing
■ Name: NLTK ■ Creator: Bird, Loper, et al. ■ License: Apache License 2.0 ■ Website: nltk.org ■ Source: github.com/nltk/nltk ■ Features:
¤ Natural Language Toolkit for Python ¤ Built-in support for dozens of corpora & trained models ¤ Libraries for classification, tokenization, stemming,
tagging, parsing, and semantic reasoning
Altamira Technologies Corporation 2014
Natural Language Processing
■ Name: Stanford CoreNLP ■ Creator: Stanford NLP Group ■ License: GPL Version 2 ■ Website: nlp.stanford.edu/software/corenlp.shtml ■ Source: github.com/stanfordnlp/CoreNLP ■ Features:
¤ Suite of high-quality, Java-based NLP tools ¤ Includes POS tagger, named entity recognizer, parser,
coreference resolution, sentiment analysis, SUTime, etc. ¤ Includes models for English, Chinese, Arabic, German
Altamira Technologies Corporation 2014
NLP + Geospatial Analysis
■ Name: CLAVIN ■ Creator: Berico Technologies ■ License: Apache License 2.0 ■ Website: clavin.io ■ Source: github.com/Berico-Technologies/CLAVIN ■ Features:
¤ Extracts location names from text, resolves to gazetteer ¤ Employs context-based geospatial entity resolution ¤ ~75% accuracy, processes 1M documents per hour ¤ Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
Altamira Technologies Corporation 2014
Social Network Analysis
■ Name: NetworkX ■ Creator: Los Alamos National Lab ■ License: BSD 3-Clause License ■ Website: networkx.github.io ■ Source: github.com/networkx/networkx ■ Features:
¤ Python structures for graphs, digraphs, & multigraphs ¤ Support for creating, manipulating, & analyzing the
structure, dynamics, & functions of complex networks ¤ Provides standard graph algorithms & analysis metrics
Altamira Technologies Corporation 2014
Social Network Analysis
■ Name: Gephi ■ Creator: UTC France ■ License: GPL Version 3 ■ Website: gephi.org ■ Source: github.com/gephi/gephi ■ Features:
¤ Network analysis and visualization package for Java ¤ Dynamic network analysis with temporal filtering ¤ Metrics include: community detection, betweenness,
closeness, clustering coefficient, PageRank, etc.
Altamira Technologies Corporation 2014
Data Visualization
■ Name: D3.js ■ Creator: Mike Bostock ■ License: BSD 3-Clause License ■ Website: d3js.org ■ Source: github.com/mbostock/d3 ■ Features:
¤ JavaScript library based on HTML, SVG, and CSS ¤ Binds data to DOM & enables transformations ¤ ~200 examples, including: force-directed graphs,
choropleths, treemaps, dendrograms, animations, etc.
Altamira Technologies Corporation 2014
Fusion, Analysis, and Visualization
■ Name: Lumify ■ Creator: Altamira ■ License: Apache License 2.0 ■ Website: lumify.io ■ Source: github.com/altamiracorp/lumify ■ Features:
¤ Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. ¤ Integrates structured data, text, images, video ¤ Cell-level security & access controls ¤ Live, shared collaborative workspaces
Altamira Technologies Corporation 2014
Final Thought…
Save your $$$ for: ¨ People
¤ salaries, training, etc.
¨ Resources ¤ hardware, AWS, etc.
¨ Proprietary software ¤ if no viable OSS
alternative exists photo: Brett Weinstein (http://bit.ly/1dHXvqJ)
FINAL THOUGHT
Springer’s
open source software for data scientists
oss4ds.com
Charlie Greenbacker | @greenbacker oss4ds.com