Upload
charlie-greenbacker
View
108
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Hiring data scientists sure is expensive. One way to afford top talent is to stop throwing your money away on costly "big data" software that over-promises and under-delivers. This talk will offer an opinionated definition of data science, argue why free & open source software is usually the right choice for data scientists, and describe some of the leading free & open source software tools for data science available today.
Citation preview
Put Down That Checkbook! Big Data without the Big Bucks
Charlie Greenbacker Director of Data Science
Altamira Technologies Corporation
Agenda
• What is a Data Scientist? • Why use Open Source Software (OSS)? • Survey of OSS Tools for Data Science
About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable
photo: Columbia Pictures
Best reason for not finishing PhD
@ExploreAltamira
WHAT IS A DATA SCIENTIST?
credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
“A data scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking”
Paul Cooper, ITProPortal.com http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/
Computer Programming
Mathematics & Analytic Methodology
Distributed Computing & Big Data
Data Science
Stat
istic
al A
naly
sis
Dat
a M
inin
g
Mac
hine
Lea
rnin
g
Nat
ural
Lan
guag
e Pr
oces
sing
Soci
al N
etw
ork
Ana
lysis
Dat
a V
isual
izat
ion
Domain Knowledge & Communication Skills
etc.
Altamira Technologies Corporation 2014
WHY USE OSS?
What is Open Source Software (OSS)?
The Open Source Definition:
1. Free Redistribution 2. Source Code 3. Derived Works
more: opensource.org
WHY USE OSS?
photo: Karen (https://flic.kr/p/5njby2)
THERE ARE NO SILVER BULLETS."
photo: Paul Inkles (https://flic.kr/p/e2QMS5)
IF YOUR BOSS BUYS SOMETHING,"YOU DAMN WELL BETTER USE IT."
photo: Valugi (http://bit.ly/1jrvVBC)
BUDGETS DON’T SCALE."
SURVEY OF OSS TOOLS FOR DATA SCIENCE
Statistical Analysis Name: R Creator: Gentleman, Ihaka, et al. License: GPL Version 2 Website: r-project.org Source: cran.us.r-project.org/src/base/ Features:
– Language & environment for statistical computing & viz – Linear and nonlinear modeling, classical statistical tests, time-series
analysis, graphical techniques, and more… – 5000+ packages available in CRAN repository
Data Mining Name: Pandas Creator: Wes McKinney, et al. License: BSD 3-Clause License Website: pandas.pydata.org Source: github.com/pydata/pandas Features:
– Data analysis workflow in Python – DataFrame object for fast manipulation & indexing – Tools for reading & writing data between formats – Label-based slicing, indexing, and subsetting of data
Data Mining Name: Impala Creator: Cloudera License: Apache License 2.0 Website: impala.io Source: github.com/cloudera/impala Features:
– MPP query engine implemented on Hadoop – Low latency, high concurrency SQL & BI queries – Same interfaces as Apache Hive, but ~24x faster – Written in C++; does not use MapReduce
Machine Learning Name: Mahout Creator: ASF License: Apache License 2.0 Website: mahout.apache.org Source: svn.apache.org/viewvc/mahout Features:
– Distributed/scalable ML library for Hadoop – Classification, Clustering, Collaborative filtering – Logistic regression, naïve Bayes, random forest, neural networks, HMM,
k-means, SVD, PCA, ALS, LDA, etc.
Machine Learning Name: Scikit-learn Creator: Cournapeau, et al. License: BSD 3-Clause License Website: scikit-learn.org Source: github.com/scikit-learn/scikit-learn Features:
– ML library for Python built on NumPy, SciPy, matplotlib – Support for classification, clustering, dimensionality reduction,
regression, model selection, preprocessing – SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
Machine Learning + NLP Name: Mallet Creator: UMass (McCallum, et al.) License: Common Public License 1.0 Website: mallet.cs.umass.edu Source: hg-iesl.cs.umass.edu/hg/mallet Features:
– Java-based “Machine Learning for Language Toolkit” – Document classification, clustering, topic modeling, information
extraction & sequence tagging, etc. – Efficient implementation of LDA for topic modeling
Natural Language Processing Name: NLTK Creator: Bird, Loper, et al. License: Apache License 2.0 Website: nltk.org Source: github.com/nltk/nltk Features:
– Natural Language Toolkit for Python – Built-in support for dozens of corpora & trained models – Libraries for classification, tokenization, stemming, tagging, parsing, and
semantic reasoning
Natural Language Processing Name: Stanford CoreNLP Creator: Stanford NLP Group License: GPL Version 2 Website: nlp.stanford.edu/software/corenlp.shtml Source: github.com/stanfordnlp/CoreNLP Features:
– Suite of high-quality, Java-based NLP tools – Includes POS tagger, named entity recognizer, parser, coreference
resolution, sentiment analysis, SUTime, etc. – Includes models for English, Chinese, Arabic, German
NLP + Geospatial Analysis Name: CLAVIN Creator: Berico Technologies License: Apache License 2.0 Website: clavin.io Source: github.com/Berico-Technologies/CLAVIN Features:
– Extracts location names from text, resolves to gazetteer – Employs context-based geospatial entity resolution – ~75% accuracy, processes 1M documents per hour – Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
Social Network Analysis Name: NetworkX Creator: Los Alamos National Lab License: BSD 3-Clause License Website: networkx.github.io Source: github.com/networkx/networkx Features:
– Python structures for graphs, digraphs, & multigraphs – Support for creating, manipulating, & analyzing the structure, dynamics,
& functions of complex networks – Provides standard graph algorithms & analysis metrics
Social Network Analysis Name: Gephi Creator: UTC France License: GPL Version 3 Website: gephi.org Source: github.com/gephi/gephi Features:
– Network analysis and visualization package for Java – Dynamic network analysis with temporal filtering – Metrics include: community detection, betweenness, closeness,
clustering coefficient, PageRank, etc.
Data Visualization Name: D3.js Creator: Mike Bostock License: BSD 3-Clause License Website: d3js.org Source: github.com/mbostock/d3 Features:
– JavaScript library based on HTML, SVG, and CSS – Binds data to DOM & enables transformations – ~200 examples, including: force-directed graphs, choropleths,
treemaps, dendrograms, animations, etc.
Fusion, Analysis, and Visualization Name: Lumify Creator: Altamira License: Apache License 2.0 Website: lumify.io Source: github.com/altamiracorp/lumify Features:
– Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. – Integrates structured data, text, images, video – Cell-level security & access controls – Live, shared collaborative workspaces
Final Thought…
Save your $$$ for: People
– salaries, training, etc.
Resources – hardware, AWS, etc.
Proprietary software – if no viable OSS
alternative exists
photo: Brett Weinstein (http://bit.ly/1dHXvqJ)
FINAL THOUGHT
Springer’s
open source software for data scientists
oss4ds.com
Charlie Greenbacker @greenbacker | oss4ds.com