37

Put Down That Checkbook! - Big Data without the Big Bucks

Embed Size (px)

DESCRIPTION

Hiring data scientists sure is expensive. One way to afford top talent is to stop throwing your money away on costly "big data" software that over-promises and under-delivers. This talk will offer an opinionated definition of data science, argue why free & open source software is usually the right choice for data scientists, and describe some of the leading free & open source software tools for data science available today.

Citation preview

Page 1: Put Down That Checkbook! - Big Data without the Big Bucks
Page 2: Put Down That Checkbook! - Big Data without the Big Bucks

Put Down That Checkbook! Big Data without the Big Bucks

Charlie Greenbacker Director of Data Science

Altamira Technologies Corporation

Page 3: Put Down That Checkbook! - Big Data without the Big Bucks

Agenda

•  What is a Data Scientist? •  Why use Open Source Software (OSS)? •  Survey of OSS Tools for Data Science

Page 4: Put Down That Checkbook! - Big Data without the Big Bucks

About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable

photo: Columbia Pictures

Page 5: Put Down That Checkbook! - Big Data without the Big Bucks

Best reason for not finishing PhD

Page 6: Put Down That Checkbook! - Big Data without the Big Bucks

@ExploreAltamira

Page 7: Put Down That Checkbook! - Big Data without the Big Bucks

WHAT IS A DATA SCIENTIST?

Page 8: Put Down That Checkbook! - Big Data without the Big Bucks
Page 9: Put Down That Checkbook! - Big Data without the Big Bucks
Page 10: Put Down That Checkbook! - Big Data without the Big Bucks
Page 11: Put Down That Checkbook! - Big Data without the Big Bucks

credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)

Page 12: Put Down That Checkbook! - Big Data without the Big Bucks

“A data scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking”

Paul Cooper, ITProPortal.com http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/

Page 13: Put Down That Checkbook! - Big Data without the Big Bucks

Computer Programming

Mathematics & Analytic Methodology

Distributed Computing & Big Data

Data Science

Stat

istic

al A

naly

sis

Dat

a M

inin

g

Mac

hine

Lea

rnin

g

Nat

ural

Lan

guag

e Pr

oces

sing

Soci

al N

etw

ork

Ana

lysis

Dat

a V

isual

izat

ion

Domain Knowledge & Communication Skills

etc.

Altamira Technologies Corporation 2014

Page 14: Put Down That Checkbook! - Big Data without the Big Bucks

WHY USE OSS?

Page 15: Put Down That Checkbook! - Big Data without the Big Bucks

What is Open Source Software (OSS)?

The Open Source Definition:

1.  Free Redistribution 2.  Source Code 3.  Derived Works

more: opensource.org

Page 16: Put Down That Checkbook! - Big Data without the Big Bucks

WHY USE OSS?

Page 17: Put Down That Checkbook! - Big Data without the Big Bucks

photo: Karen (https://flic.kr/p/5njby2)

THERE ARE NO SILVER BULLETS."

Page 18: Put Down That Checkbook! - Big Data without the Big Bucks

photo: Paul Inkles (https://flic.kr/p/e2QMS5)

IF YOUR BOSS BUYS SOMETHING,"YOU DAMN WELL BETTER USE IT."

Page 19: Put Down That Checkbook! - Big Data without the Big Bucks

photo: Valugi (http://bit.ly/1jrvVBC)

BUDGETS DON’T SCALE."

Page 20: Put Down That Checkbook! - Big Data without the Big Bucks

SURVEY OF OSS TOOLS FOR DATA SCIENCE

Page 21: Put Down That Checkbook! - Big Data without the Big Bucks

Statistical Analysis Name: R Creator: Gentleman, Ihaka, et al. License: GPL Version 2 Website: r-project.org Source: cran.us.r-project.org/src/base/ Features:

–  Language & environment for statistical computing & viz –  Linear and nonlinear modeling, classical statistical tests, time-series

analysis, graphical techniques, and more… –  5000+ packages available in CRAN repository

Page 22: Put Down That Checkbook! - Big Data without the Big Bucks

Data Mining Name: Pandas Creator: Wes McKinney, et al. License: BSD 3-Clause License Website: pandas.pydata.org Source: github.com/pydata/pandas Features:

–  Data analysis workflow in Python –  DataFrame object for fast manipulation & indexing –  Tools for reading & writing data between formats –  Label-based slicing, indexing, and subsetting of data

Page 23: Put Down That Checkbook! - Big Data without the Big Bucks

Data Mining Name: Impala Creator: Cloudera License: Apache License 2.0 Website: impala.io Source: github.com/cloudera/impala Features:

–  MPP query engine implemented on Hadoop –  Low latency, high concurrency SQL & BI queries –  Same interfaces as Apache Hive, but ~24x faster –  Written in C++; does not use MapReduce

Page 24: Put Down That Checkbook! - Big Data without the Big Bucks

Machine Learning Name: Mahout Creator: ASF License: Apache License 2.0 Website: mahout.apache.org Source: svn.apache.org/viewvc/mahout Features:

–  Distributed/scalable ML library for Hadoop –  Classification, Clustering, Collaborative filtering –  Logistic regression, naïve Bayes, random forest, neural networks, HMM,

k-means, SVD, PCA, ALS, LDA, etc.

Page 25: Put Down That Checkbook! - Big Data without the Big Bucks

Machine Learning Name: Scikit-learn Creator: Cournapeau, et al. License: BSD 3-Clause License Website: scikit-learn.org Source: github.com/scikit-learn/scikit-learn Features:

–  ML library for Python built on NumPy, SciPy, matplotlib –  Support for classification, clustering, dimensionality reduction,

regression, model selection, preprocessing –  SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...

Page 26: Put Down That Checkbook! - Big Data without the Big Bucks

Machine Learning + NLP Name: Mallet Creator: UMass (McCallum, et al.) License: Common Public License 1.0 Website: mallet.cs.umass.edu Source: hg-iesl.cs.umass.edu/hg/mallet Features:

–  Java-based “Machine Learning for Language Toolkit” –  Document classification, clustering, topic modeling, information

extraction & sequence tagging, etc. –  Efficient implementation of LDA for topic modeling

Page 27: Put Down That Checkbook! - Big Data without the Big Bucks

Natural Language Processing Name: NLTK Creator: Bird, Loper, et al. License: Apache License 2.0 Website: nltk.org Source: github.com/nltk/nltk Features:

–  Natural Language Toolkit for Python –  Built-in support for dozens of corpora & trained models –  Libraries for classification, tokenization, stemming, tagging, parsing, and

semantic reasoning

Page 28: Put Down That Checkbook! - Big Data without the Big Bucks

Natural Language Processing Name: Stanford CoreNLP Creator: Stanford NLP Group License: GPL Version 2 Website: nlp.stanford.edu/software/corenlp.shtml Source: github.com/stanfordnlp/CoreNLP Features:

–  Suite of high-quality, Java-based NLP tools –  Includes POS tagger, named entity recognizer, parser, coreference

resolution, sentiment analysis, SUTime, etc. –  Includes models for English, Chinese, Arabic, German

Page 29: Put Down That Checkbook! - Big Data without the Big Bucks

NLP + Geospatial Analysis Name: CLAVIN Creator: Berico Technologies License: Apache License 2.0 Website: clavin.io Source: github.com/Berico-Technologies/CLAVIN Features:

–  Extracts location names from text, resolves to gazetteer –  Employs context-based geospatial entity resolution –  ~75% accuracy, processes 1M documents per hour –  Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org

Page 30: Put Down That Checkbook! - Big Data without the Big Bucks

Social Network Analysis Name: NetworkX Creator: Los Alamos National Lab License: BSD 3-Clause License Website: networkx.github.io Source: github.com/networkx/networkx Features:

–  Python structures for graphs, digraphs, & multigraphs –  Support for creating, manipulating, & analyzing the structure, dynamics,

& functions of complex networks –  Provides standard graph algorithms & analysis metrics

Page 31: Put Down That Checkbook! - Big Data without the Big Bucks

Social Network Analysis Name: Gephi Creator: UTC France License: GPL Version 3 Website: gephi.org Source: github.com/gephi/gephi Features:

–  Network analysis and visualization package for Java –  Dynamic network analysis with temporal filtering –  Metrics include: community detection, betweenness, closeness,

clustering coefficient, PageRank, etc.

Page 32: Put Down That Checkbook! - Big Data without the Big Bucks

Data Visualization Name: D3.js Creator: Mike Bostock License: BSD 3-Clause License Website: d3js.org Source: github.com/mbostock/d3 Features:

–  JavaScript library based on HTML, SVG, and CSS –  Binds data to DOM & enables transformations –  ~200 examples, including: force-directed graphs, choropleths,

treemaps, dendrograms, animations, etc.

Page 33: Put Down That Checkbook! - Big Data without the Big Bucks

Fusion, Analysis, and Visualization Name: Lumify Creator: Altamira License: Apache License 2.0 Website: lumify.io Source: github.com/altamiracorp/lumify Features:

–  Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. –  Integrates structured data, text, images, video –  Cell-level security & access controls –  Live, shared collaborative workspaces

Page 34: Put Down That Checkbook! - Big Data without the Big Bucks
Page 35: Put Down That Checkbook! - Big Data without the Big Bucks

Final Thought…

Save your $$$ for: People

–  salaries, training, etc.

Resources –  hardware, AWS, etc.

Proprietary software –  if no viable OSS

alternative exists

photo: Brett Weinstein (http://bit.ly/1dHXvqJ)

FINAL THOUGHT

Springer’s

Page 36: Put Down That Checkbook! - Big Data without the Big Bucks

open source software for data scientists

oss4ds.com

Page 37: Put Down That Checkbook! - Big Data without the Big Bucks

Charlie Greenbacker @greenbacker | oss4ds.com