View
1
Download
0
Category
Preview:
Citation preview
Data Management for
Big Data Analytics
23 - 31 July PKU
What do data analysts do?
PROGRAMMING LANGUAGES
SQL70%
R57%
PYTHON54%
BASH24%JAVA18%
JAVASCRIPT17%
VISUAL BASIC / VBA13%
C++9%
SCALA8%
C#8%
C8%
SAS5%
PERL5%
RUBY3%
GO1%
OCTAVE2%
MATLAB9%
SALARY MEDIAN AND IQR (US DOLLARS)
Range/Median
Lang
uage
s
SHARE OF RESPONDENTS
0 50K 100K 150K 200K
GoOctave
RubySASPerl
C#C
ScalaMatlab
C++Visual Basic/VBA
JavaScriptJavaBash
PythonR
SQL
Source: O’Reilly Data Science Salary Survey 2016
Source: O’Reilly Data Science Salary Survey 2015
0%
10%
20%
30%
40%
50%
60%
70%
Couc
hbas
eG
oG
raph
Lab
Ora
cle
Exas
cale
Aste
r Dat
a (T
erad
ata)
Ora
cle
BIAm
azon
Dyn
amoD
BIB
MVo
wpa
l Wab
bit
EMC/
Gre
enpl
umKN
IME
Spotfir
eM
ahou
tN
eo4J
Mat
hem
atic
aSt
ata
Stor
mVe
rtic
aM
icro
stra
tegy
Pent
aho
Qlik
View
Cogn
osLI
BSVM
Map
RRu
byN
etez
za (I
BM)
Splu
nkG
oogl
e Bi
gQue
ry/F
usio
n Ta
bles
Rapi
dMin
erSA
P H
ANA
Busi
ness
Obj
ects
IBM
DB2
Goo
gle
Imag
e Ch
arts
API
Mat
lab
Redi
sG
oogl
e Ch
art T
ools
/ Im
age
API
C#H
orto
nwor
ksCa
ssan
dra
Tera
data
SPSSPe
rlAm
azon
Ela
stic
Map
Redu
ce (E
MR)
Hba
seW
eka
Amaz
on R
edSh
iftPigC
SQLi
teSc
ala
Pow
erPi
vot
C++
SAS
Apac
he H
adoo
pM
ongo
DB
Visu
al B
asic
/VBA
Clou
dera
Spar
kH
ive
Hom
egro
wn
anal
ysis
tool
sD
3O
racl
ePo
stgr
eSQ
LJa
vaM
atpl
otlib
(Pyt
hon)
Java
Scrip
tTa
blea
uM
icro
soft
SQ
L Se
rver
ggpl
otPy
thon
: num
py, s
cipy
, sci
kit-
lear
nM
ySQ
LRPy
thon
Exce
lSQ
L
TOOLSLANGUAGES, DATA PLATFORMS, ANALYTICS
Shar
e of
Res
pond
ents
Tool: language, data platform, analytics Tool: language, data platform, analytics
Source: O’Reilly Data Science Salary Survey 2014
4 Two exceptions were “Natural Language/Text Processing” and “Networks/Social GraphProcessing,"” which are less tools than they are types of data analysis.
bases, Hadoop distributions, visualization applications, businessintelligence (BI) programs, operating systems, or statistical pack‐ages.4 One hundred and fourteen tools were present on the list, butover 200 more were manually entered in the “other” fields.
Figure 1-10. Most commonly used tools
Just as in the previous year’s salary survey, SQL was the most com‐monly used tool (aside from operating systems); even with the rapidinflux of new data technology, there is no sign that SQL is going
12 | 2014 Data Science Salary Survey
Source: O’Reilly Data Science Salary Survey 2013
3. Correlations were tested using a Pearson’s chi square test with p=.05.
That SQL/RDB is the top bar is no surprise: accessing data is the meatand potatoes of data analysis, and has not been displaced by othertools. The preponderance of R and Python usage is more surprising—operating systems aside, these were the two most commonly usedindividual tools, even above Excel, which for years has been the go-tooption for spreadsheets and surface-level analysis. R and Python arelikely popular because they are easily accessible and effective opensource tools for analysis. More traditional statistical programs such asSAS and SPSS were far less common than R and Python.
By counting tool usage, we are only scratching the surface: who exactlyuses these tools? In comparing usage of R/Python and Excel, we hadhypothesized that it would be possible to categorize respondents asusers of one or the other: those who use a wider variety of tools, largelyopen source, including R, Python, and some Hadoop, and those whouse Excel but few tools beside it.
Python and R correlate with each other—a respondent who uses oneis more likely to use the other—but neither correlates with Excel (neg‐atively or positively): their usage (joint or separate) does not predictwhether a respondent would also use Excel. However, if we look at allcorrelations between all pairs of tools, we can see a pattern that, to anextent, divides respondents. The significant positive correlations canbe drawn as edges between tools as nodes, producing a graph with twomain clusters.3
Tool Usage | 7
Future data scientists’ favorite tools
wrangling analytics
wrangling analytics(up to 80%)
Challenges• 4 Vs of big data
• Volume - data is large
• Variety - data comes in different formats
• Veracity - data is not there/uncertain/dirty
• Velocity - speed of change
Volume challenges
• Even scanning data can take hours/days/weeks
• How to get to the right data?
• Precise answers to queries are impossible
• Hence need to approximate
Variety challenges• Different data formats
• Still much of the data is stored in relational DBMSs
• But other models catch up
• Most active these days is graph data
• We will look at it a lot
Veracity challenges• How to deal with uncertainty or incompleteness?
Relational databases (SQL) are really bad at it
• What to do if data is structured under a different schema, or different bits of data reside in different databases? (Data exchange and integration)
• What to do if data is supplemented with additional knowledge, e.g., an ontology, to compensate for missing data?
Course structure• A quick of reminder of the basics of relational
databases
• SQL: how well do we understand it?
• Volume: Conjunctive queries (many-way joins)
• optimisation
• approximation
• Volume: scale independence
• Variety: graph databases
• theoretical languages
• property graphs in Neo4j
• Variety: RDF data
• Variety: tree-structured data (XML) - depending on time
• Veracity: incomplete information and correct answers
• Veracity: data integration and exchange
• Veracity: answering queries with the help of ontologies
Evaluation• Project
• There is a long list of papers on the web
• Choose one of them
• The goal is to write an essay, 6-8 pages
• It must present a summary of the paper that would be understood by someone who has not read the paper
• It should also provide some of your own ideas or further investigation about the paper
• Examples:
• analysis of the followup literature to see how these ideas were used
• ideas on improving algorithms in the paper, perhaps in some special cases
• an implementation of a theoretical algorithm to see how it performs
2-stage process• Stage 1: this week (and weekend), choose the
paper, read it quickly, and decide what you want to do in addition to summarising its ideas
• Have a quick presentation Tuesday 31 July (during the last 2 hours of the course)
• The do the proper writeup and email it to me, libkin@gmail.com, before To Be Announced
Recommended