Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
@sanjacloud
How To: Big Data PipelineUsing Open Source Software
and Exclusively European Servers & Services
@sanjacloud
https://infocus.emc.com/william_schmarzo/the-bi
g-data-storymap/ (Big Data Storymap, Dell EMC, early 2013)
@sanjacloud
@thoughtkettle
Build or Buy?
4/41
@thoughtkettle
It depends.● team/company structure● privacy/regulations/compliance● money● vendor lock-in
Make it a conscious decision.
5/41
@thoughtkettle
What does Big Data need?● infrastructure● software● people
6/41
@sanjacloud
Infrastructure - Core● actual (Big) Data● databases● “cloud” aka some computers
7/41
@thoughtkettle
Infrastructure - Actions● pulling data in● data repository● processing● storage● visualisation
8/41
@thoughtkettle
9/41
Infrastructure
@thoughtkettle
10/41
Sources
@thoughtkettle
11/41
Integration
@thoughtkettle
12/41
Store
@thoughtkettle
13/41
Processing
@thoughtkettle
14/41
Persistence
@thoughtkettle
15/41
Visualisations
@thoughtkettle
What does Big Data need?● infrastructure● software● people
16/41
@thoughtkettle
Software● querying● learning● acting
17/41
@thoughtkettle
Querying● conceptually close to BI & Excel● ad-hoc analysis of data● now: queries across multiple sources
18/41
@thoughtkettle
19/41
Store
@thoughtkettle
20/41
Persistence
@thoughtkettle
Batch Processing● great for exploratory work● store first, ask questions later● high latency, high throughput
21/41
@thoughtkettle
@sanjacloud
“I have 4 different web serversand want to aggregate visitor logs.”
23/41
Example
@sanjacloud
10.0.0.1 - [04/May/2017:09:25:59 +0200] "GET / HTTP/1.1" 200 7356 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" "-"
JSON{ ip: “10.0.0.1”, date: “04/May/2017:04:25:59 +0200”, request: "GET / HTTP/1.1", status_code: 200, bytes: 7356, User_agent: "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1"}
24/41
Typical Web Access Log
@sanjacloud
● How many successful visits did we have?# cat log | awk ‘{if ($9 == “200”) print}’ | wc -l
● How many unique visitors did we get?# cat log | awk ‘{print $1}’ | sort -u | wc -l
● What’s our most successful page?# cat log | awk ‘{top[$4]+=1} END{for (x in top) print x “:” top[x]}’
25/41
Basic Unix Commands
@sanjacloud
26/41
Good Old Days
@thoughtkettle
27/41
MapReduce: Store First
@thoughtkettle
MapReduce: Ask Questions 28/41
@thoughtkettle
Software● querying● learning● acting
29/41
@thoughtkettle
(Machine) Learning● train mathematical models on data● use models to quickly find similar occurrences● extrapolate from previous trends
30/41
@sanjacloud
@thoughtkettle
Software● querying● learning● acting
32/41
@thoughtkettle
Acting● real-time reactions on incoming records● computation when you already know the question● very low latency
33/41
@thoughtkettle
Stream Processing 34/41
@thoughtkettle
@thoughtkettle
In practice● often a combination of querying, learning, acting● example: fraud detection
○ ad-hoc queries to find fraud occurrences○ training a model on the query output○ using stream processing to quickly nail down
new attempts
36/41
@thoughtkettle
What does Big Data need?● infrastructure● software● people
37/41
@sanjacloud
@thoughtkettle
BI Data Science
39/41
Find future opportunities.
Answer a question.
@thoughtkettle
Don’t jump
to conclusions.
@sanjacloud
Thank you!
41/41