41
@sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@sanjacloud

How To: Big Data PipelineUsing Open Source Software

and Exclusively European Servers & Services

Page 3: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@sanjacloud

Page 4: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

Build or Buy?

4/41

Page 5: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

It depends.● team/company structure● privacy/regulations/compliance● money● vendor lock-in

Make it a conscious decision.

5/41

Page 6: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

What does Big Data need?● infrastructure● software● people

6/41

Page 7: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@sanjacloud

Infrastructure - Core● actual (Big) Data● databases● “cloud” aka some computers

7/41

Page 8: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

Infrastructure - Actions● pulling data in● data repository● processing● storage● visualisation

8/41

Page 9: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

9/41

Infrastructure

Page 10: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

10/41

Sources

Page 11: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

11/41

Integration

Page 12: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

12/41

Store

Page 13: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

13/41

Processing

Page 14: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

14/41

Persistence

Page 15: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

15/41

Visualisations

Page 16: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

What does Big Data need?● infrastructure● software● people

16/41

Page 17: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

Software● querying● learning● acting

17/41

Page 18: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

Querying● conceptually close to BI & Excel● ad-hoc analysis of data● now: queries across multiple sources

18/41

Page 19: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

19/41

Store

Page 20: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

20/41

Persistence

Page 21: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

Batch Processing● great for exploratory work● store first, ask questions later● high latency, high throughput

21/41

Page 22: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

Page 23: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@sanjacloud

“I have 4 different web serversand want to aggregate visitor logs.”

23/41

Example

Page 24: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@sanjacloud

10.0.0.1 - [04/May/2017:09:25:59 +0200] "GET / HTTP/1.1" 200 7356 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" "-"

JSON{ ip: “10.0.0.1”, date: “04/May/2017:04:25:59 +0200”, request: "GET / HTTP/1.1", status_code: 200, bytes: 7356, User_agent: "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1"}

24/41

Typical Web Access Log

Page 25: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@sanjacloud

● How many successful visits did we have?# cat log | awk ‘{if ($9 == “200”) print}’ | wc -l

● How many unique visitors did we get?# cat log | awk ‘{print $1}’ | sort -u | wc -l

● What’s our most successful page?# cat log | awk ‘{top[$4]+=1} END{for (x in top) print x “:” top[x]}’

25/41

Basic Unix Commands

Page 26: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@sanjacloud

26/41

Good Old Days

Page 27: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

27/41

MapReduce: Store First

Page 28: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

MapReduce: Ask Questions 28/41

Page 29: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

Software● querying● learning● acting

29/41

Page 30: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

(Machine) Learning● train mathematical models on data● use models to quickly find similar occurrences● extrapolate from previous trends

30/41

Page 31: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@sanjacloud

Page 32: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

Software● querying● learning● acting

32/41

Page 33: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

Acting● real-time reactions on incoming records● computation when you already know the question● very low latency

33/41

Page 34: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

Stream Processing 34/41

Page 35: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

Page 36: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

In practice● often a combination of querying, learning, acting● example: fraud detection

○ ad-hoc queries to find fraud occurrences○ training a model on the query output○ using stream processing to quickly nail down

new attempts

36/41

Page 37: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

What does Big Data need?● infrastructure● software● people

37/41

Page 38: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@sanjacloud

Page 39: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

BI Data Science

39/41

Find future opportunities.

Answer a question.

Page 40: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@thoughtkettle

Don’t jump

to conclusions.

Page 41: @sanjacloud - CH Open · @sanjacloud How To: Big Data Pipeline Using Open Source Software and Exclusively European Servers & Services

@sanjacloud

Thank you!

41/41