43
IBM offers four distinct data retrieval technologies: the traditional RDBMS, which primarily relies on indexes to speed access; the new BLU Acceleration columnar compression database; IBM PureData System for Analytics (IBM Netezza), which deploys racks of Field Programmable Gate Array (FPGA) processors to parse the data; and IBM InfoSphere BigInsights, which is the IBM distribution of Hadoop. Choices are good to have, but how do you choose which technology to apply to a particular business use case? In this session, you learn how these techniques differ, including their relative strengths and weaknesses, to help you make an informed choice help you make an informed choice. 1

offers four distinct data retrieval technologies: the

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: offers four distinct data retrieval technologies: the

IBM offers four distinct data retrieval technologies: the traditional RDBMS, which primarily relies on indexes to speed access; the new BLU Acceleration columnar compression database; IBM PureData System for Analytics (IBM Netezza), which deploys racks of Field Programmable Gate Array (FPGA) processors to parse the data; and IBM InfoSphere BigInsights, which is the IBM distribution of Hadoop. Choices are good to have, but how do you choose which technology to apply to a particular business use case? In this session, you learn how these techniques differ, including their relative strengths and weaknesses, to help you make an informed choicehelp you make an informed choice.

1

Page 2: offers four distinct data retrieval technologies: the

2

Page 3: offers four distinct data retrieval technologies: the

3

Page 4: offers four distinct data retrieval technologies: the

4

Page 5: offers four distinct data retrieval technologies: the

5

Page 6: offers four distinct data retrieval technologies: the

6

Page 7: offers four distinct data retrieval technologies: the

http://pic.dhe.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.perf.doc%2Fdoc%2Fc0005424.html

7

Page 8: offers four distinct data retrieval technologies: the

8

Page 9: offers four distinct data retrieval technologies: the

9

Page 10: offers four distinct data retrieval technologies: the

10

Page 11: offers four distinct data retrieval technologies: the

11

Page 12: offers four distinct data retrieval technologies: the

12

Page 13: offers four distinct data retrieval technologies: the

Netezza_under_the_hood Feinsmith (page 8)

What we see here is how we increased the scan speeds. Basically about each drive in the N2001 is capable of delivering about 130 megabytes per second of throughput (compared to approximately 120 MB/sec in the N1001).

Wh t h d b f ith th 1 t 1 t 1 ti d i t FPGA t CPUWhat we had before with the 1 to 1 to 1 ratio, one drive to one FPGA core to one CPU core, the speed of the drive was the limiting factor as far as how fast the FPGA core could process the data - because the FPGA core could handle way more than that and the CPU core even more than the FPGA core. So the speed of the drive was a limiting factor.

So we now have more than one drive per FPGA core and per CPU core. Using basic math, h b t 2 1/2 d i FPGA d CPU S th t d i th dwe have about 2 1/2 drives per FPGA core and per CPU core. So that drives up the speed

that data can be scanned and delivered to the FPGA for processing up to about 325 megabytes per second. If you add in the 4x compression that’s going to get up to around 1300 megabytes per second.

IN the N2001 we now have faster FPGA cores that can process about 1000 megabytes per d S l I/O t dsecond. So we are no longer I/O starved.

So we’re now delivering both 2 1/2 times as much data, you know, per second to the FPGA , to the CPU core and that is how we fundamentally and how we increased the scan speed and how we increase the performance of this system.

13

Page 14: offers four distinct data retrieval technologies: the

14

Page 15: offers four distinct data retrieval technologies: the

Netezza Bootcamp (page 62).

15

Page 16: offers four distinct data retrieval technologies: the

Netezza Bootcamp (page 133).

16

Page 17: offers four distinct data retrieval technologies: the

Netezza Bootcamp (page 227).

17

Page 18: offers four distinct data retrieval technologies: the

Netezza Bootcamp (page 228).

18

Page 19: offers four distinct data retrieval technologies: the

19

Page 20: offers four distinct data retrieval technologies: the

IZAS_zEnterprise_Analytics Favero, et al (page 21)

20

Page 21: offers four distinct data retrieval technologies: the

Favero, et al (page 69)

21

Page 22: offers four distinct data retrieval technologies: the

22

Page 23: offers four distinct data retrieval technologies: the

23

Page 24: offers four distinct data retrieval technologies: the

Schiefer (page 4)

24

Page 25: offers four distinct data retrieval technologies: the

Schiefer (page 10)

25

Page 26: offers four distinct data retrieval technologies: the

Schiefer (page 16)

26

Page 27: offers four distinct data retrieval technologies: the

Schiefer (page 14)

27

Page 28: offers four distinct data retrieval technologies: the

Schiefer (page 18)

28

Page 29: offers four distinct data retrieval technologies: the

29

Page 30: offers four distinct data retrieval technologies: the

30

Page 31: offers four distinct data retrieval technologies: the

Positioning guidelines between Netezza and BLU are stated in slide 30, and I believe an additional factor is database size.  Given that BLU is a single‐node solution, it is unlikely a DW bigger than 10TB will fit in RAM.  There are tables that can yield 10X compression savings, but in practice it is likely a user will see a lower compression ratio.  A server with 1TB of RAM is one of larger configurations for BLU, and DW bigger than 10TB will probably not fit.  Although BLU does not require all data resided in RAM, performance will degrade if swapping occurs. – Nin Lei

31

Page 32: offers four distinct data retrieval technologies: the

32

Page 33: offers four distinct data retrieval technologies: the

33

Page 34: offers four distinct data retrieval technologies: the

DW611 (page 2‐7)

34

Page 35: offers four distinct data retrieval technologies: the

DW611 (chapter 3)

35

Page 36: offers four distinct data retrieval technologies: the

DW611 (chapter 3)

36

Page 37: offers four distinct data retrieval technologies: the

DW611 (chapter 3)

37

Page 38: offers four distinct data retrieval technologies: the

DW611 (chapter 3)

Regarding slide 37, both DPF and MapReduce (MR) use a shared nothing architecture.  As such, both architectures reduce data movement by having the threads or map tasks primarily access data from the local node.  Certainly there will be data distribution in some queries, but both programming models attempt to minimize data movement.  One of the issues with MR is that it takes 10 to 20 seconds to spawn MR jobs.  Queries that take a second or two in DFP will take much longer in MR That's the reason most vendors (IBMsecond or two in DFP will take much longer in MR.  That s the reason most vendors (IBM BigSQL, Cloudera Impala, EMC Hawk, Hortonworks Stinger) all abandon MR and implement their own data distribution mechanism. – Nin Lei

38

Page 39: offers four distinct data retrieval technologies: the

39

Page 40: offers four distinct data retrieval technologies: the

40

Page 41: offers four distinct data retrieval technologies: the

41

Page 42: offers four distinct data retrieval technologies: the

42

Page 43: offers four distinct data retrieval technologies: the

Frank Fillmore is the Founder and President of The Fillmore Group, Inc. (TFG), a Premier IBM Business Partner specializing in zAnalytics.

Since 1987, The Fillmore Group has delivered technical services to clients worldwide including government, commercial, and not‐for‐profit enterprises.

A knowledgeable and engaging speaker Frank has presented at many regional and nationalA knowledgeable and engaging speaker, Frank has presented at many regional and national events. In 1998 he became a DB2 Gold Consultant, and in 2009 was named an inaugural InfoSphere Information Champion.

Frank’s core areas of competency include replication, federation, and data interoperability, InfoSphere, and technical Project Management. Frank oversees a staff of DB2 consultants d t ib t hi t h i l ti t Th Fill G ’ i ft land contributes his technical expertise to The Fillmore Group’s growing software sales 

business.

43