Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai

Parallel Database Systems

Instructor: Dr. Yingshu Li Student: Chunyu Ai

Main Message Technology trends give

– many processors and storage units– inexpensively

To analyze large quantities of data– sequential (regular) access patterns are 100x faster– parallelism is 1000x faster (trades time for money)– Relational systems show many parallel algorithms.

1k SPECintCPU

500 GB Disc

2 GB RAM

Implications of Hardware Trends

Large Disc Farms will be inexpensive (10k$/TB)

Large RAM databases will be inexpensive (1K$/GB)

Processors will be inexpensive

So building block will be a processor with large RAM lots of Disc

lots of network bandwidth

Why Parallel Access To Data?

1 Terabyte

10 MB/s

At 10 MB/s1.2 days to scan

1 Terabyte

1,000 x parallel1.5 minute SCAN.

Parallelism: divide a big problem into many smaller ones to be solved in parallel.

Bandwidth

Implication of Hardware Trends: Clusters

Future Servers are CLUSTERSof processors, discs

50 GB Disc

5 GB RAM

ThesisMany Little will Win over Few Big

Parallel Database Architectures

Parallelism: Performance is the GoalGoal is to get 'good' performance.

Law 1: parallel system should be faster than serial system

Law 2: parallel system should give near-linear scaleup or

near-linear speedup orboth.

Speed-Up and Scale-Up Speedup: a fixed-sized problem executing on a small system is given to a

system which is N-times larger.

– Measured by:

speedup = small system elapsed time

large system elapsed time

– Speedup is linear if equation equals N. Scaleup: increase the size of both the problem and the system

– N-times larger system used to perform N-times larger job

– Measured by:

scaleup = small system small problem elapsed time

big system big problem elapsed time

– Scale up is linear if equation equals 1.

Kinds of Parallel Execution

Pipeline

Partition outputs split N ways inputs merge M ways

Any Sequential Program

SequentialSequential

SequentialSequential Any Sequential Program

Any Sequential Program

The Drawbacks of ParallelismO

Processors & Discs

The Good Speedup Curve

Processors & Discs

A Bad Speedup Curve

Linearity

No Parallelism Benefit

Processors & Discs

A Bad Speedup Curve3-Factors

Startup: Creating processesOpening filesOptimization

Interference: Device (cpu, disc, bus)logical (lock, hotspot, server, log,...)

Communication: send data among nodesSkew: If tasks get very small, variance > service time

Parallelism: Speedup & Scaleup100GB 100GB

Speedup: Same Job, More Hardware Less time

Scaleup: Bigger Job, More Hardware Same time

100GB 1 TB

Server Server

1 k clients 10 k clientsTransactionScaleup: more clients/servers Same response time

Database Systems “Hide” Parallelism Automate system management via tools

• data placement• data organization (indexing)• periodic tasks (dump / recover / reorganize)

Automatic fault tolerance• duplex & failover• transactions

Automatic parallelism• among transactions (locking)• within a transaction (parallel execution)

Automatic Data PartitioningSplit a SQL table to subset of nodes & disks

Partition within set:Range Hash Round Robin

Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning

A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z

Good for equi-joins, range queriesgroup-by

Good for equi-joins Good to spread load

Index PartitioningHash indices partition by hash

B-tree indices partition as a forest of trees.One tree per range

Primary index clusters data

0...9 10..19 20..29 30..39 40..

A..C D..F G...M N...R S..Z

Partitioned Execution

A...E F...J K...N O...S T...Z

A Table

Count Count Count Count Count

Spreads computation and IO among processors

Partitioned data gives NATURAL parallelism

N x M way Parallelism

A...E F...J K...N O...S T...Z

Merge Merge

N inputs, M outputs, no bottlenecks.

Partitioned DataPartitioned and Pipelined Data Flows

Blocking Operators = Short PipelinesAn operator is blocking,

if it does not produce any output, until it has consumed all its input

Examples:Sort, Aggregates, Hash-Join (reads all of one operand)

Blocking operators kill pipeline parallelismMake partition parallelism all the more important.

Sort RunsScan

Sort Runs

Tape File SQL Table Process

Merge Runs

Table Insert

Index Insert

SQL Table

Index 1

Index 2

Index 3

Database LoadTemplate hasthree blocked phases

Parallel AggregatesFor aggregate function, need a decomposition strategy:

count(S) = count(s(i)), ditto for sum()avg(S) = ( sum(s(i))) / count(s(i))and so on...

For groups,sub-aggregate groups close to the sourcedrop sub-aggregates into a hash river.

A...E F...J K...N O...S T...Z

A Table

Count Count Count Count Count

Sub-sortsgenerateruns

Mergeruns

Range or Hash Partition River

River is range or hash partitioned

Scan or other source

Parallel Sort

M inputs N outputs

Disk and mergenot needed if sort fits in Memory

Sort is benchmark from hell for shared nothing machinesnet traffic = disk bandwidth, no data filtering at the source

Hash Join: Combining Two Tables

Hash smaller table into N buckets (hope N=1)

If N=1 read larger table, hash to smallerElse, hash outer to disk then

bucket-by-bucket hash join.

Purely sequential data behavior

Always beats sort-merge and nestedunless data is clustered.

Good for equi, outer, exclusion joinLots of papers,

Hash reduces skew

Right Table

LeftTable

HashBuckets

Thank you!

Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai

Documents

Instructor Neelima Gupta Instructor: Ms. Neelima Gupta

1 Wireless Sensor Network EE400 Term Project Prepared by, Khaled Al-Omar211805 Fahad Al-Jabarti241980 References [1] Yingshu Li, My T.Thai, Weili Wu, “Wireless

Chinh T. Vu, Yingshu Li Computer Science Department Georgia State University IEEE percom 2009 Delaunay-triangulation based complete coverage in wireless

Using GIS Networks to Represent Model Networks May 19, 2009 Hoyt Davis & Chunyu Lu Gannett Fleming, Inc

8/17/20151 Querying XML Database Using Relational Database System Rucha Patel MS CS (Spring 2008) Advanced Database Systems CSc 8712 Instructor : Dr. Yingshu

Generating HTML Format Reports for Travel Demand Models May 18, 2009 Chunyu Lu Gannett Fleming, Inc

Optical Alignment with Computer Generated Holograms James H. Burge, Rene Zehnder, Chunyu Zhao College of Optical Sciences Steward Observatory University

Factors Affecting Polyolefin Price in 2012 Chunyu Gao Senior Expert Sinopec Economics & Development Research Institute

INSTRUCTOR TOOLS and TECHNIQUEs Instructor initiatives

Multimedia Databases Prepared by Pradeep Konduri Instructor: Dr. Yingshu Li Georgia State University

Chunyu Li, PhD, MD Health Economist

CLIMBING WALL INSTRUCTOR - Mountain Training › Content › Uploaded › Downloads … · Climbing Wall Instructor Rock Climbing Development Instructor Mountaineering Instructor

1 Data Mining &Intrusion Detection Shan Bai Instructor: Dr. Yingshu Li CSC 8712,Spring 08

Chunyu You Haonan Zhao Qinglei Guoand ongfeng Mei

ELIANA PAVONI IDC STAFF Instructor Specialty Instructor PADI Efr Instructor – DAN Instructor Presidente e fondatore A.S.D. Elysub CENTRO FORMAZIONE SUBACQUEI

Topology Control of Multihop Wireless Networks Using Transmit Power Adjustment Paper By : Ram Ramanathan, Regina Resales-Hain Instructor : Dr Yingshu Li

Cross-Cultural Gender Issues France, US, China Presented by Group 2A Justin Sin Yoann Loy Nami Higashira Yingshu Jin Lynn Tan

Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Ironworker Instructor Instructor Training Program

Water Safety Instructor Course – Instructor Trainer GUIDE