23
KOCSEA’09, Las Vegas, December 2009 -1- COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT Flash Memory Database Flash Memory Database Systems and In-Page Systems and In-Page Logging Logging Bongki Moon Bongki Moon Department of Computer Science Department of Computer Science University of Arizona University of Arizona Tucson, AZ 85721, U.S.A. Tucson, AZ 85721, U.S.A. [email protected] [email protected] In collaboration with Sang-Won Lee (SKKU), Chanik Park In collaboration with Sang-Won Lee (SKKU), Chanik Park (Samsung) (Samsung) Flash Talk Flash Talk

Flash Memory Database Systems and In-Page Logging

  • Upload
    giles

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Flash Memory Database Systems and In-Page Logging. Bongki Moon Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A. [email protected] In collaboration with Sang-Won Lee (SKKU), Chanik Park (Samsung). Flash Talk. COMPUTER SCIENCE DEPARTMENT. - PowerPoint PPT Presentation

Citation preview

Page 1: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -1- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

Flash Memory Database Systems Flash Memory Database Systems and In-Page Loggingand In-Page Logging

Flash Memory Database Systems Flash Memory Database Systems and In-Page Loggingand In-Page Logging

Bongki MoonBongki MoonDepartment of Computer ScienceDepartment of Computer Science

University of ArizonaUniversity of ArizonaTucson, AZ 85721, U.S.A.Tucson, AZ 85721, U.S.A.

[email protected]@cs.arizona.edu

In collaboration with Sang-Won Lee (SKKU), Chanik Park In collaboration with Sang-Won Lee (SKKU), Chanik Park (Samsung)(Samsung)

Flash TalkFlash TalkFlash TalkFlash Talk

Page 2: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -2- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

Magnetic Disk vs Flash SSDMagnetic Disk vs Flash SSD

Samsung FlashSSD 128 GB 2.5/1.8

inch

Seagate ST340016A40GB,7200rpm

Championfor 50 years

New challengers!

Intel X25-M Flash SSD80GB 2.5 inch

Page 3: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -3- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

Past Trend of DiskPast Trend of Disk• From 1983 to 2003 [Patterson, CACM 47(10) 2004]

Capacity increased about 2500 times (0.03 GB 73.4 GB) Bandwidth improved 143.3 times (0.6 MB/s 86 MB/s) Latency improved 8.5 times (48.3 ms 5.7 ms)

YearYear 19831983 19901990 19941994 19981998 20032003

ProductProduct CDC CDC 94145-3694145-36

Seagate Seagate ST41600ST41600

Seagate Seagate ST15150ST15150

Seagate Seagate ST39102ST39102

Seagate Seagate ST373453ST373453

CapacityCapacity 0.03 0.03 GBGB 1.4 1.4 GBGB 4.3 GB4.3 GB 9.1 GB9.1 GB 73.4 GB73.4 GB

RPMRPM 36003600 54005400 72007200 1000010000 1500015000

Bandwidth Bandwidth (MB/sec)(MB/sec)

0.60.6 44 99 2424 8686

Media Media diameterdiameter

5.255.25 5.255.25 3.53.5 3.03.0 2.52.5

Latency (msec)Latency (msec) 48.348.3 17.117.1 12.712.7 8.88.8 5.75.7

Page 4: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -4- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

I/O Crisis in OLTP SystemsI/O Crisis in OLTP SystemsI/O Crisis in OLTP SystemsI/O Crisis in OLTP Systems• I/O becomes bottleneck in OLTP systemsI/O becomes bottleneck in OLTP systems

Process a large number of small random I/O operationsProcess a large number of small random I/O operations• Common practice to close the gapCommon practice to close the gap

Use a large disk farm to exploit I/O parallelismUse a large disk farm to exploit I/O parallelism• Tens or hundredsTens or hundreds of disk drives per processor core of disk drives per processor core• (E.g.) IBM Power 596 Server : 172 15k-RPM disks per processor core(E.g.) IBM Power 596 Server : 172 15k-RPM disks per processor core

Adopt short-stroking to reduce disk latencyAdopt short-stroking to reduce disk latency• Use only the outer tracks of disk plattersUse only the outer tracks of disk platters

Other concerns are raised tooOther concerns are raised too• Wasted capacity of disk drivesWasted capacity of disk drives• Increased amount of energy consumptionIncreased amount of energy consumption

• Then, what happens 18 months later? To catch up with Moore’s law and balance CPU and I/O (Amdhal’s

law), the number of spindles should be doubled again?

Page 5: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -5- COMPUTER SCIENCE DEPARTMENT

Flash News in the MarketFlash News in the Market

• Sun Oracle Exadata Storage Server [Sep 2009] Each Exadata cell comes with 384 GB flash cache

• MySpace dumped disk drives [Oct 2009] Went all flash, saving power by 99%

• Google Chrome ditched disk drives [Nov 2009] SSD is the key to 7-sec boot time

• Gordon at UCSD/SDSC [Nov 2009] 64 TB RAM, 256 TB Flash, 4 PB Disks

• IBM hooked up with Fusion-io [Dec 2009] SSD storage appliance for System X server line

Page 6: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -7- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

Flash for Database, Really?Flash for Database, Really?

• Immediate benefit for some DB operationsImmediate benefit for some DB operations Reduce commit-time delay by fast loggingReduce commit-time delay by fast logging Reduce read time for multi-versioned dataReduce read time for multi-versioned data Reduce query processing time (sort, hash)Reduce query processing time (sort, hash)

• What about the Big Fat Tables?What about the Big Fat Tables? Random scattered I/O is very common in OLTPRandom scattered I/O is very common in OLTP

• Slow random writes by flash SSD can handle this?Slow random writes by flash SSD can handle this?

Page 7: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -8- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

Transactional LogTransactional Log

SQL Queries

System Buffer Cache

Database

Table space

Temporary

Table Space

Transaction

(Redo) Log

Rollback

Segments

Page 8: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -9- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

Commit-time Delay by LoggingCommit-time Delay by Logging

• Write Ahead Log (WAL) A committing transaction force-writes its

log records Makes it hard to hide latency With a separate disk for logging

• No seek delay, but …• Half a revolution of spindle on average• 4.2 msec (7200RPM), 2.0 msec (15k RPM)

With a Flash SSD: about 0.4 msec

• Commit-time delay remains to be a significant overhead Group-commit helps but the delay doesn’t go away altogether.

• How much commit-time delay?

On average, 8.2 msec (HDD) vs 1.3 msec (SDD) : 6-fold reduction • TPC-B benchmark with 20 concurrent users.

SQL

Buffer Log Buffer

DB

LOG

pi

T1 T2 … Tn

Page 9: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -10- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

Rollback SegmentsRollback Segments

SQL Queries

System Buffer Cache

Database

Table space

Temporary

Table Space

Transaction

(Redo) Log

Rollback

Segments

Page 10: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -11- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

MVCC Rollback SegmentsMVCC Rollback Segments

• Multi-version Concurrency Control (MVCC) Alternative to traditional Lock-based CC Support read consistency and snapshot isolation Oracle, PostgresSQL, Sybase, SQL Server 2005, MySQL

• Rollback Segments Each transaction is assigned to a rollback segment When an object is updated, its current value is recorded

in the rollback segment sequentially (in append-only fashion)

To fetch the correct version of an object, check whether it has been updated by other transactions

Page 11: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -12- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

MVCC Write PatternMVCC Write Pattern• Write requests from TPC-C workload

Concurrent transactions generate multiple streams of append-only traffic in parallel (apart by approximately 1 MB)

HDD moves disk arm very frequently SSD has no negative effect from no in-place update limitation

Page 12: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -13- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

MVCC Read PerformanceMVCC Read Performance

• To support MV read consistency, I/O activities will increase A long chain of old versions may have

to be traversed for each access to a frequently updated object

• Read requests are scattered randomly Old versions of an object may be

stored in several rollback segments With SSD, 10-fold read time reduction

was not surprising

100B

…C

50A 100A

(2)

A:

100 -

> 5

0

200A

(1)

A:

200 ->

100

Rollback segment

T1

T2

T0

Rollback segment

Page 13: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -14- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

Database Table SpaceDatabase Table Space

SQL Queries

System Buffer Cache

Database

Table space

Temporary

Table Space

Transaction

(Redo) Log

Rollback

Segments

Page 14: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -15- COMPUTER SCIENCE DEPARTMENT

Workload in Table SpaceWorkload in Table Space• TPC-C workload

Exhibit little locality and sequentiality• Mix of small/medium/large read-write, read-only (join)

Highly skewed• 84% (75%) of accesses to 20% of tuples (pages)

• Write caching not as effective as read caching Physical read/write ratio is much lower that logical

read/write ratio

• All bad news for flash memory SSD Due to the No in place update and Asymmetric read/write

speeds In-Page Logging (IPL) approach [SIGMOD’07]

Page 15: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -16- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

In-Page Logging (IPL)In-Page Logging (IPL)In-Page Logging (IPL)In-Page Logging (IPL)

• Key Ideas of the IPL ApproachKey Ideas of the IPL Approach Changes written to Changes written to loglog instead of updating them in place instead of updating them in place

• Avoid frequent write and erase operationsAvoid frequent write and erase operations

Log records are Log records are co-locatedco-located with data pages with data pages• No need to write them sequentially to a separate log regionNo need to write them sequentially to a separate log region

• Read current data more efficiently than sequential loggingRead current data more efficiently than sequential logging

DBMS buffer and storage managers work togetherDBMS buffer and storage managers work together

Page 16: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -17- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

Design of the IPLDesign of the IPLDesign of the IPLDesign of the IPL

• Logging on Per-Page basis in both Memory and FlashLogging on Per-Page basis in both Memory and Flash

An In-memory log sector can be associated with a buffer frame in memory

Allocated on demand when a page becomes dirty

An In-flash log segment is allocated in each erase unit

The log area is shared by all the data pages in an erase unit

Flash Memory

DatabaseBuffer

in-memorydata page(8KB)

update-in-place

in-memorylog sector (512B)

log area (8KB): 16 sectors

Erase unit: 128KB

15 data pages (8KB each)

….….

Page 17: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -18- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

IPL WriteIPL WriteIPL WriteIPL Write

BufferPool

Flash Memory

Update / Insert / Delete

Data Block Area

update-in-place

physiological log Page : 8KB

Sector : 512B

Block :

128KB

• Data pages in memoryData pages in memory Updated in place, andUpdated in place, and Physiological log records written to its in-memory log sectorPhysiological log records written to its in-memory log sector

• In-memory log sector is written to the in-flash log segment, whenIn-memory log sector is written to the in-flash log segment, when Data page is evicted from the buffer pool, orData page is evicted from the buffer pool, or The log sector becomes fullThe log sector becomes full

• When a dirty page is evicted, the content is When a dirty page is evicted, the content is not writtennot written to flash memory to flash memory The previous version remains intactThe previous version remains intact

Data pages and their log records are physically co-located in the same erase unitData pages and their log records are physically co-located in the same erase unit

Page 18: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -19- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

IPL ReadIPL ReadIPL ReadIPL Read

• When a page is read from flash, the current version is computed on the When a page is read from flash, the current version is computed on the flyfly

BufferPool

Apply the “physiological action”to the copy read from Flash(CPU overhead)

Flash Memory

Read from Flash Original copy of Pi

All log records belonging to Pi (IO overhead)

Re-constructthe currentin-memory copy

Pi

log area (8KB): 16 sectors

data area (120KB): 15 pages

Page 19: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -20- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

IPL MergeIPL MergeIPL MergeIPL Merge

• When all free log sectors in an erase unit are consumed When all free log sectors in an erase unit are consumed Log records are applied to the corresponding data pagesLog records are applied to the corresponding data pages The current data pages are copied into a new erase unitThe current data pages are copied into a new erase unit

• Consumes, erases, and releases only one erase unitConsumes, erases, and releases only one erase unit

PhysicalFlash Block

log area (8KB): 16 sectors

Bold Bnew

clean log area

15 up-to-datedata pages

Merge

Can beErased

Page 20: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -21- COMPUTER SCIENCE DEPARTMENT

Industry ResponseIndustry Response• Common in Enterprise Class SSDs

Multi-channel, inter-command parallelism• Thruput than bandwidth, write-followed-by-read pattern

Command queuing (SATA-II NCQ) Large RAM Buffer (with super-capacitor backup)

• Even up to 1 MB per GB• Write-back caching, controller data (mapping, wear leveling)

Fat provisioning (up to ~20% of capacity)

• Impressive improvement

Prototype/ProductPrototype/Product EC SSDEC SSD X-25MX-25M 15k-RPM Disk15k-RPM Disk

Read (IOPS)Read (IOPS) 1050010500 2000020000 450450

Write (IOPS)Write (IOPS) 25002500 12001200 450450

Page 21: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -22- COMPUTER SCIENCE DEPARTMENT

EC-SSD ArchitectureEC-SSD Architecture• Parallel/interleaved operations

8 channels, 2 packages/channel, 4 chips/package Two-plane page write, block erase, copy-back operations

Host I/F(SATA-II)

ARM9 ECC

FlashController

DRAM(128MB)

MainController

Flashpackage

Flashpackage

8 channelsNAND NAND

NAND NAND

Page 22: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -23- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT

Concluding RemarksConcluding RemarksConcluding RemarksConcluding Remarks• Recent advances cope with random I/O betterRecent advances cope with random I/O better

Write IOPS Write IOPS 100x100x higher than early SSD prototype higher than early SSD prototype

TPS: TPS: 1.3~2x1.3~2x higher than RAID0-8HDDs for read-write higher than RAID0-8HDDs for read-write TPC-C workload, with much less energy consumedTPC-C workload, with much less energy consumed

• Write still lags behind IOPSDisk < IOPSSSD-Write << IOPSSSD-Read

IOPSSSD-Read / IOPSSSD-Write = 4 ~ 17

• A lot more issues to investigate Flash-aware buffer replacement, I/O scheduling, EnergyFlash-aware buffer replacement, I/O scheduling, Energy Fluctuation in performance, Tiered storage architectureFluctuation in performance, Tiered storage architecture Virtualization, and much more …Virtualization, and much more …

Page 23: Flash Memory Database Systems and In-Page Logging

KOCSEA’09, Las Vegas, December 2009 -24- COMPUTER SCIENCE DEPARTMENT

Question?Question?

• For more Flash Memory work of Ours In-Page Logging [SIGMOD’07] Logging, Sort/Hash, Rollback [SIGMOD’08] SSD Architecture and TPC-C [SIGMOD’09] In-Page Logging for Indexes [CIKM’09]

• Even More? www.cs.arizona.edu/~bkmoon