Exploiting Your File System to Build Robust & Efficient Workflows

Embed Size (px)

Citation preview

Exploiting Your File System to Build Robust & Efficient Workflows

Jason [email protected]

Exploiting Your File System to Build Robust & Efficient Workflows

Jason [email protected]

Good afternoon!Title

BeginsDatabase server?Video Encoder?Where to go from here?Up or DownUnderstand the AbstractionGo Down to Physical

What is /dev/sdc, anyway?

CommonBig Virtual DiskRAID ControllerCouple DrivesNot set it and forget it

The Hard Disk Drive

PlattersSpindleActuatorActuator CoilActuator ArmHeads

Basic Platter Geometry

Cylinder-Head-Sector (obsolete)
Logical Block Addressing, LBA

SectorsClusterTrackCylinder512, 2K, 4KCHS obsoleteGiant String of Sectors

What is /dev/sdc, anyway?

Disk Array ControllerBreak it Down

The Disk Array Controller

Adaptec 5405Z

PCIe x8

1.2 GHz Dual Core RAID on Chip (ROC)

128-1024 MB Battery-Backed DDR

1-4 GB NAND

Up to 256 SATA or SAS HDD's

arcconf

DDR Flushes to NANDConfiguration ToolStunnedDiscrete GPU for your File System

Write Caching

Data Corruption?

(hands)

FallibleDisable All CachingEliminate Class of Errors

...you *must* disable the individual hard disk write cache in order to ensure to keep the file system intact after a power failure.

XFS.org FAQ

Initial 1MB Sector Alignment

Sector SizeStarting SectorDrive Type

512 B2048SATA & SAS

2 KB512SSD

4 KB256Advanced Format & SSD

blockdev --getpbsz /dev/sdc
blockdev --getss /dev/sdc

Aligning IO on a hard disk RAID
http://www.mysqlperformanceblog.com/2011/06/09/aligning-io-on-a-hard-disk-raid-the-theory/

Sector Alignment1MB OffsetRoom for Partition TableUse These ToolsVerify Correctness

(s)gdisk

SectorsClearly Communicating in LBALogical SizeOffsetCheck.All Makes SenseNot Scary

Tuning the File System

Disable Caching

Tools: sysbench, iozone, iostat, vmstat

Start Simple

Apply Increasing Parallel I/O

ext2, ext3, ext4, xfs, btrfs, zfs?

Graph Everything

No CachingSysbench16 Data-Bearing DisksHardware ControllerXFS Designed for This!But... verify through testing.

arcconf

arcconf \create 1 logicaldrive \stripesize 256 \wcache wt \rcache roff \max0 \0 3 \0 4 \...0 18

Stripe Size 256kEntire WidthCache DisabledAdd Physical DrivesAll Controller Brands Different

sysbench, fileio

sysbench \--num-threads=[8-1024] \--test=fileio \--file-total-size=10G \--file-test-mode=rndwr \--file-fsync-all=on \--file-num=64 \--file-block-size=16384 \[prepare|run|cleanup]

What are we comparing?EXT4 vs. XFSNaiveNaive ExternalTunedTuned External

EXT4, mkfs

mkfs.ext4 /dev/sdc1

mke2fs \-b 4096 \-O journal_dev \/dev/sdb1 32768

mkfs.ext4 \-b 4096 \-E stride=4,stripe_width=16 \-J device=/dev/sdb1 \/dev/sdc1

mount -o \noatime,stripe=16 \/dev/sdc1 \/mnt/data

From Naive to Modestly TunedStripe WidthStride128MB external journalMount requires extra information

I/O Requests per Second, EXT4

Review Graph

Latency, EXT4

Review Graph

XFS, mkfs

mkfs.xfs /dev/sdc1

mkfs.xfs \-d sw=16,su=16k \-l \logdev=/dev/sdb1, \size=128m, \su=256k \/dev/sdc1

mount -o \noatime, \logdev=/dev/sdb1, \logbufs=8,logbsize=256k \/dev/sdc1 \/mnt/data

Again, From Naive to Modestly TunedStripe WidthStripe Unit SizeExternal Journal DeviceAdditional Information Needed by Mount

I/O Requests per Second, XFS

Review Graph

Latency, XFS

Review Graph

What are we looking for?

I/O Request per Second... per Drive&Reasonable Latency

We want 330 IOPSOur $$$

XFS vs. EXT4, Latency

Neck and NeckSlight Advantage at 256 Threads

XFS vs. EXT4, per Drive

Noticeable Advantage at 256 Threads5,200 IOPSReached Practical Limit EarlyFully Tuned?sysctl for XFS?

Scenario 1, Efficiency

MySQL

Predictable Write PatternWe Make It Efficient

MySQL Write Pattern

InnoDB PagesLinux PagesUnit of WorkXFS Allocation GroupsEXT4 Metadata Groupings

MySQL Configuration

System VariableValue

innodb_io_capacity5000

innodb_thread_concurrency256

innodb_write_io_threads192

innodb_read_io_threads64

innodb_log_file_size32M

innodb_log_files_in_group32

innodb_buffer_pool_size10GB

innodb_buffer_pool_instances10

MySQL System Variables

Review The ConfigurationPlug-in Values from SysbenchGoogle MySQL System VariablesExplain Values

sysbench, mysql

sysbench \--num-threads=[32|64|128|256] \--test=oltp \--oltp-test-mode=nontrx \ --oltp-nontrx-mode=insert \ --oltp-table-size=100000 \--max-requests=10000000 \[prepare|run|cleanup]

Small Benchmark10 Million Inserts, One Transaction EachRamp up ThreadsText EXTERNALLYDeadlock PotentialSpin-locks Contending with Benchmark

Transactions per Second

+96.28%

+125.37%

+102.29%

+69.43%

15,962.79/s @ 16ms

9,421.29/s @ 27ms

Transactions per SecondFor the Percentage Increase FolksFor the Real Figures Folks125% increase (in some cases)High-End nearing 16,000

Before Next Section----------------------------------Who has written code like this?(hands)ScanningRace ConditionCreation Behind Us-----------------------------------There is a better way!

inotify

Event MaskFired when...

IN_ACCESSFile was accessed (read)

IN_ATTRIBMetadata changed

IN_CLOSE_WRITEFile opened for writing was closed

IN_CLOSE_NOWRITEFile not opened for writing was closed

IN_CREATEFile/directory created in watched directory

IN_DELETEFile/directory deleted from watched directory

IN_DELETE_SELFWatched file/directory was itself deleted

IN_MODIFYFile was modified

IN_MOVE_SELFWatched file/directory was itself moved

IN_MOVED_FROMFile moved out of watched directory

IN_MOVED_TOFile moved into watched directory

IN_OPENFile was opened

It Can Tell UsNo Scanning or PollingNo Races------------------------------------------IN_CLOSE_WRITEIN_MOVED_TO

Event StreamNo RacesFile System Obeys RulesCan't Move Files Being Written

inotify in [language]

LanguageSource

Pythonpip install pyinotify

PHPpecl install inotify

Gogo's exp repository

Rubygem install rb-inotify

C#include

Go's extracted from stdlibFreeBSD's kqueue

Scenario 2, Robustness

A Custom Message Queue

InternallyRabbitMQEvery DatacenterWorldwide-----------------------------SimplerFile-BasedRESTful200,000 ConcurrencyInsane Burst-able Throughput

Message Queue Architecture

Familiar?SMTP or Maildir-----------------------------------Fall Over-----------------------------------InboxPartial ContentSource & Victim LockedFetching Serialized

I/O Serialization

Request Must Know How to RespondBut... ONE I/O THREAD?!!

I/O Serialization

hash(queue_id) % num_threads

Predictable, Simple HashTenants CAN & WILL Clobber, Though Sticky

Message Queue Architecture

64K

Notification-based MovementSerialized I/O per-queueParallel I/O per-serverHighly Available Front-EndDecoupled DeliveryBasic UNIX Command Maintenance--------------------------------------Learn From MySQLRandom Writes & Random SizeZFS & ZILkqueue vs. inotifyFix One

Summary

Caching

File system choice

Benchmarking w/ sysbench

Efficiency through proper configuration

Robustness through cooperation & decoupling

Discovering & understanding your write pattern

Benchmark & Graph everything

Never Assume Anything (atime, stripe width, etc.)

Jason Johnson [email protected]

https://github.com/jasonjohnsonhttp://www.slideshare.net/jasonajohnson

A Case for Redundant Arrays of Inexpensive Disks (RAID)http://www.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf

Practical File System Designhttp://www.nobius.org/~dbg/practical-file-system-design.pdf

XFS Papers and Documentationhttp://xfs.org/index.php/XFS_Papers_and_Documentation

Kernel Documentation on File Systemshttps://www.kernel.org/doc/Documentation/filesystems/

MySQL Performance Bloghttp://www.mysqlperformanceblog.com/

MySQL DBAhttp://mysqldba.blogspot.com/

MySQL Server System Variableshttp://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html