If you can't read please download the document
Upload
jasonajohnson
View
1.985
Download
2
Embed Size (px)
Citation preview
Exploiting Your File System to Build Robust & Efficient Workflows
Jason [email protected]
Exploiting Your File System to Build Robust & Efficient Workflows
Jason [email protected]
Good afternoon!Title
BeginsDatabase server?Video Encoder?Where to go from here?Up or DownUnderstand the AbstractionGo Down to Physical
What is /dev/sdc, anyway?
CommonBig Virtual DiskRAID ControllerCouple DrivesNot set it and forget it
The Hard Disk Drive
PlattersSpindleActuatorActuator CoilActuator ArmHeads
Basic Platter Geometry
Cylinder-Head-Sector (obsolete)
Logical Block Addressing, LBA
SectorsClusterTrackCylinder512, 2K, 4KCHS obsoleteGiant String of Sectors
What is /dev/sdc, anyway?
Disk Array ControllerBreak it Down
The Disk Array Controller
Adaptec 5405Z
PCIe x8
1.2 GHz Dual Core RAID on Chip (ROC)
128-1024 MB Battery-Backed DDR
1-4 GB NAND
Up to 256 SATA or SAS HDD's
arcconf
DDR Flushes to NANDConfiguration ToolStunnedDiscrete GPU for your File System
Write Caching
Data Corruption?
(hands)
FallibleDisable All CachingEliminate Class of Errors
...you *must* disable the individual hard disk write cache in order to ensure to keep the file system intact after a power failure.
XFS.org FAQ
Initial 1MB Sector Alignment
Sector SizeStarting SectorDrive Type
512 B2048SATA & SAS
2 KB512SSD
4 KB256Advanced Format & SSD
blockdev --getpbsz /dev/sdc
blockdev --getss /dev/sdc
Aligning IO on a hard disk RAID
http://www.mysqlperformanceblog.com/2011/06/09/aligning-io-on-a-hard-disk-raid-the-theory/
Sector Alignment1MB OffsetRoom for Partition TableUse These ToolsVerify Correctness
(s)gdisk
SectorsClearly Communicating in LBALogical SizeOffsetCheck.All Makes SenseNot Scary
Tuning the File System
Disable Caching
Tools: sysbench, iozone, iostat, vmstat
Start Simple
Apply Increasing Parallel I/O
ext2, ext3, ext4, xfs, btrfs, zfs?
Graph Everything
No CachingSysbench16 Data-Bearing DisksHardware ControllerXFS Designed for This!But... verify through testing.
arcconf
arcconf \create 1 logicaldrive \stripesize 256 \wcache wt \rcache roff \max0 \0 3 \0 4 \...0 18
Stripe Size 256kEntire WidthCache DisabledAdd Physical DrivesAll Controller Brands Different
sysbench, fileio
sysbench \--num-threads=[8-1024] \--test=fileio \--file-total-size=10G \--file-test-mode=rndwr \--file-fsync-all=on \--file-num=64 \--file-block-size=16384 \[prepare|run|cleanup]
What are we comparing?EXT4 vs. XFSNaiveNaive ExternalTunedTuned External
EXT4, mkfs
mkfs.ext4 /dev/sdc1
mke2fs \-b 4096 \-O journal_dev \/dev/sdb1 32768
mkfs.ext4 \-b 4096 \-E stride=4,stripe_width=16 \-J device=/dev/sdb1 \/dev/sdc1
mount -o \noatime,stripe=16 \/dev/sdc1 \/mnt/data
From Naive to Modestly TunedStripe WidthStride128MB external journalMount requires extra information
I/O Requests per Second, EXT4
Review Graph
Latency, EXT4
Review Graph
XFS, mkfs
mkfs.xfs /dev/sdc1
mkfs.xfs \-d sw=16,su=16k \-l \logdev=/dev/sdb1, \size=128m, \su=256k \/dev/sdc1
mount -o \noatime, \logdev=/dev/sdb1, \logbufs=8,logbsize=256k \/dev/sdc1 \/mnt/data
Again, From Naive to Modestly TunedStripe WidthStripe Unit SizeExternal Journal DeviceAdditional Information Needed by Mount
I/O Requests per Second, XFS
Review Graph
Latency, XFS
Review Graph
What are we looking for?
I/O Request per Second... per Drive&Reasonable Latency
We want 330 IOPSOur $$$
XFS vs. EXT4, Latency
Neck and NeckSlight Advantage at 256 Threads
XFS vs. EXT4, per Drive
Noticeable Advantage at 256 Threads5,200 IOPSReached Practical Limit EarlyFully Tuned?sysctl for XFS?
Scenario 1, Efficiency
MySQL
Predictable Write PatternWe Make It Efficient
MySQL Write Pattern
InnoDB PagesLinux PagesUnit of WorkXFS Allocation GroupsEXT4 Metadata Groupings
MySQL Configuration
System VariableValue
innodb_io_capacity5000
innodb_thread_concurrency256
innodb_write_io_threads192
innodb_read_io_threads64
innodb_log_file_size32M
innodb_log_files_in_group32
innodb_buffer_pool_size10GB
innodb_buffer_pool_instances10
MySQL System Variables
Review The ConfigurationPlug-in Values from SysbenchGoogle MySQL System VariablesExplain Values
sysbench, mysql
sysbench \--num-threads=[32|64|128|256] \--test=oltp \--oltp-test-mode=nontrx \ --oltp-nontrx-mode=insert \ --oltp-table-size=100000 \--max-requests=10000000 \[prepare|run|cleanup]
Small Benchmark10 Million Inserts, One Transaction EachRamp up ThreadsText EXTERNALLYDeadlock PotentialSpin-locks Contending with Benchmark
Transactions per Second
+96.28%
+125.37%
+102.29%
+69.43%
15,962.79/s @ 16ms
9,421.29/s @ 27ms
Transactions per SecondFor the Percentage Increase FolksFor the Real Figures Folks125% increase (in some cases)High-End nearing 16,000
Before Next Section----------------------------------Who has written code like this?(hands)ScanningRace ConditionCreation Behind Us-----------------------------------There is a better way!
inotify
Event MaskFired when...
IN_ACCESSFile was accessed (read)
IN_ATTRIBMetadata changed
IN_CLOSE_WRITEFile opened for writing was closed
IN_CLOSE_NOWRITEFile not opened for writing was closed
IN_CREATEFile/directory created in watched directory
IN_DELETEFile/directory deleted from watched directory
IN_DELETE_SELFWatched file/directory was itself deleted
IN_MODIFYFile was modified
IN_MOVE_SELFWatched file/directory was itself moved
IN_MOVED_FROMFile moved out of watched directory
IN_MOVED_TOFile moved into watched directory
IN_OPENFile was opened
It Can Tell UsNo Scanning or PollingNo Races------------------------------------------IN_CLOSE_WRITEIN_MOVED_TO
Event StreamNo RacesFile System Obeys RulesCan't Move Files Being Written
inotify in [language]
LanguageSource
Pythonpip install pyinotify
PHPpecl install inotify
Gogo's exp repository
Rubygem install rb-inotify
C#include
Go's extracted from stdlibFreeBSD's kqueue
Scenario 2, Robustness
A Custom Message Queue
InternallyRabbitMQEvery DatacenterWorldwide-----------------------------SimplerFile-BasedRESTful200,000 ConcurrencyInsane Burst-able Throughput
Message Queue Architecture
Familiar?SMTP or Maildir-----------------------------------Fall Over-----------------------------------InboxPartial ContentSource & Victim LockedFetching Serialized
I/O Serialization
Request Must Know How to RespondBut... ONE I/O THREAD?!!
I/O Serialization
hash(queue_id) % num_threads
Predictable, Simple HashTenants CAN & WILL Clobber, Though Sticky
Message Queue Architecture
64K
Notification-based MovementSerialized I/O per-queueParallel I/O per-serverHighly Available Front-EndDecoupled DeliveryBasic UNIX Command Maintenance--------------------------------------Learn From MySQLRandom Writes & Random SizeZFS & ZILkqueue vs. inotifyFix One
Summary
Caching
File system choice
Benchmarking w/ sysbench
Efficiency through proper configuration
Robustness through cooperation & decoupling
Discovering & understanding your write pattern
Benchmark & Graph everything
Never Assume Anything (atime, stripe width, etc.)
Jason Johnson [email protected]
https://github.com/jasonjohnsonhttp://www.slideshare.net/jasonajohnson
A Case for Redundant Arrays of Inexpensive Disks (RAID)http://www.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf
Practical File System Designhttp://www.nobius.org/~dbg/practical-file-system-design.pdf
XFS Papers and Documentationhttp://xfs.org/index.php/XFS_Papers_and_Documentation
Kernel Documentation on File Systemshttps://www.kernel.org/doc/Documentation/filesystems/
MySQL Performance Bloghttp://www.mysqlperformanceblog.com/
MySQL DBAhttp://mysqldba.blogspot.com/
MySQL Server System Variableshttp://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html