AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Kevin Stinson, Senior Software Engineer, Big Data Services,

Quantcast Corporation

D.J. Hanson, Director of Infrastructure, Smartsheet

November 30, 2016

Case Study: How Startups like

Smartsheet and Quantcast Accelerate

Innovation and Growth with Amazon S3

STG309

What to Expect from the Session

• Quick overview of Quantcast’s MapReduce System

• Changes made to move to AWS and Amazon S3

• Problems we encountered on the way and their

resolutions

A little bit about Quantcast

• Uses real-time data about consumer behavior to

significantly improve the relevancy of digital advertising

• Over 100 billion bids and 40 PB of data processed per

day

• 180 engineers globally across San Francisco, Seattle,

Singapore, and London

• We’re hiring – [email protected]

mailto:[email protected]

MapReduce at Quantcast

QFS – Quantcast’s distributed file system

• Open sourced - https://github.com/quantcast/qfs

• Written in C++

• Compatible with Hadoop 0.23 and higher, Hive, Spark,

Storm, etc.

• Supports replication, erasure coding, tiered storage, and

rack awareness

https://github.com/quantcast/qfs

QFS - continued

• Many of our internal tools assume data is on QFS

• Quantcast has more than 17 PB of data stored in QFS

Basic QFS setup

Metaserver

QFS Client

Chunkserver

RAM SSD Disk

Chunkserver

RAM SSD Disk

Quantflow – Quantcast’s MapReduce system

• Over 40 PB processed daily

• Heavily relies on QFS

• Uses QFS instance tiered with RAM disks and SSDs for

intermediate data

• Bundled with control/monitoring systems like Zookeeper

and Ganglia

Moving Quantflow to AWS

Adding Amazon S3 support to QFS

• Uses S3 bucket as a block device

• Replication and erasure coding is not supported

because S3 is reliable

• Makes S3 appear as just another tier in QFS

• I/O performance comparable to other S3-based file

systems such as EMRFS

• Supports fast renames and deletes

• Usable with standard Hadoop and Hadoop-friendly tools

QFS setup with S3 bucket

Metaserver

QFS Client

Chunkserver

RAM SSD Disk

Chunkserver

RAM SSD Disk

S3 Bucket

Changes to Quantflow

• Repackage Quantflow for easier installation on fresh

Amazon EC2 cluster

• Some important services run on dedicated instances but

all MapReduce workers can run on spot instances.

Data flow

• Ends of data pipeline are generally QFS on S3

• Intermediate data is on QFS using tiered RAM disks,

SSDs, or Amazon EBS volumes using replication or

erasure coding

• Direct access to QFS data in data center possible but

limited by bandwidth and cost control concerns

Copying data to S3 QFS

• Copied 8 PB of data center data to S3 as backup and as

input for AWS Quantflow jobs

• Done as copy from one QFS instance to another

• Process took weeks to complete

• Major bottleneck was 20 Gb/sec link between data

center and Amazon

• Still copy 120-150 TB/day

Issues and Resolutions

Low S3 performance

• Initial tests of Quantflow in AWS ran slower than

expected

• S3 performance hit apparent cap at 20-30 GB/sec

• Adding more EC2 instances to Quantflow cluster did not

improve performance

• Tests accessing S3 directly had same problem

Finding the cause

• It took us 2 months to find out the root cause, even with

help from AWS engineers

• A tcpdump showed 8% of traffic was from DNS queries

• Parallel DNS query benchmark shows using our internal

DNS server only achieves 200 QPS vs. 10,000 QPS

using Amazon DNS

• All DNS queries went to a DNS server on a t2.micro

instance – this was a legacy from our data center setup

• S3 uses short DNS TTLs for load balancing

Fixing the problem

• We configured dnscache on worker nodes to forward

DNS queries to S3 endpoints to Amazon DNS

• We achieved 75 GB/sec with 3,200 concurrent

processes on 200 c3.8xlarge instances with dnscache;

100 GB/sec is easily achievable by using c4.8xlarge and

adding a few more instances

• Using Amazon VPC DNS should work also

Improvement from DNS caching change

22

74

0 10 20 30 40 50 60 70 80

Throughput (GB/sec)

32

00

Con

curr

en

t P

rocesses

Comparison of S3 Read Performance on 200xc3.8xlarge, 16 workers/instance, 64MBx16 Objects, Boto2 APIs

w/ DNSCache

Single DNS forwarder

Checklist to enable 100 GB/sec with S3

• Use multipart upload with large-enough object size

• Use well-distributed object keys

• Have enough DNS capacity to achieve 10,000 QPS

• Enable partitioning of bucket, which needs time and data

• Pay attention to instance types and their bandwidth

Tools that helped

• dig, tcpdump, boto with logging

• AWS CLI, S3 bucket logging

• Parallel execution tools like GXP cluster shell

• Try micro-benchmarking before checking the whole

stack

Quantflow spot fleet issues

• Getting a large spot fleet of more capable instances can

be difficult, take a long time, or cost more then expected

• With availability and pricing changes, we may want a

mixture of several different types of spot instances and

be able to drop or lose instances

• Because intermediate data is stored locally, losing

instances can cause job failures

A workaround

• Request multiple smaller spot fleets

• Tell QFS that each fleet is its own virtual rack

• QFS will try to spread out the data across racks

• Using N-way replication, up to N-1 fleets can be lost

• Using QFS’s standard 6+3 erasure coding, up to 3 fleets

can be lost and less space than 4-way replication is

used

Parting thoughts

• An easily overlooked item of your setup can have a large

impact on performance

• As we started using AWS services on a larger scale, we

hit a number of account limitations such as instance

limits, total provisioned SSD limits, etc.

• If your performance levels off, ask your friendly AWS

liaison if an account limitation is the issue

The Smartsheet use case

In the before times…

During the Waywhen.

Game changers disrupt prior assumptions

$ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e

PRE 1feed5-1337-d00d-2ba5e/

2016-12-25 13:29:10 1048576 1feed5-1337-d00d-2ba5e

$ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e/

PRE mobile/

PRE thumbs/

$ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e/mobile/

2016-11-22 10:17:21 0

2016-11-22 10:17:34 165342 400.jpg

$ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e/thumbs/

2016-11-22 10:15:23 0

2016-11-22 10:17:13 455 20.png

2016-11-22 10:17:13 169722 400.png

2016-11-22 10:17:12 494804 700.png

A dirty trick – Objects aren’t paths

$ ls -la

drwxrwxr-x. 2 djhanson djhanson 4096 Dec 25 10:38 bar

-rw-rw-r--. 1 djhanson djhanson 0 Dec 25 10:37 foo.bar

-rw-rw-r--. 1 djhanson djhanson 0 Dec 25 10:37 foo.baz

-rw-rw-r--. 1 djhanson djhanson 0 Dec 25 10:37 foo.qux

$ mv foo.bar bar # Works directory exists

$ mv foo.baz baz # Ooops not what we wanted!

$ mv foo.qux qux/ # Fails appropriately.

mv: cannot move `foo.qux' to `qux/': Not a directory

$ find .

.

./bar

./bar/foo.bar Desired state

./baz This is not what we wanted

./foo.qux Proper error condition

The power of the trailing slash

$ aws s3 cp s3://icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e/ ./picture

$ md5sum ./picture

d41d8cd98f00b204e9800998ecf8427e

$ aws s3 cp s3://icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e ./picture

$ md5sum ./picture

1cdb80e2693da95e7fa647895d6277c8

$ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e

PRE 1feed5-1337-d00d-2ba5e/

2016-12-25 13:29:10 1048576 1feed5-1337-d00d-2ba5e

$ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e/

PRE mobile/

PRE thumbs/

2016-12-25 13:29:44 18 meta.json

Take care when operating against paths

X-AMZ-META-FTW

X-AMZ-META-BILLTO: ALICE

X-AMZ-META-CREATOR: BOB

X-AMZ-META-STYLE: CLASSIFIED

X-AMZ-META-RELATIONSHIP: COMPLICATED

The power of the trailing slash

A caveat about consistency

Related Sessions

• For more about Quantcast’s experiences with other AWS

services, check out DAT310 - Building Real-Time

Campaign Analytics Using AWS Services

• For more info on S3, check out STG303 - Deep Dive on

Amazon S3

Thank you!

Remember to complete

your evaluations!

Technology

AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)