Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experience

Macy’s: Why Your Database Decision Directly Impacts Customer ExperienceOctober 5, 2016Peter Connolly, Senior Architect @ Macys.com

Agenda

• Background/Problem Statement• Options • Data Models• Migration • Performance• Business Outcomes• Retrospectives

Background & Problem Statement

Macys.com, San Francisco

About Us – Macys.com• ~1,000 full-time employees

• ~500 technology employees• Most server side apps are Java-based

• Heavy reliance on Spring Framework• 6th largest eCommerce site• >$4 billion in sales last year

• 5-25% growth year-over-year• depending on product category

• More investment in dot-com properties

About Me – Peter Connolly

• At Macys.com for 5 years• Senior Application Architect• Was lead architect on Catalog Migration• Currently one of two lead architects on multi-data center

expansion

About the Macys.com Catalog• Transformed customer experience by serving product,

inventory and store data to website, mobile, store, and partner applications

• Customer experience needed to keep up with growth in business:• 10x growth in data• Move from a read-mostly model to one which could

handle near-real-time updates• Move into multiple data centers

• Real-time RESTful data services application

Some History• Macy’s online catalog was based in 3NF RDBMS

• Write throughput was adequate for online catalog• Read performance was way too slow – customer

experience impact• Had to add an in-memory data grid

• That further increased refresh times – customer experience impact

• Both were very expensive, proprietary packages• And Catalog loaded before Search Index• Nightly refresh times were ~6hrs and growing

Technical Problems• Scalability Problems – Customer Experience Impact• Heavily normalized relational database• Caching – Customer Experience Impact

• Local caches still put too much load on DB • Large latency gap between front cache hit vs. miss• Large updates led to GC problems

• Operational Problems• Limited time window to pre-warm cache• Needed second cache cluster for failover• Complicated release process• Expensive to scale out distributed cache

Business Problems• Limits on the size of the Macy’s catalog

• Could not add large amounts of store-only data• Long nightly refresh Catalog not up-to-date• In-memory DataGrid failures

• Large grid clusters had stability problems• Large hit in response times if grid went down

• Created a back-up grid to deal with failures, but…• This led to even longer refresh times

Options Explored

Options Evaluated - 2013

• Denormalized Relational DB: DB2, Oracle• Document DB: MongoDB• Columnar DB: DataStax Enterprise (Apache CassandraTM), Couchbase,

ActiveSpaces (TIBCO)• Graph: Neo4J• Object: Versant• Any selection had to have commercial support

Options Short List

• MongoDB• Feature-rich, JSON document DB

• DataStax Enterprise (Apache CassandraTM)• Scalable, true peer-to-peer

• ActiveSpaces• TIBCO Proprietary. In-memory key/value datagrid• Existing relationship with TIBCO

• Vendors assisted setup and running benchmarks• DataStax, 10Gen & TIBCO

POC Benchmark Environment

• Amazon EC2• 5 servers

• Same servers used for each test• Had the 3 DB vendors assist with setup and

execution of benchmarks• Modeled benchmarks after our own retail

inventory use cases• C3.2xlarge instances

• 8 vCPU, 15GB RAM, 2x80GB SSD• Baseline of ~150MM inventory records

POC Results - SummaryDataStax Enterprise (Apache CassandraTM)

Data Models

Data Model Requirements

• Model hierarchical domain objects• Aggregate data to minimize reads• No reads before writes• Readable by 3rd party tools

• (e.g., cqlsh, Dev Center)

• DataStax pivotal in modeling process

Data Model based on Lists

id | name | upcId | upcColor | upcAttrId | upcAttrName | upcAttrValue | upcAttrRefId----+----------------+-----------+---------------------+-----------------+---------------------------------------------------+---------------------+-------------- 11 | Nike Pants | [22, 33] | ['White', 'Red'] | [44, 55, 66] | ['ACTIVE', 'PROMOTION', 'ACTIVE'] | ['Y', 'Y', 'N'] | [22, 22, 33]

Data Model based on Maps

id | name | upcColor | upcAttrName | upcAttrValue | upcAttrRefId----+----------------+--------------------------------+-------------------------------------------------------------------+--------------------------------+-------------------------- 11 | Nike Pants | {22: 'White', 33: 'Red'} | {44: 'ACTIVE', 55: 'PROMOTION', 66: 'ACTIVE'} | {44: 'Y', 55: 'Y', 66: 'N'} | {44: 22, 55: 22, 66: 33}

Data Model based on Compound Key & JSONCREATE TABLE product (

id int, -- product idupcId int, -- 0 means product row,

-- otherwise upc rowobject text, -- JSON objectreview text, -- JSON objectPRIMARY KEY(id, upcId)

);

id | upcId | object | review----+--------+------------------------------------------------------------------------------------------------------------+-------------------------------------- 11 | 0 | {"id" : "11", "name" : "Nike Pants"} | {"avgRating" : "5", "count" : "567"} 11 | 22 | {"color" : "White", "attr" : [{"id": "44", "name": "ACTIVE", "value": "Y"}, ...]} | null 11 | 33 | {"color" : "Red", "attr" : [{"id": "66", "name": "ACTIVE", "value": "N"}]} | null

Data Load Performance Comparison

July 2013

JSON vs. primitive column types

• Significantly reduces storage overhead because of better ratio of payload / storage metadata

• Improves throughput and latency• Supports complex hierarchical structures• But it loses in partial reads / updates• Complicates schema versioning

Migration Path

Phase 0 – Starting Point

All services were Hessian binary HTTP protocol

Phase 1 – Adding New to Old

All new services are RESTful JSON

Same RESTful JSON services built on old platform

DataStax Enterprise

Phase 2 – All Feeds Going to New

DataStax Enterprise

Phase 3 – Target: Retire Old Platform

DataStax Enterprise

Changing EnginesHybrid with Legacy Complete Switchover

Business Outcomes

Customer Experience Improvements

• Much faster Catalog refreshes• RDBMS + DataGrid refresh times ~3 hours• Cassandra refresh times ~½ hour

• Moving to Incremental Refreshes• Refresh times will be even faster less than ½ hour

• Capacity for Catalog growth• Ability to scale number of products/UPCs• Ability to scale numbers of requests per second linearly• Currently migrating Store Catalog online

DataStax Integrations

• DSE Java Client• CQL support• Intelligent request routing

• DSE and Apache SparkTM Integration• Lets Apache SparkTM use Cassandra as datastore• Added 4 nodes to each cluster for Spark analytics• Gives us real-time analytics on Catalog & Inventory• Great for problem resolution

Retrospectives

What Worked Well

• Partnership with DataStax: Modeling/Performance• CQL3 easy to migrate from SQL background• Stable: Haven’t lost a node in production• Able to reduce our reliance on caching in app tier• Documentation is good• Query tracing is helpful• Cassandra Cluster Manager

(https://github.com/pcmanus/ccm)

Problems encountered with Cassandra

• Certain CQL queries brought down cluster• DataStax fixed this (phew)

• Delete and creating a keyspace …• Lengthy compaction• Need to understand underlying storage• OpsCenter Performance Charts

Our own problems

• Small cluster ⇒ all servers must perform well• Lack of versioning of JSON schema• How to handle exports

• 5% CPU would spike to 30% during exports• Under-allocated test environments (> 1 vCPU)• Non-primary key access

Future Plans

Future Plans• Upgrade DSE 4.8.4 → 5.0.x (2017)

• Apache Cassandra 2.1.12 → 3.0.7+• (Requires RHEL 7.2)

• Currently trialing DSE 5.0.1 (C* 3.0.7)• For Store & Online Catalog & Inventory POC

• Multi-Datacenter• Using Mutagen for schema management• JSON schema versioning

Stuff we’d like to see in DSE

• DSE Kibana• Augment Spark’s query capability with an

interactive GUI• Replication listeners

• Detect when replication is complete

Questions?

Contacts & ResourcesMacy’s• Peter Connolly: Senior Architect @ Macys.com • Contact: https://www.linkedin.com/in/peter-connolly-0a544a

DataStax• Allene Jue: Product Marketing• Email: [email protected]

Resources • www.macys.com • www.datastax.com

Proofpoint

© 2015 DataStax, All Rights Reserved. 40

https://www.linkedin.com/in/peter-connolly-0a544a

https://www.linkedin.com/in/peter-connolly-0a544a

mailto:[email protected]

http://www.macys.com/

http://www.datastax.com/

Before we go…a few reminders

• Download DSE 5.0, available today!

• Check out upcoming webinars here: http://www.datastax.com/resources/webinars

• Become a DataStax Professional Community Member: http://academy.datastax.com/community

© DataStax, All Rights Reserved.

https://academy.datastax.com/downloads/welcome

http://www.datastax.com/resources/webinars

http://academy.datastax.com/community

Thank You!

Appendix

POC Results - Initial LoadOperation TIBCO

ActiveSpaces 1.0Apache

Cassandra 1.2.1010Gen

MongoDB 2.0

Initial Load~148MM RecordsSame datacenter (availability zone)

72,000 TPS (32 nodes) 34 min98,000 TPS (40 nodes) 25 minCPU: 62.4% I/O, Disk 1: 36%Memory: 83.0% I/O, Disk 2: 35%

65,000 TPS (5 nodes) 38 min

CPU: 31% I/O, Disk 1: 42%Memory: 59% I/O, Disk 2: 14%

20,000 TPS (5 nodes) ?? min

(Did not complete)processed ~23MM records

Upsert(Writes)ActiveSpaces sync writes vs. Cassandra async

4,000 TPS3.57 ms Avg Latency

CPU: 0.6% I/O: 18% (disk 1)Memory: 71% I/O: 17% (disk 2)

4,000 TPS3.2 ms Avg Latency

CPU: 3.7% I/O: 0.3%Memory: 77% I/O: 2.2%

(Did not complete)tests failed

Read 400 TPS2.54 ms Avg Latency

CPU: 0.06% I/O: 0%Memory: 62.4%

400 TPS3.23 ms Avg Latency

CPU: 0.02% I/O: 3.7%Memory: 47%

(Did not complete)tests failed

POC Results - Summary

- DataStax Enterprise (Apache CassandraTM) & ActiveSpaces

- Very close

- MongoDB- Failed

tests

YMMV! Your mileage may (will probably) vary

Technology

Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experience