Upload
andrea-de-pirro
View
400
Download
2
Embed Size (px)
DESCRIPTION
Developing a fast and scalable application for your fancy new startup is hard. Many factors are responsible for the slowness of a website, like network latency, webserver configuration or large assets, but as any developer involved with high volumes knows, the real bottleneck is the database. During the latest years a bunch of NoSQL solutions came to the rescue, each one with his pros and cons. Apache Cassandra is one of the most used and mature "Big Data" NoSQL, and is currently deployed on several projects by tech giants like Twitter, eBay and Netflix, due to its extremely high throughput, automatic replication and decentralization. During the session I'll talk about how to leverage Apache Cassandra best features and data modeling best practices for your web application projects to respond to huge peaks of traffic, using open source tools such as Zend Framework and phpcassa, and describing a large e-commerce project currently using Cassandra.
Citation preview
@akira28
Scalable PHP web applications with Apache
Cassandra
Andrea De Pirro
@akira28
About me
• Co-founder at Yameveo
• 9+ years developing in PHP
• 2+ years experience with Apache Cassandra
• Zend Framework Certified Engineer
@akira28
Yameveo
Founded on 2012 in Barcelona, Yameveo is a young, dynamic and international company specialised in e-
commerce and web applications development !
!
www.yameveo.com @Yameveo
@akira28
What we will talk about
• Apache Cassandra
• Data Modeling
• Cassandra & PHP
• Case study
@akira28
Apache CassandraApache Cassandra is a massively scalable open source
NoSQL database. Cassandra is perfect for managing large amounts of structured, semi-structured, and
unstructured data across multiple data centers and the cloud. Cassandra delivers continuous availability, linear
scalability, and operational simplicity across many commodity servers with no single point of failure, along
with a powerful dynamic data model designed for maximum flexibility and fast response times.
Apache Cassandra documentation
@akira28
Why Cassandra• Open Source (enterprise distribution also available)
• Linearly scalable
• Fault-tolerant
• Fully distributed
• Highly performant
• Flexible data model
@akira28
Cassandra Uses• Web analytics
• Web Applications
• Transaction logging
• Data collection
• …
@akira28
@akira28
Architecture
@akira28
CAP TheoremOnly two of:!!
1. Consistency all nodes see the same data at the same time
2. Availability the guarantee that every request receives a response about whether it was successful or failed
3. Partition Tolerance the system continues to operate despite message loss or failure of part of the system
@akira28
CAP Theorem
@akira28
Architecture
• Ring
• Each node has a unique token and is identical
• Intra-ring communication via “Gossip” protocol
• Tokens range from 0 to 2^127
@akira28
Partitioning
@akira28
Data Modeling
@akira28
Data Model• Cluster
• Keyspace
• Column Family
• Super Column
• Composite Columns
@akira28
Data Model
@akira28
Data Model
@akira28
Data Modeling Problems
• Neither join nor subquery support
• Limited support for aggregation
• Ordering is done per-partition
• Ordering is specified at table creation time
@akira28
Data Modeling Best Practices
• Don’t think of a relational table
• Model column families around query patterns
• De-normalize and duplicate for read performance
• Storing values in column names is perfectly OK
• Leverage wide rows for ordering, grouping, and filtering
@akira28
Some Numbers
@akira28
Some Numbers
@akira28
@akira28
Cassandra & PHP
@akira28
Apache ThriftThrift is an interface definition language and binary
communication protocol that is used to define and create services for numerous languages. It is used as a remote procedure call (RPC) framework and was developed at
Facebook for "scalable cross-language services development"
Wikipedia
@akira28
Apache Thrift
@akira28
PhpCassa• Open Source
• Uses the Thrift protocol
• Compatible with Cassandra 0.7 through 1.2
• Optional C extension for improved performance
https://github.com/thobbs/phpcassa !
require: “thobbs/phpcassa”: “v1.1.0”
@akira28
ExamplesOpening Connections!!$pool = new ConnectionPool('Keyspace1'); !Create a column family object!!$users = new ColumnFamily($pool, 'Standard1'); $super = new SuperColumnFamily($pool, 'Super1'); !Inserting!!$users->insert('key', array('column1' => 'value1', 'column2' => 'value2')); !Querying!!$users->get(‘key'); // returns an array $users->multiget(array('key1', ‘key2')); // returns an array of arrays !Removing!!$users->remove('key1'); // removes whole row $users->remove('key1', 'column1'); // removes 'column1'
@akira28
Case Study
@akira28
Flash Deals website• 5 Apache servers
• 32 GB of RAM
• 8 CPU
• 6 Cassandra nodes
• 4+ millions visits/month
• 17+ millions pages/month
• 600GB of data
@akira28
@akira28
Requirement• The client wanted a new way to navigate the
website: deal attributes
• Millions of deals (hundreds new and expiring everyday)
• Dozens of stores and categories
• Performance is key!
@akira28
How We Solved It
• Each day we have new deals, so queries based on date and attributes
• Leverage Cassandra wide-rows to create indexes
• Use Cassandra multiGet whenever possible
@akira28
Deals CFRowKey name price attributes …
211 Miyagi Sushi 29 [21,20,114]
432 Mos Eisley Cantina 19 [21,20]
12 iPhone 5 32GB 549 [7]
… … …
@akira28
Attributes CFRowKey name keyword
21 Restaurants restaurants
114 Japanese japanese
20 Barcelona barcelona
7 Technology tech
@akira28
Cities CFRowKey name attributeid …
1 Madrid 12
8 Barcelona 20
32 Amsterdam 81
@akira28
Urls CFRowKey attributes city …
/restaurants/barcelona [21] 8
/restaurants/barcelona/japanese [21,114] 8
/tech [7] -
/restaurants [21] -
… … …
@akira28
AttributesDeals CFRowKey 211 432 12 … …
21|20140621 true true -
114|20140621 true - -
20|20140621 true true -
7|20140621 - - true
… … … …
@akira28
Code/** * List deals action * eg. /restaurants/barcelona/japanese * */ public function dealsAction() { $path = $this->getUrlPath(); // cleaned query string ! $url = $this->manager->getUrl($path); $attributes = Zend_Json::decode($url[‘attributes’]); $cityId = $url[‘city’]; $deals = $this->manager->getDeals($attributes, $cityId); $this->view->assign(‘deals’, $deals); … }
Controller
@akira28
Code/** * Retrieves the url containing attributes and city infos * * @param string $path * @return array $url */ public function getUrl($path) { $pool = new ConnectionPool('Keyspace'); $urls = new ColumnFamily($pool, 'Urls'); try { $url = $urls->get($path); } catch (Exception $e) { … } return $url; }
Manager
@akira28
Code/** * Retrieves the url containing attributes and city infos * * @param array $attributes * @param int $cityId * @return array $deals */ public function getDeals($attributes, $cityId) { $pool = new ConnectionPool('Keyspace'); $dealsCF = new ColumnFamily($pool, ‘Deals’); if(!empty($cityId) { $attributes[] = $this->getAttributeIdByCity($cityId); } try { $dealsIds = $this->getDealsIdsByAttributes($attributes); $deals = $dealsCF->multiget($dealsIds); } catch (Exception $e) { … } return $deals; }
Manager
@akira28
Code/** * Retrieves an array of deals ids given an array of attribute ids * * @param array $attributes * @return array $dealsIds */ protected function getDealsIdsByAttributes($attributes) { $dealsIds = array(); $dealsGroups = array(); $date = date(‘Ymd’); $attributesDeals= new ColumnFamily($pool, 'AttributesDeals'); foreach($attributes as $attributeId) { $attributeKey =“$attributeId|$date"; $dealsGroups[] = array_keys($attributesDeals->get($attributeKey)); // columns! } $countGroups = count($dealsGroups); if($countGroups > 1) { $dealsIds = call_user_func_array('array_intersect', $dealsGroups); } elseif($countGroups == 1) { $dealsIds = reset($dealsGroups); } return $dealsIds; }
Manager
@akira28
Cassandra future (and present)
• New PHP driver wrapping the C++ driver
• Cassandra 2.0
• CQL 3.0
@akira28
Resources
• www.yameveo.com
• http://planetcassandra.org
• https://github.com/thobbs/phpcassa
• http://www.hakkalabs.co/articles/cassandra-data-modeling-guide
@akira28
Resources• http://www.ebaytechblog.com/2012/07/16/
cassandra-data-modeling-best-practices-part-1/
• http://www.slideshare.net/DataStax/cassandra-community-webinar-introduction-to-apache-cassandra-12
• http://www.geroba.com/cassandra/apache-cassandra-byteorderedpartitioner/
@akira28
Questions?
@akira28
Dank!joind.in/10865 lanyrd.com/scxyhk !
www.yameveo.com !
@akira28 @Yameveo !
http://bit.ly/andreadepirro