Upload
hoangthu
View
247
Download
0
Embed Size (px)
Citation preview
Politecnicodi Milano
NoSQL databases
Elisabetta Di [email protected]
30/03/2017Lecture for the course: Big Data Technologies
Credits to Marco Scavuzzo
What is big data?
Big Data is a collection of very huge data sets with a great diversity [Chen & Zhang 2014]
Big data should be thought of as a process — how to get to new insights, how to turn them into action, resulting in business value [Gartner 2015]
Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization [Gartner 2012]
- 2 -
Politecnicodi Milano
RDBMS vs NoSQL
New requirements
A new size for dataIn 2009 Google was processing 24 Petabyte per dayIn 2009 Facebook declared to store about 60 millions of imagesThe Internet archive stores 2 Petabyte of data
Data instances can be different one from the otherSome fields may be missingSome other fields may have different types
Videocamera, 10 megapixel, 100$Apples, 3 Kg, 4$
Correlation between data is not defined a priori but discovered a posteriori Data come at high speed
4
RDBMS: assumptions and benefits
AssumptionsWell-defined structure for data, known when data is stored in the DBMSData is dense and uniformIndexes can be defined a priori and used for queriesData stays within a few Gigabyte
BenefitsIt uses the lowest amount of disk spaceIt is a well-understood model and query languageIt can support a wide variety of use casesIt has schema-enforced data consistency
- 5 -
Issues
Assumptions are not valid anymore!Well-defined structure for data, known when data is stored in the DBMS
Relationships between data not known in some casesData is dense and uniform
Videocamera, 10 megapixel, 100$Apples, 3 Kg, 4$
Data stays within a few Gigabyte24 Petabytes per day… 60 millions of images…
- 6 -
Issues
RDMS end-user orientedDatabase takes care of data aggregation based on queriesDatabase provides transactional guaranties, schemas, and referential integrity
TodayLess interest in aggregation managed by databases
We want to control aggregation and build parallel computation for this purpose
Possibility to control integrity and validity of data at the level of applications
- 7 -
What is NoSQL?
No use of SQL (or some specific constructs) as query language:Manage large volumes of data that do not necessarily follow a fixed schemaData is partitioned among different machines and JOIN operations are not usable
ACID guarantees may be relaxed:E.g., eventual consistency Transactions limited to single data items
Distributed, fault-tolerant architectureData held in a redundant manner on several serversHorizontal scalability
8
NoSQL characteristics
Data Model and CRUD operations
Key-Value
Document-based
Column-based
Graph-based
Distributed management of data and queries
Partitioning
Replication
PACELC
- 9 -
Result
RDBMS focusesSchemaRelations between entitiesTransactionsIntegrity checksRich query language
NoSQL focusesLight schemaAggregationsFocus on eventual consistencyData partitioningBasic Create Read Update Delete(CRUD) operations
- 10 -
An example with relations
11
Student ID Student Name132 Giovanni Rossi145 Ginevra Bianchi150 Chiara Bassi
Course ID Course Name Instructor123 SE EDN134 DB 1 LT167 Math GL
Student ID Course ID Date Score132 123 10/06/2013 25132 134 11/06/2013 26145 123 10/06/2013 30
NoSQL databases and data denormalization (1) – key-value approach
- 12 -
132course1 Giovanni Rossi, 123, SE, EDN, 10/06/2013, 25
132course2 Giovanni Rossi, 134, DB1, LT, 11/06/2013, 26
145course1 Ginevra Bianchi, 123, SE, EDN, 10/06/2013, 30
150 Chiara Bassi
NoSQL databases and data denormalization(2) – document-based approach
- 13 -
132
Student: {name: ‘Giovanni Rossi’, id: 132, Exams: [{id: 123, name: ‘SE’, instructor: ‘EDN’, date: 10/06/2013, score: 25},{id: 134, name: ‘DB1’, instructor: ‘LT’, date: 11/06/2013, score: 26}]}
145
Student: {name: ‘Ginevra Bianchi’, id: 145, Exams: [{id: 123, name: ‘SE’, instructor: ‘EDN’, date: 10/06/2013, score: 30}]}
150 Student: {name: ‘Chiara Bassi’, id: 150}
NoSQL databases and data denormalization(3) – column-based approach
14
Row Key StudentData ExamData
S ID Student Name C ID C Name Instructor Date Score
CBNN 150 Chiara Bassi
GBSE 145 Ginevra Bianchi 123 SE EDN 10/06/2013 30
GRSE 132 Giovanni Rossi 123 SE EDN 10/06/2013 25
GRDB 132 Giovanni Rossi 134 DB 1 LT 11/06/2013 26
Column familiesData ordered by row key
NoSQL databases and data denormalization (3) –graph-based approach
- 15 -
SE DB1
GR
GB
CB
EDN LT
Passed 2510/06/2013
Passed 2611/06/2013
Taught by Taught byPassed 3011/06/2013
Key Value CRUD operations
Query operations are limited toput(key,value)get(key)delete(key)
16
Document-based CRUD operations (MongoDB)
db.collection.find(<query filter>, <projection>)<query filter> -> {<field1>: <value1>, …}<projection> -> {<field1>: 1, <field2>: 1} includes both field1 and field2 in the result setWriting {<field1>: 0, <field2>: 0} field1 and field2 are, instead, excluded from the result set. Examplesdb.newDB.find()db.newDB.find(”Student.name": ”Giovanni Rossi")db.newDB.find ({”Student.Exams.name": "SE"})db.newDB.find({”Student.Exams.name": "SE”, {”Student.Exams.instructor": 0})
Indexes
Support efficient execution of queriesWithout indexes a whole collection needs to be scanWith indexes the search can be limited to a subset of documentsThe index stores the value of a specific field or set of fields, ordered by the value of the field.
MongoDB - examples of indexes
Create Index on any field in the document
// 1 means ascending, -1 means descendingdb.newDB.createIndex({“Student.name”: 1})
db.newDB.createIndex({”Student.Exams.name": -1})
- 19 -
Issue
What if we want to look for all students attending a certain course?db.newDB.find ({”Student.Exams.name": "SE"})
… but … we have to retrieve all Student documents (the aggregates containing the whole career of each student)Do we have another option?Create aggregates by Course not by StudentAdopt a column-based approach
- 20 -
Column-based CRUD operations (Hbase)
create ‘Students’, ‘StudentData’, ‘ExamData’
put ‘Students’, ‘CBNN’, ‘StudentData:S ID’, ‘150’put ‘Students’, ‘CBNN’, ‘StudentData:Student Name’, ‘Chiara Bassi’put ‘Students’, ‘GBSE’, ‘StudentData:S ID’, ‘145’put ‘Students’, ‘GBSE’, ‘StudentData:Student Name’, ‘Ginevra Bianchi’put ‘Students’, ‘GBSE’, ‘ExamData:Score’, ‘30’…
Column-based CRUD operations (Hbase)
get ’Students', ’GBSE' -> you get all data concerning key GBSEscan ’Students', {COLUMNS => [’StudentData:Student Name', ’ExamData:Score']} -> you get all data in the table concerning columns Student Name and Scorescan ’Students', {STARTROW => ’GR'} -> you get all data with a row key starting with ‘GR’ (all exams of Giovanni Rossi)
Row-key design considerations in HBase
Having an efficient HBase system depends on how the row-key is chosenDepending on the data access pattern, you will need to design your key accordingly in order to achieve better performance (per table)Write intensive: random row keysRead intensive: sequential row keysEx.: App. for time series analysis èsequential row keys
23
NoSQL characteristics
Data Model and CRUD operations
Key-Value
Document-based
Column-based
Graph-based
Distributed management of data and queries
Partitioning
Replication
PACELC
- 24 -
Data partitioning
Distributed RDBMS often based on a shared disk architectureLimited scalability
NoSQL focus on a shared-nothing architecture
25
Partitioning
Adopted whendata exceeds the capacity of a single machinetraffic grows è data accesses need to be load-balanced
26
Vertical Partitioning
Horizontal Partitioningalso called Sharding
Partitioning and key-values
Data are independent from each other -> easy to distribute them
- 27 -
132course1 Giovanni Rossi, 123, SE, EDN, 10/06/2013, 25
132course2 Giovanni Rossi, 134, DB1, LT, 11/06/2013, 26
145course1 Ginevra Bianchi, 123, SE, EDN, 10/06/2013, 30
150 Chiara Bassi
How to shard?
28
How to shard? (cont)
29
On nodes joining or leaving, data have to be redistributed è Inefficient
How to shard? (cont)
30
Sharding approaches
Hash-shardingSee beforeNo need for a coordinator, all nodes can compute the distributionCan support only get operations, no scans
Range-shardingData are grouped based on value ranges and distributed accordinglyA coordinator needs to manage the assignments and redistributionScans within a certain value range can be local to a partition
- 31 -
Sharding approaches
Entity-group shardingIt is to enable single partition transactions on co-located dataEntity groups are explicitly defined by the application or derived by analysing transactions
- 32 -
Sharding and HBase
33
Row Key StudentData ExamData
S ID Student Name C ID C Name Instructor Date Score
CBNN 150 Chiara Bassi
GBSE 145 Ginevra Bianchi 123 SE EDN 10/06/2013 30
GRSE 132 Giovanni Rossi 123 SE EDN 10/06/2013 25
GRDB 132 Giovanni Rossi 134 DB 1 LT 11/06/2013 26
Column familiesData ordered by row key
Sharding and HBase – physical view on data organization
- 34 -
Row key Column key Timestamp Cell valueCBNN StudentData:S ID 1273516197868 150CBNN StudentData:Stud
ent Name1273516197865 Chiara Bassi
GBSE StudentData:S ID 1073516197865 145GBSE StudentData:Stud
ent Name1273516197886 Ginevra Bianchi
… … … …
StudentData column family
Row key Column key Timestamp Cell valueBGSE ExamData:C ID 1373516197849 123… … … …
ExamData column family
Sharding and HBASE
- 35 -
© Hortonworks Inc. 2011 Page 9 Architecting the Future of Big Data
Logical ArchitectureDistributed, persistent partitions of a BigTable
ab
dc
ef
hg
ij
lk
mn
po
Table A
Region 1
Region 2
Region 3
Region 4
Region Server 7Table A, Region 1Table A, Region 2
Table G, Region 1070Table L, Region 25
Region Server 86Table A, Region 3Table C, Region 30Table F, Region 160Table F, Region 776
Region Server 367Table A, Region 4Table C, Region 17Table E, Region 52
Table P, Region 1116
Legend: - A single table is partitioned into Regions of roughly equal size. - Regions are assigned to Region Servers across the cluster. - Region Servers host roughly the same number of regions.
HBase Architecture
36
HBase physical view
HBase creates a separate HFile for each Column FamilyQueries for a single row, spanning through multiple CFs, require HBase to reconstruct the row from multiple HFiles.Queries requesting multiple Rows and a single Column Family are much more performant
Keys need to be chosen with careThey are the basis for defining regionsIf you use timestamps-based keys, all data in a certain time interval will be in the same regionThis may cause overload of that region in case of batch queries or store operations
37
Replication
Adopted to:increase fault-tolerance (mitigate disasters)Load balance traffic occurring on the same data sub-set (minimize latency)
Data replication approaches:1. Inter data centers (multiple geographic zones)a. Active-Passive (one passive data center used just for
backups and reads)b. Active-Active (both datacenters accept reads and
writes)2. Intra data center
38
Replication
39
Non linearizable write operations -example
40
Replication strategies
Data updates sent to all replicas at the same timeData updates sent to a master firstData updates sent to an arbitrary location first
- 41 -
Data updates sent to all replicas at the same time
Assuming concurrent updates Replica may choose different update orders => potential inconsistencyConsensus protocol in place => increase of latency
- 42 -
Data updates sent to a master first
Replication is synchronous=> Increase latency, it will depend on the slowest node
Replication is asynchronousIf reads are allowed from all nodes => inconsistency can occur (e.g., PNUTS)If reads are allowed only at the master node => no inconsistency, latency increases
A combination of synchronous and asynchronous
E.g., DynamoDB, Cassandra, Riak
43
Data updates sent to an arbitrary location first
Similar to master-slaveIf synchronous replication => latency can be further increased in case of simultaneous updates
- 44 -
PACELC (Abadi 2012)
PACELC (pass-elk) = if network is Partitioned then trade off Availability and Consistency Else trade off Latency and ConsistencyCAP theorem for abnormal cases + LC tradeoffs for normal operation
- 46 -
DBMS and CAP theorem (Brewers)
Three possible guaranteesConsistency: all nodes see the same data at the same timeAvailability: every request receives a response about whether it was successful or failedPartition tolerance: the system continues to operate despite arbitrary message loss or failure of part of the system
CAP theorem: it is impossible for a distributed computer system to simultaneously provide all three guarantees(proven by Lynch and Gilbert, 2002)
Consistency
AvailabilityPartition tolerance- 47 -
AP (Availability & Partition tolerance) = CAP-Availability system example
- 48 -
CP (Consistency & Partition tolerance) = CAP-Consistency system example
- 49 -
CAP Theorem in summary
When a network partition occursPreserve CAP-Availability: app keeps writing in the database è Data not replicated è Not CAP-ConsistentPreserve CAP-Consistency: Rejecting operations on any replicaaccepting ops. on R1 and stopping ops. on R2 (or vice versa) until the partition is resolved and replicas are in sync (Who decides which replica can accept requests?)
- 50 -
Back to the non-linearizable example
51
Here we are focusing on reducing latencyOther scenarios are possible
PACELC exemplified
PA/EL: upon Partitions, privileges Availability Else LatencyDynamoDB, Cassandra, Ryak
PC/EC: upon Partitions, privileges Consistency Else Consistency Hbase, VoltDB
PA/EC: upon Partitions, privileges Availability Else Consistency MongoDB
PC/EL: upon Partitions, privileges Consistency Else LatencyPNUTS
- 52 -
Scalability vs expressiveness
53
Expressiveness
Scalability
Key-Value
Column-Based
Document-based
RDBMS
In-MemoryKey-Value
Optimized query Projections, sorting, dynamic queries, indexes, transactions, triggers …
ACID, lock, 2PC, low scalability
Horizontal and vertical partitioning, auto reconciling, P2P
Sharding, load balancer,strict consistency on request, lock-free transactions Graph DBs
Politecnicodi Milano
Database as a Service (DaaS)
- 54 -
Database as a Service (DaaS)
Relational Databases: Google Cloud SQL, Amazon RDSCross data-center replication
NoSQLGoogle Datastore
Lightweight Transactions (2PC), Secondary IndexesAzure Tables, Amazon DynamoDB, Google BigTableAzure DocumentDB
Different SLA levels, manageable trade-offs between consistency, latency and availability
55
Database as a Service (DaaS)
Automated scalingFault toleranceNodes and replicas are manned by the cloud operatorAutomated failure recovery
Low maintenanceManaged updates at different layers (bare metal, software patches, etc.)
Geographic distributionTo increase durabilityand decrease latency
Accessibility (always on)
56
Lack of well defined SLAs for cloud services
Our experience with DaaSThroughput for read and write operations?How many parallel read/write?What is the incoming and outgoing bandwidth? What is the meaning for errors notified by the DaaS?
In most cases there is limited documentation
57
Our experience
TABLE I: Migrations preserving eventual consistency
From GAE Datastore to Azure Tables From Azure Tables to GAE Datastore
dataset #a dataset #b dataset #c dataset #a
Source size (MB) 16 64 512 -# of Entities 36940 147758 1182062 36940Migration time (s) 1098 4270 34111 13101Entities throughput (ent/s) 33.643 34.604 34.653 2.820Queued data (MB) 81.98 336.73 2709.80 93.50Extraction&Conversion time (s) 31 120 768 24Queued data throughput(KB/s) 2707.985 2873.446 3613.067 3989.513Avg. %CPU usage 4.749 3.947 4.111 0.605
Moreover, it is exposed as a web service and can be easilyused to configure and control a migration process.
In [20] and [23], we conducted several tests to measurethe performance of our initial implementation. In particular,we migrated data generated by a proof-of-concept application– called Meeting in the Cloud (MiC) [36]2. MiC was built tobe deployed and to exploit the services of two clouds, Azure,and Google App Engine (GAE)3. More specifically, when itis deployed on GAE it uses GAE Datastore and supportsreadings according to an eventually consistency semantics.Instead, when it is deployed on Azure, it uses Azure Tablesand supports readings according to a strongly consistencysemantics. The adoption of two different semantics in the useof the two DaaS therefore allowed us to experiment with datamigration in both conditions.
Data are stored by the MiC application in the form ofentities (both in GAE Datastore and in Azure Tables; an entityis similar to the concept of row in relational databases) whoseaverage size is 754 bytes. In Tables I and II we report theresults obtained when migrating these entities among thesetwo DaaS. Multiple data sets have been considered and testswere performed three times (tables report average figures).Source size indicates the size in Megabytes of the entitieswe are migrating in that particular test (whose number isdenoted with # of Entities); Migration time reports the time, inseconds, needed to complete that particular migration; Entitiesthroughput shows the ratio between the number of migratedentities and the migration time; Queued data indicates the totalsize of the entities (in the meta-model representation format)that have been stored inside the queue during all the migrationprocess; Extraction and conversion time show the time neededto extract all the entities from the source DaaS, convertingthem into the meta-model representation and store them inthe queue; Queued data throughput is the ratio of the twoprevious rows; finally, avg. %CPU usage reports the averageCPU percentage measured. In order to conduct these tests, wedeployed the migration system on a Google Compute Engine(a IaaS platform) Virtual Machine (VM) – physically hostedby the Google data-center in Western-Europe (WE) – withtwo virtual CPUs and 7.5GB of RAM, running a Linux-basedoperating system.
Test results, reported in the left part of Table I, showedthat we were able to migrate entities from GAE Datastoretowards Azure Tables, preserving eventual consistency, at anaverage rate of 34 entities/s, independently from the dataset
2GitHub repository - https://github.com/deib-polimi/mic-backend3Recently, it has been extended to be deployed also on other PaaS level
clouds.
size. In order to verify the reverse migration and to checkthe correctness of our migration process, we transferred data,already migrated, back to GAE Datastore. During these exper-iments we experienced several undocumented errors thrown bythe DaaS when migrating more than 70,000 entities towardsGAE Datastore, under various working conditions, and withdifferent data sets. For this reason, we report here only theexperiment concerning 36,940 entities (see the right part ofTable I). In this case we checked that the migration wasconservative by verifying that the number of migrated entitieswas the same and that their content was identical to the onewe migrated in the first test. The migration time, however,was increased by a factor of 11.93. This is because entitiesmanaged according to an eventual consistency policy and, thus,belonging to different Partition Groups in the intermediateformat, are required to have different Ancestors in Datastore.Thus, every write requires not only the creation of the entity inthe database but also the creation of a corresponding ancestor(this corresponds to two write operations plus one read neededto check if that ancestor is not already existing).
Table II reports the results achieved when migrating entitiesfrom Azure Tables to GAE Datastore, and vice versa, pre-serving strong consistency. As data to be handled accordingto the strong consistency semantics in Datastore have to beconnected to the same ancestor, the migration time towardsGAE Datastore includes not only the write operations butalso all reads needed to retrieve the ancestor to be used forgrouping the entities. As regards data migration back intoAzure Tables, the results we obtained, in terms of throughput,are similar to the respective ones, when preserving eventualconsistency, since no computationally intensive operation isperformed to maintain strong consistency in Azure Tables (it isjust a matter of setting the same Partition Key when translatingentities contained in the same meta-model Partition Group).Of course, also in this case we verified that the migration wasconservative.
In both scenarios, the use of CPU by Hegira4Cloud was al-ways negligible. However, the limited throughput we obtainedin all cases proved that this initial prototype is not suitable tomanage BigData problems.
VI. IMPROVING HEGIRA4CLOUD THROUGH ACTIONRESEARCH
Given the lack of tools for design-level analysis and opti-mization of our framework, we decided to address the problemof improving the performance of Hegira4Cloud through actionresearch. More specifically, we decided to explore the solutionspace through the development of various experiments and as-
TABLE II: Migrations preserving strong consistency
From Azure Tables to GAE Datastore From GAE Datastore to Azure Tables
dataset #d dataset #e dataset #f dataset #d dataset #e dataset #f
# of Entities 9235 36940 55410 9235 36940 55410Migration time (s) 1402 5340 8599 275 1098 1645Entities throughput (ent/s) 6.587 6.918 6.444 33.581 33.643 33.684Queued data (MB) 22.40 89.61 134.41 22.97 93.50 140,33Extraction and Conversion time (s) 10 30 41 8 31 45Queued data throughput (KB/s) 2294.067 3058.627 3357.047 2940.16 3088.51 3193.29Avg. %CPU usage 1.509 1.139 0.957 4.932 4.746 4.564
sociated prototypes. Given the unesplicable and undocumentedbehavior we experienced while writing on GAE Datastore, wedecided to perform our experiments referring to the migrationfrom Datastore to Azure Tables and not vice versa. In our spaceexploration approach we went through the steps presented inthe following subsections.
A. Testing the read and write DaaS limitsThe read and write throughput from/to the DaaS under
consideration can certainly be considered as an upper boundfor any migration system. For this reason, our investigationstarted with an analysis of such throughput. We could not findany information on this in the documentation offered by GAEDatastore and Azure Tables. In [37] authors measured AzureTables performance by issuing CRUD (i.e. create, read, update,and delete) operations, but since those tests were performedfive years ago, they do not necessarily reflect the currentsituation. Thus, we decided to perform our own experiments.
TABLE III: Azure Tables entities throughput.
EAS ETS(MB) #threads Tt (s) X (ent/s) X (KB/s)
754 Byte 106 10 217.3 680.0 499.5754 Byte 106 20 121.3 1218.1 894.8754 Byte 106 40 102.3 1444.4 1061.0754 Byte 106 60 122.0 1211.1 889.7754 Byte 106 192 127.3 1160.7 852.7
1KB 152 10 230.0 642.4 676.71KB 152 20 156.0 947.2 997.71KB 152 40 129.0 1145.4 1206.64KB 580 10 244.0 605.6 2434.14KB 580 20 148.0 998.4 4013.04KB 580 40 139.0 1063.0 4272.84KB 580 60 120.0 1231.3 4949.34KB 580 192 116.6 1267.2 5093.7
As regards the read throughput from the source DaaS, wecan safely say that it is not a bottleneck for the migrationsystem, since, based on the preliminary results we obtainedin Tables I and II, Hegira4Cloud is able to read and processdata at a throughput up to 1,539 entities/s (given by the ratiobetween the number of migrated entities and the extractionand conversion time). For this reason, we do not go further intesting the read limits.
Instead, for what concerns the write throughput towardsAzure Tables, we developed a tool which allows us to measurethe maximum number of parallel writes an application is ableto make in the same Azure Table account. In the experiments,we considered entities having different average size (754 Byte,1KB and 4KB), trying to understand how entity size impactsAzure Tables write speed. For the tests we used an Azure VM
with the following configuration: Ubuntu Server 12.04, locatedin Microsoft WE data-center, with 4 CPU and 7 GB of RAM.Moreover, we used log4j library to compute tests duration anda custom library to measure the size of objects stored in thequeue and the size of Azure Tables entities [38]. Table IIIsummarizes the test results. For each run we transferred aconstant number of entities (147,758 entities) and we variedthe following parameters:
• The entity average size EAS, as well as the entities totalsize ETS.
• The number of consumer threads #threads writing inparallel to Azure Tables.
As a result of the tests we collected the following outputperformance metrics:
1) The overall time Tt required for writing entities in AzureTables.
2) The maximum throughput X achieved for data transfer.We report both entities/s and in KB/s values.
Each of the values in the table is the average obtained fromthree runs with the same inputs.
We found that Azure Tables maximum write throughputvaried between 600 and 1,444 entities/s, depending on thenumber of threads writing in parallel on it. While it seems wereached a limit in terms of writable entities per second, thethroughput measured in KB/s kept increasing with the entitiesaverage size, proving that the database limits the number ofoperations per second and not the throughput in terms ofbytes per second. These values, if compared with the resultswe obtained with our preliminary version of Hegira4Cloud inSection V (up to 34.6 ent/s), show that the target DaaS is notthe system bottleneck and the limited throughput we obtainedwith Hegira4Cloud is due to some other cause. Hence, weconducted further tests.
B. Checking if the bandwidth is a system bottleneck
The next step towards the identification of the bottleneckconsists in measuring the maximum network bandwidth avail-able for migration.
For these tests, we deployed two VMs on Azure and oneon Google Compute Engine. More specifically, the Azure VMshad the following configuration: Ubuntu Server 12.04, locatedin the same Microsoft WE data center (not in the same virtualnetwork), with 4 CPU cores and 7 GB RAM. The Google VMhad the following characteristics: Debian 7, hosted in GoogleWE data-center, with 2 virtual CPUs and 7.5GB of RAM.
On each of these VMs we installed the iperf tool [39], ableto measure the maximum bandwidth on IP networks among
58
Politecnicodi Milano
Polyglot persistence
- 59 -
Polyglot Persistence
In recent years the idea of polyglot persistence has emerged and become popularUse appropriate data model for different parts of the persistence layerRelational databases for structured, tabular dataDocument databases for un/semi-structured dataKey-Value databases for hash tablesGraph databases for highly linked referential data
This brings data consistency and duplication issues
60
Polyglot Persistence
61
Rapid access for reads and writes. No need to be durable.
Needs transactional updates. Tabular structure fits data.
Needs high availability across multiple locations. Can merge inconsistent writes.
Rapidly traverse links between friends, product purchases, and ratings.
High volume of writes on multiple nodes.
Large-scale analytics on large cluster.
SQL interfaces work well with reporting tools.
Lots of reads, infrequent writes. Products make natural aggregate.
Multi-model NoSQL databases
Polyglot persistence may bring data consistency and duplication issuesHaving a polyglot persistent application requires to manage and orchestrate different complex systems (operational complexity)Multi-model databases consist of document store, KV store and graph database all in one database engineThey provide a unique query language and APIThe most common implementations (ArangoDB, OrientDB) map document and graph data models to the Key-Value data model
62
Politecnicodi Milano
Conclusions on NoSQL
- 63 -
Conclusions
RDBMS pros:ACID transactions (when needed) make development easierMost SQL code is portable to other SQL databasesType columns and integrity constraints validate data before it is added to the DB. It increases Data Quality
RDBMS cons:ORM layer can be complexRDBMS are difficult to scale outSharding requires complex application code and will be operationally inefficientDifficult to store un/semi structured data
64
Conclusions
NoSQL pros:Data modeling can be an iterative processLinear scaling occurs as nodes are added to the clusterLower operational costs are obtained by autoshardingNative integration of Map/Reduce Frameworks and Full-text search enginesNo need for ORM layersEasy and efficient storage of high-variable data
65
Conclusions
NoSQL “cons”Implicit schema at the application levelApplications need to check for consistency and integrity constraintsNot possible to express stored procedures (except for HBase and MongoDB)No transactions (across multiple objects), conflict resolution must be done by the client application
66
Conclusions
NoSQL “cons”:ACID transactions are limited to just one element (row, document, entity, etc.) in contrast with RDBMS
2nd generation NoSQL or NewSQL databases try to cope with this problem
Data models and query languages are proprietary and create vendor lock-inData structure is chosen (denormalization) upfront, based on the queries that will be expressed. If queries change also data need to be changed (exception: graph dbs) è Map/Reduce
67
Good candidate projects for NoSQL databases
StrategicCompany core business (competitive advantage)Not utility projects (i.e. they aren’t central to the competitive advantage of the company)
Too risky to adopt and low benefitsAND
Rapid time to marketTo maximize the development productivity
AND/ORData intensive
Lots of dataLow latency and high availabilityLots of traffic: reads or writes
68
Thanks for your attention...
69
Bibliography
https://medium.com/@sent0hil/consistent-hashing-a-guide-go-implementation-fe3421ac3e8fhttps://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.htmlhttp://www.christof-strauch.de/nosqldbs.pdfhttp://bigdatauniversity.com/courses/course/view.php?id=572&justenroled=1http://www.n10k.com/blog/hbase-for-architects/[Chen & Zhang 2014] C.L. Philip Chen, Chun-Yang Zhang, Data-intensive applications, challenges, techniques and technologies: A survey on Big Data, Information Sciences 275 (2014) 314–347.[Gartner 2015] Frank Buytendijk, Thomas W. Oestreich, Organizing for Big Data Through Better Process and Governance, Gartner report, March 2015.[Abadi et al 2013] The Beckman Report on Database Research http://beckman.cs.wisc.edu/beckman-report2013.pdfhttp://docs.mongodb.org/Joshua Shinavier - http://www.slideshare.net/joshsh/texas-linuxfestival2014Making Sense of NoSQL, A GUIDE FOR MANAGERS AND THE REST OF US - DAN MCCREARY ANN KELLYNoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence - Pramod J. Sadalage,Martin Fowlerhttp://martinfowler.com/bliki/PolyglotPersistence.html
70