Database Expert Q&A from 2600hz and Cloudant

Preview:

DESCRIPTION

This is the Expert Q&A from 2600hz and Cloudant on Database in Telecom. If you are a service provider, MSP or anyone running a VoIP switch, you should definitely check this out.

Citation preview

Powerful, Distributed, API Communications

Call-in Number: 513.386.0101Pin 705-705-141

Expert Q&A: Database Edition

May 31st, 2013

Welcome

Our Panelists

Joshua Goldbard

Marketing Ninja, 2600hz, Moderator

Darren Schreiber

Founder, 2600hz

Sam Bisbee

Cloudant

Database:It’s all good until it

isn’t

Some background…

What is Database?

• A Record of things Remembered or

Forgotten

• Used to be Unbelievably hard, now it’s

just hard sometimes

• Modern Databases are amazingly

resilient

• Failure Mode still requires lots of

attention

• In Distributed Environments…

• Database is inexorably linked to the

network

• The network is always unreliable if

public

Masters and Slaves

• Databases have to Replicate

• Most Databases use a form of Master-

Slave Relationship to manage replication

and dedupe

• Masters are where new data is entered

• Then it’s mirrored out to the Slaves for

storage

• If you lose access to the original Master,

you can convert a Slave into a Master

and restore operation

Durability

Other Replication Strategies

• Other strategies exist, such as…

• Master-Master (What 2600hz Uses)

• Tokenized Exchange

• Time-delimited

• The most popular methods tend to be

Master-Slave or Master-Master

Each Database has its advantages and

tradeoffs. Once again, there is no Magic

Bullet.

Failure and Quorum

• When A Database needs to elect a new

master…

• There are many different strategies

• Most involve the concept of quorum

(figuring out where the greatest

number of copies reside)

• Once Quorum is established, a new

master is elected and (hopefully)

operation can resume

• Quorum is different in Master-Master

(Explain)

Cap TheoremDatabases can have (at most) 2 out of 3 of the following:

•Consistency•Availability•Partition Tolerance

Modern Database Management is balancing between Consistency and Availability because

all modern networks are unreliable

Examples of Databases

What is Important in a Database?

• Reliable Storage of Data?

• Fast Retrieval of Data?

• Fast Saving of Data?

• Resilience during failures?

• <other>

Examples

• Buying tickets from ticketmaster

• What’s important and why?

• Withdrawing money from a bank?

• Storing Call Forwarding Settings?

• Storing a List of Favorite Stocks?

Each Scenario has a different set of

requirements and constraints. There is

no silver bullet; if you could write one

database for all these scenarios, you’d

be rich.

Which Database is Better?

• STUPID QUESTION

• But I thought there were no stupid

questions?

• This is the only stupid question.

• The fight of which database is better is

almost always silly

• Databases are a tool, to get a job done

• Like the previous examples, each job

is different

• Each database stresses different

pros/cons

Let’s Get Technical!

Trouble With Databases• HUGE TOPIC (We’re only going to cover

a little)

• Network Partitions

• Layer 1 disasters

• Flapping Internet (Special Class of

Network Partitions)

Network Partitions• Common in Distributed Databases• When Databases lose contact with each other they

can partition• Caused by unreliable or faulty network connections• Databases can behave very weirdly when in

partitions

Arguably, most of what a database admin does is prepare for network partitions and how to resolve

them.

Network without Partitions

Network with Partitions

Split-Brain• During a partition, some databases will elect N

masters, one for each partition in the network.• When the partition is fixed, unless there is a pre-

defined restoral procedure, there will be conflicts• Databases have all kinds of strategies for handling

WAN Split-brain failure, but you should understand them

Key Takeaway: No Database is perfect. Understand the automation but also understand the manual

intervention procedure.

Layer 1 Failures

Layer 1 Failures• Rut Roh• Actual Physical Disaster• No easy way out except…• Don’t be in a Datacenter that’s hit by a disasterOR• Be Nimble enough to Evade Disaster

Evading Disaster• We’re not Magicians, we can’t simply predict disasters• The next best thing is being able to move and move

fast• Kazoo requires one line of code to move• Kazoo moves fast• Moving the Database fast is awesome (Thanks

BigCouch!)

During Hurricane Sandy, we cut our Datacenters away from Downtown New York to a Datacenter above the 100 year flood plain on the East Coast. Result: No Downtime.

No Silver Bullets• Layer 1 disasters are a humbling experience• Don’t rely on DataCenters in the Path of a Storm• Flooding will brick datacenters that have

generators below ground• To avoid being powerless in a disaster…• Plan, Test, Analyze, Repeat• Check out Netflix Simian Army for examples of

tests

Flapping• Is it up? Is it Down? Around and Around it Goes,

where it stops nobody knows…• Flapping Internet is a special case of network

partition or lose connectivity• Flapping connections lose contact with other

servers and then appear to come back online before going off

Why is this bad?

Fixing Flapping• I’m trying to fix a partition• The Network keeps going up and down• As I repair my cluster, it keeps starting to repair

and failing (by attempting to reintegrate the unreliable nodes)

Flapping nodes make everything awful

Why is the Network Difficult?

“Detecting network failures is hard. Since our only knowledge of the other nodes passes through the network, delays are indistinguishable from failure. This is the fundamental problem of the network partition: latency high enough to be considered a failure. When partitions arise, we have no way to determine what happened on the other nodes: are they alive? Dead? Did they receive our message? Did they try to respond? Literally no one knows. When the network finally heals, we'll have to re-establish the connection and try to work out what happened–perhaps recovering from an inconsistent state.”

-Kyle Kingsbury, Aphyr.com

Why is the Network Difficult?

“Detecting network failures is hard. Since our only knowledge of the other nodes passes through the network, delays are indistinguishable from failure. This is the fundamental problem of the network partition: latency high enough to be considered a failure. When partitions arise, we have no way to determine what happened on the other nodes: are they alive? Dead? Did they receive our message? Did they try to respond? Literally no one knows. When the network finally heals, we'll have to re-establish the connection and try to work out what happened–perhaps recovering from an inconsistent state.”

-Kyle Kingsbury, Aphyr.com

Why is the Network Difficult?

“Detecting network failures is hard. Since our only knowledge of the other nodes passes through the network, delays are indistinguishable from failure. This is the fundamental problem of the network partition: latency high enough to be considered a failure. When partitions arise, we have no way to determine what happened on the other nodes: are they alive? Dead? Did they receive our message? Did they try to respond? Literally no one knows. When the network finally heals, we'll have to re-establish the connection and try to work out what happened–perhaps recovering from an inconsistent state.”

-Kyle Kingsbury, Aphyr.com

What does 2600hz use?• Cloudant BigCouch• NoSQL Database• Master-Master• Very sensibly designed for our use case

Why BigCouch?DEMANDS1.On the Fly Schema Changes2.Scale in a distributed fashion3.Configuration changes will happen as we grow4.Has to be equipment agnostic5.Accessible Raw Data View6.Simple to Install and Keep up7.It can’t fail, ergo Fault-Tolerance8.Multi-Master writes9.Simple (to cluster, to backup, to replicate, to split)

TRADEOFFS1.Eventual Consistency is OK2.Nodes going offline randomly3.Multi-server only

Why are we ok with these tradeoffs? They suit our use case.

Let’s take some time to pontificate

about Database at scale…

What are the first things you think

of when you get errors reported

from the Database? What’s your

Thought Process?

• Database is where you put stuff

• You want your Database not to die

• 2600hz uses BigCouch because it’s really

awesome technology

• Great for our Use Case

• Easy to Administrate

• Resilient and quick-to-restore

Recap

QUESTIONS???

Recommended