Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Database Expert Q&A from 2600hz and Cloudant

8,696 views

Published on

This is the Expert Q&A from 2600hz and Cloudant on Database in Telecom. If you are a service provider, MSP or anyone running a VoIP switch, you should definitely check this out.

Published in: Technology
  • Login to see the comments

Database Expert Q&A from 2600hz and Cloudant

  1. 1. Powerful, Distributed, API CommunicationsCall-in Number: 513.386.0101Pin 705-705-141Expert Q&A: Database EditionMay 31st, 2013
  2. 2. Welcome
  3. 3. Our PanelistsJoshua GoldbardMarketing Ninja, 2600hz,ModeratorDarren SchreiberFounder, 2600hzSam BisbeeCloudant
  4. 4. Database:It’s all good until it isn’t
  5. 5. Some background…
  6. 6. What is Database?• A Record of things Remembered or Forgotten• Used to be Unbelievably hard, now it’s just hardsometimes• Modern Databases are amazingly resilient• Failure Mode still requires lots of attention• In Distributed Environments…• Database is inexorably linked to the network• The network is always unreliable if public
  7. 7. Masters and Slaves• Databases have to Replicate• Most Databases use a form of Master-SlaveRelationship to manage replication and dedupe• Masters are where new data is entered• Then it’s mirrored out to the Slaves for storage• If you lose access to the original Master, you canconvert a Slave into a Master and restoreoperationDurability
  8. 8. Other Replication Strategies• Other strategies exist, such as…• Master-Master (What 2600hz Uses)• Tokenized Exchange• Time-delimited• The most popular methods tend to be Master-Slave or Master-MasterEach Database has its advantages and tradeoffs. Onceagain, there is no Magic Bullet.
  9. 9. Failure and Quorum• When A Database needs to elect a new master…• There are many different strategies• Most involve the concept of quorum (figuringout where the greatest number of copiesreside)• Once Quorum is established, a new master iselected and (hopefully) operation can resume• Quorum is different in Master-Master (Explain)
  10. 10. Cap TheoremDatabases can have (at most) 2 out of 3 of the following:•Consistency•Availability•Partition ToleranceModern Database Management is balancing betweenConsistency and Availability because all modernnetworks are unreliable
  11. 11. Examples of Databases
  12. 12. What is Important in a Database?• Reliable Storage of Data?• Fast Retrieval of Data?• Fast Saving of Data?• Resilience during failures?• <other>
  13. 13. Examples• Buying tickets from ticketmaster• What’s important and why?• Withdrawing money from a bank?• Storing Call Forwarding Settings?• Storing a List of Favorite Stocks?Each Scenario has a different set of requirements andconstraints. There is no silver bullet; if you couldwrite one database for all these scenarios, you’dbe rich.
  14. 14. Which Database is Better?• STUPID QUESTION• But I thought there were no stupid questions?• This is the only stupid question.• The fight of which database is better is almostalways silly• Databases are a tool, to get a job done• Like the previous examples, each job is different• Each database stresses different pros/cons
  15. 15. Let’s Get Technical!
  16. 16. Trouble With Databases• HUGE TOPIC (We’re only going to cover a little)• Network Partitions• Layer 1 disasters• Flapping Internet (Special Class of NetworkPartitions)
  17. 17. Network Partitions• Common in Distributed Databases• When Databases lose contact with each other they canpartition• Caused by unreliable or faulty network connections• Databases can behave very weirdly when in partitionsArguably, most of what a database admin does is prepare fornetwork partitions and how to resolve them.
  18. 18. Network without Partitions
  19. 19. Network with Partitions
  20. 20. Split-Brain• During a partition, some databases will elect N masters, onefor each partition in the network.• When the partition is fixed, unless there is a pre-definedrestoral procedure, there will be conflicts• Databases have all kinds of strategies for handling WAN Split-brain failure, but you should understand themKey Takeaway: No Database is perfect. Understand theautomation but also understand the manual interventionprocedure.
  21. 21. Layer 1 Failures
  22. 22. Layer 1 Failures• Rut Roh• Actual Physical Disaster• No easy way out except…• Don’t be in a Datacenter that’s hit by a disasterOR• Be Nimble enough to Evade Disaster
  23. 23. Evading Disaster• We’re not Magicians, we can’t simply predict disasters• The next best thing is being able to move and move fast• Kazoo requires one line of code to move• Kazoo moves fast• Moving the Database fast is awesome (Thanks BigCouch!)During Hurricane Sandy, we cut our Datacenters away fromDowntown New York to a Datacenter above the 100 yearflood plain on the East Coast. Result: No Downtime.
  24. 24. No Silver Bullets• Layer 1 disasters are a humbling experience• Don’t rely on DataCenters in the Path of a Storm• Flooding will brick datacenters that have generators belowground• To avoid being powerless in a disaster…• Plan, Test, Analyze, Repeat• Check out Netflix Simian Army for examples of tests
  25. 25. Flapping• Is it up? Is it Down? Around and Around it Goes, where itstops nobody knows…• Flapping Internet is a special case of network partition or loseconnectivity• Flapping connections lose contact with other servers and thenappear to come back online before going offWhy is this bad?
  26. 26. Fixing Flapping• I’m trying to fix a partition• The Network keeps going up and down• As I repair my cluster, it keeps starting to repair and failing (byattempting to reintegrate the unreliable nodes)Flapping nodes make everything awful
  27. 27. Why is the Network Difficult?“Detecting network failures is hard. Since our only knowledge ofthe other nodes passes through the network, delays areindistinguishable from failure. This is the fundamental problem ofthe network partition: latency high enough to be considered afailure. When partitions arise, we have no way todetermine what happened on the other nodes: are they alive?Dead? Did they receive our message? Did they try to respond?Literally no one knows. When the network finally heals, wellhave to re-establish the connection and try to work out whathappened–perhaps recovering from an inconsistent state.”-Kyle Kingsbury, Aphyr.com
  28. 28. Why is the Network Difficult?“Detecting network failures is hard. Since our only knowledge ofthe other nodes passes through the network, delays areindistinguishable from failure. This is the fundamental problem ofthe network partition: latency high enough to be considered afailure. When partitions arise, we have no way todetermine what happened on the other nodes: are they alive?Dead? Did they receive our message? Did they try to respond?Literally no one knows. When the network finally heals, wellhave to re-establish the connection and try to work out whathappened–perhaps recovering from an inconsistent state.”-Kyle Kingsbury, Aphyr.com
  29. 29. Why is the Network Difficult?“Detecting network failures is hard. Since our only knowledge ofthe other nodes passes through the network, delays areindistinguishable from failure. This is the fundamental problem ofthe network partition: latency high enough to be considered afailure. When partitions arise, we have no way todetermine what happened on the other nodes: are they alive?Dead? Did they receive our message? Did they try to respond?Literally no one knows. When the network finally heals, wellhave to re-establish the connection and try to work out whathappened–perhaps recovering from an inconsistent state.”-Kyle Kingsbury, Aphyr.com
  30. 30. What does 2600hz use?• Cloudant BigCouch• NoSQL Database• Master-Master• Very sensibly designed for our use case
  31. 31. Why BigCouch?DEMANDS1.On the Fly Schema Changes2.Scale in a distributed fashion3.Configuration changes willhappen as we grow4.Has to be equipmentagnostic5.Accessible Raw Data View6.Simple to Install and Keep up7.It can’t fail, ergo Fault-Tolerance8.Multi-Master writes9.Simple (to cluster, toTRADEOFFS1.Eventual Consistency is OK2.Nodes going offline randomly3.Multi-server onlyWhy are we ok with thesetradeoffs? They suit our usecase.
  32. 32. Let’s take some time to pontificate aboutDatabase at scale…What are the first things you think of whenyou get errors reported from the Database?What’s your Thought Process?
  33. 33. • Database is where you put stuff• You want your Database not todie• 2600hz uses BigCouch becauseit’s really awesome technology• Great for our Use Case• Easy to Administrate• Resilient and quick-to-restoreRecap
  34. 34. QUESTIONS???

×