Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lessons Learned Migrating 2+ Billion Documents at Craigslist

57,141 views

Published on

The slides from my 2011 MongoSF talk of the same name

Published in: Technology
  • Did you try ⇒ www.HelpWriting.net ⇐?. They know how to do an amazing essay, research papers or dissertations.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Have u ever tried external professional writing services like ⇒ www.WritePaper.info ⇐ ? I did and I am more than satisfied.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Unlock The Universe & Get Answers You Seek Today In Your FREE Tarot Reading. DO THIS FIRST... To get the most out of your tarot reading, I first need you to focus your intention - this concentrates the energy on the universe to answer the questions that you most desire the answers for. Take 10 seconds to think of your #1 single biggest CHALLENGE right now. (Yes, stop for 10 seconds, close your eyes, and focus your energy on ONE key problem) Ready? Okay, let's proceed. ▲▲▲ https://dwz1.cc/swoMQ2aQ
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • See how I make over $7,293 a month from home doing REAL online jobs! ■■■ http://scamcb.com/ezpayjobs/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Lessons Learned Migrating 2+ Billion Documents at Craigslist

  1. 1. Lessons Learned from Migrating 2+ Billion Documents at Craigslist<br />Jeremy Zawodny<br />jzawodn@craigslist.org<br />Jeremy@Zawodny.com<br />http://blog.zawodny.com/<br />
  2. 2. Outline<br />Recap last year’s MongoSV Talk<br />The Archive, Why MongoDB, etc.<br />http://www.10gen.com/video/mongosv2010/craigslist<br />The Infrastructure<br />The Lessons<br />Wishlist<br />Q&A<br />
  3. 3. Craigslist Numbers<br />2 data centers<br />~500 servers<br />~100 MySQL servers<br />~700 cities, worldwide<br />~1 billion hits/day<br />~1.5 million posts/day<br />
  4. 4. Archive: Where Data Goes To Die<br />Live Numbers<br />~1.75M posts/day<br />~14 day avg. lifetime<br />~60 day retention<br />~100M posts<br />We keep all postings<br />Users reuse postings<br />Daily archive migration<br />Internal query tools<br />
  5. 5. Archive Pain<br />Coupled Schemas<br />Big Indexes<br />Hardware Failures<br />Replication Lag<br />Poor Search<br />Human Time Costs<br />
  6. 6. MongoDB Wins<br />Scalable<br />Fast<br />Friendly<br />Proven<br />Pragmatic<br />Approachable<br />
  7. 7. MongoDB Details<br />Plan for 5 billion documents<br />Average size: 2KB<br />3 Replica sets, 3 Servers each<br />Deploy to 2 datacenters<br />Same deployment in each datacenter<br />Posting ID is sharding key<br />
  8. 8. MongoDB Architecture<br />Typical Sharding with Replica Sets<br />(external sphinx full-text indexers not pictured)<br />config<br />client<br />client<br />client<br />client<br />config<br />config<br />mongos<br />mongos<br />mongos<br />shard001<br />shard003<br />shard002<br />replica set<br />replica set<br />replica set<br />
  9. 9. Lesson: Know Your Hardware<br />MongoDB on blades really sucks<br />Single 10k RPM disks can’t take it when data is noticeably larger than RAM<br />Mongo operations can hit the client timeout (30 sec default)<br />Even minutely cron jobs start to spew<br />Lots of time wasted in development environment, trying different kernels, tuning, etc.<br />Most noticeable during heavy writes but can happen if pages fall out of RAM for other reasons<br />
  10. 10. Lesson: Replica Sets Rock<br />Lots of reboots happened during dev environment troubleshooting<br />Each time, one of the remaining nodes took over<br />No “reclone” no config file or DNS changes<br />Stuff “just worked” while nodes bounced up and down<br />
  11. 11. Lesson: Know Your Data<br />MongoDB is UTF-8<br />Some of our older data is decidedly NOT UTF-8<br />We have lots of sloppy encoding issues to clean up. But we had to clean them all up.<br />Start data load. Wait 12-36 hours. Witness fail. Fix code. Start over. Sigh.<br />This is a combination of having been sloppy and having old data. Even with a lot less history, this can bite you. Get your encoding house in order!<br />
  12. 12. Lesson: Know Your Data Size<br />MongoDB has a doc size limits<br />4MB in 1.6.x, 16MB in 1.8.x<br />What to do with outliers?<br />In our case, trim off some useless data.<br />But going from relational to document means this sort of problem is easy to have. One parent, many children.<br />It’d be nice if this was easier to change, but clients have it hard-coded too.<br />Compression would help, of course.<br />
  13. 13. Lesson: Know Your Data Types<br />Field Types and Conversions can be expensive to do after the fact!<br />MongoDB treats strings and numbers differently, but some programming languages (such as Perl) don’t make that distinction obvious<br />This has indexing implications when you later look for 123456789 but had unknowingly stored “123456789”<br />http://search.cpan.org/dist/MongoDB/lib/MongoDB/DataTypes.pod<br />
  14. 14. Data Types, continued<br />“If the type of a field is ambiguous and important to your application, you should document what you expect the application to send to the database and convert your data to those types before sending.”<br />Do you know how to do that in your language of choice?<br />Some drivers may make a “guess” that gets it right most of the time.<br />
  15. 15. Lesson: Know SomeSharding<br />The Balancer can be your frenemy<br />Initial insert rate: 8,000/sec<br />Later drops to 200/sec<br />Too much time spent waiting to page in data that’s going to be sent to another node and never looked at (locally) again<br />Pre-split your data if possible<br />http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/<br />
  16. 16. Lesson: Know Some Replica Sets<br />Replica Set re-sync requires index rebuilds on the secondary<br />Most painful when a slave is down too long and can’t catch up using the oplog<br />Typically during high write volumes<br />In a large data set, the index rebuilding can take a couple of days w/out many indexes<br />What if you lose another while that is happening?<br />
  17. 17. MongoDBWishlist<br />Replica set node re-sync without out index rebuilding<br />Record (or field) compression (not everyone uses a filesystem that offers compression)<br />Method to tap into the oplog so that changes can be fed to external indexers (Sphinx, Redis, etc.)<br />Hash-based sharding (coming soon?)<br />Cluster snapshot/backup tool<br />
  18. 18. craigslist is hiring!<br />send resumes to: z@craigslist.org<br />Plain Text or PDF, no Word Docs!<br />Front-end Engineering<br />HTML, CSS, JavaScript, jQuery<br />(Mobile too)<br />Network Administration<br />Routers, switches, load balancers, etc.<br />Back-end Engineering<br />Linux, Apache, Perl, MySQL, MongoDB, Redis, Gearman, etc.<br />Systems Administration<br />Help keep all those systems running.<br />
  19. 19. craigslist is hiring!<br />send resumes to: z@craigslist.org<br />Plain Text or PDF, no Word Docs!<br />Laid back, non-corporateenvironment<br />Engineering driven culture<br />Lots of interesting technical challenges<br />Easy SF commute<br />Excellent benefits and pay<br />High-impact work<br />Millions use craigslist daily<br />

×