SlideShare a Scribd company logo
1 of 19
Lessons Learned from Migrating 2+ Billion Documents at Craigslist Jeremy Zawodny jzawodn@craigslist.org Jeremy@Zawodny.com http://blog.zawodny.com/
Outline Recap last year’s MongoSV Talk The Archive, Why MongoDB, etc. http://www.10gen.com/video/mongosv2010/craigslist The Infrastructure The Lessons Wishlist Q&A
Craigslist Numbers 2 data centers ~500 servers ~100 MySQL servers ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day
Archive: Where Data Goes To Die Live Numbers ~1.75M posts/day ~14 day avg. lifetime ~60 day retention ~100M  posts We keep all postings Users reuse postings Daily archive migration Internal query tools
Archive Pain Coupled Schemas Big Indexes Hardware Failures Replication Lag Poor Search Human Time Costs
MongoDB Wins Scalable Fast Friendly Proven Pragmatic Approachable
MongoDB Details Plan for 5 billion documents Average size: 2KB 3 Replica sets, 3 Servers each Deploy to 2 datacenters Same deployment in each datacenter Posting ID is sharding key
MongoDB Architecture Typical Sharding with Replica Sets (external sphinx full-text indexers not pictured) config client client client client config config mongos mongos mongos shard001 shard003 shard002 replica set replica set replica set
Lesson: Know Your Hardware MongoDB on blades really sucks Single 10k RPM disks can’t take it when data is noticeably larger than RAM Mongo operations can hit the client timeout (30 sec default) Even minutely cron jobs start to spew Lots of time wasted in development environment, trying different kernels, tuning, etc. Most noticeable during heavy writes but can happen if pages fall out of RAM for other reasons
Lesson: Replica Sets Rock Lots of reboots happened during dev environment troubleshooting Each time, one of the remaining nodes took over No “reclone” no config file or DNS changes Stuff “just worked” while nodes bounced up and down
Lesson: Know Your Data MongoDB is UTF-8 Some of our older data is decidedly NOT UTF-8 We have lots of sloppy encoding issues to clean up.  But we had to clean them all up. Start data load.  Wait 12-36 hours.  Witness fail.  Fix code.  Start over.  Sigh. This is a combination of having been sloppy and having old data.  Even with a lot less history, this can bite you.  Get your encoding house in order!
Lesson: Know Your Data Size MongoDB has a doc size limits 4MB in 1.6.x, 16MB in 1.8.x What to do with outliers? In our case, trim off some useless data. But going from relational to document means this sort of problem is easy to have.  One parent, many children. It’d be nice if this was easier to change, but clients have it hard-coded too. Compression would help, of course.
Lesson: Know Your Data Types Field Types and Conversions can be expensive to do after the fact! MongoDB treats strings and numbers differently, but some programming languages (such as Perl) don’t make that distinction obvious This has indexing implications when you later look for 123456789 but had unknowingly stored “123456789” http://search.cpan.org/dist/MongoDB/lib/MongoDB/DataTypes.pod
Data Types, continued “If the type of a field is ambiguous and important to your application, you should document what you expect the application to send to the database and convert your data to those types before sending.” Do you know how to do that in your language of choice? Some drivers may make a “guess” that gets it right most of the time.
Lesson: Know SomeSharding The Balancer can be your frenemy Initial insert rate: 8,000/sec Later drops to 200/sec Too much time spent waiting to page in data that’s going to be sent to another node and never looked at (locally) again Pre-split your data if possible http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/
Lesson: Know Some Replica Sets Replica Set re-sync requires index rebuilds on the secondary Most painful when a slave is down too long and can’t catch up using the oplog Typically during high write volumes In a large data set, the index rebuilding can take a couple of days w/out many indexes What if you lose another while that is happening?
MongoDBWishlist Replica set node re-sync without out index rebuilding Record (or field) compression (not everyone uses a filesystem that offers compression) Method to tap into the oplog so that changes can be fed to external indexers (Sphinx, Redis, etc.) Hash-based sharding (coming soon?) Cluster snapshot/backup tool
craigslist is hiring! send resumes to: z@craigslist.org Plain Text or PDF, no Word Docs! Front-end Engineering HTML, CSS, JavaScript, jQuery (Mobile too) Network Administration Routers, switches, load balancers, etc. Back-end Engineering Linux, Apache, Perl, MySQL, MongoDB, Redis, Gearman, etc. Systems Administration Help keep all those systems running.
craigslist is hiring! send resumes to: z@craigslist.org Plain Text or PDF, no Word Docs! Laid back, non-corporateenvironment Engineering driven culture Lots of interesting technical challenges Easy SF commute Excellent benefits and pay High-impact work Millions use craigslist daily

More Related Content

What's hot

Sizing Your MongoDB Cluster
Sizing Your MongoDB ClusterSizing Your MongoDB Cluster
Sizing Your MongoDB ClusterMongoDB
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at ScaleMongoDB
 
最近の PowerShell について
最近の PowerShell について最近の PowerShell について
最近の PowerShell についてKazuki Takai
 
MongoDB Schema Design: Four Real-World Examples
MongoDB Schema Design: Four Real-World ExamplesMongoDB Schema Design: Four Real-World Examples
MongoDB Schema Design: Four Real-World ExamplesMike Friedman
 
Best practices on building data lakes and lake formation
Best practices on building data lakes and lake formationBest practices on building data lakes and lake formation
Best practices on building data lakes and lake formationJohn Varghese
 
Insight into Azure Active Directory #02 - Azure AD B2B Collaboration New Feat...
Insight into Azure Active Directory #02 - Azure AD B2B Collaboration New Feat...Insight into Azure Active Directory #02 - Azure AD B2B Collaboration New Feat...
Insight into Azure Active Directory #02 - Azure AD B2B Collaboration New Feat...Kazuki Takai
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBNodeXperts
 
Alfresco: Implementing secure single sign on (SSO) with OpenSAML
Alfresco: Implementing secure single sign on (SSO) with OpenSAMLAlfresco: Implementing secure single sign on (SSO) with OpenSAML
Alfresco: Implementing secure single sign on (SSO) with OpenSAMLJ V
 
MongoDB Administration 101
MongoDB Administration 101MongoDB Administration 101
MongoDB Administration 101MongoDB
 
An introduction to MongoDB
An introduction to MongoDBAn introduction to MongoDB
An introduction to MongoDBCésar Trigo
 
Migrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMigrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMongoDB
 
mongodb와 mysql의 CRUD 연산의 성능 비교
mongodb와 mysql의 CRUD 연산의 성능 비교mongodb와 mysql의 CRUD 연산의 성능 비교
mongodb와 mysql의 CRUD 연산의 성능 비교Woo Yeong Choi
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBLee Theobald
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
PostgreSQL Tutorial For Beginners | Edureka
PostgreSQL Tutorial For Beginners | EdurekaPostgreSQL Tutorial For Beginners | Edureka
PostgreSQL Tutorial For Beginners | EdurekaEdureka!
 
(Ficon2016) #2 침해사고 대응, 이렇다고 전해라
(Ficon2016) #2 침해사고 대응, 이렇다고 전해라(Ficon2016) #2 침해사고 대응, 이렇다고 전해라
(Ficon2016) #2 침해사고 대응, 이렇다고 전해라INSIGHT FORENSIC
 

What's hot (20)

Sizing Your MongoDB Cluster
Sizing Your MongoDB ClusterSizing Your MongoDB Cluster
Sizing Your MongoDB Cluster
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
 
最近の PowerShell について
最近の PowerShell について最近の PowerShell について
最近の PowerShell について
 
MongoDB Schema Design: Four Real-World Examples
MongoDB Schema Design: Four Real-World ExamplesMongoDB Schema Design: Four Real-World Examples
MongoDB Schema Design: Four Real-World Examples
 
Best practices on building data lakes and lake formation
Best practices on building data lakes and lake formationBest practices on building data lakes and lake formation
Best practices on building data lakes and lake formation
 
Insight into Azure Active Directory #02 - Azure AD B2B Collaboration New Feat...
Insight into Azure Active Directory #02 - Azure AD B2B Collaboration New Feat...Insight into Azure Active Directory #02 - Azure AD B2B Collaboration New Feat...
Insight into Azure Active Directory #02 - Azure AD B2B Collaboration New Feat...
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Alfresco: Implementing secure single sign on (SSO) with OpenSAML
Alfresco: Implementing secure single sign on (SSO) with OpenSAMLAlfresco: Implementing secure single sign on (SSO) with OpenSAML
Alfresco: Implementing secure single sign on (SSO) with OpenSAML
 
Mongo DB
Mongo DB Mongo DB
Mongo DB
 
MongoDB Administration 101
MongoDB Administration 101MongoDB Administration 101
MongoDB Administration 101
 
An introduction to MongoDB
An introduction to MongoDBAn introduction to MongoDB
An introduction to MongoDB
 
Migrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMigrating to MongoDB: Best Practices
Migrating to MongoDB: Best Practices
 
Redis introduction
Redis introductionRedis introduction
Redis introduction
 
mongodb와 mysql의 CRUD 연산의 성능 비교
mongodb와 mysql의 CRUD 연산의 성능 비교mongodb와 mysql의 CRUD 연산의 성능 비교
mongodb와 mysql의 CRUD 연산의 성능 비교
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
Mongo DB
Mongo DBMongo DB
Mongo DB
 
MEAN Stack
MEAN StackMEAN Stack
MEAN Stack
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
PostgreSQL Tutorial For Beginners | Edureka
PostgreSQL Tutorial For Beginners | EdurekaPostgreSQL Tutorial For Beginners | Edureka
PostgreSQL Tutorial For Beginners | Edureka
 
(Ficon2016) #2 침해사고 대응, 이렇다고 전해라
(Ficon2016) #2 침해사고 대응, 이렇다고 전해라(Ficon2016) #2 침해사고 대응, 이렇다고 전해라
(Ficon2016) #2 침해사고 대응, 이렇다고 전해라
 

Viewers also liked

Webinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDBWebinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDBBoxed Ice
 
Midas - on-the-fly schema migration tool for MongoDB.
Midas - on-the-fly schema migration tool for MongoDB.Midas - on-the-fly schema migration tool for MongoDB.
Midas - on-the-fly schema migration tool for MongoDB.Dhaval Dalal
 
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)Jeremy Zawodny
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
 
Migrating from MySQL to MongoDB at Wordnik
Migrating from MySQL to MongoDB at WordnikMigrating from MySQL to MongoDB at Wordnik
Migrating from MySQL to MongoDB at WordnikTony Tam
 
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...MongoDB
 
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB
 
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQL
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQLWebinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQL
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQLMongoDB
 
Redis and Groovy and Grails - gr8conf 2011
Redis and Groovy and Grails - gr8conf 2011Redis and Groovy and Grails - gr8conf 2011
Redis and Groovy and Grails - gr8conf 2011Ted Naleid
 
Fusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistFusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistJeremy Zawodny
 
MongoDB Certification Study Group - May 2016
MongoDB Certification Study Group - May 2016MongoDB Certification Study Group - May 2016
MongoDB Certification Study Group - May 2016Norberto Leite
 
Production deployment
Production deploymentProduction deployment
Production deploymentMongoDB
 
Managing Big Data with MySQL
Managing Big Data with MySQLManaging Big Data with MySQL
Managing Big Data with MySQLmwasaha mwagambo
 
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012Jeremy Zawodny
 
Social Media Trends - Content Curation
Social Media Trends - Content CurationSocial Media Trends - Content Curation
Social Media Trends - Content CurationChris Mikulin
 
Sphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQLSphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQLNguyen Van Vuong
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitTyler Treat
 

Viewers also liked (20)

Webinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDBWebinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDB
 
Midas - on-the-fly schema migration tool for MongoDB.
Midas - on-the-fly schema migration tool for MongoDB.Midas - on-the-fly schema migration tool for MongoDB.
Midas - on-the-fly schema migration tool for MongoDB.
 
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Migrating from MySQL to MongoDB at Wordnik
Migrating from MySQL to MongoDB at WordnikMigrating from MySQL to MongoDB at Wordnik
Migrating from MySQL to MongoDB at Wordnik
 
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...
 
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
 
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQL
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQLWebinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQL
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQL
 
Redis and Groovy and Grails - gr8conf 2011
Redis and Groovy and Grails - gr8conf 2011Redis and Groovy and Grails - gr8conf 2011
Redis and Groovy and Grails - gr8conf 2011
 
Tayra
TayraTayra
Tayra
 
Fusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistFusion-io and MySQL at Craigslist
Fusion-io and MySQL at Craigslist
 
SphinxSearch
SphinxSearchSphinxSearch
SphinxSearch
 
MongoDB Certification Study Group - May 2016
MongoDB Certification Study Group - May 2016MongoDB Certification Study Group - May 2016
MongoDB Certification Study Group - May 2016
 
Production deployment
Production deploymentProduction deployment
Production deployment
 
Managing Big Data with MySQL
Managing Big Data with MySQLManaging Big Data with MySQL
Managing Big Data with MySQL
 
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
 
Social Media Trends - Content Curation
Social Media Trends - Content CurationSocial Media Trends - Content Curation
Social Media Trends - Content Curation
 
Sphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQLSphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQL
 
Benchmark slideshow
Benchmark slideshowBenchmark slideshow
Benchmark slideshow
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
 

Similar to Lessons Learned Migrating 2+ Billion Documents at Craigslist

MongoDB Knowledge Shareing
MongoDB Knowledge ShareingMongoDB Knowledge Shareing
MongoDB Knowledge ShareingPhilip Zhong
 
MongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewMongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewPierre Baillet
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Consjohnrjenson
 
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relationalTony Tam
 
Mongo db transcript
Mongo db transcriptMongo db transcript
Mongo db transcriptfoliba
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Scaling with mongo db (with notes)
Scaling with mongo db (with notes)Scaling with mongo db (with notes)
Scaling with mongo db (with notes)emiltamas
 
The Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterThe Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterChris Henry
 
MongoDB 2.4 and spring data
MongoDB 2.4 and spring dataMongoDB 2.4 and spring data
MongoDB 2.4 and spring dataJimmy Ray
 
Silicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDBSilicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDBManish Pandit
 
how_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptxhow_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptxsarah david
 
Mdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_searchMdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_searchDaniel M. Farrell
 
From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)MongoSF
 
how_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdfhow_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdfsarah david
 

Similar to Lessons Learned Migrating 2+ Billion Documents at Craigslist (20)

MongoDB Knowledge Shareing
MongoDB Knowledge ShareingMongoDB Knowledge Shareing
MongoDB Knowledge Shareing
 
MongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewMongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of view
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Cons
 
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relational
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
 
Look Ma! No more blobs
Look Ma! No more blobsLook Ma! No more blobs
Look Ma! No more blobs
 
Mongo db transcript
Mongo db transcriptMongo db transcript
Mongo db transcript
 
Open source Technology
Open source TechnologyOpen source Technology
Open source Technology
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Scaling with mongo db (with notes)
Scaling with mongo db (with notes)Scaling with mongo db (with notes)
Scaling with mongo db (with notes)
 
The Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterThe Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb Cluster
 
MongoDB 2.4 and spring data
MongoDB 2.4 and spring dataMongoDB 2.4 and spring data
MongoDB 2.4 and spring data
 
Silicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDBSilicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDB
 
MongoDB
MongoDBMongoDB
MongoDB
 
how_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptxhow_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptx
 
Mdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_searchMdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_search
 
disertation
disertationdisertation
disertation
 
From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)
 
Whynosql
WhynosqlWhynosql
Whynosql
 
how_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdfhow_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdf
 

Recently uploaded

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Recently uploaded (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Lessons Learned Migrating 2+ Billion Documents at Craigslist

  • 1. Lessons Learned from Migrating 2+ Billion Documents at Craigslist Jeremy Zawodny jzawodn@craigslist.org Jeremy@Zawodny.com http://blog.zawodny.com/
  • 2. Outline Recap last year’s MongoSV Talk The Archive, Why MongoDB, etc. http://www.10gen.com/video/mongosv2010/craigslist The Infrastructure The Lessons Wishlist Q&A
  • 3. Craigslist Numbers 2 data centers ~500 servers ~100 MySQL servers ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day
  • 4. Archive: Where Data Goes To Die Live Numbers ~1.75M posts/day ~14 day avg. lifetime ~60 day retention ~100M posts We keep all postings Users reuse postings Daily archive migration Internal query tools
  • 5. Archive Pain Coupled Schemas Big Indexes Hardware Failures Replication Lag Poor Search Human Time Costs
  • 6. MongoDB Wins Scalable Fast Friendly Proven Pragmatic Approachable
  • 7. MongoDB Details Plan for 5 billion documents Average size: 2KB 3 Replica sets, 3 Servers each Deploy to 2 datacenters Same deployment in each datacenter Posting ID is sharding key
  • 8. MongoDB Architecture Typical Sharding with Replica Sets (external sphinx full-text indexers not pictured) config client client client client config config mongos mongos mongos shard001 shard003 shard002 replica set replica set replica set
  • 9. Lesson: Know Your Hardware MongoDB on blades really sucks Single 10k RPM disks can’t take it when data is noticeably larger than RAM Mongo operations can hit the client timeout (30 sec default) Even minutely cron jobs start to spew Lots of time wasted in development environment, trying different kernels, tuning, etc. Most noticeable during heavy writes but can happen if pages fall out of RAM for other reasons
  • 10. Lesson: Replica Sets Rock Lots of reboots happened during dev environment troubleshooting Each time, one of the remaining nodes took over No “reclone” no config file or DNS changes Stuff “just worked” while nodes bounced up and down
  • 11. Lesson: Know Your Data MongoDB is UTF-8 Some of our older data is decidedly NOT UTF-8 We have lots of sloppy encoding issues to clean up. But we had to clean them all up. Start data load. Wait 12-36 hours. Witness fail. Fix code. Start over. Sigh. This is a combination of having been sloppy and having old data. Even with a lot less history, this can bite you. Get your encoding house in order!
  • 12. Lesson: Know Your Data Size MongoDB has a doc size limits 4MB in 1.6.x, 16MB in 1.8.x What to do with outliers? In our case, trim off some useless data. But going from relational to document means this sort of problem is easy to have. One parent, many children. It’d be nice if this was easier to change, but clients have it hard-coded too. Compression would help, of course.
  • 13. Lesson: Know Your Data Types Field Types and Conversions can be expensive to do after the fact! MongoDB treats strings and numbers differently, but some programming languages (such as Perl) don’t make that distinction obvious This has indexing implications when you later look for 123456789 but had unknowingly stored “123456789” http://search.cpan.org/dist/MongoDB/lib/MongoDB/DataTypes.pod
  • 14. Data Types, continued “If the type of a field is ambiguous and important to your application, you should document what you expect the application to send to the database and convert your data to those types before sending.” Do you know how to do that in your language of choice? Some drivers may make a “guess” that gets it right most of the time.
  • 15. Lesson: Know SomeSharding The Balancer can be your frenemy Initial insert rate: 8,000/sec Later drops to 200/sec Too much time spent waiting to page in data that’s going to be sent to another node and never looked at (locally) again Pre-split your data if possible http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/
  • 16. Lesson: Know Some Replica Sets Replica Set re-sync requires index rebuilds on the secondary Most painful when a slave is down too long and can’t catch up using the oplog Typically during high write volumes In a large data set, the index rebuilding can take a couple of days w/out many indexes What if you lose another while that is happening?
  • 17. MongoDBWishlist Replica set node re-sync without out index rebuilding Record (or field) compression (not everyone uses a filesystem that offers compression) Method to tap into the oplog so that changes can be fed to external indexers (Sphinx, Redis, etc.) Hash-based sharding (coming soon?) Cluster snapshot/backup tool
  • 18. craigslist is hiring! send resumes to: z@craigslist.org Plain Text or PDF, no Word Docs! Front-end Engineering HTML, CSS, JavaScript, jQuery (Mobile too) Network Administration Routers, switches, load balancers, etc. Back-end Engineering Linux, Apache, Perl, MySQL, MongoDB, Redis, Gearman, etc. Systems Administration Help keep all those systems running.
  • 19. craigslist is hiring! send resumes to: z@craigslist.org Plain Text or PDF, no Word Docs! Laid back, non-corporateenvironment Engineering driven culture Lots of interesting technical challenges Easy SF commute Excellent benefits and pay High-impact work Millions use craigslist daily