Scanning the Internet for External Cloud Exposures via SSL Certs
Scaling up to 30M users - The Wix Story
1. Scaling up to 30M users
Scaling Software, Scaling Data & Scaling People
The Wix Experience
Devcon TLV Feb 2013
Aviran Mordo
Server Group Manager
Wix
@aviranm
3. Wix in Numbers
• Wix was founded in 2006
• 30M registered users from most countries
• Over 1,000,000 new users every month
• ~1,000,000 new websites every month
• Over 150 TByte of users media files
– More than 1 billion users media files
– More than 1.5 TByte uploaded files daily
• Over 300 Servers in 2+1 datacenters + Google + Amazon
4. Wix Initial Architecture
Wix MySQL
• Tomcat, Hibernate, Custom web framework (Tomcat) DB
– Everything generated from HBM files
– Built for fast development
– Statefull login (tomcat session), EHCache, File uploads
– Not considering performance, scalability, fast feature rollout, evaluate
– It reflected the fact that we didn’t really know what is our business
– We know that we will need to replace it when we grow.
– However, we failed to understand how difficult that can be!
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
5. Wix Initial Architecture
After two years, we have found out that
• Our initial architecture allowed us to progress vary fast
• However, as we progressed, we slowed down
• So, we learned that
– Don’t worry about ‘building it right from the start’ – you won’t
– You are going to replace stuff you are building in the initial stages
– Be ready to do it
– Get it up to customers as fast as you can. Get feedback. Evolve.
– Our mistake was not planning for gradual re-write
– Build for gradual re-write as you learn the problems and find the right
solutions
6. Distributed Cache
Next we added EHCache as Hibernate 2nd-level cache
• Why?
– Cause it is in the design
• How was it?
– Black Box cache
– How do we know what is the state of our system?
– How to invalidate the cache?
– When to invalidate it?
– How does “operations” manage the cache?
• Did we really need it? No!
• We eventually dropped it
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
7. Editor & Public Segments
• The Challenge - Updates to our Server imposed downtime for our
customer’s websites
– Any Server or Database update has the potential of bringing down all Wix sites
– Is a symptom of a larger issue
• The Server served two different concerns
– Wix Users editing websites
– Viewing Wix Sites, the sites created by the Wix editor
• The two concerns require different SLA
– Wix Sites should never ever have a downtime!
– Wix Sites should work as fast as possible, always!
– However, an editing system does not require this level of SLA.
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
8. Editor & Public Segments
• The two concerns evolve independently
– Releases of Editing feature should have no impact on
Public
existing Wix sites operations! (Tomcat)
Public
DB
• Our Solution
– Split the Server into two Segments – Public and Editor
Editor Editor
• The Public segment targets serving websites for (Tomcat) DB
Wix Users
– Has mostly read-only usage pattern – only updated
when a site is published
– Simple publishing system
– Simple and readonly means it is easier to have higher SLA and DRP
– MySQL used as NoSQL – single large table with XML text fields
• The Editor segment
– Exposes the Wix Editing APIs, as well as user account and galleries
management APIs.
– Has different release schedule compared to the Public segment
9. Editor & Public Segments
What we have learned
• MySQL is a damn good NoSQL engine
Public Public
– Our public DB was (mainly) one huge table (Tomcat) DB
– Queries & Updates are by primary key
– Instead of relations, we use text/xml or text/json columns Editor Editor
– No updates for Blobs – immutable data (Tomcat) DB
– No Transactions
• Use indirection table to blob table
– Insert a new blob value, update the pointer to the new blob, async delete
• MySql auto-generated keys cause problems
– Locks on key generation
– Require a single instance to generate keys
• We use GUID keys
– Can be generated by any client
– No locks in key value generation
– Enabler for Master-Master replication
10. Wix on Managed Hosting
Co-Location Managed Hosting Cloud
Own and maintain your Lease both hardware and Instantly lease hardware
own hardware maintenance
Provisioning == buy and Overnight provisioning Instant provisioning
deliver your new server Unlimited resources
Reliable software on Reliable software on Reliable software on
reliable hardware reliable hardware unreliable hardware
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
11. Wix Media Segment
• The Challenge – Our static storage reached over 500 GByte of small files
– The “upload to app server, post process files, copy to lighttpd server, serve by
lighttpd” pattern proved inefficient, slow and error prone
– Disk IO became slow and inefficient as the number of files increased
– We needed a solution we can grow with –
• HTTP connections
• number of files
– We needed control over caching and Http headers
• We needed dynamic image manipulations
– Rebuild a few millions of media files is not simple
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
12. Prospero – Wix Media Storage
• Our Solution
– Lighttpd based
– Sharded on the file name
– Two copies of each file
get 37D815B5.jpg Go to 37 range servers Fallback if not found
00-1f 20-ef 40-5f 60-7f
0.static HTTP 2.static HTTP 4.static HTTP 6.static
1.static HTTP 3.static HTTP 5.static HTTP 7.static
13. Prospero – Wix Media Storage
• Dynamic Image processing
– Picture Pyramid
– Picture resize, crop and sharpen “on the fly”
– Thumbnail generation
• Eventual Consistency solutions scale
– But you have to build for when eventual consistency is not consistent
• Media files caching headers are critical
– Max-age, ETag, if-modified-since, etc.
– Think how to tune those parameters for media files, as per your specific needs
• We tried Amazon S3 and Google for secondary storage
– However, Amazon proved unreliable (connections, availability)
• We found that using a CDN in front of Prospero is very effective
• Initially, files where stored on the filesystem
• T We added Tokyo Tyrant backend for small files
• M We added Memcached (Redis) layer for “in transit” files
14. Prospero – Wix Media Storage
• Our current architecture
Google Cloud x36
x36
Storage M T x32
M T
M T
Second fallback Chicago
First fallback
CDN x36
x36
M T x32
If not in CDN M T
M T
get 37D815B5.jpg Austin
15. CDN
• Use a CDN!
• CDN acts as a great connection manager
– We have CDN hit ratio’s of over 99.9%
• Use the “Cache Killer” pattern
– http://static.wix.com/client/css/viewer.css?v=327
– http://static.wix.com/client/1.3.2/css/viewer.css
– Makes flushing files from the CDN redundant
– Enabler for longer caching periods
• There are many vendors
– We started with 1 CDN vendor
– We are now working with two CDN vendors
– Different CDN vendors have advantages at different geo
• Tune HTTP Headers per CDN Vendor
– CDN Vendors interpret HTTP headers differently
16. Development Velocity
• The Challenge – Our codebase became large and entangled
– Feature rollout became harder over time, requiring longer and longer manual
regression
– The longer the regression was, the harder is became to make “a good release”
– Strange full-table scans queries generated by Hibernate, which we still have no
idea what code is responsible for…
• The solution
– Mid 2010 – Wix Framework – modern base libraries
– Beginning 2011 – CI / CD / TDD techniques + DevOps culture
– Mid 2011 – Scala
CI / CD / TDD + DevOps
– SOA Architecture (not WSDL)
Scala
Framework
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
17. People are the key
• Train the people you already have
– We sent our entire QA department to learn Java
– Developers learn TDD and CI/CD methodologies.
• Hiring the right people is key to success
– Hire only the best developers (only seniors)
– Don’t count only on the interview, you need to test actual coding
– Anyone who interviews can drop a candidate
– Hire people who will challenge you (no “yes man”)
– Get people you can trust with “root” access to production
• Never stop hiring
– If we find an excellent person we will create a position for him even if we do
not have one open.
• Wix is doubling its size every year
– Yes we are currently hiring.
– We’re considering to start hiring and training junior developers.
18. Wix’s CI / CD / TDD + DevOps model
• Abandon “VERSION” paradigm – move feature centric life
• Make small and frequent release as soon as possible
– Today we release about 10 times a day, gaining velocity
• Empower the developer
– The developer is responsible from product idea to 100,000 active users
– Remove every obstacle in the developer’s path
– Big cultural change from waterfall – affects the whole company
– The developer is responsible for his app operations
• Automate everything – CI/CD/TDD
– CI – Continuous Integration
– CD – Continuous Delivery / Deployment
– TDD – Automated unit-tests, integration tests, GUI tests
• Measure Everything (The lean startup way)
– A/B test every new feature
– Monitor real KPIs (business, not CPU)
19. CI / CD @ Wix – Release Process
• Make an RC
– Runs build, unit-tests, integration tests
20. CI / CD @ Wix – Release Process
• Deploy as GA
– Using Chef, Noah, Artifactory
– Runs Self-Tests
21. CI / CD @ Wix – Release Process
• Monitor
– Deployment, NewRelic, App-Info, Recent Events
• Rollback
22. Products we’ve built (partial list)
• Wix Mobile
• Wix HTML5
– Full HTML 5 support – total rewrite of our Flash product
• Third Party Applications (TPAs)
– With over 200,000 installations in the 3 first months
• Answers
– Wix unique support system
• Wix Billing System (PCI Compliant)
Billing
– Support complex business models for TPAs TPA
– Support diverse geo eCommerce
App Builder
• eCommerce HTML 5
Answers
– Based on Magento
Mobile
• BI
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013