2. Flipkart in recent times
● Leading eCommerce player in India
○ 10M page visits, 2M shipments a day
○ 30 million products across more than 70 categories
○ Big Billion Days ($300M sales, top ranked app on Google
Play Store)
○ Ping - social collab in eCommerce
○ Progressive web app - native like experience on browser
(debuted at #chromedevsummit 2015)
4. Data at Flipkart - Data Platform
● 6 TB new data ingested daily
○ 30 TB on sale days
● 1100 Raw Streams
● 3 Billion Raw events in a day
● 0.6 PB data processed daily
● 10K Hadoop jobs daily
● 3000 Report views daily
6. Data migration needs, challenges
● User path systems
○ Minimize downtime. Site & App downtime is visible
■ Data - mostly eventually consistent
○ Session Data - Avoid User logout, Service scale : 250K RPS
○ Promise Data (Stock, Serviceability) - Avoid OOS, Over-booking (consistency matters)
○ Live Orders - Accept orders, Let customers checkout & pay (consistency matters)
○ User Accounts - No data loss. Change velocity is not much
● Order path systems
○ Availability not a constraint, Throughput and Data durability is
■ Data - strong durability, consistent
○ Current orders being fulfilled
○ Warehouse stock inwarding, movement
7. Data migration needs, challenges
● User Insights
○ Inter DC data bandwidth limitations (1Gbps shared link)
○ 130 TB (snappy compressed) data in HBase, Derived data(Insights) much smaller though
● MySQL instance footprint - 600+
● Flipkart Data Platform
○ Data publishers/consumers not moving together
■ Data consumers could move earlier than the publishers, vice-versa
○ Migrating couple of PB data not feasible over network
○ Consistency for raw, prepared and reporting data
8. Migration planning and execution
● Most useful tool - Google spreadsheets & docs!
○ Inventory of systems in each business cluster - split by service, backing data store
○ Defined data migration recipes and SME group for each data store type
■ Advise on IaaS constructs - instance types, PaaS integration - service discovery, Data
migration strategy (export vs live replication), built tooling
○ Create cutover sequence and interdependencies
■ for e.g. Catalog → Search → Cart/CO → Mobile apps
○ Wrote playbook for each cutover activity - including checklists, verification of data
export/restore
● Program managed a plan that touched 1000+ systems and many of the 1000
member engineering org
10. Hacks, Tools and Utilities
● “Never underestimate the bandwidth of a station wagon full of tapes hurtling
down the highway” -- Andrew S. Tanenbaum (Computer Networks, 4th ed., p. 91)
○ We used disks instead to move User Insights data (stored in HBase)
■ Moves snapshots of derived/computed data over wire (relatively small)
■ Avoided HBase export. Instead transferred HFiles into disks using custom ‘distcp’ like
tool which knapsack'ed ~40K files into 6 disks. Open sourced as : https://github.
com/flipkart-incubator/blueshift
■ Disked shipped to new DC
■ Transferred HFiles into HDFS using Blueshift
■ Imported HFiles into HBase using HBase Bulk Load
11. Migrating live User sessions - dual write
● Cold data in HBase (9TB - compressed), hot in Memcached(1TB)
● Live read-writes on Memcached, async batched writes to HBase
● Migration via Dual writes
○ Fresh Memcached cluster in new DC
○ Added this cluster as another batched write destination in old DC
○ Data move initiated 21 days before actual-cutover to allow for catchup
○ Hbase data was exported using standard snapshotting and incremental copy table periodically
○ Batch interval reduced from 10 minutes to 1 minutes during cutover for aggressive copy
● No user logout, session loss after cutover
12. Migrating Product catalog data
● Data modelled as Entities & Relationships : clients have “Views” of this data
● Views expressed as JSON DSL
● Raw data exported from HBase, Elastic Search and copied to new DC
● Required a solution that could migrate updates after initial move
○ Developed JSON diff library that could work over 100 million views. Open sourced : https:
//github.com/flipkart-incubator/zjsonpatch
■ Diffs are applied in order - important for DC move
○ Bandwidth consumption for applying updates dropped from 800 Mbps to 13-14 Mbps
14. Application Relay bridge over Kafka queues
● RESTBus :
orchestrates all Order
fulfillment systems
● Pattern : Locally
committed messages
in MySQL, relayed
over Kafka to Http
endpoints
● Bridge over 2 DCs with
destinations resolved
from ELB endpoints
15. 2 way sync of Kafka Streams
Mirror
Old DC New DC
16. Copying data across clusters
● Only copied raw data about 200TB compressed
○ All prepared and reports data generated from raw data
● Verification utilities to check correctness in data in both clusters
● Ran the full data platform stack in both places for over 2 weeks till all data
publishers and consumers move