Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Democratization of Data @Indix


Published on

The story of why and how we built the internal data pipeline platform at Indix and how it has changed the way we access our own data.

Published in: Technology
  • Login to see the comments

Democratization of Data @Indix

  1. 1. Democratization of Data Why and how we built an internal data pipeline platform @Indix
  2. 2. About me Manoj Mahalingam Principal Engineer @Indix
  3. 3. People Documents Businesses Places Products Connected Devices Six Business Critical Indexes
  4. 4. Enabling businesses to build location-aware software. ~3.6 million websites use Google maps Enabling businesses to build product-aware software. Indix catalogs over 2.1 billion product offers Indix - The “Google Maps” of Products
  5. 5. Crawling Pipeline Data PipelineML AggregateMatchStandardizeExtract AttributesClassifyDedupe Parse Crawl Data CrawlSeed Brand & Retailer Websites Feeds Pipeline Transform Clean Connect Feed Data Brand & Retailer Feeds Indix Product Catalog Customizable Feeds Search & Analytics Index Indexing PipelineReal Time Index Analyze Derive Join API (Bulk & Synchronous) Product Data Transformation Service Data Pipeline @Indix
  6. 6. Democratization of Data Enable everyone in the organization to know what data is available, and then understand and work with it.
  7. 7. At Indix, we have and work with a lot of data.
  8. 8. Scale of Data @ Indix 2.1 Billion Product URLs 8 TB HTML Data Crawled Daily 1B Unique Products 7000 Categories 120 B Price Points 3000 Sites
  9. 9. ● We have data in different shapes and sizes. ● HTML pages, Thrift and avro records. ● And also the usual suspects - CSVs and plain text data.
  10. 10. ● Datasets can be in TBs or a few hundred KBs. ● Few billion records or a couple of hundreds.
  11. 11. But...the data’s potential couldn’t be realized
  12. 12. Data wasn’t discoverable ● The biggest problem was in knowing what data exists and where. ● Some of the data was in S3. Some in HDFS. Some in Google sheets. ● There was no way to know how frequently and when the data changed or updated.
  13. 13. The schema wasn’t readily known ● The schema of the data, as expected, kept changing and it was difficult to keep track of which version of data had which schema. ● While Thrift and Avro alleviate this to an extent, access to data wasn’t simple, especially for non-engineers.
  14. 14. Writing code limited scope ● We use Scalding and Spark for our MR jobs. Having to code and tweak the jobs limited the scope of who can write and run these jobs. ● “Readymade” jobs may not enable desired tweaks if needed, affecting productivity and increasing dependencies. ● Having to write code and ship jars hinders adhoc data experimentation.
  15. 15. Cost control wasn’t trivial ● While data came in various sizes and shapes, what people did with the data also varied - some use cases needed sample of the data, while others wanted aggregations on the entire data. ● It wasn’t trivial to handle all the different workloads while minimizing costs. ● There was also the problem of adhoc jobs starving production jobs in our existing Hadoop clusters.
  16. 16. Goals of Internal Data Pipeline Platform Enable easy discovery of data. Allow Schema to be transparent and easy to create while also allowing introspection. Minimal coding - have prebuilt transformations for common tasks and enable SQL based workflow.
  17. 17. Goals of Internal Data Pipeline Platform UI and Wizard based workflow to enable ANYONE in the organization to run pipelines and extract data. Manage underlying clusters and resources transparently while optimizing for costs. Support data experimentations and also production / customer use cases.
  18. 18. MDA - Marketplace of Datasets and Algorithms
  19. 19. Tech Stack
  20. 20. MDA - DEMO!!!
  21. 21. MDA with our Data Pipeline MatchAttributesBrandClassifyDedup
  22. 22. MDA with our Data Pipeline MatchAttributesBrandClassifyDedup Enrich Data Classify Brand Feed data from Customer Feed output to customer
  23. 23. MDA for ML Training Data Filter Sample Preprocess Training Data
  24. 24. Notebooks //Setup the MDA client import com.indix.holonet.core.client.SDKClient val host = "" val port = 80 val client = SDKClient(host, port, spark) //Create dataframe from any MDA dataset val df = client.toDF("Indix", "PriceHistoryProfile")
  25. 25. Dec 2015 Start work on MDA Mar 2016 First release Lot more transforms including sampling, full Hive SQL support and UX fixes Late 2016 Performance improvements, Spark and infra upgrades. June 2017 Ability to run pipelines in customer’s cloud infra Jul 2016 Early 2017 Completely redesign the UI based on over year of feedback and learnings. GraphQL for the UI. First closed preview of MDA for a customer Aug 2017
  26. 26. What does the future hold? ● We are far from done - things like automatic schema inference, better caching are already planned. ● And as is the original vision, make it fully self-served for our customers (internal and external.) ● Integration with other tools out there like Superset ● Open source as much as possible. First cut -
  27. 27. Questions? I blog at Twitter and most other platforms @manojlds