SlideShare a Scribd company logo
1 of 30
Download to read offline
DoneDeal	
  -­‐	
  Data	
  Pla+orm	
  
April	
  2016	
  
Mar6n	
  Peters	
  	
  
(mar6n@donedeal.ie	
  /	
  @mar6nbpeters)	
  
DoneDeal	
  Analy6cs	
  Team	
  Manager	
  
If you don’t understand the details of your business you are
going to fail.
If we can keep our competitors focused on us while we stay
focused on the customer, ultimately we’ll turn out all right.
- Jeff Bezos, Amazon
What	
  do	
  these	
  companies	
  have	
  in	
  Common?
Data	
  is	
  …
With the right set of information, you can make business decisions with higher
levels of confidence, as you can audit and attribute the data you used for the
decision-making process.
- Krish Krishnan, 2014
one	
  of	
  our	
  biggest	
  assets.
Business	
  Intelligence	
  101
For	
  small	
  companies	
  the	
  gap	
  is	
  oNen	
  filled	
  with	
  
custom	
  ad	
  hoc	
  solu6ons	
  with	
  limited	
  and	
  rather	
  
sta6c	
  repor6ng	
  capability.
What	
  and	
  why	
  BI?
As	
  	
  a	
  company	
  grows,	
  the	
  Availability,	
  Accuracy	
  and	
  
Accessibility	
  requirements	
  of	
  data	
  increases.
Some	
  terminology:	
  ETL	
  process
Extrac6on
Extracts data from
homogeneous or
heterogeneous
data sources.
Transforma6on:
Process, Blend, merge
and conform the data
Loading:
Store in the proper
format or structure
for the purposes of
querying and analysis.
April	
  2015	
  -­‐	
  April	
  2016
Timeline:	
  2014-­‐2017
2014 2015 2016 2017
Silo’d Data
Manual/Error
Prone Blending
Value of BI/Data
not understood
Platform
Design
Implementation
Storage Layer
Batch Layer
Traditional
BI
Serving
Layer
Speed Layer
Real Time
Analytics
Business	
  Goals	
  &	
  Objec6ves
1.	
  Build	
  a	
  future	
  proof	
  data	
  analy2cs	
  pla5orm	
  that	
  will	
  scale	
  with	
  the	
  company	
  
over	
  the	
  next	
  5	
  years.	
  
2.	
  Take	
  ownership	
  of	
  our	
  data.	
  Collect	
  more	
  data.	
  
3.	
  Replace	
  exis2ng	
  repor2ng	
  tool.	
  
4.	
  Provide	
  a	
  holis2c	
  view	
  of	
  our	
  users	
  (buyers	
  and	
  sellers),	
  ads,	
  products	
  
5.	
  Use	
  our	
  data	
  in	
  a	
  smarter	
  manner	
  and	
  provide	
  recommenda2ons	
  in	
  a	
  2mely	
  
fashion.	
  
Apollo	
  Team
Data Engineer
Data Analyst
Architect
DevOps
BI Consultants
Solution Architect
• Analy2cs	
  Pla5orm	
  that	
  includes	
  Event	
  Streaming,	
  Data	
  Consolida2on,	
  Cleansing	
  &	
  Warehousing,	
  Data	
  
Visualisa2on,	
  Business	
  Intelligence	
  and	
  Data	
  Product	
  Delivery.	
  
• Apollo	
  brings	
  agility	
  and	
  flexibility	
  in	
  our	
  data	
  model,	
  data	
  ownership	
  is	
  key	
  and	
  allows	
  us	
  to	
  blending	
  
data	
  more	
  conveniently
Apollo	
  Principles
1.	
  System	
  must	
  scale	
  but	
  costs	
  
grow	
  more	
  slowly	
  
2.	
  Occam’s	
  Razor	
  
3.	
  Analy2cs	
  and	
  core	
  pla5orms	
  
are	
  independent	
  
4.	
  Monitoring	
  of	
  pla5orm	
  is	
  
key	
  
5.	
  Low	
  maintenance
Project	
  Principles: Data	
  Principles:
1.	
  Accurate,	
  Available,	
  Accessible	
  
2.	
  Ownership	
  -­‐	
  Business	
  &	
  Technical	
  	
  
3.	
  Standardised	
  across	
  teams	
  
4.	
  Integrity	
  	
  
5.	
  Iden2fiable	
  -­‐	
  primary	
  source	
  and	
  
globally	
  unique	
  iden2fier
Apollo	
  Architectural	
  Principles
www.slideshare.net/AmazonWebServices/big-data-architectural-patterns-and-best-practices-on-aws
•	
  Decoupled	
  “data	
  bus”	
  
•	
  Use	
  the	
  right	
  tool/service	
  for	
  the	
  job	
  
➡	
  Data	
  structure,	
  latency,	
  throughput,	
  access	
  paerns	
  
•	
  Use	
  Lambda	
  architecture	
  ideas	
  
➡	
  Immutable	
  (append-­‐only),	
  batch,	
  [speed,	
  serving]	
  layers	
  
•	
  Leverage	
  AWS	
  Managed	
  Services	
  
➡	
  Scalable/elas2c,	
  available,	
  reliable,	
  secure,	
  no/low	
  admin	
  
•	
  Big	
  data	
  !=	
  Big	
  Cost
Tools/Services	
  in	
  Produc6on
Data
Science
Business Users
ETL	
  Architecture:	
  Custom	
  Build	
  Pipeline
E T L
Summary Summary Summary
ETL:	
  Control	
  over	
  complex	
  dependencies
• Allows control of ETL
pipelines with complex
dependencies
• Easy plug-in of new
datasource
• Orchestration with Data
Pipeline and Common
Status or Summary Files
• Idempotent Pipeline
• Historical data extracted
as simulated stream
ETL:	
  By	
  the	
  numbers
• Extrac6on	
  
-­‐ 4000	
  days	
  processed	
  
-­‐ 7	
  different	
  data	
  sources	
  
-­‐ 14	
  domains	
  
-­‐ 13	
  event	
  types	
  
• Orchestra6on	
  
-­‐ 1200	
  processing	
  days	
  
-­‐ 4GB/day	
  
-­‐ 3	
  Environments	
  	
  
-­‐ 15	
  data	
  pipelines
• Data	
  Lake	
  
-­‐ 11M	
  events	
  streamed/day	
  	
  
-­‐ 3	
  million	
  files	
  
-­‐ 3	
  TB	
  of	
  data	
  stored	
  over	
  7	
  buckets	
  
• RedshiN	
  
-­‐ 7B	
  records	
  in	
  produc6on	
  
-­‐ 6	
  Schemas	
  (core	
  and	
  aggregate)	
  
-­‐ 86	
  Tables	
  in	
  core	
  schema
Kinesis	
  Streams
• 1	
  Stream	
  with	
  4	
  Shards	
  
• Data	
  reten6on	
  of	
  24hrs	
  
• KCL	
  on	
  EC2	
  writes	
  data	
  to	
  S3	
  ready	
  for	
  Spark	
  
• Max	
  size	
  of	
  1MB	
  data	
  blog	
  
• 1,000	
  records/sec	
  per	
  shard	
  write	
  
• 5	
  transac6ons/sec	
  read	
  or	
  2MB/sec	
  
• Server	
  side	
  API	
  Logging	
  from	
  7	
  applica6on	
  
servers	
  using	
  Log4JAppender	
  
• Event	
  Buffering	
  at	
  source	
  [in	
  progress]
Put records Requests
S3
• Simple Storage
Service provides
secure, highly-
scalable, durable
cloud storage
• Native support for
Spark, Hive
S3
• A strongly defined naming convention
• YYYY/MM/DD prefix used
• Avro format used for OLTP data/ JSON otherwise
- probably the right choice (schema evolution),
although we haven’t take any advantages for those
yet.
• Allow easy retrieval of data from a particular time
period
• Easy to maintain and browse
• Handling of summaries from E,T & L steps
Spark	
  on	
  EMR
• AWS’s	
  managed	
  Hadoop	
  framework	
  that	
  can	
  
interact	
  with	
  data	
  from	
  S3,	
  DynamoDB,	
  etc.	
  
• Apache	
  Spark	
  -­‐	
  Fast,	
  general	
  purpose	
  engine	
  
for	
  large-­‐scale	
  in-­‐memory	
  data	
  processing.	
  
Runs	
  on	
  Hadoop/EMR	
  and	
  can	
  read	
  from	
  S3.	
  
• PySpark	
  +	
  SparkSQL	
  was	
  the	
  focus	
  in	
  Apollo.	
  
• Streaming	
  and	
  ML	
  will	
  be	
  the	
  focusing	
  the	
  
months	
  ahead.
• Spark is easy, performant Spark code is hard and
time consuming
• DataFrame API exclusively
• Developing Spark applications in local
environment with limited size dataset
significantly differs from running Spark on EMR
(e.g. joins, unions etc.)
• Don’t pre-optimize
• Naive joins to be avoided
• Spark UI is invaluable to test performances
(both locally and on EMR) and to understand
the underlying mechanism of Spark
•Some	
  scaling	
  of	
  Spark	
  on	
  EMR,	
  seled	
  on	
  
memory	
  op2mised	
  instances	
  r3.2xlarge	
  (8	
  
vCPUs,	
  61GB	
  RAM).
Spark	
  on	
  EMR
Data	
  Pipeline	
  +	
  Simple	
  No6fica6on	
  Service
• Pipeline	
  is	
  a	
  service	
  to	
  reliably	
  process	
  and	
  
move	
  between	
  AWS	
  applica6ons	
  (e.g.	
  S3,	
  EMR,	
  
DynamoDB)	
  
• Pipelines	
  run	
  on	
  schedule	
  and	
  alarms	
  are	
  
issued	
  with	
  Simple	
  No6fica6on	
  Service	
  (SNS)	
  
• EMR/Spark	
  used	
  for	
  compute	
  and	
  EC2	
  used	
  for	
  
loading	
  data	
  in	
  RedshiN	
  
• Debugging	
  can	
  be	
  a	
  challenge
RedshiN
• Dense	
  Compute	
  or	
  Dense	
  Storage?	
  
-­‐ Single	
  ds2.xlarge	
  instance	
  
-­‐ Right	
  balance	
  between	
  storage/memory/
compute	
  and	
  cost/hr	
  
• Strict	
  ETL,	
  no	
  transforma2on	
  is	
  carried	
  out	
  in	
  DW,	
  
an	
  Append	
  Only	
  Strategy	
  
-­‐ Leverage	
  power	
  and	
  scalability	
  of	
  EMR	
  and	
  
Insert	
  speed	
  of	
  Redshif	
  
-­‐ No	
  Updates	
  in	
  DW,	
  Drop	
  and	
  Recreate	
  	
  
• Tuning	
  is	
  a	
  2me	
  consuming	
  task	
  &	
  requires	
  
rigorous	
  tes2ng.	
  
• Define	
  Sort,	
  Distribu2on,	
  Interleaved	
  keys	
  as	
  early	
  
as	
  possible.	
  
• Reserved	
  Nodes	
  will	
  be	
  used	
  in	
  future
Test Dev Prod
Core cmtest cmdev cmprod
Agg agtest agdev agprod
Test Dev Prod
Core cmtest cmdev cmprod
Agg agtest agdev agprod
read permissions
Kimball	
  Star	
  Schema:	
  Conformed	
  dimensions	
  
across	
  all	
  data	
  sources
Tableau	
  on	
  EC2
• Tableau	
  Server	
  runs	
  on	
  EC2	
  (c3.2xlarge)	
  inside	
  AWS	
  Environment.	
  	
  
• Tableau	
  Desktop	
  used	
  to	
  develop	
  dashboards	
  that	
  are	
  published	
  to	
  the	
  server.	
  
• Connec2on	
  to	
  Redshif	
  Data	
  Warehouse	
  -­‐	
  JDBC/ODBC	
  Connector.	
  
• Maps	
  support	
  is	
  poor	
  for	
  countries	
  outside	
  the	
  US
http://www.slideshare.net/AmazonWebServices/analytics-on-the-cloud-with-tableau-on-aws
Up	
  Next?
• Increase	
  number	
  of	
  data	
  
streams/Remove	
  
dependence	
  on	
  OLTP	
  
• Tradi2onal	
  BI/Repor2ng	
  -­‐	
  
More	
  dashboards	
  
• [In	
  progress]	
  Data	
  Products	
  
with	
  Spark	
  ML/Amazon	
  ML,	
  
DynamoDB,	
  Lambda	
  &	
  API	
  
Gateway
• Trials	
  of	
  Kinesis	
  Firehose,	
  
Kinesis	
  Analy2cs,	
  Quicksight	
  
• Improved	
  Code	
  Deployment	
  
with	
  Code	
  Pipeline	
  and	
  
Code	
  Commit
DoneDeal	
  Image	
  Service	
  Upgrade
•Image	
  Storage	
  &	
  Transforming	
  moved	
  to	
  AWS	
  
•Over	
  4.5M	
  images	
  migrated	
  to	
  S3	
  
•ECS	
  +	
  ELB	
  used	
  for	
  image	
  resizing	
  
•Autoscaling	
  group	
  enables	
  adding	
  new	
  image	
  sizes	
  
•We	
  now	
  run	
  docker	
  in	
  produc2on	
  thanks	
  to	
  ECS	
  
•Inves2ga2ng	
  uses	
  for	
  AWS	
  Lambda	
  and	
  image	
  processing
For more info: @davidconde
DoneDeal	
  Dynamic	
  Test	
  Environments
•QA	
  can	
  now	
  run	
  any	
  feature	
  branch	
  of	
  DoneDeal	
  directly	
  from	
  our	
  CI	
  
server	
  
•Uses	
  Jenkins	
  /	
  Docker	
  (Machine	
  +	
  Compose)	
  /	
  EC2	
  &	
  Route	
  53	
  
•Enables	
  rapid	
  tes2ng	
  without	
  server	
  conten2on	
  
•Also	
  used	
  by	
  the	
  mobile	
  team	
  to	
  develop	
  against	
  &	
  test	
  new	
  APIs	
  
For more info: @davidconde
Q&A	
  Session
Nigel Creighton
CTO at DNM
Martin Peters
BI Manager at DoneDeal
DoneDeal - AWS Data Analytics Platform

More Related Content

What's hot

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionThe DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionDataWorks Summit/Hadoop Summit
 
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMRCost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMRProvectus
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm DataWorks Summit/Hadoop Summit
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopDataWorks Summit
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperVasu S
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 

What's hot (19)

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionThe DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to Production
 
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMRCost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 

Similar to DoneDeal - AWS Data Analytics Platform

Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Web Services
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudAmazon Web Services
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauSam Palani
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeTorsten Steinbach
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSAWS User Group Kochi
 
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...Amazon Web Services
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
 
Oracle Database in-Memory Overivew
Oracle Database in-Memory OverivewOracle Database in-Memory Overivew
Oracle Database in-Memory OverivewMaria Colgan
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
 

Similar to DoneDeal - AWS Data Analytics Platform (20)

Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lake
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
 
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Oracle Database in-Memory Overivew
Oracle Database in-Memory OverivewOracle Database in-Memory Overivew
Oracle Database in-Memory Overivew
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 

Recently uploaded

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 

Recently uploaded (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 

DoneDeal - AWS Data Analytics Platform

  • 1. DoneDeal  -­‐  Data  Pla+orm   April  2016   Mar6n  Peters     (mar6n@donedeal.ie  /  @mar6nbpeters)   DoneDeal  Analy6cs  Team  Manager  
  • 2. If you don’t understand the details of your business you are going to fail. If we can keep our competitors focused on us while we stay focused on the customer, ultimately we’ll turn out all right. - Jeff Bezos, Amazon
  • 3. What  do  these  companies  have  in  Common?
  • 4. Data  is  … With the right set of information, you can make business decisions with higher levels of confidence, as you can audit and attribute the data you used for the decision-making process. - Krish Krishnan, 2014 one  of  our  biggest  assets.
  • 5. Business  Intelligence  101 For  small  companies  the  gap  is  oNen  filled  with   custom  ad  hoc  solu6ons  with  limited  and  rather   sta6c  repor6ng  capability.
  • 6. What  and  why  BI? As    a  company  grows,  the  Availability,  Accuracy  and   Accessibility  requirements  of  data  increases.
  • 7. Some  terminology:  ETL  process Extrac6on Extracts data from homogeneous or heterogeneous data sources. Transforma6on: Process, Blend, merge and conform the data Loading: Store in the proper format or structure for the purposes of querying and analysis.
  • 8. April  2015  -­‐  April  2016
  • 9. Timeline:  2014-­‐2017 2014 2015 2016 2017 Silo’d Data Manual/Error Prone Blending Value of BI/Data not understood Platform Design Implementation Storage Layer Batch Layer Traditional BI Serving Layer Speed Layer Real Time Analytics
  • 10. Business  Goals  &  Objec6ves 1.  Build  a  future  proof  data  analy2cs  pla5orm  that  will  scale  with  the  company   over  the  next  5  years.   2.  Take  ownership  of  our  data.  Collect  more  data.   3.  Replace  exis2ng  repor2ng  tool.   4.  Provide  a  holis2c  view  of  our  users  (buyers  and  sellers),  ads,  products   5.  Use  our  data  in  a  smarter  manner  and  provide  recommenda2ons  in  a  2mely   fashion.  
  • 11. Apollo  Team Data Engineer Data Analyst Architect DevOps BI Consultants Solution Architect • Analy2cs  Pla5orm  that  includes  Event  Streaming,  Data  Consolida2on,  Cleansing  &  Warehousing,  Data   Visualisa2on,  Business  Intelligence  and  Data  Product  Delivery.   • Apollo  brings  agility  and  flexibility  in  our  data  model,  data  ownership  is  key  and  allows  us  to  blending   data  more  conveniently
  • 12. Apollo  Principles 1.  System  must  scale  but  costs   grow  more  slowly   2.  Occam’s  Razor   3.  Analy2cs  and  core  pla5orms   are  independent   4.  Monitoring  of  pla5orm  is   key   5.  Low  maintenance Project  Principles: Data  Principles: 1.  Accurate,  Available,  Accessible   2.  Ownership  -­‐  Business  &  Technical     3.  Standardised  across  teams   4.  Integrity     5.  Iden2fiable  -­‐  primary  source  and   globally  unique  iden2fier
  • 13. Apollo  Architectural  Principles www.slideshare.net/AmazonWebServices/big-data-architectural-patterns-and-best-practices-on-aws •  Decoupled  “data  bus”   •  Use  the  right  tool/service  for  the  job   ➡  Data  structure,  latency,  throughput,  access  paerns   •  Use  Lambda  architecture  ideas   ➡  Immutable  (append-­‐only),  batch,  [speed,  serving]  layers   •  Leverage  AWS  Managed  Services   ➡  Scalable/elas2c,  available,  reliable,  secure,  no/low  admin   •  Big  data  !=  Big  Cost
  • 15. ETL  Architecture:  Custom  Build  Pipeline E T L Summary Summary Summary
  • 16. ETL:  Control  over  complex  dependencies • Allows control of ETL pipelines with complex dependencies • Easy plug-in of new datasource • Orchestration with Data Pipeline and Common Status or Summary Files • Idempotent Pipeline • Historical data extracted as simulated stream
  • 17. ETL:  By  the  numbers • Extrac6on   -­‐ 4000  days  processed   -­‐ 7  different  data  sources   -­‐ 14  domains   -­‐ 13  event  types   • Orchestra6on   -­‐ 1200  processing  days   -­‐ 4GB/day   -­‐ 3  Environments     -­‐ 15  data  pipelines • Data  Lake   -­‐ 11M  events  streamed/day     -­‐ 3  million  files   -­‐ 3  TB  of  data  stored  over  7  buckets   • RedshiN   -­‐ 7B  records  in  produc6on   -­‐ 6  Schemas  (core  and  aggregate)   -­‐ 86  Tables  in  core  schema
  • 18. Kinesis  Streams • 1  Stream  with  4  Shards   • Data  reten6on  of  24hrs   • KCL  on  EC2  writes  data  to  S3  ready  for  Spark   • Max  size  of  1MB  data  blog   • 1,000  records/sec  per  shard  write   • 5  transac6ons/sec  read  or  2MB/sec   • Server  side  API  Logging  from  7  applica6on   servers  using  Log4JAppender   • Event  Buffering  at  source  [in  progress] Put records Requests
  • 19. S3 • Simple Storage Service provides secure, highly- scalable, durable cloud storage • Native support for Spark, Hive
  • 20. S3 • A strongly defined naming convention • YYYY/MM/DD prefix used • Avro format used for OLTP data/ JSON otherwise - probably the right choice (schema evolution), although we haven’t take any advantages for those yet. • Allow easy retrieval of data from a particular time period • Easy to maintain and browse • Handling of summaries from E,T & L steps
  • 21. Spark  on  EMR • AWS’s  managed  Hadoop  framework  that  can   interact  with  data  from  S3,  DynamoDB,  etc.   • Apache  Spark  -­‐  Fast,  general  purpose  engine   for  large-­‐scale  in-­‐memory  data  processing.   Runs  on  Hadoop/EMR  and  can  read  from  S3.   • PySpark  +  SparkSQL  was  the  focus  in  Apollo.   • Streaming  and  ML  will  be  the  focusing  the   months  ahead.
  • 22. • Spark is easy, performant Spark code is hard and time consuming • DataFrame API exclusively • Developing Spark applications in local environment with limited size dataset significantly differs from running Spark on EMR (e.g. joins, unions etc.) • Don’t pre-optimize • Naive joins to be avoided • Spark UI is invaluable to test performances (both locally and on EMR) and to understand the underlying mechanism of Spark •Some  scaling  of  Spark  on  EMR,  seled  on   memory  op2mised  instances  r3.2xlarge  (8   vCPUs,  61GB  RAM). Spark  on  EMR
  • 23. Data  Pipeline  +  Simple  No6fica6on  Service • Pipeline  is  a  service  to  reliably  process  and   move  between  AWS  applica6ons  (e.g.  S3,  EMR,   DynamoDB)   • Pipelines  run  on  schedule  and  alarms  are   issued  with  Simple  No6fica6on  Service  (SNS)   • EMR/Spark  used  for  compute  and  EC2  used  for   loading  data  in  RedshiN   • Debugging  can  be  a  challenge
  • 24. RedshiN • Dense  Compute  or  Dense  Storage?   -­‐ Single  ds2.xlarge  instance   -­‐ Right  balance  between  storage/memory/ compute  and  cost/hr   • Strict  ETL,  no  transforma2on  is  carried  out  in  DW,   an  Append  Only  Strategy   -­‐ Leverage  power  and  scalability  of  EMR  and   Insert  speed  of  Redshif   -­‐ No  Updates  in  DW,  Drop  and  Recreate     • Tuning  is  a  2me  consuming  task  &  requires   rigorous  tes2ng.   • Define  Sort,  Distribu2on,  Interleaved  keys  as  early   as  possible.   • Reserved  Nodes  will  be  used  in  future Test Dev Prod Core cmtest cmdev cmprod Agg agtest agdev agprod Test Dev Prod Core cmtest cmdev cmprod Agg agtest agdev agprod read permissions Kimball  Star  Schema:  Conformed  dimensions   across  all  data  sources
  • 25. Tableau  on  EC2 • Tableau  Server  runs  on  EC2  (c3.2xlarge)  inside  AWS  Environment.     • Tableau  Desktop  used  to  develop  dashboards  that  are  published  to  the  server.   • Connec2on  to  Redshif  Data  Warehouse  -­‐  JDBC/ODBC  Connector.   • Maps  support  is  poor  for  countries  outside  the  US http://www.slideshare.net/AmazonWebServices/analytics-on-the-cloud-with-tableau-on-aws
  • 26. Up  Next? • Increase  number  of  data   streams/Remove   dependence  on  OLTP   • Tradi2onal  BI/Repor2ng  -­‐   More  dashboards   • [In  progress]  Data  Products   with  Spark  ML/Amazon  ML,   DynamoDB,  Lambda  &  API   Gateway • Trials  of  Kinesis  Firehose,   Kinesis  Analy2cs,  Quicksight   • Improved  Code  Deployment   with  Code  Pipeline  and   Code  Commit
  • 27. DoneDeal  Image  Service  Upgrade •Image  Storage  &  Transforming  moved  to  AWS   •Over  4.5M  images  migrated  to  S3   •ECS  +  ELB  used  for  image  resizing   •Autoscaling  group  enables  adding  new  image  sizes   •We  now  run  docker  in  produc2on  thanks  to  ECS   •Inves2ga2ng  uses  for  AWS  Lambda  and  image  processing For more info: @davidconde
  • 28. DoneDeal  Dynamic  Test  Environments •QA  can  now  run  any  feature  branch  of  DoneDeal  directly  from  our  CI   server   •Uses  Jenkins  /  Docker  (Machine  +  Compose)  /  EC2  &  Route  53   •Enables  rapid  tes2ng  without  server  conten2on   •Also  used  by  the  mobile  team  to  develop  against  &  test  new  APIs   For more info: @davidconde
  • 29. Q&A  Session Nigel Creighton CTO at DNM Martin Peters BI Manager at DoneDeal