SlideShare a Scribd company logo
1 of 76
Download to read offline
November 13, 2014 | Las Vegas | NV 
Roy Ben-Alta (Big Data Analytics Practice Lead, AWS) Thomas Barthelemy (Software Engineer, Coursera)
Amazon 
Redshift 
Amazon 
EMR 
Amazon 
EC2 
Process & Analyze 
Store 
AWS Direct 
Connect 
Amazon 
S3 
Amazon 
Kinesis 
Amazon 
Glacier 
AWS 
Import/Export 
DynamoDB 
Collect
Billing 
DB 
CSV files 
CRMDB 
Extract 
Transform 
Load 
DWH 
DB 
DB 
DB 
ETL
Amazon 
Redshift 
Amazon 
RDS 
Amazon S3 as 
Data Lake 
Amazon 
DynamoDB 
Amazon S3 as Amazon EMR 
Landing Zone 
AWS Data Pipeline 
Data 
Feeds 
Amazon EC2
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp- program-pipeline.html
Amazon S3 as Data 
Lake/Landing Zone 
Amazon EMR 
as ETL Grid 
Amazon Redshift – 
Production DWH 
Visualization 
Logs
Define 
aws datapipeline create-pipeline –name myETL--unique-id token 
--output: df-09222142C63VXJU0HC0A 
Import 
aws datapipeline put-pipeline-definition --pipeline-id df- 09222142C63VXJU0HC0A --pipeline-definition /home/repo/etl_reinvent.json 
Activate 
aws datapipeline activate-pipeline --pipeline-id df-09222142C63VXJU0HC0A
copy weblogs 
from 's3://staging/' 
credentials 'aws_access_key_id=<my access key>;aws_secret_access_key=<my secret key>' 
delimiter ','
Coursera Big Data Analytics Powered by AWS 
Thomas Barthelemy 
SWE, Data Infrastructure 
thomas@coursera.org
Overview 
●About Coursera 
●Phase 1: Consolidate data 
●Phase 2: Get users hooked 
●Phase 3: Increase reliability 
●Looking forward
Coursera at a Glance
About Coursera 
●Platform for Massive Open Online Courses 
●Universities create the content 
●Content free to the public
Coursera Stats 
●~10 million learners 
●+110 university partners 
●>200 courses open now 
●~170 employees
The value of data at Coursera 
●Making strategic and tactical decisions 
●Studying pedagogy
Becoming More Data-Driven 
●Since the early days, Coursera has understood the value of data 
oFounders came from machine learning 
oMany of the early employees researched with the founders 
●Cost of data access was high 
oeach analysis requires extraction, pre-processing 
odata only available to data scientists + engineers
Phase 1: 
Consolidate data
Sources 
●MySQL 
oSite data 
oCourse data sharded across multiple database 
●Cassandra increasingly used for course data 
●Logged event data 
●External APIs 
Classes 
Classes 
SQL 
Sharded SQL 
NoSQL 
Logs 
External APIs
(Obligatory meme)
What platforms to use? 
●Amazon Redshifthad glowing recommendations 
●AWS Data Pipelinehas native support for various Amazon services
ETL development was slow :( 
●Slow to do the following in the console: 
ocreate one pipeline 
ocreate similar pipelines 
oupdate existing pipelines
Solution: Programmatically create pipelines 
●Break ETL into reusable steps 
oExtract from variety of sources 
oTransform data with Amazon EMR, Amazon EC2, or within Amazon Redshift 
oLoad data principally into Amazon Redshift 
●Use Amazon S3 as intermediate state for Amazon EMR-or Amazon EC2-based transformations 
●Map stepsto set of AWS Data Pipeline objects
Example step: extract-rds 
●Parameters: hostname, database name, SQL 
●Creates pipeline objects: 
S3 Node can be used by many other step types
Example load definition 
steps: 
-type:extract-from-rds 
sql:| SELECT instructor_id, course_id, rank 
FROM courses_instructorincourse; 
hostname:maestro-read-replica 
database:maestro 
-type:load-into-staging-table 
table: staging.maestro_instructors_sessions 
-type:reload-prod-table 
source:staging.maestro_instructors_sessions 
destination:prod.instructors_sessions 
Extract data from Amazon RDS 
Load intermediate table in Amazon Redshift 
Reload target table with new data
ETL –Amazon RDS 
SQL 
AmazonS3 
AmazonRedshift 
Extract 
Load
ETL –Sharded RDS 
AmazonS3 
AmazonS3 
AmazonRedshift 
SQL 
AmazonS3 
AmazonEMR 
AmazonS3 
AmazonEMR 
AmazonEMR 
Transform 
Extract 
Load
ETL –Logs 
Logs 
Amazon EC2 
Amazon S3 
Amazon S3 
AmazonEMR 
AmazonRedshift 
Amazon S3 
Transform 
Load 
Extract
Reporting Model, Dec 2013
Reporting Model, Sep 2014
AWS Data Pipeline 
●Easily handles starting/stopping of resources 
●Handles permissions, roles 
●Integrates with other AWS services 
●Handles “flow” of data, data dependencies
Dealing with large pipelines 
●Monolithic pipelines hard to maintain 
●Moved to making pipelines smaller 
oHooray modularity! 
●If pipeline B depended on pipeline A, just schedule it later 
oAdd a time buffer just to be safe
Setting cross-pipeline dependencies 
●Dependencies accomplished using a script that would wait until dependencies finished 
oShellCommandActivityto the rescue
The beauty of ShellCommandActivity 
●You can use it anywhere 
oAccomplish tasks that have no corresponding activity type 
oOverride native AWS Data Pipeline support if it does not meet your needs
ETL library 
●Install on machine as first step of each pipeline 
oWith ShellCommandActivity 
●Allows for even more modularity
Phase 2: 
Get users hooked
We have data. Now what? 
●Simply collecting data will not make a company data-driven 
●First step: make data easier for the analysts to use
Certifying a version of the truth 
●Data Infrastructure team creates a 3NF model of the data 
●Model is designed to be as interpretable as possible
Example: Forums 
●Full-word names for database objects 
●Join keys made obvious through universally- unique column names 
oe.g. session_id joins to session_id 
●Auto-generated ER-diagram, so documentation is up-to-date
Some power users need data faster! 
●Added new schemas for developers to load data 
●Help us to remain data driven during early phases of… 
oProduct release 
oAnalyzing new data
Lightweight SQL interface 
●Access via browser 
oUsers don’t need database credentials or special software 
●Data exportable as CSV 
●Auto-complete for DB object names 
●Took ~2 days to make
Lightweight SQL interface
But what about people who don’t know SQL? 
●Reporting tools to the rescue! 
●In-house dashboard framework to support: 
oSupport 
oGrowth 
oData Infrastructure 
oMany other teams... 
●Tableau for interactive analysis
But what about people who don’t know SQL?
Phase 3: 
Increase reliability
If you build it, they will come 
●Built ecosystem of data consumers 
oAnalysts 
oBusiness users 
oProduct team
Maintaining the ecosystem 
●As we accumulate consumers, need to invest effort in... 
oModel stability 
oData quality 
oData availability
Persist Amazon Redshift logs to keep track of usage 
●Leverage database credentials to understand usage patterns 
●Know whom to contact when 
oNeed to change the schema 
oNeed to add a new set of tables similar to an existing set (e.g. assessments)
Data Quality 
●Quality of source systems (GIGO) 
oEncourage source system to fix issues 
●Quality of transformations 
oResponsibility of the data infrastructure team
Quality Transformations 
●Automated QA checks as part of ETL 
oAgain, use ShellCommandActivity 
●Key factors 
oCounts 
oValue comparisons 
oModeling rules (primary keys)
Ensuring High Reliability 
●Step retries catch most minor hiccups 
●On-call system to recover failed pipelines 
●Persist DB logs to keep track of load times 
oDelayed loads can also alert on-call devs 
oBonus: users know how fresh the data is 
oBonus: can keep track of success rate 
●AWS very helpful in debugging issues
Looking Forward
Current bottlenecks 
●Complex ETL 
oHow do we easily ETL complex, nested structures? 
●Developer bottleneck 
oHow do we allow users to access raw data more quickly?
Amazon S3 as principal data store 
●Amazon S3 to store raw data as soon as it’s produced (data lake) 
●Amazon EMR to process data 
●Amazon Redshift to remain as analytic platform
Thank you!
Want to learn more? 
●github.com/coursera/dataduct 
●http://tech.coursera.org 
●thomas@coursera.org
http://blogs.aws.amazon.com/bigdata/ http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-concepts-schedules.html
http://bit.ly/awsevals

More Related Content

What's hot

Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsThomas Sykes
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data FactorySlava Kokaev
 
Azure Data Factory Introduction.pdf
Azure Data Factory Introduction.pdfAzure Data Factory Introduction.pdf
Azure Data Factory Introduction.pdfMaheshPandit16
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptxWasm1953
 
Azure data factory
Azure data factoryAzure data factory
Azure data factoryDavid Giard
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure DatabricksDustin Vannoy
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSAmazon Web Services
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksDatabricks
 
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Databricks
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 

What's hot (20)

Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
 
Introduction to Sagemaker
Introduction to SagemakerIntroduction to Sagemaker
Introduction to Sagemaker
 
Azure Synapse Analytics
Azure Synapse AnalyticsAzure Synapse Analytics
Azure Synapse Analytics
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data Factory
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
Azure Data Factory Introduction.pdf
Azure Data Factory Introduction.pdfAzure Data Factory Introduction.pdf
Azure Data Factory Introduction.pdf
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
 
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 

Viewers also liked

(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012Amazon Web Services
 
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Amazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
Migration to Redshift from SQL Server
Migration to Redshift from SQL ServerMigration to Redshift from SQL Server
Migration to Redshift from SQL Serverjoeharris76
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Amazon Web Services
 
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesMigrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesAmazon Web Services
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)Amazon Web Services
 
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best PracticesAmazon Web Services
 
Ruby's Arrays and Hashes with examples
Ruby's Arrays and Hashes with examplesRuby's Arrays and Hashes with examples
Ruby's Arrays and Hashes with examplesNiranjan Sarade
 
Cluster Computing for $0.27/hr using Amazon EC2 and IPython Notebook
Cluster Computing for $0.27/hr using Amazon EC2 and IPython NotebookCluster Computing for $0.27/hr using Amazon EC2 and IPython Notebook
Cluster Computing for $0.27/hr using Amazon EC2 and IPython NotebookRandy Zwitch
 
SQL Server to Redshift Data Load Using SSIS
SQL Server to Redshift Data Load Using SSISSQL Server to Redshift Data Load Using SSIS
SQL Server to Redshift Data Load Using SSISMarc Leinbach
 
Aws meetup ssm
Aws meetup ssmAws meetup ssm
Aws meetup ssmAdam Book
 

Viewers also liked (20)

(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012
 
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Migration to Redshift from SQL Server
Migration to Redshift from SQL ServerMigration to Redshift from SQL Server
Migration to Redshift from SQL Server
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesMigrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
 
AWS_Data_Pipeline
AWS_Data_PipelineAWS_Data_Pipeline
AWS_Data_Pipeline
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)
 
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
Ruby's Arrays and Hashes with examples
Ruby's Arrays and Hashes with examplesRuby's Arrays and Hashes with examples
Ruby's Arrays and Hashes with examples
 
Cluster Computing for $0.27/hr using Amazon EC2 and IPython Notebook
Cluster Computing for $0.27/hr using Amazon EC2 and IPython NotebookCluster Computing for $0.27/hr using Amazon EC2 and IPython Notebook
Cluster Computing for $0.27/hr using Amazon EC2 and IPython Notebook
 
Sei dmm-intro1
Sei dmm-intro1Sei dmm-intro1
Sei dmm-intro1
 
SQL Server to Redshift Data Load Using SSIS
SQL Server to Redshift Data Load Using SSISSQL Server to Redshift Data Load Using SSIS
SQL Server to Redshift Data Load Using SSIS
 
Aws meetup ssm
Aws meetup ssmAws meetup ssm
Aws meetup ssm
 

Similar to (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722Amazon Web Services
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901WeCloudData
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)Amazon Web Services Korea
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platformDavid Walker
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase Türkiye
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data PlatformShu-Jeng Hsieh
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - DatalakeLam Le
 
Windows on AWS
Windows on AWSWindows on AWS
Windows on AWSDatavail
 
Amazon Redshift For Data Analysts
Amazon Redshift For Data AnalystsAmazon Redshift For Data Analysts
Amazon Redshift For Data AnalystsCan Abacıgil
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCMark Smith
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowFrom Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowDatabricks
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Amazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
Building prediction models with Amazon Redshift and Amazon ML
Building prediction models with  Amazon Redshift and Amazon MLBuilding prediction models with  Amazon Redshift and Amazon ML
Building prediction models with Amazon Redshift and Amazon MLJulien SIMON
 

Similar to (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014 (20)

AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platform
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik Platform
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
Windows on AWS
Windows on AWSWindows on AWS
Windows on AWS
 
Amazon Redshift For Data Analysts
Amazon Redshift For Data AnalystsAmazon Redshift For Data Analysts
Amazon Redshift For Data Analysts
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowFrom Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Building prediction models with Amazon Redshift and Amazon ML
Building prediction models with  Amazon Redshift and Amazon MLBuilding prediction models with  Amazon Redshift and Amazon ML
Building prediction models with Amazon Redshift and Amazon ML
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

  • 1. November 13, 2014 | Las Vegas | NV Roy Ben-Alta (Big Data Analytics Practice Lead, AWS) Thomas Barthelemy (Software Engineer, Coursera)
  • 2. Amazon Redshift Amazon EMR Amazon EC2 Process & Analyze Store AWS Direct Connect Amazon S3 Amazon Kinesis Amazon Glacier AWS Import/Export DynamoDB Collect
  • 3. Billing DB CSV files CRMDB Extract Transform Load DWH DB DB DB ETL
  • 4.
  • 5. Amazon Redshift Amazon RDS Amazon S3 as Data Lake Amazon DynamoDB Amazon S3 as Amazon EMR Landing Zone AWS Data Pipeline Data Feeds Amazon EC2
  • 6.
  • 7.
  • 8.
  • 10. Amazon S3 as Data Lake/Landing Zone Amazon EMR as ETL Grid Amazon Redshift – Production DWH Visualization Logs
  • 11.
  • 12. Define aws datapipeline create-pipeline –name myETL--unique-id token --output: df-09222142C63VXJU0HC0A Import aws datapipeline put-pipeline-definition --pipeline-id df- 09222142C63VXJU0HC0A --pipeline-definition /home/repo/etl_reinvent.json Activate aws datapipeline activate-pipeline --pipeline-id df-09222142C63VXJU0HC0A
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19. copy weblogs from 's3://staging/' credentials 'aws_access_key_id=<my access key>;aws_secret_access_key=<my secret key>' delimiter ','
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28. Coursera Big Data Analytics Powered by AWS Thomas Barthelemy SWE, Data Infrastructure thomas@coursera.org
  • 29. Overview ●About Coursera ●Phase 1: Consolidate data ●Phase 2: Get users hooked ●Phase 3: Increase reliability ●Looking forward
  • 30. Coursera at a Glance
  • 31. About Coursera ●Platform for Massive Open Online Courses ●Universities create the content ●Content free to the public
  • 32. Coursera Stats ●~10 million learners ●+110 university partners ●>200 courses open now ●~170 employees
  • 33. The value of data at Coursera ●Making strategic and tactical decisions ●Studying pedagogy
  • 34. Becoming More Data-Driven ●Since the early days, Coursera has understood the value of data oFounders came from machine learning oMany of the early employees researched with the founders ●Cost of data access was high oeach analysis requires extraction, pre-processing odata only available to data scientists + engineers
  • 36. Sources ●MySQL oSite data oCourse data sharded across multiple database ●Cassandra increasingly used for course data ●Logged event data ●External APIs Classes Classes SQL Sharded SQL NoSQL Logs External APIs
  • 38. What platforms to use? ●Amazon Redshifthad glowing recommendations ●AWS Data Pipelinehas native support for various Amazon services
  • 39. ETL development was slow :( ●Slow to do the following in the console: ocreate one pipeline ocreate similar pipelines oupdate existing pipelines
  • 40. Solution: Programmatically create pipelines ●Break ETL into reusable steps oExtract from variety of sources oTransform data with Amazon EMR, Amazon EC2, or within Amazon Redshift oLoad data principally into Amazon Redshift ●Use Amazon S3 as intermediate state for Amazon EMR-or Amazon EC2-based transformations ●Map stepsto set of AWS Data Pipeline objects
  • 41. Example step: extract-rds ●Parameters: hostname, database name, SQL ●Creates pipeline objects: S3 Node can be used by many other step types
  • 42. Example load definition steps: -type:extract-from-rds sql:| SELECT instructor_id, course_id, rank FROM courses_instructorincourse; hostname:maestro-read-replica database:maestro -type:load-into-staging-table table: staging.maestro_instructors_sessions -type:reload-prod-table source:staging.maestro_instructors_sessions destination:prod.instructors_sessions Extract data from Amazon RDS Load intermediate table in Amazon Redshift Reload target table with new data
  • 43. ETL –Amazon RDS SQL AmazonS3 AmazonRedshift Extract Load
  • 44. ETL –Sharded RDS AmazonS3 AmazonS3 AmazonRedshift SQL AmazonS3 AmazonEMR AmazonS3 AmazonEMR AmazonEMR Transform Extract Load
  • 45. ETL –Logs Logs Amazon EC2 Amazon S3 Amazon S3 AmazonEMR AmazonRedshift Amazon S3 Transform Load Extract
  • 48. AWS Data Pipeline ●Easily handles starting/stopping of resources ●Handles permissions, roles ●Integrates with other AWS services ●Handles “flow” of data, data dependencies
  • 49. Dealing with large pipelines ●Monolithic pipelines hard to maintain ●Moved to making pipelines smaller oHooray modularity! ●If pipeline B depended on pipeline A, just schedule it later oAdd a time buffer just to be safe
  • 50. Setting cross-pipeline dependencies ●Dependencies accomplished using a script that would wait until dependencies finished oShellCommandActivityto the rescue
  • 51. The beauty of ShellCommandActivity ●You can use it anywhere oAccomplish tasks that have no corresponding activity type oOverride native AWS Data Pipeline support if it does not meet your needs
  • 52. ETL library ●Install on machine as first step of each pipeline oWith ShellCommandActivity ●Allows for even more modularity
  • 53. Phase 2: Get users hooked
  • 54. We have data. Now what? ●Simply collecting data will not make a company data-driven ●First step: make data easier for the analysts to use
  • 55. Certifying a version of the truth ●Data Infrastructure team creates a 3NF model of the data ●Model is designed to be as interpretable as possible
  • 56. Example: Forums ●Full-word names for database objects ●Join keys made obvious through universally- unique column names oe.g. session_id joins to session_id ●Auto-generated ER-diagram, so documentation is up-to-date
  • 57. Some power users need data faster! ●Added new schemas for developers to load data ●Help us to remain data driven during early phases of… oProduct release oAnalyzing new data
  • 58. Lightweight SQL interface ●Access via browser oUsers don’t need database credentials or special software ●Data exportable as CSV ●Auto-complete for DB object names ●Took ~2 days to make
  • 60. But what about people who don’t know SQL? ●Reporting tools to the rescue! ●In-house dashboard framework to support: oSupport oGrowth oData Infrastructure oMany other teams... ●Tableau for interactive analysis
  • 61. But what about people who don’t know SQL?
  • 62. Phase 3: Increase reliability
  • 63. If you build it, they will come ●Built ecosystem of data consumers oAnalysts oBusiness users oProduct team
  • 64. Maintaining the ecosystem ●As we accumulate consumers, need to invest effort in... oModel stability oData quality oData availability
  • 65. Persist Amazon Redshift logs to keep track of usage ●Leverage database credentials to understand usage patterns ●Know whom to contact when oNeed to change the schema oNeed to add a new set of tables similar to an existing set (e.g. assessments)
  • 66. Data Quality ●Quality of source systems (GIGO) oEncourage source system to fix issues ●Quality of transformations oResponsibility of the data infrastructure team
  • 67. Quality Transformations ●Automated QA checks as part of ETL oAgain, use ShellCommandActivity ●Key factors oCounts oValue comparisons oModeling rules (primary keys)
  • 68. Ensuring High Reliability ●Step retries catch most minor hiccups ●On-call system to recover failed pipelines ●Persist DB logs to keep track of load times oDelayed loads can also alert on-call devs oBonus: users know how fresh the data is oBonus: can keep track of success rate ●AWS very helpful in debugging issues
  • 70. Current bottlenecks ●Complex ETL oHow do we easily ETL complex, nested structures? ●Developer bottleneck oHow do we allow users to access raw data more quickly?
  • 71. Amazon S3 as principal data store ●Amazon S3 to store raw data as soon as it’s produced (data lake) ●Amazon EMR to process data ●Amazon Redshift to remain as analytic platform
  • 73. Want to learn more? ●github.com/coursera/dataduct ●http://tech.coursera.org ●thomas@coursera.org
  • 74.