Submit Search
Upload
Autoscaling Spark on AWS EC2 - 11th Spark London meetup
•
Download as PPTX, PDF
•
10 likes
•
5,671 views
Rafal Kwasny
Follow
Autoscaling Spark for Fun and Profit 11th Spark Meetup London
Read less
Read more
Technology
Report
Share
Report
Share
1 of 46
Download now
Recommended
Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
Rahul Pathak - Amazon EMR BigData AWS IL meetup presentation Feb 2 2015
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
Israel AWS User Group
Amazon EMR is a managed service that makes it easy for customers to use big data frameworks and applications like Apache Hadoop, Spark, and Presto to analyze data stored in HDFS or on Amazon S3, Amazon’s highly scalable object storage service. In this session, we will introduce Amazon EMR and the greater Apache Hadoop ecosystem, and show how customers use them to implement and scale common big data use cases such as batch analytics, real-time data processing, interactive data science, and more. Then, we will walk through a demo to show how you can start processing your data at scale within minutes.
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Amazon Web Services
Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Amazon Web Services
Organizations often need to quickly analyze large amounts of data, such as logs generated from a wide variety of sources and formats. However, traditional approaches require a lot of time and effort designing complex data transformation and loading processes; and configuring data warehouses. Using AWS, you can start querying your datasets within minutes. In this session you will learn how you can deploy a managed Presto environment in minutes to interactively query log data using standard ANSI SQL. Presto is a popular open source SQL engine for running interactive analytic queries against data sources of all sizes. We will talk about common use cases and best practices for running Presto on Amazon EMR.
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
Amazon Web Services
Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this session, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. In this session, you learn the benefits of decoupling storage and compute and allowing them to scale independently; how to run Hadoop, Spark, Presto and other supported Hadoop Applications on Amazon EMR; how to use Amazon S3 as a persistent data-store and process data directly from Amazon S3; Deployment strategies and how to avoid common mistakes when deploying at scale; and how to use Spot instances to scale your transient infrastructure effectively.
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
Amazon Web Services
Quick intro to and walkthrough of the AWS Elastic Map Reduce (EMR) service. Part of a larger course at http://bit.ly/get-hadoop
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
rICh morrow
Enterprises are starting to deploy large scale Hadoop clusters to extract value out of the data that they are generating. These clusters often span hundreds of nodes. To speed up the time to value, a lot of the newer deployments are happening in AWS, moving from the traditional on-premises, bare-metal world. Cloudera supports just such deployments. In this session, Cloudera shares the lessons learned and best practices for deploying multi-tenant Hadoop clusters in AWS. They will cover what reference deployments look like, what services are relevant for Hadoop deployments, network configurations, instance types, backup and disaster recovery considerations, and security considerations. They will also talk about what works well, what doesn't, and what has to be done going forward to improve the operability of Hadoop on AWS.
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
Amazon Web Services
Recommended
Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
Rahul Pathak - Amazon EMR BigData AWS IL meetup presentation Feb 2 2015
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
Israel AWS User Group
Amazon EMR is a managed service that makes it easy for customers to use big data frameworks and applications like Apache Hadoop, Spark, and Presto to analyze data stored in HDFS or on Amazon S3, Amazon’s highly scalable object storage service. In this session, we will introduce Amazon EMR and the greater Apache Hadoop ecosystem, and show how customers use them to implement and scale common big data use cases such as batch analytics, real-time data processing, interactive data science, and more. Then, we will walk through a demo to show how you can start processing your data at scale within minutes.
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Amazon Web Services
Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Amazon Web Services
Organizations often need to quickly analyze large amounts of data, such as logs generated from a wide variety of sources and formats. However, traditional approaches require a lot of time and effort designing complex data transformation and loading processes; and configuring data warehouses. Using AWS, you can start querying your datasets within minutes. In this session you will learn how you can deploy a managed Presto environment in minutes to interactively query log data using standard ANSI SQL. Presto is a popular open source SQL engine for running interactive analytic queries against data sources of all sizes. We will talk about common use cases and best practices for running Presto on Amazon EMR.
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
Amazon Web Services
Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this session, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. In this session, you learn the benefits of decoupling storage and compute and allowing them to scale independently; how to run Hadoop, Spark, Presto and other supported Hadoop Applications on Amazon EMR; how to use Amazon S3 as a persistent data-store and process data directly from Amazon S3; Deployment strategies and how to avoid common mistakes when deploying at scale; and how to use Spot instances to scale your transient infrastructure effectively.
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
Amazon Web Services
Quick intro to and walkthrough of the AWS Elastic Map Reduce (EMR) service. Part of a larger course at http://bit.ly/get-hadoop
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
rICh morrow
Enterprises are starting to deploy large scale Hadoop clusters to extract value out of the data that they are generating. These clusters often span hundreds of nodes. To speed up the time to value, a lot of the newer deployments are happening in AWS, moving from the traditional on-premises, bare-metal world. Cloudera supports just such deployments. In this session, Cloudera shares the lessons learned and best practices for deploying multi-tenant Hadoop clusters in AWS. They will cover what reference deployments look like, what services are relevant for Hadoop deployments, network configurations, instance types, backup and disaster recovery considerations, and security considerations. They will also talk about what works well, what doesn't, and what has to be done going forward to improve the operability of Hadoop on AWS.
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
Amazon Web Services
Amazon EMR is one of the largest Hadoop operators in the world. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We will also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features.
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
Amazon Web Services
Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch four years ago, our customers have launched more than 5.5 million Hadoop clusters. In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Web Services
Elasticsearch has quickly become the leading open source technology for scaling search and building document services on. Many software providers have come to rely on it to serve the needs of high-performance, production applications. In this talk, we’ll go deep on lessons learned from three years in production scaling from a few shards to more than 100 spread across 100s of nodes on AWS--to serve real-time queries against 100s of millions of documents. Attendees will learn: * How to capacity plan for ES on AWS * How to scale and reshard on AWS with zero downtime * What AWS and ES metrics to collect and alert on * Tips on day to day ES operations Session sponsored by SignalFx.
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
Amazon Web Services
Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges. In this session, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures, best practices to quickly create Spark clusters using Amazon EMR, and ways to integrate Spark with other big data services in AWS. Learning Objectives: • Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing. • How to deploy and tune scalable clusters running Spark on Amazon EMR. • How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3. • Common architectures to leverage Spark with Amazon DynamoDB, Amazon Redshift, Amazon Kinesis, and more.
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
"Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this session, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. Additionally, you hear from AOL’s Senior Software Engineer on how they used these strategies to migrate their Hadoop workloads to the AWS cloud and lessons learned along the way. In this session, you learn the benefits of decoupling storage and compute and allowing them to scale independently; how to run Hadoop, Spark, Presto and other supported Hadoop Applications on Amazon EMR; how to use Amazon S3 as a persistent data-store and process data directly from Amazon S3; dDeployment strategies and how to avoid common mistakes when deploying at scale; and how to use Spot instances to scale your transient infrastructure effectively."
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
Amazon Web Services
Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch five years ago, AWS customers have launched more than 5.5 million Hadoop clusters. In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient. Speakers: Ian Meyers, AWS Solutions Architect Ian McDonald, IT Director, SwiftKey
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
Amazon Web Services
Big data technologies let you work with any velocity, volume, or variety of data in a highly productive environment. Join the General Manager of Amazon EMR, Peter Sirota, to learn how to scale your analytics, use Hadoop with Amazon EMR, write queries with Hive, develop real world data flows with Pig, and understand the operational needs of a production data platform.
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Amazon Web Services
Learning Objectives: - Learn how to run Amazon EMR clusters on Spot instances and significantly reduce the cost of processing vast amounts of data on managed Hadoop clusters - Understand key EC2 Spot Instances concepts and common usage patterns for maximum scale and cost optimization for Big Data workloads - See a few customer examples that show how to leverage the full scale of the AWS cloud for faster results
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Amazon Web Services
Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch five years ago, AWS customers have launched more than 5.5 million Hadoop clusters. In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
Amazon Web Services
Learning Objectives: - Learn how to use Amazon EMR for easy, fast, and cost-effective processing of vast amounts of data across dynamically scalable Amazon EC2 instances. - Learn how using EC2 Spot can significantly reduce the cost of running your clusters. - Learn how Amazon EMR Instance Fleets can make it easier to quickly obtain and maintain your desired capacity for your clusters.
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Amazon Web Services
Introduction to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of Spot EC2 instances to reduce costs, and other Amazon EMR architectural best practices.
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
Amazon Web Services
Amazon EMR enables fast processing of large structured or unstructured datasets, and in this presentation we'll show you how to setup an Amazon EMR job flow to analyse application logs, and perform Hive queries against it. We also review best practices around data file organisation on Amazon Simple Storage Service (S3), how clusters can be started from the AWS web console and command line, and how to monitor the status of a Map/Reduce job. Finally we take a look at Hadoop ecosystem tools you can use with Amazon EMR and the additional features of the service. See a recording of the webinar based on this presentation on YouTube here: Check out the rest of the Masterclass webinars for 2015 here: http://aws.amazon.com/campaigns/emea/masterclass/ See the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/
Amazon EMR Masterclass
Amazon EMR Masterclass
Amazon Web Services
Spark Integration Architecture for restaurant data
Spark Integration Architecture for restaurant data