Autoscaling Spark on AWS EC2 - 11th Spark London meetup

•Download as PPTX, PDF•

10 likes•5,671 views

Rafal Kwasny

Autoscaling Spark for Fun and Profit 11th Spark Meetup London

Technology

Autoscaling Spark for Fun and
Profit
Rafal Kwasny
11th Spark London Meetup
2015-11-26
1

Who am I
•DevOPS
•Build a few platforms in my life
•mostly adtech, in-game analytics for Sony
Playstation
•Currently advising Investment Banks
•CTO Entropy Investments
2

How do you run spark?
•Who runs on AWS?
•Who uses EMR?
3

Overview
•typical architecture for AWS
•How autoscaling works
•Scripts to make your life easier
5

Typical architecture for AWS
7
Generate some data

Typical architecture for AWS
8
Store it in S3

Typical architecture for AWS
9
or store it in a message queue

Typical architecture for AWS
10
Use your favourite tool for ETL

Typical architecture for AWS
11
Ship it back to S3

Typical architecture for AWS
12
Or send it somewhere

Typical architecture for AWS
13
- EMR
- spark-ec2
- build cluster from scratch

Map-reduce is about quickly writing very inefficient
code and then running it at massive scale
(C) Someone
14

Problem
•EC2 is a pay-for-what-you-use model
•You just have to decide how much resources
you want to use before starting a cluster
15

Problem
Most common problems while running on EC2
Scaling up
•My team needs a new cluster, how big it
should be?
Scaling down
•Did I shut down the DEV cluster before leaving
the office on Friday evening?
16

Types of scaling
Vertical scaling -
„Let’s get a bigger box”
•Change instance type
•Change EBS parameters
18
Horizontal scaling -
„Just add more nodes”

Autoscaling
•Automatic resizing based on demand
•Define minimum/maximum instance count
•Define when scaling should occur
•Use metrics
•Run your jobs and don’t worry about
infrastructure
19

Using RAM/local SSDs for caching
Only saving output into S3

Autoscaling components
•AMI - machine image with installed spark
•Launch configuration - defines:
•AMI
•instance type
•instance storage
•public IP
•security groups
23

Autoscaling components
•Autoscaling group
•launch configuration
•availability zones
•VPC details
•min/max servers
•when to scale
•metrics/health checks
24

Putting it all together
Then you can run your job
25

Complicated?
•AWS provides a lot of services
26

spark-cloud
• Better scripts to start spark clusters on EC2
• Alpha version
• https://github.com/entropyltd/spark-cloud
27

What’s inside spark-cloud
Building AMI’s through packer
Packer is a tool for creating machine and
container images for multiple platforms from a
single source configuration.
Supports AWS, DigitalOcean, Docker,
OpenStack, Parallels, QEMU, VirtualBox,
VMware
38

Current functionality
•Start cluster
•Shutdown cluster
•But more to come :)
39

Spot instances
–On-Demand:
$1.400
–Spot: $0.15
–89% cheaper
41

Summary
•Spark and EC2 is a very common combination
•Because it makes your life easier
•And cheaper
•spark-cloud script will help you
•You can just worry about writing good Spark
code!
42

Amazon S3 Tips
•Don’t use s3n://
•Use s3a:// with hadoop 2.6
–Parallel rename, especially important for committing output
–Supports IAM authentication
–no „xyz_$folder$" files
–input seek
–multipart upload ( no 5GB limit )
–Error recovery and retry
More info https://issues.apache.org/jira/browse/HADOOP-10400
45

Why not EMR?
•Why pay for EMR? It costs more than a spot
instance
•vendor lock-in and proprietary libraries
•netlib-java
46

What's hot

Amazon EMR is one of the largest Hadoop operators in the world. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We will also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features.

(BDT305) Amazon EMR Deep Dive and Best Practices

Amazon Web Services

Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch four years ago, our customers have launched more than 5.5 million Hadoop clusters. In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.

Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...

Amazon Web Services

Elasticsearch has quickly become the leading open source technology for scaling search and building document services on. Many software providers have come to rely on it to serve the needs of high-performance, production applications. In this talk, we’ll go deep on lessons learned from three years in production scaling from a few shards to more than 100 spread across 100s of nodes on AWS--to serve real-time queries against 100s of millions of documents. Attendees will learn: * How to capacity plan for ES on AWS * How to scale and reshard on AWS with zero downtime * What AWS and ES metrics to collect and alert on * Tips on day to day ES operations Session sponsored by SignalFx.

AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Amazon Web Services

Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges. In this session, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures, best practices to quickly create Spark clusters using Amazon EMR, and ways to integrate Spark with other big data services in AWS. Learning Objectives: • Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing. • How to deploy and tune scalable clusters running Spark on Amazon EMR. • How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3. • Common architectures to leverage Spark with Amazon DynamoDB, Amazon Redshift, Amazon Kinesis, and more.

Best Practices for Using Apache Spark on AWS

Amazon Web Services

"Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this session, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. Additionally, you hear from AOL’s Senior Software Engineer on how they used these strategies to migrate their Hadoop workloads to the AWS cloud and lessons learned along the way. In this session, you learn the benefits of decoupling storage and compute and allowing them to scale independently; how to run Hadoop, Spark, Presto and other supported Hadoop Applications on Amazon EMR; how to use Amazon S3 as a persistent data-store and process data directly from Amazon S3; dDeployment strategies and how to avoid common mistakes when deploying at scale; and how to use Spot instances to scale your transient infrastructure effectively."

(BDT208) A Technical Introduction to Amazon Elastic MapReduce

Amazon Web Services

Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch five years ago, AWS customers have launched more than 5.5 million Hadoop clusters. In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient. Speakers: Ian Meyers, AWS Solutions Architect Ian McDonald, IT Director, SwiftKey

Deep Dive - Amazon Elastic MapReduce (EMR)

Amazon Web Services

Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...

Amazon Web Services

Learning Objectives: - Learn how to run Amazon EMR clusters on Spot instances and significantly reduce the cost of processing vast amounts of data on managed Hadoop clusters - Understand key EC2 Spot Instances concepts and common usage patterns for maximum scale and cost optimization for Big Data workloads - See a few customer examples that show how to leverage the full scale of the AWS cloud for faster results

Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...

Amazon Web Services

(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...

Amazon Web Services

Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...

Amazon Web Services

Amazon EMR Deep Dive & Best Practices

Amazon Web Services

Amazon EMR enables fast processing of large structured or unstructured datasets, and in this presentation we'll show you how to setup an Amazon EMR job flow to analyse application logs, and perform Hive queries against it. We also review best practices around data file organisation on Amazon Simple Storage Service (S3), how clusters can be started from the AWS web console and command line, and how to monitor the status of a Map/Reduce job. Finally we take a look at Hadoop ecosystem tools you can use with Amazon EMR and the additional features of the service. See a recording of the webinar based on this presentation on YouTube here: Check out the rest of the Masterclass webinars for 2015 here: http://aws.amazon.com/campaigns/emea/masterclass/ See the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/

Amazon EMR Masterclass

Amazon Web Services

Spark Integration Architecture for restaurant data

David Tung

AWS EMR (Elastic Map Reduce) explained

Harsha KM

Amazon Elastic MapReduce (Amazon EMR) is a web service that allows you to easily and securely provision and manage your Hadoop clusters. In this talk, we will introduce you to Amazon EMR design patterns, such as using various data stores like Amazon S3, how to take advantage of both transient and active clusters, and how to work with other Amazon EMR architectural patterns. We will dive deep on how to dynamically scale your cluster and address the ways you can fine-tune your cluster. We will discuss bootstrapping Hadoop applications from our partner ecosystem that you can use natively with Amazon EMR. Lastly, we will share best practices on how to keep your Amazon EMR cluster cost-effective.

Deep Dive: Amazon Elastic MapReduce

Amazon Web Services

Amazon Elastic MapReduce (EMR) is a web service that allows you to easily and securely provision and manage your Hadoop clusters. In this talk, we will introduce you to Amazon EMR design patterns, such as using various data stores such as Amazon S3, how to take advantage of both transient and active clusters, as well as other Amazon EMR architectural patterns. We will dive deep on how to dynamically scale your cluster and address the ways you can fine-tune your cluster. We will discuss bootstrapping Hadoop applications from our partner ecosystem that you can use natively with Amazon EMR. Lastly, we will share best practices on how to keep your Amazon EMR cluster cost-effective.

Deep Dive: Amazon Elastic MapReduce

Amazon Web Services

Amazon EMR Facebook Presto Meetup

stevemcpherson

Querying and Analyzing Data in Amazon S3

Amazon Web Services

Talk 1: Machine Learning in Presto Presto is an open source distributed SQL query engine used by Facebook, in our Hadoop warehouse. It's typically about 10x faster than Hive, and can be extended to a number of other use cases. One of these extensions adds SQL functions to create and make predictions with machine learning models. The aim of this is to significantly reduce the time it takes to prototype a model, by moving the construction and testing of the model to the database. Bio: Christopher Berner works as a software engineer at Facebook on the Presto team. He wrote the ML functionality, and has worked on the query planner, type system, bytecode generator, and many other pieces of Presto. Before Presto he worked on the newsfeed ranking team developing machine learning models.

SF Big Analytics: Machine Learning with Presto by Christopher Berner

Chester Chen

Amazon Elastic MapReduce (EMR) is one of the largest Hadoop operators in the world. Since its launch five years ago, our customers have launched more than 15 million Hadoop clusters inside of EMR. In this webinar, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.

AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Amazon Web Services

What's hot (20)

(BDT305) Amazon EMR Deep Dive and Best Practices

Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...

AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Best Practices for Using Apache Spark on AWS

(BDT208) A Technical Introduction to Amazon Elastic MapReduce

Deep Dive - Amazon Elastic MapReduce (EMR)

Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...

Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...

(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...

Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...

Amazon EMR Deep Dive & Best Practices

Amazon EMR Masterclass

Spark Integration Architecture for restaurant data

AWS EMR (Elastic Map Reduce) explained

Deep Dive: Amazon Elastic MapReduce

Amazon EMR Facebook Presto Meetup

Querying and Analyzing Data in Amazon S3

SF Big Analytics: Machine Learning with Presto by Christopher Berner

AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Viewers also liked

OpsStack--Integrated Operation Platform

ChinaNetCloud

數位媒體雲端儲存案例和技術分享 (AWS Storage Options for Media Industry)

Amazon Web Services

Aws summit devops 云端多环境自动化运维和部署

Leon Li

透過Amazon CloudFront 和AWS WAF來執行安全的內容傳輸

Amazon Web Services

AWS Solutions Architect 準備心得

Cliff Chao-kuan Lu

基于Aws的持续集成、交付和部署代闻

Mason Mei

AwSome day 分享

得翔徐

Amazon Machine Learning 採用的機器學習技術與 Amazon 內部資料科學家社群多年來使用的技術相同，該技術經過驗證且可高度擴展。此服務使用強大的演算法，透過找出現有資料的模式來建立機器學習模型。然後，Amazon Machine Learning 使用這些模型來處理新資料並為應用程式產生預測結果。這一個場次中，除了為各位介紹Amazon Machine Learning服務之外，也準備了一個推薦引擎的範例展示，讓各位瞭解Amazon Machine Learning在實務上的運用方式。

使用Amazon Machine Learning 建立即時推薦引擎

Amazon Web Services

零到千万可扩展架构 AWS Architecture Overview

Leon Li

Internet Cloud Operations - ChinaNetcloud & AWS Event Beijing

ChinaNetCloud

AWS EC2 and ELB troubleshooting

Shiva Narayanaswamy

Aws容器服务详解

Leon Li

以Device Shadows與Rules Engine串聯實體世界

Amazon Web Services

淺談系統監控與 AWS CloudWatch 的應用

Rick Hwang

AWS Summit OaaS Talk by ChinaNetCloud

ChinaNetCloud

1. 利用微服務架構建立雲端影音平台 (Building Media Platform by Microservices Architecture)

Amazon Web Services

管理程式部署到多個AWS Lambda函數和更新您的API閘道，常可能需要手動管理，也使其顯得費時。在這個議程，我們將向您介紹如何使用AWS CodePipeline ，一個架構於Amazon內部的自動化持續交付，去部署pipeline到AWS Lambda 。我們將討論如何使用版本控制，使您能夠更好地管理你的開發工作流程（例如，開發，分期和生產）、Lambda函數和API閘道方法的不同變化。我們將演練如何自動從您的開發應用程序的整個發行過程中，分期，最後到生產;在每個階段執行自動整合測試。

管理程式對AWS LAMBDA持續交付

Amazon Web Services

Amazon EC2 and AWS Elastic Beanstalk Introduction

Amazon Web Services

Automate Software Deployments on EC2 with AWS CodeDeploy

Amazon Web Services

在本次會議中，我們將分享如何通過 AWS 的最佳實踐，説明企業業務以高效方式實現線下資料中心遷移到 AWS 雲上環境。分享經驗包括遷移戰略，遷移方法，遷移規劃和遷移執行。同時，我們還將分享一些相關的遷移工具，用於加速遷移過程和降低業務停機時間和風險。最後，我們將提供結構化的參考遷移路線圖作為AWS雲上的大規模遷移的指引。

如何規劃與執行大型資料中心遷移和案例分享

Amazon Web Services

Viewers also liked (20)

OpsStack--Integrated Operation Platform

數位媒體雲端儲存案例和技術分享 (AWS Storage Options for Media Industry)

Aws summit devops 云端多环境自动化运维和部署

透過Amazon CloudFront 和AWS WAF來執行安全的內容傳輸

AWS Solutions Architect 準備心得

基于Aws的持续集成、交付和部署代闻

AwSome day 分享

使用Amazon Machine Learning 建立即時推薦引擎

零到千万可扩展架构 AWS Architecture Overview

Internet Cloud Operations - ChinaNetcloud & AWS Event Beijing

AWS EC2 and ELB troubleshooting

Aws容器服务详解

以Device Shadows與Rules Engine串聯實體世界

淺談系統監控與 AWS CloudWatch 的應用

AWS Summit OaaS Talk by ChinaNetCloud

1. 利用微服務架構建立雲端影音平台 (Building Media Platform by Microservices Architecture)

管理程式對AWS LAMBDA持續交付

Amazon EC2 and AWS Elastic Beanstalk Introduction

Automate Software Deployments on EC2 with AWS CodeDeploy

如何規劃與執行大型資料中心遷移和案例分享

Similar to Autoscaling Spark on AWS EC2 - 11th Spark London meetup

"Each year, the technical complexity of making the next great Walt Disney Animation Studios film increases. Animation and Visual FX studios continue to push the bounds of what is possible in computer graphics. This complexity drives rapid technological growth in both computational resources and storage to the point that it exceeds what we can physically provide with our on-premise compute cluster. As a result, we have started to adopt a hybrid approach with the cloud. This session addresses the hurdles that animation and VFX studios face and focuses on automation of 'disposable' components (specifically infrastructure, licensing, fleet management, data and dependency management in a large-scale batch workload). We apply these general cloud techniques and utilities to an animation/VFX workload and push the limits with a very large scale cloud renderfarm deployment. The team from Walt Disney Animation Studios walks through how they use cloud technologies to maximize render capacity. Learn how to leverage high-performance storage (like Amazon EFS), Amazon EC2 networking and the latest EC2 Spot features to provide a fully functional renderfarm at production-quality scale."

(CMP404) Cloud Rendering at Walt Disney Animation Studios

Amazon Web Services

Building and deploying an analytic service on Cloud is a challenge. A bigger challenge is to maintain the service. In a world where users are gravitating towards a model where cluster instances are to be provisioned on the fly, in order for these to be used for analytics or other purposes, and then to have these cluster instances shut down when the jobs get done, the relevance of containers and container orchestration is more important than ever. Container orchestrators like Kubernetes can be used to deploy and distribute modules quickly, easily, and reliably. The intent of this talk is to share the experience of building such a service and deploying it on a Kubernetes cluster. In this talk, we will discuss all the requirements which an enterprise grade Hadoop/Spark cluster running on containers bring in for a container orchestrator. This talk will cover in details how Kubernetes orchestrator can be used to meet all our needs of resource management, scheduling, networking, and network isolation, volume management, etc. We will discuss how we have replaced our home grown container orchestrator with Kubernetes which used to manage the container lifecycle and manage resources in accordance to our requirements. We will also discuss the feature list as container orchestrator which is helping us deploy and patch 1000s of containers and also a list which we believe need improvement or can be enhanced in a container orchestrator. Speaker Rachit Arora, SSE, IBM

Why Kubernetes as a container orchestrator is a right choice for running spar...

DataWorks Summit

Leveraging elastic web scale computing with AWS

Shiva Narayanaswamy

Every day, the computing power of high-performance computing (HPC) clusters helps scientists make breakthroughs, such as proving the existence of gravitational waves and screening new compounds for new drugs. Yet building HPC clusters is out of reach for most organizations, due to the upfront hardware costs and ongoing operational expenses. Now the speed of innovation is only bound by your imagination, not your budget. Researchers can run one cluster for 10,000 hours or 10,000 clusters for one hour anytime, from anywhere, and both cost the same in the cloud. And with the availability of Public Data Sets in Amazon S3, petabyte scale data is instantly accessible in the cloud. Attend and learn how to build HPC clusters on the fly, leverage Amazon’s Spot market pricing to minimize the cost of HPC jobs, and scale HPC jobs on a small budget, using all the same tools you use today, and a few new ones too.

Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...

Amazon Web Services

AWS Summit 2014 Brisbane - Breakout 4 Sponsor Session Agility is the #1 reason why businesses are moving so quickly to AWS and the cloud. Attend this session with ITOC to learn how treating ‘infrastructure as code’ drives business agility and speed to market at a global scale, all powered by AWS. Backed by real success stories, ITOC will share strategic thinking and implementations that are guaranteed to help you understand and leverage the potential of AWS. Presenter: David Nedvěd - Co-founder & Director at ITOC Australia

Business Agility: Taking an App Global (at Speed) - Session Sponsored by ITOC

Amazon Web Services

Application Lifecycle Management on AWS

David Mat

EMR Training

vishal192091

part 1 in a six part series on Cloud Native Application Development with Oracle Cloud Infrastructure (OCI). In this part an overview of OCI, an introduction of the Katacoda Handson Scenarios, an intro to the Preparation Scenario, details on the Compute Service and the Events & Notification Services. A short demo of the Preparation Scenario concludes the session. see the recording of the live session here: https://www.youtube.com/watch?v=txRamyzA9Ns

Part 1 of the REAL Webinars on Oracle Cloud Native Application Development

Lucas Jellema

Running Oracle EBS in the cloud (UKOUG APPS16 edition)

Andrejs Prokopjevs

Autoscaling groups is the new ‘Capacity Plan’ for Cloud based applications. Autoscaling enables all sorts of applications to scale seamlessly from day one traffic to millions of users – all with no capital expenditure on extra hardware procurement. Never again be caught out unprepared for a surge in traffic or the traffic generated by a successful campaign. In addition, why keep enough infrastructure running for peak loads during quieter periods, at night for example. Scale down your infrastructure to enjoy the significant cost savings that cloud computing affords you. Reasons to attend: - Learn how Autoscaling groups work and how they are configured and triggered. - Learn how to architect your application in order to achieve zero impact to customers while scaling both up and down. - Learn how to dynamically change the size of your infrastructure to match the changing capacity requirements.

Day 5 - AWS Autoscaling Master Class - The New Capacity Plan

Amazon Web Services

AWS Lambda at JUST EAT

Andrew Brown

NetflixOSS for Triangle Devops Oct 2013

aspyker

Workshop : Wild Rydes Takes Off - The Dawn of a New Unicorn

Amazon Web Services

Aws ec2

Bhavik Vashi

Sitecore 8.2 Update 1 on Azure Web Apps

Rob Habraken

Wild Rides Takes off - The Dawn of a New Unicorn

Amazon Web Services

Amazon EC2 forms the backbone compute platform for hundreds of thousands of AWS customers, but how do you go beyond starting an instance and manually configuring it? This webinar takes you on a journey starting with the basics of key creation and security groups and ending with an Auto Scaling application driven by dynamic policies. It will explain the tools you need to create an Auto Scaling configuration and show you how to bootstrap an instance.

AWS APAC Webinar Week - Getting The Most From EC2

Amazon Web Services

Self-Service Supercomputing

Amazon Web Services

The talk will focus on how we are utilizing AWS Lambda for certain applications and the advantages/disadvantages, and the challenges we discovered along the way. It would help those who are looking to reduce technical debt with the infrastructure and costs. Previously a Director of technical operations at fox networks (21st Century Fox/News Corporation) responsible for infrastructure and building deployment pipelines. Currently a Python programmer / DevOps engineer with roots in systems/networks administration. Focus is on infrastructure and application automation. Worked as an engineer for Cisco Systems with emphasis on video conferencing. Built microwave networks at Bel Air Internet. Find me on github and twitter @itsmemattchung Video: https://www.youtube.com/watch?v=BLcElBUhfrQ Join DevOps Exchange London here: http://www.meetup.com/DevOps-Exchange-London Follow DOXLON on twitter http://www.twitter.com/doxlon

Matt Chung (Independent) - Serverless application with AWS Lambda

Outlyer

Containerize all the things!

Mike Melusky

Similar to Autoscaling Spark on AWS EC2 - 11th Spark London meetup (20)

(CMP404) Cloud Rendering at Walt Disney Animation Studios

Why Kubernetes as a container orchestrator is a right choice for running spar...

Leveraging elastic web scale computing with AWS

Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...

Business Agility: Taking an App Global (at Speed) - Session Sponsored by ITOC

Application Lifecycle Management on AWS

EMR Training

Part 1 of the REAL Webinars on Oracle Cloud Native Application Development

Running Oracle EBS in the cloud (UKOUG APPS16 edition)

Day 5 - AWS Autoscaling Master Class - The New Capacity Plan

AWS Lambda at JUST EAT

NetflixOSS for Triangle Devops Oct 2013

Workshop : Wild Rydes Takes Off - The Dawn of a New Unicorn

Aws ec2

Sitecore 8.2 Update 1 on Azure Web Apps

Wild Rides Takes off - The Dawn of a New Unicorn

AWS APAC Webinar Week - Getting The Most From EC2

Self-Service Supercomputing

Matt Chung (Independent) - Serverless application with AWS Lambda

Containerize all the things!

Recently uploaded

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

Heather Hedden, Senior Consultant at Enterprise Knowledge, presented “The Role of Taxonomy and Ontology in Semantic Layers” at a webinar hosted by Progress Semaphore on April 16, 2024. Taxonomies at their core enable effective tagging and retrieval of content, and combined with ontologies they extend to the management and understanding of related data. There are even greater benefits of taxonomies and ontologies to enhance your enterprise information architecture when applying them to a semantic layer. A survey by DBP-Institute found that enterprises using a semantic layer see their business outcomes improve by four times, while reducing their data and analytics costs. Extending taxonomies to a semantic layer can be a game-changing solution, allowing you to connect information silos, alleviate knowledge gaps, and derive new insights. Hedden, who specializes in taxonomy design and implementation, presented how the value of taxonomies shouldn’t reside in silos but be integrated with ontologies into a semantic layer. Learn about: - The essence and purpose of taxonomies and ontologies in information and knowledge management; - Advantages of semantic layers leveraging organizational taxonomies; and - Components and approaches to creating a semantic layer, including the integration of taxonomies and ontologies

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Enterprise Knowledge

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Sara Mae O’Brien Scott and Tatiana Baquero Cakici, Senior Consultants at Enterprise Knowledge (EK), presented “AI Fast Track to Search-Focused AI Solutions” at the Information Architecture Conference (IAC24) that took place on April 11, 2024 in Seattle, WA. In their presentation, O’Brien-Scott and Cakici focused on what Enterprise AI is, why it is important, and what it takes to empower organizations to get started on a search-based AI journey and stay on track. The presentation explored the complexities of enterprise search challenges and how IA principles can be leveraged to provide AI solutions through the use of a semantic layer. O’Brien-Scott and Cakici showcased a case study where a taxonomy, an ontology, and a knowledge graph were used to structure content at a healthcare workforce solutions organization, providing personalized content recommendations and increasing content findability. In this session, participants gained insights about the following: Most common types of AI categories and use cases; Recommended steps to design and implement taxonomies and ontologies, ensuring they evolve effectively and support the organization’s search objectives; Taxonomy and ontology design considerations and best practices; Real-world AI applications that illustrated the value of taxonomies, ontologies, and knowledge graphs; and Tools, roles, and skills to design and implement AI-powered search solutions.

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Enterprise Knowledge

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

What is a good lead in your organisation? Which leads are priority? What happens to leads? When sales and marketing give different answers to these questions, or perhaps aren't sure of the answers at all, frustrations build and opportunities are left on the table. Join us for an illuminating session with Cian McLoughlin, HubSpot Principal Customer Success Manager, as we look at that crucial piece of the customer journey in which leads are transferred from marketing to sales.

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

HampshireHUG

Enterprise Knowledge’s Urmi Majumder, Principal Data Architecture Consultant, and Fernando Aguilar Islas, Senior Data Science Consultant, presented "Driving Behavioral Change for Information Management through Data-Driven Green Strategy" on March 27, 2024 at Enterprise Data World (EDW) in Orlando, Florida. In this presentation, Urmi and Fernando discussed a case study describing how the information management division in a large supply chain organization drove user behavior change through awareness of the carbon footprint of their duplicated and near-duplicated content, identified via advanced data analytics. Check out their presentation to gain valuable perspectives on utilizing data-driven strategies to influence positive behavioral shifts and support sustainability initiatives within your organization. In this session, participants gained answers to the following questions: - What is a Green Information Management (IM) Strategy, and why should you have one? - How can Artificial Intelligence (AI) and Machine Learning (ML) support your Green IM Strategy through content deduplication? - How can an organization use insights into their data to influence employee behavior for IM? - How can you reap additional benefits from content reduction that go beyond Green IM?

Driving Behavioral Change for Information Management through Data-Driven Gree...

Enterprise Knowledge

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

BooK Now Call us at +918448380779 to hire a gorgeous and seductive call girl for sex. Take a Delhi Escort Service. The help of our escort agency is mostly meant for men who want sexual Indian Escorts In Delhi NCR. It should be noted that any impersonator will get 100 attention from our Young Girls Escorts in Delhi. They will assume the position of reliable allies. VIP Call Girl With Original Photos Book Tonight +918448380779 Our Cheap Price 1 Hour not available 2 Hours 5000 Full Night 8000 TAG: Call Girls in Delhi, Noida, Gurgaon, Ghaziabad, Connaught Place, Greater Kailash Delhi, Lajpat Nagar Delhi, Mayur Vihar Delhi, Chanakyapuri Delhi, New Friends Colony Delhi, Majnu Ka Tilla, Karol Bagh, Malviya Nagar, Saket, Khan Market, Noida Sector 18, Noida Sector 76, Noida Sector 51, Gurgaon Mg Road, Iffco Chowk Gurgaon, Rajiv Chowk Gurgaon All Delhi Ncr Free Home Deliver

08448380779 Call Girls In Civil Lines Women Seeking Men

Delhi Call girls

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Delhi Call girls

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

Imagine a world where information flows as swiftly as thought itself, making decision-making as fluid as the data driving it. Every moment is critical, and the right tools can significantly boost your organization’s performance. The power of real-time data automation through FME can turn this vision into reality. Aimed at professionals eager to leverage real-time data for enhanced decision-making and efficiency, this webinar will cover the essentials of real-time data and its significance. We’ll explore: FME’s role in real-time event processing, from data intake and analysis to transformation and reporting An overview of leveraging streams vs. automations FME’s impact across various industries highlighted by real-life case studies Live demonstrations on setting up FME workflows for real-time data Practical advice on getting started, best practices, and tips for effective implementation Join us to enhance your skills in real-time data automation with FME, and take your operational capabilities to the next level.

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Safe Software

Scaling API-first – The story of a global engineering organization

Radu Cotescu

In an era where artificial intelligence (AI) stands at the forefront of business innovation, Information Architecture (IA) is at the core of functionality. See “There’s No AI Without IA” – (from 2016 but even more relevant today) Understanding and leveraging how Information Architecture (IA) supports AI synergies between knowledge engineering and prompt engineering is critical for senior leaders looking to successfully deploy AI for internal and externally facing knowledge processes. This webinar be a high-level overview of the methodologies that can elevate AI-driven knowledge processes supporting both employees and customers. Core Insights Include: Strategic Knowledge Engineering: Delve into how structuring AI's knowledge base is required to prevent hallucinations, enable contextual retrieval of accurate information. This will include discussion of gold standard libraries of use cases support testing various LLMs and structures and configurations of knowledge base. Precision in Prompt Engineering: Learn the art of crafting prompts that direct AI to deliver targeted, relevant responses, thereby optimizing customer experiences and business outcomes. Unified Approach for Enhanced AI Performance: Explore the intersection of knowledge and prompt engineering to develop AI systems that are not only more responsive but also aligned with overarching business strategies. Guiding Principles for Implementation: Equip yourself with best practices, ethical guidelines, and strategic considerations for embedding these technologies into your business ecosystem effectively. This webinar is designed to empower business and technology leaders with the knowledge to harness the full potential of AI, ensuring their organizations not only keep pace with digital transformation but lead the charge. Join us to map a roadmap to fully leverage Information Architecture (IA) and AI chart a course towards a future where AI is a key pillar of strategic innovation and business success.

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

Earley Information Science

Three things you will take away from the session: • How to run an effective tenant-to-tenant migration • Best practices for before, during, and after migration • Tips for using migration as a springboard to prepare for Copilot in Microsoft 365 Main ideas: Migration Overview: The presentation covers the current reality of cross-tenant migrations, the triggers, phases, best practices, and benefits of a successful tenant migration Considerations: When considering a migration, it is important to consider the migration scope, performance, customization, flexibility, user-friendly interface, automation, monitoring, support, training, scalability, data integrity, data security, cost, and licensing structure Next Wave: The next wave of change includes the launch of Copilot, which requires businesses to be prepared for upcoming changes related to Copilot and the cloud, and to consolidate data and tighten governance ShareGate: ShareGate can help with pre-migration analysis, configurable migration tool, and automated, end-user driven collaborative governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

sammart93

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Delhi Call girls

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Strategies for Landing an Oracle DBA Job as a Fresher

IAC 2024 - IA Fast Track to Search Focused AI Solutions

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Driving Behavioral Change for Information Management through Data-Driven Gree...

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Artificial Intelligence: Facts and Myths

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

08448380779 Call Girls In Civil Lines Women Seeking Men

Finology Group – Insurtech Innovation Award 2024

🐬 The future of MySQL is Postgres 🐘

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Scaling API-first – The story of a global engineering organization

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Autoscaling Spark on AWS EC2 - 11th Spark London meetup

1. Autoscaling Spark for Fun and Profit Rafal Kwasny 11th Spark London Meetup 2015-11-26 1

2. Who am I •DevOPS •Build a few platforms in my life •mostly adtech, in-game analytics for Sony Playstation •Currently advising Investment Banks •CTO Entropy Investments 2

3. How do you run spark? •Who runs on AWS? •Who uses EMR? 3

4. So how to use autoscaling on AWS? 4

5. Overview •typical architecture for AWS •How autoscaling works •Scripts to make your life easier 5

6. Typical architecture for AWS 6

7. Typical architecture for AWS 7 Generate some data

8. Typical architecture for AWS 8 Store it in S3

9. Typical architecture for AWS 9 or store it in a message queue

10. Typical architecture for AWS 10 Use your favourite tool for ETL

11. Typical architecture for AWS 11 Ship it back to S3

12. Typical architecture for AWS 12 Or send it somewhere

13. Typical architecture for AWS 13 - EMR - spark-ec2 - build cluster from scratch

14. Map-reduce is about quickly writing very inefficient code and then running it at massive scale (C) Someone 14

15. Problem •EC2 is a pay-for-what-you-use model •You just have to decide how much resources you want to use before starting a cluster 15

16. Problem Most common problems while running on EC2 Scaling up •My team needs a new cluster, how big it should be? Scaling down •Did I shut down the DEV cluster before leaving the office on Friday evening? 16

17. How to automate scaling? 17

18. Types of scaling Vertical scaling - „Let’s get a bigger box” •Change instance type •Change EBS parameters 18 Horizontal scaling - „Just add more nodes”

19. Autoscaling •Automatic resizing based on demand •Define minimum/maximum instance count •Define when scaling should occur •Use metrics •Run your jobs and don’t worry about infrastructure 19

20. Architecture with autoscaling 20

21. Using RAM/local SSDs for caching Only saving output into S3

22. Fault recovery

23. Autoscaling components •AMI - machine image with installed spark •Launch configuration - defines: •AMI •instance type •instance storage •public IP •security groups 23

24. Autoscaling components •Autoscaling group •launch configuration •availability zones •VPC details •min/max servers •when to scale •metrics/health checks 24

25. Putting it all together Then you can run your job 25

26. Complicated? •AWS provides a lot of services 26

27. spark-cloud • Better scripts to start spark clusters on EC2 • Alpha version • https://github.com/entropyltd/spark-cloud 27

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38. What’s inside spark-cloud Building AMI’s through packer Packer is a tool for creating machine and container images for multiple platforms from a single source configuration. Supports AWS, DigitalOcean, Docker, OpenStack, Parallels, QEMU, VirtualBox, VMware 38

39. Current functionality •Start cluster •Shutdown cluster •But more to come :) 39

40. Spot instances •Spot instances 40

41. Spot instances –On-Demand: $1.400 –Spot: $0.15 –89% cheaper 41

42. Summary •Spark and EC2 is a very common combination •Because it makes your life easier •And cheaper •spark-cloud script will help you •You can just worry about writing good Spark code! 42

43. Thank You rafal@entropy.be 43

44. 44

45. Amazon S3 Tips •Don’t use s3n:// •Use s3a:// with hadoop 2.6 –Parallel rename, especially important for committing output –Supports IAM authentication –no „xyz_$folder$" files –input seek –multipart upload ( no 5GB limit ) –Error recovery and retry More info https://issues.apache.org/jira/browse/HADOOP-10400 45

46. Why not EMR? •Why pay for EMR? It costs more than a spot instance •vendor lock-in and proprietary libraries •netlib-java 46

Editor's Notes

How many of you use spark in production?
Single source of data very good durability & availability Offloading storage complexity to AWS
Parquet Columnar store Standard supported by Spark, Hive, Presto, Impala Optimised for: column based aggregations Not optimised for `select *` type queries INSERT/UPDATE’s
When on EC2 you have 2 main options: spark-ec2 scripts EMR (Elastic Map-Reduce)
no HDFS no state on workers

Autoscaling Spark on AWS EC2 - 11th Spark London meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Autoscaling Spark on AWS EC2 - 11th Spark London meetup

Similar to Autoscaling Spark on AWS EC2 - 11th Spark London meetup (20)

Recently uploaded

Recently uploaded (20)

Autoscaling Spark on AWS EC2 - 11th Spark London meetup

Editor's Notes