Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When Running Spark Handling Infra Failures When Running Spark

•Download as PPTX, PDF•

0 likes•406 views

Anya Bida is a senior member of technical staff working on Spark tuning at Salesforce. She has a PhD from Mayo Clinic and BS from Johns Hopkins. The document discusses DevOps concepts for data scientists like handling infrastructure failures when running Spark jobs. It provides an overview of Spark operations like map, reduceByKey and saveAsTextFile. It also discusses best practices for avoiding common Spark and HDFS failures through techniques like high availability, sufficient disk space, optimizing partitions, and persisting or checkpointing data.

Software

About Anya (she/her)
Sr. Member of Technical Staff (SRE)
Salesforce Production Engineering
Salesforce Einstein Platform
Co-organizer SF Big Analytics
Spark Tuning
• Cheat-sheet
• Talks
Previously at Alpine Data, SRI
PhD Mayo Clinic, BS Johns Hopkins
@anyabida1

1700s 1800s 1900s Today
1st Industrial Revolution
Steam
2nd Industrial Revolution
Electricity
3rd Industrial Revolution
Computing
4th Industrial Revolution
Intelligence
Fourth Industrial Revolution
Intelligence is transforming the customer experience

Just Enough DevOps for Data Scientists
Part II: Handling Infra Failures When Running Spark
abida@salesforce.com
@ anyabida1
Anya Bida, SRE at Salesforce

What is DevOps?
Software Development
Network &
SecurityInfrastructure
Build & Release

What is DevOps?
Software Development
Network &
SecurityInfrastructure
Build & Release
Data Science
Hello Ada!

Blue Green Deployments
https://docs.mobingi.com/official/guide/bg-deploy
Blue Machine
(old)
Green Machine
(new)
Users

https://spark.apache.org/examples.html
How to avoid potential HDFS failures
- Use high availability for the namenode
- Plenty of disk space for hdfs
- Plenty of disk space per disk
- Block replication = 3
- Monitor disk I/O, network connectivity
- Correct permissions

https://spark.apache.org/examples.html
Spark Context defines
the application

https://spark.apache.org/examples.html
Spark operations
saveAsTextFile
map
flatMap
.
reduceByKey
textFile

https://spark.apache.org/examples.html
Spark operations
saveAsTextFile
map
flatMap
.
reduceByKey
textFile
Stage Boundaries

https://spark.apache.org/examples.html
Spark operations
saveAsTextFile
map
flatMap
.
reduceByKey
textFile
Stage Boundaries
Wide Transformation defin
a new stage

Anatomy of a Spark Job
High Performance Spark, Karau & Warren, O’Reilly
Spark Context / Spark
Session Object
Actions (eg collect,
saveAsTextFile)
Wide Transformations
(sort, groupByKey)
Computation to
evaluate one partition
(combine narrow transforms)
Spark
Application
Job
Stage Stage
Task Task

https://spark.apache.org/examples.html
Spark operations
saveAsTextFile
map
flatMap
.
reduceByKey
textFile
Stage Boundaries
Where are the tasks?

Tasks run on executors
Apache Spark
How to avoid common task failures
- Use default retry & exponential backoff settings
- Spark is tolerant to single / multi node failures
- Spark 2.2 is tolerant to single disk failures even on non-raid commodity hardware
- Etc.
- Optimize number of partitions
- Beware data skew & dirty data
- Etc.
- Etc.

https://spark.apache.org/examples.html
Spark operations
.
reduceByKey
Stage Boundaries
The Shuffle

Cache Persist Checkpoint Local
Checkpoint
local mem cache MEM MEM MEM
local disk DISK DISK
HDFS / S3 Specify dir
If exec is
decommed, are
writes available?
No No Yes No
If job finishes are
writes available?
No No Yes No
Preserve lineage
graph?
Yes Yes No No
RDD Re-use
Persist to improve speed, Checkpoint to improve fault tolerance

https://spark.apache.org/examples.html
Spark operations
.
reduceByKey
Stage Boundaries
The Shuffle
- Persist to improve speed
- Checkpoint to improve fault tolerance

https://spark.apache.org/examples.html
Spark operations
.
Stage Boundaries
The Write
saveAsTextFile

https://spark.apache.org/examples.html
Spark operations
.
Stage Boundaries
saveAsTextFile
The Write
- reading and writing != efficient
- Writing a few large files files is more efficient than writing thousands of small
files

https://spark.apache.org/examples.html
Spark operations
.
Stage Boundaries
The Write - S3
- S3 partitions != hdfs partitions
- S3 partitions != spark partitions
- S3 partitioning can slow your write
saveAsTextFile

https://spark.apache.org/examples.html
FAILURE
FAILURE
FAILURE
Common failures

Where do I find Metrics? Logs?
Ganglia
• windowing, dashboarding
Spark History Server

More info: SRE How Google Runs
Production Systems book
High Performance
spark bookChaos Engineering

abida@salesforce.com
@ anyabida1
Anya Bida, SRE at Salesforce

What's hot

Embrace Chaos - Introducing Chaos Engineering to your Organization

Paul Osman

Writing iOS apps in Javascript is not a new idea, anymore, at least since companies like Appcelerator (Titanium) built entire business models around corresponding frameworks. And yet, Apple manages to open up two exciting new possibilities during the WWDC 2013: The release of the JavaScriptCore Framework as a public API on iOS and OS X, as well as the announcement of an Objective-C to Javascript Bridge. I'd like to talk to you about my experiences with these new bridge-technologies, the new ways in which you can use them and finally present to you my own project; Node.app — a Node.js implementation for iOS.

Where Node.JS Meets iOS

Sam Rijs

Navigating the Incubator at the Apache Software Foundation

Brett Porter

And The Slow Suffer What They Must

Chef

And the Slow Suffer What they Must...

Justin Arbuckle

Building REST APIs using gRPC and Go

Alvaro Viebrantz

How Shopify Scales Rails

jduff

What's hot (7)

Embrace Chaos - Introducing Chaos Engineering to your Organization

Where Node.JS Meets iOS

Navigating the Incubator at the Apache Software Foundation

And The Slow Suffer What They Must

And the Slow Suffer What they Must...

Building REST APIs using gRPC and Go

How Shopify Scales Rails

Similar to Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When Running Spark Handling Infra Failures When Running Spark

Imagine we have Ada, our data science intern. Let's run through a very simple wordcount spark job, and find a handful of potential failure points. Dozens of failures can and should happen when running spark jobs on commodity hardware. Given the basic foundation for infrastructure-level expectations, this talk gives Ada tools to ensure her job isn’t caught dead. Once the simple example job runs reliably, with the potential to scale, our data scientist can apply the same toolset to focus on some more interesting algorithms. Turn SNAFUs into successes by anticipating and handling Infra failures gracefully. Note: this talk is a spark-focused extension of Part I, "Just Enough DevOps For Data Scientists" from Scale by The Bay 2018 https://www.youtube.com/watch?v=RqpnBl5NgW0&t=19s

Just enough DevOps for Data Scientists (Part II)

Databricks

At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.

Spark Summit EU 2015: Lessons from 300+ production users

Databricks

Spark Tuning for Enterprise System Administrators By Anya Bida

Spark Summit

by Anya Bida and Rachel Warren from Alpine Data https://spark-summit.org/east-2016/events/spark-tuning-for-enterprise-system-administrators/ Spark offers the promise of speed, but many enterprises are reluctant to make the leap from Hadoop to Spark. Indeed, System Administrators will face many challenges with tuning Spark performance. This talk is a gentle introduction to Spark Tuning for the Enterprise System Administrator, based on experience assisting two enterprise companies running Spark in yarn-cluster mode. The initial challenges can be categorized in two FAQs. First, with so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Second, once I know which Spark Tuning parameters I need, how do I enforce them for the various users submitting various jobs to my cluster? This introduction to Spark Tuning will enable enterprise system administrators to overcome common issues quickly and focus on more advanced Spark Tuning challenges. The audience will understand the “cheat-sheet” posted here: http://techsuppdiva.github.io/ Key takeaways: FAQ 1: With so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Solution 1: The Spark Tuning cheat-sheet! A visualization that guides the System Administrator to quickly overcome the most common hurdles to algorithm deployment. [1]http://techsuppdiva.github.io/ FAQ 2: Once I know which Spark Tuning parameters I need, how do I enforce them at the user level? job level? algorithm level? project level? cluster level? Solution 2: We’ll approach these challenges using job & cluster configuration, the Spark context, and 3rd party tools – of which Alpine will be one example. We’ll operationalize Spark parameters according to user, job, algorithm, workflow pipeline, or cluster levels.

Spark Tuning For Enterprise System Administrators, Spark Summit East 2016

Anya Bida

Big Data Processing with Spark and Scala

Edureka!

This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial: 1) Big Data Introduction 2) Batch vs Real Time Analytics 3) Why Apache Spark? 4) What is Apache Spark? 5) Using Spark with Hadoop 6) Apache Spark Features 7) Apache Spark Ecosystem 8) Demo: Earthquake Detection Using Apache Spark

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...

Edureka!

Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will then describe how this is critical to the design of Spark and how this tight interplay allows very efficient execution. We will also discuss various sources of metrics on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. Users and operators who are aware of these concepts will become more effective at their interactions with Spark.

Apache Spark Performance is too hard. Let's make it easier

Databricks

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...

Michael Rys

This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark. Below topics are explained in this Spark presentation: 1. History of Spark 2. What is Spark 3. Hadoop vs Spark 4. Components of Apache Spark 5. Spark architecture 6. Applications of Spark 7. Spark usecase What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? Simplilearn’s Apache Spark and Scala certification training are designed to: 1. Advance your expertise in the Big Data Hadoop Ecosystem 2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark 3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos What skills will you learn? By completing this Apache Spark and Scala course you will be able to: 1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations 2. Understand the fundamentals of the Scala programming language and its features 3. Explain and master the process of installing Spark as a standalone cluster 4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark 5. Master Structured Query Language (SQL) using SparkSQL 6. Gain a thorough understanding of Spark streaming features 7. Master and describe the features of Spark ML programming and GraphX programming Who should take this Scala course? 1. Professionals aspiring for a career in the field of real-time big data analytics 2. Analytics professionals 3. Research professionals 4. IT developers and testers 5. Data scientists 6. BI and reporting professionals 7. Students who wish to gain a thorough understanding of Apache Spark Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

Simplilearn

Spark For Faster Batch Processing

Edureka!

Highlights and Challenges from Running Spark on Mesos in Production by Morri ...

Spark Summit

This Edureka Apache Spark Interview Questions and Answers tutorial helps you in understanding how to tackle questions in a Spark interview and also gives you an idea of the questions that can be asked in a Spark Interview. The Spark interview questions cover a wide range of questions from various Spark components. Below are the topics covered in this tutorial: 1. Basic Questions 2. Spark Core Questions 3. Spark Streaming Questions 4. Spark GraphX Questions 5. Spark MLlib Questions 6. Spark SQL Questions

Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...

Edureka!

5 reasons why spark is in demand!

Edureka!

A short introduction to Spark and its benefits

Johan Picard

The talk by Maksud Ibrahimov, Chief Data Scientist at InfoReady Analytics. He is going to share with us how to maximise the performance of Spark. As a user of Apache Spark from very early releases, he generally sees that the framework is easy to start with but as the program grows its performance starts to suffer. In this talk Maksud will answer the following questions: - How to reach higher level of parallelism of your jobs without scaling up your cluster? - Understanding shuffles, and how to avoid disk spills - How to identify task stragglers and data skews? - How to identify Spark bottlenecks?

Spark performance tuning - Maksud Ibrahimov

Maksud Ibrahimov

Detailed guide to the Apache Spark Framework

Aegis Software Canada

spark_v1_2

Frank Schroeter

Big Data Processing with .NET and Spark (SQLBits 2020)

Michael Rys

Getting Started with Spark Scala

Knoldus Inc.

This Edureka "What is Spark" tutorial will introduce you to big data analytics framework - Apache Spark. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial: 1) Big Data Analytics 2) What is Apache Spark? 3) Why Apache Spark? 4) Using Spark with Hadoop 5) Apache Spark Features 6) Apache Spark Architecture 7) Apache Spark Ecosystem - Spark Core, Spark Streaming, Spark MLlib, Spark SQL, GraphX 8) Demo: Analyze Flight Data Using Apache Spark

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...

Edureka!

Similar to Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When Running Spark Handling Infra Failures When Running Spark (20)

Just enough DevOps for Data Scientists (Part II)

Spark Summit EU 2015: Lessons from 300+ production users

Spark Tuning for Enterprise System Administrators By Anya Bida

Spark Tuning For Enterprise System Administrators, Spark Summit East 2016

Big Data Processing with Spark and Scala

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...

Apache Spark Performance is too hard. Let's make it easier

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

Spark For Faster Batch Processing

Highlights and Challenges from Running Spark on Mesos in Production by Morri ...

Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...

5 reasons why spark is in demand!

A short introduction to Spark and its benefits

Spark performance tuning - Maksud Ibrahimov

Detailed guide to the Apache Spark Framework

spark_v1_2

Big Data Processing with .NET and Spark (SQLBits 2020)

Getting Started with Spark Scala

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...

Recently uploaded

A great deal of attention in medical devices has shifted towards cybersecurity with the ratification of section 524B of the FD&C act. This new law enables the FDA to enforce cybersecurity controls in any medical device that is capable of networked communications or that has software. In this webinar we will recap the process for managing vulnerabilities, identify categories of vulnerabilities and solutions and more.

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

ICS

10 Trends Likely to Shape Enterprise Technology in 2024

Mind IT Systems

Data spaces in distributed environments should be allowed to evolve in agile ways providing data space owners with large flexibility about which data they store. Agility and heterogeneity, however, jeopardize data exchanges because representations may build on varying ontologies and data consumers may not rely on the semantic correctness of their queries in the context of semantically heterogeneous, evolving data spaces. Graph data spaces are one example of a powerful model for representing and querying data whose semantics may change over time. To assert and enforce conditions on individual graph data spaces, shape languages (e.g SHACL) have been developed. We investigate the question of how querying and programming can be guarded by reasoning over SHACL constraints in a distributed setting and we sketch a picture of how a future landscape based on semantically heterogeneous data spaces might look like.

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

Steffen Staab

Looking for an efficient way to manage your finances? Look no further than our money management app. With easy-to-use features, you can track your expenses, create budgets, and monitor your savings goals all in one place. Our app provides real-time updates on your spending habits and helps you make smarter financial decisions. Take control of your finances today with our user-friendly money management app.

Right Money Management App For Your Financial Goals

Jhone kinadey

How To Use Server-Side Rendering with Nuxt.js

Andolasoft Inc

Azure Native Qumulo scales elastically for common High Performance Compute (HPC) workloads based on application requirements for: Financial Services, Automotive, Genomics / Life Sciences, Media and Entertainment, Energy, Oil and Gas, etc. Performance can be dialed UP (and back down) much higher than the examples shown here. These slides offer a glimpse into ANQ's HPC capabilities, although at a smaller scale. We invite YOU to do your own testing (with a free ANQ trial) and work with us to test your HPC workloads in Azure.

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf

ryanfarris8

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

OnePlan Solutions

Looking to embark on a digital project in New York City? Choosing the ideal Laravel development partner is pivotal. Begin by defining your project requirements clearly. Assess potential partners' experience, expertise, and technical proficiency, checking portfolios and client testimonials. Effective communication and collaboration are paramount, so evaluate partners' communication styles and project management approaches. Consider long-term scalability and support options, and discuss pricing and contracts transparently. Lastly, trust your instincts when selecting a partner aligned with your vision and values.

How to Choose the Right Laravel Development Partner in New York City_compress...

software pro Development

Test automation is a cornerstone of software development and quality assurance in today's rapidly evolving digital landscape. Its significance cannot be overstated. Businesses can enhance efficiency, productivity, and accelerate software delivery to market through automation, streamlining testing processes effectively. This comprehensive guide addresses the best practices for test automation in 2024. It offers a detailed checklist to empower you to optimize your automation efforts and maintain a competitive edge.

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

kalichargn70th171

Define the academic and professional writing..pdf

PearlKirahMaeRagusta1

At TECUNIQUE, we're a stable and steadily growing Indian software services company with over 14 years of industry experience. Specializing in offshore software development and quality assurance services, we've built a reputation for delivering unique and effective solutions to start-ups, software development companies, enterprises, and digital agencies. We pride ourselves on our commitment to excellence and innovation. By blending insightful business domain knowledge with exceptional technical prowess, we craft tailor-made solutions that meet the unique needs of our clients. Our dedicated teams are adept in specific technologies, ensuring seamless integration of skills and delivering reliable, scalable, and high-quality software solutions aligned with our clients' preferences. Bespoke Dedicated Teams: Crafted to meet your specific needs and technology preferences, our dedicated teams are committed to delivering top-notch software solutions. Offshore Software Development: Accelerate your software development and scale up quickly with our 12+ years of expertise in offshore development. Quality Assurance Services: Ensure the quality of your software products with our dedicated teams of experienced QA professionals. IT Staff Augmentation: Overcome skill gaps with our client-centric software team, offering staff augmentation services. Expert Software Services: Unlock our capabilities in custom software development, product development, and quality assurance. Mission and Vision: Our mission at TECUNIQUE is to be the catalyst for our clients' success in the dynamic domain of software development. Rooted in our core values of respect, authenticity, and responsibility, we strive to ease the software outsourcing experience, reducing both time and cost to market for our clients. We envision ourselves as the leading Indian software services company, renowned for our unwavering commitment to excellence and innovation. www.tecunique.com

TECUNIQUE: Success Stories: IT Service provider

mohitmore19

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

Philip Schwarz

In the realm of real-time applications, Large Language Models (LLMs) have long dominated language-centric tasks, while tools like OpenCV have excelled in the visual domain. However, the future (maybe) lies in the fusion of LLMs and deep learning, giving birth to the revolutionary concept of Large Action Models (LAMs). Imagine a world where AI not only comprehends language but mimics human actions on technology interfaces. For example, the Rabbit r1 device presented at CES 2024, driven by an AI operating system and LAM, brings this vision to life. It executes complex commands, leveraging GUIs with unprecedented ease. In this presentation, join me on a journey as a software engineer tinkering with WebRTC, Janus, and LLM/LAMs. Together, we’ll evaluate the current state of these AI technologies, unraveling the potential they hold for shaping the future of real-time applications.

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

Alberto González Trastoy

In the past six months, the AI landscape has undergone a massive transformation, ushering in a new era of productivity with the latest in Large Language Models (LLMs) and AI technology. This deep dive unlocks how to: Create CustomGPT Models: No coding needed to tailor AI for your unique projects. Integrate your own data, including PDFs and Excel sheets, making information handling a breeze. Plus, discover how to call your own actions/integrations for even more personalized utility. Navigate Advanced Prompting: Overcome AI's memory limits and utilize Retrieval-Augmented Generation for accessing your personalized data, streamlining how you interact with AI. Stay Ahead with AI Trends: Peek into the evolving world of LLMs, featuring newcomers like Google Gemini, Anthropic Claude, Open Sora, and Twitter Grok, and understand what their advancements mean for your productivity. Witness Real-Life Transformations: Through examples and prompt demonstrations, see firsthand how these AI strategies revolutionize routine tasks, from data analysis to content creation. Learn to leverage image output and input for advanced practical use cases, adding a new dimension to your productivity toolkit. No previous coding or AI experience is needed for this talk. Stay ahead in the fast-evolving world of work. Embrace the AI revolution and transform your workflow with advanced LLM techniques. Join us to ensure you're not left behind in the productivity race.

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques

VictorSzoltysek

Software Quality Assurance Interview Questions

Arshad QA

Model Call Girl Services in Delhi reach out to us at 🔝 9953056974 🔝✔️✔️ Our agency presents a selection of young, charming call girls available for bookings at Oyo Hotels. Experience high-class escort services at pocket-friendly rates, with our female escorts exuding both beauty and a delightful personality, ready to meet your desires. Whether it's Housewives, College girls, Russian girls, Muslim girls, or any other preference, we offer a diverse range of options to cater to your tastes. We provide both in-call and out-call services for your convenience. Our in-call location in Delhi ensures cleanliness, hygiene, and 100% safety, while our out-call services offer doorstep delivery for added ease. We value your time and money, hence we kindly request pic collectors, time-passers, and bargain hunters to refrain from contacting us. Our services feature various packages at competitive rates: One shot: ₹2000/in-call, ₹5000/out-call Two shots with one girl: ₹3500/in-call, ₹6000/out-call Body to body massage with sex: ₹3000/in-call Full night for one person: ₹7000/in-call, ₹10000/out-call Full night for more than 1 person: Contact us at 🔝 9953056974 🔝. for details Operating 24/7, we serve various locations in Delhi, including Green Park, Lajpat Nagar, Saket, and Hauz Khas near metro stations. For premium call girl services in Delhi 🔝 9953056974 🔝. Thank you for considering us!

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

9953056974 Low Rate Call Girls In Saket, Delhi NCR

Many specialized tools cater to distinct stages within the software development lifecycle (SDLC). These tools target various aspects of development, delivery, and operations, each with its unique strengths. Uniting these diverse testing needs into a single continuous testing platform presents several challenges. Such a platform must seamlessly integrate with various development tools and environments, accommodate different testing methodologies, and remain flexible to adapt to organizational processes and quality standards.

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...

kalichargn70th171

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf

VishalKumarJha10

At the recent Microsoft Ignite 2023 conference, Microsoft unveiled a groundbreaking strategy that will redefine enterprise work management. The plan involves integrating Microsoft’s key planning tools, Microsoft To Do, Microsoft Planner, and Microsoft Project for the web into a unified experience called “Microsoft Planner.” What does this new strategy from Microsoft mean for current users? Join us and learn how best to take advantage of this announcement while gaining a clear path on how to elevate the current state of Microsoft Planner from a basic task manager to a comprehensive tool for Enterprise Work Management using OnePlan. Learn how OnePlan’s integration with Microsoft Planner allows for strategic alignment with business goals through advanced features like strategic planning, portfolio management, resource management, financial management, and more!

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution

OnePlan Solutions

HR Software Buyers Guide in 2024 - HRSoftware.com

Fatema Valibhai

Recently uploaded (20)

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

10 Trends Likely to Shape Enterprise Technology in 2024

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

Right Money Management App For Your Financial Goals

How To Use Server-Side Rendering with Nuxt.js

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

How to Choose the Right Laravel Development Partner in New York City_compress...

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

Define the academic and professional writing..pdf

TECUNIQUE: Success Stories: IT Service provider

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques

Software Quality Assurance Interview Questions

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution

HR Software Buyers Guide in 2024 - HRSoftware.com

Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When Running Spark Handling Infra Failures When Running Spark

2. About Anya (she/her) Sr. Member of Technical Staff (SRE) Salesforce Production Engineering Salesforce Einstein Platform Co-organizer SF Big Analytics Spark Tuning • Cheat-sheet • Talks Previously at Alpine Data, SRI PhD Mayo Clinic, BS Johns Hopkins @anyabida1

3. 1700s 1800s 1900s Today 1st Industrial Revolution Steam 2nd Industrial Revolution Electricity 3rd Industrial Revolution Computing 4th Industrial Revolution Intelligence Fourth Industrial Revolution Intelligence is transforming the customer experience

5. Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When Running Spark abida@salesforce.com @ anyabida1 Anya Bida, SRE at Salesforce

6. What is DevOps? Software Development Network & SecurityInfrastructure Build & Release

7. What is DevOps? Software Development Network & SecurityInfrastructure Build & Release

8. What is DevOps? Software Development Network & SecurityInfrastructure Build & Release Data Science Hello Ada!

9. Spark Primer Apache Spark

10. https://spark.apache.org/examples.html

11. https://spark.apache.org/examples.html

12. Blue Green Deployments https://docs.mobingi.com/official/guide/bg-deploy Blue Machine (old) Green Machine (new) Users

13. https://spark.apache.org/examples.html How to avoid potential HDFS failures - Use high availability for the namenode - Plenty of disk space for hdfs - Plenty of disk space per disk - Block replication = 3 - Monitor disk I/O, network connectivity - Correct permissions

14. https://spark.apache.org/examples.html Spark Context defines the application

15. https://spark.apache.org/examples.html Spark operations saveAsTextFile map flatMap . reduceByKey textFile

16. https://spark.apache.org/examples.html Spark operations saveAsTextFile map flatMap . reduceByKey textFile Stage Boundaries

17. https://spark.apache.org/examples.html Spark operations saveAsTextFile map flatMap . reduceByKey textFile Stage Boundaries Wide Transformation defin a new stage

18. Anatomy of a Spark Job High Performance Spark, Karau & Warren, O’Reilly Spark Context / Spark Session Object Actions (eg collect, saveAsTextFile) Wide Transformations (sort, groupByKey) Computation to evaluate one partition (combine narrow transforms) Spark Application Job Stage Stage Task Task

19. https://spark.apache.org/examples.html Spark operations saveAsTextFile map flatMap . reduceByKey textFile Stage Boundaries Where are the tasks?

20. Tasks run on executors Apache Spark

21. Tasks run on executors Apache Spark How to avoid common task failures - Use default retry & exponential backoff settings - Spark is tolerant to single / multi node failures - Spark 2.2 is tolerant to single disk failures even on non-raid commodity hardware - Etc. - Optimize number of partitions - Beware data skew & dirty data - Etc. - Etc.

22. https://spark.apache.org/examples.html Spark operations . reduceByKey Stage Boundaries The Shuffle

23. Cache Persist Checkpoint Local Checkpoint local mem cache MEM MEM MEM local disk DISK DISK HDFS / S3 Specify dir If exec is decommed, are writes available? No No Yes No If job finishes are writes available? No No Yes No Preserve lineage graph? Yes Yes No No RDD Re-use Persist to improve speed, Checkpoint to improve fault tolerance

24. https://spark.apache.org/examples.html Spark operations . reduceByKey Stage Boundaries The Shuffle - Persist to improve speed - Checkpoint to improve fault tolerance

25. https://spark.apache.org/examples.html Spark operations . Stage Boundaries The Write saveAsTextFile

26. https://spark.apache.org/examples.html Spark operations . Stage Boundaries saveAsTextFile The Write - reading and writing != efficient - Writing a few large files files is more efficient than writing thousands of small files

27. https://spark.apache.org/examples.html Spark operations . Stage Boundaries The Write - S3 - S3 partitions != hdfs partitions - S3 partitions != spark partitions - S3 partitioning can slow your write saveAsTextFile

28. https://spark.apache.org/examples.html Spark operations . Stage Boundaries The Write - S3 - S3 partitions != hdfs partitions - S3 partitions != spark partitions - S3 partitioning can slow your write - S3 partitioning depends on the first few characters of the bucket path - S3://mybucket/hash-myresultfile saveAsTextFile

29. https://spark.apache.org/examples.html FAILURE FAILURE FAILURE Common failures

30. https://spark.apache.org/examples.html FAILURE FAILURE FAILURE Common failures

31. Where do I find Metrics? Logs? Ganglia • windowing, dashboarding Spark History Server

32. More info: SRE How Google Runs Production Systems book High Performance spark bookChaos Engineering

33. abida@salesforce.com @ anyabida1 Anya Bida, SRE at Salesforce

Editor's Notes

Key Messages In Fourth Industrial Revolution, artificial intelligence, robotics, and the Internet of Things (IoT) are transforming the customer experience. The lines between the physical and digital worlds are blurring, especially with technology like voice command, autonomous cars, and smart devices that keep you connected and always-on. Talk TrackOver the last 300 years, our world has seen incredible innovation and unprecedented technological change. Today, we're entering the Fourth Industrial Revolution where artificial intelligence, robotics, and the Internet of Things (IoT) are transforming the customer experience. AI is making devices and apps smarter The lines between the physical and digital worlds are blurring We are seeing incredible new products and services — like connected motorcycles and connected coolers You can see this change in our everyday lives: shopping via voice command, autonomous cars, and smart devices that keep you connected and always-on.TransitionThese next-generation technologies are connecting us to our customers in a whole new way. And customer expectations, in turn, are changing.
Salesforce Einstein is serving 475 Million predictions per day, and growing.So how do we do this from an infra perspective?
What DevOps actually IS??? -- cross section of infrastructure, -- here’s all the things data scientists need to support themselves at scale
What DevOps actually IS??? -- cross section of infrastructure, -- here’s all the things data scientists need to support themselves at scale
What DevOps actually IS??? -- cross section of infrastructure, -- here’s all the things data scientists need to support themselves at scale
valtextFile=sc.textFile("hdfs://...")valcounts=textFile.flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey(_+_)counts.saveAsTextFile("hdfs://...")
When to use: Persist(useDisk) - When partitions cannot fit into memory, & When jobs are slow, eg due to network delaysCheckpoint - When cost to recompute is high, when there’s no time to recompute on failure, when jobs fail due to OOM or network interruptions Local checkpoint - when lineage graph is super long
Cloudwatch by default stores all my cluster and host metrics. So when I’ve got a huge job running, my cluster memory might be reduced significantly, and I can monitor this on cloudwatch. Ganglia is nice for creating dashboards and configuring the monitors for my clusters. I should mention logs too - the Spark History Server lets me view logs even for clusters that are terminated.
This has been “Just enough devops for data scientists”

Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When Running Spark Handling Infra Failures When Running Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When Running Spark Handling Infra Failures When Running Spark

Similar to Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When Running Spark Handling Infra Failures When Running Spark (20)

Recently uploaded

Recently uploaded (20)

Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When Running Spark Handling Infra Failures When Running Spark

Editor's Notes