SlideShare a Scribd company logo
1 of 47
Download to read offline
Project Zen: Improving Apache
Spark for Python Users
Hyukjin Kwon
Databricks Software Engineer
Hyukjin Kwon
▪ Apache Spark Committer / PMC
▪ Major Koalas contributor
▪ Databricks Software Engineer
▪ @HyukjinKwon in GitHub
Agenda
What is Project Zen?
Redesigned Documentation
PySpark Type Hints
Distribution Option for PyPI Users
Roadmap
What is Project Zen?
Python Growth
68%
of notebook
commands on
Databricks are in
Python
PySpark Today
▪ Documentation difficult to
navigate
▪ All APIs under each module is listed in single page
▪ No other information or classification
▪ Lack of information
▪ No Quickstart page
▪ No Installation page
▪ No Introduction
PySpark documentation
PySpark Today
▪ IDE unfriendly
▪ Dynamically defined functions reported as missing
functions
▪ Lack of autocompletion support
▪ Lack of type checking support
▪ Notebook unfriendly
▪ Lack of autocompletion support
Missing import in IDE
Autocompletion in IDE
Autocompletion in Jupyter
PySpark Today
▪ Less Pythonic
▪ Deprecated Python built-in instance support when creating a DataFrame
>>> spark.createDataFrame([{'a': 1}])
/.../session.py:378: UserWarning: inferring schema from dict is deprecated, please use
pyspark.sql.Row instead
PySpark Today
▪ Missing distributions with other Hadoop versions in PyPI
▪ Missing Hadoop 3 distribution
▪ Missing Hive 1.2 distribution
PyPI DistributionApache Mirror
PySpark Today
▪ Inconsistent exceptions and warnings
▪ Unclassified exceptions and warnings
>>> spark.range(10).explain(1, 2)
Traceback (most recent call last):
...
Exception: extended and mode should not be set together.
The Zen of Python
PEP 20 -- The Zen of Python
Project Zen (SPARK-32082)
▪ Be Pythonic
▪ The Zen of Python
▪ Python friendly
▪ Better and easier use of PySpark
▪ Better documentation
▪ Clear exceptions and warnings
▪ Python type hints: autocompletion, static type checking and error detection
▪ More options for pip installation
▪ Better interoperability with other Python libraries
▪ pandas, pyarrow, NumPy, Koalas, etc.
▪ Visualization
Redesigned Documentation
Problems in PySpark Documentation
(Old) PySpark documentation
▪ Everything in few pages
▪ Whole module in single page w/o
classification
▪ Difficult to navigate
▪ Very long to stroll down
▪ Virtually no structure
▪ No other useful pages
▪ How to start?
▪ How to ship 3rd party packages together?
▪ How to install?
▪ How to debug / setup an IDE?
New PySpark Documentation
New PySpark documentation
New PySpark Documentation
New user guide page
New PySpark Documentation
Search
Other pages
Top menu
Contents in
the current
page
Sub-titles in
the current
page
New API Reference
New API reference page
New API Reference
New API reference page
└── module A
├── classification A
…
└── classification ...
New API Reference
New API reference page
New API Reference
New API reference page
Table for each classification
New API Reference
New API reference page
New API Reference
New API reference page
Each page for each API
Quickstart
Quickstart page
Live Notebook
Move to live notebook (Binder integration)
Live Notebook
Live notebook (Binder integration)
Other New Pages
New useful pages
PySpark Type Hints
What are Python Type Hints?
def greeting(name):
return 'Hello ' + name
Typical Python codes
def greeting(name: str) -> str:
return 'Hello ' + name
Python codes with type hints
def greeting(name: str) -> str: ...
Stub syntax (.pyi file)
Why are Python Type Hints good?
▪ IDE Support
▪ Notebook Support
▪ Documentation
▪ Static error detection
Before type hints
After type hints
Why are Python Type Hints good?
▪ IDE Support
▪ Notebook Support
▪ Documentation
▪ Static error detection
Before type hints
After type hints
Why are Python Type Hints good?
▪ IDE Support
▪ Notebook Support
▪ Documentation
▪ Static error detection
Before type hints
After type hints
Why are Python Type Hints good?
▪ IDE Support
▪ Notebook Support
▪ Documentation
▪ Static error detection
Static error detection
https://github.com/zero323/pyspark-stubs#motivation
Python Type Hints in PySpark
Built-in in the upcoming Apache Spark 3.1!
Community support: zero323/pyspark-stubs
User facing APIs only
Stub (.pyi) files
Installation Option for PyPI Users
PyPI Distribution
PySpark on PyPI
PyPI Distribution
▪ Multiple distributions available
▪ Hadoop 2.7 and Hive 1.2
▪ Hadoop 2.7 and Hive 2.3
▪ Hadoop 3.2 and Hive 2.3
▪ Hive 2.3 without Hadoop
▪ PySpark distribution in PyPI
▪ Hadoop 2.7 and Hive 1.3
Multiple distributions
in Apache Mirror
One distribution in PyPI
New Installation Options
HADOOP_VERSION=3.2 pip install pyspark
HADOOP_VERSION=2.7 pip install pyspark
HADOOP_VERSION=without pip install pyspark
PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org HADOOP_VERSION=2.7 pip install
Spark with Hadoop 3.2
Spark with Hadoop 2.7
Spark without Hadoop
Spark downloading from the
specified mirror
Why not pip --install-options?
Ongoing issues in pip
Roadmap
Roadmap
▪ Migrate to NumPy documentation style
▪ Better classification
▪ Better readability
▪ Widely used
"""Specifies some hint on the current
:class:`DataFrame`.
:param name: A name of the hint.
:param parameters: Optional parameters.
:return: :class:`DataFrame`
"""Specifies some hint on the
current :class:`DataFrame`.
Parameters
----------
name : str
A name of the hint.
parameters : dict, optional
Optional parameters
Returns
-------
DataFrame
Numpydoc stylereST style
Roadmap
▪ Standardize warnings and exceptions
▪ Classify the exception and warning types
▪ Python friendly messages instead of JVM stack trace
>>> spark.range(10).explain(1, 2)
Traceback (most recent call last):
...
Exception: extended and mode should not be set together.
Plain Exception being thrown
Roadmap
▪ Interoperability between NumPy,
Koalas, other libraries
▪ Common features in DataFrames
▪ NumPy universe function
▪ Visualization and plotting
▪ Make a chart from Spark DataFrame
Re-cap
Re-cap
▪ Python and PySpark are becoming more and more popular
▪ PySpark documentation is redesigned with many new pages
▪ Auto-completion and type checking in IDE and notebooks
▪ PySpark download options in PyPI
Re-cap: What’s next?
▪ Migrate to NumPy documentation style
▪ Standardize warnings and exceptions
▪ Visualization
▪ Interoperability between NumPy, Koalas, other libraries
Question?

More Related Content

What's hot

Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsDataWorks Summit/Hadoop Summit
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...Spark Summit
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
 
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data EngineeringSputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data EngineeringDatabricks
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
 
Helium makes Zeppelin fly!
Helium makes Zeppelin fly!Helium makes Zeppelin fly!
Helium makes Zeppelin fly!DataWorks Summit
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
 
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...Spark Summit
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Spark Summit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftScaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftDatabricks
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsDatabricks
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesDatabricks
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 

What's hot (20)

Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data EngineeringSputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Helium makes Zeppelin fly!
Helium makes Zeppelin fly!Helium makes Zeppelin fly!
Helium makes Zeppelin fly!
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftScaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at Lyft
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 

Similar to Project Zen: Improving Apache Spark for Python Users

PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau
 
SciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programmingSciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programmingSamuel Lampa
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
 
python full stack course in hyderabad...
python full stack course in hyderabad...python full stack course in hyderabad...
python full stack course in hyderabad...sowmyavibhin
 
python full stack course in hyderabad...
python full stack course in hyderabad...python full stack course in hyderabad...
python full stack course in hyderabad...sowmyavibhin
 
Developing Brilliant and Powerful APIs in Ruby & Python
Developing Brilliant and Powerful APIs in Ruby & PythonDeveloping Brilliant and Powerful APIs in Ruby & Python
Developing Brilliant and Powerful APIs in Ruby & PythonSmartBear
 
State of Big Data on ARM64 / AArch64 - Apache Bigtop
State of Big Data on ARM64 / AArch64 - Apache BigtopState of Big Data on ARM64 / AArch64 - Apache Bigtop
State of Big Data on ARM64 / AArch64 - Apache BigtopGanesh Raju
 
Icinga 2010 at CeBIT
Icinga 2010 at CeBITIcinga 2010 at CeBIT
Icinga 2010 at CeBITIcinga
 
GitLab Frontend and VueJS at GitLab
GitLab Frontend and VueJS at GitLabGitLab Frontend and VueJS at GitLab
GitLab Frontend and VueJS at GitLabFatih Acet
 
Logs aggregation and analysis
Logs aggregation and analysisLogs aggregation and analysis
Logs aggregation and analysisDivante
 
Creating Reusable Geospatial Pipelines
Creating Reusable Geospatial PipelinesCreating Reusable Geospatial Pipelines
Creating Reusable Geospatial PipelinesDatabricks
 
Python Dependency Management - PyconDE 2018
Python Dependency Management - PyconDE 2018Python Dependency Management - PyconDE 2018
Python Dependency Management - PyconDE 2018Patrick Muehlbauer
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and FriendsRob Vesse
 
Chef on SmartOS
Chef on SmartOSChef on SmartOS
Chef on SmartOSEric Saxby
 
DSAG Jahreskongress 2018 - DevOps and Deployment Pipelines in SAP ABAP Landsc...
DSAG Jahreskongress 2018 - DevOps and Deployment Pipelines in SAP ABAP Landsc...DSAG Jahreskongress 2018 - DevOps and Deployment Pipelines in SAP ABAP Landsc...
DSAG Jahreskongress 2018 - DevOps and Deployment Pipelines in SAP ABAP Landsc...Sascha Junkert
 
Django: Beyond Basics
Django: Beyond BasicsDjango: Beyond Basics
Django: Beyond Basicsarunvr
 
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)DataWorks Summit
 
Great Tools Heavily Used In Japan, You Don't Know.
Great Tools Heavily Used In Japan, You Don't Know.Great Tools Heavily Used In Japan, You Don't Know.
Great Tools Heavily Used In Japan, You Don't Know.Junichi Ishida
 
ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300Timothy Spann
 

Similar to Project Zen: Improving Apache Spark for Python Users (20)

PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
 
SciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programmingSciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programming
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
python full stack course in hyderabad...
python full stack course in hyderabad...python full stack course in hyderabad...
python full stack course in hyderabad...
 
python full stack course in hyderabad...
python full stack course in hyderabad...python full stack course in hyderabad...
python full stack course in hyderabad...
 
Developing Brilliant and Powerful APIs in Ruby & Python
Developing Brilliant and Powerful APIs in Ruby & PythonDeveloping Brilliant and Powerful APIs in Ruby & Python
Developing Brilliant and Powerful APIs in Ruby & Python
 
State of Big Data on ARM64 / AArch64 - Apache Bigtop
State of Big Data on ARM64 / AArch64 - Apache BigtopState of Big Data on ARM64 / AArch64 - Apache Bigtop
State of Big Data on ARM64 / AArch64 - Apache Bigtop
 
Icinga 2010 at CeBIT
Icinga 2010 at CeBITIcinga 2010 at CeBIT
Icinga 2010 at CeBIT
 
GitLab Frontend and VueJS at GitLab
GitLab Frontend and VueJS at GitLabGitLab Frontend and VueJS at GitLab
GitLab Frontend and VueJS at GitLab
 
Sppp presentation
Sppp presentationSppp presentation
Sppp presentation
 
Logs aggregation and analysis
Logs aggregation and analysisLogs aggregation and analysis
Logs aggregation and analysis
 
Creating Reusable Geospatial Pipelines
Creating Reusable Geospatial PipelinesCreating Reusable Geospatial Pipelines
Creating Reusable Geospatial Pipelines
 
Python Dependency Management - PyconDE 2018
Python Dependency Management - PyconDE 2018Python Dependency Management - PyconDE 2018
Python Dependency Management - PyconDE 2018
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
 
Chef on SmartOS
Chef on SmartOSChef on SmartOS
Chef on SmartOS
 
DSAG Jahreskongress 2018 - DevOps and Deployment Pipelines in SAP ABAP Landsc...
DSAG Jahreskongress 2018 - DevOps and Deployment Pipelines in SAP ABAP Landsc...DSAG Jahreskongress 2018 - DevOps and Deployment Pipelines in SAP ABAP Landsc...
DSAG Jahreskongress 2018 - DevOps and Deployment Pipelines in SAP ABAP Landsc...
 
Django: Beyond Basics
Django: Beyond BasicsDjango: Beyond Basics
Django: Beyond Basics
 
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
 
Great Tools Heavily Used In Japan, You Don't Know.
Great Tools Heavily Used In Japan, You Don't Know.Great Tools Heavily Used In Japan, You Don't Know.
Great Tools Heavily Used In Japan, You Don't Know.
 
ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 

Project Zen: Improving Apache Spark for Python Users

  • 1. Project Zen: Improving Apache Spark for Python Users Hyukjin Kwon Databricks Software Engineer
  • 2. Hyukjin Kwon ▪ Apache Spark Committer / PMC ▪ Major Koalas contributor ▪ Databricks Software Engineer ▪ @HyukjinKwon in GitHub
  • 3. Agenda What is Project Zen? Redesigned Documentation PySpark Type Hints Distribution Option for PyPI Users Roadmap
  • 5. Python Growth 68% of notebook commands on Databricks are in Python
  • 6. PySpark Today ▪ Documentation difficult to navigate ▪ All APIs under each module is listed in single page ▪ No other information or classification ▪ Lack of information ▪ No Quickstart page ▪ No Installation page ▪ No Introduction PySpark documentation
  • 7. PySpark Today ▪ IDE unfriendly ▪ Dynamically defined functions reported as missing functions ▪ Lack of autocompletion support ▪ Lack of type checking support ▪ Notebook unfriendly ▪ Lack of autocompletion support Missing import in IDE Autocompletion in IDE Autocompletion in Jupyter
  • 8. PySpark Today ▪ Less Pythonic ▪ Deprecated Python built-in instance support when creating a DataFrame >>> spark.createDataFrame([{'a': 1}]) /.../session.py:378: UserWarning: inferring schema from dict is deprecated, please use pyspark.sql.Row instead
  • 9. PySpark Today ▪ Missing distributions with other Hadoop versions in PyPI ▪ Missing Hadoop 3 distribution ▪ Missing Hive 1.2 distribution PyPI DistributionApache Mirror
  • 10. PySpark Today ▪ Inconsistent exceptions and warnings ▪ Unclassified exceptions and warnings >>> spark.range(10).explain(1, 2) Traceback (most recent call last): ... Exception: extended and mode should not be set together.
  • 11. The Zen of Python PEP 20 -- The Zen of Python
  • 12. Project Zen (SPARK-32082) ▪ Be Pythonic ▪ The Zen of Python ▪ Python friendly ▪ Better and easier use of PySpark ▪ Better documentation ▪ Clear exceptions and warnings ▪ Python type hints: autocompletion, static type checking and error detection ▪ More options for pip installation ▪ Better interoperability with other Python libraries ▪ pandas, pyarrow, NumPy, Koalas, etc. ▪ Visualization
  • 14. Problems in PySpark Documentation (Old) PySpark documentation ▪ Everything in few pages ▪ Whole module in single page w/o classification ▪ Difficult to navigate ▪ Very long to stroll down ▪ Virtually no structure ▪ No other useful pages ▪ How to start? ▪ How to ship 3rd party packages together? ▪ How to install? ▪ How to debug / setup an IDE?
  • 15. New PySpark Documentation New PySpark documentation
  • 16. New PySpark Documentation New user guide page
  • 17. New PySpark Documentation Search Other pages Top menu Contents in the current page Sub-titles in the current page
  • 18. New API Reference New API reference page
  • 19. New API Reference New API reference page └── module A ├── classification A … └── classification ...
  • 20. New API Reference New API reference page
  • 21. New API Reference New API reference page Table for each classification
  • 22. New API Reference New API reference page
  • 23. New API Reference New API reference page Each page for each API
  • 25. Live Notebook Move to live notebook (Binder integration)
  • 26. Live Notebook Live notebook (Binder integration)
  • 27. Other New Pages New useful pages
  • 29. What are Python Type Hints? def greeting(name): return 'Hello ' + name Typical Python codes def greeting(name: str) -> str: return 'Hello ' + name Python codes with type hints def greeting(name: str) -> str: ... Stub syntax (.pyi file)
  • 30. Why are Python Type Hints good? ▪ IDE Support ▪ Notebook Support ▪ Documentation ▪ Static error detection Before type hints After type hints
  • 31. Why are Python Type Hints good? ▪ IDE Support ▪ Notebook Support ▪ Documentation ▪ Static error detection Before type hints After type hints
  • 32. Why are Python Type Hints good? ▪ IDE Support ▪ Notebook Support ▪ Documentation ▪ Static error detection Before type hints After type hints
  • 33. Why are Python Type Hints good? ▪ IDE Support ▪ Notebook Support ▪ Documentation ▪ Static error detection Static error detection https://github.com/zero323/pyspark-stubs#motivation
  • 34. Python Type Hints in PySpark Built-in in the upcoming Apache Spark 3.1! Community support: zero323/pyspark-stubs User facing APIs only Stub (.pyi) files
  • 37. PyPI Distribution ▪ Multiple distributions available ▪ Hadoop 2.7 and Hive 1.2 ▪ Hadoop 2.7 and Hive 2.3 ▪ Hadoop 3.2 and Hive 2.3 ▪ Hive 2.3 without Hadoop ▪ PySpark distribution in PyPI ▪ Hadoop 2.7 and Hive 1.3 Multiple distributions in Apache Mirror One distribution in PyPI
  • 38. New Installation Options HADOOP_VERSION=3.2 pip install pyspark HADOOP_VERSION=2.7 pip install pyspark HADOOP_VERSION=without pip install pyspark PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org HADOOP_VERSION=2.7 pip install Spark with Hadoop 3.2 Spark with Hadoop 2.7 Spark without Hadoop Spark downloading from the specified mirror
  • 39. Why not pip --install-options? Ongoing issues in pip
  • 41. Roadmap ▪ Migrate to NumPy documentation style ▪ Better classification ▪ Better readability ▪ Widely used """Specifies some hint on the current :class:`DataFrame`. :param name: A name of the hint. :param parameters: Optional parameters. :return: :class:`DataFrame` """Specifies some hint on the current :class:`DataFrame`. Parameters ---------- name : str A name of the hint. parameters : dict, optional Optional parameters Returns ------- DataFrame Numpydoc stylereST style
  • 42. Roadmap ▪ Standardize warnings and exceptions ▪ Classify the exception and warning types ▪ Python friendly messages instead of JVM stack trace >>> spark.range(10).explain(1, 2) Traceback (most recent call last): ... Exception: extended and mode should not be set together. Plain Exception being thrown
  • 43. Roadmap ▪ Interoperability between NumPy, Koalas, other libraries ▪ Common features in DataFrames ▪ NumPy universe function ▪ Visualization and plotting ▪ Make a chart from Spark DataFrame
  • 45. Re-cap ▪ Python and PySpark are becoming more and more popular ▪ PySpark documentation is redesigned with many new pages ▪ Auto-completion and type checking in IDE and notebooks ▪ PySpark download options in PyPI
  • 46. Re-cap: What’s next? ▪ Migrate to NumPy documentation style ▪ Standardize warnings and exceptions ▪ Visualization ▪ Interoperability between NumPy, Koalas, other libraries