Koalas: Interoperability Between Koalas and Apache Spark

Koalas: Interoperability between
Koalas and Apache Spark
Takuya Ueshin, Haejoon Lee
Software Engineer @ Databricks

About Us
Takuya Ueshin
- Apache Spark committer and PMC member
- Focusing on Spark SQL and PySpark
- Koalas contributor
Haejoon Lee
- Koalas contributor

Agenda
Introduction of Koalas
- pandas
- PySpark
Conversion from and to PySpark
- Index and Default Index
Spark I/O
- pandas
- Koalas specific
Spark accessor
Demo

What’s Koalas?
Announced April 24, 2019
Provides a drop-in replacement for pandas
- enabling efficient scaling out to hundreds of worker nodes
For pandas users
- Scale out the pandas code using Koalas
- Make learning PySpark much easier
For PySpark users
- More productive by pandas-like functions

pandas
Authored by Wes McKinney in 2008
The standard tool for data
manipulation and analysis in Python
Deeply integrated into Python data
science ecosystem
- NumPy
- Matplotlib
- scikit-learn
Stack Overflow Trends

Apache Spark
De facto unified analytics engine for large-scale data processing
- Streaming
- ETL
- ML
Originally created at UC Berkeley by Databricks’ founders
PySpark API for Python; also API support for Scala/Java, R and SQL

Koalas DataFrame and PySpark DataFrame
- Follow the structure of pandas
- Provide pandas APIs
- Implement index/identifier
- Translate pandas APIs into a logical
plan of Spark SQL.
- The plan will be optimized and
executed by Spark SQL engine.
- More compliant with the
relations/tables in relational
databases
- Does not have unique row identifiers
PySpark DataFrameKoalas DataFrame

Conversion from PySpark DataFrame
spark_df.to_koalas()
- Attached to Spark DataFrame when importing Koalas
- index_col parameter
- Indicate which columns should be used as index
- If not specified, “default index” will be attached

Conversion from PySpark DataFrame

Conversion to PySpark DataFrame
koalas_df.to_spark()
- Also koalas_df.spark.frame()
- index_col parameter
- Indicate the column names of index
- If not specified, the index columns will be lost

Conversion to PySpark DataFrame

Index and Default Index
- Koalas manages a group of columns as an index.
- The index behaves the same as pandas’.
- to_koalas() has index_col parameter to specify index columns.
- If no index is specified when creating a Koalas DataFrame:
it attaches a “default index” automatically.
- Koalas has 3 types of “default index”.
- Each “default index” has Pros and Cons.

Comparison of Default Index Types
Configurable by the option “compute.default_index_type”
Distributed
computation
Map-side
operation
Continuous
increment
Performance
sequence
No, in a single
worker node
No, requires a
shuffle
Yes
Bad for large
dataset
distributed
-sequence
Yes
Yes, but requires
another Spark job
Yes Good enough
distributed Yes Yes No Good

Using Spark I/O
Functions to read/write data use Spark I/O under the hood.
- ks.read_csv / DataFrame.to_csv
- ks.read_json / DataFrame.to_json
- ks.read_parquet / DataFrame.to_parquet
- ks.read_sql_table
- ks.read_sql_query
index_col parameter is available to specify the index columns.
Take keyword arguments for additional Spark I/O options.

Using Spark I/O
Koalas specific I/O functions.
- ks.read_table / DataFrame.to_table
- ks.read_spark_io / DataFrame.to_spark_io
- ks.read_delta / DataFrame.to_delta
index_col parameter is available.
Take keyword arguments for additional Spark I/O options.

Spark accessor
Provides functions to leverage the existing PySpark APIs more easily.
- transform/apply for using Spark APIs directly.
- Series.spark.transform
- Series.spark.apply
- DataFrame.spark.apply

Spark accessor
Provides functions to leverage the existing PySpark APIs more easily.
- Check the underlying Spark data type or schema
- Series.spark.data_type
- DataFrame.spark.schema / print_schema
- Check the execution plan
- DataFrame.spark.explain
- Cache the DataFrame
- DataFrame.spark.cache
- Hints
- DataFrame.spark.hint

Demo
The notebook is available here

Getting started
- Pre-installed in Databricks (7.1 and higher)
- pip install koalas
- conda install -c conda-forge koalas
- GitHub: github.com/databricks/koalas
- Docs: https://koalas.readthedocs.io/en/latest/
- 10 min tutorial in a Live Jupyter notebook is available from the docs.
- blog posts
- 10 Minutes from pandas to Koalas on Apache Spark
https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html
- Interoperability between Koalas and Apache Spark
https://databricks.com/blog/2020/08/11/interoperability-between-koalas-and-apache-spark.html

Do you have suggestions or requests?
Submit requests to github.com/databricks/koalas/issues
Very easy to contribute
koalas.readthedocs.io/en/latest/development/contributing.html

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Koalas: Interoperability Between Koalas and Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Koalas: Interoperability Between Koalas and Apache Spark

Similar to Koalas: Interoperability Between Koalas and Apache Spark (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Koalas: Interoperability Between Koalas and Apache Spark