Koalas provides a pandas-like API for Apache Spark. It allows pandas code to scale out to large datasets on Spark by translating pandas operations into Spark SQL queries. The document discusses how Koalas DataFrames are similar to pandas DataFrames but built on PySpark DataFrames. It also covers converting between PySpark and Koalas DataFrames, using different types of default indexes, Spark I/O functionality, the Spark accessor for using Spark APIs, and a demo.
2. About Us
Takuya Ueshin
Software Engineer @ Databricks
- Apache Spark committer and PMC member
- Focusing on Spark SQL and PySpark
- Koalas contributor
Haejoon Lee
Software Engineer @ Databricks
- Koalas contributor
3. Agenda
Introduction of Koalas
- pandas
- PySpark
Conversion from and to PySpark
- Index and Default Index
Spark I/O
- pandas
- Koalas specific
Spark accessor
Demo
4. Agenda
Introduction of Koalas
- pandas
- PySpark
Conversion from and to PySpark
- Index and Default Index
Spark I/O
- pandas
- Koalas specific
Spark accessor
Demo
5. What’s Koalas?
Announced April 24, 2019
Provides a drop-in replacement for pandas
- enabling efficient scaling out to hundreds of worker nodes
For pandas users
- Scale out the pandas code using Koalas
- Make learning PySpark much easier
For PySpark users
- More productive by pandas-like functions
6. pandas
Authored by Wes McKinney in 2008
The standard tool for data
manipulation and analysis in Python
Deeply integrated into Python data
science ecosystem
- NumPy
- Matplotlib
- scikit-learn
Stack Overflow Trends
7. Apache Spark
De facto unified analytics engine for large-scale data processing
- Streaming
- ETL
- ML
Originally created at UC Berkeley by Databricks’ founders
PySpark API for Python; also API support for Scala/Java, R and SQL
8. Koalas DataFrame and PySpark DataFrame
- Follow the structure of pandas
- Provide pandas APIs
- Implement index/identifier
- Translate pandas APIs into a logical
plan of Spark SQL.
- The plan will be optimized and
executed by Spark SQL engine.
- More compliant with the
relations/tables in relational
databases
- Does not have unique row identifiers
PySpark DataFrameKoalas DataFrame
9. Agenda
Introduction of Koalas
- pandas
- PySpark
Conversion from and to PySpark
- Index and Default Index
Spark I/O
- pandas
- Koalas specific
Spark accessor
Demo
10. Conversion from PySpark DataFrame
spark_df.to_koalas()
- Attached to Spark DataFrame when importing Koalas
- index_col parameter
- Indicate which columns should be used as index
- If not specified, “default index” will be attached
12. Conversion to PySpark DataFrame
koalas_df.to_spark()
- Also koalas_df.spark.frame()
- index_col parameter
- Indicate the column names of index
- If not specified, the index columns will be lost
14. Index and Default Index
- Koalas manages a group of columns as an index.
- The index behaves the same as pandas’.
- to_koalas() has index_col parameter to specify index columns.
- If no index is specified when creating a Koalas DataFrame:
it attaches a “default index” automatically.
- Koalas has 3 types of “default index”.
- Each “default index” has Pros and Cons.
15. Comparison of Default Index Types
Configurable by the option “compute.default_index_type”
Distributed
computation
Map-side
operation
Continuous
increment
Performance
sequence
No, in a single
worker node
No, requires a
shuffle
Yes
Bad for large
dataset
distributed
-sequence
Yes
Yes, but requires
another Spark job
Yes Good enough
distributed Yes Yes No Good
16. Agenda
Introduction of Koalas
- pandas
- PySpark
Conversion from and to PySpark
- Index and Default Index
Spark I/O
- pandas
- Koalas specific
Spark accessor
Demo
17. Using Spark I/O
Functions to read/write data use Spark I/O under the hood.
- ks.read_csv / DataFrame.to_csv
- ks.read_json / DataFrame.to_json
- ks.read_parquet / DataFrame.to_parquet
- ks.read_sql_table
- ks.read_sql_query
index_col parameter is available to specify the index columns.
Take keyword arguments for additional Spark I/O options.
18. Using Spark I/O
Koalas specific I/O functions.
- ks.read_table / DataFrame.to_table
- ks.read_spark_io / DataFrame.to_spark_io
- ks.read_delta / DataFrame.to_delta
index_col parameter is available.
Take keyword arguments for additional Spark I/O options.
19. Agenda
Introduction of Koalas
- pandas
- PySpark
Conversion from and to PySpark
- Index and Default Index
Spark I/O
- pandas
- Koalas specific
Spark accessor
Demo
20. Spark accessor
Provides functions to leverage the existing PySpark APIs more easily.
- transform/apply for using Spark APIs directly.
- Series.spark.transform
- Series.spark.apply
- DataFrame.spark.apply
21. Spark accessor
Provides functions to leverage the existing PySpark APIs more easily.
- Check the underlying Spark data type or schema
- Series.spark.data_type
- DataFrame.spark.schema / print_schema
- Check the execution plan
- DataFrame.spark.explain
- Cache the DataFrame
- DataFrame.spark.cache
- Hints
- DataFrame.spark.hint
23. Getting started
- Pre-installed in Databricks (7.1 and higher)
- pip install koalas
- conda install -c conda-forge koalas
- GitHub: github.com/databricks/koalas
- Docs: https://koalas.readthedocs.io/en/latest/
- 10 min tutorial in a Live Jupyter notebook is available from the docs.
- blog posts
- 10 Minutes from pandas to Koalas on Apache Spark
https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html
- Interoperability between Koalas and Apache Spark
https://databricks.com/blog/2020/08/11/interoperability-between-koalas-and-apache-spark.html
24. Do you have suggestions or requests?
Submit requests to github.com/databricks/koalas/issues
Very easy to contribute
koalas.readthedocs.io/en/latest/development/contributing.html