Apache Iceberg Presentation for the St. Louis Big Data IDEA

•

0 likes•655 views

Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.

Data & Analytics

2
© 2021 Cloudera, Inc. All rights reserved.
What is Apache Iceberg?
• Eﬃcient Table Format
– Hidden Partitioning
– Schema Evolution
– Time Travel
• Presto, Hive, Spark
• Created at Netﬂix (2017).
• Used at Adobe, Apple, LinkedIn,
Experian

3
© 2021 Cloudera, Inc. All rights reserved.
What are the Challenges?
• Data Scalability
• Atomicity
• Performance Degradation
• Complexity
• Object Stores
• Storage and Compute
• File System (Listing)

5
© 2021 Cloudera, Inc. All rights reserved.
Architecture
Spark Presto
HDFS Object Store
Iceberg

6
© 2021 Cloudera, Inc. All rights reserved.
Architecture
Snapshot (01)
Manifest List
Manifest
Files
Manifest
Manifest List
Snapshot (02)
Files Files

8
© 2021 Cloudera, Inc. All rights reserved.
Initial Setup
• Catalogs
– Working with SQL
– System Information

9
© 2021 Cloudera, Inc. All rights reserved.
Spark
spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.0
--conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog

--conf spark.sql.catalog.spark_catalog.type=hive
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.local.type=hadoop
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse
Adding a Catalog
Creating a Table
CREATE TABLE local.db.table (id bigint, data string) USING iceberg

10
© 2021 Cloudera, Inc. All rights reserved.
Hive
add jar /path/to/iceberg-hive-runtime.jar;
Add the jar ﬁle
Create an External Table
CREATE EXTERNAL TABLE table_a
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION 'hdfs://some_bucket/some_path/table_a';

12
© 2021 Cloudera, Inc. All rights reserved.
References
Apache Iceberg: https://iceberg.apache.org/
Project Nessie: https://projectnessie.org/
Hive/Iceberg Integration: https://github.com/ExpediaGroup/hiveberg
Partitioning:
https://developer.ibm.com/technologies/artiﬁcial-intelligence/articles/the-why-and-how-of-partitioning-in-apache-iceberg/?utm_source=the
newstack&utm_medium=website&utm_campaign=platform
Iceberg Explained: https://thenewstack.io/apache-iceberg-a-different-table-design-for-big-data/

Apache Iceberg Presentation for the St. Louis Big Data IDEA

What's hot

Intro to Delta LakeDatabricks

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Databricks Delta Lake and Its BenefitsDatabricks

Modernizing to a Cloud Data ArchitectureDatabricks

Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks

Accelerating Data Ingestion with Databricks AutoloaderDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Free Training: How to Build a LakehouseDatabricks

Delta lake and the delta architectureAdam Doyle

YugaByte DB Internals - Storage Engine and Transactions Yugabyte

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks

Iceberg: a fast table format for S3DataWorks Summit

Zero to Snowflake Presentation Brett VanderPlaats

Delta from a Data Engineer's PerspectiveDatabricks

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

Making Apache Spark Better with Delta LakeDatabricks

The Apache Spark File Format EcosystemDatabricks

Parquet performance tuning: the missing guideRyan Blue

What's hot (20)

Intro to Delta Lake

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Iceberg + Alluxio for Fast Data Analytics

The Parquet Format and Performance Optimization Opportunities

Databricks Delta Lake and Its Benefits

Modernizing to a Cloud Data Architecture

Simplify and Scale Data Engineering Pipelines with Delta Lake

Accelerating Data Ingestion with Databricks Autoloader

Massive Data Processing in Adobe Using Delta Lake

Free Training: How to Build a Lakehouse

Delta lake and the delta architecture

YugaByte DB Internals - Storage Engine and Transactions

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...

Iceberg: a fast table format for S3

Zero to Snowflake Presentation

Delta from a Data Engineer's Perspective

Unified Big Data Processing with Apache Spark (QCON 2014)

Making Apache Spark Better with Delta Lake

The Apache Spark File Format Ecosystem

Parquet performance tuning: the missing guide

Recently uploaded (20)

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx

VidaXL dropshipping via API with DroFx.pptx

BigBuy dropshipping via API with DroFx.pptx

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

Zuja dropshipping via API with DroFx.pptx

100-Concepts-of-AI by Anupama Kate .pptx

Edukaciniai dropshipping via API with DroFx

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha

Carero dropshipping via API with DroFx.pptx

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...

Ravak dropshipping via API with DroFx.pptx

Sampling (random) method and Non random.ppt

April 2024 - Crypto Market Report's Analysis

FESE Capital Markets Fact Sheet 2024 Q1.pdf

Apache Iceberg Presentation for the St. Louis Big Data IDEA

1. Apache Iceberg Scott Shaw

2. 2 © 2021 Cloudera, Inc. All rights reserved. What is Apache Iceberg? • Eﬃcient Table Format – Hidden Partitioning – Schema Evolution – Time Travel • Presto, Hive, Spark • Created at Netﬂix (2017). • Used at Adobe, Apple, LinkedIn, Experian

3. 3 © 2021 Cloudera, Inc. All rights reserved. What are the Challenges? • Data Scalability • Atomicity • Performance Degradation • Complexity • Object Stores • Storage and Compute • File System (Listing)

4. ARCHITECTURE

7. WORKING WITH ICEBERG

9. 9 © 2021 Cloudera, Inc. All rights reserved. Spark spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.0 --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog --conf spark.sql.catalog.spark_catalog.type=hive --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.local.type=hadoop --conf spark.sql.catalog.local.warehouse=$PWD/warehouse Adding a Catalog Creating a Table CREATE TABLE local.db.table (id bigint, data string) USING iceberg

10. 10 © 2021 Cloudera, Inc. All rights reserved. Hive add jar /path/to/iceberg-hive-runtime.jar; Add the jar ﬁle Create an External Table CREATE EXTERNAL TABLE table_a STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION 'hdfs://some_bucket/some_path/table_a';

11. REFERENCES

12. 12 © 2021 Cloudera, Inc. All rights reserved. References Apache Iceberg: https://iceberg.apache.org/ Project Nessie: https://projectnessie.org/ Hive/Iceberg Integration: https://github.com/ExpediaGroup/hiveberg Partitioning: https://developer.ibm.com/technologies/artiﬁcial-intelligence/articles/the-why-and-how-of-partitioning-in-apache-iceberg/?utm_source=the newstack&utm_medium=website&utm_campaign=platform Iceberg Explained: https://thenewstack.io/apache-iceberg-a-different-table-design-for-big-data/

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Iceberg Presentation for the St. Louis Big Data IDEA

Similar to Apache Iceberg Presentation for the St. Louis Big Data IDEA (20)

More from Adam Doyle

More from Adam Doyle (20)

Recently uploaded

Recently uploaded (20)

Apache Iceberg Presentation for the St. Louis Big Data IDEA