SlideShare a Scribd company logo
1 of 33
Keeping Spark on Track:
Productionizing Spark for ETL
Kyle Pistor
kyle@databricks.com
Miklos Christine
mwc@databricks.com
$ whoami
Kyle Pistor
–SA @ Databricks
–100s of Customers
–Focus on ETL and big
data warehousing using
Apache Spark
–BS/MS - EE
Miklos Christine
–SA @ Databricks!
–Previously: Systems
Engineer @ Cloudera
–Deep Knowledge of Big
Data Stack
–Apache Spark
Enthusiast
Agenda
Schemas
Performance: Tips & Tricks
1
2
3
4
5
Metadata: Best Practices FS Metadata
Error Handling
ETL: Why ETL?
Agenda
Schemas
Performance: Tips & Tricks
1
2
3
4
5
Metadata: Best Practices FS Metadata
Error Handling
ETL: Why ETL?
Why ETL?
Goal
• Transform raw files into more efficient binary format
Benefits
• Performance improvements, statistics
• Space savings
• Standard APIs to different sources (CSV, JSON, etc.)
• More robust queries
Agenda
Schemas
Performance: Tips & Tricks
1
2
3
4
5
Metadata: Best Practices FS Metadata
Error Handling
ETL: Why ETL?
Schema Handling
Common Raw Data
• Delimited (CSV, TSV, etc)
• JSON
Infer Schema?
• Easy and quick
Specify Schema
• Faster
• More robust
JSON Record #1
{
"time": 1486166400,
"host": "my.app.com",
"eventType": "test",
"event": {
"message": "something happened",
"type": "INFO"
}
}
JSON Record #1
{
"time": 1486166400,
"host": "my.app.com",
"eventType": "test",
"event": {
"message": "something happened",
"type": "INFO"
}
}
spark.read.json("record1.json").schema
root
|-- event: struct (nullable = true)
| |-- message: string (nullable = true)
| |-- type: string (nullable = true)
|-- host: string (nullable = true)
|-- source: string (nullable = true)
|-- time: long (nullable = true)
JSON Record #2
{
"time": 1486167800,
"host": "my.app.com",
"eventType": "test",
"event": {
"message": "Something else happened",
"type": "INFO",
"ip" : "59.16.12.0",
"browser" : "chrome",
"os" : "mac",
"country" : "us"
}
}
JSON Record #2
{
"time": 1486167800,
"host": "my.app.com",
"eventType": "test",
"event": {
"message": "Something else happened",
"type": "INFO",
"ip" : "59.16.12.0",
"browser" : "chrome",
"os" : "mac",
"country" : "us"
}
}
{
"time": 1486166400,
"host": "my.app.com",
"eventType": "test",
"event": {
"message": "something happened",
"type": "INFO"
}
}
JSON Record #1 & #2
{
"time": 1486167800,
"host": "my.app.com",
"eventType": "test",
"event": {
"message": "Something else happened",
"type": "INFO",
"ip" : "59.16.12.0",
"browser" : "chrome",
"os" : "mac",
"country" : "us"
}
}
spark.read.json("record*.json").schema
root
|-- event: struct (nullable = true)
| |-- message: string (nullable = true)
| |-- type: string (nullable = true)
| |-- ip: string (nullable = true)
| |-- browser: string (nullable = true)
| |-- os: string (nullable = true)
| |-- country: string (nullable = true)
|-- host: string (nullable = true)
|-- source: string (nullable = true)
|-- time: long (nullable = true)
customSchema = StructType([
StructField("time", TimestampType(),True),
StructField("host", StringType(),True),
StructField("source", StringType(),True),
StructField ("event",
MapType(StringType(), StringType()))
])
JSON Generic Specified Schema
StructType => MapType
Print Schema
Specify Schema
spark.read.json("record*.json").printSchema
spark.read.schema(customSchema).json("record*.json")
Specify Schemas!
Faster
• No scan to infer the schema
More Flexible
• Easily handle evolving or variable length fields
More Robust
• Handle type errors on ETL vs query
Agenda
Schemas
Performance: Tips & Tricks
1
2
3
4
5
Metadata: Best Practices FS Metadata
Error Handling
ETL: Why ETL?
Filesystem Metadata Management
Common Source FS Metadata
• Partitioned by arrival time (“path/yyyy/MM/dd/HH”)
• Many small files (kBs - 10s MB)
• Too little data per partition
...myRawData/2017/01/29/01
...myRawData/2017/01/29/02
Backfill - Naive Approach
Backfill
• Use only wildcards
Why this is a poor approach
• Spark (currently) is not aware of the existing partitions (yyyy/MM/dd/HH)
• Expensive full shuffle
df = spark.read.json("myRawData/2016/*/*/*/*")
.
.
.
df.write.partitionBy($“date”).parquet("myParquetData")
Backfill - Naive Approach
Backfill
• Use only wildcards
Why this is a poor approach
• Spark (currently) is not aware of the existing partitions (yyyy/MM/dd/HH)
• Expensive full shuffle
df = spark.read.json("myRawData/2016/*/*/*/*")
.
.
.
df.write.partitionBy($“date”).parquet("myParquetData")
Backfill - Scaleable Approach
List of Paths
• Create a list of paths to backfill
• Use FS ls commands
Iterate over list to backfill
• Backfill each path
Operate in parallel
• Use Multiple Threads
Scala .par
Python multithreading
dirList.foreach(x => convertToParquet(x))
dirList.par.foreach(x => convertToParquet(x))
def convertToParquet (path:String) = {
val df = spark.read.json(path)
.
.
df.coalesce(20).write.mode("overwrite").save()
}
Source Path Considerations
Directory
• Specify Date instead of Year, Month, Day
Block Sizes
• Read: Parquet Larger is Okay
• Write: 500MB-1GB
Blocks / GBs per Partition
• Beware of Over Partitioning
• ~30GB per Partition
Agenda
Schemas
Performance: Tips & Tricks
1
2
3
4
5
Metadata: Best Practices FS Metadata
Error Handling
ETL: Why ETL?
Performance Optimizations
• Understand how Spark interprets Null Values
– nullValue: specifies a string that indicates a null value, any fields matching this string will be set as
nulls in the DataFrame
df = spark.read.options(header='true', inferschema='true', 
nullValue='N').csv('people.csv')
• Spark can understand it’s own null data type
– Users must translate their null types to Spark’s native null type
Test Data
id name
foo
2 bar
bar
11 foo
bar
id name
null bar
null bar
3 foo
15 foo
2 foo
Performance:
Join Key
● Spark’s
WholeStageCodeGen
will attempt to filter null
values within the join key
● Ex: Empty String Test
Data
Performance:
Join Key
● The run time decreased
from 13 seconds to 3
seconds
● Ex: Null Values for Empty
Strings
DataFrame Joins
df = spark.read.options(header='true').csv('/mnt/mwc/csv_join_empty')
df_duplicate = df.join(df, df['id'] == df['id'])
df_duplicate.printSchema()
(2) Spark Jobs
root
|-- id: string (nullable = true)
|-- name1: string (nullable = true)
|-- id: string (nullable = true)
|-- name2: string (nullable = true)
DataFrame Joins
df = spark.read.options(header='true').csv('/mnt/mwc/csv_join_empty')
df_id = df.join(df, 'id')
df_id.printSchema()
(2) Spark Jobs
root
|-- id: string (nullable = true)
|-- name1: string (nullable = true)
|-- name2: string (nullable = true)
Optimizing Apache Spark SQL Joins:
https://spark-summit.org/east-2017/events/optimizing-apache-spark-sql-joins/
Agenda
Schemas
Performance: Tips & Tricks
1
2
3
4
5
Metadata: Best Practices FS Metadata
Error Handling
ETL: Why ETL?
display(json_df)
Error Handling: Corrupt Records
_corrupt_records eventType host time
null test1 my.app.com 1486166400
null test2 my.app.com 1486166402
null test3 my.app2.com 1486166404
{"time": 1486166406, "host": "my2.app.com",
"event_ }
null null null
null test5 my.app2.com 1486166408
Error Handling: UDFs
Py4JJavaError: An error occurred while calling o270.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 59.0 failed 4 times, most
recent failure: Lost task 0.3 in stage 59.0 (TID 607, 172.128.245.47, executor 1):
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
...
File "/databricks/spark/python/pyspark/worker.py", line 92, in <lambda>
mapper = lambda a: udf(*a)
...
File "<ipython-input-10-b7bf56c9b155>", line 7, in add_minutes
AttributeError: type object 'datetime.datetime' has no attribute 'timedelta'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
Error Handling: UDFs
● Spark UDF Example
from datetime import date
from datetime import datetime, timedelta
def add_minutes(start_date, minutes_to_add):
return datetime.combine(b, datetime.min.time()) + timedelta(minutes=long(minutes_to_add))
● Best Practices
○ Verify the input data types from the DataFrame
○ Sample data to verify the UDF
○ Add test cases
● Identify the records input source file
from pyspark.sql.functions import *
df = df.withColumn("fname", input_file_name())
display(spark.sql("select *, input_file_name() from companies"))
Error Handling: InputFiles
city zip fname
LITTLETON 80127 dbfs:/user/hive/warehouse/companies/part-r-00000-8cf2a23f-f3b5-4378-
b7b0-72913fbb7414.gz.parquet
BOSTON 02110 dbfs:/user/hive/warehouse/companies/part-r-00000-8cf2a23f-f3b5-4378-
b7b0-72913fbb7414.gz.parquet
NEW YORK 10019 dbfs:/user/hive/warehouse/companies/part-r-00000-8cf2a23f-f3b5-4378-
b7b0-72913fbb7414.gz.parquet
SANTA CLARA 95051 dbfs:/user/hive/warehouse/companies/part-r-00000-8cf2a23f-f3b5-4378-
b7b0-72913fbb7414.gz.parquet
Thanks
Enjoy The Snow
https://databricks.com/try-databricks

More Related Content

What's hot

From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 

What's hot (20)

From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 

Viewers also liked

No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
Spark Summit
 

Viewers also liked (20)

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Insights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured StreamingInsights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured Streaming
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure ExecutionSpark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
 

Similar to Keeping Spark on Track: Productionizing Spark for ETL

Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 

Similar to Keeping Spark on Track: Productionizing Spark for ETL (20)

20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing
 
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
 
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Recently uploaded (20)

Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 

Keeping Spark on Track: Productionizing Spark for ETL

  • 1. Keeping Spark on Track: Productionizing Spark for ETL Kyle Pistor kyle@databricks.com Miklos Christine mwc@databricks.com
  • 2. $ whoami Kyle Pistor –SA @ Databricks –100s of Customers –Focus on ETL and big data warehousing using Apache Spark –BS/MS - EE Miklos Christine –SA @ Databricks! –Previously: Systems Engineer @ Cloudera –Deep Knowledge of Big Data Stack –Apache Spark Enthusiast
  • 3. Agenda Schemas Performance: Tips & Tricks 1 2 3 4 5 Metadata: Best Practices FS Metadata Error Handling ETL: Why ETL?
  • 4. Agenda Schemas Performance: Tips & Tricks 1 2 3 4 5 Metadata: Best Practices FS Metadata Error Handling ETL: Why ETL?
  • 5. Why ETL? Goal • Transform raw files into more efficient binary format Benefits • Performance improvements, statistics • Space savings • Standard APIs to different sources (CSV, JSON, etc.) • More robust queries
  • 6. Agenda Schemas Performance: Tips & Tricks 1 2 3 4 5 Metadata: Best Practices FS Metadata Error Handling ETL: Why ETL?
  • 7. Schema Handling Common Raw Data • Delimited (CSV, TSV, etc) • JSON Infer Schema? • Easy and quick Specify Schema • Faster • More robust
  • 8. JSON Record #1 { "time": 1486166400, "host": "my.app.com", "eventType": "test", "event": { "message": "something happened", "type": "INFO" } }
  • 9. JSON Record #1 { "time": 1486166400, "host": "my.app.com", "eventType": "test", "event": { "message": "something happened", "type": "INFO" } } spark.read.json("record1.json").schema root |-- event: struct (nullable = true) | |-- message: string (nullable = true) | |-- type: string (nullable = true) |-- host: string (nullable = true) |-- source: string (nullable = true) |-- time: long (nullable = true)
  • 10. JSON Record #2 { "time": 1486167800, "host": "my.app.com", "eventType": "test", "event": { "message": "Something else happened", "type": "INFO", "ip" : "59.16.12.0", "browser" : "chrome", "os" : "mac", "country" : "us" } }
  • 11. JSON Record #2 { "time": 1486167800, "host": "my.app.com", "eventType": "test", "event": { "message": "Something else happened", "type": "INFO", "ip" : "59.16.12.0", "browser" : "chrome", "os" : "mac", "country" : "us" } } { "time": 1486166400, "host": "my.app.com", "eventType": "test", "event": { "message": "something happened", "type": "INFO" } }
  • 12. JSON Record #1 & #2 { "time": 1486167800, "host": "my.app.com", "eventType": "test", "event": { "message": "Something else happened", "type": "INFO", "ip" : "59.16.12.0", "browser" : "chrome", "os" : "mac", "country" : "us" } } spark.read.json("record*.json").schema root |-- event: struct (nullable = true) | |-- message: string (nullable = true) | |-- type: string (nullable = true) | |-- ip: string (nullable = true) | |-- browser: string (nullable = true) | |-- os: string (nullable = true) | |-- country: string (nullable = true) |-- host: string (nullable = true) |-- source: string (nullable = true) |-- time: long (nullable = true)
  • 13. customSchema = StructType([ StructField("time", TimestampType(),True), StructField("host", StringType(),True), StructField("source", StringType(),True), StructField ("event", MapType(StringType(), StringType())) ]) JSON Generic Specified Schema StructType => MapType Print Schema Specify Schema spark.read.json("record*.json").printSchema spark.read.schema(customSchema).json("record*.json")
  • 14. Specify Schemas! Faster • No scan to infer the schema More Flexible • Easily handle evolving or variable length fields More Robust • Handle type errors on ETL vs query
  • 15. Agenda Schemas Performance: Tips & Tricks 1 2 3 4 5 Metadata: Best Practices FS Metadata Error Handling ETL: Why ETL?
  • 16. Filesystem Metadata Management Common Source FS Metadata • Partitioned by arrival time (“path/yyyy/MM/dd/HH”) • Many small files (kBs - 10s MB) • Too little data per partition ...myRawData/2017/01/29/01 ...myRawData/2017/01/29/02
  • 17. Backfill - Naive Approach Backfill • Use only wildcards Why this is a poor approach • Spark (currently) is not aware of the existing partitions (yyyy/MM/dd/HH) • Expensive full shuffle df = spark.read.json("myRawData/2016/*/*/*/*") . . . df.write.partitionBy($“date”).parquet("myParquetData")
  • 18. Backfill - Naive Approach Backfill • Use only wildcards Why this is a poor approach • Spark (currently) is not aware of the existing partitions (yyyy/MM/dd/HH) • Expensive full shuffle df = spark.read.json("myRawData/2016/*/*/*/*") . . . df.write.partitionBy($“date”).parquet("myParquetData")
  • 19. Backfill - Scaleable Approach List of Paths • Create a list of paths to backfill • Use FS ls commands Iterate over list to backfill • Backfill each path Operate in parallel • Use Multiple Threads Scala .par Python multithreading dirList.foreach(x => convertToParquet(x)) dirList.par.foreach(x => convertToParquet(x)) def convertToParquet (path:String) = { val df = spark.read.json(path) . . df.coalesce(20).write.mode("overwrite").save() }
  • 20. Source Path Considerations Directory • Specify Date instead of Year, Month, Day Block Sizes • Read: Parquet Larger is Okay • Write: 500MB-1GB Blocks / GBs per Partition • Beware of Over Partitioning • ~30GB per Partition
  • 21. Agenda Schemas Performance: Tips & Tricks 1 2 3 4 5 Metadata: Best Practices FS Metadata Error Handling ETL: Why ETL?
  • 22. Performance Optimizations • Understand how Spark interprets Null Values – nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame df = spark.read.options(header='true', inferschema='true', nullValue='N').csv('people.csv') • Spark can understand it’s own null data type – Users must translate their null types to Spark’s native null type
  • 23. Test Data id name foo 2 bar bar 11 foo bar id name null bar null bar 3 foo 15 foo 2 foo
  • 24. Performance: Join Key ● Spark’s WholeStageCodeGen will attempt to filter null values within the join key ● Ex: Empty String Test Data
  • 25. Performance: Join Key ● The run time decreased from 13 seconds to 3 seconds ● Ex: Null Values for Empty Strings
  • 26. DataFrame Joins df = spark.read.options(header='true').csv('/mnt/mwc/csv_join_empty') df_duplicate = df.join(df, df['id'] == df['id']) df_duplicate.printSchema() (2) Spark Jobs root |-- id: string (nullable = true) |-- name1: string (nullable = true) |-- id: string (nullable = true) |-- name2: string (nullable = true)
  • 27. DataFrame Joins df = spark.read.options(header='true').csv('/mnt/mwc/csv_join_empty') df_id = df.join(df, 'id') df_id.printSchema() (2) Spark Jobs root |-- id: string (nullable = true) |-- name1: string (nullable = true) |-- name2: string (nullable = true) Optimizing Apache Spark SQL Joins: https://spark-summit.org/east-2017/events/optimizing-apache-spark-sql-joins/
  • 28. Agenda Schemas Performance: Tips & Tricks 1 2 3 4 5 Metadata: Best Practices FS Metadata Error Handling ETL: Why ETL?
  • 29. display(json_df) Error Handling: Corrupt Records _corrupt_records eventType host time null test1 my.app.com 1486166400 null test2 my.app.com 1486166402 null test3 my.app2.com 1486166404 {"time": 1486166406, "host": "my2.app.com", "event_ } null null null null test5 my.app2.com 1486166408
  • 30. Error Handling: UDFs Py4JJavaError: An error occurred while calling o270.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 59.0 failed 4 times, most recent failure: Lost task 0.3 in stage 59.0 (TID 607, 172.128.245.47, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last): ... File "/databricks/spark/python/pyspark/worker.py", line 92, in <lambda> mapper = lambda a: udf(*a) ... File "<ipython-input-10-b7bf56c9b155>", line 7, in add_minutes AttributeError: type object 'datetime.datetime' has no attribute 'timedelta' at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
  • 31. Error Handling: UDFs ● Spark UDF Example from datetime import date from datetime import datetime, timedelta def add_minutes(start_date, minutes_to_add): return datetime.combine(b, datetime.min.time()) + timedelta(minutes=long(minutes_to_add)) ● Best Practices ○ Verify the input data types from the DataFrame ○ Sample data to verify the UDF ○ Add test cases
  • 32. ● Identify the records input source file from pyspark.sql.functions import * df = df.withColumn("fname", input_file_name()) display(spark.sql("select *, input_file_name() from companies")) Error Handling: InputFiles city zip fname LITTLETON 80127 dbfs:/user/hive/warehouse/companies/part-r-00000-8cf2a23f-f3b5-4378- b7b0-72913fbb7414.gz.parquet BOSTON 02110 dbfs:/user/hive/warehouse/companies/part-r-00000-8cf2a23f-f3b5-4378- b7b0-72913fbb7414.gz.parquet NEW YORK 10019 dbfs:/user/hive/warehouse/companies/part-r-00000-8cf2a23f-f3b5-4378- b7b0-72913fbb7414.gz.parquet SANTA CLARA 95051 dbfs:/user/hive/warehouse/companies/part-r-00000-8cf2a23f-f3b5-4378- b7b0-72913fbb7414.gz.parquet