SlideShare a Scribd company logo
1 of 108
© 2022, Amazon Web Services, Inc. or its Affiliates.
Chiei Hayashida
Solution Architect
2022/01
AWS Glue Spark
Performance Tuning
© 2022, Amazon Web Services, Inc. or its Affiliates.
Self Introduction
Chie Hayashida(Chie Hayashida)
Amazon Web Services Japan
solution architect
© 2022, Amazon Web Services, Inc. or its Affiliates.
The target of this slide
o People who have done AWS glue tutorial or have equivalent
knowledge.
o People who have written Apache Spark applications.
o People who would like to improve their existing AWS Glue jobs
o The Code at this slide are all by PySpark because a lot of AWS Glue
users choose PySpark
o This slides are for Glue 2.0(Spark 2.4.3) and Glue 3.0(Spark 3.1.1)
© 2022, Amazon Web Services, Inc. or its Affiliates.
Agenda
• Architecture of AWS Glue and Apache Spark
• AWS Glue Functions related to performance
• How to proceed AWS Glue Spark performance tuning
• AWS Glue Spark performance tuning patterns
© 2022, Amazon Web Services, Inc. or its Affiliates.
Agenda
• Architecture of Apache Spark
• AWS Glue specific features (performance related)
• How to proceed with performance tuning of AWS Glue Spark
• Basic strategy for tuning AWS Glue Spark jobs
• Tuning Patterns for AWS Glue Spark Jobs
© 2022, Amazon Web Services, Inc. or its Affiliates.
AWS Glue and Apache Spark
Data source
crawler data catalog
Serverless
Engine
(1) Crawl data
2) Manage metadata
AWS Glue
3) Triggered manually / by schedule /
by event
5) Run transfomation job
and load data to target
data source
4) Extract data
from the input data
source
scheduler
Data source
Other AWS
Services
Managed Service
of Apache Spark
© 2022, Amazon Web Services, Inc. or its Affiliates.
Architecture of Apache Spark
© 2022, Amazon Web Services, Inc. or its Affiliates.
Architecture of Apache Spark (cluster mode)
• Cluster Manager divides a job into one or more tasks and assigns them
to executors
• On a single worker node, more than one executor is started.
• More than one task can be executed on a single executor.
(a)
3)
Driver Program
SparkContext Cluster Manager
Worker Node
Executor
Task Task
Cache
Worker Node
Executor
Task Task
Cache
Executor
Executor
© 2022, Amazon Web Services, Inc. or its Affiliates.
Architecture of Spark (cluster mode)
1) Request the resources needed by the application.
2) Launch the Executor required for the job on each worker.
3) Divide the process into tasks and assign them to Executor.
4) Assign a task to each Executor. When the task is completed, the Executor informs the Driver Program about it.
Exchange data between tasks as necessary.
5) After 2) and 3) are repeated several times, the result of the process is returned.
(a)
2) 3)
3)
Driver Program
SparkContext Cluster Manager
Worker Node
Executor
Task Task
Cache
Worker Node
Executor
Task Task
Cache
1)
2) 4)
3)
Executor
Executor
4)
5)
© 2021, Amazon Web Services, Inc. or its Affiliates.
10
How data is processed
• The data being processed is defined as a distributed collection called an RDD
• An RDD is made up of one or more “partitions”
• A partition is processed by a single "task”
• The actual Spark code is often written through interfaceswhich called
DataFrame that can treat data as typed table.
S3
file file file
RDD
Partition
Partition
Partition
RDD
Partition
Partition
Partition
RDD
Partition
Partition
Partition
S3
file file file
© 2022, Amazon Web Services, Inc. or its Affiliates.
Components of Apache Spark
Spark Core (RDD)
DataFrame/Catalyst Optimizer
Spark ML
Structured
Streaming
GraphX
Spark SQL
© 2021, Amazon Web Services, Inc. or its Affiliates.
12
RDD and DataFrame
RDD data architecture image
[
[1, Bob, 24],
[2, Alice, 48],
[3, Ken, 10],
…
]
DataFrame data architecture image
col1 col2 col3
1 Bob 24
2 Alice 48
3 Ken 10
… … …
With both interfaces, the code is written as if the data is processed as a
list/table, but the actual data is distributed across multiple servers.
DataFrame is a high-level API for RDDs, and processing described in a
DataFrame is internally executed as an RDD.
© 2021, Amazon Web Services, Inc. or its Affiliates.
13
Lazy evaluation
• There are two types of Spark processing: “action” and “transformation”
• When an “action” is executed, all the previous processing necessary for the
action is performed
• as series of processes excuted by an “action” is called a “job”
• Note that “Job” here is different from a Glue job.
>>> df1 = spark.read.csv(…)
>>> df2 = spark.read.json(…)
>>> df3 = df1.filter(…)
>>> df4 = spark.read.csv(…)
>>> df5 = df2.join(df3, …)
>>> df5.count()
Action
Up to this point, no actual
processing won’t be done.
At this point, the previous process
is executed for the first time.
The df4 process is not a dependency of df5.count(), so it will not be
executed in this action.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Examples of transformations and actions
Transform: data generation and
processing
• select()
• Selecting a column
• read
• Loading data
• filter()
• Data Filtering
• groupBy()
• Aggregation by group
• sort()
• Sorting data
Action: Output the processing
result
• count()
• Counting the number of
records
• write
• Exporting to the file system
• collect()
• Collect all data on Driver node
• show()
• View a sample of the data
• describe()
• View data statistics
© 2022, Amazon Web Services, Inc. or its Affiliates.
Spark Applications
o A Spark application consists of multiple jobs.
glueContext = GlueContext(SparkContext.getOrCreate(conf))
spark = glueContext.spark_session
df1 = spark.read.json(...)
df1.show() # job1
df1.filter(df1.col1='a').write.parquet(...) # job2
df1.filter(df1.col2='b').write.parquet(...) # job3
An application is a set of processes executed in a
single GlueContext (or SparkContext).
© 2020, Amazon Web Services, Inc. or its Affiliates.
Shuffle and Stage
Stage1 Stage2
Partition Partition
Partition Partition
Partition Partition
Partition Partition
Partition Partition
Partition Partition
Partition Partition
df2 = df1.filter(“price”>500).groupBy(“item”).sum().withColumn(“bargain”, price*0.8)
• Stages are divided by shuffling
• Multiple tasks are processed in concurrently in one stage
task
© 2021, Amazon Web Services, Inc. or its Affiliates.
17
Processing with and without shuffling (exchange of
data between tasks)
Example of no shuffling
df2 = df1.groupBy(‘item’).sum()
df2 = df1.filter(price>500)
item price
beef 1300
pork 200
chicken 700
fish 400
item price
beef 1300
chicken 700
df1 df2
item num
beef 2
pork 3
beef 4
pork 5
item num
beef 6
pork 9
df1 df2
Example of shuffling
© 2022, Amazon Web Services, Inc. or its Affiliates.
Processing with and without shuffling (exchange of data
between tasks)
Processing without shuffling
• read
• filter()
• withColumn()
• UDF
• coalesce()
Processing with shuffling
• join()
• groupBy()
• orderBy()
• repartition()
© 2022, Amazon Web Services, Inc. or its Affiliates.
Optimization with Catalyst Query Optimizer
• Processes described in DataFrame are converted into optimized RDDs
by the optimizer and before the process is executed.
df1 = spark.read.csv('s3://path/to/data1')
df2 = spark.read.parquet('s3://path/to/data2')
df3 = df1.join(df2, df1.col1 == df2.col1)
df4 = df3.filter(df3.col2 == 'meat').select(df3.col3, df3.col5)
df4.show()
data1 data2
join
filter
Logical Plan Physical Plan
scan
(data1)
scan
(data2)
filter
join
scan
(data1)
optimized
scan
(data2)
join
Physical Plan
(with storage side optimization)
10GB 50GB
60GB
20GB
10GB 50GB
20GB
20GB
20GB
10GB
20GB
data
volume
predicate pushdown
and column pruning
© 2022, Amazon Web Services, Inc. or its Affiliates.
Architecture of PySpark
Driver
Python
Spark
Context
Worker
Executor
Task
Task
Python Worker
Python Worker
Executor
Worker
Worker
• Processes written in PySpark Dataframe are converted to Java code.
• UDF code written by Python is executed as Python at each Python worker per
tasks
Py4J
© 2022, Amazon Web Services, Inc. or its Affiliates.
Introduction to AWS Glue
specific features
© 2021, Amazon Web Services, Inc. or its Affiliates.
Data Catalog
• Data Catalog has metadata(tablenames, column names, S3 paths and so
on) necessary to access data sources such as S3 and databases from
Glue, Athena, Redshift Spectrum, etc.
• There are three ways to create metadata in the data catalog: crawler,
Glue API, and DDL (Athena/EMR/Redshift Spectrum).
• Amazon DynamoDB, Amazon S3, Amazon Redshift, Amazon RDS, JDBC
connectable DB, Kinesis, HDFS, etc. can be specified as data sources.
• No need to manage metastore database
DynamoDB
S3
Redshift
RDS
JDBC接続可能なDB
Data Source
Glue ETL Athena
Redshift
Spectrum
EMR
Connected services
Hive alternative
appliations
Save metadata
Data catalog
Crawler
Image of data catalog usage
①Metadata Access
②Data access
© 2022, Amazon Web Services, Inc. or its Affiliates.
DynamicFrame
AWS Glue functions to absorb the inherent complexity of ETL for raw
data
• As a component, it is located in the same hierarchy as DataFrame.
They can be used by converting each other (fromDF and toDF functions).
• Leave the possibility of multiple types to be determined later (Choice
type)
• DynamicFrame refers to the entire data, DynamicRecord refers to a
refers to a single row of data.
Spark Core: RDDs
Spark DataFrame/
Catalyst Optimizer
AWS Glue
DynamicFrame
Data structure image
DynamicFrame in Apache Spark architecture
DataFrame DynamicFrame
Similar to semi-structured tables
record
© 2022, Amazon Web Services, Inc. or its Affiliates.
struct type
Choice Type
DynamicFrame is able to have both types when multiple types are found in a
column
• ResolveChoice method can be used to resolve types
root
|-- uuid: string
|
|-- device id: choice
|-- long
|-- string
Example of data structure of type choice
The device id column has both long and string data.
(e.g. "1234" in the device id column is confused 1234with
"1234" in the string column)
project
(Discard the mold)
cast
(Cast to a single type)
make_cols
(Keep all types in a separate column)
Example of ResolveChoice execution
deviceid: choice type
long
type
string
type
long
type
long
type
long
type
string
type
long
type
deviceid deviceid deviceid deviceid_long deviceid_string
long
type
deviceid
make_struct
(Map conversion to struct type)
string
type
ColA ColB ColC
1
2
...
1,000,000
"1000001."
"1000002."
With DataFrame, if more than one type is
present, processing may be interrupted
and have to be reprocessed.
© 2022, Amazon Web Services, Inc. or its Affiliates.
DataFrame
ETL processing that takes advantage of the
characteristics of DynamicFrame and DataFrame
• DynamicFrame is good for ETL processing, DataFrame is good for table processing
• Data input/output and associated ETL processing is performed by DynamicFrame,
while table manipulation is performed by DataFrame.
DynamicFrame
toDF function Table Operations
Output result file
ETL job
data loading
input data
By using DynamicFrame when loading, it is
possible to optimize data loading using AWS
Glue catalog, load differential data, and
process semi-structured data using Choice
type.
DynamicFrame
Table operations
such as JOIN are
performed in
DataFrame.
Use the toDF and fromDF functions for
mutual conversion between
DynamicFrame and DataFrame.
(No data copying is done. conversion
cost is within a few milliseconds)
Output the result
fromDF function
© 2022, Amazon Web Services, Inc. or its Affiliates.
Bookmark function
Function to process only the delta data when performing steady-state ETL
• Use file timestamps to process only the data that has not been
processed in the previous job to prevent duplicate processing and
duplicate data.
Processed data
(Not loaded)
Unprocessed data
(load target)
df = spark.read.parquet('s3://path/to/data')
s3://path/to/data
© 2022, Amazon Web Services, Inc. or its Affiliates.
How to proceed with
performance tuning of AWS
Glue ETL
© 2022, Amazon Web Services, Inc. or its Affiliates.
The Performance Tuning Cycle
1. Determine performance goals.
2. Measure the metrics
3. Identify bottlenecks.
4. Reduce the impact of bottlenecks
5. Repeat steps 2. to 4. until the goal is achieved.
6. Achieving performance goals
© 2022, Amazon Web Services, Inc. or its Affiliates.
Understand the characteristics of AWS Glue/Apache Spark
• distributed processing
• There are tuning patterns such as "shuffling" and "data bias" that
are not found in single-process applications.
• Delayed evaluation
• Spark processing is lazy evaluation, so the error may not be caused
by the API that was executed just before the error occurred, but by
the API that was written before the error occurred.
• Impact of optimizer
• Since Spark processing is optimized internally, it can be difficult to
determine which part of the script is the actual processing you can
see in the Spark UI. You need to check multiple metrics to find the
cause.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Spark parameters in AWS Glue
• Essentially, Spark has a means of tuning by parameters at the time of job
execution, but AWS Glue is a serverless service, so before looking at
Spark's parameters, we first tune it according to AWS Glue's best
practices.
• User should test thoroughly when changing the value of the Spark
parameter.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Metrics to check
• Spark UI(Spark History Server)
• It shows the details of Spark processing.
• Executor Log and Driver Log
• It shows stdout/stderr logs of executors and a driver
• AWS Glue Job metrics
• It shows the CPU, memory, and other resource usage status of
each executor and driver node
• Statistics obtained from the Spark API
• It shows samples and statistical values of intermediate data
© 2022, Amazon Web Services, Inc. or its Affiliates.
Setting up a job to do the tuning
In order to get the logs
needed for tuning, you need
to check the box to use
Monitoring Options in the
Add job screen.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Spark UI
© 2022, Amazon Web Services, Inc. or its Affiliates.
Spark History Server
Spark Event Logs can be visualized by running the Spark History Server
There are several ways to launch the Spark History Server.
• Using Cloud Formation
• Using docker to launch it on a local PC
• Download Apache Spark on your local PC or EC2
and start the Spark History Server.
• Using an EMR cluster
https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html
© 2022, Amazon Web Services, Inc. or its Affiliates.
Example of launching with Docker
o Once the Docker container has started, access http://localhost:18080
in your browser
$ docker build -t glue/sparkui:latest .
$ docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_eventlog
-Dspark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY_ID -Dspark.hadoop.fs.s3a.secret.key=AWS_SECRET_ACCESS_KEY" -p
18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer”
Specify the S3 path of Spark
history logs
© 2022, Amazon Web Services, Inc. or its Affiliates.
Spark History Server
Click and chek the details of each application
Check duration of each application
© 2022, Amazon Web Services, Inc. or its Affiliates.
List of jobs executed by the application
Completed Jobs
Failed Jobs
Check jobs which
takes long time
© 2022, Amazon Web Services, Inc. or its Affiliates.
Checking the contents of a job
Identify Stages that are failing
or taking a long time.
Identify Stages that are failing
or taking a long time.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Check the contents of the Stage.
If there is a difference in the line length, it
means that skew occurs and it isnʼt
distributed sufficiently.
Check the data size in the Stage.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Checking the contents of the Stage (continued)
the details of what's taking so long.
Task Time for each Executor
If there is a spill on the disk, select a
worker node with larger memory to
solve it.
In addition to the Event Timeline above, data
skew can be seen in Summary Metrics and
Tasks.
© 2022, Amazon Web Services, Inc. or its Affiliates.
View details of tasks that are Failing
Click on details to
learn more.
View the log of the Executor
where the Fail is occurring.
For AWS Glue ETL, check the
Executor ID and check from
CloudWatch Log groups
© 2022, Amazon Web Services, Inc. or its Affiliates.
Environmental settings for Spark application runtime
o Spark options, dependencies, and so on.
© 2022, Amazon Web Services, Inc. or its Affiliates.
List of Driver and Executor nodes
If the cluster is running and
accessible from the History
Server, the logs of each
Driver/Executor can be seen.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Checking the Spark SQL Query Execution Plan
You can see the actual
execution plan, which is more
accurate than the explain API.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Executor and Driver logs
© 2022, Amazon Web Services, Inc. or its Affiliates.
Check Log groups in CloudWatch
Executor Logs
Driver Log
© 2022, Amazon Web Services, Inc. or its Affiliates.
AWS Glue Job Metrics
© 2022, Amazon Web Services, Inc. or its Affiliates.
Check the resource usage of Executor and Driver
You can also create a Dashboard
for CloudWatch and add other
metrics.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Spark API
© 2022, Amazon Web Services, Inc. or its Affiliates.
Check the trend of data during processing with
commands
o Use the following commands to check the trend of the intermediate
data during processing and use it for tuning strategy.
o Note that processing with actions (red text) will make the job slow
down the if you use too many times.
• count()
• printSchema()
• show()
• describe([cols*]).show()
• explain()
• df.agg(approx_count_distinct(df.col))
© 2022, Amazon Web Services, Inc. or its Affiliates.
Check the trend of data during processing with APIs
df.count()
o Check the number of records.
o Use df.groupBy('col_name').count() to check for skew.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Check the trend of data during processing with APIs
df.printSchema()
o Check the schema information of the DataFrame.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Check the trend of data during processing with APIs
df.show()
o The number of records to be displayed can be specified by using
df.show(5).
© 2022, Amazon Web Services, Inc. or its Affiliates.
Check the trend of data during processing with APIs
df.describe([cols*]).show()
o The statistics for each column can be seen.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Check the trend of data during processing with APIs
df.explain()
o Check the execution plan which optimizer created.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Check the trend of data during processing with APIs
df.agg(approx_count_distinct(df.col))
o Check the cardinality in the columns
o Fast because HyperLogLog is used
© 2022, Amazon Web Services, Inc. or its Affiliates.
AWS Glue ETL
Performance Tuning
Pattern
© 2022, Amazon Web Services, Inc. or its Affiliates.
Basic strategy for tuning AWS Glue ETL jobs
• Use the new version
• Reduce the data I/O load.
• Minimize shuffling.
• Speed up per-task processing.
• Parallelize
© 2022, Amazon Web Services, Inc. or its Affiliates.
Use the new version
© 2022, Amazon Web Services, Inc. or its Affiliates.
Use the new version
o When the Spark application is slow, simply replacing the job execution
environment with the latest version may speed up the process.
o Both AWS Glue and Apache Spark are evolving in every way, not just in
performance. Use the newest version possible.
https://aws.amazon.com/jp/blogs/big-data/introducing-aws-glue-3-0-with-optimized-apache-spark-3-1-runtime-for-faster-data-integration/
© 2022, Amazon Web Services, Inc. or its Affiliates.
Using AWS Glue 2.0 and 3.0 to reduce
startup time
Dramatically reduced the time required to launch AWS Glue ETL jobs
• Cold start used to take about 85 minutes in AWS Glue 0.9/1.0.
• In the new AWS Glue 2.0, less than a 1minute.
time
Start-up time Execution time
Execution time
AWS Glue 1.0
AWS Glue 2.0
© 2021, Amazon Web Services, Inc. or its Affiliates.
2. Submit job to
virtual cluster
AWS Glue 2.0+ integrated scheduling and provisioning
1. Run AWS Glue job
3. Spark schedules tasks
to executors
Job
manager
4. Dynamically grow
virtual clusters
5. Spark utilizes
new executors
AZ1
AZ2
Job starts when first executor is ready
Reduced start time
Reduced start variance
Jobs run on reduced capacity
Graceful degradation
© 2022, Amazon Web Services, Inc. or its Affiliates.
Minimize data I/O
© 2022, Amazon Web Services, Inc. or its Affiliates.
How to minimize the data I/O load
• Read only the data you need.
• Control the amount of data read in one task.
• Choose the right compression format.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Using Apache Parquet
• Column-oriented format for
data arrangement suitable for
analytical applications
• Data type is preserved
• Compressed effectively
• Aggregation by skipping
unnecessary data and using
metadata
• The Spark engine can
efficiently use Apache Parquet
Integration is in place
https://parquet.apache.org/documentation/latest/
© 2022, Amazon Web Services, Inc. or its Affiliates.
Partition Filtering and Filter Pushdown
Reduce the amount of data to be read
• Partition Filtering
• The ability to read only the files in the partition specified by the
filter or Where clause.
or Where clause.
• Available in Text/CSV/JSON/ORC/Parquet
• Filter Pushdown
• Ability to read only blocks that hit the filter or where clause for
columns that are not used in the partition column.
• AWS Glue automatically applies this when Parquet is used.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Partition Filter and Filter Pushdown
Filter Pushdown
Partition Filtering
© 2022, Amazon Web Services, Inc. or its Affiliates.
Partition Filter and Filter Pushdown
Partition Filter can be used when a partition directory has been created; for
Dataframe and DynamicFrame writes, the partition directory can be created
by using the partitionBy option as follows
df.write.parquet(path=path, partitionBy='col_name')
It may be more performance
efficient to partition columns that
are used more frequently in the
filter clause into higher partitions.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Using push_down_predicate in DynamicFrame
Read only the files included in the partition where the data specified in the
filter or where clause is stored when reading a DynamicFrame from the
AWS Glue data catalog.
partitionPredicate ="(product_category == 'Video')"
datasource = glue_context.create_dynamic_frame.from_catalog(
database = "githubarchive_month",
table_name = "data",
push_down_predicate = partitionPredicate)
© 2022, Amazon Web Services, Inc. or its Affiliates.
Choose a compression codec based on your application
• Compression codec can be selected at data writing.
• Trade-off between compression rate and compression/decompression speed
• Files compressed with bzip2, lzo, and snappy can be split and processed
when read, but files compressed with gzip(*) cannot be split.
• Uncompressed files does not require compression/decompression time, but
data transfer cost may be a bottleneck
• If processing speed is important to you, choose snappy or lzo.
ex. df.write.csv("path/to/csv", compression="gzip")
Parquet can also be split by gzip.
gzip bzip2 lzo snappy
file extension .gzip .bz2 .lzo .snappy
Compression Level High Highest Average Average
Speed Medium Slow Fast Fast
CPU usage Medium High Low Low
Is Splittable No(*) No Yes, if indexed No
© 2022, Amazon Web Services, Inc. or its Affiliates.
Store data in appropriate file sizes.
• Data read/write tasks are basically tied to a single file.
(If the file is splitable, one file can be split into multiple tasks.)
• The recommended file size for AWS Glue is 128MB-512MB.
When the data is too small
• Overhead due to large number of small tasks
When there is large unspilitable data in one file
• Data is not fully loaded into memory on a single node.
• No distributed processing
task task task
...
...
task
Executor Executor Executor
...
Not used
© 2022, Amazon Web Services, Inc. or its Affiliates.
Using Bounded Execution with DynamicFrame
When there is a lot of data to be read, Bounded Execution can be used at
the same time as Job Bookmarking to divide the process without reading
all the unprocessed data at once.
glueContext.create_dynamic_frame.from_catalog(
database = "database",
tableName = "table_name",
redshift_tmp_dir = "",
transformation_ctx = "datasource0",
additional_options = {
"boundedFiles" : "500", # need to be string
# "boundedSize" : "1000000000" unit is byte
}
)
© 2022, Amazon Web Services, Inc. or its Affiliates.
Using DynamicFrame's groupFiles and groupSize
• Eliminate overhead by reading small files together in a 1single
task.
• Useful for processing data that is output every few minutes by
Kinesis Data Firehose.
• Use the groupFiles option to group the data in the S3 partition,
and the groupSize option to specify the size of the group to be
read.
df = glueContext.create_dynamic_frame_from_options(
's3', {'paths': ['s3://s3path/'],
' recurse'':True, 'groupFiles'': 'inPartition'', 'groupSize'': ''1048576}, format='json')
note: groupFiles is supported for DynamicFrames created from the following data formats: csv, ion, grokLog,
json, and xml. This option is not supported for avro, parquet, and orc.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Number of files and processing time for DataFrame
and DynamicFrame
© 2022, Amazon Web Services, Inc. or its Affiliates.
Using DynamicFrame S3ListImplementation
• If there are a lot of small files, a large number of tasks can cause
Driver OOM.
• When S3ListImplementation is True, the results of S3 list are read and
processed in one batch at a 1000time, which prevents driver memory
from being overloaded by S3 listing.
datasource = glue_context.create_dynamic_frame.from_catalog(
database = "my_database",
table_name = "my_table",
push_down_predicate = partitionPredicate,
additional_options = {"useS3ListImplementation":True} )
© 2022, Amazon Web Services, Inc. or its Affiliates.
Set the Partition Index
When reading a DataFrame from AWS Glue catalog from a data source
with many partitions consisting of multiple partition keys, setting the
Partition Index will reduce the time to fetch the read partition if there are
filter or where clauses for the target partition. This will reduce the time to
fetch the partitions.
https://aws.amazon.com/jp/blogs/big-data/improve-query-performance-using-aws-glue-partition-indexes/
© 2022, Amazon Web Services, Inc. or its Affiliates.
The difference of query planning time between using
Partition Index and not using Partition Index
https://aws.amazon.com/jp/blogs/big-data/improve-query-performance-using-aws-glue-partition-indexes/
© 2022, Amazon Web Services, Inc. or its Affiliates.
Parallel Data Reading in DataFrame JDBC Connections
• spark.read.jdbc() only allows one Executor to access the target
database by default.
• For parallel reading, partitionColumn, lowBound, upperBound, and
numPartitions must be specified. The partitionColumn must be one of
the following types: numeric, date, or timestamp.
df = spark.read.jdbc(
url=jdbcUrl, table="sample",
partitionColumn="col1",
lowerBound=1L,
upperBound=100000L,
numPartitions=100,
fetchsize=1000,
connectionProperties=connectionProperties)
© 2022, Amazon Web Services, Inc. or its Affiliates.
Parallel data reading in DynamicFrame JDBC
connection
o If you want to read data from a JDBC connection as a
DynamicFrame, you need to specify hashfield/hashexpression.
o In hashfield, strings and other columns can also be used as
partition columns.
glueContext.create_dynamic_frame.from_catalog(
database = "my_database",
tableName = "my_table_name",
transformation_ctx = "my_transformation_context",
additional_options = { ''hashfield': 'customer_name', 'hashpartitions': '5' } )
https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html
© 2022, Amazon Web Services, Inc. or its Affiliates.
Minimize shuffling
© 2022, Amazon Web Services, Inc. or its Affiliates.
Minimize shuffling
• Make good use of cache
• Perform filter processing in the first stage as much
as possible.
• Devise the order of joins to keep the data small.
• Optimize join strategy
• Remove data skew
• Use Window processing instead of data processing
by self join
© 2022, Amazon Web Services, Inc. or its Affiliates.
Minimize shuffling
The processing described in the DataFrame is optimized by Catalyst
Optimizer.
However, it is not perfect in the following aspects
• If there is a cache() in between, optimization including before and after
will not work.
• Spark 2.4.3, used in AWS Glue 1.0 and 2.0, disables the cost-based
optimizer by default
Shuffling can be reduced by manually changing the order and strategy of
filters and joins.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Make good use of cache
• When branching the processing of a single Dataframe to perform
multiple outputs, you can prevent recalculation by inserting cache()
just before the branch.
• Note that it may be faster not to use cache, and that too much use of
cache will use local disk space.
df1 df2
df3
df5
df4
df1 df2
df3
df5
df4
cache()
The process up to the
creation of df2 is executed
only once.
The process to create df2
will be executed twice.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Make good use of cache
• By default, cached data will be stored in the memory initially allocated
for caching, and what is not in memory will be stored on the local disk.
• Users can choose to save to memory only or disk only.
Example of cache only on memory: df.cache(MEMORY_ONLY)
© 2022, Amazon Web Services, Inc. or its Affiliates.
Delete cache that is no longer in use
• A cached Dataframe will continue to occupy memory and local
disk space.
• Save memory and disk space by deleting the Dataframe cache
when it is no longer needed.
df.unpersist()
© 2022, Amazon Web Services, Inc. or its Affiliates.
Perform filter processing in the first stage as much as possible
The filter process can be placed before cache() to reduce the amount of data during
processing.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Work out the order of join
• The end result is the same, but the data size of the DataFrame in the
middle is different.
• In Glue 3.0 (Spark 3.1.1), the cost-based optimizer takes into account
the amount of data and optimizes the order of joins.
df1
(4000 rows)
df2
(1000 rows)
df4
( 4000rows)
df3
(10 rows)
df5
(10 rows)
df1
(4000 rows)
df3
(10 rows)
df4
( 10rows)
df2
(1000 rows)
df5
(10 rows)
left join join
left join
join
© 2022, Amazon Web Services, Inc. or its Affiliates.
Using join in different ways
Sort Merge Join
• Distribute the two tables to be joined by their respective keys, sort
them, and then join them.
• Suitable for joining large tables together.
Broadcast Join
• Transfer one table to all Executors, and distribute the other table to all
Executors and join them.
• Suitable for when one table is smaller than the other.
Shuffle Hash Join
• Distribute the two tables to be joined and join them without sorting.
• Suitable for joins between tables that are not so large.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Using join in different ways
• By default, if the table size is less than or equal to the value specified
in spark.sql.autoBroadcastJoinThreshold (default 10MB), Broadcast
Join will be used.
• The Join strategy in use can be seen in the Spark UI or by using
explain().
• If join performance is the bottleneck, changing the join strategy
manually may improve performance.
df1.join( broadcast(df2), df1("col1") <=> df2("col1") ).expand()
== Physical Plan == BroadcastHashJoin [coalesce(col1#6, )], [coalesce(col1#21, )], Inner,
BuildRight, (col1#6 <=> col1#21)
:- LocalTableScan [first_name#5, col1#6].
+- BroadcastExchangeHashedRelationBroadcastMode(List(coalesce(input[0, string, true], ))) +-
LocalTableScan [col1#21, col2#22, population#23]
https://spark.apache.org/docs/2.4.0/sql-performance-tuning.html#broadcast-hint-for-sql-queries
© 2022, Amazon Web Services, Inc. or its Affiliates.
coalesce
For the following reasons, partitions may be split into smaller pieces during processing.
• Load a large number of small files.
• Perform groupBy on columns with high cardinality
In such a case, it is better to merge the partitions before the next process to reduce the
overhead of the subsequent process.
Since repartition involves shuffling, it may be desirable to use coalesce.
However, since it is a simple merge, the data after coalesce may be biased.
Glue 3.0 has a new feature called Adaptive Query Execution that automatically optimizes
the number of partitions by coalesce.
partition
partition
partition
partition
partition
partition
partition
partition
partition
partition
partition
partition
df.repartition(2) df.coalesce(2)
© 2022, Amazon Web Services, Inc. or its Affiliates.
Use Window processing instead of self join and data
aggregation
o When the process of creating aggregate data from a single log data
and joining it is being performed, the load of joining can be eliminated
by performing Window processing.
df_agg = df.groupBy('gender', 'age').agg(
F.mean('height').alias('avg_height')), F.mean('weight').alias('avg_weight'))
df = df.join(df_agg, on=['gender', 'age'])
w = Window.partitionBy('gender', 'age')
df = df.withColumn(
'avg_height', F.mean(col('height')).over(w)
).withColumn('avg_weight', F.mean(col('weight')).over(w))
© 2022, Amazon Web Services, Inc. or its Affiliates.
Speed up per-task processing
© 2022, Amazon Web Services, Inc. or its Affiliates.
Using Scala
Most of the DataFrame operations can be written in PySpark and
internally converted to Java and run on a JVM, but the following
operations are slower when using Python.
If the above is the bottleneck, Scala will speed up the process.
• The part that describes the process in RDD
• Processing written in RDDs is not optimized by the optimizer.
• The part that uses UDF
• later mention
© 2022, Amazon Web Services, Inc. or its Affiliates.
Task
Avoid UDF in PySpark
Performance issues
• Serialization and piping to Python process occurs for each Iterator
• The memory of the Python process is not controlled by the JVM.
Python Worker
JVM
Physical
Operator
Python
Runner
batch of
Rows
Invoke UDF
Deserialize
Serialize
Pipe
© 2022, Amazon Web Services, Inc. or its Affiliates.
Using PandasUDF over PythonUDF
Python UDF
• Serialization/deserialization is done by Pickling
• Data is fetched block by block, but UDF processing is performed
row by row
Pandas UDF
• Serialization/deserialization is done by Apache Arrow
• Both data fetch and UDF processing are performed on multiple
lines at once.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Performance differences between Python UDF,
Pandas UDF, and Spark API in AWS Glue ETL
Execution
Time(s)
© 2022, Amazon Web Services, Inc. or its Affiliates.
Parallelize
© 2022, Amazon Web Services, Inc. or its Affiliates.
Dealing with Skewness in Data
If there is data variation between partitions, the load will be unevenly
distributed to only some tasks that process large partitions, causing
processing delays.
Data is biased to only some partitions, causing a
bottleneck in processing time.
When does it happen?
• If the file size to be read is uneven
• When joining with data that has a difference in the number of records
for each join key
• Variation in the number of records per key when
df.groupBy() is performed
© 2022, Amazon Web Services, Inc. or its Affiliates.
Addressing data bias
• Ensure that the file size is uniform when creating input files.
• To repartition
• broadcast join
• salting
© 2022, Amazon Web Services, Inc. or its Affiliates.
Dealing with Skewness
Repartition
If the subsequent process is not a key-by-key process (partitioning and
storing data by date, window processing by key, etc.), repartition will
resolve the skew.
df.repartition(200)
Partition
Partition
Partition
Partition
Partition
Partition
...
200 partitions
3 partitions
© 2021, Amazon Web Services, Inc. or its Affiliates.
101
Dealing with Skewness
broadcast join
If one DataFrame is small enough to fit all the data in one Executor, and the
other DataFrame has huge data with skewed join key columns, you can improve
processing performance by using Broadcast Join as the Join strategy.
item n
beef 1
beef 3
...
beef 2
item n
pork 1
item price
beef 100
item price
pork 500
1000
lines
join
join
Partition
Partition
Sort Merge Join
item n
beef 1
beef 3
...
item n
pork 1
beef 2
...
item price
beef 100
pork 500
item price
beef 100
pork 500
500 lines
join
join
Partition
Partition
Broadcast Join
500 lines
© 2022, Amazon Web Services, Inc. or its Affiliates.
Dealing with Skewness
Salting
In the case of a join between two sufficiently large data sets that have a
skew on one side, "Salt" can be used to eliminate the load bias.
Table A Table B
Partition with Skew
partition
Clone the partition
corresponding to the
partition with Skew.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Dealing with Skewness
Salting
o In Glue 2.0 (Spark 2.4.3), user need to write the Salting code manually.
o In Glue 3.0 (Spark 3.1.1), a new feature called Adaptive Query Execution
will dynamically perform Skew Join.
103
# Salting the skewed column
df_big = df_big.withColumn('shop_salt', F.concat(df['shop'], F.lit('_'), F.lit(F.floor(F.rand() * numPartition) +
1))))
# Explode the column
df_medium = df_medium.withColumn('shop_exploded', F.explode(F.array([F.lit(i) for i in
range(1,numPartition+1)]))))
df_medium = df_medium.withColumn(
'shop_exploded', F.concat(df_medium['shop'], F.lit('_')),
df_medium['shop_exploded']))
# Joining
df_join = df_big.join(df_medium df_big.'shop_salted' == df_medium.shop_exploded)
https://spark.apache.org/docs/latest/sql-performance-tuning.html#optimizing-skew-join
© 2022, Amazon Web Services, Inc. or its Affiliates.
Assigning Incremental IDs with Performance in
Mind
• If the Window function row_number() is used to assign successive
incremental IDs to all records, the process will be slow because of the
aggregation for all records.
• monotonically_increasing_id() can give an incremental ID without
aggregation by allowing it to be non-contiguous across partitions.
1 2 3 4 5 6 7 8 9 10 11
1 2 3 6 7 8 9 10 13 14 15
df. withColumn(F.rowNumber().over(Window.partitionBy("col1"). orderBy("col2"))
df. withColumn(monotonically_increasing_id)
© 2022, Amazon Web Services, Inc. or its Affiliates.
Selecting a Worker Type for AWS Glue
• The processing power allocated at the time of job execution is called
DPU (Data Processing Unit).
• 1DPU = 4vCPU, 16GB memory
• Each Worker Type has different resource capacity and configuration.
Worker
Type
Number
of
DPUs/1
Worker
Number
of
Executor
s/1Work
er
Number
of
memory/
1Worker
Disk/1W
orker
Standard 1 2 16GB 50GB
G.1X 1 1 16GB 64GB
G.2X 2 1 32GB 128GB
Worker Type List Worker Type configuration image
Standard
Executor
Worker
Executor
DPU
G.1X
Worker
DPU
G.2X
Worker
DPU
DPU
Executor
Executor
© 2022, Amazon Web Services, Inc. or its Affiliates.
Ideal resource usage
• It is desirable that resources are used evenly and without
waste by all Executors.
• If not, there's likely room for tuning.
• Select the first Worker Type based on some prediction of
resource tendency based on processing contents.
• Example.
• CPU usage is likely to be high when there are
complex UDF and other processing operations.
• Memory usage is likely to be high when the shuffle
size becomes large, such as when joining large
amounts of data.
© 2022, Amazon Web Services, Inc. or its Affiliates.
Trade-off between number of workers and job
execution time
Job execution time can be reduced without increasing cost by increasing
the number of workers as long as the number of parallelism is sufficient
to ensure effective resource utilization.
5 10
5
10
Number
of
workers
AWS Glue ETL job execution
time
5 10
5
10
Number
of
workers
AWS Glue ETL job execution
time
© 2022, Amazon Web Services, Inc. or its Affiliates.
summary
• Introduced tuning patterns for AWS Glue Spark ETL jobs.
• AWS Glue can process large amounts of data with high
performance as-is, but it can be tuned to achieve even higher
performance and scalability.
• Tuning requires checking metrics to identify bottlenecks and
eliminate their causes.
108

More Related Content

What's hot

Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Noritaka Sekiyama
 
20191023 AWS Black Belt Online Seminar Amazon EMR
20191023 AWS Black Belt Online Seminar Amazon EMR20191023 AWS Black Belt Online Seminar Amazon EMR
20191023 AWS Black Belt Online Seminar Amazon EMRAmazon Web Services Japan
 
20200617 AWS Black Belt Online Seminar Amazon Athena
20200617 AWS Black Belt Online Seminar Amazon Athena20200617 AWS Black Belt Online Seminar Amazon Athena
20200617 AWS Black Belt Online Seminar Amazon AthenaAmazon Web Services Japan
 
20190424 AWS Black Belt Online Seminar Amazon Aurora MySQL
20190424 AWS Black Belt Online Seminar Amazon Aurora MySQL20190424 AWS Black Belt Online Seminar Amazon Aurora MySQL
20190424 AWS Black Belt Online Seminar Amazon Aurora MySQLAmazon Web Services Japan
 
실시간 스트리밍 분석 Kinesis Data Analytics Deep Dive
실시간 스트리밍 분석  Kinesis Data Analytics Deep Dive실시간 스트리밍 분석  Kinesis Data Analytics Deep Dive
실시간 스트리밍 분석 Kinesis Data Analytics Deep DiveAmazon Web Services Korea
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftAmazon Web Services
 
[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをな
[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをな[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをな
[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをなAmazon Web Services Japan
 
Amazon Redshift로 데이터웨어하우스(DW) 구축하기
Amazon Redshift로 데이터웨어하우스(DW) 구축하기Amazon Redshift로 데이터웨어하우스(DW) 구축하기
Amazon Redshift로 데이터웨어하우스(DW) 구축하기Amazon Web Services Korea
 
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018Amazon Web Services Korea
 
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWSAmazon Web Services Korea
 
オンプレミスRDBMSをAWSへ移行する手法
オンプレミスRDBMSをAWSへ移行する手法オンプレミスRDBMSをAWSへ移行する手法
オンプレミスRDBMSをAWSへ移行する手法Amazon Web Services Japan
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Amazon Web Services
 
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Amazon Web Services Korea
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...
대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...
대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...Amazon Web Services Korea
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Amazon Web Services
 
20190320 AWS Black Belt Online Seminar Amazon EBS
20190320 AWS Black Belt Online Seminar Amazon EBS20190320 AWS Black Belt Online Seminar Amazon EBS
20190320 AWS Black Belt Online Seminar Amazon EBSAmazon Web Services Japan
 
20180425 AWS Black Belt Online Seminar Amazon Relational Database Service (Am...
20180425 AWS Black Belt Online Seminar Amazon Relational Database Service (Am...20180425 AWS Black Belt Online Seminar Amazon Relational Database Service (Am...
20180425 AWS Black Belt Online Seminar Amazon Relational Database Service (Am...Amazon Web Services Japan
 

What's hot (20)

Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
 
20191023 AWS Black Belt Online Seminar Amazon EMR
20191023 AWS Black Belt Online Seminar Amazon EMR20191023 AWS Black Belt Online Seminar Amazon EMR
20191023 AWS Black Belt Online Seminar Amazon EMR
 
20200617 AWS Black Belt Online Seminar Amazon Athena
20200617 AWS Black Belt Online Seminar Amazon Athena20200617 AWS Black Belt Online Seminar Amazon Athena
20200617 AWS Black Belt Online Seminar Amazon Athena
 
20190424 AWS Black Belt Online Seminar Amazon Aurora MySQL
20190424 AWS Black Belt Online Seminar Amazon Aurora MySQL20190424 AWS Black Belt Online Seminar Amazon Aurora MySQL
20190424 AWS Black Belt Online Seminar Amazon Aurora MySQL
 
실시간 스트리밍 분석 Kinesis Data Analytics Deep Dive
실시간 스트리밍 분석  Kinesis Data Analytics Deep Dive실시간 스트리밍 분석  Kinesis Data Analytics Deep Dive
실시간 스트리밍 분석 Kinesis Data Analytics Deep Dive
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 
[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをな
[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをな[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをな
[AWS EXpert Online for JAWS-UG 18] 見せてやるよ、Step Functions の本気ってやつをな
 
Amazon Redshift로 데이터웨어하우스(DW) 구축하기
Amazon Redshift로 데이터웨어하우스(DW) 구축하기Amazon Redshift로 데이터웨어하우스(DW) 구축하기
Amazon Redshift로 데이터웨어하우스(DW) 구축하기
 
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
 
AWS Black Belt - AWS Glue
AWS Black Belt - AWS GlueAWS Black Belt - AWS Glue
AWS Black Belt - AWS Glue
 
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
 
オンプレミスRDBMSをAWSへ移行する手法
オンプレミスRDBMSをAWSへ移行する手法オンプレミスRDBMSをAWSへ移行する手法
オンプレミスRDBMSをAWSへ移行する手法
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
 
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...
대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...
대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
20190320 AWS Black Belt Online Seminar Amazon EBS
20190320 AWS Black Belt Online Seminar Amazon EBS20190320 AWS Black Belt Online Seminar Amazon EBS
20190320 AWS Black Belt Online Seminar Amazon EBS
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
20180425 AWS Black Belt Online Seminar Amazon Relational Database Service (Am...
20180425 AWS Black Belt Online Seminar Amazon Relational Database Service (Am...20180425 AWS Black Belt Online Seminar Amazon Relational Database Service (Am...
20180425 AWS Black Belt Online Seminar Amazon Relational Database Service (Am...
 

Similar to 202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS Glue

Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql databasePARIKSHIT SAVJANI
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
Refactoring Web Services on AWS cloud (PaaS & SaaS)
Refactoring Web Services on AWS cloud (PaaS & SaaS)Refactoring Web Services on AWS cloud (PaaS & SaaS)
Refactoring Web Services on AWS cloud (PaaS & SaaS)IRJET Journal
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Amazon Web Services
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Amazon Web Services
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connectorDenny Lee
 
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingAmazon Web Services Korea
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Amazon Web Services
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...Amazon Web Services LATAM
 
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...Cobus Bernard
 
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...Amazon Web Services
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Salesforce Lightning Web Components Overview
Salesforce Lightning Web Components OverviewSalesforce Lightning Web Components Overview
Salesforce Lightning Web Components OverviewNagarjuna Kaipu
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangDatabricks
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data PlatformShu-Jeng Hsieh
 

Similar to 202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS Glue (20)

Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql database
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Refactoring Web Services on AWS cloud (PaaS & SaaS)
Refactoring Web Services on AWS cloud (PaaS & SaaS)Refactoring Web Services on AWS cloud (PaaS & SaaS)
Refactoring Web Services on AWS cloud (PaaS & SaaS)
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
Grails 101
Grails 101Grails 101
Grails 101
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
 
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...
 
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Salesforce Lightning Web Components Overview
Salesforce Lightning Web Components OverviewSalesforce Lightning Web Components Overview
Salesforce Lightning Web Components Overview
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 

More from Amazon Web Services Japan

202205 AWS Black Belt Online Seminar Amazon VPC IP Address Manager (IPAM)
202205 AWS Black Belt Online Seminar Amazon VPC IP Address Manager (IPAM)202205 AWS Black Belt Online Seminar Amazon VPC IP Address Manager (IPAM)
202205 AWS Black Belt Online Seminar Amazon VPC IP Address Manager (IPAM)Amazon Web Services Japan
 
202205 AWS Black Belt Online Seminar Amazon FSx for OpenZFS
202205 AWS Black Belt Online Seminar Amazon FSx for OpenZFS202205 AWS Black Belt Online Seminar Amazon FSx for OpenZFS
202205 AWS Black Belt Online Seminar Amazon FSx for OpenZFSAmazon Web Services Japan
 
202204 AWS Black Belt Online Seminar AWS IoT Device Defender
202204 AWS Black Belt Online Seminar AWS IoT Device Defender202204 AWS Black Belt Online Seminar AWS IoT Device Defender
202204 AWS Black Belt Online Seminar AWS IoT Device DefenderAmazon Web Services Japan
 
202204 AWS Black Belt Online Seminar Amazon Connect を活用したオンコール対応の実現
202204 AWS Black Belt Online Seminar Amazon Connect を活用したオンコール対応の実現202204 AWS Black Belt Online Seminar Amazon Connect を活用したオンコール対応の実現
202204 AWS Black Belt Online Seminar Amazon Connect を活用したオンコール対応の実現Amazon Web Services Japan
 
202204 AWS Black Belt Online Seminar Amazon Connect Salesforce連携(第1回 CTI Adap...
202204 AWS Black Belt Online Seminar Amazon Connect Salesforce連携(第1回 CTI Adap...202204 AWS Black Belt Online Seminar Amazon Connect Salesforce連携(第1回 CTI Adap...
202204 AWS Black Belt Online Seminar Amazon Connect Salesforce連携(第1回 CTI Adap...Amazon Web Services Japan
 
Amazon Game Tech Night #25 ゲーム業界向け機械学習最新状況アップデート
Amazon Game Tech Night #25 ゲーム業界向け機械学習最新状況アップデートAmazon Game Tech Night #25 ゲーム業界向け機械学習最新状況アップデート
Amazon Game Tech Night #25 ゲーム業界向け機械学習最新状況アップデートAmazon Web Services Japan
 
20220409 AWS BLEA 開発にあたって検討したこと
20220409 AWS BLEA 開発にあたって検討したこと20220409 AWS BLEA 開発にあたって検討したこと
20220409 AWS BLEA 開発にあたって検討したことAmazon Web Services Japan
 
202202 AWS Black Belt Online Seminar AWS Managed Rules for AWS WAF の活用
202202 AWS Black Belt Online Seminar AWS Managed Rules for AWS WAF の活用202202 AWS Black Belt Online Seminar AWS Managed Rules for AWS WAF の活用
202202 AWS Black Belt Online Seminar AWS Managed Rules for AWS WAF の活用Amazon Web Services Japan
 
202203 AWS Black Belt Online Seminar Amazon Connect Tasks.pdf
202203 AWS Black Belt Online Seminar Amazon Connect Tasks.pdf202203 AWS Black Belt Online Seminar Amazon Connect Tasks.pdf
202203 AWS Black Belt Online Seminar Amazon Connect Tasks.pdfAmazon Web Services Japan
 
SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介
SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介
SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介Amazon Web Services Japan
 
Amazon QuickSight の組み込み方法をちょっぴりDD
Amazon QuickSight の組み込み方法をちょっぴりDDAmazon QuickSight の組み込み方法をちょっぴりDD
Amazon QuickSight の組み込み方法をちょっぴりDDAmazon Web Services Japan
 
マルチテナント化で知っておきたいデータベースのこと
マルチテナント化で知っておきたいデータベースのことマルチテナント化で知っておきたいデータベースのこと
マルチテナント化で知っておきたいデータベースのことAmazon Web Services Japan
 
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチAmazon Web Services Japan
 
パッケージソフトウェアを簡単にSaaS化!?既存の資産を使ったSaaS化手法のご紹介
パッケージソフトウェアを簡単にSaaS化!?既存の資産を使ったSaaS化手法のご紹介パッケージソフトウェアを簡単にSaaS化!?既存の資産を使ったSaaS化手法のご紹介
パッケージソフトウェアを簡単にSaaS化!?既存の資産を使ったSaaS化手法のご紹介Amazon Web Services Japan
 
202202 AWS Black Belt Online Seminar Amazon Connect Customer Profiles
202202 AWS Black Belt Online Seminar Amazon Connect Customer Profiles202202 AWS Black Belt Online Seminar Amazon Connect Customer Profiles
202202 AWS Black Belt Online Seminar Amazon Connect Customer ProfilesAmazon Web Services Japan
 
Amazon Game Tech Night #24 KPIダッシュボードを最速で用意するために
Amazon Game Tech Night #24 KPIダッシュボードを最速で用意するためにAmazon Game Tech Night #24 KPIダッシュボードを最速で用意するために
Amazon Game Tech Night #24 KPIダッシュボードを最速で用意するためにAmazon Web Services Japan
 
202202 AWS Black Belt Online Seminar AWS SaaS Boost で始めるSaaS開発⼊⾨
202202 AWS Black Belt Online Seminar AWS SaaS Boost で始めるSaaS開発⼊⾨202202 AWS Black Belt Online Seminar AWS SaaS Boost で始めるSaaS開発⼊⾨
202202 AWS Black Belt Online Seminar AWS SaaS Boost で始めるSaaS開発⼊⾨Amazon Web Services Japan
 
[20220126] JAWS-UG 2022初頭までに葬ったAWSアンチパターン大紹介
[20220126] JAWS-UG 2022初頭までに葬ったAWSアンチパターン大紹介[20220126] JAWS-UG 2022初頭までに葬ったAWSアンチパターン大紹介
[20220126] JAWS-UG 2022初頭までに葬ったAWSアンチパターン大紹介Amazon Web Services Japan
 
202111 AWS Black Belt Online Seminar AWSで構築するSmart Mirrorのご紹介
202111 AWS Black Belt Online Seminar AWSで構築するSmart Mirrorのご紹介202111 AWS Black Belt Online Seminar AWSで構築するSmart Mirrorのご紹介
202111 AWS Black Belt Online Seminar AWSで構築するSmart Mirrorのご紹介Amazon Web Services Japan
 

More from Amazon Web Services Japan (20)

202205 AWS Black Belt Online Seminar Amazon VPC IP Address Manager (IPAM)
202205 AWS Black Belt Online Seminar Amazon VPC IP Address Manager (IPAM)202205 AWS Black Belt Online Seminar Amazon VPC IP Address Manager (IPAM)
202205 AWS Black Belt Online Seminar Amazon VPC IP Address Manager (IPAM)
 
202205 AWS Black Belt Online Seminar Amazon FSx for OpenZFS
202205 AWS Black Belt Online Seminar Amazon FSx for OpenZFS202205 AWS Black Belt Online Seminar Amazon FSx for OpenZFS
202205 AWS Black Belt Online Seminar Amazon FSx for OpenZFS
 
202204 AWS Black Belt Online Seminar AWS IoT Device Defender
202204 AWS Black Belt Online Seminar AWS IoT Device Defender202204 AWS Black Belt Online Seminar AWS IoT Device Defender
202204 AWS Black Belt Online Seminar AWS IoT Device Defender
 
Infrastructure as Code (IaC) 談義 2022
Infrastructure as Code (IaC) 談義 2022Infrastructure as Code (IaC) 談義 2022
Infrastructure as Code (IaC) 談義 2022
 
202204 AWS Black Belt Online Seminar Amazon Connect を活用したオンコール対応の実現
202204 AWS Black Belt Online Seminar Amazon Connect を活用したオンコール対応の実現202204 AWS Black Belt Online Seminar Amazon Connect を活用したオンコール対応の実現
202204 AWS Black Belt Online Seminar Amazon Connect を活用したオンコール対応の実現
 
202204 AWS Black Belt Online Seminar Amazon Connect Salesforce連携(第1回 CTI Adap...
202204 AWS Black Belt Online Seminar Amazon Connect Salesforce連携(第1回 CTI Adap...202204 AWS Black Belt Online Seminar Amazon Connect Salesforce連携(第1回 CTI Adap...
202204 AWS Black Belt Online Seminar Amazon Connect Salesforce連携(第1回 CTI Adap...
 
Amazon Game Tech Night #25 ゲーム業界向け機械学習最新状況アップデート
Amazon Game Tech Night #25 ゲーム業界向け機械学習最新状況アップデートAmazon Game Tech Night #25 ゲーム業界向け機械学習最新状況アップデート
Amazon Game Tech Night #25 ゲーム業界向け機械学習最新状況アップデート
 
20220409 AWS BLEA 開発にあたって検討したこと
20220409 AWS BLEA 開発にあたって検討したこと20220409 AWS BLEA 開発にあたって検討したこと
20220409 AWS BLEA 開発にあたって検討したこと
 
202202 AWS Black Belt Online Seminar AWS Managed Rules for AWS WAF の活用
202202 AWS Black Belt Online Seminar AWS Managed Rules for AWS WAF の活用202202 AWS Black Belt Online Seminar AWS Managed Rules for AWS WAF の活用
202202 AWS Black Belt Online Seminar AWS Managed Rules for AWS WAF の活用
 
202203 AWS Black Belt Online Seminar Amazon Connect Tasks.pdf
202203 AWS Black Belt Online Seminar Amazon Connect Tasks.pdf202203 AWS Black Belt Online Seminar Amazon Connect Tasks.pdf
202203 AWS Black Belt Online Seminar Amazon Connect Tasks.pdf
 
SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介
SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介
SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介
 
Amazon QuickSight の組み込み方法をちょっぴりDD
Amazon QuickSight の組み込み方法をちょっぴりDDAmazon QuickSight の組み込み方法をちょっぴりDD
Amazon QuickSight の組み込み方法をちょっぴりDD
 
マルチテナント化で知っておきたいデータベースのこと
マルチテナント化で知っておきたいデータベースのことマルチテナント化で知っておきたいデータベースのこと
マルチテナント化で知っておきたいデータベースのこと
 
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ
 
パッケージソフトウェアを簡単にSaaS化!?既存の資産を使ったSaaS化手法のご紹介
パッケージソフトウェアを簡単にSaaS化!?既存の資産を使ったSaaS化手法のご紹介パッケージソフトウェアを簡単にSaaS化!?既存の資産を使ったSaaS化手法のご紹介
パッケージソフトウェアを簡単にSaaS化!?既存の資産を使ったSaaS化手法のご紹介
 
202202 AWS Black Belt Online Seminar Amazon Connect Customer Profiles
202202 AWS Black Belt Online Seminar Amazon Connect Customer Profiles202202 AWS Black Belt Online Seminar Amazon Connect Customer Profiles
202202 AWS Black Belt Online Seminar Amazon Connect Customer Profiles
 
Amazon Game Tech Night #24 KPIダッシュボードを最速で用意するために
Amazon Game Tech Night #24 KPIダッシュボードを最速で用意するためにAmazon Game Tech Night #24 KPIダッシュボードを最速で用意するために
Amazon Game Tech Night #24 KPIダッシュボードを最速で用意するために
 
202202 AWS Black Belt Online Seminar AWS SaaS Boost で始めるSaaS開発⼊⾨
202202 AWS Black Belt Online Seminar AWS SaaS Boost で始めるSaaS開発⼊⾨202202 AWS Black Belt Online Seminar AWS SaaS Boost で始めるSaaS開発⼊⾨
202202 AWS Black Belt Online Seminar AWS SaaS Boost で始めるSaaS開発⼊⾨
 
[20220126] JAWS-UG 2022初頭までに葬ったAWSアンチパターン大紹介
[20220126] JAWS-UG 2022初頭までに葬ったAWSアンチパターン大紹介[20220126] JAWS-UG 2022初頭までに葬ったAWSアンチパターン大紹介
[20220126] JAWS-UG 2022初頭までに葬ったAWSアンチパターン大紹介
 
202111 AWS Black Belt Online Seminar AWSで構築するSmart Mirrorのご紹介
202111 AWS Black Belt Online Seminar AWSで構築するSmart Mirrorのご紹介202111 AWS Black Belt Online Seminar AWSで構築するSmart Mirrorのご紹介
202111 AWS Black Belt Online Seminar AWSで構築するSmart Mirrorのご紹介
 

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS Glue

  • 1. © 2022, Amazon Web Services, Inc. or its Affiliates. Chiei Hayashida Solution Architect 2022/01 AWS Glue Spark Performance Tuning
  • 2. © 2022, Amazon Web Services, Inc. or its Affiliates. Self Introduction Chie Hayashida(Chie Hayashida) Amazon Web Services Japan solution architect
  • 3. © 2022, Amazon Web Services, Inc. or its Affiliates. The target of this slide o People who have done AWS glue tutorial or have equivalent knowledge. o People who have written Apache Spark applications. o People who would like to improve their existing AWS Glue jobs o The Code at this slide are all by PySpark because a lot of AWS Glue users choose PySpark o This slides are for Glue 2.0(Spark 2.4.3) and Glue 3.0(Spark 3.1.1)
  • 4. © 2022, Amazon Web Services, Inc. or its Affiliates. Agenda • Architecture of AWS Glue and Apache Spark • AWS Glue Functions related to performance • How to proceed AWS Glue Spark performance tuning • AWS Glue Spark performance tuning patterns
  • 5. © 2022, Amazon Web Services, Inc. or its Affiliates. Agenda • Architecture of Apache Spark • AWS Glue specific features (performance related) • How to proceed with performance tuning of AWS Glue Spark • Basic strategy for tuning AWS Glue Spark jobs • Tuning Patterns for AWS Glue Spark Jobs
  • 6. © 2022, Amazon Web Services, Inc. or its Affiliates. AWS Glue and Apache Spark Data source crawler data catalog Serverless Engine (1) Crawl data 2) Manage metadata AWS Glue 3) Triggered manually / by schedule / by event 5) Run transfomation job and load data to target data source 4) Extract data from the input data source scheduler Data source Other AWS Services Managed Service of Apache Spark
  • 7. © 2022, Amazon Web Services, Inc. or its Affiliates. Architecture of Apache Spark
  • 8. © 2022, Amazon Web Services, Inc. or its Affiliates. Architecture of Apache Spark (cluster mode) • Cluster Manager divides a job into one or more tasks and assigns them to executors • On a single worker node, more than one executor is started. • More than one task can be executed on a single executor. (a) 3) Driver Program SparkContext Cluster Manager Worker Node Executor Task Task Cache Worker Node Executor Task Task Cache Executor Executor
  • 9. © 2022, Amazon Web Services, Inc. or its Affiliates. Architecture of Spark (cluster mode) 1) Request the resources needed by the application. 2) Launch the Executor required for the job on each worker. 3) Divide the process into tasks and assign them to Executor. 4) Assign a task to each Executor. When the task is completed, the Executor informs the Driver Program about it. Exchange data between tasks as necessary. 5) After 2) and 3) are repeated several times, the result of the process is returned. (a) 2) 3) 3) Driver Program SparkContext Cluster Manager Worker Node Executor Task Task Cache Worker Node Executor Task Task Cache 1) 2) 4) 3) Executor Executor 4) 5)
  • 10. © 2021, Amazon Web Services, Inc. or its Affiliates. 10 How data is processed • The data being processed is defined as a distributed collection called an RDD • An RDD is made up of one or more “partitions” • A partition is processed by a single "task” • The actual Spark code is often written through interfaceswhich called DataFrame that can treat data as typed table. S3 file file file RDD Partition Partition Partition RDD Partition Partition Partition RDD Partition Partition Partition S3 file file file
  • 11. © 2022, Amazon Web Services, Inc. or its Affiliates. Components of Apache Spark Spark Core (RDD) DataFrame/Catalyst Optimizer Spark ML Structured Streaming GraphX Spark SQL
  • 12. © 2021, Amazon Web Services, Inc. or its Affiliates. 12 RDD and DataFrame RDD data architecture image [ [1, Bob, 24], [2, Alice, 48], [3, Ken, 10], … ] DataFrame data architecture image col1 col2 col3 1 Bob 24 2 Alice 48 3 Ken 10 … … … With both interfaces, the code is written as if the data is processed as a list/table, but the actual data is distributed across multiple servers. DataFrame is a high-level API for RDDs, and processing described in a DataFrame is internally executed as an RDD.
  • 13. © 2021, Amazon Web Services, Inc. or its Affiliates. 13 Lazy evaluation • There are two types of Spark processing: “action” and “transformation” • When an “action” is executed, all the previous processing necessary for the action is performed • as series of processes excuted by an “action” is called a “job” • Note that “Job” here is different from a Glue job. >>> df1 = spark.read.csv(…) >>> df2 = spark.read.json(…) >>> df3 = df1.filter(…) >>> df4 = spark.read.csv(…) >>> df5 = df2.join(df3, …) >>> df5.count() Action Up to this point, no actual processing won’t be done. At this point, the previous process is executed for the first time. The df4 process is not a dependency of df5.count(), so it will not be executed in this action.
  • 14. © 2022, Amazon Web Services, Inc. or its Affiliates. Examples of transformations and actions Transform: data generation and processing • select() • Selecting a column • read • Loading data • filter() • Data Filtering • groupBy() • Aggregation by group • sort() • Sorting data Action: Output the processing result • count() • Counting the number of records • write • Exporting to the file system • collect() • Collect all data on Driver node • show() • View a sample of the data • describe() • View data statistics
  • 15. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark Applications o A Spark application consists of multiple jobs. glueContext = GlueContext(SparkContext.getOrCreate(conf)) spark = glueContext.spark_session df1 = spark.read.json(...) df1.show() # job1 df1.filter(df1.col1='a').write.parquet(...) # job2 df1.filter(df1.col2='b').write.parquet(...) # job3 An application is a set of processes executed in a single GlueContext (or SparkContext).
  • 16. © 2020, Amazon Web Services, Inc. or its Affiliates. Shuffle and Stage Stage1 Stage2 Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition df2 = df1.filter(“price”>500).groupBy(“item”).sum().withColumn(“bargain”, price*0.8) • Stages are divided by shuffling • Multiple tasks are processed in concurrently in one stage task
  • 17. © 2021, Amazon Web Services, Inc. or its Affiliates. 17 Processing with and without shuffling (exchange of data between tasks) Example of no shuffling df2 = df1.groupBy(‘item’).sum() df2 = df1.filter(price>500) item price beef 1300 pork 200 chicken 700 fish 400 item price beef 1300 chicken 700 df1 df2 item num beef 2 pork 3 beef 4 pork 5 item num beef 6 pork 9 df1 df2 Example of shuffling
  • 18. © 2022, Amazon Web Services, Inc. or its Affiliates. Processing with and without shuffling (exchange of data between tasks) Processing without shuffling • read • filter() • withColumn() • UDF • coalesce() Processing with shuffling • join() • groupBy() • orderBy() • repartition()
  • 19. © 2022, Amazon Web Services, Inc. or its Affiliates. Optimization with Catalyst Query Optimizer • Processes described in DataFrame are converted into optimized RDDs by the optimizer and before the process is executed. df1 = spark.read.csv('s3://path/to/data1') df2 = spark.read.parquet('s3://path/to/data2') df3 = df1.join(df2, df1.col1 == df2.col1) df4 = df3.filter(df3.col2 == 'meat').select(df3.col3, df3.col5) df4.show() data1 data2 join filter Logical Plan Physical Plan scan (data1) scan (data2) filter join scan (data1) optimized scan (data2) join Physical Plan (with storage side optimization) 10GB 50GB 60GB 20GB 10GB 50GB 20GB 20GB 20GB 10GB 20GB data volume predicate pushdown and column pruning
  • 20. © 2022, Amazon Web Services, Inc. or its Affiliates. Architecture of PySpark Driver Python Spark Context Worker Executor Task Task Python Worker Python Worker Executor Worker Worker • Processes written in PySpark Dataframe are converted to Java code. • UDF code written by Python is executed as Python at each Python worker per tasks Py4J
  • 21. © 2022, Amazon Web Services, Inc. or its Affiliates. Introduction to AWS Glue specific features
  • 22. © 2021, Amazon Web Services, Inc. or its Affiliates. Data Catalog • Data Catalog has metadata(tablenames, column names, S3 paths and so on) necessary to access data sources such as S3 and databases from Glue, Athena, Redshift Spectrum, etc. • There are three ways to create metadata in the data catalog: crawler, Glue API, and DDL (Athena/EMR/Redshift Spectrum). • Amazon DynamoDB, Amazon S3, Amazon Redshift, Amazon RDS, JDBC connectable DB, Kinesis, HDFS, etc. can be specified as data sources. • No need to manage metastore database DynamoDB S3 Redshift RDS JDBC接続可能なDB Data Source Glue ETL Athena Redshift Spectrum EMR Connected services Hive alternative appliations Save metadata Data catalog Crawler Image of data catalog usage ①Metadata Access ②Data access
  • 23. © 2022, Amazon Web Services, Inc. or its Affiliates. DynamicFrame AWS Glue functions to absorb the inherent complexity of ETL for raw data • As a component, it is located in the same hierarchy as DataFrame. They can be used by converting each other (fromDF and toDF functions). • Leave the possibility of multiple types to be determined later (Choice type) • DynamicFrame refers to the entire data, DynamicRecord refers to a refers to a single row of data. Spark Core: RDDs Spark DataFrame/ Catalyst Optimizer AWS Glue DynamicFrame Data structure image DynamicFrame in Apache Spark architecture DataFrame DynamicFrame Similar to semi-structured tables record
  • 24. © 2022, Amazon Web Services, Inc. or its Affiliates. struct type Choice Type DynamicFrame is able to have both types when multiple types are found in a column • ResolveChoice method can be used to resolve types root |-- uuid: string | |-- device id: choice |-- long |-- string Example of data structure of type choice The device id column has both long and string data. (e.g. "1234" in the device id column is confused 1234with "1234" in the string column) project (Discard the mold) cast (Cast to a single type) make_cols (Keep all types in a separate column) Example of ResolveChoice execution deviceid: choice type long type string type long type long type long type string type long type deviceid deviceid deviceid deviceid_long deviceid_string long type deviceid make_struct (Map conversion to struct type) string type ColA ColB ColC 1 2 ... 1,000,000 "1000001." "1000002." With DataFrame, if more than one type is present, processing may be interrupted and have to be reprocessed.
  • 25. © 2022, Amazon Web Services, Inc. or its Affiliates. DataFrame ETL processing that takes advantage of the characteristics of DynamicFrame and DataFrame • DynamicFrame is good for ETL processing, DataFrame is good for table processing • Data input/output and associated ETL processing is performed by DynamicFrame, while table manipulation is performed by DataFrame. DynamicFrame toDF function Table Operations Output result file ETL job data loading input data By using DynamicFrame when loading, it is possible to optimize data loading using AWS Glue catalog, load differential data, and process semi-structured data using Choice type. DynamicFrame Table operations such as JOIN are performed in DataFrame. Use the toDF and fromDF functions for mutual conversion between DynamicFrame and DataFrame. (No data copying is done. conversion cost is within a few milliseconds) Output the result fromDF function
  • 26. © 2022, Amazon Web Services, Inc. or its Affiliates. Bookmark function Function to process only the delta data when performing steady-state ETL • Use file timestamps to process only the data that has not been processed in the previous job to prevent duplicate processing and duplicate data. Processed data (Not loaded) Unprocessed data (load target) df = spark.read.parquet('s3://path/to/data') s3://path/to/data
  • 27. © 2022, Amazon Web Services, Inc. or its Affiliates. How to proceed with performance tuning of AWS Glue ETL
  • 28. © 2022, Amazon Web Services, Inc. or its Affiliates. The Performance Tuning Cycle 1. Determine performance goals. 2. Measure the metrics 3. Identify bottlenecks. 4. Reduce the impact of bottlenecks 5. Repeat steps 2. to 4. until the goal is achieved. 6. Achieving performance goals
  • 29. © 2022, Amazon Web Services, Inc. or its Affiliates. Understand the characteristics of AWS Glue/Apache Spark • distributed processing • There are tuning patterns such as "shuffling" and "data bias" that are not found in single-process applications. • Delayed evaluation • Spark processing is lazy evaluation, so the error may not be caused by the API that was executed just before the error occurred, but by the API that was written before the error occurred. • Impact of optimizer • Since Spark processing is optimized internally, it can be difficult to determine which part of the script is the actual processing you can see in the Spark UI. You need to check multiple metrics to find the cause.
  • 30. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark parameters in AWS Glue • Essentially, Spark has a means of tuning by parameters at the time of job execution, but AWS Glue is a serverless service, so before looking at Spark's parameters, we first tune it according to AWS Glue's best practices. • User should test thoroughly when changing the value of the Spark parameter.
  • 31. © 2022, Amazon Web Services, Inc. or its Affiliates. Metrics to check • Spark UI(Spark History Server) • It shows the details of Spark processing. • Executor Log and Driver Log • It shows stdout/stderr logs of executors and a driver • AWS Glue Job metrics • It shows the CPU, memory, and other resource usage status of each executor and driver node • Statistics obtained from the Spark API • It shows samples and statistical values of intermediate data
  • 32. © 2022, Amazon Web Services, Inc. or its Affiliates. Setting up a job to do the tuning In order to get the logs needed for tuning, you need to check the box to use Monitoring Options in the Add job screen.
  • 33. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark UI
  • 34. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark History Server Spark Event Logs can be visualized by running the Spark History Server There are several ways to launch the Spark History Server. • Using Cloud Formation • Using docker to launch it on a local PC • Download Apache Spark on your local PC or EC2 and start the Spark History Server. • Using an EMR cluster https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html
  • 35. © 2022, Amazon Web Services, Inc. or its Affiliates. Example of launching with Docker o Once the Docker container has started, access http://localhost:18080 in your browser $ docker build -t glue/sparkui:latest . $ docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_eventlog -Dspark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY_ID -Dspark.hadoop.fs.s3a.secret.key=AWS_SECRET_ACCESS_KEY" -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer” Specify the S3 path of Spark history logs
  • 36. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark History Server Click and chek the details of each application Check duration of each application
  • 37. © 2022, Amazon Web Services, Inc. or its Affiliates. List of jobs executed by the application Completed Jobs Failed Jobs Check jobs which takes long time
  • 38. © 2022, Amazon Web Services, Inc. or its Affiliates. Checking the contents of a job Identify Stages that are failing or taking a long time. Identify Stages that are failing or taking a long time.
  • 39. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the contents of the Stage. If there is a difference in the line length, it means that skew occurs and it isnʼt distributed sufficiently. Check the data size in the Stage.
  • 40. © 2022, Amazon Web Services, Inc. or its Affiliates. Checking the contents of the Stage (continued) the details of what's taking so long. Task Time for each Executor If there is a spill on the disk, select a worker node with larger memory to solve it. In addition to the Event Timeline above, data skew can be seen in Summary Metrics and Tasks.
  • 41. © 2022, Amazon Web Services, Inc. or its Affiliates. View details of tasks that are Failing Click on details to learn more. View the log of the Executor where the Fail is occurring. For AWS Glue ETL, check the Executor ID and check from CloudWatch Log groups
  • 42. © 2022, Amazon Web Services, Inc. or its Affiliates. Environmental settings for Spark application runtime o Spark options, dependencies, and so on.
  • 43. © 2022, Amazon Web Services, Inc. or its Affiliates. List of Driver and Executor nodes If the cluster is running and accessible from the History Server, the logs of each Driver/Executor can be seen.
  • 44. © 2022, Amazon Web Services, Inc. or its Affiliates. Checking the Spark SQL Query Execution Plan You can see the actual execution plan, which is more accurate than the explain API.
  • 45. © 2022, Amazon Web Services, Inc. or its Affiliates. Executor and Driver logs
  • 46. © 2022, Amazon Web Services, Inc. or its Affiliates. Check Log groups in CloudWatch Executor Logs Driver Log
  • 47. © 2022, Amazon Web Services, Inc. or its Affiliates. AWS Glue Job Metrics
  • 48. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the resource usage of Executor and Driver You can also create a Dashboard for CloudWatch and add other metrics.
  • 49. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark API
  • 50. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with commands o Use the following commands to check the trend of the intermediate data during processing and use it for tuning strategy. o Note that processing with actions (red text) will make the job slow down the if you use too many times. • count() • printSchema() • show() • describe([cols*]).show() • explain() • df.agg(approx_count_distinct(df.col))
  • 51. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.count() o Check the number of records. o Use df.groupBy('col_name').count() to check for skew.
  • 52. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.printSchema() o Check the schema information of the DataFrame.
  • 53. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.show() o The number of records to be displayed can be specified by using df.show(5).
  • 54. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.describe([cols*]).show() o The statistics for each column can be seen.
  • 55. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.explain() o Check the execution plan which optimizer created.
  • 56. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.agg(approx_count_distinct(df.col)) o Check the cardinality in the columns o Fast because HyperLogLog is used
  • 57. © 2022, Amazon Web Services, Inc. or its Affiliates. AWS Glue ETL Performance Tuning Pattern
  • 58. © 2022, Amazon Web Services, Inc. or its Affiliates. Basic strategy for tuning AWS Glue ETL jobs • Use the new version • Reduce the data I/O load. • Minimize shuffling. • Speed up per-task processing. • Parallelize
  • 59. © 2022, Amazon Web Services, Inc. or its Affiliates. Use the new version
  • 60. © 2022, Amazon Web Services, Inc. or its Affiliates. Use the new version o When the Spark application is slow, simply replacing the job execution environment with the latest version may speed up the process. o Both AWS Glue and Apache Spark are evolving in every way, not just in performance. Use the newest version possible. https://aws.amazon.com/jp/blogs/big-data/introducing-aws-glue-3-0-with-optimized-apache-spark-3-1-runtime-for-faster-data-integration/
  • 61. © 2022, Amazon Web Services, Inc. or its Affiliates. Using AWS Glue 2.0 and 3.0 to reduce startup time Dramatically reduced the time required to launch AWS Glue ETL jobs • Cold start used to take about 85 minutes in AWS Glue 0.9/1.0. • In the new AWS Glue 2.0, less than a 1minute. time Start-up time Execution time Execution time AWS Glue 1.0 AWS Glue 2.0
  • 62. © 2021, Amazon Web Services, Inc. or its Affiliates. 2. Submit job to virtual cluster AWS Glue 2.0+ integrated scheduling and provisioning 1. Run AWS Glue job 3. Spark schedules tasks to executors Job manager 4. Dynamically grow virtual clusters 5. Spark utilizes new executors AZ1 AZ2 Job starts when first executor is ready Reduced start time Reduced start variance Jobs run on reduced capacity Graceful degradation
  • 63. © 2022, Amazon Web Services, Inc. or its Affiliates. Minimize data I/O
  • 64. © 2022, Amazon Web Services, Inc. or its Affiliates. How to minimize the data I/O load • Read only the data you need. • Control the amount of data read in one task. • Choose the right compression format.
  • 65. © 2022, Amazon Web Services, Inc. or its Affiliates. Using Apache Parquet • Column-oriented format for data arrangement suitable for analytical applications • Data type is preserved • Compressed effectively • Aggregation by skipping unnecessary data and using metadata • The Spark engine can efficiently use Apache Parquet Integration is in place https://parquet.apache.org/documentation/latest/
  • 66. © 2022, Amazon Web Services, Inc. or its Affiliates. Partition Filtering and Filter Pushdown Reduce the amount of data to be read • Partition Filtering • The ability to read only the files in the partition specified by the filter or Where clause. or Where clause. • Available in Text/CSV/JSON/ORC/Parquet • Filter Pushdown • Ability to read only blocks that hit the filter or where clause for columns that are not used in the partition column. • AWS Glue automatically applies this when Parquet is used.
  • 67. © 2022, Amazon Web Services, Inc. or its Affiliates. Partition Filter and Filter Pushdown Filter Pushdown Partition Filtering
  • 68. © 2022, Amazon Web Services, Inc. or its Affiliates. Partition Filter and Filter Pushdown Partition Filter can be used when a partition directory has been created; for Dataframe and DynamicFrame writes, the partition directory can be created by using the partitionBy option as follows df.write.parquet(path=path, partitionBy='col_name') It may be more performance efficient to partition columns that are used more frequently in the filter clause into higher partitions.
  • 69. © 2022, Amazon Web Services, Inc. or its Affiliates. Using push_down_predicate in DynamicFrame Read only the files included in the partition where the data specified in the filter or where clause is stored when reading a DynamicFrame from the AWS Glue data catalog. partitionPredicate ="(product_category == 'Video')" datasource = glue_context.create_dynamic_frame.from_catalog( database = "githubarchive_month", table_name = "data", push_down_predicate = partitionPredicate)
  • 70. © 2022, Amazon Web Services, Inc. or its Affiliates. Choose a compression codec based on your application • Compression codec can be selected at data writing. • Trade-off between compression rate and compression/decompression speed • Files compressed with bzip2, lzo, and snappy can be split and processed when read, but files compressed with gzip(*) cannot be split. • Uncompressed files does not require compression/decompression time, but data transfer cost may be a bottleneck • If processing speed is important to you, choose snappy or lzo. ex. df.write.csv("path/to/csv", compression="gzip") Parquet can also be split by gzip. gzip bzip2 lzo snappy file extension .gzip .bz2 .lzo .snappy Compression Level High Highest Average Average Speed Medium Slow Fast Fast CPU usage Medium High Low Low Is Splittable No(*) No Yes, if indexed No
  • 71. © 2022, Amazon Web Services, Inc. or its Affiliates. Store data in appropriate file sizes. • Data read/write tasks are basically tied to a single file. (If the file is splitable, one file can be split into multiple tasks.) • The recommended file size for AWS Glue is 128MB-512MB. When the data is too small • Overhead due to large number of small tasks When there is large unspilitable data in one file • Data is not fully loaded into memory on a single node. • No distributed processing task task task ... ... task Executor Executor Executor ... Not used
  • 72. © 2022, Amazon Web Services, Inc. or its Affiliates. Using Bounded Execution with DynamicFrame When there is a lot of data to be read, Bounded Execution can be used at the same time as Job Bookmarking to divide the process without reading all the unprocessed data at once. glueContext.create_dynamic_frame.from_catalog( database = "database", tableName = "table_name", redshift_tmp_dir = "", transformation_ctx = "datasource0", additional_options = { "boundedFiles" : "500", # need to be string # "boundedSize" : "1000000000" unit is byte } )
  • 73. © 2022, Amazon Web Services, Inc. or its Affiliates. Using DynamicFrame's groupFiles and groupSize • Eliminate overhead by reading small files together in a 1single task. • Useful for processing data that is output every few minutes by Kinesis Data Firehose. • Use the groupFiles option to group the data in the S3 partition, and the groupSize option to specify the size of the group to be read. df = glueContext.create_dynamic_frame_from_options( 's3', {'paths': ['s3://s3path/'], ' recurse'':True, 'groupFiles'': 'inPartition'', 'groupSize'': ''1048576}, format='json') note: groupFiles is supported for DynamicFrames created from the following data formats: csv, ion, grokLog, json, and xml. This option is not supported for avro, parquet, and orc.
  • 74. © 2022, Amazon Web Services, Inc. or its Affiliates. Number of files and processing time for DataFrame and DynamicFrame
  • 75. © 2022, Amazon Web Services, Inc. or its Affiliates. Using DynamicFrame S3ListImplementation • If there are a lot of small files, a large number of tasks can cause Driver OOM. • When S3ListImplementation is True, the results of S3 list are read and processed in one batch at a 1000time, which prevents driver memory from being overloaded by S3 listing. datasource = glue_context.create_dynamic_frame.from_catalog( database = "my_database", table_name = "my_table", push_down_predicate = partitionPredicate, additional_options = {"useS3ListImplementation":True} )
  • 76. © 2022, Amazon Web Services, Inc. or its Affiliates. Set the Partition Index When reading a DataFrame from AWS Glue catalog from a data source with many partitions consisting of multiple partition keys, setting the Partition Index will reduce the time to fetch the read partition if there are filter or where clauses for the target partition. This will reduce the time to fetch the partitions. https://aws.amazon.com/jp/blogs/big-data/improve-query-performance-using-aws-glue-partition-indexes/
  • 77. © 2022, Amazon Web Services, Inc. or its Affiliates. The difference of query planning time between using Partition Index and not using Partition Index https://aws.amazon.com/jp/blogs/big-data/improve-query-performance-using-aws-glue-partition-indexes/
  • 78. © 2022, Amazon Web Services, Inc. or its Affiliates. Parallel Data Reading in DataFrame JDBC Connections • spark.read.jdbc() only allows one Executor to access the target database by default. • For parallel reading, partitionColumn, lowBound, upperBound, and numPartitions must be specified. The partitionColumn must be one of the following types: numeric, date, or timestamp. df = spark.read.jdbc( url=jdbcUrl, table="sample", partitionColumn="col1", lowerBound=1L, upperBound=100000L, numPartitions=100, fetchsize=1000, connectionProperties=connectionProperties)
  • 79. © 2022, Amazon Web Services, Inc. or its Affiliates. Parallel data reading in DynamicFrame JDBC connection o If you want to read data from a JDBC connection as a DynamicFrame, you need to specify hashfield/hashexpression. o In hashfield, strings and other columns can also be used as partition columns. glueContext.create_dynamic_frame.from_catalog( database = "my_database", tableName = "my_table_name", transformation_ctx = "my_transformation_context", additional_options = { ''hashfield': 'customer_name', 'hashpartitions': '5' } ) https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html
  • 80. © 2022, Amazon Web Services, Inc. or its Affiliates. Minimize shuffling
  • 81. © 2022, Amazon Web Services, Inc. or its Affiliates. Minimize shuffling • Make good use of cache • Perform filter processing in the first stage as much as possible. • Devise the order of joins to keep the data small. • Optimize join strategy • Remove data skew • Use Window processing instead of data processing by self join
  • 82. © 2022, Amazon Web Services, Inc. or its Affiliates. Minimize shuffling The processing described in the DataFrame is optimized by Catalyst Optimizer. However, it is not perfect in the following aspects • If there is a cache() in between, optimization including before and after will not work. • Spark 2.4.3, used in AWS Glue 1.0 and 2.0, disables the cost-based optimizer by default Shuffling can be reduced by manually changing the order and strategy of filters and joins.
  • 83. © 2022, Amazon Web Services, Inc. or its Affiliates. Make good use of cache • When branching the processing of a single Dataframe to perform multiple outputs, you can prevent recalculation by inserting cache() just before the branch. • Note that it may be faster not to use cache, and that too much use of cache will use local disk space. df1 df2 df3 df5 df4 df1 df2 df3 df5 df4 cache() The process up to the creation of df2 is executed only once. The process to create df2 will be executed twice.
  • 84. © 2022, Amazon Web Services, Inc. or its Affiliates. Make good use of cache • By default, cached data will be stored in the memory initially allocated for caching, and what is not in memory will be stored on the local disk. • Users can choose to save to memory only or disk only. Example of cache only on memory: df.cache(MEMORY_ONLY)
  • 85. © 2022, Amazon Web Services, Inc. or its Affiliates. Delete cache that is no longer in use • A cached Dataframe will continue to occupy memory and local disk space. • Save memory and disk space by deleting the Dataframe cache when it is no longer needed. df.unpersist()
  • 86. © 2022, Amazon Web Services, Inc. or its Affiliates. Perform filter processing in the first stage as much as possible The filter process can be placed before cache() to reduce the amount of data during processing.
  • 87. © 2022, Amazon Web Services, Inc. or its Affiliates. Work out the order of join • The end result is the same, but the data size of the DataFrame in the middle is different. • In Glue 3.0 (Spark 3.1.1), the cost-based optimizer takes into account the amount of data and optimizes the order of joins. df1 (4000 rows) df2 (1000 rows) df4 ( 4000rows) df3 (10 rows) df5 (10 rows) df1 (4000 rows) df3 (10 rows) df4 ( 10rows) df2 (1000 rows) df5 (10 rows) left join join left join join
  • 88. © 2022, Amazon Web Services, Inc. or its Affiliates. Using join in different ways Sort Merge Join • Distribute the two tables to be joined by their respective keys, sort them, and then join them. • Suitable for joining large tables together. Broadcast Join • Transfer one table to all Executors, and distribute the other table to all Executors and join them. • Suitable for when one table is smaller than the other. Shuffle Hash Join • Distribute the two tables to be joined and join them without sorting. • Suitable for joins between tables that are not so large.
  • 89. © 2022, Amazon Web Services, Inc. or its Affiliates. Using join in different ways • By default, if the table size is less than or equal to the value specified in spark.sql.autoBroadcastJoinThreshold (default 10MB), Broadcast Join will be used. • The Join strategy in use can be seen in the Spark UI or by using explain(). • If join performance is the bottleneck, changing the join strategy manually may improve performance. df1.join( broadcast(df2), df1("col1") <=> df2("col1") ).expand() == Physical Plan == BroadcastHashJoin [coalesce(col1#6, )], [coalesce(col1#21, )], Inner, BuildRight, (col1#6 <=> col1#21) :- LocalTableScan [first_name#5, col1#6]. +- BroadcastExchangeHashedRelationBroadcastMode(List(coalesce(input[0, string, true], ))) +- LocalTableScan [col1#21, col2#22, population#23] https://spark.apache.org/docs/2.4.0/sql-performance-tuning.html#broadcast-hint-for-sql-queries
  • 90. © 2022, Amazon Web Services, Inc. or its Affiliates. coalesce For the following reasons, partitions may be split into smaller pieces during processing. • Load a large number of small files. • Perform groupBy on columns with high cardinality In such a case, it is better to merge the partitions before the next process to reduce the overhead of the subsequent process. Since repartition involves shuffling, it may be desirable to use coalesce. However, since it is a simple merge, the data after coalesce may be biased. Glue 3.0 has a new feature called Adaptive Query Execution that automatically optimizes the number of partitions by coalesce. partition partition partition partition partition partition partition partition partition partition partition partition df.repartition(2) df.coalesce(2)
  • 91. © 2022, Amazon Web Services, Inc. or its Affiliates. Use Window processing instead of self join and data aggregation o When the process of creating aggregate data from a single log data and joining it is being performed, the load of joining can be eliminated by performing Window processing. df_agg = df.groupBy('gender', 'age').agg( F.mean('height').alias('avg_height')), F.mean('weight').alias('avg_weight')) df = df.join(df_agg, on=['gender', 'age']) w = Window.partitionBy('gender', 'age') df = df.withColumn( 'avg_height', F.mean(col('height')).over(w) ).withColumn('avg_weight', F.mean(col('weight')).over(w))
  • 92. © 2022, Amazon Web Services, Inc. or its Affiliates. Speed up per-task processing
  • 93. © 2022, Amazon Web Services, Inc. or its Affiliates. Using Scala Most of the DataFrame operations can be written in PySpark and internally converted to Java and run on a JVM, but the following operations are slower when using Python. If the above is the bottleneck, Scala will speed up the process. • The part that describes the process in RDD • Processing written in RDDs is not optimized by the optimizer. • The part that uses UDF • later mention
  • 94. © 2022, Amazon Web Services, Inc. or its Affiliates. Task Avoid UDF in PySpark Performance issues • Serialization and piping to Python process occurs for each Iterator • The memory of the Python process is not controlled by the JVM. Python Worker JVM Physical Operator Python Runner batch of Rows Invoke UDF Deserialize Serialize Pipe
  • 95. © 2022, Amazon Web Services, Inc. or its Affiliates. Using PandasUDF over PythonUDF Python UDF • Serialization/deserialization is done by Pickling • Data is fetched block by block, but UDF processing is performed row by row Pandas UDF • Serialization/deserialization is done by Apache Arrow • Both data fetch and UDF processing are performed on multiple lines at once.
  • 96. © 2022, Amazon Web Services, Inc. or its Affiliates. Performance differences between Python UDF, Pandas UDF, and Spark API in AWS Glue ETL Execution Time(s)
  • 97. © 2022, Amazon Web Services, Inc. or its Affiliates. Parallelize
  • 98. © 2022, Amazon Web Services, Inc. or its Affiliates. Dealing with Skewness in Data If there is data variation between partitions, the load will be unevenly distributed to only some tasks that process large partitions, causing processing delays. Data is biased to only some partitions, causing a bottleneck in processing time. When does it happen? • If the file size to be read is uneven • When joining with data that has a difference in the number of records for each join key • Variation in the number of records per key when df.groupBy() is performed
  • 99. © 2022, Amazon Web Services, Inc. or its Affiliates. Addressing data bias • Ensure that the file size is uniform when creating input files. • To repartition • broadcast join • salting
  • 100. © 2022, Amazon Web Services, Inc. or its Affiliates. Dealing with Skewness Repartition If the subsequent process is not a key-by-key process (partitioning and storing data by date, window processing by key, etc.), repartition will resolve the skew. df.repartition(200) Partition Partition Partition Partition Partition Partition ... 200 partitions 3 partitions
  • 101. © 2021, Amazon Web Services, Inc. or its Affiliates. 101 Dealing with Skewness broadcast join If one DataFrame is small enough to fit all the data in one Executor, and the other DataFrame has huge data with skewed join key columns, you can improve processing performance by using Broadcast Join as the Join strategy. item n beef 1 beef 3 ... beef 2 item n pork 1 item price beef 100 item price pork 500 1000 lines join join Partition Partition Sort Merge Join item n beef 1 beef 3 ... item n pork 1 beef 2 ... item price beef 100 pork 500 item price beef 100 pork 500 500 lines join join Partition Partition Broadcast Join 500 lines
  • 102. © 2022, Amazon Web Services, Inc. or its Affiliates. Dealing with Skewness Salting In the case of a join between two sufficiently large data sets that have a skew on one side, "Salt" can be used to eliminate the load bias. Table A Table B Partition with Skew partition Clone the partition corresponding to the partition with Skew.
  • 103. © 2022, Amazon Web Services, Inc. or its Affiliates. Dealing with Skewness Salting o In Glue 2.0 (Spark 2.4.3), user need to write the Salting code manually. o In Glue 3.0 (Spark 3.1.1), a new feature called Adaptive Query Execution will dynamically perform Skew Join. 103 # Salting the skewed column df_big = df_big.withColumn('shop_salt', F.concat(df['shop'], F.lit('_'), F.lit(F.floor(F.rand() * numPartition) + 1)))) # Explode the column df_medium = df_medium.withColumn('shop_exploded', F.explode(F.array([F.lit(i) for i in range(1,numPartition+1)])))) df_medium = df_medium.withColumn( 'shop_exploded', F.concat(df_medium['shop'], F.lit('_')), df_medium['shop_exploded'])) # Joining df_join = df_big.join(df_medium df_big.'shop_salted' == df_medium.shop_exploded) https://spark.apache.org/docs/latest/sql-performance-tuning.html#optimizing-skew-join
  • 104. © 2022, Amazon Web Services, Inc. or its Affiliates. Assigning Incremental IDs with Performance in Mind • If the Window function row_number() is used to assign successive incremental IDs to all records, the process will be slow because of the aggregation for all records. • monotonically_increasing_id() can give an incremental ID without aggregation by allowing it to be non-contiguous across partitions. 1 2 3 4 5 6 7 8 9 10 11 1 2 3 6 7 8 9 10 13 14 15 df. withColumn(F.rowNumber().over(Window.partitionBy("col1"). orderBy("col2")) df. withColumn(monotonically_increasing_id)
  • 105. © 2022, Amazon Web Services, Inc. or its Affiliates. Selecting a Worker Type for AWS Glue • The processing power allocated at the time of job execution is called DPU (Data Processing Unit). • 1DPU = 4vCPU, 16GB memory • Each Worker Type has different resource capacity and configuration. Worker Type Number of DPUs/1 Worker Number of Executor s/1Work er Number of memory/ 1Worker Disk/1W orker Standard 1 2 16GB 50GB G.1X 1 1 16GB 64GB G.2X 2 1 32GB 128GB Worker Type List Worker Type configuration image Standard Executor Worker Executor DPU G.1X Worker DPU G.2X Worker DPU DPU Executor Executor
  • 106. © 2022, Amazon Web Services, Inc. or its Affiliates. Ideal resource usage • It is desirable that resources are used evenly and without waste by all Executors. • If not, there's likely room for tuning. • Select the first Worker Type based on some prediction of resource tendency based on processing contents. • Example. • CPU usage is likely to be high when there are complex UDF and other processing operations. • Memory usage is likely to be high when the shuffle size becomes large, such as when joining large amounts of data.
  • 107. © 2022, Amazon Web Services, Inc. or its Affiliates. Trade-off between number of workers and job execution time Job execution time can be reduced without increasing cost by increasing the number of workers as long as the number of parallelism is sufficient to ensure effective resource utilization. 5 10 5 10 Number of workers AWS Glue ETL job execution time 5 10 5 10 Number of workers AWS Glue ETL job execution time
  • 108. © 2022, Amazon Web Services, Inc. or its Affiliates. summary • Introduced tuning patterns for AWS Glue Spark ETL jobs. • AWS Glue can process large amounts of data with high performance as-is, but it can be tuned to achieve even higher performance and scalability. • Tuning requires checking metrics to identify bottlenecks and eliminate their causes. 108