Lightweight ETL pipelines with mara (PyData Berlin September Meetup)

@martin_loetzsch
Dr. Martin Loetzsch
PyData Berlin September meetup
https://github.com/mara
Lightweight ETL pipelines with Mara

All the data of the company in one place
 
Data is
the single source of truth
easy to access
documented
embedded into the organisation
Integration of different domains 
 
 
 
 
 
 
 
 
 
 
Main challenges
Consistency & correctness
Changeability
Complexity
Transparency
!2
Data warehouse = integrated data
@martin_loetzsch
Nowadays required for running a business
application
databases
events
csv ﬁles
apis
reporting
crm
marketing
…
search
pricing
DWH orders
users
products
price  
histories
emails
clicks
…
…
operation 
events

!3
Data Engineering
@martin_loetzsch

!4
Which technology?
@martin_loetzsch

Avoid click-tools
hard to debug
hard to change
hard to scale with team size/ data complexity / data volume  
Data pipelines as code
SQL files, python & shell scripts
Structure & content of data warehouse are result of running code  
Easy to debug & inspect
Develop locally, test on staging system, then deploy to production
!5
Make changing and testing things easy
@martin_loetzsch
Apply standard software engineering best practices
Megabytes
Plain scripts
Petabytes
Apache Airflow
In between
Mara

!6
Mara: Things that worked for us
@martin_loetzsch
Open source (MIT license)

m_dim marketing
!7
Fokus: complex data
@martin_loetzsch
Many sources, complex business logic

Runnable app
Integrates PyPI project download stats with  
Github repo events
!9
Example: Python project stats data warehouse
@martin_loetzsch
https://github.com/mara/mara-example-project

Example pipeline
pipeline = Pipeline(
id="pypi",
description="Builds a PyPI downloads cube using the public ..”)
# ..
pipeline.add(
Task(id=“transform_python_version", description=‘..’,
commands=[
ExecuteSQL(sql_file_name="transform_python_version.sql")
]),
upstreams=['read_download_counts'])
pipeline.add(
ParallelExecuteSQL(
id=“transform_download_counts", description=“..”,
sql_statement=“SELECT pypi_tmp.insert_download_counts(@chunk@::SMALLINT);",
parameter_function=etl_tools.utils.chunk_parameter_function,
parameter_placeholders=["@chunk@"],
commands_before=[
ExecuteSQL(sql_file_name="transform_download_counts.sql")
]),
upstreams=["preprocess_project_version", "transform_installer",
"transform_python_version"])
!10
ETL pipelines as code
@martin_loetzsch
Pipeline = list of tasks with dependencies between them. Task = list of commands

Target of computation
 
CREATE TABLE m_dim_next.region ( 
region_id SMALLINT PRIMARY KEY, 
region_name TEXT NOT NULL UNIQUE, 
country_id SMALLINT NOT NULL, 
country_name TEXT NOT NULL, 
_region_name TEXT NOT NULL 
); 
 
Do computation and store result in table
 
WITH raw_region
AS (SELECT DISTINCT
country, 
region 
FROM m_data.ga_session 
ORDER BY country, region) 
 
INSERT INTO m_dim_next.region
SELECT
row_number()
OVER (ORDER BY country, region ) AS region_id, 
CASE WHEN (SELECT count(DISTINCT country)
FROM raw_region r2
WHERE r2.region = r1.region) > 1
THEN region || ' / ' || country
ELSE region END AS region_name,
dense_rank() OVER (ORDER BY country) AS country_id,
country AS country_name,
region AS _region_name
FROM raw_region r1; 
INSERT INTO m_dim_next.region
VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown'); 
Speedup subsequent transformations
 
SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['_region_name', ‘country_name',
'region_id']); 
 
SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['country_id', 'region_id']); 
 
ANALYZE m_dim_next.region;
!11
PostgreSQL as a data processing engine
@martin_loetzsch
Leave data in DB, Tables as (intermediate) results of processing steps

Execute query
 
ExecuteSQL(sql_file_name=“preprocess-ad.sql")
cat app/data_integration/pipelines/facebook/preprocess-ad.sql
| PGTZ=Europe/Berlin PGOPTIONS=—client-min-messages=warning
psql --username=mloetzsch --host=localhost --echo-all
-—no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl 
Read file
 
ReadFile(file_name=“country_iso_code.csv",
compression=Compression.NONE,
target_table="os_data.country_iso_code",
mapper_script_file_name=“read-country-iso-codes.py",
delimiter_char=“;")
cat "dwh-data/country_iso_code.csv"
| .venv/bin/python3.6 "app/data_integration/pipelines/load_data/
read-country-iso-codes.py"
| PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning
--no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl
--command="COPY os_data.country_iso_code FROM STDIN WITH CSV
DELIMITER AS ';'" 
Copy from other databases
 
Copy(sql_file_name="pdm/load-product.sql", source_db_alias=“pdm",
target_table=“os_data.product",
replace={"@@db@@": "K24Pdm", "@@dbschema@@": “ps",
"@@client@@": "kfzteile24 GmbH"}) 
cat app/data_integration/pipelines/load_data/pdm/load-product.sql
| sed "s/@@db@@/K24Pdm/g;s/@@dbschema@@/ps/g;s/@@client@@/
kfzteile24 GmbH/g"
| sed 's/$/$/g;s/$/$/g' | (cat && echo ';')
| (cat && echo ';
go')
| sqsh -U ***** -P ******* -S ******* -D K24Pdm -m csv
| PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning
--no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl
--command="COPY os_data.product FROM STDIN WITH CSV HEADER"
!12
Shell commands as interface to data & DBs
@martin_loetzsch
Nothing is faster than a unix pipe

Read a set of files
 
pipeline.add(
ParallelReadFile(
id="read_download",
description="Loads PyPI downloads from pre_downloaded csv
files",
file_pattern="*/*/*/pypi/downloads-v1.csv.gz",
read_mode=ReadMode.ONLY_NEW,
compression=Compression.GZIP,
target_table="pypi_data.download",
delimiter_char="t", skip_header=True, csv_format=True,
file_dependencies=read_download_file_dependencies,
date_regex="^(?P<year>d{4})/(?P<month>d{2})/(?
P<day>d{2})/",
partition_target_table_by_day_id=True,
timezone="UTC",
commands_before=[
ExecuteSQL(
sql_file_name="create_download_data_table.sql",
file_dependencies=read_download_file_dependencies)
]))
Split large joins into chunks
pipeline.add(
ParallelExecuteSQL(
id="transform_download",
description="Maps downloads to their dimensions",
sql_statement="SELECT
pypi_tmp.insert_download(@chunk@::SMALLINT);”,
parameter_function=
etl_tools.utils.chunk_parameter_function,
parameter_placeholders=["@chunk@"],
commands_before=[
ExecuteSQL(sql_file_name="transform_download.sql")
]),
upstreams=["preprocess_project_version",
"transform_installer"])
!13
Incremental & parallel processing
@martin_loetzsch
You can’t join all clicks with all customers at once

Consider Mara for your next ETL project
@martin_loetzsch
!14

!15
Refer us a data person, earn 1000€
@martin_loetzsch
Also sales people, developers, product managers, etc.

Thank you
@martin_loetzsch
!16

Lightweight ETL pipelines with mara (PyData Berlin September Meetup)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lightweight ETL pipelines with mara (PyData Berlin September Meetup)

Similar to Lightweight ETL pipelines with mara (PyData Berlin September Meetup) (20)

Recently uploaded

Recently uploaded (20)

Lightweight ETL pipelines with mara (PyData Berlin September Meetup)