SlideShare a Scribd company logo
1 of 16
Download to read offline
@martin_loetzsch
Dr. Martin Loetzsch
PyData Berlin September meetup
https://github.com/mara
Lightweight ETL pipelines with Mara
All the data of the company in one place


Data is
the single source of truth
easy to access
documented
embedded into the organisation
Integration of different domains





















Main challenges
Consistency & correctness
Changeability
Complexity
Transparency
!2
Data warehouse = integrated data
@martin_loetzsch
Nowadays required for running a business
application
databases
events
csv files
apis
reporting
crm
marketing
…
search
pricing
DWH orders
users
products
price 

histories
emails
clicks
…
…
operation

events
!3
Data Engineering
@martin_loetzsch
!4
Which technology?
@martin_loetzsch
Avoid click-tools
hard to debug
hard to change
hard to scale with team size/ data complexity / data volume 

Data pipelines as code
SQL files, python & shell scripts
Structure & content of data warehouse are result of running code 

Easy to debug & inspect
Develop locally, test on staging system, then deploy to production
!5
Make changing and testing things easy
@martin_loetzsch
Apply standard software engineering best practices
Megabytes
Plain scripts
Petabytes
Apache Airflow
In between
Mara
!6
Mara: Things that worked for us
@martin_loetzsch
Open source (MIT license)
m_dim marketing
!7
Fokus: complex data
@martin_loetzsch
Many sources, complex business logic
!8
Mara
@martin_loetzsch
Runnable app
Integrates PyPI project download stats with 

Github repo events
!9
Example: Python project stats data warehouse
@martin_loetzsch
https://github.com/mara/mara-example-project
Example pipeline
pipeline = Pipeline(
id="pypi",
description="Builds a PyPI downloads cube using the public ..”)
# ..
pipeline.add(
Task(id=“transform_python_version", description=‘..’,
commands=[
ExecuteSQL(sql_file_name="transform_python_version.sql")
]),
upstreams=['read_download_counts'])
pipeline.add(
ParallelExecuteSQL(
id=“transform_download_counts", description=“..”,
sql_statement=“SELECT pypi_tmp.insert_download_counts(@chunk@::SMALLINT);",
parameter_function=etl_tools.utils.chunk_parameter_function,
parameter_placeholders=["@chunk@"],
commands_before=[
ExecuteSQL(sql_file_name="transform_download_counts.sql")
]),
upstreams=["preprocess_project_version", "transform_installer",
"transform_python_version"])
!10
ETL pipelines as code
@martin_loetzsch
Pipeline = list of tasks with dependencies between them. Task = list of commands
Target of computation


CREATE TABLE m_dim_next.region (

region_id SMALLINT PRIMARY KEY,

region_name TEXT NOT NULL UNIQUE,

country_id SMALLINT NOT NULL,

country_name TEXT NOT NULL,

_region_name TEXT NOT NULL

);



Do computation and store result in table


WITH raw_region
AS (SELECT DISTINCT
country,

region

FROM m_data.ga_session

ORDER BY country, region)



INSERT INTO m_dim_next.region
SELECT
row_number()
OVER (ORDER BY country, region ) AS region_id,

CASE WHEN (SELECT count(DISTINCT country)
FROM raw_region r2
WHERE r2.region = r1.region) > 1
THEN region || ' / ' || country
ELSE region END AS region_name,
dense_rank() OVER (ORDER BY country) AS country_id,
country AS country_name,
region AS _region_name
FROM raw_region r1;

INSERT INTO m_dim_next.region
VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown');

Speedup subsequent transformations


SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['_region_name', ‘country_name',
'region_id']);



SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['country_id', 'region_id']);



ANALYZE m_dim_next.region;
!11
PostgreSQL as a data processing engine
@martin_loetzsch
Leave data in DB, Tables as (intermediate) results of processing steps
Execute query


ExecuteSQL(sql_file_name=“preprocess-ad.sql")
cat app/data_integration/pipelines/facebook/preprocess-ad.sql 
| PGTZ=Europe/Berlin PGOPTIONS=—client-min-messages=warning 
psql --username=mloetzsch --host=localhost --echo-all 
-—no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl

Read file


ReadFile(file_name=“country_iso_code.csv",
compression=Compression.NONE,
target_table="os_data.country_iso_code",
mapper_script_file_name=“read-country-iso-codes.py",
delimiter_char=“;")
cat "dwh-data/country_iso_code.csv" 
| .venv/bin/python3.6 "app/data_integration/pipelines/load_data/
read-country-iso-codes.py" 
| PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning 
psql --username=mloetzsch --host=localhost --echo-all 
--no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl 
--command="COPY os_data.country_iso_code FROM STDIN WITH CSV
DELIMITER AS ';'"

Copy from other databases


Copy(sql_file_name="pdm/load-product.sql", source_db_alias=“pdm",
target_table=“os_data.product",
replace={"@@db@@": "K24Pdm", "@@dbschema@@": “ps",
"@@client@@": "kfzteile24 GmbH"})

cat app/data_integration/pipelines/load_data/pdm/load-product.sql 
| sed "s/@@db@@/K24Pdm/g;s/@@dbschema@@/ps/g;s/@@client@@/
kfzteile24 GmbH/g" 
| sed 's/$/$/g;s/$/$/g' | (cat && echo ';') 
| (cat && echo ';
go') 
| sqsh -U ***** -P ******* -S ******* -D K24Pdm -m csv 
| PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning 
psql --username=mloetzsch --host=localhost --echo-all 
--no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl 
--command="COPY os_data.product FROM STDIN WITH CSV HEADER"
!12
Shell commands as interface to data & DBs
@martin_loetzsch
Nothing is faster than a unix pipe
Read a set of files


pipeline.add(
ParallelReadFile(
id="read_download",
description="Loads PyPI downloads from pre_downloaded csv
files",
file_pattern="*/*/*/pypi/downloads-v1.csv.gz",
read_mode=ReadMode.ONLY_NEW,
compression=Compression.GZIP,
target_table="pypi_data.download",
delimiter_char="t", skip_header=True, csv_format=True,
file_dependencies=read_download_file_dependencies,
date_regex="^(?P<year>d{4})/(?P<month>d{2})/(?
P<day>d{2})/",
partition_target_table_by_day_id=True,
timezone="UTC",
commands_before=[
ExecuteSQL(
sql_file_name="create_download_data_table.sql",
file_dependencies=read_download_file_dependencies)
]))
Split large joins into chunks
pipeline.add(
ParallelExecuteSQL(
id="transform_download",
description="Maps downloads to their dimensions",
sql_statement="SELECT
pypi_tmp.insert_download(@chunk@::SMALLINT);”,
parameter_function=
etl_tools.utils.chunk_parameter_function,
parameter_placeholders=["@chunk@"],
commands_before=[
ExecuteSQL(sql_file_name="transform_download.sql")
]),
upstreams=["preprocess_project_version",
"transform_installer"])
!13
Incremental & parallel processing
@martin_loetzsch
You can’t join all clicks with all customers at once
Consider Mara for your next ETL project
@martin_loetzsch
!14
!15
Refer us a data person, earn 1000€
@martin_loetzsch
Also sales people, developers, product managers, etc.
Thank you
@martin_loetzsch
!16

More Related Content

What's hot

Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentialsqureshihamid
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
An overview of snowflake
An overview of snowflakeAn overview of snowflake
An overview of snowflakeSivakumar Ramar
 
data platform on kubernetes
data platform on kubernetesdata platform on kubernetes
data platform on kubernetes창언 정
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
MySQL Cluster - visão geral
MySQL Cluster - visão geralMySQL Cluster - visão geral
MySQL Cluster - visão geralMySQL Brasil
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know SnowflakeKnoldus Inc.
 
Introduccion a Cassandra
Introduccion a CassandraIntroduccion a Cassandra
Introduccion a CassandraStratebi
 
Delta lake - des data lake fiables a grande échelle
Delta lake - des data lake fiables a grande échelleDelta lake - des data lake fiables a grande échelle
Delta lake - des data lake fiables a grande échellefrançois de Buttet
 
Hints for a successful hfs to zfs migration
Hints for a successful hfs to zfs migrationHints for a successful hfs to zfs migration
Hints for a successful hfs to zfs migrationsatish090909
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at ScaleMongoDB
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
zenoh: The Edge Data Fabric
zenoh: The Edge Data Fabriczenoh: The Edge Data Fabric
zenoh: The Edge Data FabricAngelo Corsaro
 
SAP integration sample payloads for Azure Logic Apps
SAP integration sample payloads for Azure Logic AppsSAP integration sample payloads for Azure Logic Apps
SAP integration sample payloads for Azure Logic AppsDavid Burg
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databasesArangoDB Database
 
Domain Driven Data: Apache Kafka® and the Data Mesh
Domain Driven Data: Apache Kafka® and the Data MeshDomain Driven Data: Apache Kafka® and the Data Mesh
Domain Driven Data: Apache Kafka® and the Data Meshconfluent
 
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformReal-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformScyllaDB
 

What's hot (20)

Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentials
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
An overview of snowflake
An overview of snowflakeAn overview of snowflake
An overview of snowflake
 
data platform on kubernetes
data platform on kubernetesdata platform on kubernetes
data platform on kubernetes
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
MySQL Cluster - visão geral
MySQL Cluster - visão geralMySQL Cluster - visão geral
MySQL Cluster - visão geral
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
 
Introduccion a Cassandra
Introduccion a CassandraIntroduccion a Cassandra
Introduccion a Cassandra
 
Delta lake - des data lake fiables a grande échelle
Delta lake - des data lake fiables a grande échelleDelta lake - des data lake fiables a grande échelle
Delta lake - des data lake fiables a grande échelle
 
Hints for a successful hfs to zfs migration
Hints for a successful hfs to zfs migrationHints for a successful hfs to zfs migration
Hints for a successful hfs to zfs migration
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
zenoh: The Edge Data Fabric
zenoh: The Edge Data Fabriczenoh: The Edge Data Fabric
zenoh: The Edge Data Fabric
 
SAP integration sample payloads for Azure Logic Apps
SAP integration sample payloads for Azure Logic AppsSAP integration sample payloads for Azure Logic Apps
SAP integration sample payloads for Azure Logic Apps
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
 
Domain Driven Data: Apache Kafka® and the Data Mesh
Domain Driven Data: Apache Kafka® and the Data MeshDomain Driven Data: Apache Kafka® and the Data Mesh
Domain Driven Data: Apache Kafka® and the Data Mesh
 
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformReal-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
 

Similar to Lightweight ETL pipelines with mara (PyData Berlin September Meetup)

Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with PythonMartin Loetzsch
 
Data infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companiesData infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companiesMartin Loetzsch
 
Informatica
InformaticaInformatica
Informaticamukharji
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalCaserta
 
Database Management System
Database Management SystemDatabase Management System
Database Management SystemAbishek V S
 
Informatica Interview Questions & Answers
Informatica Interview Questions & AnswersInformatica Interview Questions & Answers
Informatica Interview Questions & AnswersZaranTech LLC
 
Experience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformExperience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformBob Ward
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
PASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep DivePASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep DiveTravis Wright
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & HadoopAhmed Gamil
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)Amazon Web Services
 
IRMUK-SOA_for_MDM_DQ_Integration_DV_20min
IRMUK-SOA_for_MDM_DQ_Integration_DV_20minIRMUK-SOA_for_MDM_DQ_Integration_DV_20min
IRMUK-SOA_for_MDM_DQ_Integration_DV_20minDigendra Vir Singh (DV)
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
Data ware house architecture
Data ware house architectureData ware house architecture
Data ware house architectureDeepak Chaurasia
 
Stored-Procedures-Presentation
Stored-Procedures-PresentationStored-Procedures-Presentation
Stored-Procedures-PresentationChuck Walker
 
Technological insights behind Clusterpoint database
Technological insights behind Clusterpoint databaseTechnological insights behind Clusterpoint database
Technological insights behind Clusterpoint databaseClusterpoint
 
Air Line Management System | DBMS project
Air Line Management System | DBMS projectAir Line Management System | DBMS project
Air Line Management System | DBMS projectAniketHandore
 
2006 DDD4: Data access layers - Convenience vs. Control and Performance?
2006 DDD4: Data access layers - Convenience vs. Control and Performance?2006 DDD4: Data access layers - Convenience vs. Control and Performance?
2006 DDD4: Data access layers - Convenience vs. Control and Performance?Daniel Fisher
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Zhenxiao Luo
 

Similar to Lightweight ETL pipelines with mara (PyData Berlin September Meetup) (20)

Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with Python
 
Data infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companiesData infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companies
 
Informatica
InformaticaInformatica
Informatica
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobal
 
Database Management System
Database Management SystemDatabase Management System
Database Management System
 
Informatica Interview Questions & Answers
Informatica Interview Questions & AnswersInformatica Interview Questions & Answers
Informatica Interview Questions & Answers
 
Experience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformExperience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data Platform
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
PASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep DivePASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep Dive
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
 
IRMUK-SOA_for_MDM_DQ_Integration_DV_20min
IRMUK-SOA_for_MDM_DQ_Integration_DV_20minIRMUK-SOA_for_MDM_DQ_Integration_DV_20min
IRMUK-SOA_for_MDM_DQ_Integration_DV_20min
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Data ware house architecture
Data ware house architectureData ware house architecture
Data ware house architecture
 
Stored-Procedures-Presentation
Stored-Procedures-PresentationStored-Procedures-Presentation
Stored-Procedures-Presentation
 
Technological insights behind Clusterpoint database
Technological insights behind Clusterpoint databaseTechnological insights behind Clusterpoint database
Technological insights behind Clusterpoint database
 
Air Line Management System | DBMS project
Air Line Management System | DBMS projectAir Line Management System | DBMS project
Air Line Management System | DBMS project
 
2006 DDD4: Data access layers - Convenience vs. Control and Performance?
2006 DDD4: Data access layers - Convenience vs. Control and Performance?2006 DDD4: Data access layers - Convenience vs. Control and Performance?
2006 DDD4: Data access layers - Convenience vs. Control and Performance?
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 

Recently uploaded

Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documents
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 

Lightweight ETL pipelines with mara (PyData Berlin September Meetup)

  • 1. @martin_loetzsch Dr. Martin Loetzsch PyData Berlin September meetup https://github.com/mara Lightweight ETL pipelines with Mara
  • 2. All the data of the company in one place 
 Data is the single source of truth easy to access documented embedded into the organisation Integration of different domains
 
 
 
 
 
 
 
 
 
 
 Main challenges Consistency & correctness Changeability Complexity Transparency !2 Data warehouse = integrated data @martin_loetzsch Nowadays required for running a business application databases events csv files apis reporting crm marketing … search pricing DWH orders users products price 
 histories emails clicks … … operation
 events
  • 5. Avoid click-tools hard to debug hard to change hard to scale with team size/ data complexity / data volume 
 Data pipelines as code SQL files, python & shell scripts Structure & content of data warehouse are result of running code 
 Easy to debug & inspect Develop locally, test on staging system, then deploy to production !5 Make changing and testing things easy @martin_loetzsch Apply standard software engineering best practices Megabytes Plain scripts Petabytes Apache Airflow In between Mara
  • 6. !6 Mara: Things that worked for us @martin_loetzsch Open source (MIT license)
  • 7. m_dim marketing !7 Fokus: complex data @martin_loetzsch Many sources, complex business logic
  • 9. Runnable app Integrates PyPI project download stats with 
 Github repo events !9 Example: Python project stats data warehouse @martin_loetzsch https://github.com/mara/mara-example-project
  • 10. Example pipeline pipeline = Pipeline( id="pypi", description="Builds a PyPI downloads cube using the public ..”) # .. pipeline.add( Task(id=“transform_python_version", description=‘..’, commands=[ ExecuteSQL(sql_file_name="transform_python_version.sql") ]), upstreams=['read_download_counts']) pipeline.add( ParallelExecuteSQL( id=“transform_download_counts", description=“..”, sql_statement=“SELECT pypi_tmp.insert_download_counts(@chunk@::SMALLINT);", parameter_function=etl_tools.utils.chunk_parameter_function, parameter_placeholders=["@chunk@"], commands_before=[ ExecuteSQL(sql_file_name="transform_download_counts.sql") ]), upstreams=["preprocess_project_version", "transform_installer", "transform_python_version"]) !10 ETL pipelines as code @martin_loetzsch Pipeline = list of tasks with dependencies between them. Task = list of commands
  • 11. Target of computation 
 CREATE TABLE m_dim_next.region (
 region_id SMALLINT PRIMARY KEY,
 region_name TEXT NOT NULL UNIQUE,
 country_id SMALLINT NOT NULL,
 country_name TEXT NOT NULL,
 _region_name TEXT NOT NULL
 );
 
 Do computation and store result in table 
 WITH raw_region AS (SELECT DISTINCT country,
 region
 FROM m_data.ga_session
 ORDER BY country, region)
 
 INSERT INTO m_dim_next.region SELECT row_number() OVER (ORDER BY country, region ) AS region_id,
 CASE WHEN (SELECT count(DISTINCT country) FROM raw_region r2 WHERE r2.region = r1.region) > 1 THEN region || ' / ' || country ELSE region END AS region_name, dense_rank() OVER (ORDER BY country) AS country_id, country AS country_name, region AS _region_name FROM raw_region r1;
 INSERT INTO m_dim_next.region VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown');
 Speedup subsequent transformations 
 SELECT util.add_index( 'm_dim_next', 'region', column_names := ARRAY ['_region_name', ‘country_name', 'region_id']);
 
 SELECT util.add_index( 'm_dim_next', 'region', column_names := ARRAY ['country_id', 'region_id']);
 
 ANALYZE m_dim_next.region; !11 PostgreSQL as a data processing engine @martin_loetzsch Leave data in DB, Tables as (intermediate) results of processing steps
  • 12. Execute query 
 ExecuteSQL(sql_file_name=“preprocess-ad.sql") cat app/data_integration/pipelines/facebook/preprocess-ad.sql | PGTZ=Europe/Berlin PGOPTIONS=—client-min-messages=warning psql --username=mloetzsch --host=localhost --echo-all -—no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl
 Read file 
 ReadFile(file_name=“country_iso_code.csv", compression=Compression.NONE, target_table="os_data.country_iso_code", mapper_script_file_name=“read-country-iso-codes.py", delimiter_char=“;") cat "dwh-data/country_iso_code.csv" | .venv/bin/python3.6 "app/data_integration/pipelines/load_data/ read-country-iso-codes.py" | PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning psql --username=mloetzsch --host=localhost --echo-all --no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl --command="COPY os_data.country_iso_code FROM STDIN WITH CSV DELIMITER AS ';'"
 Copy from other databases 
 Copy(sql_file_name="pdm/load-product.sql", source_db_alias=“pdm", target_table=“os_data.product", replace={"@@db@@": "K24Pdm", "@@dbschema@@": “ps", "@@client@@": "kfzteile24 GmbH"})
 cat app/data_integration/pipelines/load_data/pdm/load-product.sql | sed "s/@@db@@/K24Pdm/g;s/@@dbschema@@/ps/g;s/@@client@@/ kfzteile24 GmbH/g" | sed 's/$/$/g;s/$/$/g' | (cat && echo ';') | (cat && echo '; go') | sqsh -U ***** -P ******* -S ******* -D K24Pdm -m csv | PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning psql --username=mloetzsch --host=localhost --echo-all --no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl --command="COPY os_data.product FROM STDIN WITH CSV HEADER" !12 Shell commands as interface to data & DBs @martin_loetzsch Nothing is faster than a unix pipe
  • 13. Read a set of files 
 pipeline.add( ParallelReadFile( id="read_download", description="Loads PyPI downloads from pre_downloaded csv files", file_pattern="*/*/*/pypi/downloads-v1.csv.gz", read_mode=ReadMode.ONLY_NEW, compression=Compression.GZIP, target_table="pypi_data.download", delimiter_char="t", skip_header=True, csv_format=True, file_dependencies=read_download_file_dependencies, date_regex="^(?P<year>d{4})/(?P<month>d{2})/(? P<day>d{2})/", partition_target_table_by_day_id=True, timezone="UTC", commands_before=[ ExecuteSQL( sql_file_name="create_download_data_table.sql", file_dependencies=read_download_file_dependencies) ])) Split large joins into chunks pipeline.add( ParallelExecuteSQL( id="transform_download", description="Maps downloads to their dimensions", sql_statement="SELECT pypi_tmp.insert_download(@chunk@::SMALLINT);”, parameter_function= etl_tools.utils.chunk_parameter_function, parameter_placeholders=["@chunk@"], commands_before=[ ExecuteSQL(sql_file_name="transform_download.sql") ]), upstreams=["preprocess_project_version", "transform_installer"]) !13 Incremental & parallel processing @martin_loetzsch You can’t join all clicks with all customers at once
  • 14. Consider Mara for your next ETL project @martin_loetzsch !14
  • 15. !15 Refer us a data person, earn 1000€ @martin_loetzsch Also sales people, developers, product managers, etc.