SlideShare a Scribd company logo
1 of 34
Download to read offline
Luigi
Batch data processing in
Python

Elias Freider
freider@spotify.com
Background
Why did we build Luigi?


We have a lot of data
 Billions of log messages (several TBs) every day
 Usage and backend stats, debug information
 What we want to do
   AB-testing
   Music recommendations
   Monthly/daily/hourly reporting
   Business metric dashboards
 We experiment a lot – need quick development cycles
How to do it?

Hadoop
  Data storage
  Clean, filter, join and aggregate data
Postgres
  Semi-aggregated data for dashboards
Cassandra
  Time series data
Defining a data flow
A bunch of data processing tasks with inter-dependencies
Example: Artist Toplist
Input is a list of Timestamp, Track, Artist tuples




                                 Artist
      Streams
                              Aggregation




                                  Top 10             Database
Example: Artist Toplist
Naive (non-luigi) approach
Errors will occur
If a failed step leaves behind broken data, we want to clean that up
Avoid duplicate work
The second step fails, you fix it, then you want to resume
Parametrization
To use data flows as command line tools
Repeat!
You want to run the dataflow for a set of similar inputs
Introducing Luigi
A Python framework for data flow definition and execution




Main features

  Dead-simple dependency definitions (think GNU make)
  Emphasis on Hadoop/HDFS integration
  Atomic file operations
  Powerful task templating using OOP
  Data flow visualization
  Command line integration
Luigi - Aggregate Artists
Luigi - Aggregate Artists
Run on the command line:
$ python dataflow.py AggregateArtists

DEBUG: Checking if AggregateArtists() is complete
INFO: Scheduled AggregateArtists()
DEBUG: Checking if Streams() is complete
INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 74375] Running   AggregateArtists()
INFO: [pid 74375] Done      AggregateArtists()
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
Task templates and targets
Reduces code-repetition by utilizing Luigi’s object oriented data model




Luigi comes with pre-implemented Tasks for...

  Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-files
  Running Hive and (soon) Pig queries
  Inserting data sets into Postgres


Writing new ones are as easy as defining an interface and
implementing run()
Hadoop MapReduce
Built-in Hadoop Streaming Python framework




Features

 Very slim interface
 Fetches error logs from Hadoop cluster and displays them to the user
 Class instance variables can be referenced in MapReduce code, which makes it
 easy to supply extra data in dictionaries etc. for map side joins
 Easy to send along Python modules that might not be installed on the cluster
 Basic support for secondary sort
 Runs on CPython so you can use your favorite libs (numpy, pandas etc.)
Hadoop MapReduce
Built-in Hadoop Streaming Python framework
Attach Python modules
So you can use local modules remotely on the cluster
Local state is transferred
Set up local variables for map-side joins etc.
Completing the top list
Top 10 artists - Wrapped arbitrary Python code
Database support
Basic functionality for exporting to Postgres. Cassandra support is in the works
Running it all...
   DEBUG: Checking if ArtistToplistToDatabase() is complete
   INFO: Scheduled ArtistToplistToDatabase()
   DEBUG: Checking if Top10Artists() is complete
   INFO: Scheduled Top10Artists()
   DEBUG: Checking if AggregateArtists() is complete
   INFO: Scheduled AggregateArtists()
   DEBUG: Checking if Streams() is complete
   INFO: Done scheduling tasks
   DEBUG: Asking scheduler for work...
   DEBUG: Pending tasks: 3
   INFO: [pid 74811] Running   AggregateArtists()
   INFO: [pid 74811] Done      AggregateArtists()
   DEBUG: Asking scheduler for work...
   DEBUG: Pending tasks: 2
   INFO: [pid 74811] Running   Top10Artists()
   INFO: [pid 74811] Done      Top10Artists()
   DEBUG: Asking scheduler for work...
   DEBUG: Pending tasks: 1
   INFO: [pid 74811] Running   ArtistToplistToDatabase()
   INFO: Done writing, importing at 2013-03-13 15:41:09.407138
   INFO: [pid 74811] Done      ArtistToplistToDatabase()
   DEBUG: Asking scheduler for work...
   INFO: Done
   INFO: There are no more tasks to run at this time
Data flow visualization
Light-weight scheduler daemon with a web interface
The results
Imagine how cool this would be with real data...
Task Parameters
Class variables with some magic




Tasks have implicit __init__




   Generates command line interface with typing and documentation
   $ python dataflow.py AggregateArtists --date 2013-03-05
Task Parameters
Combined usage example
Task Parameters
Makes it really easy to create aggregates over time
Multiple workers
 Basic multi-processing



$ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11
Large data flows
            (Screenshot from web interface)




                                                                                                                                                                                                                                                                                                                  EndSongCleaned                                                                       EndSongCleaned
                                                                                                                                                                                                                                                                                                                  (date=2013-03-16)                                                                   (date=2013-03-15)




                                                                                                                                                                                                AggregateTracks                                                                                                             AggregateTracks                                                 UserLocationDay                                  UserLocationDay              UserLocationDay        UserLocationDay         UserLocationDay           UserLocationDay               UserLocationDay      UserLocationDay     UserLocationDay     UserLocationDay     UserLocationDay
                                                                                                                                                                                                   (test=False,                MasterMetadata                                                                                    (test=False,                                                    (test=False,                                   (test=False,                (test=False,           (test=False,            (test=False,               (test=False,                    (test=False,      (test=False,        (test=False,        (test=False,        (test=False,
                                                                                                                                                                                                date=2013-03-16,              (date=2013-03-15)                                                                            date=2013-03-15,                                                 date=2013-03-16,                                 date=2013-03-14,            date=2013-03-13,        date=2013-03-08,        date=2013-03-12,          date=2013-03-07,              date=2013-03-11,     date=2013-03-10,    date=2013-03-09,    date=2013-03-15,    date=2013-03-06,
                                                                                                                                                                                                test_users=False)                                                                                                          test_users=False)                                                test_users=False)                                test_users=False)           test_users=False)       test_users=False)       test_users=False)         test_users=False)             test_users=False)    test_users=False)   test_users=False)   test_users=False)   test_users=False)




                                                                                                                                                                                              JoinArtistGids                                 JoinArtistGids                                                                                                                                                                                                                                                                                       UserLocationPeriod             UserLocationPeriod
                                                                                                                                                                                               (test=False,                                     (test=False,                                                                                                                                                                                                                                                                                         (test=False,                     (test=False,
                                                                                                                                                                                            date=2013-03-16,                               date=2013-03-15,                                                                                                                                                                                                                                                                                           period=10,                        period=10,
                                                                                                                                                                                               nshards=12,                                   nshards=12,                                                                                                                                                                                                                                                                                          date=2013-03-17,               date=2013-03-16,
                                                                                                                                                                                            test_users=False)                              test_users=False)                                                                                                                                                                                                                                                                                      test_users=False)              test_users=False)




  AggregateUserMatrices       AggregateUserMatrices       AggregateUserMatrices                    AggregateUserMatrices               AggregateUserMatrices      AggregateUserMatrices
                                                                                                                                                                                            AggregateByArtists      AggregateByArtists                                        AggregateByArtists               AggregateByArtists               AggregateByArtists             AggregateByArtists                AggregateByArtists            AggregateByArtists           AggregateByArtists      AggregateByArtists       AggregateByArtists
       (test=False,                (test=False,                (test=False,                              (test=False,                        (test=False,              (test=False,
                                                                                                                                                                                               (test=False,            (test=False,                ArtistFollows                 (test=False,                     (test=False,                      (test=False,                  (test=False,                        (test=False,                (test=False,                  (test=False,            (test=False,            (test=False,
    date=2013-03-16,            date=2013-03-14,            date=2013-03-13,                          date=2013-03-15,                   date=2013-03-12,           date=2013-03-11,
                                                                                                                                                                                            date=2013-03-16,        date=2013-03-13,            (date=2013-03-14)             date=2013-03-14,                 date=2013-03-09,                 date=2013-03-08,               date=2013-03-15,                  date=2013-03-12,              date=2013-03-11,              date=2013-03-07,        date=2013-03-10,         date=2013-03-06,
index_version=1362433077,   index_version=1362433077,   index_version=1362433077,               index_version=1362433077,           index_version=1362433077,   index_version=1362433077,
                                                                                                                                                                                            test_users=False)       test_users=False)                                         test_users=False)                test_users=False)                test_users=False)              test_users=False)                 test_users=False)             test_users=False)             test_users=False)       test_users=False)       test_users=False)
    test_users=False)           test_users=False)           test_users=False)                         test_users=False)                  test_users=False)          test_users=False)




                                                                              AccumulateUserMatrices                         AccumulateUserMatrices                                                                                                     AccumulateByArtists                                                                                                                                                                                         AccumulateByArtists
                                                                                    (test=False,                                   (test=False,                                                                                                                (test=False,                                                                                                                                                                                             (test=False,
                                                                                                                                                                                                                                                                                            JoinArtistGids                                                                JoinArtistGids                                              JoinArtistGids                                                                                                                 JoinArtistGids
                                                                                 date=2013-03-16,                               date=2013-03-15,                                                                                                         date=2013-03-16,                                                          PlaylistVectors                                                                                                                   date=2013-03-15,
                                                                                                                                                                                                                                                                                                (test=False,                                                               (test=False,                                                (test=False,                                                                                                                   (test=False,
                                                                          index_version=1362433077,                         index_version=1362433077,                                                                                                    test_users=False,                                                           (test=False,                                                    RelatedArtistsTC                                                test_users=False,
                                                                                                                                                                                                                                                                                          date=2013-03-14,                                                              date=2013-03-13,                                             date=2013-03-12,                                                                                                               date=2013-03-11,
                                                                                 test_users=False,                              test_users=False,                                                                                                          max_days=90,                                                           date=2013-03-14,                                                              ()                                                    max_days=90,
                                                                                                                                                                                                                                                                                            nshards=12,                                                                   nshards=12,                                                  nshards=12,                                                                                                                    nshards=12,
                                                                                 decay_factor=0.99,                             decay_factor=0.99,                                                                                                      decay_factor=0.99,                                                index_version=1362433077)                                                                                                                 decay_factor=0.99,
                                                                                                                                                                                                                                                                                         test_users=False)                                                              test_users=False)                                            test_users=False)                                                                                                              test_users=False)
                                                                                      days=5,                                        days=5,                                                                                                                    days=10,                                                                                                                                                                                                 days=10,
                                                                              build_from_scratch=True)                       build_from_scratch=True)                                                                                                build_from_scratch=True)                                                                                                                                                                                    build_from_scratch=True)




                                                                                                                                                                                                                                                                                                                            UserRecs                                                                                 UserRecs
                                                                                                                                                                                                                                                                                                                            (test=False,                                                                         (test=False,
                                                                                                                                                                                                                                                                                                                        date=2013-03-17,                                                                   date=2013-03-16,
                                                                                                                                                                                                                                                                                                                           rec_days=5,                                                                           rec_days=5,
                                                                                                                                                                                                                                                                                                                                                                      IngestUserRecs
                                                                                                                                                                                                                                                                                                                          exp_days=10,                                                                          exp_days=10,
                                                                                                                                                                                                                                                                                                                                                                   (date=2013-03-15,
                                                                                                                                                                                                                                                                                                                        test_users=False,                                                                  test_users=False,
                                                                                                                                                                                                                                                                                                                                                                    test_users=False,
                                                                                                                                                                                                                                                                                                                       force_updates=False,                                                              force_updates=False,
                                                                                                                                                                                                                                                                                                                                                                        ttl=604800)
                                                                                                                                                                                                                                                                                                                     build_from_scratch=True,                                                          build_from_scratch=True,
                                                                                                                                                                                                                                                                                                                index_path=/spotify/discover/index,                                                index_path=/spotify/discover/index,
                                                                                                                                                                                                                                                                                                                       index_version=None,                                                                index_version=None,
                                                                                                                                                                                                                                                                                                                     FOLLOWS_SCORE=5.0)                                                                 FOLLOWS_SCORE=5.0)




                                                                                                                                                                                                                                                                                                                                                                      IngestUserRecs
                                                                                                                                                                                                                                                                                                                                                                   (date=2013-03-16,
                                                                                                                                                                                                                                                                                                                                                                    test_users=False,
                                                                                                                                                                                                                                                                                                                                                                        ttl=604800)




                                                                                                                                                                                                                                                                                                                                                     IngestUserRecs
                                                                                                                                                                                                                                                                                                                                                    (date=2013-03-17,
                                                                                                                                                                                                                                                                                                                                                     test_users=False,
                                                                                                                                                                                                                                                                                                                                                        ttl=604800)
Really large...
Visualization needs some work
Error notifications
Great for automated execution
Process Synchronization
Prevents two identical tasks from running simultaneously


                                luigid
                       Simple task synchronization




             Data flow 1                   Data flow 2




                        Common dependency
                             Task
What we use Luigi for
Not just another Hadoop streaming framework!

 Hadoop Streaming
 Java Hadoop MapReduce
 Hive
 Pig
 Local (non-hadoop) data processing
 Import/Export data to/from Postgres
 Insert data into Cassandra
 scp/rsync/ftp data files and reports
 Dump and load databases
Others using it with Scala MapReduce and MRJob as well
Luigi is open source

http://github.com/spotify/luigi
Originated at Spotify
Based on many years of experience with data processing
Recent contributions by Foursquare and Bitly
Open source since September 2012
Thank you!
For more information feel free to reach out at
http://github.com/spotify/luigi

Oh, and we’re hiring – http://spotify.com/jobs



Elias Freider
freider@spotify.com

More Related Content

What's hot

SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」Takuto Wada
 
SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路
SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路
SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路Rick Hwang
 
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜Takahiro Inoue
 
PostgreSQLアンチパターン
PostgreSQLアンチパターンPostgreSQLアンチパターン
PostgreSQLアンチパターンSoudai Sone
 
PostgreSQLの統計情報について(第26回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQLの統計情報について(第26回PostgreSQLアンカンファレンス@オンライン 発表資料)PostgreSQLの統計情報について(第26回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQLの統計情報について(第26回PostgreSQLアンカンファレンス@オンライン 発表資料)NTT DATA Technology & Innovation
 
[GTCJ2018]CuPy -NumPy互換GPUライブラリによるPythonでの高速計算- PFN奥田遼介
[GTCJ2018]CuPy -NumPy互換GPUライブラリによるPythonでの高速計算- PFN奥田遼介[GTCJ2018]CuPy -NumPy互換GPUライブラリによるPythonでの高速計算- PFN奥田遼介
[GTCJ2018]CuPy -NumPy互換GPUライブラリによるPythonでの高速計算- PFN奥田遼介Preferred Networks
 
VSCodeで作るPostgreSQL開発環境(第25回 PostgreSQLアンカンファレンス@オンライン 発表資料)
VSCodeで作るPostgreSQL開発環境(第25回 PostgreSQLアンカンファレンス@オンライン 発表資料)VSCodeで作るPostgreSQL開発環境(第25回 PostgreSQLアンカンファレンス@オンライン 発表資料)
VSCodeで作るPostgreSQL開発環境(第25回 PostgreSQLアンカンファレンス@オンライン 発表資料)NTT DATA Technology & Innovation
 
SolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみようSolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみようShinsuke Sugaya
 
組織にテストを書く文化を根付かせる戦略と戦術
組織にテストを書く文化を根付かせる戦略と戦術組織にテストを書く文化を根付かせる戦略と戦術
組織にテストを書く文化を根付かせる戦略と戦術Takuto Wada
 
C16 45分でわかるPostgreSQLの仕組み by 山田努
C16 45分でわかるPostgreSQLの仕組み by 山田努C16 45分でわかるPostgreSQLの仕組み by 山田努
C16 45分でわかるPostgreSQLの仕組み by 山田努Insight Technology, Inc.
 
MySQL5.6でGTIDを試してそっと閉じた
MySQL5.6でGTIDを試してそっと閉じたMySQL5.6でGTIDを試してそっと閉じた
MySQL5.6でGTIDを試してそっと閉じたEmma Haruka Iwao
 
その Pod 突然落ちても大丈夫ですか!?(OCHaCafe5 #5 実験!カオスエンジニアリング 発表資料)
その Pod 突然落ちても大丈夫ですか!?(OCHaCafe5 #5 実験!カオスエンジニアリング 発表資料)その Pod 突然落ちても大丈夫ですか!?(OCHaCafe5 #5 実験!カオスエンジニアリング 発表資料)
その Pod 突然落ちても大丈夫ですか!?(OCHaCafe5 #5 実験!カオスエンジニアリング 発表資料)NTT DATA Technology & Innovation
 
Linuxにて複数のコマンドを並列実行(同時実行数の制限付き)
Linuxにて複数のコマンドを並列実行(同時実行数の制限付き)Linuxにて複数のコマンドを並列実行(同時実行数の制限付き)
Linuxにて複数のコマンドを並列実行(同時実行数の制限付き)Hiro H.
 
モバイル向けニューラルネットワーク推論エンジンの紹介
モバイル向けニューラルネットワーク推論エンジンの紹介モバイル向けニューラルネットワーク推論エンジンの紹介
モバイル向けニューラルネットワーク推論エンジンの紹介卓然 郭
 
OpenAI FineTuning を試してみる
OpenAI FineTuning を試してみるOpenAI FineTuning を試してみる
OpenAI FineTuning を試してみるiPride Co., Ltd.
 
PostgreSQL のイケてるテクニック7選
PostgreSQL のイケてるテクニック7選PostgreSQL のイケてるテクニック7選
PostgreSQL のイケてるテクニック7選Tomoya Kawanishi
 
組み込みLinuxでのGolangのススメ
組み込みLinuxでのGolangのススメ組み込みLinuxでのGolangのススメ
組み込みLinuxでのGolangのススメTetsuyuki Kobayashi
 
これから Haskell を書くにあたって
これから Haskell を書くにあたってこれから Haskell を書くにあたって
これから Haskell を書くにあたってTsuyoshi Matsudate
 

What's hot (20)

SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
 
SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路
SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路
SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路
 
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
 
PostgreSQLアンチパターン
PostgreSQLアンチパターンPostgreSQLアンチパターン
PostgreSQLアンチパターン
 
PostgreSQLの統計情報について(第26回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQLの統計情報について(第26回PostgreSQLアンカンファレンス@オンライン 発表資料)PostgreSQLの統計情報について(第26回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQLの統計情報について(第26回PostgreSQLアンカンファレンス@オンライン 発表資料)
 
[GTCJ2018]CuPy -NumPy互換GPUライブラリによるPythonでの高速計算- PFN奥田遼介
[GTCJ2018]CuPy -NumPy互換GPUライブラリによるPythonでの高速計算- PFN奥田遼介[GTCJ2018]CuPy -NumPy互換GPUライブラリによるPythonでの高速計算- PFN奥田遼介
[GTCJ2018]CuPy -NumPy互換GPUライブラリによるPythonでの高速計算- PFN奥田遼介
 
VSCodeで作るPostgreSQL開発環境(第25回 PostgreSQLアンカンファレンス@オンライン 発表資料)
VSCodeで作るPostgreSQL開発環境(第25回 PostgreSQLアンカンファレンス@オンライン 発表資料)VSCodeで作るPostgreSQL開発環境(第25回 PostgreSQLアンカンファレンス@オンライン 発表資料)
VSCodeで作るPostgreSQL開発環境(第25回 PostgreSQLアンカンファレンス@オンライン 発表資料)
 
SolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみようSolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみよう
 
組織にテストを書く文化を根付かせる戦略と戦術
組織にテストを書く文化を根付かせる戦略と戦術組織にテストを書く文化を根付かせる戦略と戦術
組織にテストを書く文化を根付かせる戦略と戦術
 
Guide To AGPL
Guide To AGPLGuide To AGPL
Guide To AGPL
 
C16 45分でわかるPostgreSQLの仕組み by 山田努
C16 45分でわかるPostgreSQLの仕組み by 山田努C16 45分でわかるPostgreSQLの仕組み by 山田努
C16 45分でわかるPostgreSQLの仕組み by 山田努
 
MySQL5.6でGTIDを試してそっと閉じた
MySQL5.6でGTIDを試してそっと閉じたMySQL5.6でGTIDを試してそっと閉じた
MySQL5.6でGTIDを試してそっと閉じた
 
その Pod 突然落ちても大丈夫ですか!?(OCHaCafe5 #5 実験!カオスエンジニアリング 発表資料)
その Pod 突然落ちても大丈夫ですか!?(OCHaCafe5 #5 実験!カオスエンジニアリング 発表資料)その Pod 突然落ちても大丈夫ですか!?(OCHaCafe5 #5 実験!カオスエンジニアリング 発表資料)
その Pod 突然落ちても大丈夫ですか!?(OCHaCafe5 #5 実験!カオスエンジニアリング 発表資料)
 
Linuxにて複数のコマンドを並列実行(同時実行数の制限付き)
Linuxにて複数のコマンドを並列実行(同時実行数の制限付き)Linuxにて複数のコマンドを並列実行(同時実行数の制限付き)
Linuxにて複数のコマンドを並列実行(同時実行数の制限付き)
 
モバイル向けニューラルネットワーク推論エンジンの紹介
モバイル向けニューラルネットワーク推論エンジンの紹介モバイル向けニューラルネットワーク推論エンジンの紹介
モバイル向けニューラルネットワーク推論エンジンの紹介
 
OpenAI FineTuning を試してみる
OpenAI FineTuning を試してみるOpenAI FineTuning を試してみる
OpenAI FineTuning を試してみる
 
PostgreSQL のイケてるテクニック7選
PostgreSQL のイケてるテクニック7選PostgreSQL のイケてるテクニック7選
PostgreSQL のイケてるテクニック7選
 
組み込みLinuxでのGolangのススメ
組み込みLinuxでのGolangのススメ組み込みLinuxでのGolangのススメ
組み込みLinuxでのGolangのススメ
 
MapReduce入門
MapReduce入門MapReduce入門
MapReduce入門
 
これから Haskell を書くにあたって
これから Haskell を書くにあたってこれから Haskell を書くにあたって
これから Haskell を書くにあたって
 

Recently uploaded

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 

Recently uploaded (20)

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 

Luigi PyData presentation

  • 1. Luigi Batch data processing in Python Elias Freider freider@spotify.com
  • 2. Background Why did we build Luigi? We have a lot of data Billions of log messages (several TBs) every day Usage and backend stats, debug information What we want to do AB-testing Music recommendations Monthly/daily/hourly reporting Business metric dashboards We experiment a lot – need quick development cycles
  • 3. How to do it? Hadoop Data storage Clean, filter, join and aggregate data Postgres Semi-aggregated data for dashboards Cassandra Time series data
  • 4. Defining a data flow A bunch of data processing tasks with inter-dependencies
  • 5. Example: Artist Toplist Input is a list of Timestamp, Track, Artist tuples Artist Streams Aggregation Top 10 Database
  • 6. Example: Artist Toplist Naive (non-luigi) approach
  • 7. Errors will occur If a failed step leaves behind broken data, we want to clean that up
  • 8. Avoid duplicate work The second step fails, you fix it, then you want to resume
  • 9. Parametrization To use data flows as command line tools
  • 10. Repeat! You want to run the dataflow for a set of similar inputs
  • 11. Introducing Luigi A Python framework for data flow definition and execution Main features Dead-simple dependency definitions (think GNU make) Emphasis on Hadoop/HDFS integration Atomic file operations Powerful task templating using OOP Data flow visualization Command line integration
  • 12. Luigi - Aggregate Artists
  • 13. Luigi - Aggregate Artists Run on the command line: $ python dataflow.py AggregateArtists DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74375] Running AggregateArtists() INFO: [pid 74375] Done AggregateArtists() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  • 14. Task templates and targets Reduces code-repetition by utilizing Luigi’s object oriented data model Luigi comes with pre-implemented Tasks for... Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-files Running Hive and (soon) Pig queries Inserting data sets into Postgres Writing new ones are as easy as defining an interface and implementing run()
  • 15. Hadoop MapReduce Built-in Hadoop Streaming Python framework Features Very slim interface Fetches error logs from Hadoop cluster and displays them to the user Class instance variables can be referenced in MapReduce code, which makes it easy to supply extra data in dictionaries etc. for map side joins Easy to send along Python modules that might not be installed on the cluster Basic support for secondary sort Runs on CPython so you can use your favorite libs (numpy, pandas etc.)
  • 16. Hadoop MapReduce Built-in Hadoop Streaming Python framework
  • 17. Attach Python modules So you can use local modules remotely on the cluster
  • 18. Local state is transferred Set up local variables for map-side joins etc.
  • 19. Completing the top list Top 10 artists - Wrapped arbitrary Python code
  • 20. Database support Basic functionality for exporting to Postgres. Cassandra support is in the works
  • 21. Running it all... DEBUG: Checking if ArtistToplistToDatabase() is complete INFO: Scheduled ArtistToplistToDatabase() DEBUG: Checking if Top10Artists() is complete INFO: Scheduled Top10Artists() DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 3 INFO: [pid 74811] Running AggregateArtists() INFO: [pid 74811] Done AggregateArtists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 74811] Running Top10Artists() INFO: [pid 74811] Done Top10Artists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74811] Running ArtistToplistToDatabase() INFO: Done writing, importing at 2013-03-13 15:41:09.407138 INFO: [pid 74811] Done ArtistToplistToDatabase() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  • 22. Data flow visualization Light-weight scheduler daemon with a web interface
  • 23. The results Imagine how cool this would be with real data...
  • 24. Task Parameters Class variables with some magic Tasks have implicit __init__ Generates command line interface with typing and documentation $ python dataflow.py AggregateArtists --date 2013-03-05
  • 26. Task Parameters Makes it really easy to create aggregates over time
  • 27. Multiple workers Basic multi-processing $ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11
  • 28. Large data flows (Screenshot from web interface) EndSongCleaned EndSongCleaned (date=2013-03-16) (date=2013-03-15) AggregateTracks AggregateTracks UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay (test=False, MasterMetadata (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, date=2013-03-16, (date=2013-03-15) date=2013-03-15, date=2013-03-16, date=2013-03-14, date=2013-03-13, date=2013-03-08, date=2013-03-12, date=2013-03-07, date=2013-03-11, date=2013-03-10, date=2013-03-09, date=2013-03-15, date=2013-03-06, test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) JoinArtistGids JoinArtistGids UserLocationPeriod UserLocationPeriod (test=False, (test=False, (test=False, (test=False, date=2013-03-16, date=2013-03-15, period=10, period=10, nshards=12, nshards=12, date=2013-03-17, date=2013-03-16, test_users=False) test_users=False) test_users=False) test_users=False) AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, ArtistFollows (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, date=2013-03-16, date=2013-03-14, date=2013-03-13, date=2013-03-15, date=2013-03-12, date=2013-03-11, date=2013-03-16, date=2013-03-13, (date=2013-03-14) date=2013-03-14, date=2013-03-09, date=2013-03-08, date=2013-03-15, date=2013-03-12, date=2013-03-11, date=2013-03-07, date=2013-03-10, date=2013-03-06, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) AccumulateUserMatrices AccumulateUserMatrices AccumulateByArtists AccumulateByArtists (test=False, (test=False, (test=False, (test=False, JoinArtistGids JoinArtistGids JoinArtistGids JoinArtistGids date=2013-03-16, date=2013-03-15, date=2013-03-16, PlaylistVectors date=2013-03-15, (test=False, (test=False, (test=False, (test=False, index_version=1362433077, index_version=1362433077, test_users=False, (test=False, RelatedArtistsTC test_users=False, date=2013-03-14, date=2013-03-13, date=2013-03-12, date=2013-03-11, test_users=False, test_users=False, max_days=90, date=2013-03-14, () max_days=90, nshards=12, nshards=12, nshards=12, nshards=12, decay_factor=0.99, decay_factor=0.99, decay_factor=0.99, index_version=1362433077) decay_factor=0.99, test_users=False) test_users=False) test_users=False) test_users=False) days=5, days=5, days=10, days=10, build_from_scratch=True) build_from_scratch=True) build_from_scratch=True) build_from_scratch=True) UserRecs UserRecs (test=False, (test=False, date=2013-03-17, date=2013-03-16, rec_days=5, rec_days=5, IngestUserRecs exp_days=10, exp_days=10, (date=2013-03-15, test_users=False, test_users=False, test_users=False, force_updates=False, force_updates=False, ttl=604800) build_from_scratch=True, build_from_scratch=True, index_path=/spotify/discover/index, index_path=/spotify/discover/index, index_version=None, index_version=None, FOLLOWS_SCORE=5.0) FOLLOWS_SCORE=5.0) IngestUserRecs (date=2013-03-16, test_users=False, ttl=604800) IngestUserRecs (date=2013-03-17, test_users=False, ttl=604800)
  • 30. Error notifications Great for automated execution
  • 31. Process Synchronization Prevents two identical tasks from running simultaneously luigid Simple task synchronization Data flow 1 Data flow 2 Common dependency Task
  • 32. What we use Luigi for Not just another Hadoop streaming framework! Hadoop Streaming Java Hadoop MapReduce Hive Pig Local (non-hadoop) data processing Import/Export data to/from Postgres Insert data into Cassandra scp/rsync/ftp data files and reports Dump and load databases Others using it with Scala MapReduce and MRJob as well
  • 33. Luigi is open source http://github.com/spotify/luigi Originated at Spotify Based on many years of experience with data processing Recent contributions by Foursquare and Bitly Open source since September 2012
  • 34. Thank you! For more information feel free to reach out at http://github.com/spotify/luigi Oh, and we’re hiring – http://spotify.com/jobs Elias Freider freider@spotify.com