SlideShare a Scribd company logo
1 of 58
Download to read offline
Python for Business
        Intelligence


Štefan Urbánek ■ @Stiivi ■ stefan.urbanek@continuum.io ■ PyData NYC, October 2012
python business intelligence




                )
Results

Q/A and articles with Java
  solution references


               (not listed here)
Why?
Overview

■ Traditional Data Warehouse
■ Python and Data
■ Is Python Capable?
■ Conclusion
Business
Intelligence
people

technology processes
Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
Traditional Data
  Warehouse
■ Extracting data from the original sources

■ Quality assuring and cleaning data

■ Conforming the labels and measures
   in the data to achieve consistency across the original sources



■ Delivering data in a physical format that can be used by
   query tools, report writers, and dashboards.




                         Source: Ralph Kimball – The Data Warehouse ETL Toolkit
Source               Staging Area     Operational Data Store   Datamarts
Systems



   structured
   documents




   databases

                Temporary
                Staging
                Area
      APIs




                            staging              relational        dimensional

                             L0                    L1                 L2
real time = daily
Multi-dimensional
    Modeling
aggregation browsing
     slicing and dicing
business / analyst’s
       point of view

regardless of physical schema implementation
Facts

                  measurable


     fact

                   fact data cell




most detailed information
location




type




              time



           dimensions
Dimension

■ provide context for facts
■ used to filter queries or reports
■ control scope of aggregation of facts
Pentaho
Python and Data
   community perception*




                           *as of Oct 2012
Scientific & Financial
Python
Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
Scientific Data
      T1[s]     T2[s]     T3[s]     T4[s]
P1     112,68    941,67    171,01    660,48

P2      96,15    306,51    725,88    877,82

P3     313,39    189,31     41,81    428,68

P4     760,62    983,48    371,21    281,19

P5     838,56     39,27    389,42    231,12




     n-dimensional array of numbers
Assumptions

■ data is mostly numbers
■ data is neatly organized...
■ … in one multi-dimensional array
Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
Business Data
multiple snapshots of one source




multiple representations     categories are

     of same data                  changing
❄
Is Python Capable?
     very basic examples
Data Pipes with
   SQLAlchemy

 Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
■ connection: create_engine
■ schema reflection: MetaData,   Table

■ expressions: select(),   insert()
src_engine = create_engine("sqlite:///data.sqlite")
src_metadata = MetaData(bind=src_engine)
src_table = Table('data', src_metadata, autoload=True)




target_engine = create_engine("postgres://localhost/sandbox")
target_metadata = MetaData(bind=target_engine)
target_table = Table('data', target_metadata)
clone schema:

for column in src_table.columns:
    target_table.append_column(column.copy())

target_table.create()




copy data:

insert = target_table.insert()

for row in src_table.select().execute():
    insert.execute(row)
magic used:

metadata reflection
text file (CSV) to table:




reader = csv.reader(file_stream)

columns = reader.next()

for column in columns:
    table.append_column(Column(column, String))

table.create()

for row in reader:
    insert.execute(row)
Simple T from ETL

 Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
transformation = [

 ('fiscal_year',         {"w function": int,
                          ". field":"fiscal_year"}),
 ('region_code',         {"4 mapping": region_map,
                          ". field":"region"}),
 ('borrower_country',    None),
 ('project_name',        None),
 ('procurement_type',    None),
 ('major_sector_code',   {"4 mapping": sector_code_map,
                          ". field":"major_sector"}),
 ('major_sector',        None),
 ('supplier',            None),
 ('contract_amount',     {"w function": currency_to_number,
                          ". field": 'total_contract_amount'}
 ]



     target fields        source transformations
Transformation

for row in source:
    result = transform(row, [ transformation)
    table.insert(result).execute()
OLAP with Cubes

 Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
Model
           {
               “name” = “My Model”
               “description” = ....

               “cubes” = [...]
               “dimensions” = [...]
           }




cubes                          dimensions
measures                        levels, attributes, hierarchy
logical




              physical

          ❄
1   load_model("model.json")

           Application



                  ∑

                                 3   model.cube("sales")
                                 4   workspace.browser(cube)


             cubes

       Aggregation Browser
            backend



2   create_workspace("sql",
                     model,
                     url="sqlite:///data.sqlite")
browser.aggregate(o cell,
                  . drilldown=[9 "sector"])




                        drill-down
for row in result.table_rows(“sector”):




          row.record["amount_sum"]
q row.label                     k row.key
whole cube


                                           o cell = Cell(cube)
                                           browser.aggregate(o cell)
                Total




                                          browser.aggregate(o cell,
                                                       drilldown=[9 “date”])


2006 2007 2008 2009 2010


                                          ✂ cut = PointCut(9 “date”, [2010])
                                          o cell = o cell.slice(✂ cut)

                                          browser.aggregate(o cell,
                                                       drilldown=[9 “date”])
Jan   Feb Mar Apr March April May   ...
How can Python
  be Useful
just the   Language
 ■ saves maintenance resources
 ■ shortens development time
 ■ saves your from going insane
Source               Staging Area      Operational Data Store   Datamarts
Systems



   structured
   documents




   databases
                                      faster
                Temporary
                Staging
                Area
      APIs




                            staging               relational        dimensional

                             L0                     L1                 L2
faster                      advanced


 Data                                            Analysis and
          Extraction, Transformation, Loading
Sources                                          Presentation

                       Data Governance

                   Technologies and Utilities




    understandable, maintainable
Conclusion
BI is about…



       people

technology processes
don’t forget
 metadata
Future

who is going to fix your COBOL Java tool
 if you have only Python guys around?
is capable, let’s start
Thank You
      [t


          Twitter:

        @Stiivi
     DataBrewery blog:

blog.databrewery.org
          Github:

  github.com/Stiivi

More Related Content

What's hot

Capgemini Robotic Process Automation special edition summer 2017
Capgemini Robotic Process Automation special edition summer 2017Capgemini Robotic Process Automation special edition summer 2017
Capgemini Robotic Process Automation special edition summer 2017UiPath
 
Mixede reality project report
Mixede reality project reportMixede reality project report
Mixede reality project reportsanamsanam7
 
Python libraries for data science
Python libraries for data sciencePython libraries for data science
Python libraries for data sciencenilashri2
 
Fundamentals of Artificial Intelligence — QU AIO Leadership in AI
Fundamentals of Artificial Intelligence — QU AIO Leadership in AIFundamentals of Artificial Intelligence — QU AIO Leadership in AI
Fundamentals of Artificial Intelligence — QU AIO Leadership in AIJunaid Qadir
 
Introduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlowIntroduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlowPaolo Tomeo
 
Industrial Data Space
Industrial Data SpaceIndustrial Data Space
Industrial Data SpaceBoris Otto
 
Lect#1 (Artificial Intelligence )
Lect#1 (Artificial Intelligence )Lect#1 (Artificial Intelligence )
Lect#1 (Artificial Intelligence )Zeeshan_Jadoon
 
XR and the Future of Immersive Technology
XR and the Future of Immersive TechnologyXR and the Future of Immersive Technology
XR and the Future of Immersive TechnologyVincent Lau
 
Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...
Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...
Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...SlideTeam
 
Azure OpenAI 및 ChatGPT 실습가이드 (Hands-on-lab)
Azure OpenAI 및 ChatGPT 실습가이드 (Hands-on-lab) Azure OpenAI 및 ChatGPT 실습가이드 (Hands-on-lab)
Azure OpenAI 및 ChatGPT 실습가이드 (Hands-on-lab) Minnie Seungmin Cho
 
Introduction of augmented reality
Introduction of augmented realityIntroduction of augmented reality
Introduction of augmented realityTakashi Yoshinaga
 
Artificial Intelligence Course | AI Tutorial For Beginners | Artificial Intel...
Artificial Intelligence Course | AI Tutorial For Beginners | Artificial Intel...Artificial Intelligence Course | AI Tutorial For Beginners | Artificial Intel...
Artificial Intelligence Course | AI Tutorial For Beginners | Artificial Intel...Simplilearn
 
Artificial intelligence ppt
Artificial intelligence pptArtificial intelligence ppt
Artificial intelligence pptRamhariYadav
 
What is the Metaverse?
What is the Metaverse?What is the Metaverse?
What is the Metaverse?Stephen Irvine
 
Lessons from 10 Years of Automated Debugging Research
Lessons from 10 Years of Automated Debugging ResearchLessons from 10 Years of Automated Debugging Research
Lessons from 10 Years of Automated Debugging ResearchShin Yoo
 
How to use Artificial Intelligence with Python? Edureka
How to use Artificial Intelligence with Python? EdurekaHow to use Artificial Intelligence with Python? Edureka
How to use Artificial Intelligence with Python? EdurekaEdureka!
 

What's hot (20)

Extended Reality.pdf
Extended Reality.pdfExtended Reality.pdf
Extended Reality.pdf
 
Introduction to ChatGPT and Overview of its capabilities and functionality.pdf
Introduction to ChatGPT and Overview of its capabilities and functionality.pdfIntroduction to ChatGPT and Overview of its capabilities and functionality.pdf
Introduction to ChatGPT and Overview of its capabilities and functionality.pdf
 
Capgemini Robotic Process Automation special edition summer 2017
Capgemini Robotic Process Automation special edition summer 2017Capgemini Robotic Process Automation special edition summer 2017
Capgemini Robotic Process Automation special edition summer 2017
 
Mixede reality project report
Mixede reality project reportMixede reality project report
Mixede reality project report
 
Python libraries for data science
Python libraries for data sciencePython libraries for data science
Python libraries for data science
 
Fundamentals of Artificial Intelligence — QU AIO Leadership in AI
Fundamentals of Artificial Intelligence — QU AIO Leadership in AIFundamentals of Artificial Intelligence — QU AIO Leadership in AI
Fundamentals of Artificial Intelligence — QU AIO Leadership in AI
 
Robotic Process Automation vs Intelligent Automation
Robotic Process Automation vs Intelligent AutomationRobotic Process Automation vs Intelligent Automation
Robotic Process Automation vs Intelligent Automation
 
Machine learning
Machine learningMachine learning
Machine learning
 
Introduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlowIntroduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlow
 
Industrial Data Space
Industrial Data SpaceIndustrial Data Space
Industrial Data Space
 
Lect#1 (Artificial Intelligence )
Lect#1 (Artificial Intelligence )Lect#1 (Artificial Intelligence )
Lect#1 (Artificial Intelligence )
 
XR and the Future of Immersive Technology
XR and the Future of Immersive TechnologyXR and the Future of Immersive Technology
XR and the Future of Immersive Technology
 
Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...
Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...
Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...
 
Azure OpenAI 및 ChatGPT 실습가이드 (Hands-on-lab)
Azure OpenAI 및 ChatGPT 실습가이드 (Hands-on-lab) Azure OpenAI 및 ChatGPT 실습가이드 (Hands-on-lab)
Azure OpenAI 및 ChatGPT 실습가이드 (Hands-on-lab)
 
Introduction of augmented reality
Introduction of augmented realityIntroduction of augmented reality
Introduction of augmented reality
 
Artificial Intelligence Course | AI Tutorial For Beginners | Artificial Intel...
Artificial Intelligence Course | AI Tutorial For Beginners | Artificial Intel...Artificial Intelligence Course | AI Tutorial For Beginners | Artificial Intel...
Artificial Intelligence Course | AI Tutorial For Beginners | Artificial Intel...
 
Artificial intelligence ppt
Artificial intelligence pptArtificial intelligence ppt
Artificial intelligence ppt
 
What is the Metaverse?
What is the Metaverse?What is the Metaverse?
What is the Metaverse?
 
Lessons from 10 Years of Automated Debugging Research
Lessons from 10 Years of Automated Debugging ResearchLessons from 10 Years of Automated Debugging Research
Lessons from 10 Years of Automated Debugging Research
 
How to use Artificial Intelligence with Python? Edureka
How to use Artificial Intelligence with Python? EdurekaHow to use Artificial Intelligence with Python? Edureka
How to use Artificial Intelligence with Python? Edureka
 

Similar to Python business intelligence (PyData 2012 talk)

Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social mediaDataWorks Summit
 
Tspbug 2 24_2014_final
Tspbug 2 24_2014_finalTspbug 2 24_2014_final
Tspbug 2 24_2014_finalEd Senez
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIDenny Lee
 
Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Mark Tabladillo
 
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Stefan Urbanek
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Paulo Gandra de Sousa
 
Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23Martin Bém
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010Knowerce
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov
 
A general introduction to Spring Data / Neo4J
A general introduction to Spring Data / Neo4JA general introduction to Spring Data / Neo4J
A general introduction to Spring Data / Neo4JFlorent Biville
 
Salesforce & SAP Integration
Salesforce & SAP IntegrationSalesforce & SAP Integration
Salesforce & SAP IntegrationRaymond Gao
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
Data Access Tech Ed India
Data Access   Tech Ed IndiaData Access   Tech Ed India
Data Access Tech Ed Indiarsnarayanan
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFMLconf
 

Similar to Python business intelligence (PyData 2012 talk) (20)

Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social media
 
Tspbug 2 24_2014_final
Tspbug 2 24_2014_finalTspbug 2 24_2014_final
Tspbug 2 24_2014_final
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
 
Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106
 
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
PoEAA by Example
PoEAA by ExamplePoEAA by Example
PoEAA by Example
 
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)
 
Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
A general introduction to Spring Data / Neo4J
A general introduction to Spring Data / Neo4JA general introduction to Spring Data / Neo4J
A general introduction to Spring Data / Neo4J
 
Salesforce & SAP Integration
Salesforce & SAP IntegrationSalesforce & SAP Integration
Salesforce & SAP Integration
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Data Access Tech Ed India
Data Access   Tech Ed IndiaData Access   Tech Ed India
Data Access Tech Ed India
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 

More from Stefan Urbanek

Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Stefan Urbanek
 
New york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introductionNew york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introductionStefan Urbanek
 
Cubes – pluggable model explained
Cubes – pluggable model explainedCubes – pluggable model explained
Cubes – pluggable model explainedStefan Urbanek
 
Cubes – ways of deployment
Cubes – ways of deploymentCubes – ways of deployment
Cubes – ways of deploymentStefan Urbanek
 
Knowledge Management Lecture 4: Models
Knowledge Management Lecture 4: ModelsKnowledge Management Lecture 4: Models
Knowledge Management Lecture 4: ModelsStefan Urbanek
 
Dallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality PerceptionDallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality PerceptionStefan Urbanek
 
Dallas Data Brewery - introduction
Dallas Data Brewery - introductionDallas Data Brewery - introduction
Dallas Data Brewery - introductionStefan Urbanek
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsStefan Urbanek
 
Knowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: CycleKnowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: CycleStefan Urbanek
 
Knowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizationsKnowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizationsStefan Urbanek
 
Knowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presenceKnowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presenceStefan Urbanek
 
Open spending as-is 2011-06
Open spending   as-is 2011-06Open spending   as-is 2011-06
Open spending as-is 2011-06Stefan Urbanek
 
Cubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP FrameworkCubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP FrameworkStefan Urbanek
 
Open Data Decentralisation
Open Data DecentralisationOpen Data Decentralisation
Open Data DecentralisationStefan Urbanek
 
Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)Stefan Urbanek
 
Knowledge Management Introduction
Knowledge Management IntroductionKnowledge Management Introduction
Knowledge Management IntroductionStefan Urbanek
 

More from Stefan Urbanek (19)

StepTalk Introduction
StepTalk IntroductionStepTalk Introduction
StepTalk Introduction
 
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
 
Sepro - introduction
Sepro - introductionSepro - introduction
Sepro - introduction
 
New york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introductionNew york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introduction
 
Cubes 1.0 Overview
Cubes 1.0 OverviewCubes 1.0 Overview
Cubes 1.0 Overview
 
Cubes – pluggable model explained
Cubes – pluggable model explainedCubes – pluggable model explained
Cubes – pluggable model explained
 
Cubes – ways of deployment
Cubes – ways of deploymentCubes – ways of deployment
Cubes – ways of deployment
 
Knowledge Management Lecture 4: Models
Knowledge Management Lecture 4: ModelsKnowledge Management Lecture 4: Models
Knowledge Management Lecture 4: Models
 
Dallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality PerceptionDallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality Perception
 
Dallas Data Brewery - introduction
Dallas Data Brewery - introductionDallas Data Brewery - introduction
Dallas Data Brewery - introduction
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data Objects
 
Knowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: CycleKnowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: Cycle
 
Knowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizationsKnowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizations
 
Knowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presenceKnowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presence
 
Open spending as-is 2011-06
Open spending   as-is 2011-06Open spending   as-is 2011-06
Open spending as-is 2011-06
 
Cubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP FrameworkCubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP Framework
 
Open Data Decentralisation
Open Data DecentralisationOpen Data Decentralisation
Open Data Decentralisation
 
Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)
 
Knowledge Management Introduction
Knowledge Management IntroductionKnowledge Management Introduction
Knowledge Management Introduction
 

Python business intelligence (PyData 2012 talk)

  • 1. Python for Business Intelligence Štefan Urbánek ■ @Stiivi ■ stefan.urbanek@continuum.io ■ PyData NYC, October 2012
  • 3. Results Q/A and articles with Java solution references (not listed here)
  • 4.
  • 6. Overview ■ Traditional Data Warehouse ■ Python and Data ■ Is Python Capable? ■ Conclusion
  • 9. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 10. Traditional Data Warehouse
  • 11. ■ Extracting data from the original sources ■ Quality assuring and cleaning data ■ Conforming the labels and measures in the data to achieve consistency across the original sources ■ Delivering data in a physical format that can be used by query tools, report writers, and dashboards. Source: Ralph Kimball – The Data Warehouse ETL Toolkit
  • 12. Source Staging Area Operational Data Store Datamarts Systems structured documents databases Temporary Staging Area APIs staging relational dimensional L0 L1 L2
  • 13. real time = daily
  • 14. Multi-dimensional Modeling
  • 15.
  • 16. aggregation browsing slicing and dicing
  • 17. business / analyst’s point of view regardless of physical schema implementation
  • 18. Facts measurable fact fact data cell most detailed information
  • 19. location type time dimensions
  • 20. Dimension ■ provide context for facts ■ used to filter queries or reports ■ control scope of aggregation of facts
  • 22. Python and Data community perception* *as of Oct 2012
  • 25. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 26. Scientific Data T1[s] T2[s] T3[s] T4[s] P1 112,68 941,67 171,01 660,48 P2 96,15 306,51 725,88 877,82 P3 313,39 189,31 41,81 428,68 P4 760,62 983,48 371,21 281,19 P5 838,56 39,27 389,42 231,12 n-dimensional array of numbers
  • 27. Assumptions ■ data is mostly numbers ■ data is neatly organized... ■ … in one multi-dimensional array
  • 28. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 30. multiple snapshots of one source multiple representations categories are of same data changing
  • 31.
  • 32. Is Python Capable? very basic examples
  • 33. Data Pipes with SQLAlchemy Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 34. ■ connection: create_engine ■ schema reflection: MetaData, Table ■ expressions: select(), insert()
  • 35. src_engine = create_engine("sqlite:///data.sqlite") src_metadata = MetaData(bind=src_engine) src_table = Table('data', src_metadata, autoload=True) target_engine = create_engine("postgres://localhost/sandbox") target_metadata = MetaData(bind=target_engine) target_table = Table('data', target_metadata)
  • 36. clone schema: for column in src_table.columns: target_table.append_column(column.copy()) target_table.create() copy data: insert = target_table.insert() for row in src_table.select().execute(): insert.execute(row)
  • 38. text file (CSV) to table: reader = csv.reader(file_stream) columns = reader.next() for column in columns: table.append_column(Column(column, String)) table.create() for row in reader: insert.execute(row)
  • 39. Simple T from ETL Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 40. transformation = [ ('fiscal_year', {"w function": int, ". field":"fiscal_year"}), ('region_code', {"4 mapping": region_map, ". field":"region"}), ('borrower_country', None), ('project_name', None), ('procurement_type', None), ('major_sector_code', {"4 mapping": sector_code_map, ". field":"major_sector"}), ('major_sector', None), ('supplier', None), ('contract_amount', {"w function": currency_to_number, ". field": 'total_contract_amount'} ] target fields source transformations
  • 41. Transformation for row in source: result = transform(row, [ transformation) table.insert(result).execute()
  • 42. OLAP with Cubes Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 43. Model { “name” = “My Model” “description” = .... “cubes” = [...] “dimensions” = [...] } cubes dimensions measures levels, attributes, hierarchy
  • 44. logical physical ❄
  • 45. 1 load_model("model.json") Application ∑ 3 model.cube("sales") 4 workspace.browser(cube) cubes Aggregation Browser backend 2 create_workspace("sql", model, url="sqlite:///data.sqlite")
  • 46. browser.aggregate(o cell, . drilldown=[9 "sector"]) drill-down
  • 47. for row in result.table_rows(“sector”): row.record["amount_sum"] q row.label k row.key
  • 48. whole cube o cell = Cell(cube) browser.aggregate(o cell) Total browser.aggregate(o cell, drilldown=[9 “date”]) 2006 2007 2008 2009 2010 ✂ cut = PointCut(9 “date”, [2010]) o cell = o cell.slice(✂ cut) browser.aggregate(o cell, drilldown=[9 “date”]) Jan Feb Mar Apr March April May ...
  • 49. How can Python be Useful
  • 50. just the Language ■ saves maintenance resources ■ shortens development time ■ saves your from going insane
  • 51. Source Staging Area Operational Data Store Datamarts Systems structured documents databases faster Temporary Staging Area APIs staging relational dimensional L0 L1 L2
  • 52. faster advanced Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities understandable, maintainable
  • 54. BI is about… people technology processes
  • 56. Future who is going to fix your COBOL Java tool if you have only Python guys around?
  • 58. Thank You [t Twitter: @Stiivi DataBrewery blog: blog.databrewery.org Github: github.com/Stiivi