SlideShare a Scribd company logo
1 of 21
Download to read offline
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Scott Hajek
@jscotthajek
A Modern Interface for Data
Science on Postgres/Greenplum
Cover w/ Image
Executive Summary
■ Who am I?
■ Pure SQL: not the preferred interface
for sophisticated data scientists
■ Maturity and scale of SQL-based
systems are what enterprise DS
demands
■ Data Frames: A better abstraction that
can be layered on top of SQL systems
○ Ibis: Python implementation of
Data Frames for SQL and large
data platforms
Who is Scott Hajek?
● Data Science consulting for 5+ years
● Senior Data Scientist for Pivotal
● Specialty in Linguistics and Natural
Language Processing
● Many industries:
○ Banking, Telecom, Manufacturing
● Problems tackled
○ Entity resolution, info extraction,
optimization, anomaly detection,
e-comm surveillance
Personas
● DBA
● Application developer
● Business Analyst
● Data Scientist
○ Operates on large data sets interactively
○ Uses advanced statistics and machine
learning techniques
○ Sophisticated programmer
○ Appreciates good abstractions
Different kinds of users of a
database system
COBOL:
IDENTIFICATION DIVISION.
PROGRAM-ID. BINSRCH1.
The binary search
reads every input record
after looking up the employee’s month of hire on a
table,
by a sequential search, it writes it out to an
output file
ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
FILE-CONTROL.
...
Python vs R
Abstraction
● Today popular
languages have
abstractions like
DataFrames
● Not just about fewer
lines of code
● Easier to reason about
Rob Story, bit.ly/python-vs-R-vs-cobol
vs COBOL
which(sapply(dataframe,
function(x) any(month == "January")))
dataframe[dataframe.month == "January"]
Good Abstractions in Math
Matrix notation makes multiple linear regression digestible
● Simple linear regression
● Multiple regression
without matrix
notation
● Multiple regression
with matrix notation
Good Abstractions for Tabular Data
SQL
● Pros
○ Well-defined standards, very familiar in enterprise
○ Declarative language
○ Analytic operations available
○ Abstractions for tables, columns, windows
● Cons
○ Verbose
○ Difficult to compose complex queries and
transformations by hand
○ Difficult to represent subqueries and intermediate
result sets
PL /
Good Abstractions for Tabular Data
Data Frames (df)
● Tabular data structures with named columns
● Different types allowed in different columns
● Easy to select subset of columns
● Analytic operations available:
○ Arithmetic, joins, maps, filters, aggregate & window
functions
● Popular with Data Scientists:
○ Easily hooks into programming languages
○ Flexible for interactive data exploration
○ Represent sub-queries as variables → clear data
flow through pipelines
Good Abstractions in Data Science Packages
Model
Tra mo
from sklearn.ensemble import
RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X, y)
10
Abstractions for training
Good Abstractions in Data Science Packages
Pre t al
clf.predict(new_data)
Regression
11
Model
Abstractions for predicting
Challenges with Writing Complex SQL Directly
App developers use frameworks to avoid writing SQL directly
● Tedious, error prone
● Insecure (SQL injection)
● Significant rewrite needed if you switch the flavor of database, especially if you’re
utilizing advanced features of databases.
● Instead they use Object-Relational Mapping (ORMs) to generate SQL
○ Spring or ActiveRecord from Ruby on Rails
Reconsidering SQL-based Platforms
Relational DB Management Systems have a lot to offer
● Stability
● SQL is the most common and familiar language in the enterprise
● Analytical capabilities
● MPP variants offer massive scale in storage and processing
What is Ibis?
“A [Python] pandas-like deferred expression system, with
first-class SQL support”
● Pandas: Python package with DataFrame abstraction, staple for data scientists
● Same code can work on multiple data platforms
● Deferred expression → lazy evaluation
○ Define complex pipeline of transformations, represented as an object
○ Can inspect properties of the end result without evaluating
○ Allows type/error checking client side before sending job to server/cluster
○ Make bad code fail fast!
○ Gives query/execution optimizer the full picture → better plans
● Developed by Wes McKinney, Phillip Cloud, and community
Ibis in Action
Establish a connection object
Create an object that refers to a
table
Table object contains information
about the schema
Ibis in Action
Columns and aggregation
● Column selection looks like pandas
● Methods for defining aggregation and
computation (e.g. sum)
● Computation deferred until final step
when execution is called
Ibis in Action
Joins
● Define join in object-oriented fashion
● Potential columns and types are
known before actually evaluation
Ibis in Action
Making Ibis More Versatile
To cover the full range of DS tasks in Postgres/Greenplum, ibis needs
some further development
● Ability to create and use user-defined functions (UDFs)
● Ability to create a table and save the results to it
● Data science modeling abstractions that use ibis table objects as input
Let’s round out ibis and give
Postgres/Greenplum a modern
data science interface
References
● Ibis project: docs.ibis-project.org
● Greenplum.org
● Pandas.pydata.org
● Scikit-learn.org
● pivotal.io/pivotal-data-science

More Related Content

What's hot

Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019VMware Tanzu
 
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...VMware Tanzu
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...VMware Tanzu
 
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...Databricks
 
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019VMware Tanzu
 
Democratizing Data
Democratizing DataDemocratizing Data
Democratizing DataDatabricks
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNJosh Patterson
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Databricks
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkSpark Summit
 
Ai platform at scale
Ai platform at scaleAi platform at scale
Ai platform at scaleHenry Saputra
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBOkbajda
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowDatabricks
 
Personalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingPersonalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingDatabricks
 
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Databricks
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta CachingReal-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta CachingDatabricks
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
 

What's hot (20)

Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
 
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
 
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
 
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
 
Democratizing Data
Democratizing DataDemocratizing Data
Democratizing Data
 
Greenplum-PXF November 2018
Greenplum-PXF November 2018Greenplum-PXF November 2018
Greenplum-PXF November 2018
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Ai platform at scale
Ai platform at scaleAi platform at scale
Ai platform at scale
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
 
Personalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingPersonalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud Streaming
 
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta CachingReal-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 

Similar to A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit 2019

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Pullareddy_tavva_resume.doc
Pullareddy_tavva_resume.docPullareddy_tavva_resume.doc
Pullareddy_tavva_resume.docT Pulla Reddy
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Resume_sukanta_updated
Resume_sukanta_updatedResume_sukanta_updated
Resume_sukanta_updatedSukanta Saha
 
Resume_APRIL_updated
Resume_APRIL_updatedResume_APRIL_updated
Resume_APRIL_updatedSukanta Saha
 
Resume april updated
Resume april updatedResume april updated
Resume april updatedSukanta Saha
 
Sujit lead plsql
Sujit lead plsqlSujit lead plsql
Sujit lead plsqlSujit Jha
 
Enabling Your Data Science Team with Modern Data Engineering
Enabling Your Data Science Team with Modern Data EngineeringEnabling Your Data Science Team with Modern Data Engineering
Enabling Your Data Science Team with Modern Data EngineeringJames Densmore
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
Parfenov Vladimir br
Parfenov Vladimir brParfenov Vladimir br
Parfenov Vladimir brVlad Parfenov
 
Oop lec 2(introduction to object oriented technology)
Oop lec 2(introduction to object oriented technology)Oop lec 2(introduction to object oriented technology)
Oop lec 2(introduction to object oriented technology)Asfand Hassan
 

Similar to A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit 2019 (20)

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Pullareddy_tavva_resume.doc
Pullareddy_tavva_resume.docPullareddy_tavva_resume.doc
Pullareddy_tavva_resume.doc
 
Resume 11 2015
Resume 11 2015Resume 11 2015
Resume 11 2015
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Resume_sukanta_updated
Resume_sukanta_updatedResume_sukanta_updated
Resume_sukanta_updated
 
Resume_APRIL_updated
Resume_APRIL_updatedResume_APRIL_updated
Resume_APRIL_updated
 
Resume april updated
Resume april updatedResume april updated
Resume april updated
 
Resume
ResumeResume
Resume
 
Resume (1)
Resume (1)Resume (1)
Resume (1)
 
SURENDRANATH GANDLA4
SURENDRANATH GANDLA4SURENDRANATH GANDLA4
SURENDRANATH GANDLA4
 
Uday Job Profile
Uday Job ProfileUday Job Profile
Uday Job Profile
 
Sujit lead plsql
Sujit lead plsqlSujit lead plsql
Sujit lead plsql
 
Ravi_Shrivas_CV
Ravi_Shrivas_CVRavi_Shrivas_CV
Ravi_Shrivas_CV
 
Enabling Your Data Science Team with Modern Data Engineering
Enabling Your Data Science Team with Modern Data EngineeringEnabling Your Data Science Team with Modern Data Engineering
Enabling Your Data Science Team with Modern Data Engineering
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Parfenov Vladimir br
Parfenov Vladimir brParfenov Vladimir br
Parfenov Vladimir br
 
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M ResumeShaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M Resume
 
Oop lec 2(introduction to object oriented technology)
Oop lec 2(introduction to object oriented technology)Oop lec 2(introduction to object oriented technology)
Oop lec 2(introduction to object oriented technology)
 
My C.V
My C.VMy C.V
My C.V
 

More from VMware Tanzu

What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItVMware Tanzu
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023VMware Tanzu
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleVMware Tanzu
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023VMware Tanzu
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductVMware Tanzu
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready AppsVMware Tanzu
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And BeyondVMware Tanzu
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfVMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023VMware Tanzu
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptxVMware Tanzu
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchVMware Tanzu
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishVMware Tanzu
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVMware Tanzu
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - FrenchVMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023VMware Tanzu
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootVMware Tanzu
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerVMware Tanzu
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeVMware Tanzu
 
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsSpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsVMware Tanzu
 

More from VMware Tanzu (20)

What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About It
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at Scale
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a Product
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And Beyond
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptx
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - French
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - English
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - French
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software Engineer
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs Practice
 
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsSpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
 

Recently uploaded

Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 

Recently uploaded (20)

Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 

A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit 2019

  • 1.
  • 2. © Copyright 2019 Pivotal Software, Inc. All rights Reserved. Scott Hajek @jscotthajek A Modern Interface for Data Science on Postgres/Greenplum
  • 3. Cover w/ Image Executive Summary ■ Who am I? ■ Pure SQL: not the preferred interface for sophisticated data scientists ■ Maturity and scale of SQL-based systems are what enterprise DS demands ■ Data Frames: A better abstraction that can be layered on top of SQL systems ○ Ibis: Python implementation of Data Frames for SQL and large data platforms
  • 4. Who is Scott Hajek? ● Data Science consulting for 5+ years ● Senior Data Scientist for Pivotal ● Specialty in Linguistics and Natural Language Processing ● Many industries: ○ Banking, Telecom, Manufacturing ● Problems tackled ○ Entity resolution, info extraction, optimization, anomaly detection, e-comm surveillance
  • 5. Personas ● DBA ● Application developer ● Business Analyst ● Data Scientist ○ Operates on large data sets interactively ○ Uses advanced statistics and machine learning techniques ○ Sophisticated programmer ○ Appreciates good abstractions Different kinds of users of a database system
  • 6. COBOL: IDENTIFICATION DIVISION. PROGRAM-ID. BINSRCH1. The binary search reads every input record after looking up the employee’s month of hire on a table, by a sequential search, it writes it out to an output file ENVIRONMENT DIVISION. INPUT-OUTPUT SECTION. FILE-CONTROL. ... Python vs R Abstraction ● Today popular languages have abstractions like DataFrames ● Not just about fewer lines of code ● Easier to reason about Rob Story, bit.ly/python-vs-R-vs-cobol vs COBOL which(sapply(dataframe, function(x) any(month == "January"))) dataframe[dataframe.month == "January"]
  • 7. Good Abstractions in Math Matrix notation makes multiple linear regression digestible ● Simple linear regression ● Multiple regression without matrix notation ● Multiple regression with matrix notation
  • 8. Good Abstractions for Tabular Data SQL ● Pros ○ Well-defined standards, very familiar in enterprise ○ Declarative language ○ Analytic operations available ○ Abstractions for tables, columns, windows ● Cons ○ Verbose ○ Difficult to compose complex queries and transformations by hand ○ Difficult to represent subqueries and intermediate result sets PL /
  • 9. Good Abstractions for Tabular Data Data Frames (df) ● Tabular data structures with named columns ● Different types allowed in different columns ● Easy to select subset of columns ● Analytic operations available: ○ Arithmetic, joins, maps, filters, aggregate & window functions ● Popular with Data Scientists: ○ Easily hooks into programming languages ○ Flexible for interactive data exploration ○ Represent sub-queries as variables → clear data flow through pipelines
  • 10. Good Abstractions in Data Science Packages Model Tra mo from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier() clf.fit(X, y) 10 Abstractions for training
  • 11. Good Abstractions in Data Science Packages Pre t al clf.predict(new_data) Regression 11 Model Abstractions for predicting
  • 12. Challenges with Writing Complex SQL Directly App developers use frameworks to avoid writing SQL directly ● Tedious, error prone ● Insecure (SQL injection) ● Significant rewrite needed if you switch the flavor of database, especially if you’re utilizing advanced features of databases. ● Instead they use Object-Relational Mapping (ORMs) to generate SQL ○ Spring or ActiveRecord from Ruby on Rails
  • 13. Reconsidering SQL-based Platforms Relational DB Management Systems have a lot to offer ● Stability ● SQL is the most common and familiar language in the enterprise ● Analytical capabilities ● MPP variants offer massive scale in storage and processing
  • 14. What is Ibis? “A [Python] pandas-like deferred expression system, with first-class SQL support” ● Pandas: Python package with DataFrame abstraction, staple for data scientists ● Same code can work on multiple data platforms ● Deferred expression → lazy evaluation ○ Define complex pipeline of transformations, represented as an object ○ Can inspect properties of the end result without evaluating ○ Allows type/error checking client side before sending job to server/cluster ○ Make bad code fail fast! ○ Gives query/execution optimizer the full picture → better plans ● Developed by Wes McKinney, Phillip Cloud, and community
  • 15. Ibis in Action Establish a connection object Create an object that refers to a table Table object contains information about the schema
  • 16. Ibis in Action Columns and aggregation ● Column selection looks like pandas ● Methods for defining aggregation and computation (e.g. sum) ● Computation deferred until final step when execution is called
  • 17. Ibis in Action Joins ● Define join in object-oriented fashion ● Potential columns and types are known before actually evaluation
  • 19. Making Ibis More Versatile To cover the full range of DS tasks in Postgres/Greenplum, ibis needs some further development ● Ability to create and use user-defined functions (UDFs) ● Ability to create a table and save the results to it ● Data science modeling abstractions that use ibis table objects as input
  • 20. Let’s round out ibis and give Postgres/Greenplum a modern data science interface
  • 21. References ● Ibis project: docs.ibis-project.org ● Greenplum.org ● Pandas.pydata.org ● Scikit-learn.org ● pivotal.io/pivotal-data-science