3. Cover w/ Image
Executive Summary
■ Who am I?
■ Pure SQL: not the preferred interface
for sophisticated data scientists
■ Maturity and scale of SQL-based
systems are what enterprise DS
demands
■ Data Frames: A better abstraction that
can be layered on top of SQL systems
○ Ibis: Python implementation of
Data Frames for SQL and large
data platforms
4. Who is Scott Hajek?
● Data Science consulting for 5+ years
● Senior Data Scientist for Pivotal
● Specialty in Linguistics and Natural
Language Processing
● Many industries:
○ Banking, Telecom, Manufacturing
● Problems tackled
○ Entity resolution, info extraction,
optimization, anomaly detection,
e-comm surveillance
5. Personas
● DBA
● Application developer
● Business Analyst
● Data Scientist
○ Operates on large data sets interactively
○ Uses advanced statistics and machine
learning techniques
○ Sophisticated programmer
○ Appreciates good abstractions
Different kinds of users of a
database system
6. COBOL:
IDENTIFICATION DIVISION.
PROGRAM-ID. BINSRCH1.
The binary search
reads every input record
after looking up the employee’s month of hire on a
table,
by a sequential search, it writes it out to an
output file
ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
FILE-CONTROL.
...
Python vs R
Abstraction
● Today popular
languages have
abstractions like
DataFrames
● Not just about fewer
lines of code
● Easier to reason about
Rob Story, bit.ly/python-vs-R-vs-cobol
vs COBOL
which(sapply(dataframe,
function(x) any(month == "January")))
dataframe[dataframe.month == "January"]
7. Good Abstractions in Math
Matrix notation makes multiple linear regression digestible
● Simple linear regression
● Multiple regression
without matrix
notation
● Multiple regression
with matrix notation
8. Good Abstractions for Tabular Data
SQL
● Pros
○ Well-defined standards, very familiar in enterprise
○ Declarative language
○ Analytic operations available
○ Abstractions for tables, columns, windows
● Cons
○ Verbose
○ Difficult to compose complex queries and
transformations by hand
○ Difficult to represent subqueries and intermediate
result sets
PL /
9. Good Abstractions for Tabular Data
Data Frames (df)
● Tabular data structures with named columns
● Different types allowed in different columns
● Easy to select subset of columns
● Analytic operations available:
○ Arithmetic, joins, maps, filters, aggregate & window
functions
● Popular with Data Scientists:
○ Easily hooks into programming languages
○ Flexible for interactive data exploration
○ Represent sub-queries as variables → clear data
flow through pipelines
10. Good Abstractions in Data Science Packages
Model
Tra mo
from sklearn.ensemble import
RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X, y)
10
Abstractions for training
11. Good Abstractions in Data Science Packages
Pre t al
clf.predict(new_data)
Regression
11
Model
Abstractions for predicting
12. Challenges with Writing Complex SQL Directly
App developers use frameworks to avoid writing SQL directly
● Tedious, error prone
● Insecure (SQL injection)
● Significant rewrite needed if you switch the flavor of database, especially if you’re
utilizing advanced features of databases.
● Instead they use Object-Relational Mapping (ORMs) to generate SQL
○ Spring or ActiveRecord from Ruby on Rails
13. Reconsidering SQL-based Platforms
Relational DB Management Systems have a lot to offer
● Stability
● SQL is the most common and familiar language in the enterprise
● Analytical capabilities
● MPP variants offer massive scale in storage and processing
14. What is Ibis?
“A [Python] pandas-like deferred expression system, with
first-class SQL support”
● Pandas: Python package with DataFrame abstraction, staple for data scientists
● Same code can work on multiple data platforms
● Deferred expression → lazy evaluation
○ Define complex pipeline of transformations, represented as an object
○ Can inspect properties of the end result without evaluating
○ Allows type/error checking client side before sending job to server/cluster
○ Make bad code fail fast!
○ Gives query/execution optimizer the full picture → better plans
● Developed by Wes McKinney, Phillip Cloud, and community
15. Ibis in Action
Establish a connection object
Create an object that refers to a
table
Table object contains information
about the schema
16. Ibis in Action
Columns and aggregation
● Column selection looks like pandas
● Methods for defining aggregation and
computation (e.g. sum)
● Computation deferred until final step
when execution is called
17. Ibis in Action
Joins
● Define join in object-oriented fashion
● Potential columns and types are
known before actually evaluation
19. Making Ibis More Versatile
To cover the full range of DS tasks in Postgres/Greenplum, ibis needs
some further development
● Ability to create and use user-defined functions (UDFs)
● Ability to create a table and save the results to it
● Data science modeling abstractions that use ibis table objects as input
20. Let’s round out ibis and give
Postgres/Greenplum a modern
data science interface