6. What’s the problems to solve?
> We couldn’t visualize data in HDFS directly using
dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more and less scalable by far
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
8. HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
✓ Less scalable
✓ Extra cost
Commercial
BI Tools
Dashboard
✓ Extra work to manage
2 platforms
✓ Can’t query against
“live”data directly
Batch analysis platform Visualization platform
14. What can Presto do?
> Query interactively (in milli-seconds to minues)
> MapReduce and Hive are still necessary for ETL
> Query using commercial BI tools or dashboards
> Reliable ODBC/JDBC connectivity through Prestogres
> Query across multiple data sources such as
Hive, HBase, Cassandra, or even internal DBs
> Plugin mechanism
> Integrate batch analisys + visualization
into a single data analysis platform
15. Presto’s deployment
> Facebook
> Multiple geographical regions
> scaled to 1,000 nodes
> actively used by 1,000+ employees
> who run 30,000+ queries every day
> processing 1PB/day
> Netflix, Dropbox, Treasure Data, Airbnb, Qubole
> Presto as a Service
17. The problems to use Presto with
BI tools
> BI tools need ODBC or JDBC connectivity
> Tableau, IBM Cognos, QlickView, Chart.IO, ...
> JasperSoft, Pentaho, MotionBoard, ...
> ODBC/JDBC is VERY COMPLICATED
> Matured implementation needs LONG time
• psqlODBC: 58,000 lines
• postgresql-jdbc: 62,000 lines
• mysql-connctor-odbc: 27,000 lines
• mysql-connector-j: 101,000 lines
18. A solution
> Creates a PostgreSQL protocol gateway
> Uses PostgreSQL’s stable ODBC / JDBC driver
19. Other possible designs were…
a) MySQL protocol + libdrizzle:
> Drizzle provides a well-designed library to implement
MySQL protocol server.
> Proof-of-concept worked well:
• trd-gateway - MySQL protocol gateway for Hive
> Difficulties: clients assumes the server is MySQL but,
• syntax mismatches: MySQL uses `…` while Presto “…”
• function mismatches: DAYOFMONTH(…) vs EXTRACT(day…)
b) PostgreSQL + Foreign Data Wrapper (FDW):
> JOIN and aggregation pushdown is not available
20. Other possible designs were…
c) PostgreSQL + H2 database + patch:
> H2 is an embedded database engine written in Java
> H2 has a PostgreSQL protocol implementation in Java
> Difficulties:
• System catalog implementation is incomplete
(pg_class, pg_namespace, pg_proc, etc.)
d) Reusing PostgreSQL protocol impl.:
> Difficulties:
• complete implementation of system catalogs was too difficult
21. Prestogres design
pgpool-II + PostgreSQL + PL/Python
> pgpool-II is a PostgreSQL protocol middleware for
replication, failover, load-balancing, etc.
> pgpool-II originally has some useful code
(parsing SQL, rewriting SQL, hacking system catalogs, …)
> Basic idea:
• Rewrite queries at pgpool-II and run Presto queries using PL/Python
select count(1)
from access
select * from
python_func(‘select count(1) from access’)
rewrite!
30. system catalog!
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
SELECT setup_system_catalog(‘presto.local:8080’, ‘hive’)“Q” SELECT * FROM pg_class;
"Query against a system catalog!”
Meta-query
PL/Python function
defined at prestogres.py
31. pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
SELECT setup_system_catalog(‘presto.local:8080’, ‘hive’)“Q” SELECT * FROM pg_class;
> CREATE TABLE access_logs;
> CREATE TABLE users;
> CREATE TABLE events;
…
Meta-query
SELECT * FROM
information_schema.columns
"Query against a system catalog!”
32. pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” SELECT * FROM pg_class; “Q” SELECT * FROM pg_class;
Meta-query
"Query against a system catalog!”
33. pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” select count(*) from access_logs;
regular table!
Presto Query
"Query against a regular table!”
34. pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” select count(*) from access_logs; SELECT start_presto_query(…
‘select count(*) from access_logs’)
regular table!
Presto Query
"Query against a regular table!”
PL/Python function
defined at prestogres.py
35. pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” select count(*) from access_logs; SELECT start_presto_query(…
‘select count(*) from access_logs’)
> CREATE TYPE result_type (c0_ BIGINT);
> CREATE FUNCTION fetch_results
RETURNS SETOF result_type
…
regular table!
Presto Query
"Query against a regular table!”
1. start the query on Presto
2. define a function
to fetch the result
36. pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” select count(*) from access_logs; “Q” SELECT * FROM fetch_results();
Presto Query
"Query against a regular table!”
PL/Python function
defined by start_presto_query
“Q” RAISE EXCEPTION …
42. Faked current_database()
delete from pg_catalog.pg_proc where
proname=‘current_database’;
create function pg_catalog.current_database()
returns name as $$
begin
return ‘faked_name’::name;
end
$$ language plpgsql stable strict;