*This talk was first presented at http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/225673273/*
Enterprise users today demand the ability to glean insights from their disparate data spread across varied transactional and analytics sources; hence, analytics application developers need the ability to connect to varied data & compute engines such as Spark, Flink, Cassandra, etc.
A key pain point for developers is the lack of a uniform API across data & compute engines, a limitation which adversely impacts developer productivity, while also restricting dataflow across different engines. DDF (Distributed DataFrame) is a simple but powerful API above and across multiple engines. Using DDF, developers reap significant benefits including (1) a unified and highly productive API for data/compute access, (2) the ability to process data at-source, bypassing the absolute requirement for a Hadoop data lake, and (3) future-proofing against rapidly shifting economics of specific data engines.
To date, DDF has been implemented on Spark, Flink, and other engines. In this talk we demonstrate, for the first time, a business-analyst-friendly realtime data exploration and visualization system working directly with Flink. We will show how a business user can enter natural-language questions of his/her data and get real-time answers from Flink, in the form of visual charts and tables. We’ll also show interaction with the DDF-on-Flink API at the developer level, and share our experience on the challenges and lessons learned in realizing this vision on Flink, and compare and contrast that with the same experience on Spark.
Speakers:
Christopher Nguyen, Founder and CEO, Adatao
Rohit Rai, Founder and CEO of Tuplejump
6. @arimoinc@pentagoniachttp//ddf.io
Scala Java Python R
DDF
Spark Flink
DDF
Ignite
DDF
Data in Memory
Presto
DDF
Data at Rest
HDFS
DDF
DWs DBs
Enterprise Data Bus
DDF
S3
DDF
Redshift
DDF
BigQ
DDF
Cassandra
DDF
RDBMS
The Solution: DDF Data Integration
7. @arimoinc@pentagoniachttp//ddf.io
Benefits of DDF Data Integration
§ FOR DATA ENGINEERS
§ Unified API across data sources and engines
§ HDFS, S3, Cassandra, Redshift, BigQuery, RDBMS, Salesforce,
Spark, Flink, Ignite …
§ FOR DATA SCIENTISTS
§ Uniform high-level DataFrame abstractions: ETL, ML, Streaming
12. @arimoinc@pentagoniachttp//ddf.io
DDF API in a Nutshell
// To start working with an engine
DDFManager manager = DDFManager.get(“flink”); // or “spark”
// Then, data can be loaded into a DDF as follows:
DDF table = manager.sql2ddf("select * from airline");
// ETL, transform
table = table.transform("dist= round(distance/2, 2)”);
// Run Machine learning using MLlib, then run prediction
KMeansModel kmeansModel = (KMeansModel) ddf.ML.train("kmeans", 5, 5).getRawModel();
Int prediction = ddf.ML.applyModel(kmeansModel, false, true);
20. @arimoinc@pentagoniachttp//ddf.io
DDF: Where is it heading?
§ More Engines: DBs & DWs: BigQuery, Cassandra, Teradata, Presto, Ignite
§ Enterprise Databus to seamlessly move data across sources
§ Richer APIs
21. @arimoinc@pentagoniachttp//ddf.io
Get Started with DDF
§ Increase your productivity & build engine-agnostics Apps
• Build your analytics apps on existing modules
• Flink, Spark, JDBC
§ Expand possibilities. Contribute to DDF
• Enrich existing plugins: Data APIs, ML APIs...
• Add new DDF plugins:
• BigQuery, Cassandra
• Marketo
• Ignite, Presto
§ Spread the word!
www.ddf.io/gettingstarted
22. Collaborative Predictive Intelligence
via DDF-on-Flink using Distributed DataFrame
Christopher Nguyen, PhD—CEO & Co-Founder, Arimo
Rohit Rai—CEO, Tuplejump
Bringing BigApps to Flink
@arimoinc@pentagoniachttp//ddf.io