Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

•

3 likes•1,941 views

*This talk was first presented at http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/225673273/* Enterprise users today demand the ability to glean insights from their disparate data spread across varied transactional and analytics sources; hence, analytics application developers need the ability to connect to varied data & compute engines such as Spark, Flink, Cassandra, etc. A key pain point for developers is the lack of a uniform API across data & compute engines, a limitation which adversely impacts developer productivity, while also restricting dataflow across different engines. DDF (Distributed DataFrame) is a simple but powerful API above and across multiple engines. Using DDF, developers reap significant benefits including (1) a unified and highly productive API for data/compute access, (2) the ability to process data at-source, bypassing the absolute requirement for a Hadoop data lake, and (3) future-proofing against rapidly shifting economics of specific data engines. To date, DDF has been implemented on Spark, Flink, and other engines. In this talk we demonstrate, for the first time, a business-analyst-friendly realtime data exploration and visualization system working directly with Flink. We will show how a business user can enter natural-language questions of his/her data and get real-time answers from Flink, in the form of visual charts and tables. We’ll also show interaction with the DDF-on-Flink API at the developer level, and share our experience on the challenges and lessons learned in realizing this vision on Flink, and compare and contrast that with the same experience on Spark. Speakers: Christopher Nguyen, Founder and CEO, Adatao Rohit Rai, Founder and CEO of Tuplejump

Technology

Collaborative Predictive Intelligence
via DDF-on-Flink using Distributed DataFrame
Christopher Nguyen, PhD—CEO & Co-Founder, Arimo
Rohit Rai—CEO, Tuplejump
Bringing BigApps to Flink
@arimoinc@pentagoniachttp//ddf.io

@arimoinc@pentagoniachttp//ddf.io
Who Are We?
What Do We Do?

@arimoinc@pentagoniachttp//ddf.io
What Are Adatao Big Apps?
§ Predictive: Predictive Analytics for Business Users
§ Collaborative: Real-time Collaboration with Data Scientists

@arimoinc@pentagoniachttp//ddf.io
The EXPLOSION
of Data & Compute engines
The CIO Challenge
ScalaClient
Scala
JavaClient
Java
PyClient
Python
RClient
R
Ignite
HDFS
S3
Redshift
BigQ
Cassandra
RDBMS
Spark
Flink
Presto
Ignite
HDFS
S3
Redshift
BigQ
Cassandra
RDBMS
Spark
Flink
Presto
Ignite
HDFS
S3
Redshift
BigQ
Cassandra
RDBMS
Spark
FlinkPresto
ScalaClient
Scala
PyClient
Python
JavaClient
Java
RClient
R
FlinkFlink
Ignite
HDFS
RDBMS
Redshift
Cassandra HDFS RDBMS
HDFS
Flink

@arimoinc@pentagoniachttp//ddf.io
Scala Java Python R
DDF
Spark Flink
DDF
Ignite
DDF
Data in Memory
Presto
DDF
Data at Rest
HDFS
DDF
DWs DBs
Enterprise Data Bus
DDF
S3
DDF
Redshift
DDF
BigQ
DDF
Cassandra
DDF
RDBMS
The Solution: DDF Data Integration

@arimoinc@pentagoniachttp//ddf.io
Beneﬁts of DDF Data Integration
§ FOR DATA ENGINEERS
§ Unified API across data sources and engines
§ HDFS, S3, Cassandra, Redshift, BigQuery, RDBMS, Salesforce,
Spark, Flink, Ignite …
§ FOR DATA SCIENTISTS
§ Uniform high-level DataFrame abstractions: ETL, ML, Streaming

@arimoinc@pentagoniachttp//ddf.io
Custom  
Apps
Adatao AppBuilder
Adatao PredictiveEngine
Arimo Predictive Intelligence Platform
Big Compute
Big Data
Big Apps
Distributed DataFrame (DDF)
Open
Sourced
Data ScientistBusiness User Data Engineer

@arimoinc@pentagoniachttp//ddf.io
Why Flink?
§ Emerging engine with unique strengths (e.g., streaming)
§ Driven by Customer & Partner conversations

@arimoinc@pentagoniachttp//ddf.io
Java Python R
DDF DDF DDF
Spark Flink Redshift
Spark APIs
RDD
DataFrame
DStream
…
Flink APIs
DataSet
Table
DataStream
…
ETL
Interfaces
ML
Interfaces
Streaming
Interfaces
Unified DDF APIs
DDF: “Under the Hood”

@arimoinc@pentagoniachttp//ddf.io
DDF API in a Nutshell
// To start working with an engine
DDFManager manager = DDFManager.get(“flink”); // or “spark”
// Then, data can be loaded into a DDF as follows:
DDF table = manager.sql2ddf("select * from airline");
// ETL, transform
table = table.transform("dist= round(distance/2, 2)”);
// Run Machine learning using MLlib, then run prediction
KMeansModel kmeansModel = (KMeansModel) ddf.ML.train("kmeans", 5, 5).getRawModel();
Int prediction = ddf.ML.applyModel(kmeansModel, false, true);

@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ It was easy for us to implement DDF on Flink
§ Flink API close to functional collection API

@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ With DDF, it’s easy to port applications on DDF from one engine to another

@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ There’s now an opportunity to use Flink for interactive applications
§ Backtracking scheduler, session management, better graph analysis

@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ Null/missing value handling in Flink
§ Null value support needed in RowSerializer

@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ Map vs MapPartitions vs Accumulators
§ Map for aggregations can cause a lot of object creation overhead
§ Accumulators may fail for huge datasets

@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ Use caution when doing array copy overs in Table API

@arimoinc@pentagoniachttp//ddf.io
DDF: Where is it heading?
§ More Engines: DBs & DWs: BigQuery, Cassandra, Teradata, Presto, Ignite
§ Enterprise Databus to seamlessly move data across sources
§ Richer APIs

@arimoinc@pentagoniachttp//ddf.io
Get Started with DDF
§ Increase your productivity & build engine-agnostics Apps
• Build your analytics apps on existing modules
• Flink, Spark, JDBC
§ Expand possibilities. Contribute to DDF
• Enrich existing plugins: Data APIs, ML APIs...
• Add new DDF plugins:
• BigQuery, Cassandra
• Marketo
• Ignite, Presto
§ Spread the word!
www.ddf.io/gettingstarted

Recently uploaded

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

CloudStudio User manual (basic edition):comworks

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Story boards and shot lists for my a level piececharlottematthew16

Advanced Computer Architecture – An IntroductionDilum Bandara

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Take control of your SAP testing with UiPath Test SuiteDianaGray10

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation

Nell’iperspazio con Rocket: il Framework Web di Rust!

CloudStudio User manual (basic edition):

Ensuring Technical Readiness For Copilot in Microsoft 365

How AI, OpenAI, and ChatGPT impact business and software.

Dev Dives: Streamline document processing with UiPath Studio Web

Story boards and shot lists for my a level piece

Advanced Computer Architecture – An Introduction

Vertex AI Gemini Prompt Engineering Tips

DevEX - reference for building teams, processes, and platforms

Take control of your SAP testing with UiPath Test Suite

SAP Build Work Zone - Overview L2-L3.pptx

TeamStation AI System Report LATAM IT Salaries 2024

WordPress Websites for Engineers: Elevate Your Brand

Streamlining Python Development: A Guide to a Modern Project Setup

Artificial intelligence in cctv survelliance.pptx

Designing IA for AI - Information Architecture Conference 2024

Anypoint Exchange: It’s Not Just a Repo!

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Featured

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at WorkGetSmarter

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings

How Race, Age and Gender Shape Attitudes Towards Mental Health

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

1. Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame Christopher Nguyen, PhD—CEO & Co-Founder, Arimo Rohit Rai—CEO, Tuplejump Bringing BigApps to Flink @arimoinc@pentagoniachttp//ddf.io

2. @arimoinc@pentagoniachttp//ddf.io Who Are We? What Do We Do?

3. @arimoinc@pentagoniachttp//ddf.io What Are Adatao Big Apps? § Predictive: Predictive Analytics for Business Users § Collaborative: Real-time Collaboration with Data Scientists

4. @arimoinc@pentagoniachttp//ddf.io Demo

5. @arimoinc@pentagoniachttp//ddf.io The EXPLOSION of Data & Compute engines The CIO Challenge ScalaClient Scala JavaClient Java PyClient Python RClient R Ignite HDFS S3 Redshift BigQ Cassandra RDBMS Spark Flink Presto Ignite HDFS S3 Redshift BigQ Cassandra RDBMS Spark Flink Presto Ignite HDFS S3 Redshift BigQ Cassandra RDBMS Spark FlinkPresto ScalaClient Scala PyClient Python JavaClient Java RClient R FlinkFlink Ignite HDFS RDBMS Redshift Cassandra HDFS RDBMS HDFS Flink

6. @arimoinc@pentagoniachttp//ddf.io Scala Java Python R DDF Spark Flink DDF Ignite DDF Data in Memory Presto DDF Data at Rest HDFS DDF DWs DBs Enterprise Data Bus DDF S3 DDF Redshift DDF BigQ DDF Cassandra DDF RDBMS The Solution: DDF Data Integration

7. @arimoinc@pentagoniachttp//ddf.io Beneﬁts of DDF Data Integration § FOR DATA ENGINEERS § Unified API across data sources and engines § HDFS, S3, Cassandra, Redshift, BigQuery, RDBMS, Salesforce, Spark, Flink, Ignite … § FOR DATA SCIENTISTS § Uniform high-level DataFrame abstractions: ETL, ML, Streaming

8. @arimoinc@pentagoniachttp//ddf.io Custom   Apps Adatao AppBuilder Adatao PredictiveEngine Arimo Predictive Intelligence Platform Big Compute Big Data Big Apps Distributed DataFrame (DDF) Open Sourced Data ScientistBusiness User Data Engineer

9. @arimoinc@pentagoniachttp//ddf.io Why Flink? § Emerging engine with unique strengths (e.g., streaming) § Driven by Customer & Partner conversations

10. @arimoinc@pentagoniachttp//ddf.io Demo

11. @arimoinc@pentagoniachttp//ddf.io Java Python R DDF DDF DDF Spark Flink Redshift Spark APIs RDD DataFrame DStream … Flink APIs DataSet Table DataStream … ETL Interfaces ML Interfaces Streaming Interfaces Unified DDF APIs DDF: “Under the Hood”

12. @arimoinc@pentagoniachttp//ddf.io DDF API in a Nutshell // To start working with an engine DDFManager manager = DDFManager.get(“flink”); // or “spark” // Then, data can be loaded into a DDF as follows: DDF table = manager.sql2ddf("select * from airline"); // ETL, transform table = table.transform("dist= round(distance/2, 2)”); // Run Machine learning using MLlib, then run prediction KMeansModel kmeansModel = (KMeansModel) ddf.ML.train("kmeans", 5, 5).getRawModel(); Int prediction = ddf.ML.applyModel(kmeansModel, false, true);

13. @arimoinc@pentagoniachttp//ddf.io Demo

14. @arimoinc@pentagoniachttp//ddf.io Lessons Learned § It was easy for us to implement DDF on Flink § Flink API close to functional collection API

15. @arimoinc@pentagoniachttp//ddf.io Lessons Learned § With DDF, it’s easy to port applications on DDF from one engine to another

16. @arimoinc@pentagoniachttp//ddf.io Lessons Learned § There’s now an opportunity to use Flink for interactive applications § Backtracking scheduler, session management, better graph analysis

17. @arimoinc@pentagoniachttp//ddf.io Lessons Learned § Null/missing value handling in Flink § Null value support needed in RowSerializer

18. @arimoinc@pentagoniachttp//ddf.io Lessons Learned § Map vs MapPartitions vs Accumulators § Map for aggregations can cause a lot of object creation overhead § Accumulators may fail for huge datasets

19. @arimoinc@pentagoniachttp//ddf.io Lessons Learned § Use caution when doing array copy overs in Table API

20. @arimoinc@pentagoniachttp//ddf.io DDF: Where is it heading? § More Engines: DBs & DWs: BigQuery, Cassandra, Teradata, Presto, Ignite § Enterprise Databus to seamlessly move data across sources § Richer APIs

21. @arimoinc@pentagoniachttp//ddf.io Get Started with DDF § Increase your productivity & build engine-agnostics Apps • Build your analytics apps on existing modules • Flink, Spark, JDBC § Expand possibilities. Contribute to DDF • Enrich existing plugins: Data APIs, ML APIs... • Add new DDF plugins: • BigQuery, Cassandra • Marketo • Ignite, Presto § Spread the word! www.ddf.io/gettingstarted

22. Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame Christopher Nguyen, PhD—CEO & Co-Founder, Arimo Rohit Rai—CEO, Tuplejump Bringing BigApps to Flink @arimoinc@pentagoniachttp//ddf.io

Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame