SlideShare a Scribd company logo
1 of 22
Download to read offline
Minerva:
Drill Storage Plugin for
IPFS
Run SQL query on data in IPFS
Build big data storage block chain (BDSC)
1. Pinpoint the real address of a dataset, typically an HTTP link;
2. Download the dataset in a client-server mode;
3. Configure a computation environment for big data analysis;
4. Preprocess the dataset (e.g. converting file formats) and
develop data analysis algorithms.
A Present-day Workflow:
Problems with Public Dataset Analytics
1. Pinpoint the real address of
a dataset;
2. Download the dataset;
3. Set up a computation environment
powerful enough for big data analysis;
4. Prepare the data, e.g. converting file
formats, implementing basic analysis
algorithms.
Workflow: Caveats:
 Links may expire over time due to
temporary server failure or
permanent website shutdown.
 Dataset might be polluted (no clue
whether it is the right dataset in your
need).
 A single website cannot host all the
datasets.
Problems with Public Dataset Analytics
1. Locate the dataset, typically via an
HTTP link;
2. Download the dataset in a
client-server mode;
3. Set up a computation environment
powerful enough for big data analysis;
4. Prepare the data, e.g. converting file
formats, implementing basic analysis
algorithms.
Workflow: Caveats:
 Datasets are usually huge,
demanding a long downloading time;
 Client-server mode is not bandwidth
efficient;
 Data files are usually packaged and
compressed in a single dataset
archive. A user interested in a part
of the dataset has to download all.
Problems with Public Dataset Analytics
1. Locate the dataset, typically via an
HTTP link;
2. Download the dataset;
3. Configure a computation
environment for big data
analysis;
4. Prepare the data, e.g. converting file
formats, implementing basic analysis
algorithms.
Workflow: Caveats:
 Expensive storage and computation
resources are necessary for large-
scale data analytics;
 Maintenance and management
overhead consume enormous
human resources.
Problems with Public Dataset Analytics
1. Locate the dataset, typically via an HTTP
link;
2. Download the dataset;
3. Set up a computation environment
powerful enough for big data analysis;
4. Preprocess the dataset (e.g.
converting file formats) and
develop data analysis
algorithms.
Workflow: Caveats:
 Datasets from different origins and
different areas of research come in
different formats and structures.
 The users of datasets might not be
proficient in programming;
 Repetitive work in data analytics is
inevitable when many users happen
to process the same dataset.
Problems with Public Dataset Analytics
IPFS1 to the Rescue
• Decentralization: no single point of failure
• Collaboration: sharing resources as well as reusing
codes in the community
• Fine-grained Content addressing2: get exactly what you
need
1: https://ipfs.io/
2: datasets can be split into blocks and only those of interest need processing.
Drill1 the Distributed Query Engine
• Compatibility: supporting standard SQL statements
• Flexibility: no metastore, no schema, non-relational data
• Scalability: enabling user defined functions
• Locality-awareness: pushing processing into the nearby
datastores
1: https://drill.apache.org/
Drill and IPFS Combined
Drill and IPFS collocation:
A distributed network of nodes, each of which runs Drill and
IPFS simultaneously.
Localhost
Peers on
network
P2P Network
Storage
Planner
Reader /
Writer
Query engine Version &
format
management
Qri1
2
1: https://qri.io/
2: https://libp2p.io/
Query Explained: Read
SQL input
= ?
IPFS CID1 of
the dataset
being queried SQL statement that “reads” data:
SELECT *
FROM ipfs.`/ipfs/QmAce…f2a/employee.json`
Drill query
interface
1: Content Identifier, CID. https://github.com/ipld/specs/blob/master/block-layer/CID.md
Foreman
Query Explained: Read
SQL input
= ?
IPFS object resolution:
ipfs object links QmAce…f2a
Links – CIDs of objects
(chunks) contained in
the “top” object
Foreman
Query Explained: Read
SQL input
= ?
DHT
A
D
C
B
IPFS provider resolution:
ipfs dht findprovs QmFHq…32T
A
D
B
C
Drillbits running IPFS
that can provide the
data pieces
Drill execution
plan sent to
peer nodes
Foreman
Query Explained: Read
A
D
B
C
SQL input
Results
= ?
Parts of results
returned to
foreman
Results are returned to
the user
Foreman
Query Explained: Write
A
D
B
C
SQL input
Result
SQL statement that “writes” data:
CREATE IPFSTABLE ipfs.`create` AS (
SELECT *
FROM ipfs.`/ipfs/QmAce…f2a/employee.json`
ORDER BY `id` DESC
)
DHT
A
D
C
B
Partial CIDs reassembled
into a single CID and
returned to the user
Actual data operations
happen on the node that
stores the data locally
Partial CIDs of new
data pieces sent
back to foreman
Foreman
User Defined Functions
• Format conversion programs and common analysis
algorithms can be implemented in the form of User
Defined Functions (UDF) and distributed along with the
datasets.
• Drill can invoke these UDFs using their CIDs, in the same
way it locates a dataset on IPFS.
Code Structure
IPFS DAG/DHT API
IPFS Object API
A Query Example
Performance Evaluation
• A 6-node cluster on a cloud service provider, each
with 8GB RAM and 4 cores CPU
• IPFS running in private network mode
• Query file size:100MB-1GB
• Query: simple queries like select *, select count(*)
• Response time:2-10s
• Transactions per second:~10
Performance Evaluation
Query completion time under different chunk sizes (left) and
parallelization width (right). Dataset 1: 67MB, Dataset 2: 190MB.
Possible Applications
• An easy MPP cluster with Minerva
• Decentralized data sharing system
• Big data analysis for other Dapps running on IPFS
Problems To Be Solved
• Performance
• DHT operations take too much time, especially on the
Internet.
• IPFS limits blocks to be 4MB at max, resulting in
enormous number of blocks for huge datasets.
• Write operations are incomplete
• The last step to reassemble the partial CIDs is not yet
implemented.
• Stability
THANK YOU FOR YOUR TIME!
Github: github.com/bdchain/Minerva

More Related Content

What's hot

Module: Content Routing in IPFS
Module: Content Routing in IPFSModule: Content Routing in IPFS
Module: Content Routing in IPFSIoannis Psaras
 
Docker Networking Overview
Docker Networking OverviewDocker Networking Overview
Docker Networking OverviewSreenivas Makam
 
Apache kafka 관리와 모니터링
Apache kafka 관리와 모니터링Apache kafka 관리와 모니터링
Apache kafka 관리와 모니터링JANGWONSEO4
 
OpenShift 4, the smarter Kubernetes platform
OpenShift 4, the smarter Kubernetes platformOpenShift 4, the smarter Kubernetes platform
OpenShift 4, the smarter Kubernetes platformKangaroot
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Community
 
20220224台中演講k8s
20220224台中演講k8s20220224台中演講k8s
20220224台中演講k8schabateryuhlin
 
Modernizing applications with Amazon EKS - MAD304 - Santa Clara AWS Summit.pdf
Modernizing applications with Amazon EKS - MAD304 - Santa Clara AWS Summit.pdfModernizing applications with Amazon EKS - MAD304 - Santa Clara AWS Summit.pdf
Modernizing applications with Amazon EKS - MAD304 - Santa Clara AWS Summit.pdfAmazon Web Services
 
Simplifying Your IT Workflow with Katello and Foreman
Simplifying Your IT Workflow with Katello and ForemanSimplifying Your IT Workflow with Katello and Foreman
Simplifying Your IT Workflow with Katello and ForemanNikhil Kathole
 
Running Kafka as a Native Binary Using GraalVM with Ozan Günalp
Running Kafka as a Native Binary Using GraalVM with Ozan GünalpRunning Kafka as a Native Binary Using GraalVM with Ozan Günalp
Running Kafka as a Native Binary Using GraalVM with Ozan GünalpHostedbyConfluent
 
HDFSネームノードのHAについて #hcj13w
HDFSネームノードのHAについて #hcj13wHDFSネームノードのHAについて #hcj13w
HDFSネームノードのHAについて #hcj13wCloudera Japan
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryDataWorks Summit
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into CassandraDataStax
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
 
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...DataStax
 
Analytics with Apache Superset and ClickHouse - DoK Talks #151
Analytics with Apache Superset and ClickHouse - DoK Talks #151Analytics with Apache Superset and ClickHouse - DoK Talks #151
Analytics with Apache Superset and ClickHouse - DoK Talks #151DoKC
 
Service Discovery & Load-Balancing under Docker 1.12.0 @ Docker Meetup #22
Service Discovery & Load-Balancing under Docker 1.12.0 @ Docker Meetup #22Service Discovery & Load-Balancing under Docker 1.12.0 @ Docker Meetup #22
Service Discovery & Load-Balancing under Docker 1.12.0 @ Docker Meetup #22Ajeet Singh Raina
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요Jo Hoon
 

What's hot (20)

Module: Content Routing in IPFS
Module: Content Routing in IPFSModule: Content Routing in IPFS
Module: Content Routing in IPFS
 
Docker Networking Overview
Docker Networking OverviewDocker Networking Overview
Docker Networking Overview
 
Apache kafka 관리와 모니터링
Apache kafka 관리와 모니터링Apache kafka 관리와 모니터링
Apache kafka 관리와 모니터링
 
OpenShift 4, the smarter Kubernetes platform
OpenShift 4, the smarter Kubernetes platformOpenShift 4, the smarter Kubernetes platform
OpenShift 4, the smarter Kubernetes platform
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS Update
 
20220224台中演講k8s
20220224台中演講k8s20220224台中演講k8s
20220224台中演講k8s
 
Modernizing applications with Amazon EKS - MAD304 - Santa Clara AWS Summit.pdf
Modernizing applications with Amazon EKS - MAD304 - Santa Clara AWS Summit.pdfModernizing applications with Amazon EKS - MAD304 - Santa Clara AWS Summit.pdf
Modernizing applications with Amazon EKS - MAD304 - Santa Clara AWS Summit.pdf
 
Simplifying Your IT Workflow with Katello and Foreman
Simplifying Your IT Workflow with Katello and ForemanSimplifying Your IT Workflow with Katello and Foreman
Simplifying Your IT Workflow with Katello and Foreman
 
Running Kafka as a Native Binary Using GraalVM with Ozan Günalp
Running Kafka as a Native Binary Using GraalVM with Ozan GünalpRunning Kafka as a Native Binary Using GraalVM with Ozan Günalp
Running Kafka as a Native Binary Using GraalVM with Ozan Günalp
 
HDFSネームノードのHAについて #hcj13w
HDFSネームノードのHAについて #hcj13wHDFSネームノードのHAについて #hcj13w
HDFSネームノードのHAについて #hcj13w
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recovery
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
 
Query logging with proxysql
Query logging with proxysqlQuery logging with proxysql
Query logging with proxysql
 
Analytics with Apache Superset and ClickHouse - DoK Talks #151
Analytics with Apache Superset and ClickHouse - DoK Talks #151Analytics with Apache Superset and ClickHouse - DoK Talks #151
Analytics with Apache Superset and ClickHouse - DoK Talks #151
 
Service Discovery & Load-Balancing under Docker 1.12.0 @ Docker Meetup #22
Service Discovery & Load-Balancing under Docker 1.12.0 @ Docker Meetup #22Service Discovery & Load-Balancing under Docker 1.12.0 @ Docker Meetup #22
Service Discovery & Load-Balancing under Docker 1.12.0 @ Docker Meetup #22
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
 

Similar to Minerva: Drill Storage Plugin for IPFS

A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptxShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxVishalBH1
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopWilfried Hoge
 
Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014aceas13tern
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 

Similar to Minerva: Drill Storage Plugin for IPFS (20)

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
 
Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 

Recently uploaded

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 

Recently uploaded (20)

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 

Minerva: Drill Storage Plugin for IPFS

  • 1. Minerva: Drill Storage Plugin for IPFS Run SQL query on data in IPFS Build big data storage block chain (BDSC)
  • 2. 1. Pinpoint the real address of a dataset, typically an HTTP link; 2. Download the dataset in a client-server mode; 3. Configure a computation environment for big data analysis; 4. Preprocess the dataset (e.g. converting file formats) and develop data analysis algorithms. A Present-day Workflow: Problems with Public Dataset Analytics
  • 3. 1. Pinpoint the real address of a dataset; 2. Download the dataset; 3. Set up a computation environment powerful enough for big data analysis; 4. Prepare the data, e.g. converting file formats, implementing basic analysis algorithms. Workflow: Caveats:  Links may expire over time due to temporary server failure or permanent website shutdown.  Dataset might be polluted (no clue whether it is the right dataset in your need).  A single website cannot host all the datasets. Problems with Public Dataset Analytics
  • 4. 1. Locate the dataset, typically via an HTTP link; 2. Download the dataset in a client-server mode; 3. Set up a computation environment powerful enough for big data analysis; 4. Prepare the data, e.g. converting file formats, implementing basic analysis algorithms. Workflow: Caveats:  Datasets are usually huge, demanding a long downloading time;  Client-server mode is not bandwidth efficient;  Data files are usually packaged and compressed in a single dataset archive. A user interested in a part of the dataset has to download all. Problems with Public Dataset Analytics
  • 5. 1. Locate the dataset, typically via an HTTP link; 2. Download the dataset; 3. Configure a computation environment for big data analysis; 4. Prepare the data, e.g. converting file formats, implementing basic analysis algorithms. Workflow: Caveats:  Expensive storage and computation resources are necessary for large- scale data analytics;  Maintenance and management overhead consume enormous human resources. Problems with Public Dataset Analytics
  • 6. 1. Locate the dataset, typically via an HTTP link; 2. Download the dataset; 3. Set up a computation environment powerful enough for big data analysis; 4. Preprocess the dataset (e.g. converting file formats) and develop data analysis algorithms. Workflow: Caveats:  Datasets from different origins and different areas of research come in different formats and structures.  The users of datasets might not be proficient in programming;  Repetitive work in data analytics is inevitable when many users happen to process the same dataset. Problems with Public Dataset Analytics
  • 7. IPFS1 to the Rescue • Decentralization: no single point of failure • Collaboration: sharing resources as well as reusing codes in the community • Fine-grained Content addressing2: get exactly what you need 1: https://ipfs.io/ 2: datasets can be split into blocks and only those of interest need processing.
  • 8. Drill1 the Distributed Query Engine • Compatibility: supporting standard SQL statements • Flexibility: no metastore, no schema, non-relational data • Scalability: enabling user defined functions • Locality-awareness: pushing processing into the nearby datastores 1: https://drill.apache.org/
  • 9. Drill and IPFS Combined Drill and IPFS collocation: A distributed network of nodes, each of which runs Drill and IPFS simultaneously. Localhost Peers on network P2P Network Storage Planner Reader / Writer Query engine Version & format management Qri1 2 1: https://qri.io/ 2: https://libp2p.io/
  • 10. Query Explained: Read SQL input = ? IPFS CID1 of the dataset being queried SQL statement that “reads” data: SELECT * FROM ipfs.`/ipfs/QmAce…f2a/employee.json` Drill query interface 1: Content Identifier, CID. https://github.com/ipld/specs/blob/master/block-layer/CID.md Foreman
  • 11. Query Explained: Read SQL input = ? IPFS object resolution: ipfs object links QmAce…f2a Links – CIDs of objects (chunks) contained in the “top” object Foreman
  • 12. Query Explained: Read SQL input = ? DHT A D C B IPFS provider resolution: ipfs dht findprovs QmFHq…32T A D B C Drillbits running IPFS that can provide the data pieces Drill execution plan sent to peer nodes Foreman
  • 13. Query Explained: Read A D B C SQL input Results = ? Parts of results returned to foreman Results are returned to the user Foreman
  • 14. Query Explained: Write A D B C SQL input Result SQL statement that “writes” data: CREATE IPFSTABLE ipfs.`create` AS ( SELECT * FROM ipfs.`/ipfs/QmAce…f2a/employee.json` ORDER BY `id` DESC ) DHT A D C B Partial CIDs reassembled into a single CID and returned to the user Actual data operations happen on the node that stores the data locally Partial CIDs of new data pieces sent back to foreman Foreman
  • 15. User Defined Functions • Format conversion programs and common analysis algorithms can be implemented in the form of User Defined Functions (UDF) and distributed along with the datasets. • Drill can invoke these UDFs using their CIDs, in the same way it locates a dataset on IPFS.
  • 16. Code Structure IPFS DAG/DHT API IPFS Object API
  • 18. Performance Evaluation • A 6-node cluster on a cloud service provider, each with 8GB RAM and 4 cores CPU • IPFS running in private network mode • Query file size:100MB-1GB • Query: simple queries like select *, select count(*) • Response time:2-10s • Transactions per second:~10
  • 19. Performance Evaluation Query completion time under different chunk sizes (left) and parallelization width (right). Dataset 1: 67MB, Dataset 2: 190MB.
  • 20. Possible Applications • An easy MPP cluster with Minerva • Decentralized data sharing system • Big data analysis for other Dapps running on IPFS
  • 21. Problems To Be Solved • Performance • DHT operations take too much time, especially on the Internet. • IPFS limits blocks to be 4MB at max, resulting in enormous number of blocks for huge datasets. • Write operations are incomplete • The last step to reassemble the partial CIDs is not yet implemented. • Stability
  • 22. THANK YOU FOR YOUR TIME! Github: github.com/bdchain/Minerva