Minerva: Drill Storage Plugin for IPFS

Minerva:
Drill Storage Plugin for
IPFS
Run SQL query on data in IPFS
Build big data storage block chain (BDSC)

1. Pinpoint the real address of a dataset, typically an HTTP link;
2. Download the dataset in a client-server mode;
3. Configure a computation environment for big data analysis;
4. Preprocess the dataset (e.g. converting file formats) and
develop data analysis algorithms.
A Present-day Workflow:
Problems with Public Dataset Analytics

1. Pinpoint the real address of
a dataset;
2. Download the dataset;
3. Set up a computation environment
powerful enough for big data analysis;
4. Prepare the data, e.g. converting file
formats, implementing basic analysis
algorithms.
Workflow: Caveats:
 Links may expire over time due to
temporary server failure or
permanent website shutdown.
 Dataset might be polluted (no clue
whether it is the right dataset in your
need).
 A single website cannot host all the
datasets.

1. Locate the dataset, typically via an
HTTP link;
2. Download the dataset in a
client-server mode;
algorithms.
Workflow: Caveats:
 Datasets are usually huge,
demanding a long downloading time;
 Client-server mode is not bandwidth
efficient;
 Data files are usually packaged and
compressed in a single dataset
archive. A user interested in a part
of the dataset has to download all.

1. Locate the dataset, typically via an
HTTP link;
3. Configure a computation
environment for big data
analysis;
algorithms.
Workflow: Caveats:
 Expensive storage and computation
resources are necessary for large-
scale data analytics;
 Maintenance and management
overhead consume enormous
human resources.

1. Locate the dataset, typically via an HTTP
link;
4. Preprocess the dataset (e.g.
converting file formats) and
develop data analysis
algorithms.
Workflow: Caveats:
 Datasets from different origins and
different areas of research come in
different formats and structures.
 The users of datasets might not be
proficient in programming;
 Repetitive work in data analytics is
inevitable when many users happen
to process the same dataset.

IPFS1 to the Rescue
• Decentralization: no single point of failure
• Collaboration: sharing resources as well as reusing
codes in the community
• Fine-grained Content addressing2: get exactly what you
need
1: https://ipfs.io/
2: datasets can be split into blocks and only those of interest need processing.

Drill1 the Distributed Query Engine
• Compatibility: supporting standard SQL statements
• Flexibility: no metastore, no schema, non-relational data
• Scalability: enabling user defined functions
• Locality-awareness: pushing processing into the nearby
datastores
1: https://drill.apache.org/

Drill and IPFS Combined
Drill and IPFS collocation:
A distributed network of nodes, each of which runs Drill and
IPFS simultaneously.
Localhost
Peers on
network
P2P Network
Storage
Planner
Reader /
Writer
Query engine Version &
format
management
Qri1
2
1: https://qri.io/
2: https://libp2p.io/

Query Explained: Read
SQL input
= ?
IPFS CID1 of
the dataset
being queried SQL statement that “reads” data:
SELECT *
FROM ipfs.`/ipfs/QmAce…f2a/employee.json`
Drill query
interface
1: Content Identifier, CID. https://github.com/ipld/specs/blob/master/block-layer/CID.md
Foreman

SQL input
= ?
IPFS object resolution:
ipfs object links QmAce…f2a
Links – CIDs of objects
(chunks) contained in
the “top” object
Foreman

SQL input
= ?
DHT
A
D
C
B
IPFS provider resolution:
ipfs dht findprovs QmFHq…32T
A
D
B
C
Drillbits running IPFS
that can provide the
data pieces
Drill execution
plan sent to
peer nodes
Foreman

A
D
B
C
SQL input
Results
= ?
Parts of results
returned to
foreman
Results are returned to
the user
Foreman

Query Explained: Write
A
D
B
C
SQL input
Result
SQL statement that “writes” data:
CREATE IPFSTABLE ipfs.`create` AS (
SELECT *
FROM ipfs.`/ipfs/QmAce…f2a/employee.json`
ORDER BY `id` DESC
)
DHT
A
D
C
B
Partial CIDs reassembled
into a single CID and
returned to the user
Actual data operations
happen on the node that
stores the data locally
Partial CIDs of new
data pieces sent
back to foreman
Foreman

User Defined Functions
• Format conversion programs and common analysis
algorithms can be implemented in the form of User
Defined Functions (UDF) and distributed along with the
datasets.
• Drill can invoke these UDFs using their CIDs, in the same
way it locates a dataset on IPFS.

Code Structure
IPFS DAG/DHT API
IPFS Object API

Performance Evaluation
• A 6-node cluster on a cloud service provider, each
with 8GB RAM and 4 cores CPU
• IPFS running in private network mode
• Query file size：100MB-1GB
• Query: simple queries like select *, select count(*)
• Response time：2-10s
• Transactions per second：~10

Performance Evaluation
Query completion time under different chunk sizes (left) and
parallelization width (right). Dataset 1: 67MB, Dataset 2: 190MB.

Possible Applications
• An easy MPP cluster with Minerva
• Decentralized data sharing system
• Big data analysis for other Dapps running on IPFS

Problems To Be Solved
• Performance
• DHT operations take too much time, especially on the
Internet.
• IPFS limits blocks to be 4MB at max, resulting in
enormous number of blocks for huge datasets.
• Write operations are incomplete
• The last step to reassemble the partial CIDs is not yet
implemented.
• Stability

THANK YOU FOR YOUR TIME!
Github: github.com/bdchain/Minerva

Minerva: Drill Storage Plugin for IPFS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Minerva: Drill Storage Plugin for IPFS

Similar to Minerva: Drill Storage Plugin for IPFS (20)

Recently uploaded

Recently uploaded (20)

Minerva: Drill Storage Plugin for IPFS