Alluxio-FUSE as a data access layer for Dask

Alluxio Day III
Exploring Alluxio & Dask integration
1
&
2020-04-27

whoami
2
Peter Roelants
Machine Learning Engineering Lead
@Aspect Analytics
@PeterRoelants

Outline
1. Aspect Analytics
2. Use case: Mass Spectrometry Imaging
3. Dask
4. Alluxio
5. Data access via FUSE POSIX API
3

Aspect Analytics
A brief overview of Aspect Analytics.
4
more info at https:/
/aspect-analytics.com/

5
Software company dedicated to Mass Spectrometry Imaging bioinformatics
We build software tools to support clients’ workﬂows (oﬀ-the-shelf and custom)
Leverage the full potential of MSI data in high-throughput settings
Beyond bioinformatics: data analysis embedded in integrated platform solution
/aspect-analytics.com/

Mass Spectrometry Imaging
What data are we working with?
6
more info at

7
Mass spectrometry
● Measures the abundance of molecular weights in a sample.
● Output is a mass spectrum:
○ Histogram of molecular weights in sample.
/aspect-analytics.com/media/blog/2020-05-30-introduction-to-mass-spectrometry-data-analysis/

8
Mass spectrometry imaging
Measure spatial distribution of molecular masses
over a slice of tissue.

9
Mass spectrometry imaging workﬂow
Overlay tissue slice with virtual grid of "pixels".

10
Mass spectrometry imaging workﬂow
Measure mass spectrum at each "pixel".

MSI data structure: 3D tensor
500 x 500 pixels
100,000 to 1,000,000 mass bins
⇒ 100GB - 1TB per data set
11

Illustration from Unsupervised machine learning for exploratory data analysis in imaging mass spectrometry. Nico Verbeeck Richard M. Caprioli Raf Van de Plas - 2019
12
Unsupervised analysis of mass
spectral images to help with
biomarker discovery.
⇒ new diagnostic tests
Use case

Use case
/aspect-analytics.com/media/blog/2020-05-31-introduction-to-msi-data-analysis/
● Spatial localisation of biomolecules
● Region of interest analysis (images shown)
● Clinical diagnostics
13

Data challenges
14
• Interactively explore data
• Slice and subset data without loading full
data-cube into memory.
• Distributed machine learning
• Find patterns and extract features from
multiple large data-cubes.
• Process huge data arrays
• Parallel processing
• Out-of-core

Why Dask
Dask is like Apache Spark in Python with support for distributed data arrays.
• Parallel processing of data array chunks.
• Integration with Python machine learning ecosystem.
• Integration with our existing Python algorithms.
16
/docs.dask.org/en/latest/spark.html

Why Dask
• Delayed compute that can be dynamically scheduled.
• Diagnostics dashboard.
/docs.dask.org/. Figure from http:/
/matthewrocklin.com/slides/dask-scipy-2016.html
17

Why Alluxio
Data access layer
• Non Python specific
• Our platform user application is built on Clojure.
• Standardized access via FUSE POSIX API
• More on this in later slides
• Distributed and Tiered Caching layer
• Download once, use multiple times
• Share between different processes and services
• Centralized access to data
• Analytics code does not need to deal with different storage implementations.
• Avoid keeping object store credentials on client services.
19

Why Alluxio
Deployable in various scenarios
• Deployable in cloud as well as on-prem
• Through Docker & Kubernetes
• Long-lived vs short-lived deployments
• Long-running Alluxio server for continuous data access
(e.g. to provide data for notebook server)
• Short-term Alluxio deployment voor ad-hoc computations.
(e.g. to run a set of analyses on a new dataset)
• Integrate in automated testing.
20

Dask & Alluxio
Using Alluxio as a data access layer for Dask.
21

Dask & Alluxio
24
● Dataset access to Dask is
provided via Alluxio FUSE

Dask & Alluxio
25
● Dataset access to Dask is
provided via Alluxio FUSE
● Alluxio worker only loads
data that is required locally
● Alluxio worker keeps data in
cache

Alluxio FUSE
Data access via a POSIX API.
26

27
FUSE
Filesystem in Userspace (FUSE)
Filesystem:
● Expose virtual files
● POSIX filesystem API
Userspace
● Refers to all code that run is run by the user (outside the operating system's kernel).
FUSE allows to create filesystems without needing to modify OS kernel code.
27

28
Share Alluxio-FUSE via Bind-Mount
● Each service has its own
Docker environment.

29
Docker environment.

30
Docker environment.
● FUSE Filesystem connects
Alluxio with Analytics
platform via a bind-mount.

31
Docker environment.
● FUSE Filesystem connects
Alluxio with Analytics
platform via a bind-mount.

32
Some anecdotal results
• We have custom Alluxio containers to reduce image size.
• It takes 30s to 1 min to spin up the Alluxio services with FUSE.
• Dask reading from S3 through FUSE (without caching):
• 30% slower compared to the native Dask S3 integration.
• Reading large ﬁles with Dask from local Alluxio cache:
• 10x speedup compared to reading from S3 each time.
• Enabling FUSE kernel caching gave another 3x speedup when reading.
32
/docs.alluxio.io/os/user/stable/en/api/POSIX-API.html#:~:text=Tuning%20mount%20options

Alluxio-FUSE as a data access layer for Dask

Alluxio-FUSE as a data access layer for Dask

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Alluxio-FUSE as a data access layer for Dask

Similar to Alluxio-FUSE as a data access layer for Dask (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Alluxio-FUSE as a data access layer for Dask