This document discusses integrating Alluxio with Dask for processing large mass spectrometry imaging data. Alluxio is used as a distributed caching layer via its FUSE POSIX API to provide standardized access to datasets from Dask. This allows Dask to process data in parallel across compute nodes without needing to load full datasets into memory. Initial results found a 10x speedup when reading cached data from Alluxio versus directly from S3 storage each time.
3. Outline
1. Aspect Analytics
2. Use case: Mass Spectrometry Imaging
3. Dask
4. Alluxio
5. Data access via FUSE POSIX API
3
4. Aspect Analytics
A brief overview of Aspect Analytics.
4
more info at https:/
/aspect-analytics.com/
5. 5
Software company dedicated to Mass Spectrometry Imaging bioinformatics
We build software tools to support clients’ workflows (off-the-shelf and custom)
Leverage the full potential of MSI data in high-throughput settings
Beyond bioinformatics: data analysis embedded in integrated platform solution
more info at https:/
/aspect-analytics.com/
7. 7
Mass spectrometry
● Measures the abundance of molecular weights in a sample.
● Output is a mass spectrum:
○ Histogram of molecular weights in sample.
more info at https:/
/aspect-analytics.com/media/blog/2020-05-30-introduction-to-mass-spectrometry-data-analysis/
11. MSI data structure: 3D tensor
500 x 500 pixels
100,000 to 1,000,000 mass bins
⇒ 100GB - 1TB per data set
11
12. Illustration from Unsupervised machine learning for exploratory data analysis in imaging mass spectrometry. Nico Verbeeck Richard M. Caprioli Raf Van de Plas - 2019
12
Unsupervised analysis of mass
spectral images to help with
biomarker discovery.
⇒ new diagnostic tests
Use case
13. Use case
more info at https:/
/aspect-analytics.com/media/blog/2020-05-31-introduction-to-msi-data-analysis/
● Spatial localisation of biomolecules
● Region of interest analysis (images shown)
● Clinical diagnostics
13
14. Data challenges
14
• Interactively explore data
• Slice and subset data without loading full
data-cube into memory.
• Distributed machine learning
• Find patterns and extract features from
multiple large data-cubes.
• Process huge data arrays
• Parallel processing
• Out-of-core
16. Why Dask
Dask is like Apache Spark in Python with support for distributed data arrays.
• Parallel processing of data array chunks.
• Integration with Python machine learning ecosystem.
• Integration with our existing Python algorithms.
16
more info at https:/
/docs.dask.org/en/latest/spark.html
17. Why Dask
• Delayed compute that can be dynamically scheduled.
• Diagnostics dashboard.
more info at https:/
/docs.dask.org/. Figure from http:/
/matthewrocklin.com/slides/dask-scipy-2016.html
17
19. Why Alluxio
Data access layer
• Non Python specific
• Our platform user application is built on Clojure.
• Standardized access via FUSE POSIX API
• More on this in later slides
• Distributed and Tiered Caching layer
• Download once, use multiple times
• Share between different processes and services
• Centralized access to data
• Analytics code does not need to deal with different storage implementations.
• Avoid keeping object store credentials on client services.
19
20. Why Alluxio
Deployable in various scenarios
• Deployable in cloud as well as on-prem
• Through Docker & Kubernetes
• Long-lived vs short-lived deployments
• Long-running Alluxio server for continuous data access
(e.g. to provide data for notebook server)
• Short-term Alluxio deployment voor ad-hoc computations.
(e.g. to run a set of analyses on a new dataset)
• Integrate in automated testing.
20
25. Dask & Alluxio
25
● Dataset access to Dask is
provided via Alluxio FUSE
● Alluxio worker only loads
data that is required locally
● Alluxio worker keeps data in
cache
27. 27
FUSE
Filesystem in Userspace (FUSE)
Filesystem:
● Expose virtual files
● POSIX filesystem API
Userspace
● Refers to all code that run is run by the user (outside the operating system's kernel).
FUSE allows to create filesystems without needing to modify OS kernel code.
27
30. 30
Share Alluxio-FUSE via Bind-Mount
● Each service has its own
Docker environment.
● FUSE Filesystem connects
Alluxio with Analytics
platform via a bind-mount.
31. 31
Share Alluxio-FUSE via Bind-Mount
● Each service has its own
Docker environment.
● FUSE Filesystem connects
Alluxio with Analytics
platform via a bind-mount.
32. 32
Some anecdotal results
• We have custom Alluxio containers to reduce image size.
• It takes 30s to 1 min to spin up the Alluxio services with FUSE.
• Dask reading from S3 through FUSE (without caching):
• 30% slower compared to the native Dask S3 integration.
• Reading large files with Dask from local Alluxio cache:
• 10x speedup compared to reading from S3 each time.
• Enabling FUSE kernel caching gave another 3x speedup when reading.
32
more info at https:/
/docs.alluxio.io/os/user/stable/en/api/POSIX-API.html#:~:text=Tuning%20mount%20options