Ray and Its Growing Ecosystem

Ray and its growing ecosystem
Richard Liaw, Anyscale, @richliaw
Travis Addair, Uber, @TravisAddair

Overview of talk
● Overview of Ray
● Ray’s ecosystem integrations
● Uber Open Source + Ray

What is Ray?
3
Mission: Simplify distributed computing.

Relation to Ecosystem
● Similar in nature: Dask, Celery, Erlang, Akka, gRPC
● Runs on top of AWS, GCP, Azure, Kubernetes, your laptop, …
● Compatible with Python ecosystem:
○ NumPy
○ Pandas
○ TensorFlow
○ PyTorch
○ SpaCy, …

Distributed Systems are not New
HPC
(1980s)
Web
(1990s)
Big Data
(2000s)
Deep
Learning
(2010s)

No Longer Isolated Workloads
HPC
Deep Learning
Microservices
Big Data

Big Data
Microservices
Deep Learning
HPC

Big Data
Deep Learning
MicroservicesHPC

Deep Learning
MicroservicesHPC
Big Data

Deep Learning
MicroservicesHPC
Big Data
?

API
Functions -> Tasks
def read_array(file):
# read array “a” from “file”
return a
def add(a, b):
return np.add(a, b)

API
Functions -> Tasks
@ray.remote
return a
@ray.remote
def add(a, b):
return np.add(a, b)

API
Functions -> Tasks
@ray.remote
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(“/input1”)
id1
read_array

API
Functions -> Tasks
@ray.remote
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1
read_array
id2
zerosread_array

API
Functions -> Tasks
@ray.remote
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id3 = add.remote(id1, id2)
id1
read_array
id2
zerosread_array
id3
add

API
Functions -> Tasks
@ray.remote
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id3 = add.remote(id1, id2); ray.get(id3)
id1
read_array
id2
zerosread_array
id3
add

API
Functions -> Tasks
@ray.remote
return a
@ray.remote
def add(a, b):
return np.add(a, b)
Classes -> Actors

API
Functions -> Tasks
@ray.remote
return a
@ray.remote
def add(a, b):
return np.add(a, b)
Classes -> Actors
@ray.remote
class Counter(object):
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value

API
Functions -> Tasks
@ray.remote
return a
@ray.remote
def add(a, b):
return np.add(a, b)
Classes -> Actors
@ray.remote
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter.remote()
id4 = c.inc.remote()
ray.get([id4, id5])

API
Functions -> Tasks
@ray.remote
return a
@ray.remote(num_gpus=1)
def add(a, b):
return np.add(a, b)
Classes -> Actors
@ray.remote(num_gpus=1)
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter.remote()
ray.get([id4, id5])

Ecosystem
Native Libraries Third Party Libraries

Ecosystem
ML Training
Libraries

Ecosystem
Auto ML
Libraries

Ecosystem
Cloud ML
Platforms

Ecosystem
Ray becoming the go-to framework for scaling
libraries

Ray at Uber
Travis Addair, Uber

Ecosystem - Horovod
- Fast and easy distributed training for any framework
- Run Horovod on Ray
- Any cloud provider or k8s with Ray cluster launcher
- Hyperparameter search integration for Horovod
- Benefits of ecosystem (data processing, serving)
- Integration released in Horovod 0.20
- ~400 lines of code

Ecosystem - Ludwig
- Code-free deep learning (Auto ML)
- Given inputs and outputs, Ludwig builds the right model for any task

Ludwig: scalability challenges
- Single worker for preprocessing
- Whole dataset must fit in-memory (Pandas)
- Hyperparameter Optimization
- Optimize over preprocessing (feature engineering)
- Optimize over model params
- Optimizer over model architecture (encoders / decoders)

Ludwig: conventional ML workflow

Challenges with ML workflows
- Rewrite major sections of the code:
- Pandas -> Spark transformers
- Maintain two distinct code paths
- Each step heavyweight, allocates heterogenous infra
- Airflow
- But what about hyperparameter optimization?
- Dynamic process
- Difficult to model using static workflow definitions

Examining the Ray ecosystem
Dask
- Drop-in replacement for Pandas
- Pure-Python data processing (low overhead, easy debugging)
- GPU acceleration with RAPIDS / cuDF
Horovod
- Framework agnostic distributed training (TensorFlow, PyTorch, MXNet)
- Supports fault tolerance and auto-scaling
- Flexible: no restrictions on the structure of the training code
Ray
- Brings everything together as a single infra layer
- Provides scalable hyperparameter optimization and serving natively

Looking forward: minimizing I/O

Check us out on GitHub
Horovod:
- https://horovod.ai/
- https://github.com/horovod/horovod
Ludwig:
- https://ludwig.ai/
- https://github.com/uber/ludwig

Getting Involved
Things you can do now
pip install ray
Join Ray Slack https://rb.gy/fntume
Browse docs: docs.ray.io

Ray and Its Growing Ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ray and Its Growing Ecosystem

Similar to Ray and Its Growing Ecosystem (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Ray and Its Growing Ecosystem