Alluxio Global Online Meetup
March 30, 2021
For more Alluxio events: https://www.alluxio.io/events/
Speaker(s):
Vasista Polali, boolean UG
Bin Fan, Alluxio
Alluxio is an open source Data orchestration platform that can be deployed on multiple platforms. However, it can require a lot of thinking and experience to integrate Alluxio into an existing Data Architecture adhering to minimally required DevOps principles meeting Organizational standards.
The presentation talks about the best practices to set up and techniques to build a cluster with open source Alluxio on AWS EKS, for one of our clients, which made it Scalable, Reliable, and Secure by adapting to Kubernetes RBAC.
Our speaker Vasista Polali will show you how to:
- Bootstrap EKS cluster in AWS with Terraform.
- Deploy open source Alluxio in a Namespace with persistence in AWS EFS.
- Scale up and down the Alluxio worker nodes as Daemon sets by - Scaling the EKS nodes with Terraform.
- Accessing data with S3 mount.
- Controlling the access to Alluxio with Kubernetes port-forwarding, “setfacl” functionality, and Kubernetes service accounts.
- Re-using the data/metadata in the persistence layer on a new cluster(Need to do a bit more experimenting, will include this based on the outcome)
2. About me
Vasista Polali
Founder @ boolean UG
Berlin, Germany
http://booleancomputing.com
Email: vasista.polali@booleancomputing.com
3. Agenda
• Bootstrap EKS cluster in AWS with Terraform.
• Deploy open source Alluxio in a namespace with persistence in AWS EFS.
• Scale up and down the Alluxio worker nodes as Daemon sets by Scaling the EKS
nodes with Terraform.
• Accessing data with S3 mount.
• Controlling the access to Alluxio with “setfacl” functionality, and Kubernetes service
accounts.
• Re-using the metadata in the persistence layer on a new cluster
4. Use case:
Data Security:
• There was a need for sharing data stored in different storage systems like AWS S3 and Azure Blob
storage and also in different buckets in the same object store controlled by different teams.
• This caused a lot of data movement and also time delay in getting approvals from data security
around making multiple copies of data, access control , data retention and deletion of data owing to
GDPR not to mention the additional ETL development effort and maintenance.
Data Sharing and intermediate data persistence:
• There was a need for data sharing between various spark jobs in the ETL and Iterative machine
learning workflows where intermediate data had to be written back to the storage systems and re-
ingested by the consecutive steps causing higher processing time, increased data transfer and
increased costs.
Fault Tolerance:
• Long running job crashes cause loss of data in memory and intermediate persistence to storage
increases processing time, causing pain.
5. MVP
The goal was to build a cloud native data sharing system by taking open source Alluxio
and wrapping it in set a of processes that would adhere to minimally required
Enterprise wide standards and DevOps principles of Security, Automation, Infrastructure
as code, Continuous improvement and Deployment and short lead times.
In scope:
• Everything that open source Alluxio provides out-of-box.
• Should be cloud native and deployable on AWS EKS.
Out of scope:
• No forking , customization, or maintenance of open source code.
7. AWS EKS - quick look
Amazon Elastic Kubernetes Service (Amazon EKS) gives you the flexibility to start, run, and scale
Kubernetes applications in the AWS cloud or on-premises
8. Environment:
• AWS EKS with assigned minimum size of 5 nodes.
• Infrastructure as code provisioned by Terraform.
• A workspace in Terraform Cloud to plan and apply
the configuration and maintain remote state
storage.
• Auto scaling group to provision and maintain EC2
instance capacity.
• CICD pipeline with Git actions.
• Official Alluxio docker images.
• Kubernetes Persistence Volume mounted on AWS
EFS
+
Bootstrap AWS EKS with Terraform
Terraform Cloud
Maintain and apply
configuration
CICD with Git Actions
9. Deploy Open Source Alluxio on Kubernetes
• Deploy alluxio-master as a StatefulSet, which when scaling the
master pods provides guarantees about the ordering and
uniqueness of these Pods.
• Deploy worker pods as DaemonSet which provides guarantees that
one worker is running on each node.
• Scalability- as nodes are added to the cluster, additional worker
pods are added to them. As nodes are removed from the cluster,
those Pods are garbage collected.
• Set the Alluxio configuration properties through Kubernetes
ConfigMap.
• Deploy Persistence Volume claims with AWS EFS as persistence
volume provisioned by a Storage class.
• None of the services are exposed to the outside world and remain
accessible only on the internal network.
Node 2
Node 1 Node 3
• Alluxio-master-0
• Alluxio Worker
• Alluxio-master-1
• Alluxio Worker
• Alluxio Worker
StatefulSet
DaemonSet
StatefulSet
DaemonSet DaemonSet
ConfigMap
Persistent Volume
10. Idea was to provide a unified namespace for accessing and processing data
Mount S3 bucket to alluxio fs directory:
• Create CICD action to run alluxio fs mount
alluxio fs mount --option aws.accessKeyId=<aws key> --
option aws.secretKey=<aws secret> <alluxio dir> s3://<s3
bucket>
• Create service user in IAM with access to the S3 bucket.
• Store security credentials service user as secrets in the CICD tool.
• Create Kubernetes user and role to port-forward a pod and limit It to
the namespace that Alluxio is deployed in.
• Generate Kube config file for the Kubernetes user.
• Generate Git workflow yaml to run the setup on merge to master.
Alluxio fs
11. Implementation:
Master Branch
Add the list of the directories to be
mounted
Checkout Branch
Create pull request
Merge to master on approvals from
Data owners
Trigger Workflow Kubectl port forward <allluxio-master-pod> on 19998
• Alluxio master will be available on localhost:19998
Checkout repo with alluxio binaries
• Provide AWS credentials of the AWS service user
from secrets.
• Run alluxio fs mount
Actions
12. Control Access to data in Alluxio fs:
• Access to the mounted data from S3 and other directories in
open source Alluxio are accessible only by the user that
created it.
• Create a CICD action to run alluxio fs setfacl command
alluxio fs setfacl -R -m user:<user>:<permissions>
<dir>
• Set POSIX permissions to a user that needs access to data
in an alluxio directory in the form of rwx
• Alluxio fs setfacl can also be used to remove permissions
with –x flag
alluxio fs setfacl -R –x user:<user> <dir>
• Generate Git workflow yaml to run the setup on merge to
master.
Idea was to implement access control for data security on the mounted data
Alluxio fs
13. Implementation:
Master Branch
Add the list of the users, directories
and corresponding permissions to
be applied
Checkout Branch
Create pull request
Merge to master on approvals from
Data owners
Trigger Workflow Kubectl port forward <allluxio-master-pod> on 19998
• Alluxio master will be available on localhost:19998
Checkout repo with alluxio binaries
• Provide AWS credentials of the AWS service user
from secrets.
• Run alluxio fs setfacl to add or remove permissions
Actions
Add the list of the users and
corresponding permissions to
be removed
14. Idea was to establish a mechanism to process data with Kubernetes Spark adhering to Alluxio fs permissions
Run an application in deploy mode Cluster with Kubernetes
Spark to process the data in Alluxio fs:
• Build a Spark docker image and “adduser” who needs to access
the data in alluxio and set it as default using USER instruction
• Make sure the default user in Spark container has the necessary
access permissions for the data in alluxio.
• Build the Spark image along with the application jar or provide it as
a mount and access it with scheme local:// while executing spark-
submit
• Alternatively place the artifact in S3 or other storage of choice and
have Spark download it at runtime.
• The diver and the number of executors specified during submission
are created as pods in the specified namespace.
• The user submitting the Spark application is also the user of client
process accessing Alluxio fs. Hence is only allowed to perform
operations based on his access privileges
• Any other user in the client process without the requisite access to
the data is rejected.
Git workflow to build
Spark Docker image
Push image to ECR
Pull image from
ECR
Spark-submit
--master=Kubernetes API
Create Driver Pod
C
r
e
a
t
e
E
x
e
c
u
t
o
r
P
o
d
s
Access Alluxio
15. • Alluxio workers run as Daemon sets in the EKS cluster. Every additional node added to the cluster will have Kubernetes
spin an alluxio worker as a daemon set pod on that node.
• Scaling down can also be achieved in the same way.
• This allows to move up and down the cluster capacity on-demand in an automated way
• Using DevOps processes to bootstrap and control EKS and Alluxio cluster size makes the system flexible resulting in
optimal use of resources and reducing costs.
• The /journal folder of alluxio master is persisted in EFS through a Persistent Volume Claim. Keeping the EFS fs up and
running will allow us to spin alluxio clusters on-demand with all the mount points and user permissions related metadata
intact and tear it down after the data processing has finished, thereby saving time, effort and reducing costs. This will be
most beneficial when we need big clusters to run computations on huge batches of data running into TB’s.
• Time bound operations with scheduling git in git actions.
Idea was to make the setup scalable to resize the Alluxio cluster on demand and save costs.
16. Other options.
• Adopting this process to centralized operations and data security teams.
• Opening up the EKS cluster to spark-submit jobs in a client mode, where the
driver runs on a remote system and controlling user access to data in alluxio.