This document summarizes a serverless Clojure and machine learning application for prototyping. It describes the team building the application, the serverless AWS infrastructure using Terraform, CI/CD pipelines with GitHub Actions, the Clojure applications including a dashboard SPA and event processor, and natural language processing using Haystack and SageMaker batch transforms. The application allows users to upload documents and questionnaires, select questions, and trigger inferences to find answers from the documents using ML models.
4. The Team
Toni: Fullstacker with an ML angle
● The main dev in the project
Clojurians, Koodiklinikka: @tvaisanen
Kimmo: Long time Clojure enthusiast, likes to dabble in data
projects
● The mentor in the project
https://twitter.com/KimmoKoskinen & https://github.com/viesti
6. Serverless Infrastructure
● AWS Organizations
○ AWS Root account and AWS account per client
■ Separate infra for each client
○ Terraform modules for logical parts of the infra
● DynamoDB used as the database, ML is run on-demand
with Sagemaker
○ Serverless infra requires a Serverless database
○ GPU’s on demand
● API Gateway
○ Frontend uses AWS services directly via API Gateway
● Terraform defining everything
8. CI / CD
● GitHub Actions
○ Build & Test
○ Deploys the development environment
■ GitHub Actions assumes an AWS IAM Role
● Production deploy
○ Publish build
■ Triggered by new tag push
■ Publish versioned release artifacts to S3
○ Deployment
■ Manually triggered workflow
■ Artifacts are downloaded from S3 and
■ Deployed with Terraform
9. Clojure Applications
● Dashboard
○ ClojureScript Reagent Single Page Application
● Dashboard Backend
○ Node/ClojureScript Lambda
■ Used to presigns S3 URLs upload and download
■ Node JS for faster cold-start
● Event Processor
○ JVM/Clojure Lambda
○ Processes events from services such as:
■ SES, SQS, S3, SageMaker etc…
11. Clojure Tooling
● ClojureScript
○ Shadow-CLJS
■ Builds the Dashboard SPA and
■ The Lambda that runs on NodeJS
● JVM/Clojure
○ deps.edn for project configuration and
○ depstar for building the uberjar
● Babashka
○ Build, test and release tasks
○ bb.edn files small, task code required and shared
● Kaocha for testing
14. Natural Language Processing
● NLP tasks are based on having source material from which
natural language queries are asked.
○ Natural language texts (policy files)
○ Answer question pairs (FAQ items)
15. Natural Language Processing
● NLP tasks are based on having source material from which
natural language queries are asked.
○ Natural language texts (policy files)
○ Answer question pairs (FAQ items)
● The source material collection is called the document store
which stores the data in SQLite database
16. Natural Language Processing
● NLP tasks are based on having source material from which
natural language queries are asked.
○ Natural language texts (policy files)
○ Answer question pairs (FAQ items)
● The source material collection is called the document store
which stores the data in SQLite database
● In addition to the DB the document store has another
component (FAISS index) that stores the vectorized
(embeddings) representations from text passages
17. Natural Language Processing
“FAISS (Facebook AI Similarity Search) is a library that allows developers to quickly search for
embeddings of multimedia documents that are similar to each other. It solves limitations of
traditional query search engines that are optimized for hash-based searches, and provides more
scalable similarity search functions.”
18. User Workflow
● User upload a questionnaire file (excel)
○ This triggers the Event Processor Lambda
○ File is transformed to JSON format and stored for later use
19. User Workflow
● User upload a questionnaire file (excel)
○ This triggers the Event Processor Lambda
○ File is transformed to JSON format and stored for later use
● There’s an UI tool that enables the user to
○ select the question rows and
○ pick the property columns
20. User Workflow
● User upload a questionnaire file (excel)
○ This triggers the Event Processor Lambda
○ File is transformed to JSON format and stored for later use
● There’s an UI tool that enables the user to
○ select the question rows and
○ pick the property columns
● The selection is saved and stored for later use
21. User Workflow
● User upload a questionnaire file (excel)
○ This triggers the Event Processor Lambda
○ File is transformed to JSON format and stored for later use
● There’s an UI tool that enables the user to
○ select the question rows and
○ pick the property columns
● The selection is saved and stored for later use
● User can also upload pre answered questions
○ to be used in the document store where the answers are searched for
22. User Workflow
● User upload a questionnaire file (excel)
○ This triggers the Event Processor Lambda
○ File is transformed to JSON format and stored for later use
● There’s an UI tool that enables the user to
○ select the question rows and
○ pick the property columns
● The selection is saved and stored for later use
● User can also upload pre answered questions
○ to be used in the document store where the answers are searched for
● User can trigger inference
○ Event sent to SQS, Fires a Lambda that trigger SageMaker Batch
Transform Job 🧠
24. Inference Workflow
- On startup Batch Transform Job fetches
- Policy files from S3
- FAQ items from DynamoDB
- Initializes the document store
- Pre-process policy files
- Create the embeddings
- Starts a web server (Flask)
- SageMaker reads the questions from S3
- SageMaker writes the answers to S3
- PutNotification triggered on new object
- Event Processor listens to these events and writes the results to
Dynamo DB
27. Closing Thoughts
● The project continues, next phase reveals how this actually
works :grimacing:
○ But there’s more angles also
● Pros
○ Interesting technology
○ Exploratory coding
○ Full stack: Infra, Backend, Frontend, ML, Design, UX, you name it!
● Cons
○ Complexity creeping, how to maintain…
○ See last pros bullet :D
● Learnings
○ Using tools that fit the job is good
○ ML & Serverless is not too difficult with Clojure