SlideShare a Scribd company logo
1 of 47
Download to read offline
Django Application Monitoring
with Sentry, ELK and
Prometheus
By Ridwan Fadjar Septian
Cloud Infrastructure Engineer at NiceDay Nederland B.V.
PyCon ID 2021
Introduction
- My name is Ridwan Fadjar Septian
- Living in Bandung, Indonesia
- My career journey are:
- 2014 - 2016, Web Programmer by using PHP
- 2016 - 2017, Backend Engineer by using Django
- 2017, Backend Engineer for big data project by using AWS Lambda, AWS Kinesis, AWS
EMR + PySpark and AWS S3 for Data Lake. Also as Cloud Infrastructure Engineer
- 2017 - 2018, Backend Engineer by using Django. Also as Cloud Infrastructure Engineer
at NiceDay Nederland B.V.
- 2018 - Current, Cloud Infrastructure Engineer at NiceDay Nederland B.V. which is mostly
working with Google Cloud Platform
- My favorites
- Programming languages: Python and Javascript
- Web frameworks: Django
- Operating system: Linux
- My interests: Open Source Projects, AI, DevOps, Cloud Infrastructure, Software Engineering,
IT Governance, IT Security, Computer Networking, etc.
Overview
A. Company Background
- NiceDay Nederland B.V.
- Provide online mental healthcare provider since 2014
- Cover national market in Netherlands
- Planning to expand into international market
- Targetting to become a leader for mental healthcare service compete with
other companies in national sector
- Based in Rotterdam, NL
- Branch office in Bandung, ID
- +/- 50 employees Rotterdam and Bandung combined
- Came from diverse nationalities and background
- Visit us more here -> https://nicedaynederland.nl/en/home-en/
B. Problems
● How to provide secure services?
● How to ensure availability of our services?
● How to build a better security practice?
● How to give better experience for our users (therapists and clients)?
C. Goals
● Why we need monitoring and logging systems?
○ We are trying to give our users secure mental healthcare service
○ Highly available service for our users
○ Compliance with national, regional and international security standards
■ NEN 7510-02:2017 (Netherland’s national standard for health
information system security)
■ GDPR (Regional data security standard under European Union)
■ ISO 27001:2013 (International standard for information security
management system)
○ Better user experience for our users (therapists and clients)
Architectures
D. Architectures of Our Application - An Overview
E. Monitoring and Logging Architectures Overview
E. Architectures - Sentry 10
E. Architectures - Elasticsearch and Kibana
E. Architectures - Prometheus, AlertManager and OpsGenie
E. Architectures - Prometheus and Grafana
Current Implementation
F. Current Implementation - Elasticsearch + Kibana
● Elasticsearch + Kibana
○ Functions
■ Managing logs from Docker containers and hosts
■ Weekly log inspection
● Measures performance of our services (e.g. APDEX)
● Find any errors on Docker container logs or system logs
■ Root cause analysis on system or application logs per incident
■ Service endpoints deprecation
■ etc.
○ Ability
■ Retain all logs for more than years (long term)
■ Fast query on various logs for wide timerange
F. Current Implementation - Elasticsearch + Kibana (2)
● Deployment
○ Managed services at Elastic Cloud
○ Previously, we used Logstash to ingest Filebeat logs. But now, Filebeat
could send logs to Elasticsearch directly
F. Current Implementation - Sentry 10
● Sentry10
○ Functions
■ Manage bug / exception from our Django, Python, React.js and
React Native projects
● Bug management for every releases
■ Performance analytics tools for developers
■ Root cause analysis on application code level
● Bug tracing
○ Ability
■ Retain catched exceptions for years (long term)
F. Current Implementation - Sentry 10 (2)
● Deployment
○ On-premises at Google Cloud Platform
■ 3 VM instances to host Sentry 10 containers managed by container
orchestration
● E2-standard-4: vCPUs 4 cores, 16 GB of RAM
■ CloudSQL for Sentry10 database to store its event records
■ CloudStorage to host Sentry10 data
○ Sentry10 is quite complex. It should use Apache Kafka and Clickhouse
as its new data stores.
F. Current Implementation - Prometheus
● Prometheus + Grafana
○ Function
■ OKR evaluation
● Weekly
● Every 6 months
■ Root cause analysis by utilize server and application metrics
○ Ability
■ Retain resource and application metrics for a month (short term)
F. Current Implementation - Prometheus (2)
● Prometheus + Alert Manager + OpsGenie
○ Function
■ Services uptime monitoring
● Service performance whether its getting slower
■ VMs status monitoring
● Memory
● CPU
● Disk/IO
● Uptime
● etc.
○ Ability
■ Faster alerting system to Infrastructure Team
● Alert might come just under 1 minutes or 5 minutes
○ SMS
○ Push Notification
○ Phone Call
● OpsGenie will keep your phone ringing if you don’t response on it
yet.
F. Current Implementation - Prometheus (3)
● Deployment
○ On-premise at Google Cloud Platform
■ Single VM instance to host Prometheus and Alert Manager
● E2-standard-2: vCPUs 2 cores, 8 GB of RAM
■ Grafana is deployed at our container orchestration co-hosted with other
services for infrastructure team purposes.
F. Current Implementation - Security
We ensure the deployment of Prometheus, Elasticsearch + Kibana and Sentry by
applying this action:
- Deploy those tools under private network
- Only Infrastructure team have an access to those tools for managing purposes
- Every users for those tools have a least privileges.
- Only few person who become superadmin for administration purposes.
- Access to private network with 2FA enabled
F. Current Implementation vs The History Behind it
- Back to 2017, we have used New Relic as our monitoring tool.
- But it the capability for storing log from our servers and Docker containers weren’t
satisfying. Therefore, we built Elasticsearch on-premise cluster
- The alerting system weren’t satisfying also. So we built our alerting system by using
Prometheus on-premise
- Finally, we found that Sentry 9 was simpler than New Relic for managing exceptions
from our application. So we built our bug management by using Sentry 9
- 2019, Sentry and Prometheus moved to Google Cloud Platform as on premise
- We faced networking issue from local cloud provider. So we could deploy our
infrastructure in unstable situation.
- 2019, Elasticsearch + Kibana upgraded
- We moved Elasticsearch and Kibana to Elasticloud because the log size we managed
was nearly 1TB and its really hard to scale. Moreover, the networking issue was one the
main problem of that local cloud provider
- 2020, Sentry upgraded from version 9 to 10
- We moved to Sentry10 because we want to use the APM which provided by this new
version. But we still deploy it on-premise at Google Cloud Platform. The cost for Sentry
Cloud is quite expensive as its charged per num of developers in our company.
Usage Examples
G. Usage examples - Prometheus
G. Usage examples - Prometheus
G. Usage examples - Prometheus
G. Usage examples - Prometheus + OpsGenie
G. Usage examples - Elasticsearch + Kibana
G. Usage examples - Sentry 10
G. Usage examples - Sentry 10
G. Usage examples - Sentry 10
G. Usage examples - Sentry 10
G. Usage examples - Sentry 10
G. Usage examples - Sentry 10
Impacts
H. Impacts
● Those tools help us to provide secure services
○ Prometheus + OpsGenie
■ Warn us if SSL certificate are going to be expired.
○ Elasticsearch + Kibana
■ Weekly log inspection
● Anomaly in HTTP requests came to our services
○ Call to unknown endpoints
○ Strange number of requests that came exceeding
normal requests per seconds.
● Find someone suspicious who perform SSH beside from our
whitelisted users
● Find suspicious scripts which are being executed by CRON
● Find commands executed by whitelisted users which might
put our services in danger
○ Sentry
■ Find any parts of application that might led to bug
○ etc.
H. Impacts (2)
● Those tools help us to ensure availability of our services
○ Prometheus + OpsGenie
■ Faster response time upon incidents in our infrastructure 24/7
■ Improve our infrastructure by keep them optimized and efficient
● Reduce cost for underperforming VMs
■ Detect unapplied migration scripts from backend service
● It might led to crash for backend service if we can’t detect it earlier
○ Elasticsearch + Kibana
■ High availability log inspection to help root cause analysis when incident
happened
● Find any errors output on Docker container logs across our
Docker-based services
● Find any errors output on system logs across our servers
■ We don’t have to SSH to our servers to find system error logs
■ We don’t have to check Docker logs to find service error logs
○ Sentry
■ We could configure Sentry to send OpsGenie alert. It could be triggered when
exception catched from our services.
○ etc.
H. Impacts (3)
● Those tools help us to build a better security practice
○ Elasticsearch + Kibana
■ High availability log inspection to perform further root cause analysis
after incident happened last week or last month
○ Prometheus + Grafana,
■ Monitor incident response performance through various
sources
● MTTA, mean time to acknowledge
● MTTR, mean time to resolve
● MTBF, mean time between failure
● 99PTA, 99 percentiles time to acknowledge
● 99PTR, 99 percentiles time to resolve
■ Decide better strategies every new OKR period.
● For example, infrastructure team maintain its workflows
which related to NiceDay security practice
H. Impacts (4)
● Those tools help us to give better experience for our users
○ Sentry
■ Faster debugging process in their codebases for developers
● They could find how exception produced through amazing stacktrace
visualization
● They could see where exceptions catched from particular release
● They could find to the line which exceptions catched
● For example, backend team could debug Django and Celery codebase
easily and faster
● Etc.
○ Elasticsearch + Kibana
■ Improve the backend service from performance analysis
■ Backend service endpoint deprecation
■ Help developers to find performance bottleneck of the service
H. Impacts (5)
● Other impacts
○ Stay compliance with some security standards for assurance to
clients.
○ Management could see the overview of service status when they
need it
○ Management could see in-house teams and products are growing
better
○ etc.
I. Best Practices
● Prometheus + OpsGenie, Refine your alerting rules periodically to be more suitable for
your team needs
● Whichever the tool
○ please enforce least privilege setup
■ Assign someone only what they need. Don’t give them role that are not
necessarily assigned out of their tasks
○ Enable two factor authentication when its possible
○ Setup process in your team to manage all credentials that you manage
■ You might utilizepassword managers (e.g. 1Password, DashLane,
BitWarden, LogMeOnce, etc.)
■ Manage secret key and password rotation to keep your monitoring
infrastructures secure
○ Evaluate your security-related processes in the team
■ Threat might come internally also. For example:
● Bug from development team
● Human error when performing particular task upon infrastructures
○ Connect to your logging infrastructures with private connection
■ Use secure approach to be connected with your third party logging services
○ Deploy and manage your logging infrastructures under private network
■ For example, separate monitoring and logging infrastructure private network
from warehouse, staging, production private networks.
Let’s wrap up
By enabling monitoring and logging systems, we might be able to:
● provide secure services
● ensure availability of our services
● build a better security practice
● give better experience for our users
References
● Sentry
○ https://develop.sentry.dev/self-hosted/
○ https://docs.sentry.io/product/
● Elastic Cloud
○ https://www.elastic.co/guide/index.html
○ https://www.elastic.co/guide/en/kibana/current/index.html
● Prometheus
○ https://prometheus.io/docs/prometheus/latest/getting_started/
○ https://prometheus.io/docs/alerting/latest/alertmanager/
○ https://support.atlassian.com/opsgenie/docs/integrate-opsgenie-with-prometheus/
● Security Practices, especially for Monitoring and Logging
○ https://sre.google/sre-book/table-of-contents/
○ NEN 7510-2:2017 - 12.4 Reporting and monitoring ->
https://www.webtoolmanagementsystemen.nl/en/ViewDocumentSection/d873e9df-44ae-413b-
8564-7ca7df60bde1/d873e9df-44ae-413b-8564-7ca7df60bde1/255021a3-1c42-4700-98f6-7f0
4eb16274f#8f13d102-3e26-4580-a20c-f4ae375725cb
○ ISO 27001:2013 - Annex A - A.12 Operations Security - A.12.4 Logging and Monitoring
Special Thanks!
● PyCon Indonesia 2021 who made this possible!
● Kurnia Jaya Eliazar, Team Manager at NiceDay, for reviewing my slide and
gave amazing feedbacks
● NiceDay Infrastructure Team, who gave me unlimited chances to implement
and improve NiceDay infrastructures
● Former Ebizu Data Team, who gave me a lot of chances for exploring about
AWS and Python application development on Big Data project.
● Bramandityo Prabowo, who used to teach me Python, Linux, Django and
many things at the college
Keep in touch
● Reach me at
○ E-mail: ridwanbejo@gmail.com
○ LinkedIn: https://www.linkedin.com/in/ridwan-fadjar-79781756/
○ Github: https://github.com/ridwanbejo
○ Google Scholar: https://scholar.google.com/citations?hl=en&user=edU-dL8AAAAJ
Q & A

More Related Content

What's hot

Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...
Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...
Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...
Bharath Sudharsan
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Spark Summit
 
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Spark Summit
 

What's hot (20)

Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 
Introduction to High Performance Computing
Introduction to High Performance ComputingIntroduction to High Performance Computing
Introduction to High Performance Computing
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Optimized placement in Openstack for NFV
Optimized placement in Openstack for NFVOptimized placement in Openstack for NFV
Optimized placement in Openstack for NFV
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query Engines
 
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
 
My Dissertation 2016
My Dissertation 2016My Dissertation 2016
My Dissertation 2016
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
 
Web application
Web applicationWeb application
Web application
 
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
 
High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...
 
Connecting kafka message systems with scylla
Connecting kafka message systems with scylla   Connecting kafka message systems with scylla
Connecting kafka message systems with scylla
 
Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...
Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...
Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan Zvara
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedIn
 
Hadoop summit 2016
Hadoop summit 2016Hadoop summit 2016
Hadoop summit 2016
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
 
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
 
Simulating Heterogeneous Resources in CloudLightning
Simulating Heterogeneous Resources in CloudLightningSimulating Heterogeneous Resources in CloudLightning
Simulating Heterogeneous Resources in CloudLightning
 

Similar to Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitoring with sentry, elk and prometheus

Similar to Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitoring with sentry, elk and prometheus (20)

Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Program
 
Building a Distributed & Automated Open Source Program at Netflix
Building a Distributed & Automated Open Source Program at NetflixBuilding a Distributed & Automated Open Source Program at Netflix
Building a Distributed & Automated Open Source Program at Netflix
 
#RADC4L16: An API-First Archives Approach at NPR
#RADC4L16: An API-First Archives Approach at NPR#RADC4L16: An API-First Archives Approach at NPR
#RADC4L16: An API-First Archives Approach at NPR
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data edition
 
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
vinay-mittal-new
vinay-mittal-newvinay-mittal-new
vinay-mittal-new
 
Introduction to PaaS and Heroku
Introduction to PaaS and HerokuIntroduction to PaaS and Heroku
Introduction to PaaS and Heroku
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriThinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
 
NGINX Microservices Reference Architecture: What’s in Store for 2019 – EMEA
NGINX Microservices Reference Architecture: What’s in Store for 2019 – EMEANGINX Microservices Reference Architecture: What’s in Store for 2019 – EMEA
NGINX Microservices Reference Architecture: What’s in Store for 2019 – EMEA
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managability
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
 
Controlled Evolution with Puppet and AWS
Controlled Evolution with Puppet and AWSControlled Evolution with Puppet and AWS
Controlled Evolution with Puppet and AWS
 

More from Ridwan Fadjar

More from Ridwan Fadjar (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
PyCon ID 2023 - Ridwan Fadjar Septian.pdf
PyCon ID 2023 - Ridwan Fadjar Septian.pdfPyCon ID 2023 - Ridwan Fadjar Septian.pdf
PyCon ID 2023 - Ridwan Fadjar Septian.pdf
 
Cloud Infrastructure automation with Python-3.pdf
Cloud Infrastructure automation with Python-3.pdfCloud Infrastructure automation with Python-3.pdf
Cloud Infrastructure automation with Python-3.pdf
 
GraphQL- Presentation
GraphQL- PresentationGraphQL- Presentation
GraphQL- Presentation
 
Bugs and Where to Find Them (Study Case_ Backend).pdf
Bugs and Where to Find Them (Study Case_ Backend).pdfBugs and Where to Find Them (Study Case_ Backend).pdf
Bugs and Where to Find Them (Study Case_ Backend).pdf
 
Introduction to Elixir and Phoenix.pdf
Introduction to Elixir and Phoenix.pdfIntroduction to Elixir and Phoenix.pdf
Introduction to Elixir and Phoenix.pdf
 
CS meetup 2020 - Introduction to DevOps
CS meetup 2020 - Introduction to DevOpsCS meetup 2020 - Introduction to DevOps
CS meetup 2020 - Introduction to DevOps
 
Why Serverless?
Why Serverless?Why Serverless?
Why Serverless?
 
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)
 
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
 
Mongodb intro-2-asbasdat-2018-v2
Mongodb intro-2-asbasdat-2018-v2Mongodb intro-2-asbasdat-2018-v2
Mongodb intro-2-asbasdat-2018-v2
 
Mongodb intro-2-asbasdat-2018
Mongodb intro-2-asbasdat-2018Mongodb intro-2-asbasdat-2018
Mongodb intro-2-asbasdat-2018
 
Mongodb intro-1-asbasdat-2018
Mongodb intro-1-asbasdat-2018Mongodb intro-1-asbasdat-2018
Mongodb intro-1-asbasdat-2018
 
Resftul API Web Development with Django Rest Framework & Celery
Resftul API Web Development with Django Rest Framework & CeleryResftul API Web Development with Django Rest Framework & Celery
Resftul API Web Development with Django Rest Framework & Celery
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan Python
 
Kisah Dua Sejoli: Arduino & Python
Kisah Dua Sejoli: Arduino & PythonKisah Dua Sejoli: Arduino & Python
Kisah Dua Sejoli: Arduino & Python
 
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014
 
Modul pelatihan-django-dasar-possupi-v1
Modul pelatihan-django-dasar-possupi-v1Modul pelatihan-django-dasar-possupi-v1
Modul pelatihan-django-dasar-possupi-v1
 
Membuat game-shooting-dengan-pygame
Membuat game-shooting-dengan-pygameMembuat game-shooting-dengan-pygame
Membuat game-shooting-dengan-pygame
 

Recently uploaded

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 

Recently uploaded (20)

How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistan
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 

Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitoring with sentry, elk and prometheus

  • 1. Django Application Monitoring with Sentry, ELK and Prometheus By Ridwan Fadjar Septian Cloud Infrastructure Engineer at NiceDay Nederland B.V. PyCon ID 2021
  • 2. Introduction - My name is Ridwan Fadjar Septian - Living in Bandung, Indonesia - My career journey are: - 2014 - 2016, Web Programmer by using PHP - 2016 - 2017, Backend Engineer by using Django - 2017, Backend Engineer for big data project by using AWS Lambda, AWS Kinesis, AWS EMR + PySpark and AWS S3 for Data Lake. Also as Cloud Infrastructure Engineer - 2017 - 2018, Backend Engineer by using Django. Also as Cloud Infrastructure Engineer at NiceDay Nederland B.V. - 2018 - Current, Cloud Infrastructure Engineer at NiceDay Nederland B.V. which is mostly working with Google Cloud Platform - My favorites - Programming languages: Python and Javascript - Web frameworks: Django - Operating system: Linux - My interests: Open Source Projects, AI, DevOps, Cloud Infrastructure, Software Engineering, IT Governance, IT Security, Computer Networking, etc.
  • 4. A. Company Background - NiceDay Nederland B.V. - Provide online mental healthcare provider since 2014 - Cover national market in Netherlands - Planning to expand into international market - Targetting to become a leader for mental healthcare service compete with other companies in national sector - Based in Rotterdam, NL - Branch office in Bandung, ID - +/- 50 employees Rotterdam and Bandung combined - Came from diverse nationalities and background - Visit us more here -> https://nicedaynederland.nl/en/home-en/
  • 5. B. Problems ● How to provide secure services? ● How to ensure availability of our services? ● How to build a better security practice? ● How to give better experience for our users (therapists and clients)?
  • 6. C. Goals ● Why we need monitoring and logging systems? ○ We are trying to give our users secure mental healthcare service ○ Highly available service for our users ○ Compliance with national, regional and international security standards ■ NEN 7510-02:2017 (Netherland’s national standard for health information system security) ■ GDPR (Regional data security standard under European Union) ■ ISO 27001:2013 (International standard for information security management system) ○ Better user experience for our users (therapists and clients)
  • 8. D. Architectures of Our Application - An Overview
  • 9. E. Monitoring and Logging Architectures Overview
  • 10. E. Architectures - Sentry 10
  • 11. E. Architectures - Elasticsearch and Kibana
  • 12. E. Architectures - Prometheus, AlertManager and OpsGenie
  • 13. E. Architectures - Prometheus and Grafana
  • 15. F. Current Implementation - Elasticsearch + Kibana ● Elasticsearch + Kibana ○ Functions ■ Managing logs from Docker containers and hosts ■ Weekly log inspection ● Measures performance of our services (e.g. APDEX) ● Find any errors on Docker container logs or system logs ■ Root cause analysis on system or application logs per incident ■ Service endpoints deprecation ■ etc. ○ Ability ■ Retain all logs for more than years (long term) ■ Fast query on various logs for wide timerange
  • 16. F. Current Implementation - Elasticsearch + Kibana (2) ● Deployment ○ Managed services at Elastic Cloud ○ Previously, we used Logstash to ingest Filebeat logs. But now, Filebeat could send logs to Elasticsearch directly
  • 17. F. Current Implementation - Sentry 10 ● Sentry10 ○ Functions ■ Manage bug / exception from our Django, Python, React.js and React Native projects ● Bug management for every releases ■ Performance analytics tools for developers ■ Root cause analysis on application code level ● Bug tracing ○ Ability ■ Retain catched exceptions for years (long term)
  • 18. F. Current Implementation - Sentry 10 (2) ● Deployment ○ On-premises at Google Cloud Platform ■ 3 VM instances to host Sentry 10 containers managed by container orchestration ● E2-standard-4: vCPUs 4 cores, 16 GB of RAM ■ CloudSQL for Sentry10 database to store its event records ■ CloudStorage to host Sentry10 data ○ Sentry10 is quite complex. It should use Apache Kafka and Clickhouse as its new data stores.
  • 19. F. Current Implementation - Prometheus ● Prometheus + Grafana ○ Function ■ OKR evaluation ● Weekly ● Every 6 months ■ Root cause analysis by utilize server and application metrics ○ Ability ■ Retain resource and application metrics for a month (short term)
  • 20. F. Current Implementation - Prometheus (2) ● Prometheus + Alert Manager + OpsGenie ○ Function ■ Services uptime monitoring ● Service performance whether its getting slower ■ VMs status monitoring ● Memory ● CPU ● Disk/IO ● Uptime ● etc. ○ Ability ■ Faster alerting system to Infrastructure Team ● Alert might come just under 1 minutes or 5 minutes ○ SMS ○ Push Notification ○ Phone Call ● OpsGenie will keep your phone ringing if you don’t response on it yet.
  • 21. F. Current Implementation - Prometheus (3) ● Deployment ○ On-premise at Google Cloud Platform ■ Single VM instance to host Prometheus and Alert Manager ● E2-standard-2: vCPUs 2 cores, 8 GB of RAM ■ Grafana is deployed at our container orchestration co-hosted with other services for infrastructure team purposes.
  • 22. F. Current Implementation - Security We ensure the deployment of Prometheus, Elasticsearch + Kibana and Sentry by applying this action: - Deploy those tools under private network - Only Infrastructure team have an access to those tools for managing purposes - Every users for those tools have a least privileges. - Only few person who become superadmin for administration purposes. - Access to private network with 2FA enabled
  • 23. F. Current Implementation vs The History Behind it - Back to 2017, we have used New Relic as our monitoring tool. - But it the capability for storing log from our servers and Docker containers weren’t satisfying. Therefore, we built Elasticsearch on-premise cluster - The alerting system weren’t satisfying also. So we built our alerting system by using Prometheus on-premise - Finally, we found that Sentry 9 was simpler than New Relic for managing exceptions from our application. So we built our bug management by using Sentry 9 - 2019, Sentry and Prometheus moved to Google Cloud Platform as on premise - We faced networking issue from local cloud provider. So we could deploy our infrastructure in unstable situation. - 2019, Elasticsearch + Kibana upgraded - We moved Elasticsearch and Kibana to Elasticloud because the log size we managed was nearly 1TB and its really hard to scale. Moreover, the networking issue was one the main problem of that local cloud provider - 2020, Sentry upgraded from version 9 to 10 - We moved to Sentry10 because we want to use the APM which provided by this new version. But we still deploy it on-premise at Google Cloud Platform. The cost for Sentry Cloud is quite expensive as its charged per num of developers in our company.
  • 25. G. Usage examples - Prometheus
  • 26. G. Usage examples - Prometheus
  • 27. G. Usage examples - Prometheus
  • 28. G. Usage examples - Prometheus + OpsGenie
  • 29. G. Usage examples - Elasticsearch + Kibana
  • 30. G. Usage examples - Sentry 10
  • 31. G. Usage examples - Sentry 10
  • 32. G. Usage examples - Sentry 10
  • 33. G. Usage examples - Sentry 10
  • 34. G. Usage examples - Sentry 10
  • 35. G. Usage examples - Sentry 10
  • 37. H. Impacts ● Those tools help us to provide secure services ○ Prometheus + OpsGenie ■ Warn us if SSL certificate are going to be expired. ○ Elasticsearch + Kibana ■ Weekly log inspection ● Anomaly in HTTP requests came to our services ○ Call to unknown endpoints ○ Strange number of requests that came exceeding normal requests per seconds. ● Find someone suspicious who perform SSH beside from our whitelisted users ● Find suspicious scripts which are being executed by CRON ● Find commands executed by whitelisted users which might put our services in danger ○ Sentry ■ Find any parts of application that might led to bug ○ etc.
  • 38. H. Impacts (2) ● Those tools help us to ensure availability of our services ○ Prometheus + OpsGenie ■ Faster response time upon incidents in our infrastructure 24/7 ■ Improve our infrastructure by keep them optimized and efficient ● Reduce cost for underperforming VMs ■ Detect unapplied migration scripts from backend service ● It might led to crash for backend service if we can’t detect it earlier ○ Elasticsearch + Kibana ■ High availability log inspection to help root cause analysis when incident happened ● Find any errors output on Docker container logs across our Docker-based services ● Find any errors output on system logs across our servers ■ We don’t have to SSH to our servers to find system error logs ■ We don’t have to check Docker logs to find service error logs ○ Sentry ■ We could configure Sentry to send OpsGenie alert. It could be triggered when exception catched from our services. ○ etc.
  • 39. H. Impacts (3) ● Those tools help us to build a better security practice ○ Elasticsearch + Kibana ■ High availability log inspection to perform further root cause analysis after incident happened last week or last month ○ Prometheus + Grafana, ■ Monitor incident response performance through various sources ● MTTA, mean time to acknowledge ● MTTR, mean time to resolve ● MTBF, mean time between failure ● 99PTA, 99 percentiles time to acknowledge ● 99PTR, 99 percentiles time to resolve ■ Decide better strategies every new OKR period. ● For example, infrastructure team maintain its workflows which related to NiceDay security practice
  • 40. H. Impacts (4) ● Those tools help us to give better experience for our users ○ Sentry ■ Faster debugging process in their codebases for developers ● They could find how exception produced through amazing stacktrace visualization ● They could see where exceptions catched from particular release ● They could find to the line which exceptions catched ● For example, backend team could debug Django and Celery codebase easily and faster ● Etc. ○ Elasticsearch + Kibana ■ Improve the backend service from performance analysis ■ Backend service endpoint deprecation ■ Help developers to find performance bottleneck of the service
  • 41. H. Impacts (5) ● Other impacts ○ Stay compliance with some security standards for assurance to clients. ○ Management could see the overview of service status when they need it ○ Management could see in-house teams and products are growing better ○ etc.
  • 42. I. Best Practices ● Prometheus + OpsGenie, Refine your alerting rules periodically to be more suitable for your team needs ● Whichever the tool ○ please enforce least privilege setup ■ Assign someone only what they need. Don’t give them role that are not necessarily assigned out of their tasks ○ Enable two factor authentication when its possible ○ Setup process in your team to manage all credentials that you manage ■ You might utilizepassword managers (e.g. 1Password, DashLane, BitWarden, LogMeOnce, etc.) ■ Manage secret key and password rotation to keep your monitoring infrastructures secure ○ Evaluate your security-related processes in the team ■ Threat might come internally also. For example: ● Bug from development team ● Human error when performing particular task upon infrastructures ○ Connect to your logging infrastructures with private connection ■ Use secure approach to be connected with your third party logging services ○ Deploy and manage your logging infrastructures under private network ■ For example, separate monitoring and logging infrastructure private network from warehouse, staging, production private networks.
  • 43. Let’s wrap up By enabling monitoring and logging systems, we might be able to: ● provide secure services ● ensure availability of our services ● build a better security practice ● give better experience for our users
  • 44. References ● Sentry ○ https://develop.sentry.dev/self-hosted/ ○ https://docs.sentry.io/product/ ● Elastic Cloud ○ https://www.elastic.co/guide/index.html ○ https://www.elastic.co/guide/en/kibana/current/index.html ● Prometheus ○ https://prometheus.io/docs/prometheus/latest/getting_started/ ○ https://prometheus.io/docs/alerting/latest/alertmanager/ ○ https://support.atlassian.com/opsgenie/docs/integrate-opsgenie-with-prometheus/ ● Security Practices, especially for Monitoring and Logging ○ https://sre.google/sre-book/table-of-contents/ ○ NEN 7510-2:2017 - 12.4 Reporting and monitoring -> https://www.webtoolmanagementsystemen.nl/en/ViewDocumentSection/d873e9df-44ae-413b- 8564-7ca7df60bde1/d873e9df-44ae-413b-8564-7ca7df60bde1/255021a3-1c42-4700-98f6-7f0 4eb16274f#8f13d102-3e26-4580-a20c-f4ae375725cb ○ ISO 27001:2013 - Annex A - A.12 Operations Security - A.12.4 Logging and Monitoring
  • 45. Special Thanks! ● PyCon Indonesia 2021 who made this possible! ● Kurnia Jaya Eliazar, Team Manager at NiceDay, for reviewing my slide and gave amazing feedbacks ● NiceDay Infrastructure Team, who gave me unlimited chances to implement and improve NiceDay infrastructures ● Former Ebizu Data Team, who gave me a lot of chances for exploring about AWS and Python application development on Big Data project. ● Bramandityo Prabowo, who used to teach me Python, Linux, Django and many things at the college
  • 46. Keep in touch ● Reach me at ○ E-mail: ridwanbejo@gmail.com ○ LinkedIn: https://www.linkedin.com/in/ridwan-fadjar-79781756/ ○ Github: https://github.com/ridwanbejo ○ Google Scholar: https://scholar.google.com/citations?hl=en&user=edU-dL8AAAAJ
  • 47. Q & A