SlideShare a Scribd company logo
1 of 17
Big Data at Aadhaar

Dr. Pramod K Varma       Regunath Balasubramaian
pramod.uid@gmail.com        regunathb@gmail.com
Twitter: @pramodkvarma       Twitter: @RegunathB
Aadhaar at a Glance




         2
India
• 1.2 billion residents
   – 640,000 villages, ~60% lives under $2/day
   – ~75% literacy, <3% pays Income Tax, <20% banking
   – ~800 million mobile, ~200-300 mn migrant workers


• Govt. spends about $25-40 bn on direct subsidies
   – Residents have no standard identity document
   – Most programs plagued with ghost and multiple
     identities causing leakage of 30-40%


                             3
Vision
• Create a common “national identity” for every
  “resident”
  – Biometric backed identity to eliminate duplicates
  – “Verifiable online identity” for portability


• Applications ecosystem using open APIs
  – Aadhaar enabled bank account and payment platform
  – Aadhaar enabled electronic, paperless KYC



                             4
Aadhaar System
• Enrolment
  –   One time in a person’s lifetime
  –   Minimal demographics
  –   Multi-modal biometrics (Fingerprints, Iris)
  –   12-digit unique Aadhaar number assigned


• Authentication
  – Verify “you are who you claim to be”
  – Open API based
  – Multi-device, multi-factor, multi-modal
                               5
Architecture Principles
• Design for scale
   – Every component needs to scale to large volumes
   – Millions of transactions and billions of records
   – Accommodate failure and design for recovery
• Open architecture
   – Use of open standards to ensure interoperability
   – Allow the ecosystem to build libraries to standard APIs
   – Use of open-source technologies wherever prudent
• Security
   – End to end security of resident data
   – Use of open source
   – Data privacy handling (API and data anonymization)


                                    6
Designed for Scale
• Horizontal scalability for all components
   –   “Open Scale-out” is the key
   –   Distributed computing on commodity hardware
   –   Distributed data store and data partitioning
   –   Horizontal scaling of “data store” a must!
   –   Use of right data store for right purpose
• No single point of bottleneck for scaling
• Asynchronous processing throughout the system
   – Allows loose coupling various components
   – Allows independent component level scaling

                              7
Enrolment Volume
• 600 to 800 million UIDs in 4 years
   – 1 million a day
   – 200+ trillion matches every day!!!
• ~5MB per resident
   – Maps to about 10-15 PB of raw data (2048-bit PKI encrypted!)
   – About 30 TB I/O every day
   – Replication and backup across DCs of about 5+ TB of incremental
     data every day
   – Lifecycle updates and new enrolments will continue for ever
• Additional process data
   – Several million events on an average moving through async
     channels (some persistent and some transient)
   – Needing complete update and insert guarantees across data stores

                                    8
Authentication Volume
• 100+ million authentications per day (10 hrs)
   – Possible high variance on peak and average
   – Sub second response
   – Guaranteed audits
• Multi-DC architecture
   – All changes needs to be propagated from enrolment data stores to
     all authentication sites
• Authentication request is about 4 K
   –   100 million authentications a day
   –   1 billion audit records in 10 days (30+ billion a year)
   –   4 TB encrypted audit logs in 10 days
   –   Audit write must be guaranteed

                                       9
Open APIs
• Aadhaar Services
  – Core Authentication API and supporting Best
    Finger Detection, OTP Request APIs
  – New services being built on top
• Aadhaar Open Standards for Plug-n-play
  – Biometric Device API
  – Biometric SDK API
  – Biometric Identification System API
  – Transliteration API for Indian Languages
                         10
Implementation




       11
Patterns & Technologies
• Principles
    • POJO based application implementation
    • Light-weight, custom application container
    • Http gateway for APIs

• Compute Patterns
   • Data Locality
   • Distribute compute (within a OS process and across)

• Compute Architectures
   • SEDA – Staged Event Driven Architecture
   • Master-Worker(s) Compute Grid

• Data Access types
   • High throughput streaming : bio-dedupe, analytics
   • High volume, moderate latency : workflow, UID records
   • High volume , low latency : auth, demo-dedupe,
                                 search – eAadhaar, KYC
Aadhaar Data Stores
                                           (Data consistency challenges..)
Shard        Shard           Shard        Shard
  0            2               6            9
                                                                                            Low latency indexed read (Documents per sec),
                                                             Solr cluster                   Low latency random search (Documents per sec)
Shard       Shard          Shard            (all enrolment records/documents
  a           d              f
                                                  – selected demographics only)


    Shard        Shard
      1            2
                               Shard                                                 Low latency indexed read (Documents per sec),
                                 3
                                                   Mongo cluster                     High latency random search (seconds per read)
   Shard        Shard                  (all enrolment records/documents
     4            5                               – demographics + photo)


                                                                                                 Low latency indexed read (milli-seconds
                       Enrolment
   UID master             DB                                                   MySQL             per read),
    (sharded)                           (all UID generated records - demographics only,          High latency random search (seconds per
                                                        track & trace, enrolment status )        read)


                                                               HBase                High read throughput (MB per sec),
 Region      Region         Region      Region             (all enrolment           Low-to-Medium latency read (milli-seconds per read)
 Ser. 1      Ser. 10        Ser. ..     Ser. 20
                                                     biometric templates)

 Data
Node 1
             Data
            Node 10
                             Data
                            Node ..
                                           Data
                                          Node 20
                                                                  HDFS               High read throughput (MB per sec),
                                                           (all raw packets)         High latency read (seconds per read)



 LUN 1       LUN 2         LUN 3       LUN 4                                         Moderate read throughput,
                                                                     NFS             High latency read (seconds per read)
                                                  (all archived raw packets)
Aadhaar Architecture
                       • Real-time monitoring using Events


• Work distribution
  using SEDA &
  Messaging
• Ability to scale
  within JVM and
  across
• Recovery through
  check-pointing


• Sync Http based
  Auth gateway
• Protocol Buffers &
  XML payloads
• Sharded clusters

                                        • Near Real-time data delivery to warehouse
                                        • Nightly data-sets used to build
                                          dashboards, data marts and reports
Deployment Monitoring
Learnings
• Make everything API based
• Everything fails
  (hardware, software, network, storage)
  – System must recover, retry transactions, and sort of self-
    heal
• Security and privacy should not be an afterthought
• Scalability does not come from one product
• Open scale out is the only way you should go.
  – Heterogeneous, multi-vendor, commodity
    compute, growing linear fashion. Nothing else can
    adapt!
                              16
Thank You!
Dr. Pramod K Varma            Regunath Balasubramaian
pramod.uid@gmail.com             regunathb@gmail.com
Twitter: @pramodkvarma            Twitter: @RegunathB




                         17

More Related Content

What's hot

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 

What's hot (20)

Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge Graphs
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Hadoop at aadhaar
Hadoop at aadhaarHadoop at aadhaar
Hadoop at aadhaar
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Time Series Data with InfluxDB
Time Series Data with InfluxDBTime Series Data with InfluxDB
Time Series Data with InfluxDB
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
 
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit... Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 

Viewers also liked

Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)
Ali Raw
 

Viewers also liked (13)

Srikanth Nadhamuni
Srikanth NadhamuniSrikanth Nadhamuni
Srikanth Nadhamuni
 
practical risks in aadhaar project and measures to overcome them
practical risks in aadhaar project and measures to overcome thempractical risks in aadhaar project and measures to overcome them
practical risks in aadhaar project and measures to overcome them
 
Aesop change data propagation
Aesop change data propagationAesop change data propagation
Aesop change data propagation
 
Building tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systemsBuilding tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systems
 
Building the Flipkart phantom
Building the Flipkart phantomBuilding the Flipkart phantom
Building the Flipkart phantom
 
Uid
UidUid
Uid
 
Facebook style notifications using hbase and event streams
Facebook style notifications using hbase and event streamsFacebook style notifications using hbase and event streams
Facebook style notifications using hbase and event streams
 
What database
What databaseWhat database
What database
 
Unique identification authority of india uid
Unique identification authority of india   uidUnique identification authority of india   uid
Unique identification authority of india uid
 
E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres
 
Aadhaar
AadhaarAadhaar
Aadhaar
 
Oss as a competitive advantage
Oss as a competitive advantageOss as a competitive advantage
Oss as a competitive advantage
 
Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)
 

Similar to Aadhaar at 5th_elephant_v3

Key-value databases in practice Redis @ DotNetToscana
Key-value databases in practice Redis @ DotNetToscanaKey-value databases in practice Redis @ DotNetToscana
Key-value databases in practice Redis @ DotNetToscana
Matteo Baglini
 
ModeShape 3 overview
ModeShape 3 overviewModeShape 3 overview
ModeShape 3 overview
Randall Hauch
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
George Ang
 
Millions quotes per second in pure java
Millions quotes per second in pure javaMillions quotes per second in pure java
Millions quotes per second in pure java
Roman Elizarov
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
DATAVERSITY
 

Similar to Aadhaar at 5th_elephant_v3 (20)

Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch
Real time monitoring-alerting: storing 2Tb of logs a day in ElasticsearchReal time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch
Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch
 
DRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
 
Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis
 
Key-value databases in practice Redis @ DotNetToscana
Key-value databases in practice Redis @ DotNetToscanaKey-value databases in practice Redis @ DotNetToscana
Key-value databases in practice Redis @ DotNetToscana
 
ModeShape 3 overview
ModeShape 3 overviewModeShape 3 overview
ModeShape 3 overview
 
SSD Performance Benchmarking
SSD Performance BenchmarkingSSD Performance Benchmarking
SSD Performance Benchmarking
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Redis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupRedis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetup
 
Data engineering
Data engineeringData engineering
Data engineering
 
Traitement d'événements
Traitement d'événementsTraitement d'événements
Traitement d'événements
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 
Openstack swift - VietOpenStack 6thmeeetup
Openstack swift - VietOpenStack 6thmeeetupOpenstack swift - VietOpenStack 6thmeeetup
Openstack swift - VietOpenStack 6thmeeetup
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
 
Millions quotes per second in pure java
Millions quotes per second in pure javaMillions quotes per second in pure java
Millions quotes per second in pure java
 
Introduction to near real time computing
Introduction to near real time computingIntroduction to near real time computing
Introduction to near real time computing
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
 
You suck at Memory Analysis
You suck at Memory AnalysisYou suck at Memory Analysis
You suck at Memory Analysis
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Aadhaar at 5th_elephant_v3

  • 1. Big Data at Aadhaar Dr. Pramod K Varma Regunath Balasubramaian pramod.uid@gmail.com regunathb@gmail.com Twitter: @pramodkvarma Twitter: @RegunathB
  • 2. Aadhaar at a Glance 2
  • 3. India • 1.2 billion residents – 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pays Income Tax, <20% banking – ~800 million mobile, ~200-300 mn migrant workers • Govt. spends about $25-40 bn on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing leakage of 30-40% 3
  • 4. Vision • Create a common “national identity” for every “resident” – Biometric backed identity to eliminate duplicates – “Verifiable online identity” for portability • Applications ecosystem using open APIs – Aadhaar enabled bank account and payment platform – Aadhaar enabled electronic, paperless KYC 4
  • 5. Aadhaar System • Enrolment – One time in a person’s lifetime – Minimal demographics – Multi-modal biometrics (Fingerprints, Iris) – 12-digit unique Aadhaar number assigned • Authentication – Verify “you are who you claim to be” – Open API based – Multi-device, multi-factor, multi-modal 5
  • 6. Architecture Principles • Design for scale – Every component needs to scale to large volumes – Millions of transactions and billions of records – Accommodate failure and design for recovery • Open architecture – Use of open standards to ensure interoperability – Allow the ecosystem to build libraries to standard APIs – Use of open-source technologies wherever prudent • Security – End to end security of resident data – Use of open source – Data privacy handling (API and data anonymization) 6
  • 7. Designed for Scale • Horizontal scalability for all components – “Open Scale-out” is the key – Distributed computing on commodity hardware – Distributed data store and data partitioning – Horizontal scaling of “data store” a must! – Use of right data store for right purpose • No single point of bottleneck for scaling • Asynchronous processing throughout the system – Allows loose coupling various components – Allows independent component level scaling 7
  • 8. Enrolment Volume • 600 to 800 million UIDs in 4 years – 1 million a day – 200+ trillion matches every day!!! • ~5MB per resident – Maps to about 10-15 PB of raw data (2048-bit PKI encrypted!) – About 30 TB I/O every day – Replication and backup across DCs of about 5+ TB of incremental data every day – Lifecycle updates and new enrolments will continue for ever • Additional process data – Several million events on an average moving through async channels (some persistent and some transient) – Needing complete update and insert guarantees across data stores 8
  • 9. Authentication Volume • 100+ million authentications per day (10 hrs) – Possible high variance on peak and average – Sub second response – Guaranteed audits • Multi-DC architecture – All changes needs to be propagated from enrolment data stores to all authentication sites • Authentication request is about 4 K – 100 million authentications a day – 1 billion audit records in 10 days (30+ billion a year) – 4 TB encrypted audit logs in 10 days – Audit write must be guaranteed 9
  • 10. Open APIs • Aadhaar Services – Core Authentication API and supporting Best Finger Detection, OTP Request APIs – New services being built on top • Aadhaar Open Standards for Plug-n-play – Biometric Device API – Biometric SDK API – Biometric Identification System API – Transliteration API for Indian Languages 10
  • 12. Patterns & Technologies • Principles • POJO based application implementation • Light-weight, custom application container • Http gateway for APIs • Compute Patterns • Data Locality • Distribute compute (within a OS process and across) • Compute Architectures • SEDA – Staged Event Driven Architecture • Master-Worker(s) Compute Grid • Data Access types • High throughput streaming : bio-dedupe, analytics • High volume, moderate latency : workflow, UID records • High volume , low latency : auth, demo-dedupe, search – eAadhaar, KYC
  • 13. Aadhaar Data Stores (Data consistency challenges..) Shard Shard Shard Shard 0 2 6 9 Low latency indexed read (Documents per sec), Solr cluster Low latency random search (Documents per sec) Shard Shard Shard (all enrolment records/documents a d f – selected demographics only) Shard Shard 1 2 Shard Low latency indexed read (Documents per sec), 3 Mongo cluster High latency random search (seconds per read) Shard Shard (all enrolment records/documents 4 5 – demographics + photo) Low latency indexed read (milli-seconds Enrolment UID master DB MySQL per read), (sharded) (all UID generated records - demographics only, High latency random search (seconds per track & trace, enrolment status ) read) HBase High read throughput (MB per sec), Region Region Region Region (all enrolment Low-to-Medium latency read (milli-seconds per read) Ser. 1 Ser. 10 Ser. .. Ser. 20 biometric templates) Data Node 1 Data Node 10 Data Node .. Data Node 20 HDFS High read throughput (MB per sec), (all raw packets) High latency read (seconds per read) LUN 1 LUN 2 LUN 3 LUN 4 Moderate read throughput, NFS High latency read (seconds per read) (all archived raw packets)
  • 14. Aadhaar Architecture • Real-time monitoring using Events • Work distribution using SEDA & Messaging • Ability to scale within JVM and across • Recovery through check-pointing • Sync Http based Auth gateway • Protocol Buffers & XML payloads • Sharded clusters • Near Real-time data delivery to warehouse • Nightly data-sets used to build dashboards, data marts and reports
  • 16. Learnings • Make everything API based • Everything fails (hardware, software, network, storage) – System must recover, retry transactions, and sort of self- heal • Security and privacy should not be an afterthought • Scalability does not come from one product • Open scale out is the only way you should go. – Heterogeneous, multi-vendor, commodity compute, growing linear fashion. Nothing else can adapt! 16
  • 17. Thank You! Dr. Pramod K Varma Regunath Balasubramaian pramod.uid@gmail.com regunathb@gmail.com Twitter: @pramodkvarma Twitter: @RegunathB 17