3. India
• 1.2 billion residents
– 640,000 villages, ~60% lives under $2/day
– ~75% literacy, <3% pays Income Tax, <20% banking
– ~800 million mobile, ~200-300 mn migrant workers
• Govt. spends about $25-40 bn on direct subsidies
– Residents have no standard identity document
– Most programs plagued with ghost and multiple
identities causing leakage of 30-40%
3
4. Vision
• Create a common “national identity” for every
“resident”
– Biometric backed identity to eliminate duplicates
– “Verifiable online identity” for portability
• Applications ecosystem using open APIs
– Aadhaar enabled bank account and payment platform
– Aadhaar enabled electronic, paperless KYC
4
5. Aadhaar System
• Enrolment
– One time in a person’s lifetime
– Minimal demographics
– Multi-modal biometrics (Fingerprints, Iris)
– 12-digit unique Aadhaar number assigned
• Authentication
– Verify “you are who you claim to be”
– Open API based
– Multi-device, multi-factor, multi-modal
5
6. Architecture Principles
• Design for scale
– Every component needs to scale to large volumes
– Millions of transactions and billions of records
– Accommodate failure and design for recovery
• Open architecture
– Use of open standards to ensure interoperability
– Allow the ecosystem to build libraries to standard APIs
– Use of open-source technologies wherever prudent
• Security
– End to end security of resident data
– Use of open source
– Data privacy handling (API and data anonymization)
6
7. Designed for Scale
• Horizontal scalability for all components
– “Open Scale-out” is the key
– Distributed computing on commodity hardware
– Distributed data store and data partitioning
– Horizontal scaling of “data store” a must!
– Use of right data store for right purpose
• No single point of bottleneck for scaling
• Asynchronous processing throughout the system
– Allows loose coupling various components
– Allows independent component level scaling
7
8. Enrolment Volume
• 600 to 800 million UIDs in 4 years
– 1 million a day
– 200+ trillion matches every day!!!
• ~5MB per resident
– Maps to about 10-15 PB of raw data (2048-bit PKI encrypted!)
– About 30 TB I/O every day
– Replication and backup across DCs of about 5+ TB of incremental
data every day
– Lifecycle updates and new enrolments will continue for ever
• Additional process data
– Several million events on an average moving through async
channels (some persistent and some transient)
– Needing complete update and insert guarantees across data stores
8
9. Authentication Volume
• 100+ million authentications per day (10 hrs)
– Possible high variance on peak and average
– Sub second response
– Guaranteed audits
• Multi-DC architecture
– All changes needs to be propagated from enrolment data stores to
all authentication sites
• Authentication request is about 4 K
– 100 million authentications a day
– 1 billion audit records in 10 days (30+ billion a year)
– 4 TB encrypted audit logs in 10 days
– Audit write must be guaranteed
9
10. Open APIs
• Aadhaar Services
– Core Authentication API and supporting Best
Finger Detection, OTP Request APIs
– New services being built on top
• Aadhaar Open Standards for Plug-n-play
– Biometric Device API
– Biometric SDK API
– Biometric Identification System API
– Transliteration API for Indian Languages
10
12. Patterns & Technologies
• Principles
• POJO based application implementation
• Light-weight, custom application container
• Http gateway for APIs
• Compute Patterns
• Data Locality
• Distribute compute (within a OS process and across)
• Compute Architectures
• SEDA – Staged Event Driven Architecture
• Master-Worker(s) Compute Grid
• Data Access types
• High throughput streaming : bio-dedupe, analytics
• High volume, moderate latency : workflow, UID records
• High volume , low latency : auth, demo-dedupe,
search – eAadhaar, KYC
13. Aadhaar Data Stores
(Data consistency challenges..)
Shard Shard Shard Shard
0 2 6 9
Low latency indexed read (Documents per sec),
Solr cluster Low latency random search (Documents per sec)
Shard Shard Shard (all enrolment records/documents
a d f
– selected demographics only)
Shard Shard
1 2
Shard Low latency indexed read (Documents per sec),
3
Mongo cluster High latency random search (seconds per read)
Shard Shard (all enrolment records/documents
4 5 – demographics + photo)
Low latency indexed read (milli-seconds
Enrolment
UID master DB MySQL per read),
(sharded) (all UID generated records - demographics only, High latency random search (seconds per
track & trace, enrolment status ) read)
HBase High read throughput (MB per sec),
Region Region Region Region (all enrolment Low-to-Medium latency read (milli-seconds per read)
Ser. 1 Ser. 10 Ser. .. Ser. 20
biometric templates)
Data
Node 1
Data
Node 10
Data
Node ..
Data
Node 20
HDFS High read throughput (MB per sec),
(all raw packets) High latency read (seconds per read)
LUN 1 LUN 2 LUN 3 LUN 4 Moderate read throughput,
NFS High latency read (seconds per read)
(all archived raw packets)
14. Aadhaar Architecture
• Real-time monitoring using Events
• Work distribution
using SEDA &
Messaging
• Ability to scale
within JVM and
across
• Recovery through
check-pointing
• Sync Http based
Auth gateway
• Protocol Buffers &
XML payloads
• Sharded clusters
• Near Real-time data delivery to warehouse
• Nightly data-sets used to build
dashboards, data marts and reports
16. Learnings
• Make everything API based
• Everything fails
(hardware, software, network, storage)
– System must recover, retry transactions, and sort of self-
heal
• Security and privacy should not be an afterthought
• Scalability does not come from one product
• Open scale out is the only way you should go.
– Heterogeneous, multi-vendor, commodity
compute, growing linear fashion. Nothing else can
adapt!
16
17. Thank You!
Dr. Pramod K Varma Regunath Balasubramaian
pramod.uid@gmail.com regunathb@gmail.com
Twitter: @pramodkvarma Twitter: @RegunathB
17