Hadoop at aadhaar

Hadoop at Aadhaar
(Data Store, OLTP & OLAP)
github.com/regunathb
RegunathB
Bangalore Hadoop Meetup

Enrolment Data
•
600 to 800 million UIDs in 4 years
– 1 million a day with transaction, durability guarantees
– 350+ trillion matches every day
•
~5MB per resident
– Maps to about 10-15 PB of raw data (2048-bit PKI encrypted)
– About 30 TB I/O every day
– Replication and backup across DCs of about 5+ TB of incremental
data every day
– Lifecycle updates and new enrolments will continue for ever
•
Enrolment data moves from very hot to cold, needing
multi-layered storage architecture
•
Additional process data
– Several million events on an average moving through async
channels (some persistent and some transient)
– Needing insert and update guarantees across data stores
2

Authentication Data
•
100+ million authentications per day (10 hrs)
– Possible high variance on peak and average
– Sub second response
– Guaranteed audits
•
Multi-DC architecture
– All changes needs to be propagated from enrolment data stores to
all authentication sites
•
Authentication request is about 4 K
– 100 million authentications a day
– 1 billion audit records in 10 days (30+ billion a year)
– 4 TB encrypted audit logs in 10 days
– Audit write must be guaranteed
3

Aadhaar Data Stores
Mongo cluster
(all enrolment records/documents
– demographics + photo)
Shard
1
Shard
4
Shard
5
Shard
2
Shard
3 Low latency indexed read (Documents per sec),
High latency random search (seconds per read)
MySQL
(all UID generated records - demographics only,
track & trace, enrolment status )
Low latency indexed read (milli-
seconds per read),
High latency random search (seconds
per read)
UID master
(sharded)
Enrolment
DB
Solr cluster
(all enrolment records/documents
– selected demographics only)
Low latency indexed read (Documents per sec),
Low latency random search (Documents per sec)
Shard
0
Shard
2
Shard
6
Shard
9
Shard
a
Shard
d
Shard
f
HDFS
(all raw packets)
Data
Node 1
Data
Node 10
Data
Node ..
High read throughput (MB per sec),
High latency read (seconds per read)
Data
Node 20
HBase
(all enrolment
biometric templates)
Region
Ser. 1
Region
Ser. 10
Region
Ser. ..
High read throughput (MB per sec),
Low-to-Medium latency read (milli-seconds per read)Region
Ser. 20
NFS
(all archived raw packets)
Moderate read throughput,
High latency read (seconds per read)
LUN 1 LUN 2 LUN 3 LUN 4

Systems Architecture
•
Work distribution
using SEDA &
Messaging
•
Ability to scale
within JVM and
across
•
Recovery through
check-pointing
•
Sync Http based
Auth gateway
•
Protocol Buffers &
XML payloads
•
Sharded clusters
•
Near Real-time data delivery to warehouse
•
Nightly data-sets used to build dashboards,
data marts and reports
•
Real-time monitoring using Events

Enrolment Biometric Middleware
•
Distribute, Reconcile biometric data extraction and de-dup
requests across multiple vendors (ABISs)
•
Biometric data de-referencing/read service(Http) over
sharded HDFS and NFS
– Serves bulk of the HDFS read requests (25TB per day)
– Locate data from multiple HDFS clusters
●
Sharded by read/write patterns : New, Archive,
Purge
•
Calculates and maintains Volume allocation, SLA breach
thresholds of ABISs
– Thresholds stored in ZK and pushed to middleware
nodes
6

Event Streams & Sinks
•
Event framework supporting different interaction/data
durability patterns
– P2P, Pub-Sub
– Intra-JVM and Queue destinations - Durable / Non-Durable
– Fire & Forget, Ack. after processing
•
Event Sinks
– Ephemeral data consumed by counters, metrics (dashboard)
– Rolling file appenders that push data to HDFS
●
Primary mechanism for delivering raw fact data from
transactional systems to the warehouse staging area
7

Data Analysis
•
Statistical analysis from millions of events
– View into quality of enrolments – e.g. Enrolment
Agencies, Operators
– Feature introduction – e.g. Based on avg. time taken for
biometric capture, demographic data input
– Enrolment volumes – e.g. By Registrar, Agency,
Operator etc
●
Useful in fraud detection
•
Goal to share anonymized data sets for use by industry and
academia – information transparency
•
Various reports – Self-serve, Canned, Operational and/or
Aggregates
8

UID BI Platform
Data Analysis architecture
9
Data Access Framework
UIDAI Systems
Events
(Rabbit MQ)
Server DB
(MySQL)
Hadoop HDFS
Data Warehouse (HDFS/Hive)
Event CSV
Fact DataDimension Data
Datasets
On-Demand Datasets
Datamarts
(MySQL)
Raw Data
Dimension Data
(MySQL)
Pig
Pentaho Kettle
Hive
Pentaho Kettle
Canned Reports Dashboard
Self-service
Analytics
Pentaho BI
FusionCharts
E-mail/Portal/Others

Hadoop stack summary
•
CDH2 (Enrolment, Analysis), CDH3(Authentication)
•
Data Store
– HDFS : Enrolment, Events, Audit Logs, Warehouse
– HBase : Biometric templates used in Authentication
•
Coordination/Config
– ZK : Biometric middleware thresholds
•
Analysis
– Pig : ETL for loading analysis data from staging to atomic
warehouse
– Hive : Dataset generation framework
10

Learnings
•
Watch out for“too many small files”. HDFS is better suited for
fewer but large files
•
Data loss from HDFS in spite of having 3 replica copies – maybe
fixed in releases post CDH2?
•
Give careful consideration to HBase table design – row key
primarily to avoid region-server hot-spotting
•
Hive data (HDFS files) does not handle duplicate records – can
be an issue if data injestion is replayed for data sets
– Hive over Hbase is a viable alternative
11

References
•
Aadhaar Portal :
https://portal.uidai.gov.in/uidwebportal/dashboard.do
•
Data Portal :
https://data.uidai.gov.in/uiddatacatalog/dataCatalogHom
e.do
•
Analytics whitepaper :
http://uidai.gov.in/images/FrontPageUpdates/uid_doc_30
012012.pdf
12

Hadoop at aadhaar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Hadoop at aadhaar

Similar to Hadoop at aadhaar (20)

Recently uploaded

Recently uploaded (20)

Hadoop at aadhaar