Building a Data Lake on S3 for IoT Workloads

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Pop-up Loft
Building a Data Lake on Amazon Object Storage for IoT Workloads
PD Dutta
Sr. Product Manager, Amazon S3

What We Will Cover
• Principles of IoT
• Why object storage and Amazon S3?
• How can you architect an IoT solution on Amazon S3 and Glacier

Principles of IoT
Agility Scalability Cost Security

IoT Verticals
Healthcare & Life
Sciences
Smart HomeManufacturing &
Logistics
Automotive
Agriculture
Retail

Common IoT trends: Why Object Storage?
§ Scale
§ Performance at Scale
§ Durability and Security
§ Analytics on unstructured data
§ Cost-Effective

Amazon S3 for IoT Workloads
§ Scalable
§ Virtually Unlimited number of objects
§ Very high bandwidth – no aggregate throughput limit
§ Increased Data Protection
§ Highly available – can tolerate AZ failure
§ Designed for 99.999999999% durability
§ Secure – SSL, client/server-side encryption at rest
§ Cost-Effective:
§ Tiered storage(Standard, IA, Amazon Glacier) via life-cycle policy
§ No need to run compute clusters for storage

IoT Data Lifecycle
§ Data Collection and Ingestion
§ Data Processing
§ Data Storage – durability, availability and security
§ Data Analytics and Visualization

IoT Data Lake with S3, AWS IoT, Kinesis; Query in Place
by Athena/Spectrum
AWS IoT
gateway
IoT
Rules Engine
Processed
Data in S3
Athena
OR
Redshift Spectrum
4 5
1
7
8
Stream/detect anomaly
Kinesis Streams/Analytics
Raw Data
In S3
Kinesis
Firehose
6
Batch Processing
and Analysis
Kinesis
Firehose
Amazon
Glacier
Amazon
QuickSight
IoT Devices
MQTT
messages
Kinesis
Firehose
Processed
Data in S3
Batch incoming files – JSON inputs
S3 Lifecycle Policies
2 3
9 10

Use Case: Improve Driver Safety with Connected Cars

Data Collection and Processing
§ Collection:
§ AWS IoT – process and route IoT MQTT messages to AWS endpoints
§ E.g. Sensor data - temperature, humidity, sound levels collected in a JSON payload
§ AWS IoT Rules Engine – select, process and send data to other AWS services
such as Amazon Kinesis to setup delivery streams
§ Ingestion:
§ Setup different Firehose streams to ingest the data into S3
§ Batch and encrypt the data as they get ingested into S3

Steps to Collect and Ingest Data
Name IoT-Source
S3 bucket <your unique name>-kinesis
S3 prefix /source/<key-name-randomizer>/
Delivery stream 1: Batch raw data
Name IoT-Destination-Aggregate
S3 prefix /aggregate/<key-name-randomizer>/
Delivery stream 3: Batch Processed Data
Name IoT-Destination-Data
S3 prefix /data/<key-name-randomizer>/
Delivery stream 2: Batch Output Device Data
Set up AWS IoT Rule to receive and forward incoming data
Name IoT_to_Firehose
Attribute *
Topic Filter /sbs/devicedata/#
Add Action
Send messages to an Amazon Kinesis Firehose stream (select
IoT-Source-Stream from dropdown)
Separator “n (newline)”
{ "sql": "SELECT * FROM 'my-topic'",
"ruleDisabled": false,
"awsIotSqlVersion": "2017-10-06",
"actions": [{
"firehose": {
"roleArn": ""arn:aws:iam::123456789012:role/my-iot-role",
"deliveryStreamName": "my-stream-name"
}
}] }

Processing incoming Data
§ Real-time data processing and ETL with Kinesis
§ Standard SQL queries to extract specific components from the incoming data
stream
§ Deliver to S3 through separate Kinesis Firehose streams in 1-15 minute intervals
§ aws kinesisanalytics add-application-output --application-name <Name of Analytics
Application> --current-application-version-id <number> --application-output
'Name=DESTINATION_SQL_BASIC_STREAM,KinesisFirehoseOutput={ResourceARN=
<ARN of IoT-Data-Stream>,RoleARN=<ARN of Analytics
application>,DestinationSchema={RecordFormatType=CSV}'

Encrypting the data in Kinesis
§ Data using PutRecord or PutRecords API can be encrypted using an AWS KMS master key
§ Uses AES-256-bit Advanced Encryption Standard
§ Setup using Kinesis Management console or the SDK
§ Audit using CloudTrail

S3: Storage Tiered To Your Requirements
L
i
f
e
c
y
c
l
e
Available
S3: 99.99%
S3-IA: 99.9%
Performant
Low Latency
High Throughput
Secure
SSE, client
encryption, IAM
integration
Event
Notifications
SQS, SNS, and
Lambda
Versioning
Keep multiple
copies
automatically
Cross Region
Replication
Common
Namespace
Define storage
class per object
Durable
99.999999999%
Scalable
Elastic capacity
No preset limits
“Hot” Data
Active and/or
Temporary Data
“Warm” Data
Infrequently
Accessed Data
“Cold” Data
Archive and
Compliance Data
S3-IA
Glacier
S3

{
"Rules": [
{
"Status": "Enabled",
"NoncurrentVersionExpiration": {
"NoncurrentDays": 365
},
"Transition": [ {
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 60,
"StorageClass": "GLACIER"
}
],
"Prefix": "",
"Expiration": {
"ExpiredObjectDeleteMarker": true
},
"AbortIncompleteMultipartUpload": {
"DaysAfterInitiation": 7
},
"NoncurrentVersionTransition": {
"NoncurrentDays": 60,
"StorageClass": "GLACIER"
},
"ID": "S3 to S3IA to Glacier with Recycle Bin S3 + GL"
}
]
}
S3 30 Days (Very Active Data)
S3-IA 30-60 Days (Infrequently Accessed)
Glacier 60+ Days (Rarely Accessed)
Clean up
Clean up
Clean up
Glacier 60+ Days (Rarely Restored)

Storage Management for S3
Cross-Region
Replication Lifecycle Policy
Data
Classification
& Management
Event
Notifications
S3 CloudWatch Metrics S3 Inventory Audit with object level
AWS CloudTrail Data Events
S3 Analytics
Standard Standard - Infrequent Access Amazon Glacier

Securing your data on S3
§ Data in S3 is secure by default – with ACL, IAM and bucket policies
§ Additional security with SSL endpoints, Server Side Encryption (SSE), or SSE
with customer provided keys, or SSE-KMS for data @ Rest
§ Multi-factor encryption
§ Create bucket policies to enforce object encryption

Bucket Policy Examples Using Server Side Encryption
{
"Version": "2017-10-06",
"Id": "PutObjPolicy",
"Statement": [
{
"Sid": "DenyIncorrectEncryptionHeader",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::<bucket_name>/*",
"Condition": {
"StringNotEquals": {
"s3:x-amz-server-side-encryption": "aws:kms"
}
}
},
{
"Sid": "DenyUnEncryptedObjectUploads",
"Effect": "Deny",
"Principal": "*",
"Condition": {
"Null": {
"s3:x-amz-server-side-encryption": true
}
}
}
]
}
{
"Version": "2017-10-06",
"Id": "PutObjPolicy",
"Statement": [
{
"Sid": "DenyIncorrectEncryptionHeader",
"Effect": "Deny",
"Principal": "*",
"Condition": {
"StringNotEquals": {
"s3:x-amz-server-side-encryption": "AES256"
}
}
},
{
"Sid": "DenyUnEncryptedObjectUploads",
"Effect": "Deny",
"Principal": "*",
"Condition": {
"Null": {
"s3:x-amz-server-side-encryption": true
}
}
}
]
}
Using SSE-KMS Using SSE-S3

Performance: What we have told customers
• Use a key-naming scheme with randomness at the beginning for high TPS
– Most important if you regularly exceed 100 TPS on a bucket
– Avoid starting with a date or monotonically increasing numbers
– Consider adding a hash or reversed timestamp (ssmmhhddmmyy)
• Don’t do this…
<my_bucket>/2017_10_06-164533125.jpg
<my_bucket>/2017_10_06-164533126.jpg
<my_bucket>/2017_10_06-164533127.jpg
<my_bucket>/2017_10_06-164533128.jpg
<my_bucket>/2017_10_06-164533129.jpg
<my_bucket>/2017_10_06-164533130.jpg
<my_bucket>/2017_10_06-164533131.jpg
<my_bucket>/2017_10_06-164533132.jpg
<my_bucket>/2017_10_06-164533133.jpg
<my_bucket>/2017_10_06-164533134.jpg
<my_bucket>/2017_10_06-164533135.jpg
<my_bucket>/2017_10_06-164533136.jpg

Distributing key names
• …because this is going to happen if you don’t
1 2 N
1 N
Partition Partition Partition Partition
2

• Add randomness to the beginning of the key name…
<my_bucket>/521335461-2017_11_13.jpg
<my_bucket>/465330151-2017_11_13.jpg
<my_bucket>/987331160-2017_11_13.jpg
<my_bucket>/465765461-2017_11_13.jpg
<my_bucket>/125631151-2017_11_13.jpg
<my_bucket>/934563160-2017_11_13.jpg
<my_bucket>/532132341-2017_11_13.jpg
<my_bucket>/565437681-2017_11_13.jpg
<my_bucket>/234567460-2017_11_13.jpg
<my_bucket>/456767561-2017_11_13.jpg
<my_bucket>/345565651-2017_11_13.jpg
<my_bucket>/431345660-2017_11_13.jpg

• …so your transactions can be distributed across the partitions
1 2 N
1 2 N
Partition Partition Partition Partition
Best Practice

Analyzing with Amazon Athena
• Amazon Athena: Analyze data directly from Amazon S3
using Standard SQL
§ No loading of data
§ Query data in its raw format - No ETL required
§ Stream data from directly from Amazon S3
§ Take advantage of Amazon S3 durability and availability

Amazon Athena is Fast
§ Tuned for performance; automatically parallelizes queries
§ Results are streamed to console and also stored on S3
§ Improve Query performance
§ Data Partitioning benefits: reduce amount of data scanned, reduce costs
§ Prefer Hive compatible partition naming
§ [column_name = column_value]
§ i.e. s3://athena-examples/logs/year=2017/month=5/
§ Support simple partition naming
§ i.e. s3://athena-examples/logs/2017/5/
§ Encryption and Athena
§ Athena can read from encrypted S3 buckets (SSE-S3, SSE-KMS, CSE-KMS)
§ Athena can write results to encrypted S3 buckets
§ In-transit encryption between S3 and Athena and between Athena resources

Athena Example – Different File Formats
• SELECT count(*) as count FROM taxi_rides_csv
• Run time: 20.06 seconds, Data scanned: 207.54GB – 1,310,911,060
• SELECT count(*) as count FROM taxi_rides_parquet
• Run time: 5.76 seconds, Data scanned: 0KB – 2,870,781,820
• SELECT * FROM taxi_rides_csv limit 1000
• Run time: 3.13 seconds, Data scanned: 328.82MB
• SELECT * FROM taxi_rides_parquet limit 1000
• Run time: 1.13 seconds, Data scanned: 5.2MB

Analyzing with Redshift Spectrum
§ Put new or existing Parquet, CSV or ORC files in a S3 folder
s3://<my-bucket>/data/<key-name-randomizer>/<temperature-parq>/
§ Define external schema in Redshift - point it to external catalog (DataCatalog or Hive MetaStore)
Create external schema if not exists in S3
from data catalog database ‘default’ region e.g. ‘us-west-2’
(or from HIVE METASTORE database ‘default’ URI ‘172.12.34.56’ PORT 9083)
iam_role ‘arn:aws:iam::123456789999:role/Redshift-S3’
§ Define external table under external schema pointed to a S3 file
create external table s3.<temperature-parq>
(L_ORDERKEY BIGINT, ….)
STORED AS PARQUET
LOCATION s3://<my-bucket>/data/<key-name-randomizer>/<temperature-parq>
§ Run query from Redshift against data in S3
SELECT *
FROM s3.<temperature-parq>
WHERE x_HUMIDITY > ‘75%’
GROUP BY DEVICE_TYPE

Pop-up Loft
Thank You

Building a Data Lake on S3 for IoT Workloads

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building a Data Lake on S3 for IoT Workloads

Similar to Building a Data Lake on S3 for IoT Workloads (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Building a Data Lake on S3 for IoT Workloads