Amazon Cloud Drive's plans to provide a low cost, unlimited storage service presented a major engineering challenge. In this session, you learn how the Amazon Cloud Drive team designed and optimized the storage back-end, Amazon S3, to handle millions of users while containing infrastructure costs. In this session, the lead engineers share details of how they built the service for massive scale, and the regular steps they take to increase performance and efficiency. They also describe proven techniques for scaling and optimization, learned from experience.
2. What to expect from the session
• What is Amazon Cloud Drive?
• Key Challenges
• Services design & architecture
• Content store deep dive
• Lessons learned
3. What Is Amazon Cloud Drive?
• Unlimited cloud storage from Amazon for consumers
• Subscription based storage plans
Unlimited photos
Unlimited photo storage, plus
5 GB for videos and files for just
$11.99 per year.
Unlimited everything
Securely store all of your photos,
videos, files and documents for just
$59.99 per year.
https://www.amazon.com/clouddrive/
4. How do I use it from anywhere and any device?
Amazon Apps for photos and files
Mobile Computer
Mac PC Web
https://www.amazon.com/clouddrive/apps
5. What’s in it for developers & partners?
Reach millions of customers
RESTful APIs
Android & iOS SDKs
Revenue sharing
https://developer.amazon.com/public/apis/experience/cloud-drive/
6. A growing partner ecosystem
Access to millions of Amazon customers
Revenue-sharing for developers and partnersnew!
https://www.amazon.com/clouddrive/apps
7. Key challenges?
• Unlimited storage
• Millions of users
• Billions of files
• Variety of content (photos/videos/docs)
• Variety of metadata
• Flexible indexing & querying
• Terabytes of logs
9. Amazon Cloud Drive service architecture
Indexing & Query
Analytics
AppsUsers
Asynchronous
Pipeline
Amazon
Kinesis
Stream
Message
Queue
Amazon Cloud Drive service Amazon EC2
Content
Store
Amazon S3
Metadata
Store
Amazon
DynamoDB
Notifications
Content
Processing
Amazon Elastic
Transcoder
Amazon
ELB
10. What does Cloud Drive store in Amazon S3?
• Customer content
• Derived content
• Transcoded videos
• Thumbnails of videos, documents
• Log files
• Dynamic configuration
• DynamoDB backups
• Using the publicly available AWS Java SDK
11. Storing customer content
• Single Amazon S3 bucket per geographical region
• Billions of objects per content bucket
• Randomly generated keys
• Keys are stored in Amazon DynamoDB
• Avoids hot key prefixes
• No list operations
• Amazon S3 server-side encryption
• AES 256
12. Managing log files
• Cloud Drive consists of 800+ servers in 3 AWS regions
• More during peak load times
• 200GB+ logs per hour
• Delivered to Timber log archiving service
• Timber encrypts and stores in Amazon S3
13. Log file types
• Application logs
• Time-stamped and severity-tagged messages
• Service logs
• Amazon-wide standard format
• Record per service invocation
• Source for metrics
• Wire logs
17. Coordinating dynamic configuration
• Dynamic values like feature toggles
• Enable feature for test customers
• Dial capabilities up from 0% -> 100%
• Configuration files stored in S3
• Servers poll for changes using HTTP HEAD
(GetObjectMetadata)
• File is reloaded only if ETag has changed
18. Challenge 1/6: Upload size variation
• Uploads vary widely in size
• Text files to VM images
• Even images vary from 10K GIFs to 20MB RAW
• Maintain reasonable performance for all file sizes
• Prevent large files from causing resource starvation
19. Challenge 1/6: Upload size variation
• Solution: Size-aware upload logic
• Size < 15MB: PUT object
• Upload performed by the request thread
• Size larger or unknown: multipart upload API
• Parts uploaded by a thread pool with blocking array in front
• Fixed-size 5MB parts
• 50GB file size limit, due to 10,000 part limit for multipart API
20. Challenge 2/6: Rapid upload availability
• Content should be available as soon as possible
• But some content processing takes time
• Solution: a mix of synchronous, asynchronous, and
optimistic synchronous processing
21. Challenge 2/6: Rapid upload availability
• Metadata extraction from images and videos
• Quick
• Largely independent of file size
Synchronous
Asynchronous
Optimistic synchronous
22. Challenge 2/6: Rapid upload availability
• Video transcoding
• Necessary for playback on different devices
• Time consuming and size dependent
• We use the Amazon Elastic Transcoder service
Synchronous
Asynchronous
Optimistic synchronous
23. Challenge 2/6: Rapid upload availability
• Document transformation to PDF
• Timing is unpredictable
• Try synchronous with a timeout
• If timeout, queue SQS message for async processing
Synchronous
Asynchronous
Optimistic synchronous
24. Challenge 3/6: Intermittent connections
• Clients may have slow and intermittent connections to
our service
• Especially mobile devices
• This makes uploading a large file in a single HTTP
request difficult
• But multipart upload APIs are complex
• Especially for the happy path
• Solution: Resumable uploads
25. Challenge 3/6: Intermittent connections
• Client attempts large upload
• If it fails mid-stream, Cloud Drive saves the transmitted bytes
• Leveraging existing Amazon S3 multipart upload
• Client queries for resumption point
• Client resumes upload
• HTTP Content-Range header
• Cloud Drive completes multipart upload
26. Challenge 3/6: Intermittent connections
• Problem: Can’t use instance profile credentials from
different instances for a single multipart upload
27. Challenge 3/6: Intermittent connections
• We used the AWS Security Token Service (STS)
to provide consistent credentials for each step of
the upload
• Amazon S3 presigned URLs are another option
• http://amzn.to/1FLeoii
28. Challenge 4/6: Download size variation
• Like uploads, downloads vary widely in size
• Maintain reasonable performance for all file sizes
• Prevent large requests from causing resource starvation
• Solution: Size-aware download logic
29. Challenge 4/6: Download size variation
• Small downloads (<5MB)
• Single GET object
• In the request thread
• Retry once on failure
• This covers 90% of our customer’s files
30. Challenge 4/6: Download size variation
• Large downloads
• Custom parallel download logic for large files
• 5MB part size (range requests)
• Dedicated thread pool with blocking queue to avoid affecting
uploads, small file downloads
• Connection reuse
• Single retry on failure or timeout
• Uses Apache HTTPClient
31. Challenge 5/6: Thumbnails of large images
• High traffic for thumbnails of images
• 3000+ requests per second
• Image thumbnails generated on-the-fly
• Large images thumbnails are expensive
• Large object to download from Amazon S3
• More time to generate thumbnail
32. Challenge 5/6: Thumbnails of large images
Content
Bucket
Cloud
Drive
Thumbnail
Bucket
Solution: Create an intermediate JPEG
thumbnail and cache it in Amazon S3
33. Challenge 5/6: Thumbnails of large images
• Cache in S3 bucket with 48 hour expiry
• Key on hash of customer id + image id + image version
• 2k X 2k JPEG, ~1MB
• Cache candidates:
• JPEG, PNG, TIFF >10MB
• All other images (primarily RAW)
34. Challenge 6/6: Large direct downloads
• No on-the-fly transformations to large files
• Downloading to disk doesn’t make sense
• Redirect to a short-lived Amazon S3 presigned URL
35. Takeaways
• Amazon S3 is flexible
• Not just for big data, but caching, coordinating configuration
• Selection of Amazon S3 keys is important
• Upload and download strategies depend on file size
and workflow
• First fallacy of distributed computing: the network
is reliable
• Retrying upload and download requests may be appropriate
• Limit retries
36. Final Thoughts
Experience Amazon Cloud Drive
amazon.com/clouddrive
Build Apps with Amazon Cloud Drive API
developer.amazon.com/public/apis/experience/cloud-drive
Earn revenue & reach millions of Amazon customers
http://tinyurl.com/Cloud-drive-revenue