This document discusses using various AWS services like Kinesis, CloudSearch, DynamoDB, Redshift and EMR for processing streaming data and performing analytics. It provides code snippets for initializing the services, ingesting and analyzing data using Spark on EMR, and scaling the infrastructure as needed. It also discusses updating event data in DynamoDB based on geospatial proximity and lists SDKs for interacting with AWS services from different programming languages.
9. Amazon SQS Amazon Kinesis
Auto-scaling Shards provisioning
=> Simple to set up and operate, easy to deploy
new version to a new queue
=> More cost effective in high scale, once you
tuned the system
“At least once delivery” Multiple “exactly once in order delivery”
=> Easy to start with a single worker => A set of dedicated workers that are working in
different intervals and different operations
10.
11. fromflaskimportFlask,request
application=Flask(__name__)
# main entry point for SQS, accepting only POST requests
@application.route("/sqs/",methods=['POST'])
defsqs():
application.logger.debug('Message was received for processing!')
doc=parse_request(request)
load_to_cloudsearch(doc)
put_into_dynamodb(doc)
return""# OK
12. defload_to_cloudsearch(doc):
# index document
doc_serv=g.domain.get_document_service()
doc_serv.add(doc[id],doc)
application.logger.debug('Inserting docId: %s',doc[id])
# send index batch to CloudSearch
try:
doc_serv.commit()
exceptCommitMismatchErrorase:
application.logger.error('CommitMismatchErrorraised')
formsgine.errors:
application.logger.error('Error: %s',msg)
raise
finally:
doc_serv.clear_sdf()# clear SDF for next iteration
13. defput_into_dynamodb(doc):
itemData=copy.deepcopy(doc) # I want different fields in DynamoDB
delitemData['day']# TS is enough, day is only for faceting
#Using GeoHasingfor DynamoDB lookup index
geojson_location="{{"coordinates":[{0},{1}],"type":"Point"}}".format(doc['latitude'],doc['longitude'])
itemData['location']=geojson_location
geo_server_url="http://geo-server.elasticbeanstalk.com/wl-dynamodb-geo? point={0},{1}".format(doc['latitude'],doc['longitude']))
itemData['geohash']=int(requests.get(geo_server_url).content)
itemData['geobox']=itemData['geohash']/10000000000000
# PUT into DynamoDB Table
item=Item(g.eventsTable,data=itemData)
item.save()
30. valsqlContext=neworg.apache.spark.sql.SQLContext(sc)
importsqlContext._
// Define the schema using a case class.
caseclassEvent(event_id:String,time:String, latitude: Float, longitude: Float)
// Create an RDD of Event objects from S3 “folder” and register it as a table.
valevents=sc.textFile("s3://spark-bucket-demo/spark/events").map(_.split(",")).
map(p=>Event(p(0),p(1), p(2).trim.toFloat), p(3).trim.toFloat))
events.registerAsTable(”events”)
// SQL statements can be run by using the SQL methods provided by sqlContext.
Valoct= sql("SELECTevent_idFROMeventsWHEREtime>=‘2014-10-01’ANDtime<=‘2014-11-31’”)
// The results of SQL queries are SchemaRDDsand support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
oct.map(t=>”event-id: "+t(0)).collect().foreach(println)
39. // update an event as close only is the reports is coming from the same geo-box
Tabletable=dynamo.getTable(TABLE_NAME);
table.updateItem("event-id","7982e605-dc7d-4199-bc3e-d449733932e2”,
// update expression
"set status = 'close'",
// condition expression
"geobox= :geobox",
null,
newValueMap()
.withInt(":geobox",515811)
);
40. SDKs
Java
Python (boto)
PHP
.NET
Ruby
Node.js
iOS
Android
AWS Toolkit for Visual Studio
AWS Toolkit for Eclipse
AWS Tools for Windows PowerShell
AWS CLI
JavaScript
new!
41.
42.
43.
44.
45.
46.
47.
48. Learn from AWS big data experts
start-to-finish post on analyzing and visualizing big data
blogs.aws.amazon.com/bigdata
49. Please give us your feedback on this session.
Complete session evaluations and earn re:Invent swag.
http://bit.ly/awsevals