Stream Processing, Analytics & Visualization

Queue Table
Textual Search
%LIKE%
Analytical Reports
BLOBs
Geospatial Queries
Events Table

Amazon SQS Amazon Kinesis
Auto-scaling Shards provisioning
=> Simple to set up and operate, easy to deploy
new version to a new queue
=> More cost effective in high scale, once you
tuned the system
“At least once delivery” Multiple “exactly once in order delivery”
=> Easy to start with a single worker => A set of dedicated workers that are working in
different intervals and different operations

fromflaskimportFlask,request
application=Flask(__name__)
# main entry point for SQS, accepting only POST requests
@application.route("/sqs/",methods=['POST'])
defsqs():
application.logger.debug('Message was received for processing!')
doc=parse_request(request)
load_to_cloudsearch(doc)
put_into_dynamodb(doc)
return""# OK

defload_to_cloudsearch(doc):
# index document
doc_serv=g.domain.get_document_service()
doc_serv.add(doc[id],doc)
application.logger.debug('Inserting docId: %s',doc[id])
# send index batch to CloudSearch
try:
doc_serv.commit()
exceptCommitMismatchErrorase:
application.logger.error('CommitMismatchErrorraised')
formsgine.errors:
application.logger.error('Error: %s',msg)
raise
finally:
doc_serv.clear_sdf()# clear SDF for next iteration

defput_into_dynamodb(doc):
itemData=copy.deepcopy(doc) # I want different fields in DynamoDB
delitemData['day']# TS is enough, day is only for faceting
#Using GeoHasingfor DynamoDB lookup index
geojson_location="{{"coordinates":[{0},{1}],"type":"Point"}}".format(doc['latitude'],doc['longitude'])
itemData['location']=geojson_location
geo_server_url="http://geo-server.elasticbeanstalk.com/wl-dynamodb-geo? point={0},{1}".format(doc['latitude'],doc['longitude']))
itemData['geohash']=int(requests.get(geo_server_url).content)
itemData['geobox']=itemData['geohash']/10000000000000
# PUT into DynamoDB Table
item=Item(g.eventsTable,data=itemData)
item.save()

// Using Leaflet to show a map
functionshow_map(position){
varlatitude=position.coords.latitude;
varlongitude=position.coords.longitude;
map=L.map('map').setView([latitude,longitude],15);
L.tileLayer('http://{s}.tiles.mapbox.com/v3/guyernest.jngcdfig/{z}/{x}/{y}.png',{
attribution:'Map data &copy…’,maxZoom:18
}).addTo(map);
// Call DDB to show markers on the map
varbounds=map.getBounds();
varrequest={minLng:bounds.getWest(),maxLng:bounds.getEast(),minLat: bounds.getSouth(),maxLat:bounds.getNorth()}
$.ajax({
type:"POST",
url:'http://geo-server.elasticbeanstalk.com/wl-dynamodb-geo',
data:'{ action: query-rectangle, request :'+JSON.stringify(request) +'}',
success:success,
dataType:"json"
});

functionsuccess(data){
data.result.forEach(function(entry){
varmarker=L.marker([parseFloat(entry.latitude),parseFloat(entry.longitude)])
marker.addTo(map);
marker.bindPopup("<b>"+entry.comment+"</b><br>"+"<imgsrc='"+entry.img”’>")
.openPopup();
});
}

Amazon CloudSearch Native Geospatial support
•Latitude and Longitude data types
•Region search
•Distance sort
•Supports mobile

emrMyKeyPair-- bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark
{
"ClusterId": "j-38X214F58P62M"
}
>>awsemrssh--cluster-id j-38X214F58P62M --key-pair-fileMyKeyPair.pem

valsqlContext=neworg.apache.spark.sql.SQLContext(sc)
importsqlContext._
// Define the schema using a case class.
caseclassEvent(event_id:String,time:String, latitude: Float, longitude: Float)
// Create an RDD of Event objects from S3 “folder” and register it as a table.
valevents=sc.textFile("s3://spark-bucket-demo/spark/events").map(_.split(",")).
map(p=>Event(p(0),p(1), p(2).trim.toFloat), p(3).trim.toFloat))
events.registerAsTable(”events”)
// SQL statements can be run by using the SQL methods provided by sqlContext.
Valoct= sql("SELECTevent_idFROMeventsWHEREtime>=‘2014-10-01’ANDtime<=‘2014-11-31’”)
// The results of SQL queries are SchemaRDDsand support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
oct.map(t=>”event-id: "+t(0)).collect().foreach(println)

###############Setup for initial small environment ###############
# Kinesis
awskinesiscreate-stream --stream-name "StreamName"--shard-count 1
# CloudSearch
awscloudsearchcreate-domain --domain-name "SearchDomain"
# DynamoDB
awsdynamodbcreate-table
--attribute-definitions --table-name "TableName"--key-schema AttributeName=Id,KeyType=HASH Attribute..
# Redshift
awsredshiftcreate-cluster --cluster-identifier "ClusterID"--cluster-type single-node --node-type…
# EMR
awsemrrun-job-flow --name"JobFlow"
--instances {"MasterInstanceType": "m1.medium", "SlaveInstanceType": "m1.medium", "InstanceCount”:} }
--steps[
{
"Name": "Analyze Positions",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {"Jar": "s3://emr-steps/AnalizePositions.jar",
}
}
]
# Kinesis
awskinesiscreate-stream --stream-name "StreamName"- -shard-count 1
# Redshift
awsredshiftcreate-cluster --cluster-identifier "ClusterID"--cluster-type single-node
--node-type dw2.large --master-username "master- username"--master-user-password "master-user- password"

#################### Scaling the infrastructure when needed ####################
# Kinesis
awskinesissplit-shard --stream-name"StreamName"--shard-to-split $SHARD_ID--new- starting-hash-key$MID_HASH
# CloudSearch
awscloudsearchupdate-scaling-parameters --domain-name"SearchDomain"--scaling- parameters DesiredInstanceType=search.m2.xlarge,DesiredReplicationCount=2
# DynamoDB
awsdynamodbupdate-table --table-name"TableName"--provisioned-throughput ReadCapacityUnits=100,WriteCapacityUnits=20
# Redshift
awsredshiftmodify-cluster --cluster-identifier"ClusterID"--number-of-nodes 2
# EMR
awsemradd-instance-groups --job-flow-id"JobFlow"
--instance-groups Name=insGroup,Market=SPOT,InstanceRole=TASK,BidPrice='0.3',InstanceType=m1.medium,InstanceCount=2

// update an event as close only is the reports is coming from the same geo-box
Tabletable=dynamo.getTable(TABLE_NAME);
table.updateItem("event-id","7982e605-dc7d-4199-bc3e-d449733932e2”,
// update expression
"set status = 'close'",
// condition expression
"geobox= :geobox",
null,
newValueMap()
.withInt(":geobox",515811)
);

SDKs
Java
Python (boto)
PHP
.NET
Ruby
Node.js
iOS
Android
AWS Toolkit for Visual Studio
AWS Toolkit for Eclipse
AWS Tools for Windows PowerShell
AWS CLI
JavaScript
new!

Learn from AWS big data experts
start-to-finish post on analyzing and visualizing big data
blogs.aws.amazon.com/bigdata

Please give us your feedback on this session.
Complete session evaluations and earn re:Invent swag.
http://bit.ly/awsevals

Stream Processing, Analytics & Visualization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Stream Processing, Analytics & Visualization

Similar to Stream Processing, Analytics & Visualization (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Stream Processing, Analytics & Visualization