Learn how to quickly create a highly scalable solution using AWS. We introduce the benefits and challenges you may face. We discuss scope and establish realistic expectations, budgets, and constraints for these type of projects. Finally we end with a demo for website event tracking and analysis.
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
A Primer for Your Next Data Science Proof of Concept on the Cloud
1. A Primer for Your Next
Data Science Proof of
Concept on the Cloud
2. Your Presenters
Alton Alexander
About Me : Data scientist. PhD dropout with a love for
solving real world business problems. Experience delivering
solutions with big data, machine learning, and statistics for
marketing, manufacturing, and finance industries.
Affiliations : Front Analytics Consulting
Connect : altonalexander
alexanderalton@gmail.com
Matt Davies
About Me : “Big Data” architect / engineer with clients in
retail, healthcare, e-commerce, insurance, and government.
Primarily focused on operationalizing complex data-driven
solutions.
Affiliations : Xpert Data Solutions
Connect : 4mattdavies
matt@xds.io
Site : http://xds.io
3. Agenda
● Scoping the problem and solution
● Discuss pros / cons of starting with a cloud solution
● Establishing realistic expectations, budgets, constraints, etc…
● Hands on demo
● Q&A
4.
5. Scoping the Problem and Solution
● What are you trying to solve?
● What data might be helpful in answering the question(s)?
● Are there specific techniques which are known to work well?
● Do you need to use BI tools and/or export data?
● What are your timelines?
● What are your resources?
6. Competing on Analytics
Product Customer Operations
Who are your potential
customers?
What do they want?
Brand loyalty?
What’s next?
What motivates customers?
Which channels work best?
What else do they need?
What is a “customer”?
How long will X function?
How much product waste?
Will Y be cheaper?
e.g. Customer segmentation / profiles
Market analysis
Channel attribution
Keywords
Engagement
Enrichment
Churn
Conversion
Offers
Sentiment
Profiling
Yield optimization
Failure rates
Futures
7. data collectors
bulk
store
batch
process
live
store
api
service
ui
Datacenter
AWSand/or
“Analytic”
queues & oltp
*SQS
redis or couch
mongodb
rdbms
olap
mongo
Hbase
Thrift/Protobuf/AVRO
sockets style
*messagepack based
netty
kafka
*kinesis
apps
*elastic beanstalk
ec2/vm + load balance
emitters
*messagepack
*s3
ebs
*HDFS
cassandra
columnar
on file system
M/R based (pig, hive)
Graph based
off file system
anything language
diy json
mongodb
*BaaSes
postgres
column
Hbase/Impala
*cassandra
graph
cypher
gremlin
search
elasticsearch
*stupid-simple-n-scale
cloud/dc apps
ec2/vm + load balance
*elastic beanstalk
sql-ish
Phoenix
Cassandra
graph
Gremlin
Cypher
search
elasticsearch
d3
nvd3.js - simple
d3.js - complex
dc.js - dimensions
putting the long-A in OLAP
8. Pros / Cons of Starting With a Cloud Solution
Cons:
● Data sensitivity
● Less control
● Unfamiliarity with terminology and/or
design
● On prem world very different than cloud.
Terms, risk factors, skillsets
● Data movement can be difficult
● Cloud “tax”
Pros:
● Elasticity
● Scalability
● Speed of implementation
● Focus on business problem
● Can easily create multiple instances for
tests
● Less management
● Strong security
● No Network, Datacenter barriers
● Strong industry adoption
9. Pros / Cons of Starting With a Cloud Solution
General Challenges
● POC -> Production can be difficult
● Security is widely misunderstood
● Skillsets: When to hire, develop, consult, outsource
10. Use Case : Clear objective with
identified stakeholders.
Sufficient Time : Discovery is
such a large part of these projects
that projecting “Put this legacy
project out in X hrs will translate
to Y in big data” is not reliable.
Iterate : Like all software projects
it is usually better to iterate than
have large waterfall deployments.
Review : What went well, what
failed, where is our technical
debt.
Establishing Realistic Expectations, Budgets, Constraints, Etc...
11. Establishing Realistic Expectations, Budgets, Constraints, Etc...
Budgets
● Know how the cloud provider makes money
● Start lean
● “Leave No Trace”
Constraints
● Time
● Complexity
● Resources
12. Establishing Realistic Expectations, Budgets, Constraints, Etc...
● Sleep on it
● Are you a solution in search of a need?
● Use the scientific method
● Involve yourself in the community
● Hire a consultant
18. Connect to the master and Monitor Cluster
Be sure to configure your security group settings and use a private key to login
ssh -i ~/hackathon.pem hadoop@ec2-52-91-26-92.compute-1.amazonaws.com
Let’s set up an ssh tunnel so we can see what is happening on the cluster
● Hadoop, Ganglia, and other applications publish user interfaces as websites
hosted on the master node. For security reasons, these websites are only
available on the master node's local webserver (http://localhost:port) and are not
published on the Internet.
ssh -i ~/hackathon.pem -ND 8157 hadoop@ec2-52-91-26-92.compute-1.amazonaws.com
● http://ec2-52-91-26-92.compute-1.amazonaws.com
19. Configure Hive to query JSON
Set up the hive table to query the underlying json files -- (see notes)
/* ---[ A tool to automate creation of Hive JSON schemas ]--- */
One feature missing from the openx JSON SerDe is a tool to generate a schema from a JSON document. Creating a schema for a
large complex, highly nested JSON document is quite tedious.
I've created a tool to automate this: https://github.com/midpeter444/hive-json-schema.
How to get data in and out.
20. Bootstrap our cluster
Now we can bootstrap our cluster
to load additional libraries and
functions on all the nodes
We are going to bootstrap with
python and nlp and the stanford
library so we can pick out
keywords in each record.