2. Hi I’m Bobby (bobby@apache.org)
2
Low Latency Data Processing Architect at Yahoo.
› My team and I provide Apache Storm as a service to Yahoo.
› We also maintain Spark at Yahoo, but that is another talk.
Thursday June 5th @ 11:50 Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
› And we get to play around with deep learning and online machine learning too.
Commiter and PMC/PPMC member for
› Apache Storm incubating
› Apache Hadoop
› Apache Spark
› Apache TEZ incubating
4. Storm Concepts
1. Streams
› Unbounded sequence of tuples
2. Spout
› Source of Stream
› E.g. Read from Twitter streaming API
3. Bolts
› Processes input streams and produces
new streams
› E.g. Functions, Filters, Aggregation,
Joins
4. Topologies
› Network of spouts and bolts
11. Authentication By Type
6/17/201411
HTTP – Using HTTP Authentication or with a Custom Java Servlet
Filter.
Thrift – Kerberos (Possibly through a forwarded TGT)
ZooKeeper
› Kerberos for system processes (Because there is a keytab available)
› a shared secret for worker processes with MD5SUM in ZK.
File System – OS user/group + FS permissions.
Worker to Worker – Can use encryption with shared secret, but we
really need to add in SASL Auth.
External Services (like HBase) – Sorry it is up to you (Sort of …)
13. Credentials Push
(Authenticating with External Services)
6/17/201413
APIs to deliver credentials to a Topology.
ICredentialsListener – informed of credentials updates.
IAutoCredentials – automatically include credentials to push.
ICredentialsRenewer – renew credentials.
Push new Credentials
› storm upload_credentials
› StormSubmitter.pushCredentails
AutoTGT – push forwardable TGT to topology.
› Also logs you into Hadoop/HBase if needed
14. Authorization
6/17/201414
IAuthorizer plugin allows you to decide what is and isn’t allowed
SimpleACLAuthorizer for Nimbus.
Different roles for users
› Administrators can do anything.
› Supervisors
› Users
Topology can configure access to itself as well (rebalance).
DRPCSimpleACLAuthorizer for DRPC.
Can configure client and topology users per function.
Can default open or closed.
Topology can also whitelist users to view info through UI and Logviewer
16. Multi-tenant Scheduler
16
Provides admin resource allotments per user instead of per topology
› Users decide how to divide up their resources per topology
21. Storm on YARN
6/17/201421
Currently
A stand alone storm cluster running on YARN
Has some hacks to avoid port conflicts
No security
No recovery if AM goes down
24. What’s Next?
(If you see anything you like we are hiring…)
24
Nimbus HA/Recovery.
Long lived secure processes in YARN.
Ephemeral ports for storm.
Combine the AM and Nimbus.
Do we need a Supervisor if we have a Node Manager?
Possibly run as Unmanaged AMs and Proxy Users.
Elasticity for storm topologies.
Resource aware scheduling/requests in storm.
Network aware scheduling in YARN and Storm.
Automatic fetching of delegation tokens like Oozie
27. Why Not…
27
No need for a religious war, there are lots of good options out there and
we picked one.
Apache Spark Streaming
We started before Spark Streaming was a possibility.
Storm is currently more advanced in many areas, but not in all.
› Fault Tolerance (I can turn it off in storm)
S4
The community for Storm was more active
Fault Tolerance (I can turn it on in storm)