Abstract:
Join Nick Piette, Director of Evangelism at Talend, as he brings you a deep, technical discussion on the real-world data pipeline that underlies modern sports. Working from real-time instrumentation data collected during play, and using open source tools, Nick will show you how to produce meaningful analytics results in minutes. If you are using Kafka, Spark, or any real-time data science technologies, or even if you are just trying to get a better understanding of them, this event is for you.
Speaker’s bio:
Nick Piette is the Director of Evangelism at Talend. He has spent the last eight years helping enterprises with many different data processing challenges. Nick enjoys sharing the most compelling big data use cases that are changing the world.
4. 4
So what?
• Cool usecase and all, but what's the value?
• Real-time streams from robotic manufacturing (Audi, Ford, BMW, Toyota)
• Real-time traffic analysis for Smart Cities / Theme Parks (Denver, Cincinnati,
London, Disney, Universal)
• Real-time mechanical data from devices (Aircraft - Air France, Windmills – GE)
• And before you discount this whole sports things
• UK tax office collects 1.3B pounds ~2B USD in taxes each year from EPL teams
• Greater than the GDP of the bottom 25% of all countries
• 95 billion dollars wagered annually on NFL and college football
• #1 on Forbes 2000 list by a lot…
10. 10
From Seen To Described
d d d
+
+
Gigs of Video data to KB/MB description data
Most applications that convert are proprietary
but seeing investment in space by the usual suspects
11. 11
Phone home?
d d d
+
+
Data tends to be JSON or XML
Onvif Standard for Security
Messaging vs Web services?
17. 17
• The camera array sends a feed of 25 frames per second
• Each frame captures the x,y,z coordinates of every player
• A live feed of sport data is actually pretty serious Big Data!
Challenges
19. 19
• It let's you publish and subscribe to
streams of records. In this respect it
is similar to a message queue or
enterprise messaging system.
• It let's you store streams of records in
a fault-tolerant way.
• It let's you process streams of records
as they occur.
Distributed Streaming Platform
Kafka Background
20. 20
• Fast and general engine for large-scale data processing
• Developed in response to processing limitations with MapReduce
• 10x faster than MapReduce on disk
• 100x faster than MapReduce in memory
• Has a stack of libraries including Spark Streaming & MLib (machine learning)
• Runs everywhere; on Hadoop or Standalone
Spark Background
22. 22
Next Step: From Analysis to Prediction
Team stats
Who is the most likely to score
next?
Which team is going to win?
Individual players stats
Which player need a rest/bench?
Which player are being traded
( bring in historical data)
23. 23
Free Trial: Talend Big Data Sandbox
• A ready-to-run Docker environment
• A step-by-step expert guide: the cookbook
• Real-world scenarios using Spark, Kafka,
MapReduce & NoSQL
• Iot Analytics
• Real-time Recommendation
• Clickstream Analysis
• Weblogs Analysis
• EDW Offload
www.talend.com/BigDataSandbox
Hit the Easy Button for Hadoop, Spark and Machine Learning
24. 24
• An active community
• 80,000 visitors/week
• 3M of total downloads
• Engaged members
• Individual members &
partners
• Active User Groups
• 1,000+components built by
the community
The NEW Talend Community
25. 25
Talend Data Masters Awards
• Share your Talend story &
win in $1,500 for your
favorite charity
• Deadline: July 28th
• https://info.talend.com/d
atamasters2017all.html
Editor's Notes
More often that not, most data people anayze today is voliate – it comes and goes, in analyzed and gone.
The idea was that you needed to download twitter to do anything of value with social analytics but that’s not true… there’s an api for that.
The things Data anayltics is important to every originzation, doesn’t matter the size so “big” is different for everyone and that doesn’t
Velocity and variety of the data
Who here is a sports fan? Big fantasy league players here?
Big data is an interesting marketing
The 4.5 trillion frames per second is the FASTEST slow motion camera to date, it is used to capture the moments leading up to, during and after a chemical reaction… not something we’d need for a goal line review but it certainly exemplifies the big data challenge we are presenting.
If you were to manually watch this, It would take you ~ hundreds of thousands of years to process…hope you didn’t have plans
NFL Zebra – RFID’s in jerseys – Force impact, speed, concussion rates
NBA, you’d think they could keep the traveling down to a minimum
Goal Line technology
There is a lot of value in the data that is created behind this… influence even by a small fraction we’re talking about millions
Now we’re going to break this challenge up into two sections, the first will cover all aspects of the image collection and video processing, the second covers the analytics
The first question that needs to be asked when architecting a solution for processing video and image data is what do I need to solve the problem. A lot of architectural decisions will be made depending on this question.
Is the challenge to identify that what I am seeing is a car? do I need to know what color it is? Or what the model is? Or in the case of video, can I tell the difference between one car and another? Perhaps I am just getting a general flow of traffic on a highway, or am I trying to identify the market share of one of my competitors by identifying the ratio of my car brands vs theirs within a given area?
Almost all video and image processing pipelines look like this.
We’re capturing the raw video format and they compressing / encoding.
Next we process the video to extract relevant metadata and then pass that information further downstream to our analytical process. There are a lot of questions as to where and when to do certain steps and we’ll walk though them in the following slides.
* This makes a very strong argument for processing and handling it as locally as possible to work with that high bandwidth
*18.88 Mbps in most urban areas with it even higher for a premium
The FCC recently found that 39% of rural populations lack target levels of speed: 25 Mbps for downloads and 3 Mbps uploads
This impacts things like smart farming or smart aggriculter
Some HD video cameras output uncompressed video, whereas others compress the video using a lossy compression method such as MPEG or H.264 H265 is also picking up
HEVC was developed with the goal of providing twice the compression efficiency of the previous standard, H.264 / AVC
At an identical level of visual quality, HEVC enables video to be compressed to a file that is about half the size (or half the bit rate) of AVC,
When compressed to the same file size or bit rate as AVC, HEVC delivers significantly better visual quality.
NFL stadiums tend to have hundreds to the thousand servers within the stadium devoted to encoding and metadata processing.
The usual suspects,
Amazon,
Google,
Microsoft,
IBM …. Just to name a few
While a lot of the camera hardware vendors will provide this processing capability, I did a check and there are some 30 + available API’s out there to handle the video processing. This is likely the most complex and use case specific process and I have yet to find a one size fits all API.
This makes a very strong argument for processing and handling it as locally as possible to work with that high bandwidth
But as discussed as work continues in codec compression and infrastructure improves upload bandwidth we might get to the point where this discussion becomes mute.
In short, the better we get at lossless compression the more flexible we can be in this step.... Where’s pied piper when you need them
So with that in mind I’d like to show you how you could build a process like this. We’re going to take the google vision API for a little spin, I am going to gather you up and we’re going to take a picture that I’ll post on twitter and pull down using Talend to analyze with the Google Vision API. It will spit out some interest results and hopefully recognize you all as people and see your faces.
So we just covered how to architect something to handle video processing and discussed some of the trade offs for locality of service finishing off with a demo highlighting some of the work cloud based companies like google are doing to democratize the video and image meta data gathering process.
So now lets focus on the analytical side. Where we left off from the video processing architecture was that the video data had been converted into a metadata representation. We’re going to want to work with that in a more general analytical setting.
So going back to our conversation earlier about sports analytics and the gobs of money it brings it, we see coaches, analysts even the average sports viewer looking for insight into their favorite players; looking for ways to optimize their strategy to improve success..
In the case we have here which is focused on data collected from the EPL, players are often running all over the place and identifying when they are getting tired can be important intel for both teams. When you have players playing well into their 40s’ you want to make sure one of them isn’t going to break a hip or something….
The NFL is doing similar fact finding with regards to force impact analysis.. With so much attention on concussion rates and effects you bet everyone is making sure they keep their 120 Million franchise player safe and healthy.
Heres just an example of what is in the JSON information we receive, while it’s not the 4.5 trillion frames per second
Consistent Growth
1,500 members in the new Community.Talend.com INTERNAL ONLY
3M of total download of Talend software to-date since the company was founded (includes TOS + evals)
In 2016, we had 360,000 total downloads, up 14% since 2015 (total downloads include TOS + evals)
Engaged members:
Members: Our community members are “strategic partners” in solving data challenges—not just Talend challenges.
Talend Advocates: Small-to-medium SIs and VARs are the some of the greatest Talend champions in the community. They share their technical expertise and by sharing their knowledge, they get visibility and find new customers
Thought Leaders: We’re about to launch a new Discussion Board about IoT/Smart Cities. By comparison, competitors use their forum for product support only.
The health of a community is measure by the engagement—not just growth
User Groups:
Not only do we have community members that actively respond to questions on the forum ….
…. we also have customers who are creating and managing User Groups around the world (US, UK, Germany, France, Belgium, Switzerland, and India)
Our User Group in Portland, Maine, and Vancouver, Canada were launched by customers, and so were many others.
The Community Team is launching one NEW user group/quarter. In 2017, we plan to have a new user group in Chicago, Dallas, Toronto, and Atlanta in 2017. Vancouver was launched in Q1.
Every day, we have about 400 online concurrent users.
Monetization:
Both Talend and the Talend partners know how to monetize the community.
Talend has been converting open source customers (i.e. Judicial Court of California, Mogo Finance Technology) from Open Studion to the commercial version, Talend Data Integration
And partners who are active on the community are finding new business (some of the most active members are SI partners)
Criteria
Creativity and uniqueness of use
Scope and complexity of project
Business transformation and improvement
Timeline
We are accepting entries until July 28, 2017. Hurry and send your entries now!
Winners will be notified in September.
Winners will be announced in November.
Eligibility Requirements
Award winners should be willing to have their story shared publicly on Talend web site (company logo, video and case study) and promoted on social media and in press announcements.