How to build your own Delve: combining machine learning, big data and SharePoint

How to build your own Delve: combining
machine learning, big data and SharePoint
#SPSBE11
Joris Poelmans
April 18th, 2015

PlatinumGoldSilver
Thanks to our sponsors!

Agenda
 Introduction to Delve
 Office Graph
 Big Data and Machine Learning
 Building your own Delve - architectural concept

Stay In the Know Find What you Need Discover New Connections
Connect with the right experts and
learn more about their content.
Find just the right results from any
source and take action
Discover new information tailored
to you from your network
Delve – Search and Discovery Across O365
Powered by Office Graph

What is The Office Graph?
User Documents People Conversations

What is The Office Graph?
Manager
Direct report
Works with
Shared with me
Viewed by me
Trending around me
Presented to me
Liked by me

Signals sent from Delve, Exchange, O365, …
Click person
Modify/Save
Elevate
Share
Follow
Like
Comments
Email
Ignore
Presented to
Shown document
Open document
Shown board
++

Content and signals across O365 auto-
populating the Office Graph insights
Insights derived with machine learning for proactive and intelligent experiences

Big data is what
happened
when the cost
of storing user data
became cheaper
than making the
decision
to throw it away

Transactions + Interactions +
Observations = Big Data
Megabytes
Gigabytes
Terabytes
Petabytes
Purchase detail
Purchase record
Payment record
ERP
CRM
WEB
Offer details
Support Contacts
Customer Touches
Segmentation
Web logs
Offer history
A/B testing
Dynamic Pricing
Affiliate Networks
Search Marketing
Behavioral Targeting
Dynamic Funnels
User Generated Content
Mobile Web
SMS/MMSSentiment
External Demographics
HD Video, Audio, Images
Speech to Text
Product/Service Logs
Social Interactions & Feeds
Business Data Feeds
User Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
Increasing Data Variety and Complexity

Big Data Core Technology landscape
• New paradigm for
storing data
• 100+ Non-SQL DB’s
and growing
• Support SQL querying
• Internal architecture
different from classic DBs
• Appliances
• Teradata
• Microsoft
PDW/APS
• Oracle BDA X4-2
• Hadoop/HDFS+
MapReduce
• Key Big Data
technology
Hadoop MPP
NoSQLNewSQL

Modern Data Architecture
• Apache Hadoop is an open source
framework that supports data-
intensive distributed applications
 Uses HDFS storage to enable
applications to work with 1000s of
nodes and petabytes of data using a
scale-out model
 Uses MapReduce to process data
 Inspired by Google
 MapReduce
 Google File System
 Related projects:
 HBase, Hive, Mahout, Pig,Sqoop,
Ambari, Storm, Zookeeper, ... And
many more

HDFS and MapReduce in a nutshell

Hadoop components
Distributed Storage
(HDFS)
Hive
Distributed Processing
(MapReduce)
Pig
HBase HCatalog
DataIntegration
(ODBC/SQOOP/REST/Flume)
MahoutPegasus Rhadoop
Oozie
Data integration
Data access
Hadoop core
Operations
AmbariZookeeper
StormKafka
http://jopx.blogspot.be/2015/03/overview-of-apache-hadoop-components-in.html

Microsoft Azure HDInsight
Support HBase as NoSQL columnar
database on Azure Blobs
Support Storm as stream processing
Hadoop in Azure
Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker
Name Node
Job Tracker
HMaster
Coordination
Region Server Region Server Region Server Region Server
Able to leverage Azure Blob Storage
Pay per use model
Based on Hortonworks Data Platform

Hive
• Hadoop feature to perform data warehouse
operations
• HiveQL
 High-level, SQL-like language, abstraction over MapReduce
 Supports equi-joins
 Schema on read NOT schema on write
 Automatically invokes MapReduce jobs
 Much simpler than using MapReduce directly
• Metadata store
 Contains descriptions of tables
• Acts as a bridge to many BI products which expect
tabular data

Machine learning
finding the needle in the haystack
• Formal definition: “A computer program is said to learn from
experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P,
improves with experience E” - Tom M. Mitchell
• Another definition: “The goal of machine learning is to program
computers to use example data or past experience to solve a given
problem.” – Introduction to Machine Learning, 2nd Edition, MIT Press
• ML often involves two primary techniques:
– Supervised Learning: Finding the mapping between inputs and outputs using
correct values to “train” a model
– Unsupervised Learning: Finding patterns in the input data (similar to Density
Estimates in Statistics)

Vision Analytics
Recommendation
engines
Advertising analysis
Weather forecasting for
business planning
Social network analysis
Legal
discovery and document
archiving
Pricing analysis
Fraud
detection
Churn
analysis
Equipment monitoring
Location-based tracking
and services
Personalized Insurance

Some retailers profit
… by predicting major changes in your life.

Steps to build a machine learning solution

Typical machine learning algorithms
• Clustering (k-means, orthogonal partitioning,…)
• Association rule learning ( A priori)
• Regression (linear/logistic)
• Recommendation engines
• Classification (C4.5, decision trees, SVM, Naïve Bayes, AdaBoost, Random Forest, …)
• Similarity matching
• Neural networks
• Bayesian networks
• Genetic algorithms
• Ensembles
See http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
And http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf and
http://www.quora.com/What-are-the-top-10-data-mining-or-machine-learning-algorithms

Doing recommendations – some approaches
• Collaborative filtering
• Feature based recommendations
• K-nearest neighbours

Collaborative filtering
• A set of items
(books, beers,
blogposts,…)
• Ratings from users
• Recommended
items based on
your ratings and
other people’s
ratings

Feature based recommendations
• Use user’s ratings of items
 Create an algorithm to define
which features (metadata ) of
items the user likes
• Requires detailed
information about items -
content based
 An item can be a person as well –
see “People you may know”
• Most approaches combine
“feature based” and
“collaborative filtering”

K-Nearest Neighbours (Classification approach)
• Find ratings from people similar
to you and see what they liked
 Use similarity functions (Minkowski
distance, RMSE, Pearson Correlation
Coefficient,…)
• Take the average ratings of the k
people most similar to you
 Display the items with the highest
averages
• Conclusion – requires solid
background in Math and
Statistics

Machine Learning and Data Scientists
Developing predictive analytics and
machine learning must be simpler,
today it requires specialized skills:
• Data management
• Data exploration
• Math & statistics
• Domain expertise
• Machine learning
• Software development
• Data visualization
65% of enterprise feel they have a
strategic shortage of data scientists, a
role many did not know existed 12
months ago …

Microsoft Azure Machine Learning

Microsoft Azure Machine Learning (Ctd.)
Personalized Workspace
Combine R modules with Microsoft’s
best in class algorithms running Xbox
and Bing
Work with anyone, anywhere by simply
sharing the workspace
Easy Access to All Data
Drop in desktop data sets into the
built-in storage space.
Bring in cloud data with the ease of a
drop down
Deploy Models as Web Services
Operationalize in minutes and refine
models at the speed of the market
Partner Tools
ML partners enjoy SDK access for
robust solutions
Microsoft Azure
Machine Learning Studio
Microsoft Azure
Machine Learning API service
Microsoft Azure
Machine Learning SDK

E vent producers
Web logs
Documents &
metadata
Transform Long-term
storage
Azure SQL
Database & Azure
Storage
Predictive
Analytics
Azure
Machine
Learning
Presentation
and action
On premise
Building your own Delve - high level architecture

Building your own Delve – remarks
• Graph technology left out for simplicity
 Take a look at Neo4J or Pegasus on Hadoop if you are interested
• Not very realistic to rebuild Delve but possible to
define point solutions
• If you still go ahead
 Think about the end-to-end data pipeline
 Fast track with Recommendation API in datamarket
http://datamarket.azure.com/dataset/amla/recommendations
 Cache recommendations for performance and cost optimization
 Learn R or Python to extend AzureML capabilities

Online Resources
• www.coursera.org (MOOC)
• Microsoft Virtual Academy
 http://www.microsoftvirtualacademy.com/training-courses/getting-started-with-microsoft-
azure-machine-learning
 http://www.microsoftvirtualacademy.com/training-courses/implementing-big-data-
analysis
• Cloud Data Science process - http://azure.microsoft.com/en-
us/documentation/articles/machine-learning-data-science-how-to-create-machine-learning-service/
• Blogs
 http://blogs.msdn.com/b/benjguin/
 http://hortonworks.com/blog/
 http://blogs.msdn.com/b/bigdatasupport/
 http://blogs.msdn.com/b/big_data_france/
 http://blogs.msdn.com/b/brian_swan/
 http://blogs.msdn.com/b/mwinkle/
 http://blogs.msdn.com/b/avkashchauhan/
 http://blogs.msdn.com/b/carlnol/
 http://blogs.technet.com/b/machinelearning/

How to build your own Delve: combining machine learning, big data and SharePoint

How to build your own Delve: combining machine learning, big data and SharePoint

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to How to build your own Delve: combining machine learning, big data and SharePoint

Similar to How to build your own Delve: combining machine learning, big data and SharePoint (20)

More from Joris Poelmans

More from Joris Poelmans (20)

Recently uploaded

Recently uploaded (20)

How to build your own Delve: combining machine learning, big data and SharePoint