Contact details:
Jen.Stirrup@datarelish.com
In a world where the HiPPO’s (Highest Paid Person’s Opinion) is final, how can we use technology to drive the organisation towards data-driven decision making as part of their organizational DNA? R provides a range of functionality in machine learning, but we need to expose its richness in a world where it is made accessible to decision makers. Using Data Storytelling with R, we can imprint data in the culture of the organization by making it easily accessible to everyone, including decision makers. Together, the insights and process of machine learning are combined with data visualisation to help organisations derive value and insights from big and little data.
9. You have to start with the truth. The
truth is the only way that we can get
anywhere. Because any decision-
making that is based upon lies or
ignorance can't lead to a good
conclusion.
Julian Assange, Wikileaks
10. You have to start with the truth. The
truth is the only way that we can get
anywhere. Because any decision-
making that is based upon lies or
ignorance can't lead to a good
conclusion.
Julian Assange, Wikileaks
16. Internet of things
Audio /
Video
Log
Files
Text/Image
Social
Sentiment
Data Market
Feeds
eGov Feeds
Weather
Wikis / Blogs
Click
Stream
Sensors / RFID /
Devices
Spatial & GPS Coordinates
WEB 2.0Mobile
Advertisin
g
Collaboratio
n
eCommerce
Digital
Marketing
Search Marketing
Web Logs
Recommendation
s
ERP / CRM
Sales
Pipeline
Payables
Payroll
Inventor
y
Contacts
Deal
Tracking
Terabytes
(10E12)
Gigabytes
(10E9)
Exabytes
(10E18)
Petabytes
(10E15)
Velocity - Variety - variability
Volume
1980
190,000$
2010
0.07$
1990
9,000$
2000
15$
Storage/GB
ERP / CRM WEB
2.0
Internet of things
What Is Big Data?
17. DIGITAL
ANALOG
1985 1990 1995 2000 2005 2010 2015 2020
The world’s data
Credit: 17:15-19:04 of Joseph Sirosh’s PASS Keynote:
https://www.youtube.com/watch?v=DZW1-
euLaQ4&feature=youtu.be&t=17m10s
18. The world’s data
DIGITAL
ANALOG
1985 1990 1995 2000 2005 2010 2015 2020
ANALOG
DATACENTERS (CLOUD)
PC / DEVICE
DIGITAL TAPE
DVD / BLU-RAY
CD
Credit: 17:15-19:04 of Joseph Sirosh’s PASS Keynote:
https://www.youtube.com/watch?v=DZW1-
euLaQ4&feature=youtu.be&t=17m10s
19. Connected data
CONNECTED
DIGITAL
ANALOG
1985 1990 1995 2000 2005 2010 2015 2020
DATACENTERS (CLOUD)
PC / DEVICE
DIGITAL TAPE
DVD / BLU-RAY
CD
Credit: 17:15-19:04 of Joseph Sirosh’s PASS Keynote:
https://www.youtube.com/watch?v=DZW1-
euLaQ4&feature=youtu.be&t=17m10s
20. Connected data
CONNECTED
DIGITAL
ANALOG
1985 1990 1995 2000 2005 2010 2015 2020
CLOUD / IoT
PC / MOBILE
Credit: 17:15-19:04 of Joseph Sirosh’s PASS Keynote:
https://www.youtube.com/watch?v=DZW1-
euLaQ4&feature=youtu.be&t=17m10s
22. Embracing data transforms business
It is central to outperforming competitors
Agriculture EducationManufacturing Aerospace FinancialAutomotive GovernmentRetailHealthcare
Credit: http://download.microsoft.com/documents/en-
us/making_the_right_analytics_investments_whitepaper.pdf
23. { }
Relational
Cloud
• Disparate systems and processes
• Multiple tools and skillsets
• Siloed insights on
disconnected data
• High cost of ownership
Challenges of the modern data platform
Inefficiencies from fragmented architecture
Beyond relational
On-premises
Credit:
http://download.microsoft.com/documents/en-
us/making_the_right_analytics_investments_w
hitepaper.pdf
24. Azure SQL DB
Azure SQL DW
Analytics Platform System
Azure Data Lake
SQL Server 2016
Analytics Platform System
SQL
Relational Beyond relational
On-premisesCloud
Data Management
Power BI
Cortana Analytics
Azure IoT
Business
Analytics
Business Analytics & Data Management Platform
Credit:
http://download.microsoft.com/docum
ents/en-
us/making_the_right_analytics_investm
ents_whitepaper.pdf
36. In “about five to eight
seconds, someone’s
going to make the
decision of do they
devote any more time to
looking at what you’ve
got in front of them or
do they move on to the
next thing.”
Cole Naussbaumer
StorytellingwithData.com
From: http://cxcafe.maritzcx.com/storytelling-with-data-dashboarding-with-cole-nussbaumer/
37. London Cholera Map – John Snow
1854. London. Cholera strikes. In just
10 days, over 500 people have been
killed in one neighborhood. The
mysterious cluster of deaths is
especially terrifying because no one
understands the source.
No one besides John Snow, an
epidemiologist who realized the water
supply was spreading the disease.
38. 5. London Cholera Map – John Snow
He plotted every death on a map with
ingenious mapped bar charts (see left)
and was able to show that the closer
to the Broad Street water pump he
plotted, the greater the number of
deaths.
The information helped convince the
public a true sewage system was
needed and spurred the city to action.
39. Gapminder – Hans Rosling
The Swedish scientist Hans Rosling had
been working with developmental data
for over 30 years – but it took a great
visualization and a 2007 TED talk for him
to share his passion with the world.
His original viz (now one of many) shows
the relationship between income and life
expectancy. The data is simple but
Rosling’s visual storytelling has allowed
him to spread his passion for this
fascinating, overlooked data to millions.
41. War Mortality – Florence Nightingale
1855. The Crimea. Britain is fighting a
battle with both Russia and disease.
As a nurse, how do you convince an
army to invest in hospitals and
healthcare instead of guns and
ammunition?
Florence Nightingale told her story
with data by showing the staggering
amount of deaths due to preventable
disease (shown in blue/grey). After
this viz, sanitation became a major
priority for the British Army.
43. Consider the kind of data
story you have.
Distribution Part to Whole Correlation
Time Series Compare
Categories
Ranking
Image credit: Column Five Media’s Visage Data Visualization 101
55. Tipping Point to NoSQL
New
Paradigm
Large Data
Sets
Scalability
Social Media
Structured /
Unstructured
Data
56. What is NOSQL
• Any database that is not
Non-Relational SQL
Not ‘No SQL’
But Not Only SQL
relational
•
•
•
57. Where is NOSQL used?
Cassandra used on:
Digg, Facebook, Twitter, Reddit, Rackspace, Cloudkick, Cisco
Hadoop used on:
Amazon Web Services, Pentaho, Yahoo!, The New York Times
CouchDB used on:
CERN, BBC, Interactive Mediums
MongoDB used on:
Foursquare, bit.ly, SourceForge, Fotopedia, Joomla Ads
Riak used on:
Widescript, Western Communications, Ask Sponsored Listings
62. Key-Value Stores
• Keys are used to access blobs of data
• Video
• Images..
• A key uniquely identifies each record.
• Dictionaries have records that are stored and
retrieved using a key.
• If it fast because the key uniquely identifies
each record.
• Data is a single opaque collection
Key Value
Key Value
63. Locker Analogy
• Keys are used to access blobs of data
• Video
• Images..
• A key uniquely identifies each record.
• Dictionaries have records that are stored and
retrieved using a key.
• The Value is simply an object.
64. Graph Store
• Data is stored in nodes, which have properties
• They are connected by critical relationships
65. Documents
• Data stored in nested
• hierarchies
• Logical data remains stored together as a unit
• Any item in the document can be queried
• Pros: No object-relational mapping layer, ideal
for search
• Cons: Complex to implement, incompatible with
SQL
66. Database Availability Online
Database Availability Means
CAP Theorem (BASE vs ACID)
Partitioning and Replication
Replication Diagram
“Ring” of Consistent Hashing
Next …. → Database Integrity
67. What is Database Availability?
● High Availability: database and application is available
in scheduled period, when maintenance period system
is temporarily down.
● Continuous Operation: system available all the time
with no scheduled outages.
● Continuous Availability: combination of HA & CO,
data is always available, and maintenance is done
without shutdown the system
69. ACID and BASE
ACID
Atomicity: All or nothing
Consistency: Any transaction should result in valid tables
Isolation: separate transactions
Durability: Database will survive a system failures.
70. BASE
BASE
Basically Available - system seems to work all the time
Soft State - it doesn't have to be consistent all the time
Eventually Consistent - becomes consistent at some later
time
71. Scalability
Vertical scale
Improving server
RAM, and storage
Horizontal scale
specification by adding more processor,
device. Limited and expensive.
Adding more cheap computer as server expansion. Do
sharding and partitioning which is hard to implement and
expensive using relational databases (RDBMS)
72. Partitioning
Sharing the data between different nodes
Each node placed on a ring
Advantage : ability to scale incrementally
Issues : non-uniform data distribution
(data host)
74. •NoSQL solutions need to solve real-world
business problems
•Search
•High Availability
•Agility
75. • Big Data is not the same as NoSQL.
• NoSQL is more than dealing with big datasets.
• NoSQL includes concepts that can be managed by a single processor
• However, big data problems are a primary use case for NoSQL.
76. One or many databases?
One Database
• Easy to understand
• Easy to set up and configure
• Easy to administer
• Single source
• Limited scalability
79. Big Data Problems
Big Data
Read-mostly
Documents
Full Text
Event Log
Real Time Batch
Graph
Read-write
Transactions Transactions
80. Why do databases fail?
• Anything that can go wrong, will go wrong – Murphy’s
law.
• Human error
• Network failure
• Hardware failure
• Security
81. What can we do to support Hadoop?
• Hadoop helps manage and process large datasets
• Hadoop provides linear scalability
• Hadoop brings computing logic to the data rather than
bringing the data to computing logic.
82. Hadoop Clustering basics
•Hadoop uses a cluster for data storage and
computation purposes.
•It runs and writes distributed applications for huge
amounts of data
83. What is the purpose of Hive?
83
Hive is a data warehousing system for Hadoop
To meet the needs of businesses, data scientists, analysts and BI professionals
Data, Summarized
Fit a structure onto data
Data, Analyzed
Analysis of Large Datasets stored in Hadoop File Systems
SQL-Like language called HiveQL
Custom mappers and reduces when HiveQL isn’t enough
86. 86
What can Hive offer you?
Hive can help with a range of business problems:
• Log Processing
• Predictive Modelling
• Hypothesis testing
• And Business Intelligence
87. 87
Hive is not a replacement for SQL
So don’t throw out your SQL Server instances!
• Hive is for processing large data sets that may span hundreds, or even thousands, of
machines
• Hive as a high overhead for starting a job. It translates queries to MR so it takes time
• Hive does not cache data, like SQL Server
• Hive performance tuning is mainly Hadoop performance tuning
• Similarity of the query engine, but different architectures for different purposes
88. HiveQL
88
Hive QL is a SQL-like language
It outputs naturally occurring groups for further analysis
Easy Data Summarization
Large Datasets, summarized
Fit a structure onto data
Analysis of Large Datasets stored in Hadoop file systems
SQL-Like language called HiveQL
Custom mappers and reduces when HiveQL isn’t enough
89. HiveQL Queries like SQL Queries?
89
Similarities in Syntax and Features
Similar features
SELECT
FROM
WHERE
GROUP BY / HAVING
Table Aliases
Computed Columns
90. HiveQL Queries like SQL Queries?
90
Similarities in Syntax and Features
Similar features
Aggregate Functions
Nested Select
CASE
LIKE / RLIKE
JOIN
ORDER BY / SORT BY
91. How does Hive work?
91
Hive as a Translation Tool
Compiles and executes queries
Hive translates the SQL Query to a Map Reduce Job
These are chained together
Queries are compiled and executed
92. How does Hive work?
92
Hive as a structuring Tool
Creates a schema around the data
Tables stored in Directories
Hive Tables
Rows and columns, like SQL tables
Hive Metastore
Namespace with a set of tables
Holds table definitions
Physical Layout
Column Types
Partition Information
93. Hive and SQL Data Types
Hive SQL
Tinyint Tinyint
SmallInt Smallint
Int Int
BigInt BigInt
Boolean Bit (setting as NOT NULL)
Float Float
Double Real
BigDecimal Decimal
94
94. Hive and SQL Data Types
HEADING HEADING
String Char, varchar, nvarchar, ntext, text, image
Binary binary
Timestamp Timestamp (note that this is being deprecated).
RowVersion
95
96. Power View Power Map
• Highly Visual Design Experience
• Power View is an interactive, ad
hoc, query and visualization
experience.
• It is for business question
‘mystery’ solving
• Power Map is a new 3D
visualization add-in for Excel
helping you to analyse
geographical and temporal data
• Mapping
• Exploring
• Interacting
Different Tools for Different Jobs
97. Hive and Pig: Similarities
98
Hive and Pig are great at crunching large amounts of
data from HDFS to database
Both compile to Map Reduce jobs
Pig is Procedural, Hive is Declarative
Hive is much closer to SQL in terms of querying – this can be a
good or a bad thing!
98. Hive and Pig: Differences
99
Pig Hive
Procedural Declarative
Fits cleanly into pipeline paradigm; no
need for temporary tables
Temporary tables are ubiquitous but
can be disjointed; may involve clean up.
Greater control over dataflow:
- Checkpoints
- Naturally handles splitting of data
streams
SQL expects one result and works
towards it. Handles trees but not splits
Optimizing done by developer Hive optimisation is passed to the Hive
Query Optimizer
99. Hive and Pig: When are they best used?
100
Different Tools with Different Jobs
Pig is akin to SSIS
Great for dataflows and automated batch jobs
Hive is akin to ad-hoc, analytics SQL Queries
Results that make sense of the data
100. Why, Who & How of Power BI
More Specialized
BI Pros
Power Users
Decision Makers
Business Analysts
Information Workers
Self-Service
• Power Pivot
• Power View
• Power Query
• Power Map
Clients
• Excel Services
• Office Professional
103. Microsoft Power BI for Office 365
1 in 4 enterprise customers on Office 3651 Billion Office Users
Analyze Visualize Share Find
Q&A
MobileDiscover
Scalable | Manageable | Trusted
104. Power QueryEnable self-service data discovery, query, transformation and mashup experiences for Information Workers, via
Excel and PowerPivot
Discovery and connectivity to a wide range of data sources, spanning volume as well as variety of
data.
Highly interactive and intuitive experience for rapidly and iteratively building queries over any
data source, any size.
Consistency of experience, and parity of query capabilities over all data sources.
Joins across different data sources; ability to create custom views over data that can then be
shared with team/department.
111. Power View – Business Mysteries, Solved
Power View is an interactive data exploration,
visualization, and presentation experience
Highly visual design experience
Rich meta-driven interactivity
Presentation-ready at all times
It delivers intuitive ad-hoc reporting for business
users
112. Introducing Power View
It is now also available in Excel 2013, and with new features:
• Maps
• Pie charts
• Hierarchies
• KPIs
• Drill down/Drill up
• Report styles,
themes and text
resizing
• Backgrounds with
images
• Hyperlinks
• Printing
113. Power View in Excel
Excel Database
server
SQL AS
(Tabular)
Power View
SQL RS
ADOMD.NET
SQL AS
(PowerPivot)
114. Power View in SharePoint
Browser SharePoint
web server
Database
server
SharePoint
app server
SQL AS
(PowerPivot)
SQL AS
(Tabular)
SQL RS
Add-In
SQL RS
Power View
116. Power Map for Microsoft Excel enables information workers to discover and share new insights
from geographical and temporal data through three-dimensional storytelling.
What Is Power Map?
117. Map Data
• Data in Excel
• Geo-Code
• 3D and 3 Visuals
Discover Insights
• Play over Time
• Annotate points
• Capture scenes
Share Stories
• Cinematic Effects
• Interactive Tours
• Share Workbook
Power Map: Steps to 3D insights