Data science with its specialized tools and knowledge has been a forte of data scientists. However, it is not easy even for data scientists to get access to data that could be in different data stores in the organization. To unleash the power of data and gain valuable insights, machine learning needs to be made easily consumable by various stake holders and access to data made simpler. As an organization's data volumes continue to grow, delivering these insights real time is a complex challenge to solve.
This session will provide on overview of an approach to building a scalable solution where machine and deep learning and access to data is made much more consumable and simpler by the fastest SQL on Hadoop engine on the planet, a rich data scientist toolset and an infrastructure that can deliver the responsiveness needed for production environments.
Speakers:
Pandit Prasad, Program Director, IBM
Ashutosh Mate, Global Senior Solutions Architect, IBM
3. 3
Operationalizing machine learning and getting actionable insights from
disparate sources of data has been a huge challenge
Data Integration/Data Engineering Team Data Science Team and Data Engineering Team Application Development Team
Line of Business
Data still lives in Silos
IBM Db2
Operationalize Machine Learning Organization needs to act fast
ACT
NOW!
5. 5
With Big SQL, Amy’s team can save time on execution and
enhance Productivity
Federation Spark Integration Application Integration
Ryan
Data
Scientist
Nick
Application
Developer
ONE TIME
DEVELOPMENT
5
Product Details
Sales Campaign
Customer Details
IBM Big SQL
Chris
Data Engineer
6. 6
Democratize Data Science and Machine Learning
Data Ingestion
Data Transformation/
Data Science/
Machine Learning
Data Visualization
Virtualize disparate data
sources like Hadoop, RDBMS,
and Object Stores (S3) to join
data in a single query
Manipulate data and
operationalize data science
models written in various
languages
Perform data discovery,
analyze, and visualize business
results in notebooks or other BI
tools
8. 8
Want to modernize
your EDW without
long and costly
migration efforts
Offloading historical
data from Oracle,
Db2, Netezza
because reaching
capacity
Operationalize
machine learning
Need to query,
optimize and
integrate multiple
data sources from
one single endpoint
Slow query
performance for SQL
workloads
Require skill set to
migrate data from
RDBMS to
Hadoop / Hive
Do you have any of these challenges?
9. 9
Big Fish Games – Uses Big SQL to combine disparate data to drive product
innovation through customer feedback with the use of analytics
“The ability to answer complicated questions with data from
disparate sources will allow our analysts to focus on
answering business questions without having to worry
about where the data lives or waiting on a project to
perform the data integration for them.” -- David Darden,
BI Engineering Manager, Big Fish Games
Business need:
• Understand which product features resonate the best with
the gaming community
• Increase cross sell and up-sell opportunities
Solution:
• Combines structured (customer) data from PureData System
for Analytics with semi-structured (game log) data in Hadoop
Beta Experience and Outcomes:
Puts users in charge of data analysis - Access to data without
technology getting in the way
Faster insights – fast data movement to Hadoop,
3X faster than Sqoop
Leverage existing skills - IBM Big SQL enables and leverages
existing SQL skill set
10. 10
Southwest Power Pool – Uses Big SQL to federate and reuse applications
while taking first steps in establishing an enterprise data lake
Business need:
• Near term requirements: Offload less frequently used data
• Long term vision: Build a data lake infrastructure
Solution:
• Offloaded cold data to Hadoop and reuse applications
• Combine cold data on Hadoop with hot data on Netezza
to derive insights
Business Benefits:
Application portability – SQL compatibility enables reuse
of application with minimal business query modifications -
Support for Netezza functions
Federation – query capabilities between Netezza and
Hadoop
Leverage existing skills - IBM Big SQL enables and
leverages existing SQL skill set
Big SQL provides SQL query, federation, transactional,
performance, and security capabilities, which will combine
with streaming data, governance, and BI analytics in later
phases.
11. 11
EY – Query data from different sources using Big SQL to help prevent fraud
Business need:
• Analyze data to quickly detect fraud threats
• Forensic data analytics is seen as a key capability to
invest in
Solution:
• EY is now able to detect potential threats before they
escalate
Business Benefits:
High performance helps queries to run in minutes, not
hours, helping clients rapidly identify and eliminate threats
Gathers data from multiple sources and applies real-
time analytics to identify hidden patterns and anomalies
Ability to process a wide variety of data types, from
journal entries and payment streams to email, news feeds
and social media.
EY is a global leader in assurance,tax, transaction and advisory services.
The insights and quality services EY delivers help build trust and confidence
in the capital markets and in economies the world over, and help to build a
better working world for EY’s people,clients and communities.
Transformation
EY provides its clients with comprehensive protection against fraud and security risks.
IBM Analytics helps EY rapidly detect potential threats before they escalate.
12. 12
Vestas – Leverages complex queries and performance capabilities to turn
climate into capital with Big data using IBM Big SQL
“In our development strategy, we see growing our library in
the range of 18 to 24 petabytes of data. And while it’s fairly
easy to build that library, we needed to make sure that we
could gain knowledge from that data.”
— Lars Christian Christensen, vice president, Vestas Wind Systems
The transformation:
Successful analysis resulted in 97% decrease in response
times for wind forecasting information to pinpoint optimal
turbine placement, maximizes power generation and reduces
energy costs.
Business Benefits:
High performance helps queries to run in minutes, not
hours, helping clients rapidly identify and eliminate threats
Complex query processing helps manage and analyze
weather and location data for calculating the right location
for turbines
Business need:
• Need to process complex queries on large volume of data
• Analyze wind forecasting information for ideal placement of
turbines
Solution:
• Analyzed petabytes of data using complex queries on weather
and location data for turbine placements
• Successful placements of turbines lead to increased customer’s
ROI
13. 13
"In a half-day workshop, we were able to show the
chemical company how big data analytics work and were
able to identify four new customers."
—Dr. Michael Kowolenko, Senior Research Scholar, Poole
College of Management, NC State University
The transformation: The Poole College of Management at NC
State University is developing the next generation of data-
driven decision makers, utilizing an IBM big data solution based
on PowerLinux technology. This system allows its students to
effectively manage and analyze large volumes of structured
and unstructured data from a variety of sources.
NC State University Poole College of Management – Helping businesses
uncover new opportunities with IBM on PowerLinux
Business Benefits:
Application portability – SQL compatibility enables reuse
of application with minimal business query modifications -
Support for Netezza functions
Federation – query capabilities between Netezza and
Hadoop
Leverage existing skills - IBM Big SQL enables and
leverages existing SQL skill set
Business need:
• Make data driven decision on what new courses to add
• Enable students to get real-world experience on Big Data
Solution:
• Created a curriculum that enables students to apply Big
data analytics to real-world problems
• Help businesses identify new opportunities
14. 14
Major North American Food Retailer, implements HDP on IBM POWER
Business need:
• Gain a competitive advantage by retaining and analyzing
their store level loyalty program data
• Bring outsourced analytics back in-house
Solution:
• Consolidation of client transaction data into a Hortonworks
Data Platform on Linux on IBM Power Systems.
• SAP Customer Activity Repository (CAR) application,
powered by SAP HANA, connected to the data lake to
enable real-time insights.
Business Benefits:
• More efficient and flexible in-store experiences for their
clients to increase client loyalty and purchases.
Time to Value
HDP 2.6 running on a cluster of 9 IBM Power System servers
Full solution deployed by IBM lab services and an IBM Business
Partner in < 2 weeks
Trial to production in 2 months
15. 15
Big SQL is the only SQL-on-Hadoop
solution to understand SQL syntax from
other vendors and products, including:
Oracle, IBM DB2 and Netezza.
For this reason, Big SQL is the ultimate
hybrid engine to optimize EDW workloads
on an open Hadoop platform
What is IBM Big SQL?
17. 17
Big SQL queries heterogeneous systems in a single query - only SQL-on-Hadoop that virtualizes more than 10
different data sources: RDBMS, NoSQL, HDFS or Object Store
Big SQL
Fluid Query (federation)
Oracle
SQL
Server
Teradata DB2
Netezza
(PDA) Informix
Microsoft
SQL Server
Hive HBase HDFS
Object Store
(S3)
WebHDFS
Big SQL allows query federation by virtualizing data sources and processing where data resides
Hortonworks Data Platform (HDP)
Data Virtualization
18. 18
§ Easy porting of enterprise applications
§ Ability to work seamlessly with Business Intelligence tools like Cognos to
gain insights
§ Big SQL integrates with Information Governance Catalog by enabling easy
shared imports to InfoSphere Metadata Asset Manager, which allows:
-Analyze assets
-Utilize assets in jobs
-Designate stewards for the assets
Oracle
SQL
DB2
SQL
Netezza
SQL
Big SQL
SQL syntax tolerance (ANSI SQL Compliant)
Cognos Analytics
InfoSphere Metadata Asset Manager
Big SQL is a synergetic SQL engine that offers SQL compatibility, portability and
collaborative ability to get composite analysis on data
Data Offloading and Analytics
19. 19
BRANCH_A FINANCE
(security admin)BRANCH_B
Role Based Access Control
enables separation
of Duties / Audit
Row Level Security
Row and Column Level Security
Big SQL offers row and column level access control (RBAC) among other security settings
Data Security
20. 20
PERFORMANCE
Big SQL 5.0 is 3.2x faster than Spark SQL 2.1
(4 ConcurrentStreams)SNAPSHOT OF 100TB HADOOP-DS
I/O (vs Spark)
Big SQL reads 12x less data
Big SQL writes 30x less data
COMPRESSION
60%
SPACE SAVED
WITH PARQUET
AVERAGE CPU USAGE
76.4%
MAX I/O THROUGHPUT
READ 4.4 GB/SEC
WRITE 2.8 GB/SEC
WORKING QUERIES
Big SQL’s Performance at a Glance
Leads performance metrics on high volumes of data and concurrent streams
21. 21
Right Tool for the Right Job
Not Mutually Exclusive. Hive, Big SQL & Spark SQL can co-exist and complement each other in a cluster
Big SQL
Federation
Complex Queries
High Concurrency
Enterprise ready
Application portability
All open source file formats
Spark SQL
Machine learning
Data exploration
Simpler SQL
Hive
In-memory cache
Geospatial analytics
ACID capabilities
Fast ingest
Ideal tool for Data Scientists
and discovery
Ideal tool for BI Data Analysts
and production workloads
Ideal tool for simple BI Data Analysts
and production workloads
22. 22
Summary – Get more for Less with Big SQL
§Big SQL is really a powerful runtime that makes access to Hive Tables fast and secure.
§Big SQL supports ANSI SQL 2003, 2008, and even parts of SQL 2011!
§Though SQL for HBase can be achieved with projects like Apache Phoenix, Big SQL provides
seamless access to both HBase and Hive tables with the ability to join them too!
§Big SQL can start working with existing Hive and HBase tables in Hadoop
§With Big SQL Nicknames, you can provide seamless access to remote data sources to allow
users to see and experiment with data – without the time and cost associated with building
ingest processes.
§All of this capability, provided with a single driver, one connection, a single robust and
consistent ANSI compliant SQL dialect – with unified security managed for all object types.
Big SQL will save you Time and Money
24. 24
Innovation Pervasive in the Design
Power Systems S822LC for Big Data
Not Just Another Intel Server
NVIDIA:
Tesla K80 GPU Accelerator
Linux by Redhat:
Redhat7.2 Linux OS
Mellanox:InfiniBand/Ethernet
Connectivity in and out of server
HGST:OptionalNVMe Adapters
Alpha Data with Xilinx FPGA:
OptionalCAPIAccelerator
Broadcom:OptionalPCIe Adapters
QLogic:OptionalFiberChannel PCIe
Samsung:SSDs & NVMe
Hynix,Samsung,Micron:DDR4
IBM: POWER8 CPU
25. 25
IBM Power S822LC for Big Data and Hortonworks Combine to Deliver Leadership in
Hadoop Environments
• Hive/Tez Performance results are basedonIBM Internal Testing of 10 queries (simple, medium, andcomplex) withvaryingruntimes runningagainst a 10TB database. Thetests were run on10x IBM Power System S822LC for Big Data
20 cores / 40 threads, 2 X POWER8 2.92GHz, 256 GB memory, RHEL 7.2,, HDP 2.5.3 comparedto the published x86/Hortonworks results running on10x AWS d2.8xlarge EC2 nodes running HDP 2.5; details canbe foundat
https://hortonworks.com/blog/apache-hive-going-memory-computing/ . Data as of February 28, 2017)
• BigSQL Performance results are basedonIBM Internal Testing of all 50TPC-DS queries selected by Hortonworks. There is a majordifferencewith the 10 longest running queries basedonthe 10TB result Hortonworks teamachieved
with 10 x AWS d2.8xlarge EC2 dataNodes runningwith HDP 2.5.2 (details can befound at https://hortonworks.com/blog/apache-hive-going-memory-computing/). 11 x S822LC for BigData Power servers wereusedas dataNodes running
BigSQL and IOP 4.2.5. Data as of July 12, 2017.
• Conducted under laboratory condition,individual result can vary based on workloadsize, use of storagesubsystems & other conditions.
• POWER8 and Hortonworksdeliver1.70X the throughput
comparedto Hortonworks Hive/Tez running on x86
– 70% More QpH based on the averageresponsetime –
complete the same amountofwork with less system
resources
– 41% Reduction on averagein query response time –
reduced response time enablesmaking business
decisionsfaster.
• IBM BigSQLon IBM PowerSystems can deliver3.5X faster
query times on average for the mostcomplex queries
70%
More
Throughput
26. 26
Data lake with IBM Spectrum Scale
Unleash new storage economies on a global scale.
Block
iSCSI
Client
workstations Users and
applications
Compute
farm
Traditional
applications
GLOBAL Namespace
Analytics
Transparent
HDFS
OpenStack
Cinder
Glance
Manilla
Object
Swift S3
Transparent Cloud
Powered by
IBM Spectrum Scale
Automated data placement and data migration
Disk Tape Shared Nothing
Cluster
Flash
New Gen
applications
Transparent Cloud
Tier
Worldwide Data
Distribution (R/W)
Site B
Site A
Site C
SMBNFS
POSIX
File
Consolidate all your unstructured data storage on Spectrum Scale with unlimited and painless scaling of capacity and performance
Encryption DR Site
AFM-DR
JBOD/JBOF
Spectrum Scale RAID
Compression
4000+
clients
27. 27
Why Spectrum Scale for Big Data & Analytics
Extreme scalability with parallel file system architecture
Data + Metadata
Node
Data + Metadata
Node
Data + Metadata
Node
Data + Metadata
Node
No centralized metadata node bottleneck. Every node can serve as data and metadata in the cluster.
Global namespace that can span geographies
Active – Active replicas of data for real time global collaboration
Reduce datacenter footprint with industry’s best in-place analytics
True software defined storage that can be purchased as software only OR pre-integrated system
Data
NFS
SMB POSIX Object
HDFS API
Access to the data using any of the industry standard protocols
IBM Elastic Storage Server (ESS)
(Pre-integrated system)
IBM Spectrum Scale
(SW only)
28. 28
Reduce data center footprint with Spectrum Scale
HDFS
Raw
data
ext4
ext4
write move copy
Traditional
applicationsCopies in both HDFS and ext4
Spectrum
Scale
Application
writes direct
to Hadoop
Path with
NFS/SMB/
Object/POSIX
direct-read with
NFS/SMB/
Object/POSIX
Raw
data
Traditional
applications
Multiple copies with HDFS based workflow
Spectrum Scale in-place analytics (No copies required)
Hadoop
analysis Jobs Hadoop
analysis Jobs
Direct read,one version
Data Scientists waste daysjust copying data to HDFS No copies required with Spectrum Scale
Costly data protection - Default uses 3-wayreplication with HDFS
IBM ESS Software RAID eliminatesneed for 3 way replication. Just 30%
extra storage requirement.
[Erasure coding in HDFS has limitations and is good only for cold data]
Traditional
applications Traditional
applications
HDFS
APIs
HDFS
APIs
Example: For 5PB of data, HDFS requires 15PB of storage Example: For 5PB of data, ESS requires 6.5PB of storage
Copy process can take hours/days & eventually results are based on stale data.
29. 29
Page 29
HDP and IBM Systems – Better Together
5 Mission Critical Support
Ø Stable, trusted Hadoop platform on proven Power System with outstanding client support
Performance and Price/Performance – Leading performance for SQL and Spark workloads
Ø 1.70X the throughput compared to Hortonworks running on x86 and a 3X price performance
guarantee
2
TCO at Scale with HDP on Power Systems :
Ø Host mixed application workloads on a single global filesystem with IBM Spectrum Scale
Ø Up to 3X reduction of storage and compute infrastructure moving to Power Systems and IBM
Elastic Storage Server vs commodity scale out x86
4
Flexibility – Richest family of Linux servers to match your workload’s scale and reliability needs1
Designed for Cognitive + AI – Obtain your ML/DL results faster with AI on Power servers
Ø PowerAI is the only commercial offering containing all key deep learning frameworks
§ Caffe, TensorFlow, Torch, Theano, OpenBLAS, NCCL, NVIDIA DIGITS*
3
30. 30
Scaling Data Science on Big Data
Date: Wed, 9/20 @ 11:00 AM
Room: C2.3
1
Ingesting Data at Blazing Speed using Apache
ORC
Data: Wed, 9/20 @ 4:20 PM
Room: C4.7
2
Open metadata and governance with Apache
Atlas
Date: Wed, 9/20 @ 5:10 PM
Room: C4.6
Empowering YOU with Democratized Data
Access, Data Science and Machine Learning
Date: Wednesday, 9/20 @ 6:00 PM
Room: C4.5
Breaching the 100TB mark with SQL over
Hadoop
Date: Thurs, 9/21 @ 2:20 PM
Room: C2.3
Apache Spark, Apache Zeppelin
and Data Science
Date: Thurs, 9/21 @ 6:00 PM
Room: C4.5
Visit IBM Booth for More Information!
Check out the Breakout Sessions