SlideShare a Scribd company logo
1 of 36
SQL Server and Big Data 
Projects in the Real World 
Mark Kromer 
Pentaho Big Data Analytics Product Manager 
@mssqldude 
@kromerbigdata 
http://www.kromerbigdata.com
What weā€™ll (try) to cover today 
1. The Big Data Technology Landscape 
2. Big Data Analytics 
3. 3 Big Data Analytics Scenarios: 
āÆ Digital Marketing Analytics 
ā€¢ Hadoop, Aster Data, SQL Server 
āÆ Sentiment Analysis 
ā€¢ MongoDB, SQL Server 
āÆ Data Refinery 
ā€¢ Hadoop, MPP, SQL Server, Pentaho 
4. SQL Server in the Big Data world
Big Data 101 
3 Vā€™s 
āÆ Volume ā€“ Terabyte records, transactions, tables, files 
āÆ Velocity ā€“ Batch, near-time, real-time (analytics), streams. 
āÆ Variety ā€“ Structures, unstructured, semi-structured, and all the above in a mix 
Text Processing 
āÆ Techniques for processing and analyzing unstructured (and structured) LARGE files 
Analytics & Insights 
Distributed File System & Programming
ā€¢ Batch Processing 
ā€¢ Commodity Hardware 
ā€¢ Data Locality, no shared 
storage 
ā€¢ Scales linearly 
ā€¢ Great for large text file 
processing, not so great on 
small files 
ā€¢ Distributed programming 
paradigm 
Hadoop 1.x
Hadoop 1 vs Hadoop 2 
HADOOP 1.0 
MapReduce 
(cluster resource management 
& data processing) 
Ā© Hortonworks Inc. 2014 
HDFS 
(redundant, reliable storage) 
HADOOP 2.0 
YARN 
MapReduce 
(data processing) 
Others 
(cluster resource management) 
HDFS2 
(redundant, highly-available & reliable storage) 
Single Use System 
Batch Apps 
Multi Purpose Platform 
Batch, Interactive, Online, Streaming, ā€¦
YARN: Taking Hadoop Beyond Batch 
Ā© Hortonworks Inc. 2014 
Applications Run Natively in Hadoop 
YARN (Cluster Resource Management) 
HDFS2 (Redundant, Reliable Storage) 
BATCH 
(MapReduce) 
INTERACTIVE 
(Tez) 
STREAMING 
(Storm, S4,ā€¦) 
GRAPH 
(Giraph) 
IN-MEMORY 
(Spark) 
HPC MPI 
(OpenMPI) 
ONLINE 
(HBase) 
OTHER 
(Search) 
(Weaveā€¦) 
Store ALL DATA in one placeā€¦ 
Interact with that data in MULTIPLE WAYS 
with Predictable Performance and Quality of Service
YARN Eco-system 
Ā© Hortonworks Inc. 2014 
Page 7 
Applications Powered by YARN 
Apache Giraph ā€“ Graph Processing 
Apache Hama - BSP 
Apache Hadoop MapReduce ā€“ Batch 
Apache Tez ā€“ Batch/Interactive 
Apache S4 ā€“ Stream Processing 
Apache Samza ā€“ Stream Processing 
Apache Storm ā€“ Stream Processing 
Apache Spark ā€“ Iterative applications 
Elastic Search ā€“ Scalable Search 
Cloudera Llama ā€“ Impala on YARN 
DataTorrent ā€“ Data Analysis 
HOYA ā€“ HBase on YARN 
Frameworks Powered By YARN 
Apache Twill 
REEF by Microsoft 
Spring support for Hadoop 2
Apache Spark 
High-Speed In-Memory Analytics over Hadoop 
ā— Open Source 
ā— Alternative to Map Reduce for certain applications 
ā— A low latency cluster computing system 
ā— For very large data sets 
ā— May be 100 times faster than Map Reduce for 
ā€“ Iterative algorithms 
ā€“ Interactive data mining 
ā— Used with Hadoop / HDFS 
ā— Released under BSD License
Popular Hadoop Distributions 
Hosted PaaS Hadoop platforms: Amazon EMR, Pivotal, Microsoft 
Hadoop on Azure
Popular NoSQL Distributions 
Transactional-based, not analytics schemas
Popular MPP Distributions 
Big Data as distributed, scale-out, sharded data stores
Big Data Analytics Web Platform ā€“ RA 1
Sentiment Analysis 
Reference Architecture 2 
MongoDB 
Hadoop 
PDW 
Big Data 
Platforms 
Social Media 
Sources 
Data 
Orchestration 
Data Mining 
OLAP Cubes 
Data Models 
Analytical 
Models 
OLAP 
Analytics 
Tools, 
Reporting 
Tools, 
Dashboards
Streamlined Data Refinery 
Reference Architecture 3 
Transactions ā€“ Batch 
& Real-time 
Enrollments & 
Redemptions 
Location, Email, 
Other Data 
Hadoop 
Cluster 
Analytics 
Reports 
Data 
Orchestration
Big Data Analytics
Big Data Analytics 
Core Tenets 
ā€¢ Distributed Data (Data Locality) 
āÆ HDFS / MapReduce 
āÆ YARN / TEZ 
āÆ Replicated / Sharded Data 
ā€¢ MPP Databases 
āÆ Vertica, Aster, Microsoft, Greenplum ā€¦ In-database analytics that can scale-out 
with distributed processing across nodes 
ā€¢ Distributed Analytics 
āÆ SAS: Quickly solve complex problems using big data and sophisticated analytics in a 
distributed, in-memory and parallel environment.ā€ 
http://www.sas.com/resources/whitepaper/wp_46345.pdf 
ā€¢ In-memory Analytics 
āÆ Microsoft PowerPivot (Tabular models) 
āÆ SAP HANA 
āÆ Tableau
SQL on Hadoop 
Hortonworks and Cloudera DW Engine Approaches
SQL on Hadoop Landscape 
Gartner Research on SQL on Hadoop 
Not Quite Real Time 
Many vendors market their SQL interfaces to Hadoop as providing so called "real-time access" to 
data stored in a Hadoop cluster ā€¦ SQL on Hadoop provides a purely interactive data query and data 
manipulation experience ā€” faster than batch, but not truly real time. In the case of Hadoop and the types 
of tasks it performs, we define interactive time frames as between 30 milliseconds and 10 minutes. 
If your usage truly needs realtime, a different set of technologies and vendors may be required.
SQL on Hadoop 
Vendor Perspective: MapR 
Batch SQL 
Hive is used primarily for queries on very large data sets and large ETL jobs. The queries can take anywhere between a few minutes to several 
hours depending on the complexity of the job. The Apache Tez project aims to provide targeted performance improvements for Hive to deliver 
interactive query capabilities in future. MapR ships and supports Apache Hive today. 
Interactive SQL 
Technologies such as Impala and Apache Drill provide interactive query capabilities to enable traditional business intelligence and analytics on 
Hadoop-scale datasets. The response times vary between milliseconds to minutes depending on the query complexity. Users expect SQL-on- 
Hadoop technologies to support common BI tools such as Tableau and MicroStrategy (to name a couple) for reporting and ad-hoc queries. MapR 
supports customers using Impala on the MapR distribution of Hadoop today. Apache Drill will be available Q2 2014.
MapReduce Framework (Map) 
using Microsoft.Hadoop.MapReduce; 
using System.Text.RegularExpressions; 
public class TotalHitsForPageMap : MapperBase 
{ 
public override void Map(string inputLine, MapperContext context) 
{ 
context.Log(inputLine); 
var parts = Regex.Split(inputLine, "s+"); 
if (parts.Length != expected) //only take records with all values 
{ 
return; 
} 
context.EmitKeyValue(parts[pagePos], hit); 
} 
}
MapReduce Framework (Reduce & Job) 
public class TotalHitsForPageReducerCombiner : ReducerCombinerBase 
{ 
public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context) 
{ 
context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString()); 
} 
} 
public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner> 
{ 
public override HadoopJobConfiguration Configure(ExecutorContext context) 
{ 
var retVal = new HadoopJobConfiguration(); 
retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT"); 
retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT"); 
retVal.DeleteOutputFolder = true; 
return retVal; 
} 
}
Get Data into Hadoop 
Linux shell commands to access data in HDFS 
Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv 
List files in HDFS: 
c:Hadoop>hadoop fs -ls /import 
Found 1 items 
-rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv 
View file in HDFS: 
c:Hadoop>hadoop fs -cat /import/sales.csv 
Kromer,123,5,55 
Smith,567,1,25 
Jones,123,9,99 
James,11,12,1 
Johnson,456,2,2.5 
Singh,456,1,3.25 
Yu,123,1,11 
Now, we can work on the data with MapReduce, Hive, Pig, etc.
Use Hive for Data Schema and Analysis 
create external table ext_sales 
( 
lastname string, 
productid int, 
quantity int, 
sales_amount float 
) 
row format delimited fields terminated by ',' stored as textfile location 
'/user/makromer/hiveext/input'; 
LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;
Sqoop 
Data transfer to & from Hadoop & SQL Server 
sqoop import ā€“connect jdbc:sqlserver://localhost ā€“username sqoop -password password ā€“table customers -m 1 
> hadoop fs -cat /user/mark/customers/part-m-00000 
> 5,Bob Smith 
sqoop export ā€“connect jdbc:sqlserver://localhost ā€“username sqoop -password password -m 1 ā€“table customers ā€“export-dir 
/user/mark/data/employees3 
12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec) 
12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
Role of NoSQL in a Big Data Analytics Solution 
ā€£ Use NoSQL to store data quickly without the overhead of RDBMS 
ā€£ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few 
ā€£ Why NoSQL? 
ā€£ In the world of ā€œBig Dataā€ 
ā€£ ā€œSchema laterā€ 
ā€£ Ignore ACID properties 
ā€£ Drop data into key-value store quick & dirty 
ā€£ Worry about query & read later 
ā€£ Why NOT NoSQL? 
ā€£ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface 
ā€£ SQL Server and NoSQL 
ā€£ Not a natural fit 
ā€£ Use HDFS or your favorite NoSQL database 
ā€£ Consider turning off SQL Server locking mechanisms 
ā€£ Focus on writes, not reads (read uncommitted)
MongoDB and Enterprise IT Stack 
Applications 
CRM, ERP, Collaboration, Mobile, BI 
Data Management 
Online Data Offline Data 
Hadoop EDW 
Management & Monitoring 
Security & Auditing 
RDBMS 
RDBMS 
Infrastructure 
OS & Virtualization, Compute, Storage, Network
General document per customer per account 
{ 
_id : ObjectId("4e2e3f92268cdda473b628f6"), 
sourceIDs: { 
ABCSystemIDPart1: 8397897, 
ABCSystemIDPart2: 2937430, 
ABCSystemIDPart3: 932018 } 
accountType: ā€œCheckingā€, 
accountOwners: [ 
{ firstName : ā€John", 
lastName: ā€œSmithā€, 
contactMethods: [ 
{ type: ā€œphoneā€, subtype: ā€œmobileā€, number: 8743927394}, 
{ type: ā€œmailā€, address: ā€œ58 3rd St.ā€, city: ā€¦} ] 
possibleMatchCriteria: { 
govtID: 2938932432, fullName: ā€œjohnsmithā€, dob: ā€¦ } }, 
{ firstName : ā€Anne", 
maidenName: ā€œCollinsā€, 
lastName: ā€œSmithā€, ā€¦} ], 
openDate: ISODate("2013-02-15 10:00ā€), 
accountFeatures { Overdraft: true, APR: 20, ā€¦ } 
} 
OR creditCardNumber: 8392384938391293 
OR mortgageID: 2374389 
OR policyID: 18374923
Text Search Example 
(e.g. address typo so do fuzzy match) 
// Text search for address filtered by first name and NY 
> db.ticks.runCommand( 
ā€œtextā€, 
{ search: ā€œvanderbilt ave. vander biltā€, 
filter: {name: ā€œSmithā€, 
city: ā€œNew Yorkā€} })
Aggregate: Total Value of Accounts 
//Find total value of each customerā€™s accounts for a given RM (or Agent) sorted by value 
db.accts.aggregate( 
{ $match: {relationshipManager: ā€œSmithā€}}, 
{ $group : 
{ _id : ā€œ$ssnā€, 
totalValue: {$sum: ā€$valueā€} }}, 
{ $sort: { totalValue: -1}} )
SQL Server Big Data ā€“ Data Loading 
Amazon HDFS & EMR Data Loading 
Amazon S3 Bucket
SQL Server Big Data Environment 
SQL Server Database 
āÆ SQL 2012 Enterprise Edition 
āÆ Page Compression 
āÆ 2012 Columnar Compression on Fact Tables 
āÆ Clustered Index on all tables 
āÆ Auto-update Stats Asynch 
āÆ Partition Fact Tables by month and archive data with sliding window technique 
āÆ Drop all indexes before nightly ETL load jobs 
āÆ Rebuild all indexes when ETL completes 
SQL Server Analysis Services 
āÆ SSAS 2012 Enterprise Edition 
āÆ 2008 R2 OLAP cubes partition-aligned with DW 
āÆ 2012 cubes in-memory tabular cubes 
āÆ All access through MSMDPUMP or SharePoint
SQL Server Big Data Analytics Features 
Columnstore 
Sqoop adapter 
PolyBase 
Hive 
In-memory analytics 
Scale-out MPP 
SQL Server APS
Pentaho Big Data Analytics 
DBA ETL/BI Developer Business Users & Executives Analysts & Data Scientists 
Enterprise & 
Interactive 
Reporting 
Pentaho Business Analytics 
Interactive 
Analysis 
Dashboards Predictive 
Analytics 
DIRECT ACCESS 
Data Integration 
Instaview | Visual Map Reduce 
OPERATIONAL DATA BIG DATA PUBLIC/PRIVATE CLOUDS DATA STREAM
Pentaho Big Data Analytics 
Accelerate the time to big data value 
ā€¢ Full continuity from data 
access to decisions ā€“ 
complete data integration & 
analytics for any big data 
store 
ā€¢ Faster development, 
faster runtime ā€“ visual 
development, distributed 
execution 
ā€¢ Instant and interactive 
analysis ā€“ no coding and 
no ETL required
Product Components 
ā€¢ Visual data exploration 
ā€¢ Ad hoc analysis 
ā€¢ Interactive charts & visualizations 
Pentaho Data Integration 
ā€¢ Visual development for big data 
ā€¢ Broad connectivity 
ā€¢ Data quality & enrichment 
ā€¢ Integrated scheduling 
ā€¢ Security integration 
Pentaho Dashboards 
ā€¢ Self-service dashboard builder 
ā€¢ Content linking & drill through 
ā€¢ Highly customized mash-ups 
Pentaho Data Mining & 
Predictive Analytics 
ā€¢ Model construction & evaluation 
ā€¢ Learning schemes 
ā€¢ Integration with 3rd part models 
using PMML 
Pentaho Enterprise & 
Interactive Reports 
ā€¢ Both ad hoc & distributed reporting 
ā€¢ Drag & drop interactive reporting 
ā€¢ Pixel-perfect enterprise reports 
Pentaho for Big Data 
MapReduce & Instaview 
ā€¢ Visual Interface for Developing 
MR 
ā€¢ Self-service big data discovery 
ā€¢ Big data access to Data Analysts 
Pentaho Analyzer
Pentaho Interactive Analysis & Data Discovery 
Highly Flexible Advanced Visualizations 
āÆ Simple, easy-to-use visual data exploration 
āÆ Web-based thin client; in-memory caching 
āÆ Rich library of interactive visualizations 
ā€¢ Geo-mapping, heat grids, scatter plots, bubble 
charts, line over bar and more 
ā€¢ Pluggable visualizations 
āÆ Java ROLAP engine to analyze structured and 
unstructured data, with SQL dialects for querying 
data from RDBMs 
āÆ Pluggable cache integrating with leading caching 
architectures: Infinispan (JBoss Data Grid) & 
Memcached

More Related Content

What's hot

Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
Ā 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
Ā 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
Ā 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics HadoopMishika Bharadwaj
Ā 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
Ā 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core conceptsMaryan Faryna
Ā 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoopVishwajeet Jadeja
Ā 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTAmrit Chhetri
Ā 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopSavvycom Savvycom
Ā 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
Ā 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
Ā 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
Ā 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...Dataconomy Media
Ā 
Pentaho Analytics on MongoDB
Pentaho Analytics on MongoDBPentaho Analytics on MongoDB
Pentaho Analytics on MongoDBMark Kromer
Ā 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
Ā 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
Ā 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
Ā 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
Ā 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
Ā 

What's hot (20)

Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
Ā 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
Ā 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Ā 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
Ā 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
Ā 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
Ā 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
Ā 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
Ā 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
Ā 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
Ā 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ā 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
Ā 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Ā 
Pentaho Analytics on MongoDB
Pentaho Analytics on MongoDBPentaho Analytics on MongoDB
Pentaho Analytics on MongoDB
Ā 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
Ā 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
Ā 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
Ā 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
Ā 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Ā 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
Ā 

Viewers also liked

Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackAndrew Brust
Ā 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
Ā 
Big Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesBig Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesMark Kromer
Ā 
Pentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and HadoopPentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and HadoopMark Kromer
Ā 
ETL in the Cloud With Microsoft Azure
ETL in the Cloud With Microsoft AzureETL in the Cloud With Microsoft Azure
ETL in the Cloud With Microsoft AzureMark Kromer
Ā 
Azure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analyticsAzure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analyticsMark Kromer
Ā 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
Ā 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
Ā 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseHenk van der Valk
Ā 
MongoDB Hadoop and Humongous Data
MongoDB Hadoop and Humongous DataMongoDB Hadoop and Humongous Data
MongoDB Hadoop and Humongous DataMongoDB
Ā 
PSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL ServerPSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL ServerMark Kromer
Ā 
Microsoft Event Registration System Hosted on Windows Azure
Microsoft Event Registration System Hosted on Windows AzureMicrosoft Event Registration System Hosted on Windows Azure
Microsoft Event Registration System Hosted on Windows AzureMark Kromer
Ā 
What's new in SQL Server 2012 for philly code camp 2012.1
What's new in SQL Server 2012 for philly code camp 2012.1What's new in SQL Server 2012 for philly code camp 2012.1
What's new in SQL Server 2012 for philly code camp 2012.1Mark Kromer
Ā 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMark Kromer
Ā 
MEC Data sheet
MEC Data sheetMEC Data sheet
MEC Data sheetMark Kromer
Ā 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerMark Kromer
Ā 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL ServerMark Kromer
Ā 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data SolutionsMark Kromer
Ā 
SQL Saturday Paris 2015 - Polybase
SQL Saturday Paris 2015 - PolybaseSQL Saturday Paris 2015 - Polybase
SQL Saturday Paris 2015 - PolybaseRomain Casteres
Ā 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache HadoopSuman Saurabh
Ā 

Viewers also liked (20)

Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Ā 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
Ā 
Big Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesBig Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace Images
Ā 
Pentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and HadoopPentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and Hadoop
Ā 
ETL in the Cloud With Microsoft Azure
ETL in the Cloud With Microsoft AzureETL in the Cloud With Microsoft Azure
ETL in the Cloud With Microsoft Azure
Ā 
Azure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analyticsAzure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analytics
Ā 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
Ā 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Ā 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
Ā 
MongoDB Hadoop and Humongous Data
MongoDB Hadoop and Humongous DataMongoDB Hadoop and Humongous Data
MongoDB Hadoop and Humongous Data
Ā 
PSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL ServerPSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL Server
Ā 
Microsoft Event Registration System Hosted on Windows Azure
Microsoft Event Registration System Hosted on Windows AzureMicrosoft Event Registration System Hosted on Windows Azure
Microsoft Event Registration System Hosted on Windows Azure
Ā 
What's new in SQL Server 2012 for philly code camp 2012.1
What's new in SQL Server 2012 for philly code camp 2012.1What's new in SQL Server 2012 for philly code camp 2012.1
What's new in SQL Server 2012 for philly code camp 2012.1
Ā 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Ā 
MEC Data sheet
MEC Data sheetMEC Data sheet
MEC Data sheet
Ā 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Ā 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
Ā 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
Ā 
SQL Saturday Paris 2015 - Polybase
SQL Saturday Paris 2015 - PolybaseSQL Saturday Paris 2015 - Polybase
SQL Saturday Paris 2015 - Polybase
Ā 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache Hadoop
Ā 

Similar to Big Data Analytics with Hadoop, MongoDB and SQL Server

Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
Ā 
Ų¹ŲµŲ± Ś©Ł„Ų§Ł† ŲÆŲ§ŲÆŁ‡ŲŒ Ś†Ų±Ų§ Łˆ Ś†ŚÆŁˆŁ†Ł‡ŲŸ
Ų¹ŲµŲ± Ś©Ł„Ų§Ł† ŲÆŲ§ŲÆŁ‡ŲŒ Ś†Ų±Ų§ Łˆ Ś†ŚÆŁˆŁ†Ł‡ŲŸŲ¹ŲµŲ± Ś©Ł„Ų§Ł† ŲÆŲ§ŲÆŁ‡ŲŒ Ś†Ų±Ų§ Łˆ Ś†ŚÆŁˆŁ†Ł‡ŲŸ
Ų¹ŲµŲ± Ś©Ł„Ų§Ł† ŲÆŲ§ŲÆŁ‡ŲŒ Ś†Ų±Ų§ Łˆ Ś†ŚÆŁˆŁ†Ł‡ŲŸdatastack
Ā 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
Ā 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
Ā 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
Ā 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_pptjerrin joseph
Ā 
Handling not so big data
Handling not so big dataHandling not so big data
Handling not so big dataSATOSHI TAGOMORI
Ā 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
Ā 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
Ā 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
Ā 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
Ā 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
Ā 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
Ā 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a NutshellAnthony Thomas
Ā 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
Ā 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
Ā 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data ConceptsAhmed Salman
Ā 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
Ā 

Similar to Big Data Analytics with Hadoop, MongoDB and SQL Server (20)

Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
Ā 
Ų¹ŲµŲ± Ś©Ł„Ų§Ł† ŲÆŲ§ŲÆŁ‡ŲŒ Ś†Ų±Ų§ Łˆ Ś†ŚÆŁˆŁ†Ł‡ŲŸ
Ų¹ŲµŲ± Ś©Ł„Ų§Ł† ŲÆŲ§ŲÆŁ‡ŲŒ Ś†Ų±Ų§ Łˆ Ś†ŚÆŁˆŁ†Ł‡ŲŸŲ¹ŲµŲ± Ś©Ł„Ų§Ł† ŲÆŲ§ŲÆŁ‡ŲŒ Ś†Ų±Ų§ Łˆ Ś†ŚÆŁˆŁ†Ł‡ŲŸ
Ų¹ŲµŲ± Ś©Ł„Ų§Ł† ŲÆŲ§ŲÆŁ‡ŲŒ Ś†Ų±Ų§ Łˆ Ś†ŚÆŁˆŁ†Ł‡ŲŸ
Ā 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
Ā 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
Ā 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Ā 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
Ā 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
Ā 
Handling not so big data
Handling not so big dataHandling not so big data
Handling not so big data
Ā 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Ā 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
Ā 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
Ā 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
Ā 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Ā 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Ā 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
Ā 
finap ppt conference.pptx
finap ppt conference.pptxfinap ppt conference.pptx
finap ppt conference.pptx
Ā 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
Ā 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Ā 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
Ā 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Ā 

More from Mark Kromer

Fabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxFabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxMark Kromer
Ā 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesMark Kromer
Ā 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mark Kromer
Ā 
Data cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsData cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsMark Kromer
Ā 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsMark Kromer
Ā 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mark Kromer
Ā 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mark Kromer
Ā 
Data Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFData Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFMark Kromer
Ā 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryMark Kromer
Ā 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Mark Kromer
Ā 
Data Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFData Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFMark Kromer
Ā 
Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Mark Kromer
Ā 
Data quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFData quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFMark Kromer
Ā 
Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Mark Kromer
Ā 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryMark Kromer
Ā 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300Mark Kromer
Ā 
ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2Mark Kromer
Ā 
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1Mark Kromer
Ā 
ADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationMark Kromer
Ā 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudMark Kromer
Ā 

More from Mark Kromer (20)

Fabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxFabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptx
Ā 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
Ā 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22
Ā 
Data cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsData cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flows
Ā 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flows
Ā 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021
Ā 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
Ā 
Data Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFData Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADF
Ā 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power Query
Ā 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Ā 
Data Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFData Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADF
Ā 
Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)
Ā 
Data quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFData quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADF
Ā 
Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005
Ā 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
Ā 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300
Ā 
ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2
Ā 
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1
Ā 
ADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview Migration
Ā 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
Ā 

Recently uploaded

Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot ModelDeepika Singh
Ā 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
Ā 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
Ā 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vƔzquez
Ā 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
Ā 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
Ā 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
Ā 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
Ā 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
Ā 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
Ā 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
Ā 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
Ā 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĆŗjo
Ā 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
Ā 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
Ā 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
Ā 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
Ā 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
Ā 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
Ā 

Recently uploaded (20)

Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Ā 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
Ā 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
Ā 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Ā 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
Ā 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Ā 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Ā 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
Ā 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Ā 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
Ā 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Ā 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Ā 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Ā 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Ā 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Ā 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
Ā 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
Ā 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Ā 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Ā 

Big Data Analytics with Hadoop, MongoDB and SQL Server

  • 1. SQL Server and Big Data Projects in the Real World Mark Kromer Pentaho Big Data Analytics Product Manager @mssqldude @kromerbigdata http://www.kromerbigdata.com
  • 2. What weā€™ll (try) to cover today 1. The Big Data Technology Landscape 2. Big Data Analytics 3. 3 Big Data Analytics Scenarios: āÆ Digital Marketing Analytics ā€¢ Hadoop, Aster Data, SQL Server āÆ Sentiment Analysis ā€¢ MongoDB, SQL Server āÆ Data Refinery ā€¢ Hadoop, MPP, SQL Server, Pentaho 4. SQL Server in the Big Data world
  • 3. Big Data 101 3 Vā€™s āÆ Volume ā€“ Terabyte records, transactions, tables, files āÆ Velocity ā€“ Batch, near-time, real-time (analytics), streams. āÆ Variety ā€“ Structures, unstructured, semi-structured, and all the above in a mix Text Processing āÆ Techniques for processing and analyzing unstructured (and structured) LARGE files Analytics & Insights Distributed File System & Programming
  • 4. ā€¢ Batch Processing ā€¢ Commodity Hardware ā€¢ Data Locality, no shared storage ā€¢ Scales linearly ā€¢ Great for large text file processing, not so great on small files ā€¢ Distributed programming paradigm Hadoop 1.x
  • 5. Hadoop 1 vs Hadoop 2 HADOOP 1.0 MapReduce (cluster resource management & data processing) Ā© Hortonworks Inc. 2014 HDFS (redundant, reliable storage) HADOOP 2.0 YARN MapReduce (data processing) Others (cluster resource management) HDFS2 (redundant, highly-available & reliable storage) Single Use System Batch Apps Multi Purpose Platform Batch, Interactive, Online, Streaming, ā€¦
  • 6. YARN: Taking Hadoop Beyond Batch Ā© Hortonworks Inc. 2014 Applications Run Natively in Hadoop YARN (Cluster Resource Management) HDFS2 (Redundant, Reliable Storage) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, S4,ā€¦) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) ONLINE (HBase) OTHER (Search) (Weaveā€¦) Store ALL DATA in one placeā€¦ Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service
  • 7. YARN Eco-system Ā© Hortonworks Inc. 2014 Page 7 Applications Powered by YARN Apache Giraph ā€“ Graph Processing Apache Hama - BSP Apache Hadoop MapReduce ā€“ Batch Apache Tez ā€“ Batch/Interactive Apache S4 ā€“ Stream Processing Apache Samza ā€“ Stream Processing Apache Storm ā€“ Stream Processing Apache Spark ā€“ Iterative applications Elastic Search ā€“ Scalable Search Cloudera Llama ā€“ Impala on YARN DataTorrent ā€“ Data Analysis HOYA ā€“ HBase on YARN Frameworks Powered By YARN Apache Twill REEF by Microsoft Spring support for Hadoop 2
  • 8. Apache Spark High-Speed In-Memory Analytics over Hadoop ā— Open Source ā— Alternative to Map Reduce for certain applications ā— A low latency cluster computing system ā— For very large data sets ā— May be 100 times faster than Map Reduce for ā€“ Iterative algorithms ā€“ Interactive data mining ā— Used with Hadoop / HDFS ā— Released under BSD License
  • 9. Popular Hadoop Distributions Hosted PaaS Hadoop platforms: Amazon EMR, Pivotal, Microsoft Hadoop on Azure
  • 10. Popular NoSQL Distributions Transactional-based, not analytics schemas
  • 11. Popular MPP Distributions Big Data as distributed, scale-out, sharded data stores
  • 12. Big Data Analytics Web Platform ā€“ RA 1
  • 13. Sentiment Analysis Reference Architecture 2 MongoDB Hadoop PDW Big Data Platforms Social Media Sources Data Orchestration Data Mining OLAP Cubes Data Models Analytical Models OLAP Analytics Tools, Reporting Tools, Dashboards
  • 14. Streamlined Data Refinery Reference Architecture 3 Transactions ā€“ Batch & Real-time Enrollments & Redemptions Location, Email, Other Data Hadoop Cluster Analytics Reports Data Orchestration
  • 16. Big Data Analytics Core Tenets ā€¢ Distributed Data (Data Locality) āÆ HDFS / MapReduce āÆ YARN / TEZ āÆ Replicated / Sharded Data ā€¢ MPP Databases āÆ Vertica, Aster, Microsoft, Greenplum ā€¦ In-database analytics that can scale-out with distributed processing across nodes ā€¢ Distributed Analytics āÆ SAS: Quickly solve complex problems using big data and sophisticated analytics in a distributed, in-memory and parallel environment.ā€ http://www.sas.com/resources/whitepaper/wp_46345.pdf ā€¢ In-memory Analytics āÆ Microsoft PowerPivot (Tabular models) āÆ SAP HANA āÆ Tableau
  • 17. SQL on Hadoop Hortonworks and Cloudera DW Engine Approaches
  • 18. SQL on Hadoop Landscape Gartner Research on SQL on Hadoop Not Quite Real Time Many vendors market their SQL interfaces to Hadoop as providing so called "real-time access" to data stored in a Hadoop cluster ā€¦ SQL on Hadoop provides a purely interactive data query and data manipulation experience ā€” faster than batch, but not truly real time. In the case of Hadoop and the types of tasks it performs, we define interactive time frames as between 30 milliseconds and 10 minutes. If your usage truly needs realtime, a different set of technologies and vendors may be required.
  • 19. SQL on Hadoop Vendor Perspective: MapR Batch SQL Hive is used primarily for queries on very large data sets and large ETL jobs. The queries can take anywhere between a few minutes to several hours depending on the complexity of the job. The Apache Tez project aims to provide targeted performance improvements for Hive to deliver interactive query capabilities in future. MapR ships and supports Apache Hive today. Interactive SQL Technologies such as Impala and Apache Drill provide interactive query capabilities to enable traditional business intelligence and analytics on Hadoop-scale datasets. The response times vary between milliseconds to minutes depending on the query complexity. Users expect SQL-on- Hadoop technologies to support common BI tools such as Tableau and MicroStrategy (to name a couple) for reporting and ad-hoc queries. MapR supports customers using Impala on the MapR distribution of Hadoop today. Apache Drill will be available Q2 2014.
  • 20. MapReduce Framework (Map) using Microsoft.Hadoop.MapReduce; using System.Text.RegularExpressions; public class TotalHitsForPageMap : MapperBase { public override void Map(string inputLine, MapperContext context) { context.Log(inputLine); var parts = Regex.Split(inputLine, "s+"); if (parts.Length != expected) //only take records with all values { return; } context.EmitKeyValue(parts[pagePos], hit); } }
  • 21. MapReduce Framework (Reduce & Job) public class TotalHitsForPageReducerCombiner : ReducerCombinerBase { public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context) { context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString()); } } public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner> { public override HadoopJobConfiguration Configure(ExecutorContext context) { var retVal = new HadoopJobConfiguration(); retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT"); retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT"); retVal.DeleteOutputFolder = true; return retVal; } }
  • 22. Get Data into Hadoop Linux shell commands to access data in HDFS Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv List files in HDFS: c:Hadoop>hadoop fs -ls /import Found 1 items -rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv View file in HDFS: c:Hadoop>hadoop fs -cat /import/sales.csv Kromer,123,5,55 Smith,567,1,25 Jones,123,9,99 James,11,12,1 Johnson,456,2,2.5 Singh,456,1,3.25 Yu,123,1,11 Now, we can work on the data with MapReduce, Hive, Pig, etc.
  • 23. Use Hive for Data Schema and Analysis create external table ext_sales ( lastname string, productid int, quantity int, sales_amount float ) row format delimited fields terminated by ',' stored as textfile location '/user/makromer/hiveext/input'; LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;
  • 24. Sqoop Data transfer to & from Hadoop & SQL Server sqoop import ā€“connect jdbc:sqlserver://localhost ā€“username sqoop -password password ā€“table customers -m 1 > hadoop fs -cat /user/mark/customers/part-m-00000 > 5,Bob Smith sqoop export ā€“connect jdbc:sqlserver://localhost ā€“username sqoop -password password -m 1 ā€“table customers ā€“export-dir /user/mark/data/employees3 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec) 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
  • 25. Role of NoSQL in a Big Data Analytics Solution ā€£ Use NoSQL to store data quickly without the overhead of RDBMS ā€£ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few ā€£ Why NoSQL? ā€£ In the world of ā€œBig Dataā€ ā€£ ā€œSchema laterā€ ā€£ Ignore ACID properties ā€£ Drop data into key-value store quick & dirty ā€£ Worry about query & read later ā€£ Why NOT NoSQL? ā€£ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface ā€£ SQL Server and NoSQL ā€£ Not a natural fit ā€£ Use HDFS or your favorite NoSQL database ā€£ Consider turning off SQL Server locking mechanisms ā€£ Focus on writes, not reads (read uncommitted)
  • 26. MongoDB and Enterprise IT Stack Applications CRM, ERP, Collaboration, Mobile, BI Data Management Online Data Offline Data Hadoop EDW Management & Monitoring Security & Auditing RDBMS RDBMS Infrastructure OS & Virtualization, Compute, Storage, Network
  • 27. General document per customer per account { _id : ObjectId("4e2e3f92268cdda473b628f6"), sourceIDs: { ABCSystemIDPart1: 8397897, ABCSystemIDPart2: 2937430, ABCSystemIDPart3: 932018 } accountType: ā€œCheckingā€, accountOwners: [ { firstName : ā€John", lastName: ā€œSmithā€, contactMethods: [ { type: ā€œphoneā€, subtype: ā€œmobileā€, number: 8743927394}, { type: ā€œmailā€, address: ā€œ58 3rd St.ā€, city: ā€¦} ] possibleMatchCriteria: { govtID: 2938932432, fullName: ā€œjohnsmithā€, dob: ā€¦ } }, { firstName : ā€Anne", maidenName: ā€œCollinsā€, lastName: ā€œSmithā€, ā€¦} ], openDate: ISODate("2013-02-15 10:00ā€), accountFeatures { Overdraft: true, APR: 20, ā€¦ } } OR creditCardNumber: 8392384938391293 OR mortgageID: 2374389 OR policyID: 18374923
  • 28. Text Search Example (e.g. address typo so do fuzzy match) // Text search for address filtered by first name and NY > db.ticks.runCommand( ā€œtextā€, { search: ā€œvanderbilt ave. vander biltā€, filter: {name: ā€œSmithā€, city: ā€œNew Yorkā€} })
  • 29. Aggregate: Total Value of Accounts //Find total value of each customerā€™s accounts for a given RM (or Agent) sorted by value db.accts.aggregate( { $match: {relationshipManager: ā€œSmithā€}}, { $group : { _id : ā€œ$ssnā€, totalValue: {$sum: ā€$valueā€} }}, { $sort: { totalValue: -1}} )
  • 30. SQL Server Big Data ā€“ Data Loading Amazon HDFS & EMR Data Loading Amazon S3 Bucket
  • 31. SQL Server Big Data Environment SQL Server Database āÆ SQL 2012 Enterprise Edition āÆ Page Compression āÆ 2012 Columnar Compression on Fact Tables āÆ Clustered Index on all tables āÆ Auto-update Stats Asynch āÆ Partition Fact Tables by month and archive data with sliding window technique āÆ Drop all indexes before nightly ETL load jobs āÆ Rebuild all indexes when ETL completes SQL Server Analysis Services āÆ SSAS 2012 Enterprise Edition āÆ 2008 R2 OLAP cubes partition-aligned with DW āÆ 2012 cubes in-memory tabular cubes āÆ All access through MSMDPUMP or SharePoint
  • 32. SQL Server Big Data Analytics Features Columnstore Sqoop adapter PolyBase Hive In-memory analytics Scale-out MPP SQL Server APS
  • 33. Pentaho Big Data Analytics DBA ETL/BI Developer Business Users & Executives Analysts & Data Scientists Enterprise & Interactive Reporting Pentaho Business Analytics Interactive Analysis Dashboards Predictive Analytics DIRECT ACCESS Data Integration Instaview | Visual Map Reduce OPERATIONAL DATA BIG DATA PUBLIC/PRIVATE CLOUDS DATA STREAM
  • 34. Pentaho Big Data Analytics Accelerate the time to big data value ā€¢ Full continuity from data access to decisions ā€“ complete data integration & analytics for any big data store ā€¢ Faster development, faster runtime ā€“ visual development, distributed execution ā€¢ Instant and interactive analysis ā€“ no coding and no ETL required
  • 35. Product Components ā€¢ Visual data exploration ā€¢ Ad hoc analysis ā€¢ Interactive charts & visualizations Pentaho Data Integration ā€¢ Visual development for big data ā€¢ Broad connectivity ā€¢ Data quality & enrichment ā€¢ Integrated scheduling ā€¢ Security integration Pentaho Dashboards ā€¢ Self-service dashboard builder ā€¢ Content linking & drill through ā€¢ Highly customized mash-ups Pentaho Data Mining & Predictive Analytics ā€¢ Model construction & evaluation ā€¢ Learning schemes ā€¢ Integration with 3rd part models using PMML Pentaho Enterprise & Interactive Reports ā€¢ Both ad hoc & distributed reporting ā€¢ Drag & drop interactive reporting ā€¢ Pixel-perfect enterprise reports Pentaho for Big Data MapReduce & Instaview ā€¢ Visual Interface for Developing MR ā€¢ Self-service big data discovery ā€¢ Big data access to Data Analysts Pentaho Analyzer
  • 36. Pentaho Interactive Analysis & Data Discovery Highly Flexible Advanced Visualizations āÆ Simple, easy-to-use visual data exploration āÆ Web-based thin client; in-memory caching āÆ Rich library of interactive visualizations ā€¢ Geo-mapping, heat grids, scatter plots, bubble charts, line over bar and more ā€¢ Pluggable visualizations āÆ Java ROLAP engine to analyze structured and unstructured data, with SQL dialects for querying data from RDBMs āÆ Pluggable cache integrating with leading caching architectures: Infinispan (JBoss Data Grid) & Memcached