1. Michael Rys
Principal Program Manager, Big Data @ Microsoft
@MikeDoesBigData, {mrys, usql}@microsoft.com
U-SQL Federated Distributed Queries
2.
3. Query data where it lives
Easily query data in multiple Azure data stores without moving it to a single store
Benefits
• Avoid moving large amounts of data across the
network between stores
• Single view of data irrespective of physical location
• Minimize data proliferation issues caused by
maintaining multiple copies
• Single query language for all data
• Each data store maintains its own sovereignty
• Design choices based on the need
• Push SQL expressions to remote SQL sources
• Filters
• Joins
U-SQL
Query
Query
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
Azure
SQL Data Warehouse
Azure
Data Lake Storage
4. Federated
queries
• Minimize data proliferation through data consolidation
• Same U-SQL over all Azure data (WASB, SQL Azure)
• Efficient and reliable execution strategies
• Striving to maintain semantic equivalence
• Design choices based on requirements:
• Schema-less design
• fast time-to-query and exploratory analysis
• Schematized design
• protect applications from data source changes
• Advanced federated query capabilities:
• Built-in decisions to optimize for performance
• push downs of joins, predicates, projection
• Control when and what to push down
• Prevent data source overload
• Provide control over semantics
5. Data sources and
external tables
• Secure credential
management
• Data sources to manage
connections and
remoting of queries
• Schematized design:
external tables to provide
early bound tables for
federated queries
Create secret in PowerShell
New-AzureRMDataLakeAnalyticsCatalogSecret
Create credential
CREATE CREDENTIAL Secret
WITH USER_NAME = “user@server", IDENTITY = "Secret";
Create external data source on
• Azure SQL DB
• Azure SQL DW
• SQL Server in Azure VM
CREATE DATA SOURCE SQL_PATIENTS FROM SQLSERVER WITH
( PROVIDER_STRING =
"Database=DB;Trusted_Connection=False;Encrypt=False"
, CREDENTIAL = Secret
, REMOTABLE_TYPES = (bool, byte, short, string, DateTime)
);
External tables (optional)
CREATE EXTERNAL TABLE sql_patients (
[custkey] int,
[name] string,
[address] string
) FROM SQL_PATIENTS LOCATION "dbo.patients";
6. Federated
queries
• Queries have to be in a
different script from data
source
• Pass-through queries to
execute remote language
• Schema-less design:
query data source
location
• Schematized design:
query external tables
• Semantics of federated
queries close to U-SQL
and C#
Pass-Through Query
@alive_patients =
SELECT *
FROM EXTERNAL SQL_PATIENTS EXECUTE @"
SELECT name
, CASE WHEN is_alive = 1
THEN 'Alive' ELSE 'Deceased' END AS status
, address, nationkey, phone
FROM dbo.patients";
Query Data Source Location
@patients = SELECT *
FROM EXTERNAL master.SQL_PATIENTS LOCATION "dbo.patients";
Query External Tables
@patients = SELECT * FROM EXTERNAL master.dbo.sql_patients;
Execution
• U-SQL Semantics
• Pushes predicates and even joins based on remotable types
DATA SOURCE: Represents a remote data source such as Azure SQL Database. Have to specify all the details (connection string, credentials, etc required to connect to and issues queries.
EXTERNAL TABLE: A local table, with columns defined in C# types, that redirects queries issued against it to the remote table that it is based on. U-SQL automatically does the type conversion. External tables lets you impose a specific schema against the remote data, shielding you from remote schema changes. You can issue queries that ‘join’ external and local tables.
PASS THROUGH queries: These queries are issued directly against the remote data source in the syntax of the remote data source (say T-SQL for Azure SQL database).
REMOTABLE_TYPES: For every external data source you have to specify the list of ‘remoteable types. This list constrains the types of queries that will be remoted. Ex: REMOTABLE_TYPES = (bool, byte, short, ushort, int, decimal);
LAZY METADATA LOADING: Here the remote data schematized only when the query is actually issues to the remote data source. Your program must be able to deal with remote schema changes.