RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
1.
2.
3. Apache Spark is an OSS fast analytics engine for big data and machine
learning
Improves efficiency through:
General computation graphs beyond map/reduce
In-memory computing primitives
Allows developers to scale out their user code & write in their language of
choice
Rich APIs in Java, Scala, Python, R, SparkSQL etc.
Batch processing, streaming and interactive shell
Available on Azure via
Azure Synapse Azure Databricks
Azure HDInsight IaaS/Kubernetes
4. .NET Developers 💖 Apache Spark…
A lot of big data-usable business logic (millions
of lines of code) is written in .NET!
Expensive and difficult to translate into
Python/Scala/Java!
Locked out from big data processing due to
lack of .NET support in OSS big data solutions
In a recently conducted .NET Developer survey (> 1000 developers), more than 70%
expressed interest in Apache Spark!
Would like to tap into OSS eco-system for: Code libraries, support, hiring
5. Goal: .NET for Apache Spark is aimed at providing
.NET developers a first-class experience when
working with Apache Spark.
Non-Goal: Converting existing Scala/Python/Java
Spark developers.
6. We are developing it in the open!
Contributions to foundational OSS projects:
• Apache Spark Core: SPARK-28271, SPARK-28278, SPARK-28283, SPARK-28282, SPARK-28284,
SPARK-28319, SPARK-28238, SPARK-28856, SPARK-28970, SPARK-29279, SPARK-29373
• Apache Arrow: ARROW-4997, ARROW-5019, ARROW-4839, ARROW-4502, ARROW-4737,
ARROW-4543, ARROW-4435, ARROW-4503, ARROW-4717, ARROW-4337, ARROW-5887,
ARROW-5908, ARROW-6314, ARROW-6682
• Pyrolite (Pickling Library): Improve pickling/unpickling performance, Add a Strong Name to
Pyrolite, Improve Pickling Performance, Hash set handling, Improve unpickling performance
.NET for Apache Spark is open source
• Website: https://dot.net/spark
• GitHub: https://github.com/dotnet/spark
• Version 0.6 released Oct 2019
Spark project improvement proposals:
• Interop support for Spark language extensions: SPARK-26257
• .NET bindings for Apache Spark: SPARK-27006
7. .NET provides full-spectrum Spark support
Spark DataFrames
with SparkSQL
Works with
Spark v2.3.x/v2.4.x
and includes
~300 SparkSQL
functions
Grouped Map
Delta Lake
.NET Spark UDFs
Batch &
streaming
Including
Spark Structured
Streaming and all
Spark-supported data
sources
.NET Standard 2.0
Works with
.NET Framework v4.6.1+
and .NET Core v2.1/v3.x
and includes C#/F#
support
.NET
Standard
Data Science
Including access to
ML.NET
Interactive Notebook
with C# REPL
Speed &
productivity
Performance optimized
interop, as fast or faster
than pySpark,
Support for HW
Vectorization
https://github.com/dotnet/spark/examples
8. 0.6
8
DataStreamWriter.PartitionBy()
RelationalGroupedDataset.Mean(),Max(),Avg(),Min(),Agg(),Count()
SparkSession.*Session(),Range(),Conf()
UDF with Row as a parameter
Delta Lake’s DeltaTable
SparkSession.Catalog
UDF with Array.Map as a return type
UDF debugging
Vector & GroupedMap UDFspark.yarn.archives support
Compatibility check for Microsoft.Spark.Worker
AssemblyLoader enhancement for loading UDFs
Resolver signer fix
Arrow & Pickling perf improvement
Arcade build infrastructure
TPC-H update with Arrow
DataStreamWriter.Trigger
ComplexTypes.MapType
Support for Spark 2.3.*, Spark 2.4.[1/2/4]
Worker binaries for MacOS
UDF with dependent types
DataFrameReader.Load() Source link for Nuget packageSparkFile
.NET for Apache Spark
12. What is happening when you write .NET Spark code?
DataFrame
SparkSQL
.NET for
Apache
Spark
.NET
Program
Did you
define
a .NET
UDF?
Regular execution path
(no .NET runtime during execution)
Same Speed as with Scala Spark
Interop between Spark and .NET
Faster than with PySpark
No
Yes
Spark
operation tree
13. Works everywhere!
Cross platform
Cross Cloud
Windows Ubuntu
Azure & AWS
Databricks
macOS
AWS EMR
Spark
Azure HDI
Spark
Installed out of
the box
Azure
Synapse
Installation docs
on Github
14. More
programming
experiences in
.NET
(UDAF, UDT
support, multi-
language UDFs)
What’s next?
Spark data
connectors in
.NET
(e.g., Apache Kafka,
Azure Blob Store,
Azure Data Lake)
Tooling
experiences
(e.g., Jupyter, VS
Code, Visual
Studio, others?)
Idiomatic
experiences
for C# and F#
(LINQ, Type
Provider)
Go to https://github.com/dotnet/spark and let us know what is important to you!
Out-of-Box
Experiences
(Azure Synapse,
Azure HDInsight,
Azure Databricks,
Cosmos DB
Spark, SQL 2019
BDC, …)
15. Call to action: Engage, use & guide us!
Useful links:
• http://github.com/dotnet/spark
• https://www.nuget.org/packages/Microsoft.Spark
https://aka.ms/GoDotNetForSpark
• https://docs.microsoft.com/dotnet/spark
Website:
• https://dot.net/spark (Request a Demo!)
Starter Videos .NET for Apache Spark 101:
• Watch on YouTube
• Watch on Channel 9
Available out-of-box on
Azure Synapse & Azure HDInsight Spark
Running .NET for Spark anywhere—
https://aka.ms/InstallDotNetForSpark
You & .NET