This presentation about PySpark will help you understand what PySpark is, the different features of PySpark, and the comparison of Spark with Python and Scala. Then, you will learn the various PySpark contents - SparkConf, SparkContext, SparkFiles, RDD, StorageLevel, DataFrames, Broadcast and Accumulator. You will get an idea about the various Sub packages in PySpark. Finally, you will look at a demo using PySpark SQL to analyze Walmart Stocks data. Now, let's dive into learning PySpark in detail.
1. What is PySpark?
2. PySpark Features
3. PySpark with Python and Scala
4. PySpark Contents
5. PySpark Sub packages
6. Companies using PySpark
7. Demo using PySpark
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
PySpark Tutorial | PySpark Tutorial For Beginners | Apache Spark With Python Tutorial | Simplilearn
1.
2. What’s in it for you?
1. What is PySpark?
2. PySpark Features
3. PySpark with Python and Scala
4. PySpark Contents
5. PySpark Subpackages
6. Companies using PySpark
7. Demo using PySpark
8. Criteria
Performance
Python is slower than Scala
when used with Spark
Spark is written in Scala, so it integrates well
and is faster than Python
Spark with Python and Scala
9. Criteria
Performance
Python is slower than Scala
when used with Spark
Spark is written in Scala, so it integrates well
and is faster than Python
Python has simple syntax
and being a high-level language, it’s
easy to learn
Scala has a complex syntax, hence is not
easy to learnLearning curve
Spark with Python and Scala
10. Criteria
Performance
Python is slower than Scala
when used with Spark
Spark is written in Scala, so it integrates well
and is faster than Python
Python has simple syntax
and being a high-level language, it’s
easy to learn
Scala has a complex syntax, hence is not
easy to learnLearning curve
Code Readability Readability, maintenance, and familiarity of
code is better in Python API
Spark with Python and Scala
Scala is a sophisticated language. Developers
need to pay a lot of attention towards the
readability of the code
11. Criteria
Performance
Python is slower than Scala
when used with Spark
Spark is written in Scala, so it integrates well
and is faster than Python
Python has simple syntax
and being a high-level language, it’s
easy to learn
Scala has a complex syntax, hence is not
easy to learnLearning curve
Code readability Readability, maintenance, and familiarity of
code is better in Python API
Scala is a sophisticated language. Developers
need to pay a lot of attention towards the
readability of the code
Data Science
libraries
Python provides a rich set of libraries for data
visualization and model building
Scala lacks in providing data science libraries
and tools for data visualization
Spark with Python and Scala
16. SparkConf provides configurations to run a Spark application
class pyspark.SparkConf(
loadDefaults = True,
_jvm = None,
_jconf = None
)
The following code block has the details of a
SparkConf class for PySpark
PySpark – SparkConf
17. SparkConf provides configurations to run a Spark application
class pyspark.SparkConf(
loadDefaults = True,
_jvm = None,
_jconf = None
)
The following code block has the details of a
SparkConf class for PySpark
Following are some of the most commonly
used attributes of SparkConf
set(key, value) – To set a configuration property
setMaster(value) – To set the master URL
setAppName(value) – To set an application
name
Get(key, defaultValue=None) – To get a
configuration value of a key
PySpark – SparkConf
19. SparkContext is the main entry point
in any Spark Program
PySpark – SparkContext
20. Data Flow
SparkContext
Local FS SparkContext Python
Socket
Spark
Worker
Spark
Worker
Python
Python
Python
Python
Local
Cluster
Pipe
Py4J
JVMSparkContext is the main entry point
in any Spark Program
PySpark – SparkContext
21. class pyspark.SparkContext (
master = None,
appName = None,
sparkHome = None,
pyFiles = None,
environment = None,
batchSize = 0,
serializer = PickleSerializer(),
conf = None,
gateway = None,
jsc = None,
profiler_cls = <class ‘pyspark.profiler.BasicProfiler’>
)
Below code has the details of a PySpark class as well as the
parameters which SparkContext can take
PySpark – SparkContext
23. SparkFiles allows you to upload your files using sc.addFile and get the path
on a worker using SparkFiles.get
PySpark – SparkFiles
24. SparkFiles allows you to upload your files using sc.addFile and get the path
on a worker using SparkFiles.get
get(filename)
getrootdirectory()
SparkFiles contain the following
classmethods:
PySpark – SparkFiles
25. SparkFiles allows you to upload your files using sc.addFile and get the path
on a worker using SparkFiles.get
get(filename)
getrootdirectory()
SparkFiles contain the following
classmethods:
getrootdirectory() specifies the path to the root
directory, which contains the file that is added
through the SparkContext.addFile()
from pyspark import SparkContext
from pyspark import SparkFiles
finddistance = “/home/Hadoop/examples/finddistance.R”
finddistancename = “finddistance.R”
sc = SparkContext(“local”, “SparkFile App”)
sc.addFile(finddistance)
print “Absolute path -> %s” % SparkFiles.get(finddistancename)
PySpark – SparkFiles
27. A Resilient Distributed Dataset (RDD) is the basic abstraction in Spark. It
presents an immutable, partitioned collection of elements that can be operated on
in parallel
PySpark – RDD
28. A Resilient Distributed Dataset (RDD) is the basic abstraction in Spark. It
presents an immutable, partitioned collection of elements that can be operated on
in parallel
RDD
Transformation Action
These are operations (such as reduce,
first, count) that return
a value after running a computation on an
RDD
These are operations (such as map, filter,
join, union) that are performed on an RDD
that yields a new RDD containing the
result
PySpark – RDD
29. class pyspark.RDD (
jrdd,
ctx,
jrdd_deserializer =
AutoBatchedSerializer(PickleSerializer())
)
Creating PySpark RDD:
PySpark program to return the
number of elements in the RDD
from pyspark import SparkContext
sc = SparkContext("local", "count app")
words = sc.parallelize (
["scala",
"java",
"hadoop",
"spark",
"akka",
"spark vs hadoop",
"pyspark",
"pyspark and spark"]
)
counts = words.count()
print "Number of elements in RDD -> %i" % (counts)
PySpark – RDD
31. StorageLevel decides whether RDD should be stored in the memory or
should it be stored over the disk, or both
PySpark – StorageLevel
RDD
Memory Disk
32. StorageLevel decides whether RDD should be stored in the memory or
should it be stored over the disk, or both
class pyspark.StorageLevel(useDisk, useMemory, useoffHeap,
deserialized, replication=1)
from pyspark import SparkContext
import pyspark
sc = SparkContext(
“local”,
“storagelevel app”
)
rdd1 = sc.parallelize([1, 2])
rdd1.persist(pyspark.StorageLevel.MEMORY_AND_DISK_2)
rdd1.getStorageLevel()
print(rdd1.getStorageLevel())
RDD
Memory Disk
PySpark – StorageLevel
Output: Disk Memory Serialized 2x
Replicated
34. DataFrames in PySpark is a distributed collection of rows with named
columns
PySpark – DataFrames
35. DataFrames in PySpark is a distributed collection of rows with named
columns
Characteristics with RDD:
• Immutable in nature
• Lazy Evaluation
• Distribution
PySpark – DataFrames
36. DataFrames in PySpark is a distributed collection of rows with named
columns
• It can be created using different data
formats
• Loading data from existing RDD
• Programmatically specifying schema
Ways to create a DataFrame in
SparkCharacteristics with RDD:
• Immutable in nature
• Lazy Evaluation
• Distribution
PySpark – DataFrames
38. A Broadcast variable allow the programmers to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks
PySpark – Broadcast and Accumulator
39. A Broadcast variable allow the programmers to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks
A broadcast variable is created with SparkContext.broadcast()
>>> from pyspark.context import SparkContext
>>> sc = SparkContext(‘local’, ‘test’)
>>> b = sc.broadcast([1, 2, 3, 4, 5])
>>> b.value
[1, 2, 3, 4, 5]
PySpark – Broadcast and Accumulator
40. Accumulators are variables that are only added through an associative and
commutative operation
PySpark – Broadcast and Accumulator
41. Accumulators are variables that are only added through an associative and
commutative operation
class pyspark.Accumulator(aid, value, accum_param)
from pyspark import SparkContext
sc = SparkContext(“local”, “Accumulator app”)
num = sc.accumulator(10)
def f(x):
global num
num + = x
rdd = sc.parallelize([20, 30, 40, 50])
rdd.foreach(f)
final = num.value
print(“Accumulated value is -> %i” % (final))
PySpark – Broadcast and Accumulator
Output: Accumulated value is -> 150