PySpark Tutorial | PySpark Tutorial For Beginners | Apache Spark With Python Tutorial | Simplilearn

What’s in it for you?
1. What is PySpark?
2. PySpark Features
3. PySpark with Python and Scala
4. PySpark Contents
5. PySpark Subpackages
6. Companies using PySpark
7. Demo using PySpark

What is PySpark?
PySpark is the Python API to support Apache Spark

PySpark is the Python API to support Apache Spark
+ =
What is PySpark?

Polyglot Real-time Analysis Caching and disk
persistence
Fast
processing
PySpark Features

Criteria
Performance
Python is slower than Scala
when used with Spark
Spark is written in Scala, so it integrates well
and is faster than Python
Spark with Python and Scala

Criteria
Performance
Python has simple syntax
and being a high-level language, it’s
easy to learn
Scala has a complex syntax, hence is not
easy to learnLearning curve

Criteria
Performance
easy to learn
Code Readability Readability, maintenance, and familiarity of
code is better in Python API
Scala is a sophisticated language. Developers
need to pay a lot of attention towards the
readability of the code

Criteria
Performance
easy to learn
Code readability Readability, maintenance, and familiarity of
code is better in Python API
Scala is a sophisticated language. Developers
need to pay a lot of attention towards the
readability of the code
Data Science
libraries
Python provides a rich set of libraries for data
visualization and model building
Scala lacks in providing data science libraries
and tools for data visualization

SparkContext
RDD
Broadcast
&
Accumulator
SparkConf
SparkFilesDataFrames
StorageLevel
PySpark Contents

SparkConf provides configurations to run a Spark application
PySpark – SparkConf

class pyspark.SparkConf(
loadDefaults = True,
_jvm = None,
_jconf = None
)
The following code block has the details of a
SparkConf class for PySpark

class pyspark.SparkConf(
loadDefaults = True,
_jvm = None,
_jconf = None
)
The following code block has the details of a
SparkConf class for PySpark
Following are some of the most commonly
used attributes of SparkConf
set(key, value) – To set a configuration property
setMaster(value) – To set the master URL
setAppName(value) – To set an application
name
Get(key, defaultValue=None) – To get a
configuration value of a key

SparkContext is the main entry point
in any Spark Program
PySpark – SparkContext

Data Flow
SparkContext
Local FS SparkContext Python
Socket
Spark
Worker
Spark
Worker
Python
Python
Python
Python
Local
Cluster
Pipe
Py4J
JVMSparkContext is the main entry point
in any Spark Program

class pyspark.SparkContext (
master = None,
appName = None,
sparkHome = None,
pyFiles = None,
environment = None,
batchSize = 0,
serializer = PickleSerializer(),
conf = None,
gateway = None,
jsc = None,
profiler_cls = <class ‘pyspark.profiler.BasicProfiler’>
)
Below code has the details of a PySpark class as well as the
parameters which SparkContext can take

SparkFiles allows you to upload your files using sc.addFile and get the path
on a worker using SparkFiles.get
PySpark – SparkFiles

get(filename)
getrootdirectory()
SparkFiles contain the following
classmethods:

get(filename)
getrootdirectory()
SparkFiles contain the following
classmethods:
getrootdirectory() specifies the path to the root
directory, which contains the file that is added
through the SparkContext.addFile()
from pyspark import SparkContext
from pyspark import SparkFiles
finddistance = “/home/Hadoop/examples/finddistance.R”
finddistancename = “finddistance.R”
sc = SparkContext(“local”, “SparkFile App”)
sc.addFile(finddistance)
print “Absolute path -> %s” % SparkFiles.get(finddistancename)

A Resilient Distributed Dataset (RDD) is the basic abstraction in Spark. It
presents an immutable, partitioned collection of elements that can be operated on
in parallel
PySpark – RDD

A Resilient Distributed Dataset (RDD) is the basic abstraction in Spark. It
presents an immutable, partitioned collection of elements that can be operated on
in parallel
RDD
Transformation Action
These are operations (such as reduce,
first, count) that return
a value after running a computation on an
RDD
These are operations (such as map, filter,
join, union) that are performed on an RDD
that yields a new RDD containing the
result
PySpark – RDD

class pyspark.RDD (
jrdd,
ctx,
jrdd_deserializer =
AutoBatchedSerializer(PickleSerializer())
)
Creating PySpark RDD:
PySpark program to return the
number of elements in the RDD
sc = SparkContext("local", "count app")
words = sc.parallelize (
["scala",
"java",
"hadoop",
"spark",
"akka",
"spark vs hadoop",
"pyspark",
"pyspark and spark"]
)
counts = words.count()
print "Number of elements in RDD -> %i" % (counts)
PySpark – RDD

StorageLevel decides whether RDD should be stored in the memory or
should it be stored over the disk, or both
PySpark – StorageLevel
RDD
Memory Disk

StorageLevel decides whether RDD should be stored in the memory or
should it be stored over the disk, or both
class pyspark.StorageLevel(useDisk, useMemory, useoffHeap,
deserialized, replication=1)
import pyspark
sc = SparkContext(
“local”,
“storagelevel app”
)
rdd1 = sc.parallelize([1, 2])
rdd1.persist(pyspark.StorageLevel.MEMORY_AND_DISK_2)
rdd1.getStorageLevel()
print(rdd1.getStorageLevel())
RDD
Memory Disk
PySpark – StorageLevel
Output: Disk Memory Serialized 2x
Replicated

DataFrames in PySpark is a distributed collection of rows with named
columns
PySpark – DataFrames

columns
Characteristics with RDD:
• Immutable in nature
• Lazy Evaluation
• Distribution

columns
• It can be created using different data
formats
• Loading data from existing RDD
• Programmatically specifying schema
Ways to create a DataFrame in
SparkCharacteristics with RDD:
• Immutable in nature
• Lazy Evaluation
• Distribution

PySpark – Broadcast and
Accumulator

A Broadcast variable allow the programmers to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks
PySpark – Broadcast and Accumulator

A Broadcast variable allow the programmers to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks
A broadcast variable is created with SparkContext.broadcast()
>>> from pyspark.context import SparkContext
>>> sc = SparkContext(‘local’, ‘test’)
>>> b = sc.broadcast([1, 2, 3, 4, 5])
>>> b.value
[1, 2, 3, 4, 5]

Accumulators are variables that are only added through an associative and
commutative operation

Accumulators are variables that are only added through an associative and
commutative operation
class pyspark.Accumulator(aid, value, accum_param)
sc = SparkContext(“local”, “Accumulator app”)
num = sc.accumulator(10)
def f(x):
global num
num + = x
rdd = sc.parallelize([20, 30, 40, 50])
rdd.foreach(f)
final = num.value
print(“Accumulated value is -> %i” % (final))
Output: Accumulated value is -> 150

pyspark.sql module
pyspark.streaming
module
pyspark.ml package
pyspark.mllib package
SQL Streaming ML Mllib
Subpackages in PySpark

PySpark Tutorial | PySpark Tutorial For Beginners | Apache Spark With Python Tutorial | Simplilearn

PySpark Tutorial | PySpark Tutorial For Beginners | Apache Spark With Python Tutorial | Simplilearn

Recommended

Recommended

More Related Content

More from Simplilearn

More from Simplilearn (20)

Recently uploaded

Recently uploaded (20)

PySpark Tutorial | PySpark Tutorial For Beginners | Apache Spark With Python Tutorial | Simplilearn