PySpark 배우기 Ch 06. ML 패키지 소개하기

1 Spark ML 패키지
MLlib 패키지가 RDD 기반의 머신 러닝을 지원하는 패키지면, ML 패키지는 DataFrame 기반의 머신 러닝을 지원하는 패키지이다.
Spark ML의 정식 명칭은 'MLlib DataFrame-based API'이며, DataFrame이 RDD보다 Spark에서의 데이터 로딩, 실행 계획 최적화, 언어 간의 API 통
일성에 있어 장점이 있기 때문에, Spark 2 버전 기준 머신 러닝을 위한 Primary API이다.
https://spark.apache.org/docs/latest/ml-guide.html#machine-learning-library-mllib-guide


1.1 Spark MLlib, ML 패키지 제공 기능
ML Algorithms: Classi cation, Regression, Clustering 그리고 Collaborative ltering과 같은 머신 러닝 알고리즘 제공
Featurization: Feature 추출, 변환, 차원 축소, Feature 선택을 할 수 있도록 기능 제공
Pipelines: 머신 러닝을 위한 알고리즘 처리 과정을 만들고 평가 및 파라미터 튜닝을 할 수 있도록 지원
Persistence: 알고리즘, Pipeline, 모델을 저장하고 불러오는 기능 제공
Utilities: 선형 대수, 통계 및 데이터 핸들링 기능을 제공


1.2 MLlib 패키지와 비교
MLlib 패키지 비교 ML 패키지
pyspark.mllib 패키지 경로 pyspark.ml
RDD 지원 자료 구조 DataFrame
스트리밍 데이터를 받으면서 학습시킬 수 있는 유일한 패키지 기타 PySpark를 사용한다면 기본적으로 이 패키지를 사용해야 함
Spark 2.x 버전에서 새로운 기능이 추가되지 않고
Spark 3 버전에서 삭제될 예정 (Maintenance mode)
https://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api


2 Spark ML Pipelines
Spark MLlib 패키지는 여러 알고리즘을 하나의 Pipeline (또는 Work ow)에 쉽게 결합하기 위한 표준화된 API를 제공한다.
scikit-learn(sklearn) 프로젝트에서 컨셉을 가져왔다고 한다.
DataFrame: Spark ML API는 Spark SQL을 통해 생성되는 DataFrame을 사용하며, DataFrame을 통해 다양한 데이터 타입에 대한 머신 러닝
알고리즘 적용을 지원한다.
Transformer: Transformer는 DataFrame을 컬럼이 추가된 다른 DataFrame으로 변환해주는 알고리즘들을 말하며, Spark ML Pipeline의 결
과 또한 입력받는 DataFrame에 대해 변환 및 예측을 수행한 결과값을 포함하는 새로운 DataFrame을 생성하는 Transformer가 된다.
Estimator: Estimator는 Transformer에 의해 생성된 DataFrame을 사용하여 학습을 진행하고 모델을 생성하는 알고리즘을 말한다.
Pipeline: 여러 Transformer, Estimator의 집합을 Stage로 연결하여 머신 러닝 워크플로우를 만들 수 있게 해준다.
Parameter: 모든 Transformer와 Estimator는 파라미터를 입력받기 위한 공통의 API 형식을 가진다.
https://spark.apache.org/docs/latest/ml-pipeline.html#main-concepts-in-pipelines
Transformer, Estimator의알고리즘들은Python 클래스형식으로구현되어있다.


2.1 Transformers
Feature를 변형, 추출하기 위한 알고리즘과 학습된 모델을 위한 추상 클래스 또는 구현한 클래스 집합이다.
Transformer 클래스들은 내부적으로 transform 함수가 구현되어 있다.
DataFrame을 입력받아 하나 이상의 컬럼이 추가된 새로운 DataFrame을 만들어낸다.


2.2 Estimators
데이터로 알고리즘을 학습하는 과정을 추상화한 클래스 또는 구현한 클래스 집합이다.
Estimator 클래스들은 내부적으노 fit 함수가 구현되어 있다.
DataFrame을 입력받아 Transformer인 모델을 반환한다.
예시, Estimator 클래스인 LogisticRegression은 fit 함수 호출을 통해 학습된 LogisticRegressionModel을 반환하며, 반환
되는 LogisticRegressionModel은 Transformer이다.


2.3 Pipeline
데이터를 처리하고 학습하기 위한 알고리즘의 Stage 별 실행 과정을 만들기 위해 사용한다.
각 Stage는 Transformer와 Estimator들로 구성된다.
구성된 Pipeline은 그 자체가 Estimator로 fit 함수를 통해 실행 또는 학습한다.
결과로 반환되는 PipelineModel은 Transformer로 학습 이후 모델을 통해 결과를 산출할 때 사용한다.


2.3.1 Pipeline 예시
문서를 단어로 분리
각 문서의 단어를 Feature vector로 변환
Feature vector와 Label을 통해 예측 모델을 학습


학습 과정
파란 박스는 Transformer, 빨간 박스는 Estimator
https://spark.apache.org/docs/latest/img/ml-Pipeline.png


테스트 또는 사용 시의 과정
https://spark.apache.org/docs/latest/img/ml-PipelineModel.png


2.4 Parameters
Transformer와 Estimator는 파라미터를 전달하기 위한 통일된 API를 제공한다.
Transformer와 Estimator의 Param 객체를 속성으로 가지고 있다. 이 객체는 파라미터를 설명하는 문서를 내장하고 있으며,
Transformer와 Estimator에 파라미터를 전달하기 위한 Key로 사용된다.
ParamMap은 PySpark에서는 dict 형태로 (parameter, value) 쌍을 가짐
파라미터는 클래스 인스턴스 생성 시 또는 생성 후 Setter를 통해 전달하거나 ParamMap 형태로 fit 또는 transform 함수 실행 시에 전
달한다.


2.5 예시 1: Estimator, Transformer, and Param
https://spark.apache.org/docs/latest/ml-pipeline.html#example-estimator-transformer-and-param


Spark Session 생성 - Spark Cluster 연결


Spark Session 생성 - Spark Cluster 연결
In [1]:
executed in 6.63s, nished 15:33:08 2019-12-23
Out[1]:
SparkSession - in-memory
SparkContext
Version
v2.4.4
Master
spark://192.168.1.233:7077
AppName
pyspark-shell
Spark UI
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('spark://192.168.1.233:7077').getOrCreate()
spark


Spark Session 생성 - Local 모드, Google Colab


In [ ]: !apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-eu.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark


In [ ]:
In [ ]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"


In [ ]:
In [ ]:
In [ ]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"
import findspark
findspark.init("spark-2.4.4-bin-hadoop2.7")# SPARK_HOME
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()


DataFrame, Estimator 초기화
Transformer, Estimator는 자신이 입력받는 파라미터를 설명하는 explainParams함수를 가지고 있다.
explainParams를 통해 파라미터에 대한 설명과 디폴트 값을 확인할 수 있으며, 파라미터 전달 후에는 각 파라미터의 현재 값을 확인할 수 있다.


DataFrame, Estimator 초기화
Transformer, Estimator는 자신이 입력받는 파라미터를 설명하는 explainParams함수를 가지고 있다.
explainParams를 통해 파라미터에 대한 설명과 디폴트 값을 확인할 수 있으며, 파라미터 전달 후에는 각 파라미터의 현재 값을 확인할 수 있다.
In [2]:
ogisticRegression parameters
ggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
lasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
amily: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
eaturesCol: features column name. (default: features)
itIntercept: whether to fit an intercept term. (default: True)
abelCol: label column name. (default: label)
owerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of
eatures) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
owerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The bounds vector size must beequal with 1 for binomial regression, o
the number oflasses for multinomial regression. (undefined)
axIter: max number of iterations (>= 0). (default: 100, current: 10)
redictionCol: prediction column name. (default: prediction)
robabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be tr
ated as confidences, not precise probabilities. (default: probability)
awPredictionCol: raw prediction (a.k.a. confidence) column name. (default: rawPrediction)
egParam: regularization parameter (>= 0). (default: 0.0, current: 0.01)
tandardization: whether to standardize the training features before fitting the model. (default: True)
hreshold: Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds m
st be equal to [1-p, p]. (default: 0.5)
hresholds: Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >
, excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.
(undefined)
ol: the convergence tolerance for iterative algorithms (>= 0). (default: 1e-06)
pperBoundsOnCoefficients: The upper bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of
pperBoundsOnIntercepts: The upper bounds on intercepts if fitting under bound constrained optimization. The bound vector size must be equal with 1 for binomial regression, o
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression
# Prepare training data from a list of (label, features) tuples.
training = spark.createDataFrame([
(1.0, Vectors.dense([0.0, 1.1, 0.1])),
(0.0, Vectors.dense([2.0, 1.0, -1.0])),
(0.0, Vectors.dense([2.0, 1.3, 1.0])),
(1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print("LogisticRegression parameters")
print(lr.explainParams())


Estimator 학습 1
fit 함수 호출을 통해 학습을 진행 후 Transformer를 반환받는다. 학습 시 사용한 파라미터들은 extractParamMap 함수를 통해 확인할 수 있
다.


Estimator 학습 1
fit 함수 호출을 통해 학습을 진행 후 Transformer를 반환받는다. 학습 시 사용한 파라미터들은 extractParamMap 함수를 통해 확인할 수 있
다.
In [3]:
Out[3]:
odel 1 was fit using parameters:
Param(parent='LogisticRegression_4e776cc5a93f', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2)'): 2,
Param(parent='LogisticRegression_4e776cc5a93f', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. F
r alpha = 1, it is an L1 penalty'): 0.0,
Param(parent='LogisticRegression_4e776cc5a93f', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported op
ions: auto, binomial, multinomial.'): 'auto',
Param(parent='LogisticRegression_4e776cc5a93f', name='featuresCol', doc='features column name'): 'features',
Param(parent='LogisticRegression_4e776cc5a93f', name='fitIntercept', doc='whether to fit an intercept term'): True,
Param(parent='LogisticRegression_4e776cc5a93f', name='labelCol', doc='label column name'): 'label',
Param(parent='LogisticRegression_4e776cc5a93f', name='maxIter', doc='maximum number of iterations (>= 0)'): 10,
Param(parent='LogisticRegression_4e776cc5a93f', name='predictionCol', doc='prediction column name'): 'prediction',
Param(parent='LogisticRegression_4e776cc5a93f', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calib
ated probability estimates! These probabilities should be treated as confidences, not precise probabilities'): 'probability',
Param(parent='LogisticRegression_4e776cc5a93f', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name'): 'rawPrediction',
Param(parent='LogisticRegression_4e776cc5a93f', name='regParam', doc='regularization parameter (>= 0)'): 0.01,
Param(parent='LogisticRegression_4e776cc5a93f', name='standardization', doc='whether to standardize the training features before fitting the model'): True,
Param(parent='LogisticRegression_4e776cc5a93f', name='threshold', doc='threshold in binary classification prediction, in range [0, 1]'): 0.5,
Param(parent='LogisticRegression_4e776cc5a93f', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0)'): 1e-06}
# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)
# Since model1 is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique IDs for this
# LogisticRegression instance.
print("Model 1 was fit using parameters: ")
model1.extractParamMap()


Estimator 학습 2
ParamMap 형식으로 fit 또는 transform 함수 호출 시에 ParamMap 전달할 수 있다.
ParamMap 형식으로 파라미터 전달 시에 Dictionary의 Key가 Transformer, Estimator의 속성이 그대로 Key로 들어가는 것에 유의할 것.


Estimator 학습 2
ParamMap 형식으로 fit 또는 transform 함수 호출 시에 ParamMap 전달할 수 있다.
ParamMap 형식으로 파라미터 전달 시에 Dictionary의 Key가 Transformer, Estimator의 속성이 그대로 Key로 들어가는 것에 유의할 것.
In [4]:
Out[4]:
odel 2 was fit using parameters:
Param(parent='LogisticRegression_4e776cc5a93f', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2)'): 2,
Param(parent='LogisticRegression_4e776cc5a93f', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. F
r alpha = 1, it is an L1 penalty'): 0.0,
Param(parent='LogisticRegression_4e776cc5a93f', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported op
ions: auto, binomial, multinomial.'): 'auto',
Param(parent='LogisticRegression_4e776cc5a93f', name='featuresCol', doc='features column name'): 'features',
Param(parent='LogisticRegression_4e776cc5a93f', name='fitIntercept', doc='whether to fit an intercept term'): True,
Param(parent='LogisticRegression_4e776cc5a93f', name='labelCol', doc='label column name'): 'label',
Param(parent='LogisticRegression_4e776cc5a93f', name='maxIter', doc='maximum number of iterations (>= 0)'): 30,
Param(parent='LogisticRegression_4e776cc5a93f', name='predictionCol', doc='prediction column name'): 'prediction',
Param(parent='LogisticRegression_4e776cc5a93f', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calib
ated probability estimates! These probabilities should be treated as confidences, not precise probabilities'): 'myProbability',
Param(parent='LogisticRegression_4e776cc5a93f', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name'): 'rawPrediction',
Param(parent='LogisticRegression_4e776cc5a93f', name='regParam', doc='regularization parameter (>= 0)'): 0.1,
Param(parent='LogisticRegression_4e776cc5a93f', name='standardization', doc='whether to standardize the training features before fitting the model'): True,
Param(parent='LogisticRegression_4e776cc5a93f', name='threshold', doc='threshold in binary classification prediction, in range [0, 1]'): 0.55,
Param(parent='LogisticRegression_4e776cc5a93f', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0)'): 1e-06}
# We may alternatively specify parameters using a Python dictionary as a paramMap
paramMap = {lr.maxIter: 20}
paramMap[lr.maxIter] = 30 # Specify 1 Param, overwriting the original maxIter.
paramMap.update({lr.regParam: 0.1, lr.threshold: 0.55}) # Specify multiple Params.
# You can combine paramMaps, which are python dictionaries.
paramMap2 = {lr.probabilityCol: "myProbability"} # Change output column name
paramMapCombined = paramMap.copy()
paramMapCombined.update(paramMap2)
# Now learn a new model using the paramMapCombined parameters.
# paramMapCombined overrides all parameters set earlier via lr.set* methods.
model2 = lr.fit(training, paramMapCombined)
print("Model 2 was fit using parameters: ")
model2.extractParamMap()


학습 모델 테스트
Estimator의 학습을 통해 반환되는 모델은 Transformer로 transform 함수를 통해 테스트 데이터 또는 실제 데이터에 대한 예측을 할 수 있
다.


Estimator의 학습을 통해 반환되는 모델은 Transformer로 transform 함수를 통해 테스트 데이터 또는 실제 데이터에 대한 예측을 할 수 있
다.
In [5]:
features=[-1.0,1.5,1.3], label=1.0 -> prob=[0.057073041710340625,0.9429269582896593], prediction=1.0
features=[3.0,2.0,-0.1], label=0.0 -> prob=[0.923852231170412,0.07614776882958792], prediction=0.0
features=[0.0,2.2,-1.5], label=1.0 -> prob=[0.10972776114779774,0.8902722388522022], prediction=1.0
# Prepare test data
test = spark.createDataFrame([
(1.0, Vectors.dense([-1.0, 1.5, 1.3])),
(0.0, Vectors.dense([3.0, 2.0, -0.1])),
(1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])
# Make predictions on test data using the Transformer.transform() method.
# LogisticRegression.transform will only use the 'features' column.
# Note that model2.transform() outputs a "myProbability" column instead of the usual
# 'probability' column since we renamed the lr.probabilityCol parameter previously.
prediction = model2.transform(test)
result = prediction.select("features", "label", "myProbability", "prediction")
.collect()
for row in result:
print("features=%s, label=%s -> prob=%s, prediction=%s"
% (row.features, row.label, row.myProbability, row.prediction))


2.6 예시 2: Pipeline
Pipeline Stage 구성 및 Pipeline 학습
Pipeline을 구성할 Transformer, Estimator들을 초기화하고, Pipeline 초기화 시에 stages 매개변수를 통해 처리 순서를 정의해준다.
순차적 처리를 위해 이전 Stage의 Output 컬럼과 다음 Stage의 Input 컬럼을 연결해주어야 한다. Pipeline은 Estimator로 fit 함수 호출을 통
해 학습을 진행할 수 있고 결과로 PipelineModel을 반환한다.


2.6 예시 2: Pipeline
Pipeline Stage 구성 및 Pipeline 학습
Pipeline을 구성할 Transformer, Estimator들을 초기화하고, Pipeline 초기화 시에 stages 매개변수를 통해 처리 순서를 정의해준다.
순차적 처리를 위해 이전 Stage의 Output 컬럼과 다음 Stage의 Input 컬럼을 연결해주어야 한다. Pipeline은 Estimator로 fit 함수 호출을 통
해 학습을 진행할 수 있고 결과로 PipelineModel을 반환한다.
In [6]:
Out[6]: PipelineModel_7958720c3c4e
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(inputCol="features", maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# Fit the pipeline to training documents.
model = pipeline.fit(training)
model


반환되는 PipelineModel은 Transformer로 transform 함수를 통해 테스트 데이터 또는 실제 데이터에 대한 예측을 할 수 있다.


반환되는 PipelineModel은 Transformer로 transform 함수를 통해 테스트 데이터 또는 실제 데이터에 대한 예측을 할 수 있다.
In [7]:
executed in 854ms, nished 15:33:37 2019-12-23
(4, spark i j k) --> prob=[0.1596407738787475,0.8403592261212525], prediction=1.000000
(5, l m n) --> prob=[0.8378325685476744,0.16216743145232562], prediction=0.000000
(6, spark hadoop spark) --> prob=[0.06926633132976037,0.9307336686702395], prediction=1.000000
(7, apache hadoop) --> prob=[0.9821575333444218,0.01784246665557808], prediction=0.000000
# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
(4, "spark i j k"),
(5, "l m n"),
(6, "spark hadoop spark"),
(7, "apache hadoop")
], ["id", "text"])
# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
rid, text, prob, prediction = row
print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))


3 시나리오: 유아 생존률 예측하기
1. 데이터 다운로드
2. 데이터 로드
3. Transformer 생성
4. Estimator 생성
5. Pipeline 생성
6. 모델 학습
7. 모델 성능 측정
8. 모델 저장
9. Hyperparameter 튜닝


3.1 데이터 다운로드
http://tomdrabas.com/data/LearningPySpark/births_transformed.csv.gz


3.1 데이터 다운로드
In [8]:
--2019-12-23 15:33:43--
esolving tomdrabas.com... 162.241.253.147
onnecting to tomdrabas.com|162.241.253.147|:80... connected.
TTP request sent, awaiting response... 200 OK
ength: 364560 (356K) [application/x-gzip]
aving to: 'births_transformed.csv.gz.1'
0K .......... .......... .......... .......... .......... 14% 132K 2s
50K .......... .......... .......... .......... .......... 28% 266K 1s
100K .......... .......... .......... .......... .......... 42% 11.4M 1s
150K .......... .......... .......... .......... .......... 56% 11.4M 0s
200K .......... .......... .......... .......... .......... 70% 273K 0s
250K .......... .......... .......... .......... .......... 84% 11.3M 0s
300K .......... .......... .......... .......... .......... 98% 11.3M 0s
350K ...... 100% 10.1M=0.8s
019-12-23 15:33:45 (463 KB/s) - 'births_transformed.csv.gz.1' saved [364560/364560]
!wget http://tomdrabas.com/data/LearningPySpark/births_transformed.csv.gz


3.2 데이터 로드
다운로드 받은 CSV 파일을 Spark DataFrame으로 로드한다.


3.2 데이터 로드
다운로드 받은 CSV 파일을 Spark DataFrame으로 로드한다.
In [9]:
키마:
ataFrame[INFANT_ALIVE_AT_REPORT: int, BIRTH_PLACE: string, MOTHER_AGE_YEARS: int, FATHER_COMBINED_AGE: int, CIG_BEFORE: int, CIG_1_TRI: int, CIG_2_TRI: int, CIG_3_TRI: int,
OTHER_HEIGHT_IN: int, MOTHER_PRE_WEIGHT: int, MOTHER_DELIVERY_WEIGHT: int, MOTHER_WEIGHT_GAIN: int, DIABETES_PRE: int, DIABETES_GEST: int, HYP_TENS_PRE: int, HYP_TENS_GEST:
nt, PREV_BIRTH_PRETERM: int]
이터 샘플:
[Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE='1', MOTHER_AGE_YEARS=29, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=99, MOTHER_
RE_WEIGHT=999, MOTHER_DELIVERY_WEIGHT=999, MOTHER_WEIGHT_GAIN=99, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0), Row(INFANT_ALIVE_A
_REPORT=0, BIRTH_PLACE='1', MOTHER_AGE_YEARS=22, FATHER_COMBINED_AGE=29, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=65, MOTHER_PRE_WEIGHT=180, MOT
ER_DELIVERY_WEIGHT=198, MOTHER_WEIGHT_GAIN=18, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0)]
from pyspark.sql import types
labels = [
('INFANT_ALIVE_AT_REPORT', types.IntegerType()), ('BIRTH_PLACE', types.StringType()),
('MOTHER_AGE_YEARS', types.IntegerType()), ('FATHER_COMBINED_AGE', types.IntegerType()),
('CIG_BEFORE', types.IntegerType()), ('CIG_1_TRI', types.IntegerType()),
('CIG_2_TRI', types.IntegerType()), ('CIG_3_TRI', types.IntegerType()),
('MOTHER_HEIGHT_IN', types.IntegerType()), ('MOTHER_PRE_WEIGHT', types.IntegerType()),
('MOTHER_DELIVERY_WEIGHT', types.IntegerType()), ('MOTHER_WEIGHT_GAIN', types.IntegerType()),
('DIABETES_PRE', types.IntegerType()), ('DIABETES_GEST', types.IntegerType()),
('HYP_TENS_PRE', types.IntegerType()), ('HYP_TENS_GEST', types.IntegerType()),
('PREV_BIRTH_PRETERM', types.IntegerType())
]
schema = types.StructType([
types.StructField(e[0], e[1], False) for e in labels # 세 번째 매개변수 `nullable`
])
births = spark.read.csv('births_transformed.csv.gz', header=True, schema=schema)
print(f'스키마: n{births}n')
print(f'데이터 샘플: n{births.take(2)}')


3.3 Transformer 생성
3.3.1 OneHotEncoder 생성
출생지 정보를 One-hot encoding을 하기 위한 Transformer를 생성한다.
사용하는 pyspark.ml.feature.OneHotEncoder 자체는 Spark 2.3 버전부터 Deprecated 인 것에 유의
(대체 클래스 pyspark.ml.feature.OneHotEncoderEstimator)
(Deprecated)https://spark.apache.org/docs/latest/ml-features#onehotencoder-deprecated-since-230
https://spark.apache.org/docs/latest/ml-features#onehotencoderestimator
OneHotEncoder는 IntegerType에만동작하기때문에현재 StringType인출생지정보를변
환해줘야정상적으로사용할수있다.


컬럼 타입 변환 및 OneHotEncoder 파라미터 확인


컬럼 타입 변환 및 OneHotEncoder 파라미터 확인
In [10]:
dropLast: whether to drop the last category (default: True)
inputCol: input column name. (undefined)
outputCol: output column name. (default: OneHotEncoder_dee08023bdfc__output)
from pyspark.ml import feature
births = births.withColumn('BIRTH_PLACE_INT', births['BIRTH_PLACE'].cast(types.IntegerType()))
encoder = feature.OneHotEncoder()
print(encoder.explainParams())


파라미터 전달 방법 1
클래스 인스턴스 초기화 시에 파라미터 전달
feature.OneHotEncoder(inputCol='BIRTH_PLACE_INT', outputCol='BIRTH_PLACE_VEC')


파라미터 전달 방법 2-1
클래스 인스턴스 초기화 후 Setter 함수를 통해 개별 전달
encoder.setInputCol('BIRTH_PLACE_INT')
encoder.setOutputCol('BIRTH_PLACE_VEC')


클래스 인스턴스 초기화 후 Setter 함수를 통해 개별 전달
encoder.setInputCol('BIRTH_PLACE_INT')
encoder.setOutputCol('BIRTH_PLACE_VEC')
클래스 인스턴스 초기화 후 Setter 함수를 통해 한번에 전달
encoder_param_map = dict(
inputCol='BIRTH_PLACE_INT',
outputCol='BIRTH_PLACE_VEC'
)
encoder.setParams(**encoder_param_map)


파라미터 전달 방법 3
Pipeline을 통한 실행 아님, transform, t 직접 실행 한정
클래스 인스턴스의 속성을 사용하여 ParamMap 생성
encoder.inputCol='BIRTH_PLACE_INT',
encoder.outputCol='BIRTH_PLACE_VEC'
)
encoder.transform(df, encoder_param_map)


파라미터 전달
Setter setParams 함수를 통해 파라미터 전달


파라미터 전달
Setter setParams 함수를 통해 파라미터 전달
In [11]:
Out[11]: OneHotEncoder_dee08023bdfc
inputCol='BIRTH_PLACE_INT',
outputCol='BIRTH_PLACE_VEC'
)
encoder.setParams(**encoder_param_map)


3.3.2 VectorAssembler 생성
VectorAssembler는 Feature 컬럼 집합을 입력 받아, 벡터를 생성하는 Transformer로, INFANT_ALIVE_AT_REPORT (유아 생존률)을 예측
하기 위한 Logistic regression을 학습할 Vector를 생성한다.
https://spark.apache.org/docs/latest/ml-features#vectorassembler


VectorAssembler 파라미터 확인


VectorAssembler 파라미터 확인
In [12]:
andleInvalid: How to handle invalid data (NULL and NaN values). Options are 'skip' (filter out rows with invalid data), 'error' (throw an error), or 'keep' (return relevant
umber of NaN in the output). Column lengths are taken from the size of ML Attribute Group, which can be set using `VectorSizeHint` in a pipeline before `VectorAssembler`. Co
umn lengths can also be inferred from first rows of the data since it is safe to do so but only in case of 'error' or 'skip'). (default: error)
nputCols: input column names. (undefined)
utputCol: output column name. (default: VectorAssembler_56a86e780ed3__output)
features_creator = feature.VectorAssembler()
print(features_creator.explainParams())


파라미터 전달
In [13]:
Out[13]: VectorAssembler_56a86e780ed3
features_creator_param_map = dict(
inputCols=[col[0] for col in labels[2:]] + [encoder_param_map['outputCol']],
outputCol='features'
)
features_creator.setParams(**features_creator_param_map)


3.4 Estimator 생성
Transformer를 통해 생성한 Vector를 통해 유아 생존률을 예측하기 위한 Estimator LogisticRegression을 생성한다.


LogisticRegression 파라미터 확인


LogisticRegression 파라미터 확인
In [14]:
ggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
lasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
amily: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
eaturesCol: features column name. (default: features)
itIntercept: whether to fit an intercept term. (default: True)
abelCol: label column name. (default: label)
owerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of
owerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The bounds vector size must beequal with 1 for binomial regression, o
the number oflasses for multinomial regression. (undefined)
axIter: max number of iterations (>= 0). (default: 100)
redictionCol: prediction column name. (default: prediction)
robabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be tr
ated as confidences, not precise probabilities. (default: probability)
awPredictionCol: raw prediction (a.k.a. confidence) column name. (default: rawPrediction)
egParam: regularization parameter (>= 0). (default: 0.0)
tandardization: whether to standardize the training features before fitting the model. (default: True)
hreshold: Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds m
st be equal to [1-p, p]. (default: 0.5)
hresholds: Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >
, excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.
(undefined)
ol: the convergence tolerance for iterative algorithms (>= 0). (default: 1e-06)
pperBoundsOnCoefficients: The upper bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of
pperBoundsOnIntercepts: The upper bounds on intercepts if fitting under bound constrained optimization. The bound vector size must be equal with 1 for binomial regression, o
the number of classes for multinomial regression. (undefined)
eightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (undefined)
from pyspark.ml import classification
logistic = classification.LogisticRegression()
print(logistic.explainParams())


파라미터 전달
In [15]:
Out[15]: LogisticRegression_bd8adb7506ea
logistic_param_map = dict(
maxIter=10,
regParam=0.01,
labelCol='INFANT_ALIVE_AT_REPORT'
)
logistic.setParams(**logistic_param_map)


3.5 Pipeline 정의
앞서 정의한 Transformer, Estimator를 Stage 별로 실행시키기 위한 Pipeline을 정의한다.


3.5 Pipeline 정의
앞서 정의한 Transformer, Estimator를 Stage 별로 실행시키기 위한 Pipeline을 정의한다.
In [16]:
pipeline = Pipeline(stages=[encoder, features_creator, logistic])


3.6 모델 학습
학습, 테스트 데이터 분할
모델 학습
테스트 데이터 예측


3.6 모델 학습
학습, 테스트 데이터 분할
모델 학습
테스트 데이터 예측
In [17]:
<class 'pyspark.ml.pipeline.PipelineModel'>, <class 'pyspark.sql.dataframe.DataFrame'>
births_train, births_test = births.randomSplit([0.7, 0.3], seed=666)
model = pipeline.fit(births_train)
predicted = model.transform(births_test)
print(f'{type(model)}, {type(predicted)}')


Pipeline 처리 결과 확인
Pipeline을 통해 정의한 Transformer와 Estimator의 결과가 DataFrame에 컬럼으로 추가된 것을 확인할 수 있다.


In [18]:
Out[18]: [Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE='1', MOTHER_AGE_YEARS=13, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=66, MOTHER_
RE_WEIGHT=133, MOTHER_DELIVERY_WEIGHT=135, MOTHER_WEIGHT_GAIN=2, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1)]
births_test.take(1)


In [18]:
In [19]:
RE_WEIGHT=133, MOTHER_DELIVERY_WEIGHT=135, MOTHER_WEIGHT_GAIN=2, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1)]
RE_WEIGHT=133, MOTHER_DELIVERY_WEIGHT=135, MOTHER_WEIGHT_GAIN=2, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1, B
RTH_PLACE_VEC=SparseVector(9, {1: 1.0}), features=SparseVector(24, {0: 13.0, 1: 99.0, 6: 66.0, 7: 133.0, 8: 135.0, 9: 2.0, 16: 1.0}), rawPrediction=DenseVector([1.0573, -1.0
73]), probability=DenseVector([0.7422, 0.2578]), prediction=0.0)]
births_test.take(1)
predicted.take(1)


3.7 모델 성능 측정
pyspark.ml.evaluation 패키지 아래에 모델 성능을 측정하기 위한 다양한 클래스들을 가지고 있다.


3.7 모델 성능 측정
pyspark.ml.evaluation 패키지 아래에 모델 성능을 측정하기 위한 다양한 클래스들을 가지고 있다.
In [20]:
labelCol: label column name. (default: label)
metricName: metric name in evaluation (areaUnderROC|areaUnderPR) (default: areaUnderROC)
rawPredictionCol: raw prediction (a.k.a. confidence) column name. (default: rawPrediction)
from pyspark.ml import evaluation
evaluator = evaluation.BinaryClassificationEvaluator()
print(evaluator.explainParams())


In [21]:
Out[21]: 0.7401301847095617
evaluator_param_map = {
evaluator.rawPredictionCol: 'probability',
evaluator.labelCol: 'INFANT_ALIVE_AT_REPORT',
evaluator.metricName: 'areaUnderROC'
}
evaluator.evaluate(predicted, evaluator_param_map)


In [21]:
In [22]:
Out[21]: 0.7401301847095617
Out[22]: 0.7139354342365674
}
evaluator_param_map.update({
evaluator.metricName: 'areaUnderPR'
})


3.8 모델 저장
Pipeline과 PipelineModel을 저장할 수 있다.
Pipeline은 PipelineModel을 생성하기 위한 프로세스를 저장하는 것으로 로드 후 해당 프로세스로 학습을 시킬 수 있고, PipelineModel
은 생성한 모델을 저장하는 것으로 로드하여 예측에 사용할 수 있다.


Pipeline 저장
In [23]:
pipeline_path = './infant_onehotencoder_logistic_pipeline'
pipeline.write().overwrite().save(pipeline_path)


Pipeline 저장
In [23]:
In [24]:
./infant_onehotencoder_logistic_pipeline:
total 0
drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 metadata
drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 stages
./infant_onehotencoder_logistic_pipeline/metadata:
total 1
-rw-r--r-- 1 lunal lunal 0 Dec 23 15:34 _SUCCESS
-rw-r--r-- 1 lunal lunal 262 Dec 23 15:34 part-00000
./infant_onehotencoder_logistic_pipeline/stages:
total 0
drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 0_OneHotEncoder_dee08023bdfc
drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 1_VectorAssembler_56a86e780ed3
drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 2_LogisticRegression_bd8adb7506ea
./infant_onehotencoder_logistic_pipeline/stages/0_OneHotEncoder_dee08023bdfc:
total 0
./infant_onehotencoder_logistic_pipeline/stages/0_OneHotEncoder_dee08023bdfc/metadata:
total 1
./infant_onehotencoder_logistic_pipeline/stages/1_VectorAssembler_56a86e780ed3:
total 0
./infant_onehotencoder_logistic_pipeline/stages/1_VectorAssembler_56a86e780ed3/metadata:
total 1
./infant_onehotencoder_logistic_pipeline/stages/2_LogisticRegression_bd8adb7506ea:
total 0
./infant_onehotencoder_logistic_pipeline/stages/2_LogisticRegression_bd8adb7506ea/metadata:
total 1
pipeline_path = './infant_onehotencoder_logistic_pipeline'
pipeline.write().overwrite().save(pipeline_path)
!ls -lR ./infant_onehotencoder_logistic_pipeline


PipelineModel 저장
In [25]:
model_path = './infant_onehotencoder_logistic_pipelinemodel'
model.write().overwrite().save(model_path)


PipelineModel 저장
In [25]:
In [26]:
./infant_onehotencoder_logistic_pipelinemodel:
total 0
drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 stages
./infant_onehotencoder_logistic_pipelinemodel/metadata:
total 1
./infant_onehotencoder_logistic_pipelinemodel/stages:
total 0
drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 0_OneHotEncoder_dee08023bdfc
drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 1_VectorAssembler_56a86e780ed3
drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 2_LogisticRegression_bd8adb7506ea
./infant_onehotencoder_logistic_pipelinemodel/stages/0_OneHotEncoder_dee08023bdfc:
total 0
./infant_onehotencoder_logistic_pipelinemodel/stages/0_OneHotEncoder_dee08023bdfc/metadata:
total 1
./infant_onehotencoder_logistic_pipelinemodel/stages/1_VectorAssembler_56a86e780ed3:
total 0
./infant_onehotencoder_logistic_pipelinemodel/stages/1_VectorAssembler_56a86e780ed3/metadata:
total 1
./infant_onehotencoder_logistic_pipelinemodel/stages/2_LogisticRegression_bd8adb7506ea:
total 0
drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 data
./infant_onehotencoder_logistic_pipelinemodel/stages/2_LogisticRegression_bd8adb7506ea/data:
total 8
-rw-r--r-- 1 lunal lunal 4307 Dec 23 15:34 part-00000-a81692dd-4119-41b8-86ad-b297c922197e-c000.snappy.parquet
model_path = './infant_onehotencoder_logistic_pipelinemodel'
model.write().overwrite().save(model_path)
!ls -lR ./infant_onehotencoder_logistic_pipelinemodel


Pipeline, PipelineModel 로드 후 사용 예시
loaded_pipeline = Pipeline.load(pipeline_path)
loaded_pipeline.fit(births_train)
.transform(births_test)
.take(1)
from pyspark.ml import PipelineModel
loaded_pipeline_model = PipelineModel.load(model_path)
predicted = loaded_pipeline_model.transform(births_test)


3.9 Hyperparameter 튜닝
Grid search와 Cross validation을 사용하여 최적의 Hyperparameter을 찾는 과정을 진행한다.
Hyperparameter를 찾는 과정은 Estimator와 Evaluator를 Grid search를 위한 Hyperparameter 집합
pyspark.ml.tuning.ParamGridBuilder와 함께 pyspark.ml.tuning.CrossValidator로 전달하여 진행한다.
앞의 Pipeline을 만들어 진행하는 과정에서 Estimator Stage 이전에 Transformer Stage 들을 거쳐야 했으므로, Transformer 로만 구성
된 Pipeline으로 전처리를 진행한 결과 또한 전달하도록 한다.


Cross Validation 정의
In [31]:
from pyspark.ml import tuning
logistic = classification.LogisticRegression(
)
evaluator = evaluation.BinaryClassificationEvaluator(
rawPredictionCol='probability',
)
grid = tuning.ParamGridBuilder().addGrid(logistic.maxIter, [2, 10, 50])
.addGrid(logistic.regParam, [0.01, 0.05, 0.3])
.build()
cv = tuning.CrossValidator(
estimator=logistic,
evaluator=evaluator,
estimatorParamMaps=grid
)


Transformer
In [32]:
pipeline = Pipeline(stages=[encoder, features_creator])
transformer = pipeline.fit(births_train)


Transformer
In [32]:
In [33]:
Out[33]: CrossValidatorModel_ea1cb3e53e4d
pipeline = Pipeline(stages=[encoder, features_creator])
transformer = pipeline.fit(births_train)
cv_model = cv.fit(transformer.transform(births_train))
cv_model


Cross Validation 결과로 테스트 데이터 예측 및 성능 평가
Cross Validation 결과 중 최적의 Hyperparameter로 테스트 데이터 예측을 진행하고 성능을 확인한다.


In [34]:
predicted = cv_model.transform(transformer.transform(births_test))


In [34]:
In [35]:
Out[35]: 0.7404959803309813
}


In [34]:
In [35]:
In [36]:
Out[35]: 0.7404959803309813
Out[36]: 0.7157971108486731
}
evaluator_param_map.update({
evaluator.metricName: 'areaUnderPR'
})


Grid Search의 결과 확인
Cross validation 과정에 전달한 Grid search의 파라미터별 결과를 확인할 수 있다.


Grid Search의 결과 확인
Cross validation 과정에 전달한 Grid search의 파라미터별 결과를 확인할 수 있다.
In [37]:
params={'maxIter': 2, 'regParam': 0.01}, metric=0.6967531825356931
for params, metric in zip(cv_model.getEstimatorParamMaps(), cv_model.avgMetrics):
params_map = {key.name: value for key, value in zip(params.keys(), params.values())}
print(f'params={params_map}, metric={metric}')


PySpark 배우기 Ch 06. ML 패키지 소개하기

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PySpark 배우기 Ch 06. ML 패키지 소개하기

Similar to PySpark 배우기 Ch 06. ML 패키지 소개하기 (20)

More from 찬희 이

More from 찬희 이 (11)

PySpark 배우기 Ch 06. ML 패키지 소개하기