Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PySpark 배우기 Ch 06. ML 패키지 소개하기

1. Spark ML 패키지
2. Spark ML Pipelines
3. 시나리오: 유아 생존률 예측하기

https://gist.github.com/lunalcni/00483f89c0035d0fedd00f3f9c803820

PySpark 배우기 Ch 06. ML 패키지 소개하기

  1. 1. 1  Spark ML 패키지 MLlib 패키지가 RDD 기반의 머신 러닝을 지원하는 패키지면, ML 패키지는 DataFrame 기반의 머신 러닝을 지원하는 패키지이다. Spark ML의 정식 명칭은 'MLlib DataFrame-based API'이며, DataFrame이 RDD보다 Spark에서의 데이터 로딩, 실행 계획 최적화, 언어 간의 API 통 일성에 있어 장점이 있기 때문에, Spark 2 버전 기준 머신 러닝을 위한 Primary API이다. https://spark.apache.org/docs/latest/ml-guide.html#machine-learning-library-mllib-guide 
  2. 2. 1.1  Spark MLlib, ML 패키지 제공 기능 ML Algorithms: Classi cation, Regression, Clustering 그리고 Collaborative ltering과 같은 머신 러닝 알고리즘 제공 Featurization: Feature 추출, 변환, 차원 축소, Feature 선택을 할 수 있도록 기능 제공 Pipelines: 머신 러닝을 위한 알고리즘 처리 과정을 만들고 평가 및 파라미터 튜닝을 할 수 있도록 지원 Persistence: 알고리즘, Pipeline, 모델을 저장하고 불러오는 기능 제공 Utilities: 선형 대수, 통계 및 데이터 핸들링 기능을 제공 
  3. 3. 1.2  MLlib 패키지와 비교 MLlib 패키지 비교 ML 패키지 pyspark.mllib 패키지 경로 pyspark.ml RDD 지원 자료 구조 DataFrame 스트리밍 데이터를 받으면서 학습시킬 수 있는 유일한 패키지 기타 PySpark를 사용한다면 기본적으로 이 패키지를 사용해야 함 Spark 2.x 버전에서 새로운 기능이 추가되지 않고 Spark 3 버전에서 삭제될 예정 (Maintenance mode) https://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api 
  4. 4. 2  Spark ML Pipelines Spark MLlib 패키지는 여러 알고리즘을 하나의 Pipeline (또는 Work ow)에 쉽게 결합하기 위한 표준화된 API를 제공한다. scikit-learn(sklearn) 프로젝트에서 컨셉을 가져왔다고 한다. DataFrame: Spark ML API는 Spark SQL을 통해 생성되는 DataFrame을 사용하며, DataFrame을 통해 다양한 데이터 타입에 대한 머신 러닝 알고리즘 적용을 지원한다. Transformer: Transformer는 DataFrame을 컬럼이 추가된 다른 DataFrame으로 변환해주는 알고리즘들을 말하며, Spark ML Pipeline의 결 과 또한 입력받는 DataFrame에 대해 변환 및 예측을 수행한 결과값을 포함하는 새로운 DataFrame을 생성하는 Transformer가 된다. Estimator: Estimator는 Transformer에 의해 생성된 DataFrame을 사용하여 학습을 진행하고 모델을 생성하는 알고리즘을 말한다. Pipeline: 여러 Transformer, Estimator의 집합을 Stage로 연결하여 머신 러닝 워크플로우를 만들 수 있게 해준다. Parameter: 모든 Transformer와 Estimator는 파라미터를 입력받기 위한 공통의 API 형식을 가진다. https://spark.apache.org/docs/latest/ml-pipeline.html#main-concepts-in-pipelines Transformer, Estimator의알고리즘들은Python 클래스형식으로구현되어있다. 
  5. 5. 2.1  Transformers Feature를 변형, 추출하기 위한 알고리즘과 학습된 모델을 위한 추상 클래스 또는 구현한 클래스 집합이다. Transformer 클래스들은 내부적으로 transform 함수가 구현되어 있다. DataFrame을 입력받아 하나 이상의 컬럼이 추가된 새로운 DataFrame을 만들어낸다. 
  6. 6. 2.2  Estimators 데이터로 알고리즘을 학습하는 과정을 추상화한 클래스 또는 구현한 클래스 집합이다. Estimator 클래스들은 내부적으노 fit 함수가 구현되어 있다. DataFrame을 입력받아 Transformer인 모델을 반환한다. 예시, Estimator 클래스인 LogisticRegression은 fit 함수 호출을 통해 학습된 LogisticRegressionModel을 반환하며, 반환 되는 LogisticRegressionModel은 Transformer이다. 
  7. 7. 2.3  Pipeline 데이터를 처리하고 학습하기 위한 알고리즘의 Stage 별 실행 과정을 만들기 위해 사용한다. 각 Stage는 Transformer와 Estimator들로 구성된다. 구성된 Pipeline은 그 자체가 Estimator로 fit 함수를 통해 실행 또는 학습한다. 결과로 반환되는 PipelineModel은 Transformer로 학습 이후 모델을 통해 결과를 산출할 때 사용한다. 
  8. 8. 2.3.1  Pipeline 예시 문서를 단어로 분리 각 문서의 단어를 Feature vector로 변환 Feature vector와 Label을 통해 예측 모델을 학습 
  9. 9. 학습 과정 파란 박스는 Transformer, 빨간 박스는 Estimator https://spark.apache.org/docs/latest/img/ml-Pipeline.png 
  10. 10. 테스트 또는 사용 시의 과정 https://spark.apache.org/docs/latest/img/ml-PipelineModel.png 
  11. 11. 2.4  Parameters Transformer와 Estimator는 파라미터를 전달하기 위한 통일된 API를 제공한다. Transformer와 Estimator의 Param 객체를 속성으로 가지고 있다. 이 객체는 파라미터를 설명하는 문서를 내장하고 있으며, Transformer와 Estimator에 파라미터를 전달하기 위한 Key로 사용된다. ParamMap은 PySpark에서는 dict 형태로 (parameter, value) 쌍을 가짐 파라미터는 클래스 인스턴스 생성 시 또는 생성 후 Setter를 통해 전달하거나 ParamMap 형태로 fit 또는 transform 함수 실행 시에 전 달한다. 
  12. 12. 2.5  예시 1: Estimator, Transformer, and Param https://spark.apache.org/docs/latest/ml-pipeline.html#example-estimator-transformer-and-param 
  13. 13. 2.5  예시 1: Estimator, Transformer, and Param https://spark.apache.org/docs/latest/ml-pipeline.html#example-estimator-transformer-and-param Spark Session 생성 - Spark Cluster 연결 
  14. 14. 2.5  예시 1: Estimator, Transformer, and Param https://spark.apache.org/docs/latest/ml-pipeline.html#example-estimator-transformer-and-param Spark Session 생성 - Spark Cluster 연결 In [1]: executed in 6.63s, nished 15:33:08 2019-12-23 Out[1]: SparkSession - in-memory SparkContext Version v2.4.4 Master spark://192.168.1.233:7077 AppName pyspark-shell Spark UI from pyspark.sql import SparkSession spark = SparkSession.builder.master('spark://192.168.1.233:7077').getOrCreate() spark 
  15. 15. Spark Session 생성 - Local 모드, Google Colab 
  16. 16. Spark Session 생성 - Local 모드, Google Colab In [ ]: !apt-get install openjdk-8-jdk-headless -qq > /dev/null !wget -q http://www-eu.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz !tar xf spark-2.4.4-bin-hadoop2.7.tgz !pip install -q findspark 
  17. 17. Spark Session 생성 - Local 모드, Google Colab In [ ]: In [ ]: !apt-get install openjdk-8-jdk-headless -qq > /dev/null !wget -q http://www-eu.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz !tar xf spark-2.4.4-bin-hadoop2.7.tgz !pip install -q findspark import os os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7" 
  18. 18. Spark Session 생성 - Local 모드, Google Colab In [ ]: In [ ]: In [ ]: !apt-get install openjdk-8-jdk-headless -qq > /dev/null !wget -q http://www-eu.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz !tar xf spark-2.4.4-bin-hadoop2.7.tgz !pip install -q findspark import os os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7" import findspark findspark.init("spark-2.4.4-bin-hadoop2.7")# SPARK_HOME from pyspark.sql import SparkSession spark = SparkSession.builder.master("local[*]").getOrCreate() 
  19. 19. DataFrame, Estimator 초기화 Transformer, Estimator는 자신이 입력받는 파라미터를 설명하는 explainParams함수를 가지고 있다. explainParams를 통해 파라미터에 대한 설명과 디폴트 값을 확인할 수 있으며, 파라미터 전달 후에는 각 파라미터의 현재 값을 확인할 수 있다. 
  20. 20. DataFrame, Estimator 초기화 Transformer, Estimator는 자신이 입력받는 파라미터를 설명하는 explainParams함수를 가지고 있다. explainParams를 통해 파라미터에 대한 설명과 디폴트 값을 확인할 수 있으며, 파라미터 전달 후에는 각 파라미터의 현재 값을 확인할 수 있다. In [2]: executed in 2.53s, nished 15:33:19 2019-12-23 ogisticRegression parameters ggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2) lasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0) amily: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto) eaturesCol: features column name. (default: features) itIntercept: whether to fit an intercept term. (default: True) abelCol: label column name. (default: label) owerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of eatures) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined) owerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The bounds vector size must beequal with 1 for binomial regression, o the number oflasses for multinomial regression. (undefined) axIter: max number of iterations (>= 0). (default: 100, current: 10) redictionCol: prediction column name. (default: prediction) robabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be tr ated as confidences, not precise probabilities. (default: probability) awPredictionCol: raw prediction (a.k.a. confidence) column name. (default: rawPrediction) egParam: regularization parameter (>= 0). (default: 0.0, current: 0.01) tandardization: whether to standardize the training features before fitting the model. (default: True) hreshold: Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds m st be equal to [1-p, p]. (default: 0.5) hresholds: Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > , excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold. (undefined) ol: the convergence tolerance for iterative algorithms (>= 0). (default: 1e-06) pperBoundsOnCoefficients: The upper bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of eatures) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined) pperBoundsOnIntercepts: The upper bounds on intercepts if fitting under bound constrained optimization. The bound vector size must be equal with 1 for binomial regression, o from pyspark.ml.linalg import Vectors from pyspark.ml.classification import LogisticRegression # Prepare training data from a list of (label, features) tuples. training = spark.createDataFrame([ (1.0, Vectors.dense([0.0, 1.1, 0.1])), (0.0, Vectors.dense([2.0, 1.0, -1.0])), (0.0, Vectors.dense([2.0, 1.3, 1.0])), (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) # Create a LogisticRegression instance. This instance is an Estimator. lr = LogisticRegression(maxIter=10, regParam=0.01) # Print out the parameters, documentation, and any default values. print("LogisticRegression parameters") print(lr.explainParams()) 
  21. 21. Estimator 학습 1 fit 함수 호출을 통해 학습을 진행 후 Transformer를 반환받는다. 학습 시 사용한 파라미터들은 extractParamMap 함수를 통해 확인할 수 있 다. 
  22. 22. Estimator 학습 1 fit 함수 호출을 통해 학습을 진행 후 Transformer를 반환받는다. 학습 시 사용한 파라미터들은 extractParamMap 함수를 통해 확인할 수 있 다. In [3]: executed in 4.58s, nished 15:33:24 2019-12-23 Out[3]: odel 1 was fit using parameters: Param(parent='LogisticRegression_4e776cc5a93f', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2)'): 2, Param(parent='LogisticRegression_4e776cc5a93f', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. F r alpha = 1, it is an L1 penalty'): 0.0, Param(parent='LogisticRegression_4e776cc5a93f', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported op ions: auto, binomial, multinomial.'): 'auto', Param(parent='LogisticRegression_4e776cc5a93f', name='featuresCol', doc='features column name'): 'features', Param(parent='LogisticRegression_4e776cc5a93f', name='fitIntercept', doc='whether to fit an intercept term'): True, Param(parent='LogisticRegression_4e776cc5a93f', name='labelCol', doc='label column name'): 'label', Param(parent='LogisticRegression_4e776cc5a93f', name='maxIter', doc='maximum number of iterations (>= 0)'): 10, Param(parent='LogisticRegression_4e776cc5a93f', name='predictionCol', doc='prediction column name'): 'prediction', Param(parent='LogisticRegression_4e776cc5a93f', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calib ated probability estimates! These probabilities should be treated as confidences, not precise probabilities'): 'probability', Param(parent='LogisticRegression_4e776cc5a93f', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name'): 'rawPrediction', Param(parent='LogisticRegression_4e776cc5a93f', name='regParam', doc='regularization parameter (>= 0)'): 0.01, Param(parent='LogisticRegression_4e776cc5a93f', name='standardization', doc='whether to standardize the training features before fitting the model'): True, Param(parent='LogisticRegression_4e776cc5a93f', name='threshold', doc='threshold in binary classification prediction, in range [0, 1]'): 0.5, Param(parent='LogisticRegression_4e776cc5a93f', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0)'): 1e-06} # Learn a LogisticRegression model. This uses the parameters stored in lr. model1 = lr.fit(training) # Since model1 is a Model (i.e., a transformer produced by an Estimator), # we can view the parameters it used during fit(). # This prints the parameter (name: value) pairs, where names are unique IDs for this # LogisticRegression instance. print("Model 1 was fit using parameters: ") model1.extractParamMap() 
  23. 23. Estimator 학습 2 ParamMap 형식으로 fit 또는 transform 함수 호출 시에 ParamMap 전달할 수 있다. ParamMap 형식으로 파라미터 전달 시에 Dictionary의 Key가 Transformer, Estimator의 속성이 그대로 Key로 들어가는 것에 유의할 것. 
  24. 24. Estimator 학습 2 ParamMap 형식으로 fit 또는 transform 함수 호출 시에 ParamMap 전달할 수 있다. ParamMap 형식으로 파라미터 전달 시에 Dictionary의 Key가 Transformer, Estimator의 속성이 그대로 Key로 들어가는 것에 유의할 것. In [4]: executed in 2.47s, nished 15:33:27 2019-12-23 Out[4]: odel 2 was fit using parameters: Param(parent='LogisticRegression_4e776cc5a93f', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2)'): 2, Param(parent='LogisticRegression_4e776cc5a93f', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. F r alpha = 1, it is an L1 penalty'): 0.0, Param(parent='LogisticRegression_4e776cc5a93f', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported op ions: auto, binomial, multinomial.'): 'auto', Param(parent='LogisticRegression_4e776cc5a93f', name='featuresCol', doc='features column name'): 'features', Param(parent='LogisticRegression_4e776cc5a93f', name='fitIntercept', doc='whether to fit an intercept term'): True, Param(parent='LogisticRegression_4e776cc5a93f', name='labelCol', doc='label column name'): 'label', Param(parent='LogisticRegression_4e776cc5a93f', name='maxIter', doc='maximum number of iterations (>= 0)'): 30, Param(parent='LogisticRegression_4e776cc5a93f', name='predictionCol', doc='prediction column name'): 'prediction', Param(parent='LogisticRegression_4e776cc5a93f', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calib ated probability estimates! These probabilities should be treated as confidences, not precise probabilities'): 'myProbability', Param(parent='LogisticRegression_4e776cc5a93f', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name'): 'rawPrediction', Param(parent='LogisticRegression_4e776cc5a93f', name='regParam', doc='regularization parameter (>= 0)'): 0.1, Param(parent='LogisticRegression_4e776cc5a93f', name='standardization', doc='whether to standardize the training features before fitting the model'): True, Param(parent='LogisticRegression_4e776cc5a93f', name='threshold', doc='threshold in binary classification prediction, in range [0, 1]'): 0.55, Param(parent='LogisticRegression_4e776cc5a93f', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0)'): 1e-06} # We may alternatively specify parameters using a Python dictionary as a paramMap paramMap = {lr.maxIter: 20} paramMap[lr.maxIter] = 30 # Specify 1 Param, overwriting the original maxIter. paramMap.update({lr.regParam: 0.1, lr.threshold: 0.55}) # Specify multiple Params. # You can combine paramMaps, which are python dictionaries. paramMap2 = {lr.probabilityCol: "myProbability"} # Change output column name paramMapCombined = paramMap.copy() paramMapCombined.update(paramMap2) # Now learn a new model using the paramMapCombined parameters. # paramMapCombined overrides all parameters set earlier via lr.set* methods. model2 = lr.fit(training, paramMapCombined) print("Model 2 was fit using parameters: ") model2.extractParamMap() 
  25. 25. 학습 모델 테스트 Estimator의 학습을 통해 반환되는 모델은 Transformer로 transform 함수를 통해 테스트 데이터 또는 실제 데이터에 대한 예측을 할 수 있 다. 
  26. 26. 학습 모델 테스트 Estimator의 학습을 통해 반환되는 모델은 Transformer로 transform 함수를 통해 테스트 데이터 또는 실제 데이터에 대한 예측을 할 수 있 다. In [5]: executed in 1.09s, nished 15:33:28 2019-12-23 features=[-1.0,1.5,1.3], label=1.0 -> prob=[0.057073041710340625,0.9429269582896593], prediction=1.0 features=[3.0,2.0,-0.1], label=0.0 -> prob=[0.923852231170412,0.07614776882958792], prediction=0.0 features=[0.0,2.2,-1.5], label=1.0 -> prob=[0.10972776114779774,0.8902722388522022], prediction=1.0 # Prepare test data test = spark.createDataFrame([ (1.0, Vectors.dense([-1.0, 1.5, 1.3])), (0.0, Vectors.dense([3.0, 2.0, -0.1])), (1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"]) # Make predictions on test data using the Transformer.transform() method. # LogisticRegression.transform will only use the 'features' column. # Note that model2.transform() outputs a "myProbability" column instead of the usual # 'probability' column since we renamed the lr.probabilityCol parameter previously. prediction = model2.transform(test) result = prediction.select("features", "label", "myProbability", "prediction") .collect() for row in result: print("features=%s, label=%s -> prob=%s, prediction=%s" % (row.features, row.label, row.myProbability, row.prediction)) 
  27. 27. 2.6  예시 2: Pipeline 
  28. 28. 2.6  예시 2: Pipeline Pipeline Stage 구성 및 Pipeline 학습 Pipeline을 구성할 Transformer, Estimator들을 초기화하고, Pipeline 초기화 시에 stages 매개변수를 통해 처리 순서를 정의해준다. 순차적 처리를 위해 이전 Stage의 Output 컬럼과 다음 Stage의 Input 컬럼을 연결해주어야 한다. Pipeline은 Estimator로 fit 함수 호출을 통 해 학습을 진행할 수 있고 결과로 PipelineModel을 반환한다. 
  29. 29. 2.6  예시 2: Pipeline Pipeline Stage 구성 및 Pipeline 학습 Pipeline을 구성할 Transformer, Estimator들을 초기화하고, Pipeline 초기화 시에 stages 매개변수를 통해 처리 순서를 정의해준다. 순차적 처리를 위해 이전 Stage의 Output 컬럼과 다음 Stage의 Input 컬럼을 연결해주어야 한다. Pipeline은 Estimator로 fit 함수 호출을 통 해 학습을 진행할 수 있고 결과로 PipelineModel을 반환한다. In [6]: executed in 2.51s, nished 15:33:36 2019-12-23 Out[6]: PipelineModel_7958720c3c4e from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression from pyspark.ml.feature import HashingTF, Tokenizer # Prepare training documents from a list of (id, text, label) tuples. training = spark.createDataFrame([ (0, "a b c d e spark", 1.0), (1, "b d", 0.0), (2, "spark f g h", 1.0), (3, "hadoop mapreduce", 0.0) ], ["id", "text", "label"]) # Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr. tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression(inputCol="features", maxIter=10, regParam=0.001) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) # Fit the pipeline to training documents. model = pipeline.fit(training) model 
  30. 30. 학습 모델 테스트 반환되는 PipelineModel은 Transformer로 transform 함수를 통해 테스트 데이터 또는 실제 데이터에 대한 예측을 할 수 있다. 
  31. 31. 학습 모델 테스트 반환되는 PipelineModel은 Transformer로 transform 함수를 통해 테스트 데이터 또는 실제 데이터에 대한 예측을 할 수 있다. In [7]: executed in 854ms, nished 15:33:37 2019-12-23 (4, spark i j k) --> prob=[0.1596407738787475,0.8403592261212525], prediction=1.000000 (5, l m n) --> prob=[0.8378325685476744,0.16216743145232562], prediction=0.000000 (6, spark hadoop spark) --> prob=[0.06926633132976037,0.9307336686702395], prediction=1.000000 (7, apache hadoop) --> prob=[0.9821575333444218,0.01784246665557808], prediction=0.000000 # Prepare test documents, which are unlabeled (id, text) tuples. test = spark.createDataFrame([ (4, "spark i j k"), (5, "l m n"), (6, "spark hadoop spark"), (7, "apache hadoop") ], ["id", "text"]) # Make predictions on test documents and print columns of interest. prediction = model.transform(test) selected = prediction.select("id", "text", "probability", "prediction") for row in selected.collect(): rid, text, prob, prediction = row print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction)) 
  32. 32. 3  시나리오: 유아 생존률 예측하기 1. 데이터 다운로드 2. 데이터 로드 3. Transformer 생성 4. Estimator 생성 5. Pipeline 생성 6. 모델 학습 7. 모델 성능 측정 8. 모델 저장 9. Hyperparameter 튜닝 
  33. 33. 3.1  데이터 다운로드 http://tomdrabas.com/data/LearningPySpark/births_transformed.csv.gz 
  34. 34. 3.1  데이터 다운로드 http://tomdrabas.com/data/LearningPySpark/births_transformed.csv.gz In [8]: executed in 1.46s, nished 15:33:45 2019-12-23 --2019-12-23 15:33:43-- esolving tomdrabas.com... 162.241.253.147 onnecting to tomdrabas.com|162.241.253.147|:80... connected. TTP request sent, awaiting response... 200 OK ength: 364560 (356K) [application/x-gzip] aving to: 'births_transformed.csv.gz.1' 0K .......... .......... .......... .......... .......... 14% 132K 2s 50K .......... .......... .......... .......... .......... 28% 266K 1s 100K .......... .......... .......... .......... .......... 42% 11.4M 1s 150K .......... .......... .......... .......... .......... 56% 11.4M 0s 200K .......... .......... .......... .......... .......... 70% 273K 0s 250K .......... .......... .......... .......... .......... 84% 11.3M 0s 300K .......... .......... .......... .......... .......... 98% 11.3M 0s 350K ...... 100% 10.1M=0.8s 019-12-23 15:33:45 (463 KB/s) - 'births_transformed.csv.gz.1' saved [364560/364560] http://tomdrabas.com/data/LearningPySpark/births_transformed.csv.gz !wget http://tomdrabas.com/data/LearningPySpark/births_transformed.csv.gz 
  35. 35. 3.2  데이터 로드 다운로드 받은 CSV 파일을 Spark DataFrame으로 로드한다. 
  36. 36. 3.2  데이터 로드 다운로드 받은 CSV 파일을 Spark DataFrame으로 로드한다. In [9]: executed in 453ms, nished 15:33:45 2019-12-23 키마: ataFrame[INFANT_ALIVE_AT_REPORT: int, BIRTH_PLACE: string, MOTHER_AGE_YEARS: int, FATHER_COMBINED_AGE: int, CIG_BEFORE: int, CIG_1_TRI: int, CIG_2_TRI: int, CIG_3_TRI: int, OTHER_HEIGHT_IN: int, MOTHER_PRE_WEIGHT: int, MOTHER_DELIVERY_WEIGHT: int, MOTHER_WEIGHT_GAIN: int, DIABETES_PRE: int, DIABETES_GEST: int, HYP_TENS_PRE: int, HYP_TENS_GEST: nt, PREV_BIRTH_PRETERM: int] 이터 샘플: [Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE='1', MOTHER_AGE_YEARS=29, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=99, MOTHER_ RE_WEIGHT=999, MOTHER_DELIVERY_WEIGHT=999, MOTHER_WEIGHT_GAIN=99, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0), Row(INFANT_ALIVE_A _REPORT=0, BIRTH_PLACE='1', MOTHER_AGE_YEARS=22, FATHER_COMBINED_AGE=29, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=65, MOTHER_PRE_WEIGHT=180, MOT ER_DELIVERY_WEIGHT=198, MOTHER_WEIGHT_GAIN=18, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0)] from pyspark.sql import types labels = [ ('INFANT_ALIVE_AT_REPORT', types.IntegerType()), ('BIRTH_PLACE', types.StringType()), ('MOTHER_AGE_YEARS', types.IntegerType()), ('FATHER_COMBINED_AGE', types.IntegerType()), ('CIG_BEFORE', types.IntegerType()), ('CIG_1_TRI', types.IntegerType()), ('CIG_2_TRI', types.IntegerType()), ('CIG_3_TRI', types.IntegerType()), ('MOTHER_HEIGHT_IN', types.IntegerType()), ('MOTHER_PRE_WEIGHT', types.IntegerType()), ('MOTHER_DELIVERY_WEIGHT', types.IntegerType()), ('MOTHER_WEIGHT_GAIN', types.IntegerType()), ('DIABETES_PRE', types.IntegerType()), ('DIABETES_GEST', types.IntegerType()), ('HYP_TENS_PRE', types.IntegerType()), ('HYP_TENS_GEST', types.IntegerType()), ('PREV_BIRTH_PRETERM', types.IntegerType()) ] schema = types.StructType([ types.StructField(e[0], e[1], False) for e in labels # 세 번째 매개변수 `nullable` ]) births = spark.read.csv('births_transformed.csv.gz', header=True, schema=schema) print(f'스키마: n{births}n') print(f'데이터 샘플: n{births.take(2)}') 
  37. 37. 3.3  Transformer 생성 
  38. 38. 3.3  Transformer 생성 3.3.1  OneHotEncoder 생성 출생지 정보를 One-hot encoding을 하기 위한 Transformer를 생성한다. 사용하는 pyspark.ml.feature.OneHotEncoder 자체는 Spark 2.3 버전부터 Deprecated 인 것에 유의 (대체 클래스 pyspark.ml.feature.OneHotEncoderEstimator) (Deprecated)https://spark.apache.org/docs/latest/ml-features#onehotencoder-deprecated-since-230 https://spark.apache.org/docs/latest/ml-features#onehotencoderestimator OneHotEncoder는 IntegerType에만동작하기때문에현재 StringType인출생지정보를변 환해줘야정상적으로사용할수있다. 
  39. 39. 컬럼 타입 변환 및 OneHotEncoder 파라미터 확인 
  40. 40. 컬럼 타입 변환 및 OneHotEncoder 파라미터 확인 In [10]: executed in 13ms, nished 15:33:54 2019-12-23 dropLast: whether to drop the last category (default: True) inputCol: input column name. (undefined) outputCol: output column name. (default: OneHotEncoder_dee08023bdfc__output) from pyspark.ml import feature births = births.withColumn('BIRTH_PLACE_INT', births['BIRTH_PLACE'].cast(types.IntegerType())) encoder = feature.OneHotEncoder() print(encoder.explainParams()) 
  41. 41. 파라미터 전달 방법 1 클래스 인스턴스 초기화 시에 파라미터 전달 feature.OneHotEncoder(inputCol='BIRTH_PLACE_INT', outputCol='BIRTH_PLACE_VEC') 
  42. 42. 파라미터 전달 방법 2-1 클래스 인스턴스 초기화 후 Setter 함수를 통해 개별 전달 encoder = feature.OneHotEncoder() encoder.setInputCol('BIRTH_PLACE_INT') encoder.setOutputCol('BIRTH_PLACE_VEC') 
  43. 43. 파라미터 전달 방법 2-1 클래스 인스턴스 초기화 후 Setter 함수를 통해 개별 전달 encoder = feature.OneHotEncoder() encoder.setInputCol('BIRTH_PLACE_INT') encoder.setOutputCol('BIRTH_PLACE_VEC') 파라미터 전달 방법 2-2 클래스 인스턴스 초기화 후 Setter 함수를 통해 한번에 전달 encoder = feature.OneHotEncoder() encoder_param_map = dict( inputCol='BIRTH_PLACE_INT', outputCol='BIRTH_PLACE_VEC' ) encoder.setParams(**encoder_param_map) 
  44. 44. 파라미터 전달 방법 3 Pipeline을 통한 실행 아님, transform, t 직접 실행 한정 클래스 인스턴스의 속성을 사용하여 ParamMap 생성 encoder = feature.OneHotEncoder() encoder_param_map = dict( encoder.inputCol='BIRTH_PLACE_INT', encoder.outputCol='BIRTH_PLACE_VEC' ) encoder.transform(df, encoder_param_map) 
  45. 45. 파라미터 전달 Setter setParams 함수를 통해 파라미터 전달 
  46. 46. 파라미터 전달 Setter setParams 함수를 통해 파라미터 전달 In [11]: executed in 3ms, nished 15:33:54 2019-12-23 Out[11]: OneHotEncoder_dee08023bdfc encoder_param_map = dict( inputCol='BIRTH_PLACE_INT', outputCol='BIRTH_PLACE_VEC' ) encoder.setParams(**encoder_param_map) 
  47. 47. 3.3.2  VectorAssembler 생성 VectorAssembler는 Feature 컬럼 집합을 입력 받아, 벡터를 생성하는 Transformer로, INFANT_ALIVE_AT_REPORT (유아 생존률)을 예측 하기 위한 Logistic regression을 학습할 Vector를 생성한다. https://spark.apache.org/docs/latest/ml-features#vectorassembler 
  48. 48. 3.3.2  VectorAssembler 생성 VectorAssembler는 Feature 컬럼 집합을 입력 받아, 벡터를 생성하는 Transformer로, INFANT_ALIVE_AT_REPORT (유아 생존률)을 예측 하기 위한 Logistic regression을 학습할 Vector를 생성한다. https://spark.apache.org/docs/latest/ml-features#vectorassembler VectorAssembler 파라미터 확인 
  49. 49. 3.3.2  VectorAssembler 생성 VectorAssembler는 Feature 컬럼 집합을 입력 받아, 벡터를 생성하는 Transformer로, INFANT_ALIVE_AT_REPORT (유아 생존률)을 예측 하기 위한 Logistic regression을 학습할 Vector를 생성한다. https://spark.apache.org/docs/latest/ml-features#vectorassembler VectorAssembler 파라미터 확인 In [12]: executed in 11ms, nished 15:33:54 2019-12-23 andleInvalid: How to handle invalid data (NULL and NaN values). Options are 'skip' (filter out rows with invalid data), 'error' (throw an error), or 'keep' (return relevant umber of NaN in the output). Column lengths are taken from the size of ML Attribute Group, which can be set using `VectorSizeHint` in a pipeline before `VectorAssembler`. Co umn lengths can also be inferred from first rows of the data since it is safe to do so but only in case of 'error' or 'skip'). (default: error) nputCols: input column names. (undefined) utputCol: output column name. (default: VectorAssembler_56a86e780ed3__output) features_creator = feature.VectorAssembler() print(features_creator.explainParams()) 
  50. 50. 파라미터 전달 
  51. 51. 파라미터 전달 In [13]: executed in 4ms, nished 15:33:54 2019-12-23 Out[13]: VectorAssembler_56a86e780ed3 features_creator_param_map = dict( inputCols=[col[0] for col in labels[2:]] + [encoder_param_map['outputCol']], outputCol='features' ) features_creator.setParams(**features_creator_param_map) 
  52. 52. 3.4  Estimator 생성 Transformer를 통해 생성한 Vector를 통해 유아 생존률을 예측하기 위한 Estimator LogisticRegression을 생성한다. 
  53. 53. 3.4  Estimator 생성 Transformer를 통해 생성한 Vector를 통해 유아 생존률을 예측하기 위한 Estimator LogisticRegression을 생성한다. LogisticRegression 파라미터 확인 
  54. 54. 3.4  Estimator 생성 Transformer를 통해 생성한 Vector를 통해 유아 생존률을 예측하기 위한 Estimator LogisticRegression을 생성한다. LogisticRegression 파라미터 확인 In [14]: executed in 7ms, nished 15:33:54 2019-12-23 ggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2) lasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0) amily: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto) eaturesCol: features column name. (default: features) itIntercept: whether to fit an intercept term. (default: True) abelCol: label column name. (default: label) owerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of eatures) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined) owerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The bounds vector size must beequal with 1 for binomial regression, o the number oflasses for multinomial regression. (undefined) axIter: max number of iterations (>= 0). (default: 100) redictionCol: prediction column name. (default: prediction) robabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be tr ated as confidences, not precise probabilities. (default: probability) awPredictionCol: raw prediction (a.k.a. confidence) column name. (default: rawPrediction) egParam: regularization parameter (>= 0). (default: 0.0) tandardization: whether to standardize the training features before fitting the model. (default: True) hreshold: Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds m st be equal to [1-p, p]. (default: 0.5) hresholds: Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > , excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold. (undefined) ol: the convergence tolerance for iterative algorithms (>= 0). (default: 1e-06) pperBoundsOnCoefficients: The upper bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of eatures) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined) pperBoundsOnIntercepts: The upper bounds on intercepts if fitting under bound constrained optimization. The bound vector size must be equal with 1 for binomial regression, o the number of classes for multinomial regression. (undefined) eightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (undefined) from pyspark.ml import classification logistic = classification.LogisticRegression() print(logistic.explainParams()) 
  55. 55. 파라미터 전달 
  56. 56. 파라미터 전달 In [15]: executed in 4ms, nished 15:34:00 2019-12-23 Out[15]: LogisticRegression_bd8adb7506ea logistic_param_map = dict( maxIter=10, regParam=0.01, labelCol='INFANT_ALIVE_AT_REPORT' ) logistic.setParams(**logistic_param_map) 
  57. 57. 3.5  Pipeline 정의 앞서 정의한 Transformer, Estimator를 Stage 별로 실행시키기 위한 Pipeline을 정의한다. 
  58. 58. 3.5  Pipeline 정의 앞서 정의한 Transformer, Estimator를 Stage 별로 실행시키기 위한 Pipeline을 정의한다. In [16]: executed in 11ms, nished 15:34:00 2019-12-23 from pyspark.ml import Pipeline pipeline = Pipeline(stages=[encoder, features_creator, logistic]) 
  59. 59. 3.6  모델 학습 학습, 테스트 데이터 분할 모델 학습 테스트 데이터 예측 
  60. 60. 3.6  모델 학습 학습, 테스트 데이터 분할 모델 학습 테스트 데이터 예측 In [17]: executed in 2.36s, nished 15:34:02 2019-12-23 <class 'pyspark.ml.pipeline.PipelineModel'>, <class 'pyspark.sql.dataframe.DataFrame'> births_train, births_test = births.randomSplit([0.7, 0.3], seed=666) model = pipeline.fit(births_train) predicted = model.transform(births_test) print(f'{type(model)}, {type(predicted)}') 
  61. 61. Pipeline 처리 결과 확인 Pipeline을 통해 정의한 Transformer와 Estimator의 결과가 DataFrame에 컬럼으로 추가된 것을 확인할 수 있다. 
  62. 62. Pipeline 처리 결과 확인 Pipeline을 통해 정의한 Transformer와 Estimator의 결과가 DataFrame에 컬럼으로 추가된 것을 확인할 수 있다. In [18]: executed in 632ms, nished 15:34:03 2019-12-23 Out[18]: [Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE='1', MOTHER_AGE_YEARS=13, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=66, MOTHER_ RE_WEIGHT=133, MOTHER_DELIVERY_WEIGHT=135, MOTHER_WEIGHT_GAIN=2, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1)] births_test.take(1) 
  63. 63. Pipeline 처리 결과 확인 Pipeline을 통해 정의한 Transformer와 Estimator의 결과가 DataFrame에 컬럼으로 추가된 것을 확인할 수 있다. In [18]: executed in 632ms, nished 15:34:03 2019-12-23 In [19]: executed in 347ms, nished 15:34:03 2019-12-23 Out[18]: [Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE='1', MOTHER_AGE_YEARS=13, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=66, MOTHER_ RE_WEIGHT=133, MOTHER_DELIVERY_WEIGHT=135, MOTHER_WEIGHT_GAIN=2, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1)] Out[19]: [Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE='1', MOTHER_AGE_YEARS=13, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=66, MOTHER_ RE_WEIGHT=133, MOTHER_DELIVERY_WEIGHT=135, MOTHER_WEIGHT_GAIN=2, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1, B RTH_PLACE_VEC=SparseVector(9, {1: 1.0}), features=SparseVector(24, {0: 13.0, 1: 99.0, 6: 66.0, 7: 133.0, 8: 135.0, 9: 2.0, 16: 1.0}), rawPrediction=DenseVector([1.0573, -1.0 73]), probability=DenseVector([0.7422, 0.2578]), prediction=0.0)] births_test.take(1) predicted.take(1) 
  64. 64. 3.7  모델 성능 측정 pyspark.ml.evaluation 패키지 아래에 모델 성능을 측정하기 위한 다양한 클래스들을 가지고 있다. 
  65. 65. 3.7  모델 성능 측정 pyspark.ml.evaluation 패키지 아래에 모델 성능을 측정하기 위한 다양한 클래스들을 가지고 있다. In [20]: executed in 10ms, nished 15:34:03 2019-12-23 labelCol: label column name. (default: label) metricName: metric name in evaluation (areaUnderROC|areaUnderPR) (default: areaUnderROC) rawPredictionCol: raw prediction (a.k.a. confidence) column name. (default: rawPrediction) from pyspark.ml import evaluation evaluator = evaluation.BinaryClassificationEvaluator() print(evaluator.explainParams()) 
  66. 66. In [21]: executed in 1.08s, nished 15:34:10 2019-12-23 Out[21]: 0.7401301847095617 evaluator_param_map = { evaluator.rawPredictionCol: 'probability', evaluator.labelCol: 'INFANT_ALIVE_AT_REPORT', evaluator.metricName: 'areaUnderROC' } evaluator.evaluate(predicted, evaluator_param_map) 
  67. 67. In [21]: executed in 1.08s, nished 15:34:10 2019-12-23 In [22]: executed in 616ms, nished 15:34:11 2019-12-23 Out[21]: 0.7401301847095617 Out[22]: 0.7139354342365674 evaluator_param_map = { evaluator.rawPredictionCol: 'probability', evaluator.labelCol: 'INFANT_ALIVE_AT_REPORT', evaluator.metricName: 'areaUnderROC' } evaluator.evaluate(predicted, evaluator_param_map) evaluator_param_map.update({ evaluator.metricName: 'areaUnderPR' }) evaluator.evaluate(predicted, evaluator_param_map) 
  68. 68. 3.8  모델 저장 Pipeline과 PipelineModel을 저장할 수 있다. Pipeline은 PipelineModel을 생성하기 위한 프로세스를 저장하는 것으로 로드 후 해당 프로세스로 학습을 시킬 수 있고, PipelineModel 은 생성한 모델을 저장하는 것으로 로드하여 예측에 사용할 수 있다. 
  69. 69. Pipeline 저장 
  70. 70. Pipeline 저장 In [23]: executed in 791ms, nished 15:34:12 2019-12-23 pipeline_path = './infant_onehotencoder_logistic_pipeline' pipeline.write().overwrite().save(pipeline_path) 
  71. 71. Pipeline 저장 In [23]: executed in 791ms, nished 15:34:12 2019-12-23 In [24]: executed in 51ms, nished 15:34:12 2019-12-23 ./infant_onehotencoder_logistic_pipeline: total 0 drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 metadata drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 stages ./infant_onehotencoder_logistic_pipeline/metadata: total 1 -rw-r--r-- 1 lunal lunal 0 Dec 23 15:34 _SUCCESS -rw-r--r-- 1 lunal lunal 262 Dec 23 15:34 part-00000 ./infant_onehotencoder_logistic_pipeline/stages: total 0 drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 0_OneHotEncoder_dee08023bdfc drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 1_VectorAssembler_56a86e780ed3 drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 2_LogisticRegression_bd8adb7506ea ./infant_onehotencoder_logistic_pipeline/stages/0_OneHotEncoder_dee08023bdfc: total 0 drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 metadata ./infant_onehotencoder_logistic_pipeline/stages/0_OneHotEncoder_dee08023bdfc/metadata: total 1 -rw-r--r-- 1 lunal lunal 0 Dec 23 15:34 _SUCCESS -rw-r--r-- 1 lunal lunal 295 Dec 23 15:34 part-00000 ./infant_onehotencoder_logistic_pipeline/stages/1_VectorAssembler_56a86e780ed3: total 0 drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 metadata ./infant_onehotencoder_logistic_pipeline/stages/1_VectorAssembler_56a86e780ed3/metadata: total 1 -rw-r--r-- 1 lunal lunal 0 Dec 23 15:34 _SUCCESS -rw-r--r-- 1 lunal lunal 563 Dec 23 15:34 part-00000 ./infant_onehotencoder_logistic_pipeline/stages/2_LogisticRegression_bd8adb7506ea: total 0 drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 metadata ./infant_onehotencoder_logistic_pipeline/stages/2_LogisticRegression_bd8adb7506ea/metadata: total 1 -rw-r--r-- 1 lunal lunal 0 Dec 23 15:34 _SUCCESS -rw-r--r-- 1 lunal lunal 552 Dec 23 15:34 part-00000 pipeline_path = './infant_onehotencoder_logistic_pipeline' pipeline.write().overwrite().save(pipeline_path) !ls -lR ./infant_onehotencoder_logistic_pipeline 
  72. 72. PipelineModel 저장 
  73. 73. PipelineModel 저장 In [25]: executed in 1.34s, nished 15:34:13 2019-12-23 model_path = './infant_onehotencoder_logistic_pipelinemodel' model.write().overwrite().save(model_path) 
  74. 74. PipelineModel 저장 In [25]: executed in 1.34s, nished 15:34:13 2019-12-23 In [26]: executed in 39ms, nished 15:34:13 2019-12-23 ./infant_onehotencoder_logistic_pipelinemodel: total 0 drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 metadata drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 stages ./infant_onehotencoder_logistic_pipelinemodel/metadata: total 1 -rw-r--r-- 1 lunal lunal 0 Dec 23 15:34 _SUCCESS -rw-r--r-- 1 lunal lunal 272 Dec 23 15:34 part-00000 ./infant_onehotencoder_logistic_pipelinemodel/stages: total 0 drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 0_OneHotEncoder_dee08023bdfc drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 1_VectorAssembler_56a86e780ed3 drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 2_LogisticRegression_bd8adb7506ea ./infant_onehotencoder_logistic_pipelinemodel/stages/0_OneHotEncoder_dee08023bdfc: total 0 drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 metadata ./infant_onehotencoder_logistic_pipelinemodel/stages/0_OneHotEncoder_dee08023bdfc/metadata: total 1 -rw-r--r-- 1 lunal lunal 0 Dec 23 15:34 _SUCCESS -rw-r--r-- 1 lunal lunal 295 Dec 23 15:34 part-00000 ./infant_onehotencoder_logistic_pipelinemodel/stages/1_VectorAssembler_56a86e780ed3: total 0 drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 metadata ./infant_onehotencoder_logistic_pipelinemodel/stages/1_VectorAssembler_56a86e780ed3/metadata: total 1 -rw-r--r-- 1 lunal lunal 0 Dec 23 15:34 _SUCCESS -rw-r--r-- 1 lunal lunal 563 Dec 23 15:34 part-00000 ./infant_onehotencoder_logistic_pipelinemodel/stages/2_LogisticRegression_bd8adb7506ea: total 0 drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 data drwxrwx---+ 1 lunal lunal 0 Dec 23 15:34 metadata ./infant_onehotencoder_logistic_pipelinemodel/stages/2_LogisticRegression_bd8adb7506ea/data: total 8 -rw-r--r-- 1 lunal lunal 0 Dec 23 15:34 _SUCCESS -rw-r--r-- 1 lunal lunal 4307 Dec 23 15:34 part-00000-a81692dd-4119-41b8-86ad-b297c922197e-c000.snappy.parquet model_path = './infant_onehotencoder_logistic_pipelinemodel' model.write().overwrite().save(model_path) !ls -lR ./infant_onehotencoder_logistic_pipelinemodel 
  75. 75. Pipeline, PipelineModel 로드 후 사용 예시 from pyspark.ml import Pipeline loaded_pipeline = Pipeline.load(pipeline_path) loaded_pipeline.fit(births_train) .transform(births_test) .take(1) from pyspark.ml import PipelineModel loaded_pipeline_model = PipelineModel.load(model_path) predicted = loaded_pipeline_model.transform(births_test) 
  76. 76. 3.9  Hyperparameter 튜닝 Grid search와 Cross validation을 사용하여 최적의 Hyperparameter을 찾는 과정을 진행한다. Hyperparameter를 찾는 과정은 Estimator와 Evaluator를 Grid search를 위한 Hyperparameter 집합 pyspark.ml.tuning.ParamGridBuilder와 함께 pyspark.ml.tuning.CrossValidator로 전달하여 진행한다. 앞의 Pipeline을 만들어 진행하는 과정에서 Estimator Stage 이전에 Transformer Stage 들을 거쳐야 했으므로, Transformer 로만 구성 된 Pipeline으로 전처리를 진행한 결과 또한 전달하도록 한다. 
  77. 77. Cross Validation 정의 
  78. 78. Cross Validation 정의 In [31]: executed in 10ms, nished 15:35:32 2019-12-23 from pyspark.ml import tuning logistic = classification.LogisticRegression( labelCol='INFANT_ALIVE_AT_REPORT' ) evaluator = evaluation.BinaryClassificationEvaluator( rawPredictionCol='probability', labelCol='INFANT_ALIVE_AT_REPORT' ) grid = tuning.ParamGridBuilder().addGrid(logistic.maxIter, [2, 10, 50]) .addGrid(logistic.regParam, [0.01, 0.05, 0.3]) .build() cv = tuning.CrossValidator( estimator=logistic, evaluator=evaluator, estimatorParamMaps=grid ) 
  79. 79. Transformer 
  80. 80. Transformer In [32]: executed in 3ms, nished 15:35:33 2019-12-23 pipeline = Pipeline(stages=[encoder, features_creator]) transformer = pipeline.fit(births_train) 
  81. 81. Transformer In [32]: executed in 3ms, nished 15:35:33 2019-12-23 In [33]: executed in 26.5s, nished 15:36:00 2019-12-23 Out[33]: CrossValidatorModel_ea1cb3e53e4d pipeline = Pipeline(stages=[encoder, features_creator]) transformer = pipeline.fit(births_train) cv_model = cv.fit(transformer.transform(births_train)) cv_model 
  82. 82. Cross Validation 결과로 테스트 데이터 예측 및 성능 평가 Cross Validation 결과 중 최적의 Hyperparameter로 테스트 데이터 예측을 진행하고 성능을 확인한다. 
  83. 83. Cross Validation 결과로 테스트 데이터 예측 및 성능 평가 Cross Validation 결과 중 최적의 Hyperparameter로 테스트 데이터 예측을 진행하고 성능을 확인한다. In [34]: executed in 303ms, nished 15:36:04 2019-12-23 predicted = cv_model.transform(transformer.transform(births_test)) 
  84. 84. Cross Validation 결과로 테스트 데이터 예측 및 성능 평가 Cross Validation 결과 중 최적의 Hyperparameter로 테스트 데이터 예측을 진행하고 성능을 확인한다. In [34]: executed in 303ms, nished 15:36:04 2019-12-23 In [35]: executed in 458ms, nished 15:36:05 2019-12-23 Out[35]: 0.7404959803309813 predicted = cv_model.transform(transformer.transform(births_test)) evaluator_param_map = { evaluator.rawPredictionCol: 'probability', evaluator.labelCol: 'INFANT_ALIVE_AT_REPORT', evaluator.metricName: 'areaUnderROC' } evaluator.evaluate(predicted, evaluator_param_map) 
  85. 85. Cross Validation 결과로 테스트 데이터 예측 및 성능 평가 Cross Validation 결과 중 최적의 Hyperparameter로 테스트 데이터 예측을 진행하고 성능을 확인한다. In [34]: executed in 303ms, nished 15:36:04 2019-12-23 In [35]: executed in 458ms, nished 15:36:05 2019-12-23 In [36]: executed in 420ms, nished 15:36:06 2019-12-23 Out[35]: 0.7404959803309813 Out[36]: 0.7157971108486731 predicted = cv_model.transform(transformer.transform(births_test)) evaluator_param_map = { evaluator.rawPredictionCol: 'probability', evaluator.labelCol: 'INFANT_ALIVE_AT_REPORT', evaluator.metricName: 'areaUnderROC' } evaluator.evaluate(predicted, evaluator_param_map) evaluator_param_map.update({ evaluator.metricName: 'areaUnderPR' }) evaluator.evaluate(predicted, evaluator_param_map) 
  86. 86. Grid Search의 결과 확인 Cross validation 과정에 전달한 Grid search의 파라미터별 결과를 확인할 수 있다. 
  87. 87. Grid Search의 결과 확인 Cross validation 과정에 전달한 Grid search의 파라미터별 결과를 확인할 수 있다. In [37]: executed in 4ms, nished 15:36:07 2019-12-23 params={'maxIter': 2, 'regParam': 0.01}, metric=0.6967531825356931 params={'maxIter': 2, 'regParam': 0.05}, metric=0.6968703672986438 params={'maxIter': 2, 'regParam': 0.3}, metric=0.6975771498455402 params={'maxIter': 10, 'regParam': 0.01}, metric=0.7378261331858156 params={'maxIter': 10, 'regParam': 0.05}, metric=0.7327721800928406 params={'maxIter': 10, 'regParam': 0.3}, metric=0.7224361941836431 params={'maxIter': 50, 'regParam': 0.01}, metric=0.738711211455581 params={'maxIter': 50, 'regParam': 0.05}, metric=0.733370894256066 params={'maxIter': 50, 'regParam': 0.3}, metric=0.7195023591773146 for params, metric in zip(cv_model.getEstimatorParamMaps(), cv_model.avgMetrics): params_map = {key.name: value for key, value in zip(params.keys(), params.values())} print(f'params={params_map}, metric={metric}') 

×