파이썬 라이브러리로 쉽게 시작하는 데이터 분석

파이썬 라이브러리로 쉽게
시작하는 데이터 분석
아무것도 모르는 문과생도 환영

데이터분석 해볼까 말까 망설이는 분들이 망설이지 말고 도전해보시기를
바라며 만든 슬라이드입니다. 아무것도 모르는 문과학생이 강의 2개와 책
1권을 읽고 시도해 본 데이터 분석입니다. 파이썬 Scikit Learn

Library를
사용하였습니다! 정말 잘 만들어진 라이브러리이라 초보자들이 쉽게
접근하기 좋습니다.
강의 1:
Intro
to
CS
and
Programming
Using
Python -‐ 링크
책1:
깐깐하게 배우는 파이썬 -‐ 링크
강의2:
Machine
Learning (강추강추!) -‐ 링크

Bank
Credit
Scoring
Algorithm
윤희경
은행 신용 평가 알고리즘 만들기

데이터 사이언스의 고향(?) 캐글 데이터를 이용했습니다!

그 중에서도 ’돈 좀 빌려줘!(?)’라는 이름의 2011년 대회 데이터를요!

1
Data
• Content:
credit
history
of
customers
• Size:
150,000
records
*
(1
id
+
10
features
+
1
target)
Variable Name Description Type
SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N
RevolvingUtilizationOfUnsecuredLines
Total balance on credit cards and personal lines of credit except real estate and no installment debt like
car loans divided by the sum of credit limits percentage
age Age of borrower in years integer
NumberOfTime30-59DaysPastDueNotWorse Number of times borrower has been 30-59 days past due but no worse in the last 2 years. integer
DebtRatio Monthly debt payments, alimony,living costs divided by monthy gross income percentage
MonthlyIncome Monthly income real
NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) integer
NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. integer
NumberRealEstateLoansOrLines Number of mortgage and real estate loans including home equity lines of credit integer
NumberOfTime60-89DaysPastDueNotWorse Number of times borrower has been 60-89 days past due but no worse in the last 2 years. integer
NumberOfDependents Number of dependents in family excluding themselves (spouse, children etc.) integer
십 오만명의 은행 고객에 대한 과거 신용기록 데이터입니다.
10개 신상 정보 및 과거 신용 정보로 ‘연체 여부’를 예측하기 위한 데이터
셋입니다!

Features
Age
Distribution Debt
Ratio
몇 개 항목을 하나씩 뜯어봅니다.

Features
Monthly
Income Number
of
Loans
몇 개 항목을 하나씩 뜯어봅니다.

Target
Value:
Default
or
Not?
연체여부 항목입니다. 연체를 안한 사람이 압도적으로 많은 불균형한
데이터셋이군요~

1
Goal
Business
goal

Maximize
profit
by
filtering
out
customers
with
high
possibility
of

default
Analysis
goal
Build
credit
scoring
model
with
maximum
f-‐score,
not
accuracy
Accuracy가 아닌 F-score가 최대화되는 평가 모델을 만들어 수익을
극대화하는 것이 목표입니다. 그렇다면, 왜 Accuracy가 아닌 F-score인가?

Why
F-‐Score?
• The
data
is
a
skewed
data with

tiny
percentage
of
default

customers
(SeriousDlqin2yrs
=

1).
• Over
93%
accuracy
can
be

achieved
without
filtering
out
a

single
default
customer.
모두 다 연체를 안했다고 예측해도 93%가 넘는 정확도를 가지기 때문에
Accuracy는 적절한 지표가 아닙니다. 중요한 것은 7% 가량의 연체자를
가려내는 것!

Process

Data
First

Model
Evaluate
Modified

Model
Threshold
Polynomial
Degree
C
(Regularization
Term)
우선 Scikit learn패키지의 디폴트 모델로 첫 모델을 만들고, 각종 파라미터를
조금씩 조절하여 모델을 개선시켜보도록 하겠습니다.

1
Process
Data
먼저, 데이터 가공하기!

2.1
Process
Data
Original
Data
150,000

records
(100
%)
Cross

Valida-‐
tion
(20
%)
Test
(20
%)
Training
(60
%)Random
Sampling

by
Shuffling
Data
60%의 개발셋, 20%의 크로스밸리데이션셋(파라미터 튜닝 용),
20%테스트셋으로 나눕니다.

2
Build
First
Model
첫번째 모델을 만듭니다.

First
Model
Logistic
Regression
Threshold Polynomial C F
score Accuracy
First
Model 0.5 1 1 0.0786 0.9336
Modified
Model
Change
(%)
(Tested
on
Test
Data)
Threshold,

C값이 모두 디폴트로 위와 같이 주어질 때,
F-score는 0.0786

3
Evaluate
Model
이 첫번째 모델을 평가해 볼까요?

LR2.3
Evaluate
Model:
Learning
Curve
• High
bias
• Increase
in
#
of
train
data

doesn’t
improve
cv
accuracy.
• There
is
no
big
gap
between

train
accuracy
and
cv
accuracy.
• Increasing
#
of
train
data

won’t
be
much
helpful.
Ø We
need
to
develop
more

complex
model.
개발셋 크기를 증가시켜나갈때, Accuracy가 그다지 개선이 되지 않으므로,
모델이 너무 단순한 것 같군요. 좀 더 복잡한 모델을 만들어보죠!

4
Modify
Model
1.
Threshold
2.
Polynomial
Degree
3.
C
(Regularization
Term)
세 가지 파라미터를 조정하여 모델을 개선시켜봅시다.

1.
Threshold
최적의 Threshold(Probability 몇 이상을 무연체, 몇 이하를 연체로
예측할것인가)를 찾아봅시다

1.
Threshold
• Threshold
that
maximizes
F-‐
score

0.125
• Accuracy
is
sacrificed
for

better
f-‐score.
Threshold를 0~0.5사이에서 바꾸어가면서 F-score를 측정합니다. 0.125로
기준을 정할 때, F-score가 최대화됩니다.

2.
Add
Polynomial
Features
로지스틱 회귀 모형으로도 비선형 관계를 설명할 수 있습니다. 항목의 차수를
높이는 것인데요. 그림은 10개의 항목의 차수를 2로 높인 경우입니다.
1(0차)+10(1차)+10*9/2(1차*1차)+10(2차)=66개의 항목을 만들수
있습니다.

2.
Add
Polynomial
Features
• Polynomial
degree
that

maximizes
F-‐score

2
• 10
features
(original)
-‐
66

features
(poly
2)
트레이닝 시간이 오래 걸려 1차, 2차, 3차만 테스트해보았습니다. 2차일때 F-
score가 극대화됩니다.

3.
C
(regularization
term)
• C
that
maximizes
F-‐score

3
Regularization

파라미터입니다. C가 커질수록 모델은 덜 복잡해집니다. 좀더
부드러운 곡선을 그리죠. C가 3인 지점에서 F-score가 극대화되네요.

5.
Conclusion
Improvements
made
from
the
first
model
to
the
modified

model
모델은 얼마나 개선되었을까요?

Modified
Model

Logistic
Regression
Threshold Polynomial C F
score Accuracy
First
Model 0.5 1 1 0.0786 0.9336
Modified
Model 0.125 2 3 0.4027 0.9108
Change
(%) +
412.34% -‐2.44%
(Tested
on
Test
Data)
F-score가 무려 412.34%나 개선되었습니다. Accuracy는 2.44%정도
손해를 보았지만요!

How
many
default
customers
were
filtered
out?
4.23%
40.28%
이게 무슨의미냐 하면, 처음에는 연체고객의 4.23%밖에 걸러내지 못했지만
개선후에는 무려 40.28%나 걸러낼 수 있게 된 것입니다.

Modified
Model
-‐ Coefficients

• 10
most
positively correlated

features
(out
of
66
polynomial

features)
• How
to
interpret
the
table
• When
1,
multiply
once.
• When
2,
multiply
twice.
• For
example,

• Feature
#5:

DebtRatio*MonthlyIncome
• Feature
#4:
NumberOfTimes60-‐
89DaysPastDueNotWorse^2
위 항목들은 클수록 연체할 확률이 적었습니다.

파이썬 라이브러리로 쉽게 시작하는 데이터 분석

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 파이썬 라이브러리로 쉽게 시작하는 데이터 분석

Similar to 파이썬 라이브러리로 쉽게 시작하는 데이터 분석 (20)

파이썬 라이브러리로 쉽게 시작하는 데이터 분석