1. 데이터 과학 입문
Ch 6. 시간기록과
금융 모형화
아꿈사 스터디 2015.06.27
정민철
(ccc612@gmail.com)
2. 티비 태그
• 목적: 개인화된 TV 프로그램 추천과 편성표 제공
• 수집 정보 형식: {사용자, 행동, 항목} + 시간
• 특정 쇼에 특정 방식으로 반응 => “좋아요!” 로 분류
• “좋아요!” 데이터 시각화 방법: 사용자-항목 이분 그래프
• 그래프 개선: 사용자 간(팔로우/친구), TV쇼 간(유사도) 연결선
• 시간: 특정 시간대, 영향 확산, 시간에 따른 변화 파악에 도움
3. this stored data is by drawing a bipartite graph as shown in Figure 6-1.
Figure 6-1. Bipartite graph with users and items (shows) as nodes
We’ll go into graphs in later chapters, but for now you should know
4. 시간기록
• 시간기록 사건 데이터 다루기
- 빅데이터 시대의 흔한 형태로 빅데이터 현상을 만든 요인
- 하루종일 정확한 시간의 인간 행동 측정 가능
- 대용량 데이터를 저장하고 신속하게 처리 가능
• 데이터 추출
nd which stories were clicked on. This generates event logs. Each
cordisaneventthattookplacebetweenauserandtheapporwebsite.
ere’s an example of raw data point from GetGlue:
{"userId": "rachelschutt", "numCheckins": "1",
"modelName": "movies", "title": "Collaborator",
"source": "http://getglue.com/stickers/tribeca_film/
collaborator_coming_soon", "numReplies": "0",
"app": "GetGlue", "lastCheckin": "true",
"timestamp": "2012-05-18T14:15:40Z",
"director": "martin donovan", "verb": "watching",
"key": "rachelschutt/2012-05-18T14:15:40Z",
"others": "97", "displayName": "Rachel Schutt",
"lastModified": "2012-05-18T14:15:43Z",
"objectKey": "movies/collaborator/martin_donovan",
"action": "watching"}
we extract four fields: {"userid":"rachelschutt", "action":
{"userid":"rachelschutt",
"action": "watching",
"title":"Collaborator",
timestamp:"2012-05- 18T14:15:40Z" }
5. 시간기록 - 탐색적 데이터 분석
• EDA를 통해 데이터에 관한 직관 얻기
• 여러가지 질문과 그에 대한 답을 얻는 과정에서 데이터의 특성을
파악
• 파악된 특성을 기반으로 분석 방법을 선택
- 상황에 기반한 선택
- 선택에 대한 타당한 이유 제시
• 선택할 것들: 척도, 부호화 방법, 시간범위, 데이터 범주, 혼합된
행동 다루기, 주목할 행동 패턴 등등등....
6. Figure 6-2. An example of a way to visually display user-level data
over time
Now try to construct a narrative from that plot. For example, we could
say that user 1 comes the same time each day, whereas user 2 started
out active in this time period but then came less and less frequently.
User 3 needs a longer time horizon for us to understand his or her
behavior, whereas user 4 looks “normal,” whatever that means.
Let’s pose questions from that narrative:
discipline. Say we have some raw data where each data point is an
event, but we want to have data stored in rows where each row consists
of a user followed by a bunch of timestamps corresponding to actions
that user performed. How would we get the data to that point? Note
that different users will have a different number of timestamps.
Make this reasoning explicit: how would we write the code to create a
plot like the one just shown? How would we go about tackling the data
munging exercise?
Suppose a user can take multiple actions: “thumbs_up,” or
“thumbs_down,”“like,”and“comment.”Howcanweplotthoseevents?
How can we modify our metrics? How can we encode the user data
with these different actions? Figure 6-3 provides an example for the
first question where we color code actions thumbs up and thumbs
down, denoted thumbs_up and thumbs_down.
7. canthinkabouthowwemightwanttoaggregateusers.Wemightmake
the x-axis refer to time, and the y-axis refer to counts, as shown in
Figure 6-4.
Figure 6-4. Aggregating user actions into counts
We’re no longer working with 100 individual users, but we’re still
making choices, and those choices will impact our perception and
8. 시간기록 - 마무리
• 척도와 새로운 변수 또는 특징
- EDA로 부터 얻은 직관은 척도 구성에 도움
- EDA로 얻은 직관을 모형과 알고리즘에 반영
• 다음은 무엇을 해야하나?
- 자기회귀를 포함한 시계화 모형화
- 근접성 정의를 통한 군집화
- 행동패턴 탐지
- 전환지점 탐지: 큰 사건이 일어난 시점 포착
- 추천 시스템 훈련
• 시계열 모형화 (time series modeling): 시간에 극도로 민감한 사건을 예측
or 이미 발생한 것으로부터 예측 가능한 사건 예측
9. 사고실험 - 커다란 훈련 데이터세트를 다루면서 시간 기록을
무시했을 때 무엇을 잃게 될까?
• 시간 감각이 없다면 원인과 결과를 알아낼 수 없음
- 절대적 시간기록 v.s. 상대적 시간차이
- 계절성, 추세 분석
Figure 6-6. Without keeping track of timestamps, we can’t see time-
based patterns; here, we see a seasonal pattern in a time series
10. 금융 모형화
• 금융 분석가: 빅데이터 분석의 원조
- 시간기록에 집착하지만 원인은 크게 신경쓰지 않음
• 표본내, 표본외 데이터
- 표본내 데이터: 훈련데이터 + 검증데이터
- 표본외 데이터: 모형이 완성된 후 사용하는 데이터
11. 인과 모형화 (causal modeling)
• 현재의 무언가를 예측하기 위해 미래의 정보를 결코 사용해서는 안 된다.
(과거 ~ 현재 까지의 정보만 사용)
• 준거의 시간기록 v.s. 가용성의 시간기록(이용 가능한 시점)
• 계수의 집합
- 마지막 시간 기록에 도달할 때 까지는 최적적합계수를 알지 못함
- 시계열 데이터는 하나의 최적 적합 계수를 얻을 수 없음
- 사건이 발생함에 따라 계수가 변경 됨
- 새로운 데이터를 얻을 때마다 모형 갱신
- 모형의 계수는 끊임없이 진화하는 살아있는 유기체
• 현재 알고있는 것에 기반해 미래에 관한 의사결정을 해야함
12. 금융 데이터 준비하기
• 데이터 준비: 현실을 더 잘 반영하는 데이터로 변환
- 데이터 정규화
- 데이터의 로그를 취함
- 범주형 변수 생성
- 경계값을 기준으로 이진 변수로 데이터 변환
• 모형의 하위모형 운용
- 새로운 부분 고려
- 일변량 회기 등 하위모형 훈련
• 계산된 평균으로 정규화 할 수 없는 경우 => 이동평균으로 정규화
- 인과적 간섭은 나쁜 모형을 좋아보이게 할 수 있음
13. 로그 수익률
• 금융에서는 하루 단위로 수익을 계산
• 백분 수익률:
- 비가산적
- 이득에 대해 편의적
• 로그 수익률:
- 가산적
- 득실에 대칭적
• 아주 작은 수익률에서는 비슷
make a bad model look good (or, what is more likely, make a
that is pure noise look good).
Log Returns
In finance, we consider returns on a daily basis. In other wo
care about how much the stock (or future, or index) changes fr
to day. This might mean we measure movement from open
Monday to opening on Tuesday, but the standard approach is
about closing prices on subsequent trading days.
We typically don’t consider percent returns, but rather log ret
Ft denotes a close on day t, then the log return that day is def
log Ft /Ft−1 , whereas the percent return would be compu
100 Ft /Ft−1 −1 . To simplify the discussion, we’ll compare
turns to scaled percent returns, which is the same as percent
except without the factor of 100. The reasoning is not changed
difference in scalar.
There are a few different reasons we use log returns instead
centage returns. For example, log returns are additive but scal
cent returns aren’t. In other words, the five-day log return is t
care about how much the stock (or future, or index) changes fro
to day. This might mean we measure movement from openin
Monday to opening on Tuesday, but the standard approach is t
about closing prices on subsequent trading days.
We typically don’t consider percent returns, but rather log retu
Ft denotes a close on day t, then the log return that day is defin
log Ft /Ft−1 , whereas the percent return would be comput
100 Ft /Ft−1 −1 . To simplify the discussion, we’ll compare l
turns to scaled percent returns, which is the same as percent re
except without the factor of 100. The reasoning is not changed b
difference in scalar.
There are a few different reasons we use log returns instead o
centage returns. For example, log returns are additive but scale
cent returns aren’t. In other words, the five-day log return is th
of the five one-day log returns. This is often computationally ha
By the same token, log returns are symmetric with respect to gain
losses, whereas percent returns are biased in favor of gains. S
example, if our stock goes down by 50%, or has a –0.5 scaled pe
gain, and then goes up by 200%, so has a 2.0 scaled percent ga
are where we started. But working in the same scenarios wit
log x = ∑
n
x−1
n
n
= x−1 + x−1
2
/2+⋯
In other words, the first term of the Taylor expansion agrees with
percent return. So as long as the second term is small compared to
first, which is usually true for daily returns, we get a pretty good
proximation of percent returns using log returns.
Here’s a picture of how closely these two functions behave, keepin
mind that when x=1, there’s no change in price whatsoever, as sh
in Figure 6-8.
14. 예시: S&P 지수
Figure 6-10. The log of the S&P returns shown over time
Financial Modeling
www.it-ebooks.info
Example: The S&P Index
Let’s work out a toy example. If you start with S&P closing levels as
shown in Figure 6-9, then you get the log returns illustrated in
Figure 6-10.
Figure 6-9. S&P closing levels shown over time
What’s that mess? It’s crazy volatility caused by the financial crisis. W
sometimes (not always) want to account for that volatility by normal
izing with respect to it (described earlier). Once we do that we ge
something like Figure 6-11, which is clearly better behaved.
Figure 6-11. The volatility normalized log of the S&P closing return
shown over time
15. 변동성 측정하기
• 회고창 선택 (lookback window)
- 정보를 취하는 과거 시간길이
- 길어질 수록 => 추정을 위해 더 많은 정보 필요
- 짧아질 수록 => 새로운 정보에 더 빨리 반응
- 큰 사건이 일어나면 잊혀지는 데 시간이 얼마나 소요되나?
• 과거 데이터를 어떻게 사용하나?
- 롤링 창 적용: 이전 n일 각각에 동일한 가중치 부여
- 연속적인 회고 창 사용: 오래된 데이터에 반감기 적용
• 위험 부담을 최소화 하는 가중치 감소화 지수 선택
17. 지수적 가중치 감소화
• 감쇠
- 현재 데이터에 더 많은 가중치 부여
- 오래된 데이터의 감소화를 s라 하고 모수처럼 취급
Exponential Downweighting
We’ve already seen an example of exponential downwei
case of keeping a running estimate of the volatility of t
the S&P.
The general formula for downweighting some additive
mate E is simple enough. We weight recent data more tha
and we assign the downweighting of older data a name
like a parameter. It is called the decay. In its simplest for
Et =s·Et−1 + 1−s ·et
where et is the new term.
18. 금융 모형화 피드백 루프
• 시장은 시간이 흐르면서 학습한다.
- 구매는 시장에 영향 => 예상한 신호를 감소시킴
• 무언가를 예측하고, 예측한 것들이 사라지게 만드는 수많은 알고
리즘의 조합
- 기존 신호들이 대단히 약해짐
- 시장 참여자들이 그 신호들을 모두 이해했고, 미리 예측했기
때문
• 일반적인 평가척도는 통하지 않음 => 대신 PnL(Profit & Loss)
그래프 활용
19. change (difference, not ratio), or today’s value minus yesterday’s value.
Figure 6-13. A graph of the cumulative PnLs of two theoretical models
20. 사전정보 추가하기
• 사전정보: 수학적으로 공식화되고 결합된 의견
• 예제: 오래된 데이터에 감소된 가중치 부여
- 새로운 데이터가 오래된 데이터보다 중요
• 사전정보는 자유도를 줄여줌
21. 베이비 모형
• 시계열 그래프의 자기상관계수 계산
• 벌칙함수: 최소화를 하려는 함수에 항을 추가
- 모형의 적합정도 측정
• 지수적 가중치감소화항 r 선택
y =Ft =α0 +α1Ft−1 +α2Ft−2 +
which is just the example where we take the last two values of the time
series F to predict the next one. We could use more than two values,
of course. If we used lots of lagged values, then we could strengthen
our prior in order to make up for the fact that we’ve introduced so
many degrees of freedom. In effect, priors reduce degrees of freedom.
The way we’d place the prior about the relationship between coeffi‐
cients(inthiscaseconsecutivelaggeddatapoints)isbyaddingamatrix
to our covariance matrix when we perform linear regression. See more
about this here.
A Baby Model
Say we drew a plot in a time series and found that we have strong but
fading autocorrelation up to the first 40 lags or so as shown in
Figure 6-14.
Figure 6-14. Looking at auto-correlation out to 100 lags
22. • 사전정보를 가지고 있지 않을 때: 오차 제곱
• 표준 사전정보 추가: 단위행렬의 스칼라 배 추가
• 사전정보 추가: 계수들이 부드럽게 변한다.
coefficient by 0. The matrix M is called a shift opera
ence I−M can be thought of as a discrete derivative
for more information on discrete calculus).
Because this is the most complicated version, let’s lo
Remembering our vector calculus, the derivative of
F2 β with respect to the vector β is a vector, and s
the properties that happen at the scalar level, inclu
it’s both additive and linear and that:
∂uτ
·u
∂β
=2
∂uτ
∂β
u
Putting the preceding rules to use, we have:
∂F2 β
∂β
=
1
N
∂ y−xβ
τ
y−xβ /
∂β
+λ2
·
∂β
τ
β
∂β
+µ2
·
∂ I −M β
τ
I −M
∂β
=
−2
N
xτ
y−xβ +2λ2
·β+2µ2
I−M
τ
I−M β
Setting this to 0 and solving for β gives us:
β2 = xτ
x+N ·λ2
I+N ·µ2
· I−M
τ
I−M
−1
xτ
y
In other words, we have yet another matrix added
A good way to think about priors is by adding a term to the function
we are seeking to minimize, which measures the extent to which we
have a good fit. This is called the “penalty function,” and when we have
no prior at all, it’s simply the sum of the squares of the error:
F β = ∑i yi −xiβ 2
= y−xβ
τ
y−xβ
If we want to minimize F, which we do, then we take its derivative with
respect to the vector of coefficients β, set it equal to zero, and solve for
β—there’s a unique solution, namely:
β= xτ
x
−1
xτ
y
If we now add a standard prior in the form of a penalty term for large
coefficients, then we have:
F1 β =
1
N
∑i yi −xiβ 2
+∑ j λ2
βj
2
=
1
N
y−xβ
τ
y−xβ + λIβ
τ
λIβ
This can also be solved using calculus, and we solve for beta to get:
β1 = xτ
x+N ·λ2
I
−1
xτ
y
respect to the vector of coe
β—there’s a unique solutio
β= xτ
x
−1
xτ
y
If we now add a standard p
coefficients, then we have:
F1 β =
1
N
∑i yi −xiβ 2
+
This can also be solved usi
β1 = xτ
x+N ·λ2
I
−1
xτ
y
Inotherwords,addingthe
into adding a scalar multip
β= xτ
x
−1
xτ
y
If we now add a standard prior i
coefficients, then we have:
F1 β =
1
N
∑i yi −xiβ 2
+∑ j λ2
This can also be solved using ca
β1 = xτ
x+N ·λ2
I
−1
xτ
y
Inotherwords,addingthepenal
into adding a scalar multiple of
matrix in the closed form soluti
If we now want to add another p
cients vary smoothly” prior, we
adjacent coefficients should be no
can be expressed in the following
eter µ as follows:
F2 β =
1
N
∑
i
yi −xiβ 2
+∑
j
λ2
βj
2
+∑
j
µ2
βj −βj+1
2
=
1
N
y−xβ
τ
y−xβ +λ2
βτ
β+µ2
Iβ−Mβ
τ
Iβ−Mβ
where M is the matrix that contains zeros everywhere except on the
lower off-diagonals, where it contains 1’s. Then Mβ is the vector that
results from shifting the coefficients of β by one and replacing the last
If we want to minimize F, which we do, then we take its derivative with
respect to the vector of coefficients β, set it equal to zero, and solve for
β—there’s a unique solution, namely:
β= xτ
x
−1
xτ
y
If we now add a standard prior in the form of a penalty term for large
coefficients, then we have:
F1 β =
1
N
∑i yi −xiβ 2
+∑ j λ2
βj
2
=
1
N
y−xβ
τ
y−xβ + λIβ
τ
λIβ
This can also be solved using calculus, and we solve for beta to get:
β1 = xτ
x+N ·λ2
I
−1
xτ
y
Inotherwords,addingthepenaltytermforlargecoefficientstranslates
into adding a scalar multiple of the identity matrix to the covariance
matrix in the closed form solution to β.
If we now want to add another penalty term that represents a “coeffi‐
cients vary smoothly” prior, we can think of this as requiring that
adjacent coefficients should be not too different from each other, which
can be expressed in the following penalty function with a new param‐
eter µ as follows: