SlideShare a Scribd company logo
1 of 22
Download to read offline
데이터 과학 입문
Ch 6. 시간기록과

금융 모형화
아꿈사 스터디 2015.06.27

정민철

(ccc612@gmail.com)
티비 태그
• 목적: 개인화된 TV 프로그램 추천과 편성표 제공
• 수집 정보 형식: {사용자, 행동, 항목} + 시간
• 특정 쇼에 특정 방식으로 반응 => “좋아요!” 로 분류
• “좋아요!” 데이터 시각화 방법: 사용자-항목 이분 그래프
• 그래프 개선: 사용자 간(팔로우/친구), TV쇼 간(유사도) 연결선
• 시간: 특정 시간대, 영향 확산, 시간에 따른 변화 파악에 도움
this stored data is by drawing a bipartite graph as shown in Figure 6-1.
Figure 6-1. Bipartite graph with users and items (shows) as nodes
We’ll go into graphs in later chapters, but for now you should know
시간기록
• 시간기록 사건 데이터 다루기

- 빅데이터 시대의 흔한 형태로 빅데이터 현상을 만든 요인

- 하루종일 정확한 시간의 인간 행동 측정 가능

- 대용량 데이터를 저장하고 신속하게 처리 가능
• 데이터 추출
nd which stories were clicked on. This generates event logs. Each
cordisaneventthattookplacebetweenauserandtheapporwebsite.
ere’s an example of raw data point from GetGlue:
{"userId": "rachelschutt", "numCheckins": "1",
"modelName": "movies", "title": "Collaborator",
"source": "http://getglue.com/stickers/tribeca_film/
collaborator_coming_soon", "numReplies": "0",
"app": "GetGlue", "lastCheckin": "true",
"timestamp": "2012-05-18T14:15:40Z",
"director": "martin donovan", "verb": "watching",
"key": "rachelschutt/2012-05-18T14:15:40Z",
"others": "97", "displayName": "Rachel Schutt",
"lastModified": "2012-05-18T14:15:43Z",
"objectKey": "movies/collaborator/martin_donovan",
"action": "watching"}
we extract four fields: {"userid":"rachelschutt", "action":
{"userid":"rachelschutt", 

"action": "watching", 

"title":"Collaborator", 

timestamp:"2012-05- 18T14:15:40Z" }
시간기록 - 탐색적 데이터 분석
• EDA를 통해 데이터에 관한 직관 얻기
• 여러가지 질문과 그에 대한 답을 얻는 과정에서 데이터의 특성을
파악
• 파악된 특성을 기반으로 분석 방법을 선택

- 상황에 기반한 선택

- 선택에 대한 타당한 이유 제시
• 선택할 것들: 척도, 부호화 방법, 시간범위, 데이터 범주, 혼합된
행동 다루기, 주목할 행동 패턴 등등등....
Figure 6-2. An example of a way to visually display user-level data
over time
Now try to construct a narrative from that plot. For example, we could
say that user 1 comes the same time each day, whereas user 2 started
out active in this time period but then came less and less frequently.
User 3 needs a longer time horizon for us to understand his or her
behavior, whereas user 4 looks “normal,” whatever that means.
Let’s pose questions from that narrative:
discipline. Say we have some raw data where each data point is an
event, but we want to have data stored in rows where each row consists
of a user followed by a bunch of timestamps corresponding to actions
that user performed. How would we get the data to that point? Note
that different users will have a different number of timestamps.
Make this reasoning explicit: how would we write the code to create a
plot like the one just shown? How would we go about tackling the data
munging exercise?
Suppose a user can take multiple actions: “thumbs_up,” or
“thumbs_down,”“like,”and“comment.”Howcanweplotthoseevents?
How can we modify our metrics? How can we encode the user data
with these different actions? Figure 6-3 provides an example for the
first question where we color code actions thumbs up and thumbs
down, denoted thumbs_up and thumbs_down.
canthinkabouthowwemightwanttoaggregateusers.Wemightmake
the x-axis refer to time, and the y-axis refer to counts, as shown in
Figure 6-4.
Figure 6-4. Aggregating user actions into counts
We’re no longer working with 100 individual users, but we’re still
making choices, and those choices will impact our perception and
시간기록 - 마무리
• 척도와 새로운 변수 또는 특징

- EDA로 부터 얻은 직관은 척도 구성에 도움

- EDA로 얻은 직관을 모형과 알고리즘에 반영
• 다음은 무엇을 해야하나?

- 자기회귀를 포함한 시계화 모형화

- 근접성 정의를 통한 군집화

- 행동패턴 탐지

- 전환지점 탐지: 큰 사건이 일어난 시점 포착

- 추천 시스템 훈련
• 시계열 모형화 (time series modeling): 시간에 극도로 민감한 사건을 예측
or 이미 발생한 것으로부터 예측 가능한 사건 예측
사고실험 - 커다란 훈련 데이터세트를 다루면서 시간 기록을
무시했을 때 무엇을 잃게 될까?
• 시간 감각이 없다면 원인과 결과를 알아낼 수 없음

- 절대적 시간기록 v.s. 상대적 시간차이

- 계절성, 추세 분석
Figure 6-6. Without keeping track of timestamps, we can’t see time-
based patterns; here, we see a seasonal pattern in a time series
금융 모형화
• 금융 분석가: 빅데이터 분석의 원조

- 시간기록에 집착하지만 원인은 크게 신경쓰지 않음
• 표본내, 표본외 데이터

- 표본내 데이터: 훈련데이터 + 검증데이터

- 표본외 데이터: 모형이 완성된 후 사용하는 데이터
인과 모형화 (causal modeling)
• 현재의 무언가를 예측하기 위해 미래의 정보를 결코 사용해서는 안 된다.
(과거 ~ 현재 까지의 정보만 사용)
• 준거의 시간기록 v.s. 가용성의 시간기록(이용 가능한 시점)
• 계수의 집합

- 마지막 시간 기록에 도달할 때 까지는 최적적합계수를 알지 못함

- 시계열 데이터는 하나의 최적 적합 계수를 얻을 수 없음

- 사건이 발생함에 따라 계수가 변경 됨

- 새로운 데이터를 얻을 때마다 모형 갱신

- 모형의 계수는 끊임없이 진화하는 살아있는 유기체
• 현재 알고있는 것에 기반해 미래에 관한 의사결정을 해야함
금융 데이터 준비하기
• 데이터 준비: 현실을 더 잘 반영하는 데이터로 변환

- 데이터 정규화

- 데이터의 로그를 취함

- 범주형 변수 생성

- 경계값을 기준으로 이진 변수로 데이터 변환
• 모형의 하위모형 운용

- 새로운 부분 고려

- 일변량 회기 등 하위모형 훈련
• 계산된 평균으로 정규화 할 수 없는 경우 => 이동평균으로 정규화

- 인과적 간섭은 나쁜 모형을 좋아보이게 할 수 있음
로그 수익률
• 금융에서는 하루 단위로 수익을 계산
• 백분 수익률:

- 비가산적

- 이득에 대해 편의적
• 로그 수익률: 

- 가산적

- 득실에 대칭적
• 아주 작은 수익률에서는 비슷
make a bad model look good (or, what is more likely, make a
that is pure noise look good).
Log Returns
In finance, we consider returns on a daily basis. In other wo
care about how much the stock (or future, or index) changes fr
to day. This might mean we measure movement from open
Monday to opening on Tuesday, but the standard approach is
about closing prices on subsequent trading days.
We typically don’t consider percent returns, but rather log ret
Ft denotes a close on day t, then the log return that day is def
log Ft /Ft−1 , whereas the percent return would be compu
100 Ft /Ft−1 −1 . To simplify the discussion, we’ll compare
turns to scaled percent returns, which is the same as percent
except without the factor of 100. The reasoning is not changed
difference in scalar.
There are a few different reasons we use log returns instead
centage returns. For example, log returns are additive but scal
cent returns aren’t. In other words, the five-day log return is t
care about how much the stock (or future, or index) changes fro
to day. This might mean we measure movement from openin
Monday to opening on Tuesday, but the standard approach is t
about closing prices on subsequent trading days.
We typically don’t consider percent returns, but rather log retu
Ft denotes a close on day t, then the log return that day is defin
log Ft /Ft−1 , whereas the percent return would be comput
100 Ft /Ft−1 −1 . To simplify the discussion, we’ll compare l
turns to scaled percent returns, which is the same as percent re
except without the factor of 100. The reasoning is not changed b
difference in scalar.
There are a few different reasons we use log returns instead o
centage returns. For example, log returns are additive but scale
cent returns aren’t. In other words, the five-day log return is th
of the five one-day log returns. This is often computationally ha
By the same token, log returns are symmetric with respect to gain
losses, whereas percent returns are biased in favor of gains. S
example, if our stock goes down by 50%, or has a –0.5 scaled pe
gain, and then goes up by 200%, so has a 2.0 scaled percent ga
are where we started. But working in the same scenarios wit
log x = ∑
n
x−1
n
n
= x−1 + x−1
2
/2+⋯
In other words, the first term of the Taylor expansion agrees with
percent return. So as long as the second term is small compared to
first, which is usually true for daily returns, we get a pretty good
proximation of percent returns using log returns.
Here’s a picture of how closely these two functions behave, keepin
mind that when x=1, there’s no change in price whatsoever, as sh
in Figure 6-8.
예시: S&P 지수
Figure 6-10. The log of the S&P returns shown over time
Financial Modeling
www.it-ebooks.info
Example: The S&P Index
Let’s work out a toy example. If you start with S&P closing levels as
shown in Figure 6-9, then you get the log returns illustrated in
Figure 6-10.
Figure 6-9. S&P closing levels shown over time
What’s that mess? It’s crazy volatility caused by the financial crisis. W
sometimes (not always) want to account for that volatility by normal
izing with respect to it (described earlier). Once we do that we ge
something like Figure 6-11, which is clearly better behaved.
Figure 6-11. The volatility normalized log of the S&P closing return
shown over time
변동성 측정하기
• 회고창 선택 (lookback window)

- 정보를 취하는 과거 시간길이

- 길어질 수록 => 추정을 위해 더 많은 정보 필요

- 짧아질 수록 => 새로운 정보에 더 빨리 반응

- 큰 사건이 일어나면 잊혀지는 데 시간이 얼마나 소요되나?
• 과거 데이터를 어떻게 사용하나?

- 롤링 창 적용: 이전 n일 각각에 동일한 가중치 부여

- 연속적인 회고 창 사용: 오래된 데이터에 반감기 적용
• 위험 부담을 최소화 하는 가중치 감소화 지수 선택
Figure 6-12. Volatility in the S&P with different decay factors
Exponential Downweighting
지수적 가중치 감소화
• 감쇠

- 현재 데이터에 더 많은 가중치 부여

- 오래된 데이터의 감소화를 s라 하고 모수처럼 취급
Exponential Downweighting
We’ve already seen an example of exponential downwei
case of keeping a running estimate of the volatility of t
the S&P.
The general formula for downweighting some additive
mate E is simple enough. We weight recent data more tha
and we assign the downweighting of older data a name
like a parameter. It is called the decay. In its simplest for
Et =s·Et−1 + 1−s ·et
where et is the new term.
금융 모형화 피드백 루프
• 시장은 시간이 흐르면서 학습한다.

- 구매는 시장에 영향 => 예상한 신호를 감소시킴
• 무언가를 예측하고, 예측한 것들이 사라지게 만드는 수많은 알고
리즘의 조합

- 기존 신호들이 대단히 약해짐

- 시장 참여자들이 그 신호들을 모두 이해했고, 미리 예측했기
때문
• 일반적인 평가척도는 통하지 않음 => 대신 PnL(Profit & Loss)
그래프 활용
change (difference, not ratio), or today’s value minus yesterday’s value.
Figure 6-13. A graph of the cumulative PnLs of two theoretical models
사전정보 추가하기
• 사전정보: 수학적으로 공식화되고 결합된 의견
• 예제: 오래된 데이터에 감소된 가중치 부여

- 새로운 데이터가 오래된 데이터보다 중요
• 사전정보는 자유도를 줄여줌
베이비 모형
• 시계열 그래프의 자기상관계수 계산
• 벌칙함수: 최소화를 하려는 함수에 항을 추가

- 모형의 적합정도 측정
• 지수적 가중치감소화항 r 선택
y =Ft =α0 +α1Ft−1 +α2Ft−2 +
which is just the example where we take the last two values of the time
series F to predict the next one. We could use more than two values,
of course. If we used lots of lagged values, then we could strengthen
our prior in order to make up for the fact that we’ve introduced so
many degrees of freedom. In effect, priors reduce degrees of freedom.
The way we’d place the prior about the relationship between coeffi‐
cients(inthiscaseconsecutivelaggeddatapoints)isbyaddingamatrix
to our covariance matrix when we perform linear regression. See more
about this here.
A Baby Model
Say we drew a plot in a time series and found that we have strong but
fading autocorrelation up to the first 40 lags or so as shown in
Figure 6-14.
Figure 6-14. Looking at auto-correlation out to 100 lags
• 사전정보를 가지고 있지 않을 때: 오차 제곱
• 표준 사전정보 추가: 단위행렬의 스칼라 배 추가
• 사전정보 추가: 계수들이 부드럽게 변한다.
coefficient by 0. The matrix M is called a shift opera
ence I−M can be thought of as a discrete derivative
for more information on discrete calculus).
Because this is the most complicated version, let’s lo
Remembering our vector calculus, the derivative of
F2 β with respect to the vector β is a vector, and s
the properties that happen at the scalar level, inclu
it’s both additive and linear and that:
∂uτ
·u
∂β
=2
∂uτ
∂β
u
Putting the preceding rules to use, we have:
∂F2 β
∂β
=
1
N
∂ y−xβ
τ
y−xβ /
∂β
+λ2
·
∂β
τ
β
∂β
+µ2
·
∂ I −M β
τ
I −M
∂β
=
−2
N
xτ
y−xβ +2λ2
·β+2µ2
I−M
τ
I−M β
Setting this to 0 and solving for β gives us:
β2 = xτ
x+N ·λ2
I+N ·µ2
· I−M
τ
I−M
−1
xτ
y
In other words, we have yet another matrix added
A good way to think about priors is by adding a term to the function
we are seeking to minimize, which measures the extent to which we
have a good fit. This is called the “penalty function,” and when we have
no prior at all, it’s simply the sum of the squares of the error:
F β = ∑i yi −xiβ 2
= y−xβ
τ
y−xβ
If we want to minimize F, which we do, then we take its derivative with
respect to the vector of coefficients β, set it equal to zero, and solve for
β—there’s a unique solution, namely:
β= xτ
x
−1
xτ
y
If we now add a standard prior in the form of a penalty term for large
coefficients, then we have:
F1 β =
1
N
∑i yi −xiβ 2
+∑ j λ2
βj
2
=
1
N
y−xβ
τ
y−xβ + λIβ
τ
λIβ
This can also be solved using calculus, and we solve for beta to get:
β1 = xτ
x+N ·λ2
I
−1
xτ
y
respect to the vector of coe
β—there’s a unique solutio
β= xτ
x
−1
xτ
y
If we now add a standard p
coefficients, then we have:
F1 β =
1
N
∑i yi −xiβ 2
+
This can also be solved usi
β1 = xτ
x+N ·λ2
I
−1
xτ
y
Inotherwords,addingthe
into adding a scalar multip
β= xτ
x
−1
xτ
y
If we now add a standard prior i
coefficients, then we have:
F1 β =
1
N
∑i yi −xiβ 2
+∑ j λ2
This can also be solved using ca
β1 = xτ
x+N ·λ2
I
−1
xτ
y
Inotherwords,addingthepenal
into adding a scalar multiple of
matrix in the closed form soluti
If we now want to add another p
cients vary smoothly” prior, we
adjacent coefficients should be no
can be expressed in the following
eter µ as follows:
F2 β =
1
N
∑
i
yi −xiβ 2
+∑
j
λ2
βj
2
+∑
j
µ2
βj −βj+1
2
=
1
N
y−xβ
τ
y−xβ +λ2
βτ
β+µ2
Iβ−Mβ
τ
Iβ−Mβ
where M is the matrix that contains zeros everywhere except on the
lower off-diagonals, where it contains 1’s. Then Mβ is the vector that
results from shifting the coefficients of β by one and replacing the last
If we want to minimize F, which we do, then we take its derivative with
respect to the vector of coefficients β, set it equal to zero, and solve for
β—there’s a unique solution, namely:
β= xτ
x
−1
xτ
y
If we now add a standard prior in the form of a penalty term for large
coefficients, then we have:
F1 β =
1
N
∑i yi −xiβ 2
+∑ j λ2
βj
2
=
1
N
y−xβ
τ
y−xβ + λIβ
τ
λIβ
This can also be solved using calculus, and we solve for beta to get:
β1 = xτ
x+N ·λ2
I
−1
xτ
y
Inotherwords,addingthepenaltytermforlargecoefficientstranslates
into adding a scalar multiple of the identity matrix to the covariance
matrix in the closed form solution to β.
If we now want to add another penalty term that represents a “coeffi‐
cients vary smoothly” prior, we can think of this as requiring that
adjacent coefficients should be not too different from each other, which
can be expressed in the following penalty function with a new param‐
eter µ as follows:

More Related Content

More from Minchul Jung

13.앙상블학습
13.앙상블학습13.앙상블학습
13.앙상블학습Minchul Jung
 
10장 진화학습
10장 진화학습10장 진화학습
10장 진화학습Minchul Jung
 
Ch9 프로세스의 메모리 구조
Ch9 프로세스의 메모리 구조Ch9 프로세스의 메모리 구조
Ch9 프로세스의 메모리 구조Minchul Jung
 
7부. 애플리케이션 입장에서의 성능 튜닝 (1~8장)
7부. 애플리케이션 입장에서의 성능 튜닝 (1~8장)7부. 애플리케이션 입장에서의 성능 튜닝 (1~8장)
7부. 애플리케이션 입장에서의 성능 튜닝 (1~8장)Minchul Jung
 
실무로 배우는 시스템 성능 최적화 - 4부. 프로세스 이해하기
실무로 배우는 시스템 성능 최적화 - 4부. 프로세스 이해하기실무로 배우는 시스템 성능 최적화 - 4부. 프로세스 이해하기
실무로 배우는 시스템 성능 최적화 - 4부. 프로세스 이해하기Minchul Jung
 
Ch1 일래스틱서치 클러스터 시작
Ch1 일래스틱서치 클러스터 시작Ch1 일래스틱서치 클러스터 시작
Ch1 일래스틱서치 클러스터 시작Minchul Jung
 
Ch6 대용량서비스레퍼런스아키텍처 part.1
Ch6 대용량서비스레퍼런스아키텍처 part.1Ch6 대용량서비스레퍼런스아키텍처 part.1
Ch6 대용량서비스레퍼런스아키텍처 part.1Minchul Jung
 

More from Minchul Jung (7)

13.앙상블학습
13.앙상블학습13.앙상블학습
13.앙상블학습
 
10장 진화학습
10장 진화학습10장 진화학습
10장 진화학습
 
Ch9 프로세스의 메모리 구조
Ch9 프로세스의 메모리 구조Ch9 프로세스의 메모리 구조
Ch9 프로세스의 메모리 구조
 
7부. 애플리케이션 입장에서의 성능 튜닝 (1~8장)
7부. 애플리케이션 입장에서의 성능 튜닝 (1~8장)7부. 애플리케이션 입장에서의 성능 튜닝 (1~8장)
7부. 애플리케이션 입장에서의 성능 튜닝 (1~8장)
 
실무로 배우는 시스템 성능 최적화 - 4부. 프로세스 이해하기
실무로 배우는 시스템 성능 최적화 - 4부. 프로세스 이해하기실무로 배우는 시스템 성능 최적화 - 4부. 프로세스 이해하기
실무로 배우는 시스템 성능 최적화 - 4부. 프로세스 이해하기
 
Ch1 일래스틱서치 클러스터 시작
Ch1 일래스틱서치 클러스터 시작Ch1 일래스틱서치 클러스터 시작
Ch1 일래스틱서치 클러스터 시작
 
Ch6 대용량서비스레퍼런스아키텍처 part.1
Ch6 대용량서비스레퍼런스아키텍처 part.1Ch6 대용량서비스레퍼런스아키텍처 part.1
Ch6 대용량서비스레퍼런스아키텍처 part.1
 

Recently uploaded

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 

데이터과학 입문 - ch6.시간 기록과 금융 모형화

  • 1. 데이터 과학 입문 Ch 6. 시간기록과
 금융 모형화 아꿈사 스터디 2015.06.27 정민철 (ccc612@gmail.com)
  • 2. 티비 태그 • 목적: 개인화된 TV 프로그램 추천과 편성표 제공 • 수집 정보 형식: {사용자, 행동, 항목} + 시간 • 특정 쇼에 특정 방식으로 반응 => “좋아요!” 로 분류 • “좋아요!” 데이터 시각화 방법: 사용자-항목 이분 그래프 • 그래프 개선: 사용자 간(팔로우/친구), TV쇼 간(유사도) 연결선 • 시간: 특정 시간대, 영향 확산, 시간에 따른 변화 파악에 도움
  • 3. this stored data is by drawing a bipartite graph as shown in Figure 6-1. Figure 6-1. Bipartite graph with users and items (shows) as nodes We’ll go into graphs in later chapters, but for now you should know
  • 4. 시간기록 • 시간기록 사건 데이터 다루기
 - 빅데이터 시대의 흔한 형태로 빅데이터 현상을 만든 요인
 - 하루종일 정확한 시간의 인간 행동 측정 가능
 - 대용량 데이터를 저장하고 신속하게 처리 가능 • 데이터 추출 nd which stories were clicked on. This generates event logs. Each cordisaneventthattookplacebetweenauserandtheapporwebsite. ere’s an example of raw data point from GetGlue: {"userId": "rachelschutt", "numCheckins": "1", "modelName": "movies", "title": "Collaborator", "source": "http://getglue.com/stickers/tribeca_film/ collaborator_coming_soon", "numReplies": "0", "app": "GetGlue", "lastCheckin": "true", "timestamp": "2012-05-18T14:15:40Z", "director": "martin donovan", "verb": "watching", "key": "rachelschutt/2012-05-18T14:15:40Z", "others": "97", "displayName": "Rachel Schutt", "lastModified": "2012-05-18T14:15:43Z", "objectKey": "movies/collaborator/martin_donovan", "action": "watching"} we extract four fields: {"userid":"rachelschutt", "action": {"userid":"rachelschutt", 
 "action": "watching", 
 "title":"Collaborator", 
 timestamp:"2012-05- 18T14:15:40Z" }
  • 5. 시간기록 - 탐색적 데이터 분석 • EDA를 통해 데이터에 관한 직관 얻기 • 여러가지 질문과 그에 대한 답을 얻는 과정에서 데이터의 특성을 파악 • 파악된 특성을 기반으로 분석 방법을 선택
 - 상황에 기반한 선택
 - 선택에 대한 타당한 이유 제시 • 선택할 것들: 척도, 부호화 방법, 시간범위, 데이터 범주, 혼합된 행동 다루기, 주목할 행동 패턴 등등등....
  • 6. Figure 6-2. An example of a way to visually display user-level data over time Now try to construct a narrative from that plot. For example, we could say that user 1 comes the same time each day, whereas user 2 started out active in this time period but then came less and less frequently. User 3 needs a longer time horizon for us to understand his or her behavior, whereas user 4 looks “normal,” whatever that means. Let’s pose questions from that narrative: discipline. Say we have some raw data where each data point is an event, but we want to have data stored in rows where each row consists of a user followed by a bunch of timestamps corresponding to actions that user performed. How would we get the data to that point? Note that different users will have a different number of timestamps. Make this reasoning explicit: how would we write the code to create a plot like the one just shown? How would we go about tackling the data munging exercise? Suppose a user can take multiple actions: “thumbs_up,” or “thumbs_down,”“like,”and“comment.”Howcanweplotthoseevents? How can we modify our metrics? How can we encode the user data with these different actions? Figure 6-3 provides an example for the first question where we color code actions thumbs up and thumbs down, denoted thumbs_up and thumbs_down.
  • 7. canthinkabouthowwemightwanttoaggregateusers.Wemightmake the x-axis refer to time, and the y-axis refer to counts, as shown in Figure 6-4. Figure 6-4. Aggregating user actions into counts We’re no longer working with 100 individual users, but we’re still making choices, and those choices will impact our perception and
  • 8. 시간기록 - 마무리 • 척도와 새로운 변수 또는 특징
 - EDA로 부터 얻은 직관은 척도 구성에 도움
 - EDA로 얻은 직관을 모형과 알고리즘에 반영 • 다음은 무엇을 해야하나?
 - 자기회귀를 포함한 시계화 모형화
 - 근접성 정의를 통한 군집화
 - 행동패턴 탐지
 - 전환지점 탐지: 큰 사건이 일어난 시점 포착
 - 추천 시스템 훈련 • 시계열 모형화 (time series modeling): 시간에 극도로 민감한 사건을 예측 or 이미 발생한 것으로부터 예측 가능한 사건 예측
  • 9. 사고실험 - 커다란 훈련 데이터세트를 다루면서 시간 기록을 무시했을 때 무엇을 잃게 될까? • 시간 감각이 없다면 원인과 결과를 알아낼 수 없음
 - 절대적 시간기록 v.s. 상대적 시간차이
 - 계절성, 추세 분석 Figure 6-6. Without keeping track of timestamps, we can’t see time- based patterns; here, we see a seasonal pattern in a time series
  • 10. 금융 모형화 • 금융 분석가: 빅데이터 분석의 원조
 - 시간기록에 집착하지만 원인은 크게 신경쓰지 않음 • 표본내, 표본외 데이터
 - 표본내 데이터: 훈련데이터 + 검증데이터
 - 표본외 데이터: 모형이 완성된 후 사용하는 데이터
  • 11. 인과 모형화 (causal modeling) • 현재의 무언가를 예측하기 위해 미래의 정보를 결코 사용해서는 안 된다. (과거 ~ 현재 까지의 정보만 사용) • 준거의 시간기록 v.s. 가용성의 시간기록(이용 가능한 시점) • 계수의 집합
 - 마지막 시간 기록에 도달할 때 까지는 최적적합계수를 알지 못함
 - 시계열 데이터는 하나의 최적 적합 계수를 얻을 수 없음
 - 사건이 발생함에 따라 계수가 변경 됨
 - 새로운 데이터를 얻을 때마다 모형 갱신
 - 모형의 계수는 끊임없이 진화하는 살아있는 유기체 • 현재 알고있는 것에 기반해 미래에 관한 의사결정을 해야함
  • 12. 금융 데이터 준비하기 • 데이터 준비: 현실을 더 잘 반영하는 데이터로 변환
 - 데이터 정규화
 - 데이터의 로그를 취함
 - 범주형 변수 생성
 - 경계값을 기준으로 이진 변수로 데이터 변환 • 모형의 하위모형 운용
 - 새로운 부분 고려
 - 일변량 회기 등 하위모형 훈련 • 계산된 평균으로 정규화 할 수 없는 경우 => 이동평균으로 정규화
 - 인과적 간섭은 나쁜 모형을 좋아보이게 할 수 있음
  • 13. 로그 수익률 • 금융에서는 하루 단위로 수익을 계산 • 백분 수익률:
 - 비가산적
 - 이득에 대해 편의적 • 로그 수익률: 
 - 가산적
 - 득실에 대칭적 • 아주 작은 수익률에서는 비슷 make a bad model look good (or, what is more likely, make a that is pure noise look good). Log Returns In finance, we consider returns on a daily basis. In other wo care about how much the stock (or future, or index) changes fr to day. This might mean we measure movement from open Monday to opening on Tuesday, but the standard approach is about closing prices on subsequent trading days. We typically don’t consider percent returns, but rather log ret Ft denotes a close on day t, then the log return that day is def log Ft /Ft−1 , whereas the percent return would be compu 100 Ft /Ft−1 −1 . To simplify the discussion, we’ll compare turns to scaled percent returns, which is the same as percent except without the factor of 100. The reasoning is not changed difference in scalar. There are a few different reasons we use log returns instead centage returns. For example, log returns are additive but scal cent returns aren’t. In other words, the five-day log return is t care about how much the stock (or future, or index) changes fro to day. This might mean we measure movement from openin Monday to opening on Tuesday, but the standard approach is t about closing prices on subsequent trading days. We typically don’t consider percent returns, but rather log retu Ft denotes a close on day t, then the log return that day is defin log Ft /Ft−1 , whereas the percent return would be comput 100 Ft /Ft−1 −1 . To simplify the discussion, we’ll compare l turns to scaled percent returns, which is the same as percent re except without the factor of 100. The reasoning is not changed b difference in scalar. There are a few different reasons we use log returns instead o centage returns. For example, log returns are additive but scale cent returns aren’t. In other words, the five-day log return is th of the five one-day log returns. This is often computationally ha By the same token, log returns are symmetric with respect to gain losses, whereas percent returns are biased in favor of gains. S example, if our stock goes down by 50%, or has a –0.5 scaled pe gain, and then goes up by 200%, so has a 2.0 scaled percent ga are where we started. But working in the same scenarios wit log x = ∑ n x−1 n n = x−1 + x−1 2 /2+⋯ In other words, the first term of the Taylor expansion agrees with percent return. So as long as the second term is small compared to first, which is usually true for daily returns, we get a pretty good proximation of percent returns using log returns. Here’s a picture of how closely these two functions behave, keepin mind that when x=1, there’s no change in price whatsoever, as sh in Figure 6-8.
  • 14. 예시: S&P 지수 Figure 6-10. The log of the S&P returns shown over time Financial Modeling www.it-ebooks.info Example: The S&P Index Let’s work out a toy example. If you start with S&P closing levels as shown in Figure 6-9, then you get the log returns illustrated in Figure 6-10. Figure 6-9. S&P closing levels shown over time What’s that mess? It’s crazy volatility caused by the financial crisis. W sometimes (not always) want to account for that volatility by normal izing with respect to it (described earlier). Once we do that we ge something like Figure 6-11, which is clearly better behaved. Figure 6-11. The volatility normalized log of the S&P closing return shown over time
  • 15. 변동성 측정하기 • 회고창 선택 (lookback window)
 - 정보를 취하는 과거 시간길이
 - 길어질 수록 => 추정을 위해 더 많은 정보 필요
 - 짧아질 수록 => 새로운 정보에 더 빨리 반응
 - 큰 사건이 일어나면 잊혀지는 데 시간이 얼마나 소요되나? • 과거 데이터를 어떻게 사용하나?
 - 롤링 창 적용: 이전 n일 각각에 동일한 가중치 부여
 - 연속적인 회고 창 사용: 오래된 데이터에 반감기 적용 • 위험 부담을 최소화 하는 가중치 감소화 지수 선택
  • 16. Figure 6-12. Volatility in the S&P with different decay factors Exponential Downweighting
  • 17. 지수적 가중치 감소화 • 감쇠
 - 현재 데이터에 더 많은 가중치 부여
 - 오래된 데이터의 감소화를 s라 하고 모수처럼 취급 Exponential Downweighting We’ve already seen an example of exponential downwei case of keeping a running estimate of the volatility of t the S&P. The general formula for downweighting some additive mate E is simple enough. We weight recent data more tha and we assign the downweighting of older data a name like a parameter. It is called the decay. In its simplest for Et =s·Et−1 + 1−s ·et where et is the new term.
  • 18. 금융 모형화 피드백 루프 • 시장은 시간이 흐르면서 학습한다.
 - 구매는 시장에 영향 => 예상한 신호를 감소시킴 • 무언가를 예측하고, 예측한 것들이 사라지게 만드는 수많은 알고 리즘의 조합
 - 기존 신호들이 대단히 약해짐
 - 시장 참여자들이 그 신호들을 모두 이해했고, 미리 예측했기 때문 • 일반적인 평가척도는 통하지 않음 => 대신 PnL(Profit & Loss) 그래프 활용
  • 19. change (difference, not ratio), or today’s value minus yesterday’s value. Figure 6-13. A graph of the cumulative PnLs of two theoretical models
  • 20. 사전정보 추가하기 • 사전정보: 수학적으로 공식화되고 결합된 의견 • 예제: 오래된 데이터에 감소된 가중치 부여
 - 새로운 데이터가 오래된 데이터보다 중요 • 사전정보는 자유도를 줄여줌
  • 21. 베이비 모형 • 시계열 그래프의 자기상관계수 계산 • 벌칙함수: 최소화를 하려는 함수에 항을 추가
 - 모형의 적합정도 측정 • 지수적 가중치감소화항 r 선택 y =Ft =α0 +α1Ft−1 +α2Ft−2 + which is just the example where we take the last two values of the time series F to predict the next one. We could use more than two values, of course. If we used lots of lagged values, then we could strengthen our prior in order to make up for the fact that we’ve introduced so many degrees of freedom. In effect, priors reduce degrees of freedom. The way we’d place the prior about the relationship between coeffi‐ cients(inthiscaseconsecutivelaggeddatapoints)isbyaddingamatrix to our covariance matrix when we perform linear regression. See more about this here. A Baby Model Say we drew a plot in a time series and found that we have strong but fading autocorrelation up to the first 40 lags or so as shown in Figure 6-14. Figure 6-14. Looking at auto-correlation out to 100 lags
  • 22. • 사전정보를 가지고 있지 않을 때: 오차 제곱 • 표준 사전정보 추가: 단위행렬의 스칼라 배 추가 • 사전정보 추가: 계수들이 부드럽게 변한다. coefficient by 0. The matrix M is called a shift opera ence I−M can be thought of as a discrete derivative for more information on discrete calculus). Because this is the most complicated version, let’s lo Remembering our vector calculus, the derivative of F2 β with respect to the vector β is a vector, and s the properties that happen at the scalar level, inclu it’s both additive and linear and that: ∂uτ ·u ∂β =2 ∂uτ ∂β u Putting the preceding rules to use, we have: ∂F2 β ∂β = 1 N ∂ y−xβ τ y−xβ / ∂β +λ2 · ∂β τ β ∂β +µ2 · ∂ I −M β τ I −M ∂β = −2 N xτ y−xβ +2λ2 ·β+2µ2 I−M τ I−M β Setting this to 0 and solving for β gives us: β2 = xτ x+N ·λ2 I+N ·µ2 · I−M τ I−M −1 xτ y In other words, we have yet another matrix added A good way to think about priors is by adding a term to the function we are seeking to minimize, which measures the extent to which we have a good fit. This is called the “penalty function,” and when we have no prior at all, it’s simply the sum of the squares of the error: F β = ∑i yi −xiβ 2 = y−xβ τ y−xβ If we want to minimize F, which we do, then we take its derivative with respect to the vector of coefficients β, set it equal to zero, and solve for β—there’s a unique solution, namely: β= xτ x −1 xτ y If we now add a standard prior in the form of a penalty term for large coefficients, then we have: F1 β = 1 N ∑i yi −xiβ 2 +∑ j λ2 βj 2 = 1 N y−xβ τ y−xβ + λIβ τ λIβ This can also be solved using calculus, and we solve for beta to get: β1 = xτ x+N ·λ2 I −1 xτ y respect to the vector of coe β—there’s a unique solutio β= xτ x −1 xτ y If we now add a standard p coefficients, then we have: F1 β = 1 N ∑i yi −xiβ 2 + This can also be solved usi β1 = xτ x+N ·λ2 I −1 xτ y Inotherwords,addingthe into adding a scalar multip β= xτ x −1 xτ y If we now add a standard prior i coefficients, then we have: F1 β = 1 N ∑i yi −xiβ 2 +∑ j λ2 This can also be solved using ca β1 = xτ x+N ·λ2 I −1 xτ y Inotherwords,addingthepenal into adding a scalar multiple of matrix in the closed form soluti If we now want to add another p cients vary smoothly” prior, we adjacent coefficients should be no can be expressed in the following eter µ as follows: F2 β = 1 N ∑ i yi −xiβ 2 +∑ j λ2 βj 2 +∑ j µ2 βj −βj+1 2 = 1 N y−xβ τ y−xβ +λ2 βτ β+µ2 Iβ−Mβ τ Iβ−Mβ where M is the matrix that contains zeros everywhere except on the lower off-diagonals, where it contains 1’s. Then Mβ is the vector that results from shifting the coefficients of β by one and replacing the last If we want to minimize F, which we do, then we take its derivative with respect to the vector of coefficients β, set it equal to zero, and solve for β—there’s a unique solution, namely: β= xτ x −1 xτ y If we now add a standard prior in the form of a penalty term for large coefficients, then we have: F1 β = 1 N ∑i yi −xiβ 2 +∑ j λ2 βj 2 = 1 N y−xβ τ y−xβ + λIβ τ λIβ This can also be solved using calculus, and we solve for beta to get: β1 = xτ x+N ·λ2 I −1 xτ y Inotherwords,addingthepenaltytermforlargecoefficientstranslates into adding a scalar multiple of the identity matrix to the covariance matrix in the closed form solution to β. If we now want to add another penalty term that represents a “coeffi‐ cients vary smoothly” prior, we can think of this as requiring that adjacent coefficients should be not too different from each other, which can be expressed in the following penalty function with a new param‐ eter µ as follows: