SlideShare a Scribd company logo
1 of 25
Download to read offline
Martin Loetzsch 
How to Build a Data Warehouse? 
Project A Ventures, Berlin 
! 
http://project-a.com 
http://twitter.com/martin-loetzsch
The “typical startup” 
‣ Has data in 
• application database 
• Excel & csv files 
• external tools 
‣ Excel based reporting chains 
• manual sql queries, CSVs 
• copy & paste from external data sources 
• difficult to debug and test 
• sometimes cranky 
! 
‣ Everybody pulls their own numbers. # Orders? 
! 
! 
! 
! 
! 
! 
! 
‣ Does not have “big data” 
‣ Will not have “big data” in the relevant future 
2 / 25 
-- count rows! 
SELECT count(*) FROM orders;! 
! 
-- count everything except test orders! 
SELECT count(*) FROM orders! 
WHERE is_test IS NULL;! 
! 
-- count everything that was once paid! 
SELECT count(*) FROM orders! 
JOIN order_history ON order_fk = order_id! 
WHERE status_id = 17; 
If Excel works for your company, stick to it
Data driven growth requires integrated data 
‣ Integrated data = Data Warehouse 
csv files 
Integrate data! 
! 
! 
! 
! 
‣ Data in the Data Warehouse is 
• the single point of truth 
• cleaned up & validated 
• easy to access 
• embedded in the organisation 
‣ Connect data from different domains 
3 / 25 
application 
databases 
json files 
apis 
reporting 
marketing 
crm 
search 
pricing 
… 
DWH 
orders 
users 
products 
stocks 
prices 
emails 
clicks 
…
‣ 1. Use a BI Solutions by one of the big vendors 
! 
! 
• classic agency business 
• takes forever in startup time 
• usually too expensive 
! 
‣ 2. Use a cloud based DWH solution 
! 
! 
• covers only 80% of your business questions 
• usually not possible to extend 
‣ 3. Build your own, it’s easy! 
! 
! 
• with technology that existed in the 1990s 
• simple ETL scripts running inside Postgresql 
• open source Pentaho Mondrian as query processor 
• own lightweight reporting frontend 
• integrated in own shop system 
‣ Keep it simple & pragmatic 
‣ Don’t use big data technologies if you don’t have big 
data 
How to build a Data Warehouse? 
Invest in own BI infrastructure 4 / 25
Basis of any Data Warehouse: fact tables 
‣ 
Works with Excel, SQL frontends, Elasticsearch, Mondrian & other BI front ends 
! 
! 
! 
‣ KPIs: aggregations on single columns 
‣ All time orders? 
! 
‣ Revenue October 1st? 
! 
‣ Sales by product? 
! 
! 
‣ Allowed query operations 
• Aggregations (count, distinct-count, sum, avg) 
• Filtering 
• Grouping 
5 / 25 
item 
id 
order 
id 
has 
voucher price day product 
1 1 20 09-30 Cat 
2 1 10 09-30 Dog 
3 2 2 20 09-30 Cat 
4 3 30 09-30 Cow 
5 4 4 10 10-01 Dog 
6 4 4 30 10-01 Cow 
# Sold items: count(item_id) 
# Orders: distinct-count(order_id) 
# Orders with vouchers: distinct-count(has_voucher) 
Revenue: sum(price) 
Avg product price: avg(price) 
SELECT count(distinct order_id) FROM order_item; 
SELECT sum(price) FROM order_item WHERE day = ’10-01'; 
SELECT count(item_id) FROM order_item GROUP BY product;
Dimensional modelling 
‣ Move redundant categorial data to “dimension” tables 
order_item 
item_id 
order_id 
has_voucher 
price 
day_fk 
product_fk 
day 
day_id 
day_name 
month_id 
month_name 
product 
product_id 
product_name 
Key challenge: finding good keys 
6 / 25 
item 
id 
order 
id 
has 
voucher price day 
fk 
product 
fk 
1 1 20 930 1 
2 1 10 930 2 
3 2 2 20 930 1 
4 3 30 930 3 
5 4 4 10 1001 2 
6 4 4 30 1001 3 
day 
id 
day 
name 
month 
id 
month 
name 
930 09-30 9 Sep 
1001 10-01 10 Oct 
product 
id 
product 
name 
1 Cat 
2 Dog 
3 Cow
Real life schemas I 
‣ https://www.contorion.de/ 
early stage project 
order_item_status 
order_item_status_id 
order_item_status_sort_id 
order_item_status_name 
order_item_status_mapping 
order_item_status_fk 
order_item_status_partition_fk 
sales_event 
sales_event_id 
order_item_fk 
order_item_current_status_fk 
order_item_status_partition_fk 
order_timestamp 
event_timestamp 
hours_since_order 
hours_since_last_event 
hours_to_next_event 
estimated_net_revenue 
order_item_status_partition 
order_item_status_partition_id 
order_item_status_perspective_id 
order_item_status_perspective_name 
order_item_status_process_id 
order_item_status_process_name 
order_item_status_group_id 
order_item_status_group_name 
order_item 
order_item_id 
order_fk 
merchant_fk 
product_fk 
category_fk 
category_tree_fk 
order_process_fk 
order_item_status_fk 
processed_order_item_id 
net_shipping_revenue 
tax_amount_shipping 
gross_voucher_value 
net_voucher_value 
gross_revenue_before_voucher 
net_item_value 
gross_item_value 
tax_amount_before_voucher 
tax_amount_voucher 
gross_shipping_revenue 
gross_shipping_revenue_before_voucher 
net_purchase_cost 
gross_purchase_cost 
net_revenue_returned 
net_revenue_canceled 
net_payment_cost 
net_return_cost_and_loss_and_fraud 
net_shipping_and_fulfillment_cost 
net_marketing_expenses 
address 
zip_code 
first_name 
last_name 
city 
country_fk 
gender 
account_disabled 
company 
phone 
cell_phone 
country 
country_id 
country_name 
order 
order_id 
increment_id 
order_type_fk 
is_first_order_id 
is_follow_up_order_id 
is_second_order_id 
is_second_or_subsequent_order_id 
customer_fk 
returning_customer_fk 
order_rank_fk 
items_per_order_fk 
payment_method_fk 
payment_provider_fk 
zip_code_fk 
order_rank_1st_fk 
order_rank 
order_rank_id 
order_rank_name 
order_rank_group_id 
order_rank_group_name 
customer 
customer_id 
increment_id 
customer_name 
email 
number_of_orders 
first_order_date 
last_order_date 
avg_days_between_orders 
number_of_orders_with_vouchers 
phone 
company 
gender_fk 
customer_type_fk 
customer_group_fk 
customer_industry_fk 
order_type 
order_type_id 
order_type_name 
order_date 
day_fk 
hour_of_day_fk 
day_of_week_fk 
order_fk 
order_date_perspective_fk 
customer_date 
day_fk 
customer_fk 
customer_date_perspective_fk 
order_date_perspective 
order_date_perspective_id 
order_date_perspective_name 
hour_of_day 
hour_of_day_id 
hour_of_day_name 
day_of_week 
day_of_week_id 
day day_of_week_name 
day_id 
day_reversed_id 
day_name 
year_id 
year_reversed_id 
iso_year_id 
iso_year_reversed_id 
quarter_id 
quarter_name 
month_id 
month_reversed_id 
month_name 
week_id 
week_reversed_id 
week_name 
customer_date_perspective 
customer_date_perspective_id 
customer_date_perspective_name 
gender 
gender_id 
gender_name 
customer_type 
customer_type_id 
customer_type_name 
customer_group 
customer_group_id 
customer_group_name 
sales_event_duration_perspective 
sales_event_duration_perspective_id 
sales_event_duration_perspective_name 
sales_event_duration 
sales_event_fk 
duration_fk 
sales_event_duration_perspective_fk 
duration 
duration_id 
days 
days_name 
weeks 
weeks_name 
months 
months_name 
quarters 
quarters_name 
years 
years_name 
sales_event_date 
sales_event_fk 
day_fk 
sales_event_date_perspective_fk 
sales_event_date_perspective 
sales_event_date_perspective_id 
sales_event_date_perspective_name 
product 
product_id 
sku 
ean 
sales_sku 
merchant_sku 
product_name 
category 
category_id 
category_parent_fk 
category_name 
newsletter_event 
newsletter_event_id 
day_fk 
campaign_fk 
customer_increment_fk 
sent 
bounce 
bounce_block 
bounce_soft 
bounce_hard 
bounce_reason_fk 
open 
first_open 
click 
first_click 
url_fk 
complaint 
subsequent_order_fk 
first_order 
gross_revenue_before_voucher 
gross_voucher_value 
net_voucher_value 
tax_amount_before_voucher 
net_purchase_cost 
url 
cost_per_campaign_and_day 
day_fk 
campaign_fk 
number_of_clicks 
imported_cost_mci 
imported_cost_api 
cost_of_clicks_directly_assigned 
cost_of_clicks_campaigns_without_clicks 
cost_of_clicks_unknown_campaign 
campaign 
campaign_id 
campaign_name 
subject 
sent_hour 
sent_date 
campaign 
campaign_id 
campaign_name 
level_3_id 
level_3_name 
level_2_id 
level_2_name 
channel_fk 
channel 
channel_id 
channel_name 
Star schema, galaxy schema, nth normal form? Doesn’t matter, do what’s fastest. 
7 / 25
campaign_click_performance 
campaign_click_fk 
performance_attribution_model_fk 
attribution_path_segment_fk 
number_of_signups 
number_of_activations 
number_of_transactions 
gross_revenue 
campaign_click 
campaign_click_id 
visitor_id 
day_fk 
campaign_fk 
user_fk 
path_segment_fk 
path_position_fk 
reverse_path_position_fk 
step_fk 
next_step_fk 
step_reverse_fk 
next_step_reverse_fk 
number_of_clicks 
number_of_new_visitors 
number_of_daily_visitors 
number_of_monthly_visitors 
duration_fk 
time_to_end 
marketing_cost 
path_segment 
path_segment_id 
path_segment_name 
performance_attribution_model 
performance_attribution_model_id 
performance_attribution_model_name 
user_date 
day_fk 
user_fk 
user_time_perspective_fk 
cost_per_campaign_and_day 
day_fk 
campaign_fk 
number_of_clicks 
imported_cost_mci 
imported_cost_api 
cost_of_clicks_directly_assigned 
cost_of_clicks_campaigns_without_clicks 
cost_of_clicks_unknown_campaign 
campaign 
campaign_id 
campaign_name 
level_3_id 
level_3_name 
level_2_id 
level_2_name 
channel_fk 
corridor_fk 
conversion_path_transition 
conversion_path_transition_id 
conversion_path_transition_name 
channel_with_position_id 
channel_with_position_name 
reverse_path_position 
reverse_path_position_id 
reverse_path_position_name 
path_position 
path_position_id 
path_position_name 
user 
user_id 
number_of_users 
customer_id 
number_of_customers 
repeat_customer_id 
gender_fk 
age_fk 
user_city_fk 
user_state_fk 
user_country_fk 
most_freq_corridor_fk 
total_transaction_range_fk 
referral_source_fk 
transaction_frequency_fk 
has_sent_cash_id 
has_sent_airtime_id 
has_sent_cash_and_airtime_id 
sent_amount_money_transfer 
number_of_transactions 
number_of_transactions_with_voucher 
fees 
fx_gain 
sent_amount_airtime 
voucher_cost_money_transfer 
voucher_cost_airtime 
days_between_signup_and_first_transaction 
days_between_signup_and_second_transaction 
days_between_first_and_second_transaction 
days_between_second_and_third_transaction 
average_days_between_transactions 
days_since_last_transaction 
days_since_last_login 
day 
day_id 
day_name 
year_id 
iso_year_id 
quarter_id 
quarter_name 
month_id 
month_name 
week_id 
week_name 
day_of_week_id 
day_of_week_name 
day_of_month_id 
day_of_month_reversed_id 
number_of_days_in_month 
duration 
duration_id 
days 
days_name 
weeks 
weeks_name 
months 
months_name 
quarters 
quarters_name 
years 
years_name 
conversions 
conversions_name 
channel 
channel_id 
channel_name 
corridor 
corridor_id 
corridor_name 
sender_country_fk 
sender_country_name 
receiver_country_fk 
receiver_country_name 
campaign_cohort 
performance_attribution_model_fk 
user_fk 
campaign_fk 
channel_fk 
day_fk 
duration_fk 
number_of_transactions 
gross_revenue 
age 
age_id 
age_name 
age_group_id 
age_group_name 
user_city 
user_city_id 
user_city_name 
user_state_id 
user_state_name 
user_country_fk 
user_country_name 
gender 
gender_id 
gender_name 
referral_source 
referral_source_id 
referral_source_name 
total_transaction_range 
total_transaction_range_id 
total_transaction_range_name 
country 
country_id 
country_code 
country_name 
transaction_frequency 
transaction_frequency_id 
transaction_frequency_name 
transaction 
transaction_id 
number_of_transactions 
number_of_first_transactions 
number_of_second_transactions 
number_of_third_transactions 
number_of_subsequent_transactions 
number_of_transactions_with_voucher 
number_of_first_transactions_with_voucher 
number_of_money_transfer_transactions 
number_of_airtime_transactions 
number_of_on_hold_transactions 
number_of_pending_transactions 
number_of_paid_transactions 
is_repeat_customer_id 
transaction_status_fk 
cancellation_status_fk 
customer_fk 
user_fk 
sender_city_fk 
receiver_city_fk 
sender_currency_fk 
receiver_currency_fk 
correspondent_fk 
voucher_fk 
corridor_fk 
payment_method_fk 
receive_method_fk 
transaction_rank_fk 
bank_fk 
sent_amount_range_fk 
sent_amount_money_transfer 
sent_amount_airtime 
receive_amount_creation 
receive_amount_payout 
total_to_pay 
fx_gain 
fees 
voucher_cost_money_transfer 
voucher_cost_airtime 
fx_rate_gbp_to_sent_amount 
fx_rate_gbp_to_receive_amount_creation_date 
fx_rate_gbp_to_receive_amount_payout_date 
fx_rate_sent_to_receive 
bank 
bank_id 
bank_name 
transaction_rank 
transaction_rank_id 
transaction_rank_name 
transaction_rank_group_id 
transaction_rank_group_name 
cancellation_status 
cancellation_status_id 
cancellation_status_name correspondent 
correspondent_id 
correspondent_name 
transaction_city 
transaction_city_id 
transaction_city_name 
transaction_state_id 
transaction_state_name 
transaction_country_fk 
transaction_country_name 
transaction_country_code 
transaction_capital_latitude 
transaction_capital_longitude 
currency 
currency_id 
currency_code 
currency_name 
voucher 
voucher_id 
voucher_name 
voucher_type_id 
voucher_type_name 
voucher_percentage_id 
voucher_percentage_name 
voucher_receive_method_group_id 
voucher_receive_method_group_name 
voucher_start_date_fk 
voucher_end_date_fk 
voucher_duration_days_id 
voucher_duration_days_name 
voucher_duration_range_id 
voucher_duration_range_name 
payment_method 
payment_method_id 
payment_method_name 
payment_method_group_id 
payment_method_group_name 
Real life schemas II 
sent_amount_range 
sent_amount_range_id 
origin_currency_fk 
sent_amount_range_name 
range_lower_limit 
range_upper_limit 
receive_method 
receive_method_id 
receive_method_name 
receive_method_group_id 
receive_method_group_name 
receive_service_id 
transaction_status 
transaction_status_id 
transaction_status_name 
foreign_exchange_rate 
foreign_exchange_rate_id 
day_fk 
sender_currency_fk 
receiver_currency_fk 
foreign_exchange_rate 
foreign_exchange_rate_without_markup 
voucher_usage_fact 
day_fk 
voucher_fk 
voucher_duration_days_id 
voucher_duration_days_name 
voucher_duration_range_id 
voucher_duration_range_name 
voucher_start_date_fk 
voucher_end_date_fk 
voucher_is_money_transfer_id 
voucher_is_airtime_id 
voucher_is_valid_id 
voucher_is_used_id 
voucher_receive_method_id 
voucher_receive_method_name 
number_of_customers 
number_of_transactions 
number_of_first_transactions 
fees 
voucher_cost_money_transfer 
voucher_cost_airtime 
fx_gain 
sent_amount_money_transfer 
sent_amount_airtime 
transaction_event_date 
transaction_event_fk 
day_fk 
transaction_event_time_perspective_fk 
transaction_event 
transaction_event_id 
number_of_transaction_events 
transaction_fk 
previous_status_fk 
current_status_fk 
hours_since_transaction 
hours_since_last_event 
hours_to_next_event 
sent_amount_money_transfer 
sent_amount_airtime 
voucher_cost_money_transfer 
voucher_cost_airtime 
fx_gain 
fees 
transaction_event_time_perspective 
transaction_event_time_perspective_id 
transaction_event_time_perspective_name 
transaction_date 
day_fk 
transaction_fk 
transaction_time_perspective_fk 
transaction_duration 
duration_fk 
transaction_fk 
transaction_time_perspective_fk 
transaction_time_perspective 
transaction_time_perspective_id 
transaction_time_perspective_name 
transaction_event_duration 
transaction_event_fk 
duration_fk 
transaction_event_duration_perspective_fk 
8 / 25 
‣ https://www.worldremit.com/ 
finished soon* project 
* A Data Warehouse is never finished
order_item 
order_item_id 
processed_order_item_id 
is_original_id 
is_print_id 
processed_product_id 
order_fk 
product_fk 
order_item_status_fk 
price_range_fk 
order_process_fk 
option_fk 
fulfillment_provider_fk 
refund_reason_fk 
gross_revenue_item 
net_item_price 
net_item_price_first_order 
vat_amount 
net_shipping_revenue 
net_shipping_revenue_first_order 
duties_amount 
gross_revenue_item_option 
net_option_price 
net_option_price_first_order 
net_payment_cost 
net_option_cost 
net_printing_cost 
net_voucher_amount_saatchi_share 
net_voucher_amount_artist_share 
net_voucher_amount_saatchi_share_first_order 
net_voucher_amount_artist_share_first_order 
artist_commission 
artist_royalties 
estimated_net_revenue_after_vouchers 
origin_country_iso2 
origin_latitude 
origin_longitude 
destination_latitude 
destination_longitude 
artwork 
artwork_id 
artist_fk 
showdown_fk 
artwork_category_fk 
artwork_subject_fk 
artwork_is_curated_fk 
artwork_is_user_collection_fk 
artwork_is_admin_collection_fk 
artwork_related_fk 
artwork_sale_category_fk 
artwork_for_sale_as_print_fk 
artwork_for_sale_as_original_fk 
date_uploaded_fk 
artwork_in_showdown_fk 
artwork_in_weekly_roundup_fk 
artwork_is_visible 
artwork_is_in_curated 
artwork_is_in_user_collection 
artwork_is_in_admin_collection 
user_collections_per_artwork 
admin_collections_per_artwork 
url 
title 
styles 
artist_name 
artist_first_name 
artist_last_name 
option 
option_id 
option_name 
artwork_for_sale_as_original 
artwork_for_sale_as_original_id 
artwork_for_sale_as_original_name 
Real life schemas III 
artwork_category 
artwork_category_id 
artwork_category_name 
artwork_in_showdown 
artwork_in_showdown_id 
artwork_in_showdown_name 
artwork_for_sale_as_print 
artwork_for_sale_as_print_id 
artwork_for_sale_as_print_name 
artwork_in_weekly_roundup 
artwork_in_weekly_roundup_id 
artwork_in_weekly_roundup_name 
artwork_is_admin_collection 
artwork_is_admin_collection_id 
artwork_is_admin_collection_name 
artwork_is_curated 
artwork_is_curated_id 
artwork_is_curated_name 
artwork_is_user_collection 
artwork_is_user_collection_id 
artwork_is_artwork_related user_collection_name 
artwork_related_id 
artwork_related_name 
artwork_sale_category 
artwork_sale_category_id 
artwork_sale_category_name 
artwork_subject 
artwork_subject_id 
artwork_subject_name 
round 
round_id 
showdown_id 
showdown_round 
showdown_title_sort_id 
showdown_title 
user 
user_id 
user_type_fk 
user_status_fk 
user_city_fk 
artist_with_artwork_for_sale_id 
artist_with_artwork_uploaded_id 
user_name 
user_first_name 
user_last_name 
email 
number_of_weekly_roundup 
number_of_showdown 
number_of_artwork_comments 
number_of_collection_comments 
number_of_artworks_in_user_collections 
number_of_user_likes 
number_of_collection_favourites 
number_of_user_logins 
number_of_messages_sent 
number_of_uploads 
hours_to_first_upload 
number_of_bought_items 
number_of_originals_bought 
number_of_prints_bought 
number_of_orders_made 
net_item_price_bought 
net_item_revenue_bought 
gross_revenue_after_vouchers_bought 
net_revenue_after_vouchers_bought 
net_voucher_cost_bought 
number_of_sold_items 
number_of_originals_sold 
number_of_prints_sold 
number_of_orders_sold 
net_item_price_sold 
net_item_revenue_sold 
net_voucher_cost_sold 
product 
product_id 
sku 
artwork_fk 
product_category_fk 
substrate_fk 
product_category 
product_category_id 
product_category_name 
edition_type 
substrate 
substrate_id 
substrate_name 
collection_artwork_order_item 
collection_artwork_order_item_id 
collection_fk 
artwork_fk 
order_item_fk 
collection 
collection_id 
collection_name 
user_fk 
collection_type_fk 
collection_detailed_type_fk 
date_created_fk 
date_initiated_fk 
artwork_style_mapping 
artwork_fk 
artwork_style artwork_style_fk 
artwork_style_id 
artwork_style_name 
artwork_in_collection 
artwork_fk 
collection_fk 
sales_event_duration_perspective 
sales_event_duration_perspective_id 
sales_event_duration_perspective_name 
sales_time_perspective 
sales_time_perspective_id 
sales_time_perspective_name 
collection_artwork_order_item_date 
collection_artwork_order_item_fk 
day_fk 
collection_artwork_order_item_time_perspective_fk 
day 
day_id 
day_name 
year_id 
iso_year_id 
quarter_id 
quarter_name 
month_id 
month_name 
week_id 
week_name 
day_of_the_month 
number_of_days_in_month 
iso_date 
collection_artwork_order_item_time_perspective 
collection_artwork_order_item_time_perspective_id 
collection_artwork_order_item_time_perspective_name 
collection_detailed_type 
collection_detailed_type_id 
collection_detailed_type_name 
collection_type 
collection_type_id 
collection_type_name 
campaign_click_date 
campaign_click_fk 
day_fk 
online_marketing_time_perspective_fk 
campaign_click 
campaign_click_id 
campaign_fk 
search_phrase_fk 
referrer_fk 
user_fk 
number_of_clicks 
number_of_daily_visits 
number_of_monthly_visits 
number_of_new_visits 
number_of_daily_visitors 
number_of_monthly_visitors 
subsequent_registration_fk 
subsequent_confirmation_fk 
subsequent_first_order_fk 
subsequent_order_fk 
direct_cost 
cost_of_campaigns_without_clicks 
unmatched_cost 
visit_duration 
online_marketing_time_perspective 
online_marketing_time_perspective_id 
online_marketing_time_perspective_name 
email_event_date 
email_event_fk 
day_fk 
email_time_perspective_fk 
email_event 
email_event_id 
email_list_fk 
email_campaign_fk 
email_recipient_fk 
subscribe 
unsubscribe 
email_unsubscribe_reason_fk 
sent 
bounce_soft 
bounce_hard 
open 
first_open 
click 
first_click 
subsequent_order 
subsequent_first_order 
items 
net_item_price 
net_option_price 
net_shipping_revenue 
net_voucher_amount_saatchi_share 
net_voucher_amount_artist_share 
email_time_perspective 
email_time_perspective_id 
email_time_perspective_name 
transactional_mail 
number_of_mails_sent 
transactional_mail_type_fk 
day_fk 
transactional_mail_type 
transactional_mail_type_id 
transactional_mail_type_name 
sales_event_date 
sales_event_fk 
day_fk 
sales_event_date_perspective_fk 
sales_event 
sales_event_id 
order_item_fk 
order_item_current_status_fk 
order_item_status_partition_fk 
order_timestamp 
event_timestamp 
hours_since_order 
hours_since_last_event 
hours_to_next_event 
effected_net_revenue_after_vouchers 
estimated_net_revenue_after_vouchers 
sales_event_date_perspective 
sales_event_date_perspective_id 
sales_event_date_perspective_name 
order_date 
day_fk 
order_fk 
order_date_perspective_fk 
order 
order_id 
order_increment_id 
processed_order_id 
is_first_order_id 
is_second_order_id 
is_second_or_subsequent_order_id 
order_with_voucher_id 
user_fk 
returning_buyer_fk 
hour_of_day_fk 
voucher_fk 
payment_method_fk 
payment_provider_fk 
shipping_city_fk 
order_source_fk 
order_date_perspective 
order_date_perspective_id 
order_date_perspective_name 
sales_event_duration 
sales_event_fk 
duration_fk 
sales_event_duration_perspective_fk 
duration 
duration_id 
days 
days_name 
weeks 
weeks_name 
months 
months_name 
quarters 
quarters_name 
five_day_period 
five_day_period_name 
order_duration 
order_fk 
duration_fk 
sales_time_perspective_fk 
fulfillment_provider 
fulfillment_provider_id 
fulfillment_provider_name 
order_item_status 
order_item_status_id 
order_item_status_sort_id 
order_item_status_name 
order_process 
order_process_id 
order_process_name 
checkout_type_id 
checkout_type 
fulfillment_type_id 
fulfillment_type 
price_range 
price_range_id 
price_range_name 
refund_reason 
refund_reason_id 
refund_reason_name 
refund_code_id 
hour_of_day 
hour_of_day_id 
hour_of_day_name 
order_source 
order_source_id 
order_source_name 
payment_method 
payment_method_id 
payment_method_name 
payment_provider 
payment_provider_id 
payment_provider_name 
shipping_city 
shipping_city_id 
shipping_city_name 
shipping_country_id 
shipping_country_name 
voucher 
voucher_id 
voucher_name 
order_item_status_partition 
order_item_status_partition_id 
order_item_status_perspective_id 
order_item_status_perspective_name 
order_item_status_group_id 
order_item_status_group_name 
order_item_refunds 
order_item_refunds_id 
order_item_fk 
refund_code_id 
refund_code 
refund_desc 
refund_amount 
refund_date 
refund_comment 
order_item_status_mapping 
order_item_status_fk 
order_item_status_partition_fk 
email_campaign 
email_campaign_id 
email_campaign_name 
email_list_fk 
email_unsubscribe_reason 
email_unsubscribe_reason_id 
email_unsubscribe_reason_name 
email_recipient 
email_recipient_id 
email 
email_recipient_location_fk 
email_list 
email_list_id 
email_list_name 
email_recipient_location 
email_recipient_location_id 
country_id 
country_name 
region_id 
region_name 
latitude 
longitude 
user_city 
user_city_id 
user_city 
user_country_id 
user_country 
user_status 
user_status_id 
user_status_name 
user_type 
user_type_id 
user_type_name 
user_event_date_registration 
user_event_fk 
day_fk 
user_event_time_perspective_fk 
user_event 
user_event_id 
user_fk 
user_type_fk 
user_event_date 
registration_date 
weekly_roundup 
showdown 
artwork_comment 
collection_comment 
artwork 
user_likes 
collection_favourite 
user_login 
message_sent 
artwork_upload 
artwork_for_sale_as_print 
artwork_for_sale_as_original 
artwork_for_sale_as_both_print_and_original 
artwork_for_sale_as_either_print_or_original 
signup 
verified_signup 
user_order 
time_since_signup 
time_since_last_order 
user_event_date_event 
user_event_fk 
day_fk 
user_event_time_perspective_fk 
referrer 
referrer_id 
referrer_name 
referrer_type_name 
campaign 
campaign_id 
campaign_name 
campaign_code 
channel_id 
channel_name 
is_brand_id 
is_brand_name 
partner_or_adwords_account_id 
partner_or_adwords_account_name 
publication_or_adwords_campaign_id 
publication_or_adwords_campaign_name 
wmc_or_adwords_adgroup_id 
wmc_or_adwords_adgroup_name 
search_phrase 
search_phrase_id 
search_phrase_name 
search_phrase_type_name 
user_date 
user_fk 
day_fk 
sales_time_perspective_fk 
campaign_click_position 
campaign_click_fk 
conversion_type_fk 
campaign_click_performance 
campaign_click_fk 
performance_attribution_model_fk 
conversion_type_fk 
number_of_registrations 
number_of_leads 
number_of_orders 
number_of_received_orders 
number_of_first_orders 
number_of_orders_with_voucher 
net_order_revenue performance_attribution_model 
performance_attribution_model_id 
performance_attribution_model_name 
9 / 25 
‣ http://www.saatchiart.com/ 
exit August 2014
Data integration 
‣ Visuals ETL tools 
• many data source connectors 
• hard to debug 
• slow to change 
Optimize for change speed! 
‣ Start with simple sql queries & batch scripts 
cat create-tables.sql | psql dwh! 
! 
cat load-order.sql ! 
| mysql --skip-column-names source_db ! 
| psql dwh --command="COPY tmp.order FROM STDIN ! 
! 
! 
! 
! 
‣ Later build something more robust 
10 / 25 
WITH NULL AS 'NULL'"! 
! 
cat /data/payment.csv ! 
| python payment_filter.py! 
| psql dwh --command="COPY tmp.payment FROM STDIN” ! 
! 
cat transform-order.sql | psql dwh! 
!
Data integration in Yves & Zed 
11 / 25 
‣ Jobs = processing steps with dependencies 
• parallel execution with cost based scheduler 
• robust, transparent, no black boxes 
‣ Parallel jobs & incremental processing 
‣ Extensive visualisations & monitoring tools
Plain text files 
‣ Very git-friendly 
12 / 25 
<?xml version="1.0" encoding="UTF-8"?>! 
<process xmlns="http://project-a.com/dwh-process"! 
id=“operational-data" ..>! 
! 
<initial-job id="initialize-schemas">! 
<description>Recreates schemas and writes configs</description>! 
<commands>! 
..! 
</commands>! 
</initial-job>! 
! 
<!-- Orders -->! 
<job id="load-order">! 
<description>Loads orders into tmp.order</description>! 
<commands>! 
<execute-sql-file file-name="orders/create-order-tmp-table.sql" echo-queries="true"/>! 
<load-from-mysql file-name="orders/load-order.sql"! 
target-table="tmp.order" database="app"! 
timezone="UTC"/>! 
<execute-sql>SELECT tmp.index_tmp_order();</execute-sql>! 
</commands>! 
</job>! 
! 
<job id="cleanse-order">! 
<description>Deletes test orders and other invalid orders</description>! 
<dependencies>! 
<dependency job="cleanse-member"/>! 
<dependency job="load-order-item"/>! 
<dependency job="load-product"/>!
MDX = query language for multidimensional data 
‣ Developed by Microsoft as part of Analysis Services 
• http://en.wikipedia.org/wiki/MultiDimensional_eXpressions 
‣ 
Each KPI is always computed in the same way 
! 
! 
‣ 
13 / 25 
SELECT ! 
TopCount([Product].[Product].Members, 2,! 
[Measures].[Revenue])! 
ON COLUMNS,! 
[Measures].[Revenue]! 
ON ROWS! 
FROM [Pet sales]! 
WHERE [Date].[Month].[Oct] 
SELECT [Date].[Month].Members! 
ON COLUMNS,! 
CrossJoin({[Measures].[Sold items],! 
[Measures].[# Orders], ! 
[Measures].[Revenue]},! 
Descendants([Product].[All products]))! 
ON ROWS! 
FROM [Pet sales] 
order_item 
item_id 
order_id 
has_voucher 
price 
day_fk 
product_fk 
day 
day_id 
day_name 
month_id 
month_name 
product 
product_id 
product_name
Mondrian = engine for executing MDX 
‣ Open source analytics processor 
• http://mondrian.pentaho.com 
• http://en.wikipedia.org/wiki/Mondrian_OLAP_server 
• In Java 
• Eclipse Public License 
• Active community 
• https://github.com/pentaho/mondrian/ 
! 
‣ Part of Pentaho BI platform 
Open source business analytics 
William D. Back 
Nicholas Goodman 
Julian Hyde 
M A N N I N G 
14 / 25 
www.it-ebooks.info
Mondrian schema I 
‣ The relation between fact tables and dimension tables is defined in a XML file 
15 / 25 
<Cube name="Pet sales" defaultMeasure="# Orders">! 
<Table schema="dim" name="order_item"/>! 
! 
<Dimension name="Date" type="TimeDimension" foreignKey="day_fk">! 
<Hierarchy allMemberName="All dates" hasAll="true" primaryKey="day_id">! 
<Table schema="dim" name="day"/>! 
<Level name="Month" column="month_id" nameColumn="month_name"! 
type="Integer" levelType="TimeMonths" uniqueMembers="true"/>! 
<Level name="Day" column="day_id" nameColumn="day_name"! 
type="Integer" levelType="TimeDays" uniqueMembers="true"/>! 
</Hierarchy>! 
</Dimension>! 
! 
<Dimension name="Product" foreignKey="product_fk">! 
<Hierarchy hasAll="true" allMemberName="All products" primaryKey="product_id">! 
<Table schema="dim" name="product"/>! 
<Level name="Product" column="product_id" nameColumn="product_name"! 
type="Integer" uniqueMembers="true"/>! 
</Hierarchy>! 
</Dimension>! 
! 
..! 
</Cube> 
order_item 
item_id 
order_id 
has_voucher 
price 
day_fk 
product_fk 
day 
day_id 
day_name 
month_id 
month_name 
product 
product_id 
product_name
Mondrian schema II 
‣ Measures as defined as aggregates on columns 
Each KPI is always computed in the same way 
! 
! 
! 
! 
‣ Mondrian = SQL query generator 
16 / 25 
SELECT [Date].[Month].Members! 
ON COLUMNS,! 
[Measures].[Avg cart value]! 
ON ROWS! 
FROM [Pet sales] 
SELECT! 
"day"."month_id" AS "c0",! 
count(DISTINCT "order_item"."order_id") AS "m0",! 
sum("order_item"."price") AS "m1"! 
FROM! 
"dim"."day" AS "day",! 
"dim"."order_item" AS "order_item"! 
WHERE! 
"order_item"."day_fk" = "day"."day_id"! 
GROUP BY! 
"day"."month_id" 
order_item 
item_id 
order_id 
has_voucher 
price 
day_fk 
product_fk 
day 
day_id 
day_name 
month_id 
month_name 
product 
product_id 
<Cube name="Pet sales" defaultMeasure="# Orders”>! product_name 
..! 
<Measure name="# Orders" column="order_id" datatype="Integer" aggregator="distinct-count" formatString="Standard"/>! 
! 
<Measure name="Revenue" column="price" datatype="Integer" aggregator="sum" formatString="Currency"/>! 
! 
<Measure name="Sold items" column="item_id" datatype="Integer" aggregator="count" formatString="Standard"/>! 
! 
<CalculatedMember name="Avg cart value" dimension="Measures">! 
<Formula>[Measures].[Revenue] / [Measures].[# Orders]</Formula>! 
</CalculatedMember>! 
</Cube>! 
! 
➞ ➞
Mondrian schema III 
‣ Everything about KPIs & dimensions (business) and 
tables & columns (IT) in one file 
• consistent & explicit semantics 
• transparency is easy 
Always draw your Mondrian schema! 
17 / 25
Ad-hoc queries with Saiku Analytics 
‣ Drag & drop reporting tool on top of Mondrian 
• Open source (Apache 2.0) 
• Talks to Mondrian via MDX 
• http://meteorite.bi/saiku 
Try it out immediately, it’s amazing: http://demo.analytical-labs.com/ 
18 / 25
Reports in Yves & Zed I 
‣ Own lightweight reporting frontend 
• bootstrap/ Google charts 
• lacks many features 
• features are easy to implement 
Numbers are random! 19 / 25
Reports in Yves & Zed II 
‣ Dashboard-like interactive reports 
• maintained by developers 
• each table / chart is an MDX query 
Numbers are random! 
20 / 25
XMLA = XML for Analysis = MDX via SOAP 
‣ Industry standard originally proposed by Microsoft 
• http://en.wikipedia.org/wiki/XML_for_Analysis 
• Soap protocol to discover and query OLAP cubes 
• Mondrian has an XMLA server 
‣ Request 
‣ Response 
21 / 25 
<?xml version="1.0" encoding="UTF-8"?>! 
<SOAP-ENV:Envelope xmlns:SOAP-ENV=“..”>! 
<SOAP-ENV:Header/>! 
<SOAP-ENV:Body>! 
<Execute xmlns="urn:schemas-microsoft-com:xml-analysis">! 
<Command>! 
<Statement>! 
<![CDATA[! 
SELECT [Date].[Month].Members! 
ON COLUMNS,! 
[Measures].[Avg cart value]! 
ON ROWS! 
FROM [Pet sales]! 
]]>! 
</Statement>! 
</Command>! 
<Properties>! 
<PropertyList>! 
<Catalog>dwh</Catalog>! 
<DataSourceInfo>Monsai</DataSourceInfo>! 
<Format>Multidimensional</Format>! 
<?xml version="1.0" encoding="UTF-8"?>! 
<SOAP-ENV:Envelope xmlns:SOAP-ENV="..">! 
<SOAP-ENV:Header ../>! 
<SOAP-ENV:Body>! 
<cxmla:ExecuteResponse xmlns:cxmla="urn:schemas-microsoft-<cxmla:return>! 
<root>! 
<OlapInfo ../>! 
<Axes>! 
<Axis name=“Axis0" ../>! 
<Axis name="Axis1">! 
<Tuples>! 
<Tuple>! 
<Member Hierarchy=“Measures" ..>! 
</Tuple>! 
</Tuples>! 
</Axis>! 
<Axis name=“SlicerAxis" ../>! 
</Axes>! 
<CellData>! 
<Cell CellOrdinal="0">! 
<Value xsi:type="xsd:double">26.666666666666668</<FmtValue>26,67 €</FmtValue>! 
<FormatString>Standard</FormatString>! 
</Cell>! 
<Cell CellOrdinal="1">! 
<Value xsi:type="xsd:double">40</Value>! 
<FmtValue>40,00 €</FmtValue>! 
<FormatString>Standard</FormatString>! 
</Cell>!
Data Warehouse in Yves & Zed 
! 
! 
! 
! 
! 
! 
csv files 
! 
! 
! 
! 
! 
data integration monsai reporting 
MDX results 
database 
mapping 
! 
‣ monsai = Mondrian XMLA Server + Saiku in a single war file, https://github.com/project-a/monsai 
22 / 25 
application 
databases 
json files 
apis 
SQL SQL 
DB results 
XMLA / MDX 
Mondrian XMLA response 
Mondrian schema
What kind of people do you need to hire for this? 
‣ The “typical BI expert”: 
• studied something related to business and learnt VBA 
programming through Excel 
• relies on others to set up databases and tools 
‣ Your ideal candidate 
• has studied computer science 
• masters the basic tools of software development and 
computer science 
• likes to learn new technologies 
• understands how databases work 
‣ Good profile example: 
http://www.project-a.com/en/careers/jobs/?yid=332 
Job opportunity Data Engineer / Data Scientist (m/f) at Projec... https://karriere.project-a.com/eng?yid=For our "A-Team" we are looking to fill the following position as soon as possible 
Data Engineer / Data Scientist (m/f) 
Your tasks: 
You will help our business intelligence team to build data driven applications for our ventures: 
data warehouses, recommendation engines and CRM systems (developed in-house, based 
on open-source technologies) 
You will integrate, transform and index data from various data sources, develop meaningful 
data representations and visualisations, and provide aggregated data for third-party systems 
You will advance our software architecture and tool set to growing challenges and data 
amounts (performance, scaling, data quality) 
You will work in an agile software development process in close collaboration with a product 
management team 
Your profile: 
You have a Master's degree in computer science or a comparable degree 
You have a genuine interest in data and algorithms and you are excited about solving difficult 
problems and strive for efficient and robust solutions 
You master at least these basic tools of computer science: object oriented programming in 
multiple languages, HTTP and current web technologies, the unix command line and basic 
server administration, version control systems, a basic understanding of the interplay 
between software and memory, hard discs and the CPU 
You have profound knowledge about the inner workings of database systems 
You are eager to delve into new technologies and programming languages (our current 
stack: Mac or Linux, PostgreSQL, Mondrian & MDX, PHP, Java, Python, Solr, ElasticSearch, 
R) 
You have a basic understanding of mathematics and machine learning 
Your chance: 
23 You will join a highly professional and motivated team 
You will have the unique opportunity to witness the launch of a newly established company 
and you can contribute your own ideas to its development 
Search for computer scientists, not business intelligence experts 
/ 25
Use a standard software engineering process! 
‣ Product managers: what? 
• Collection of business requirements 
• KPI & report definitions 
• QA & analysis 
! 
Any kind of Scrum / Kanban works, do it 
‣ Developers: how? 
• Implementation, performance & stability 
• Schema & process design 
• Consistency checks 
Avg 
net 
revenue 
per 
buying 
member 
% 
Contribution 
margin 
1 
24 / 25 
Net 
revenue 
Net 
voucher 
cost 
Avg 
net 
voucher 
cost 
per 
order 
Contribution 
margin 
3a 
Tax 
shipping 
amount 
Tax 
amount 
Gross 
revenue 
Avg 
gross 
item 
price 
Gross 
price 
to 
gross 
retail 
price 
ratio 
Price 
to 
retail 
price 
ratio 
Avg 
gross 
order 
value 
% 
Gross 
voucher 
cost 
Gross 
invoiced 
amount 
Net 
invoiced 
amount 
Gross 
retail 
price 
Net 
price 
to 
net 
purchase 
price 
ratio 
Net 
price 
to 
net 
retail 
price 
ratio 
% 
Net 
discount 
Avg 
gross 
invoiced 
amount 
HGB 
net 
revenue 
margin 
Avg 
gross 
voucher 
cost 
per 
buying 
member 
Net 
item 
revenue 
Tax 
item 
amount 
Net 
purchase 
cost 
Net 
retail 
price 
Retail 
tax 
amount 
Gross 
voucher 
cost 
Net 
shipping 
revenue 
Gross 
shipping 
revenue
Thank you 
Data integration is easy if you keep things simple! 
http://www.project-a.com/ 
25 / 25

More Related Content

What's hot

Indexes and Indexing in Oracle 12c
Indexes and Indexing in Oracle 12cIndexes and Indexing in Oracle 12c
Indexes and Indexing in Oracle 12cOren Nakdimon
 
Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptxJesusaEspeleta
 
"It can always get worse!" – Lessons Learned in over 20 years working with Or...
"It can always get worse!" – Lessons Learned in over 20 years working with Or..."It can always get worse!" – Lessons Learned in over 20 years working with Or...
"It can always get worse!" – Lessons Learned in over 20 years working with Or...Markus Michalewicz
 
SQL Server Integration Services Tips & Tricks
SQL Server Integration Services Tips & TricksSQL Server Integration Services Tips & Tricks
SQL Server Integration Services Tips & TricksGuillermo Caicedo
 
UNIT 1- Data Warehouse.pdf
UNIT 1- Data Warehouse.pdfUNIT 1- Data Warehouse.pdf
UNIT 1- Data Warehouse.pdfNancykumari47
 
Level of-detail-expressions
Level of-detail-expressionsLevel of-detail-expressions
Level of-detail-expressionsYogeeswar Reddy
 
Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...
Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...
Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...Cathrine Wilhelmsen
 

What's hot (8)

Indexes and Indexing in Oracle 12c
Indexes and Indexing in Oracle 12cIndexes and Indexing in Oracle 12c
Indexes and Indexing in Oracle 12c
 
Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
 
"It can always get worse!" – Lessons Learned in over 20 years working with Or...
"It can always get worse!" – Lessons Learned in over 20 years working with Or..."It can always get worse!" – Lessons Learned in over 20 years working with Or...
"It can always get worse!" – Lessons Learned in over 20 years working with Or...
 
SQL Server Integration Services Tips & Tricks
SQL Server Integration Services Tips & TricksSQL Server Integration Services Tips & Tricks
SQL Server Integration Services Tips & Tricks
 
UNIT 1- Data Warehouse.pdf
UNIT 1- Data Warehouse.pdfUNIT 1- Data Warehouse.pdf
UNIT 1- Data Warehouse.pdf
 
Data Vault and DW2.0
Data Vault and DW2.0Data Vault and DW2.0
Data Vault and DW2.0
 
Level of-detail-expressions
Level of-detail-expressionsLevel of-detail-expressions
Level of-detail-expressions
 
Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...
Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...
Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...
 

Viewers also liked

The Rails Engine That Could - In Motion
The Rails Engine That Could - In MotionThe Rails Engine That Could - In Motion
The Rails Engine That Could - In MotionAndy Maleh
 
Software Design Trilogy Part III - Domain Driven Design for Ruby on Rails App...
Software Design Trilogy Part III - Domain Driven Design for Ruby on Rails App...Software Design Trilogy Part III - Domain Driven Design for Ruby on Rails App...
Software Design Trilogy Part III - Domain Driven Design for Ruby on Rails App...Andy Maleh
 
Revised Rails Engine Patterns for Montreal.rb meetup Oct 16, 2012
Revised Rails Engine Patterns for Montreal.rb meetup Oct 16, 2012Revised Rails Engine Patterns for Montreal.rb meetup Oct 16, 2012
Revised Rails Engine Patterns for Montreal.rb meetup Oct 16, 2012Andy Maleh
 
RailsConf 2014 Recap at Montreal.rb by Andy Maleh
RailsConf 2014 Recap at Montreal.rb by Andy MalehRailsConf 2014 Recap at Montreal.rb by Andy Maleh
RailsConf 2014 Recap at Montreal.rb by Andy MalehAndy Maleh
 
Rails Engine Patterns
Rails Engine PatternsRails Engine Patterns
Rails Engine PatternsAndy Maleh
 
How I Learned To Apply Design Patterns
How I Learned To Apply Design PatternsHow I Learned To Apply Design Patterns
How I Learned To Apply Design PatternsAndy Maleh
 
Software Craftsmanship VS Software Engineering
Software Craftsmanship VS Software EngineeringSoftware Craftsmanship VS Software Engineering
Software Craftsmanship VS Software EngineeringAndy Maleh
 

Viewers also liked (7)

The Rails Engine That Could - In Motion
The Rails Engine That Could - In MotionThe Rails Engine That Could - In Motion
The Rails Engine That Could - In Motion
 
Software Design Trilogy Part III - Domain Driven Design for Ruby on Rails App...
Software Design Trilogy Part III - Domain Driven Design for Ruby on Rails App...Software Design Trilogy Part III - Domain Driven Design for Ruby on Rails App...
Software Design Trilogy Part III - Domain Driven Design for Ruby on Rails App...
 
Revised Rails Engine Patterns for Montreal.rb meetup Oct 16, 2012
Revised Rails Engine Patterns for Montreal.rb meetup Oct 16, 2012Revised Rails Engine Patterns for Montreal.rb meetup Oct 16, 2012
Revised Rails Engine Patterns for Montreal.rb meetup Oct 16, 2012
 
RailsConf 2014 Recap at Montreal.rb by Andy Maleh
RailsConf 2014 Recap at Montreal.rb by Andy MalehRailsConf 2014 Recap at Montreal.rb by Andy Maleh
RailsConf 2014 Recap at Montreal.rb by Andy Maleh
 
Rails Engine Patterns
Rails Engine PatternsRails Engine Patterns
Rails Engine Patterns
 
How I Learned To Apply Design Patterns
How I Learned To Apply Design PatternsHow I Learned To Apply Design Patterns
How I Learned To Apply Design Patterns
 
Software Craftsmanship VS Software Engineering
Software Craftsmanship VS Software EngineeringSoftware Craftsmanship VS Software Engineering
Software Craftsmanship VS Software Engineering
 

Similar to How to build a data warehouse - code.talks 2014

The machine learning process: From ideation to deployment with Azure Machine ...
The machine learning process: From ideation to deployment with Azure Machine ...The machine learning process: From ideation to deployment with Azure Machine ...
The machine learning process: From ideation to deployment with Azure Machine ...Francesca Lazzeri, PhD
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeAtScale
 
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris RobisonData Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris RobisonDatabricks
 
[WSO2Con USA 2018] Patterns for Building Streaming Apps
[WSO2Con USA 2018] Patterns for Building Streaming Apps[WSO2Con USA 2018] Patterns for Building Streaming Apps
[WSO2Con USA 2018] Patterns for Building Streaming AppsWSO2
 
Prepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBPrepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBMongoDB
 
Data modeling trends for Analytics
Data modeling trends for AnalyticsData modeling trends for Analytics
Data modeling trends for AnalyticsIke Ellis
 
Elevate workshop programmatic_2014
Elevate workshop programmatic_2014Elevate workshop programmatic_2014
Elevate workshop programmatic_2014David Scruggs
 
60 reporting tips in 60 minutes - SQLBits 2018
60 reporting tips in 60 minutes - SQLBits 201860 reporting tips in 60 minutes - SQLBits 2018
60 reporting tips in 60 minutes - SQLBits 2018Ike Ellis
 
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202Christopher Gutknecht
 
From Spreadsheet Hell to Streamlined Automation with QuickBase
From Spreadsheet Hell to Streamlined Automation with QuickBaseFrom Spreadsheet Hell to Streamlined Automation with QuickBase
From Spreadsheet Hell to Streamlined Automation with QuickBaseQuickBase, Inc.
 
GetX Universal Search Tool for Sage 100
GetX Universal Search Tool for Sage 100GetX Universal Search Tool for Sage 100
GetX Universal Search Tool for Sage 100Net at Work
 
Meetup PredicSis.ai
Meetup PredicSis.aiMeetup PredicSis.ai
Meetup PredicSis.aiPredicSis
 
Engage 2013 - Flexible Data Access with APIs
Engage 2013 - Flexible Data Access with APIsEngage 2013 - Flexible Data Access with APIs
Engage 2013 - Flexible Data Access with APIsWebtrends
 
Building Your Enterprise Data Marketplace with DMX-h
Building Your Enterprise Data Marketplace with DMX-hBuilding Your Enterprise Data Marketplace with DMX-h
Building Your Enterprise Data Marketplace with DMX-hPrecisely
 

Similar to How to build a data warehouse - code.talks 2014 (20)

The machine learning process: From ideation to deployment with Azure Machine ...
The machine learning process: From ideation to deployment with Azure Machine ...The machine learning process: From ideation to deployment with Azure Machine ...
The machine learning process: From ideation to deployment with Azure Machine ...
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
 
Patterns for Building Streaming Apps
Patterns for Building Streaming AppsPatterns for Building Streaming Apps
Patterns for Building Streaming Apps
 
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris RobisonData Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
 
[WSO2Con USA 2018] Patterns for Building Streaming Apps
[WSO2Con USA 2018] Patterns for Building Streaming Apps[WSO2Con USA 2018] Patterns for Building Streaming Apps
[WSO2Con USA 2018] Patterns for Building Streaming Apps
 
Prepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBPrepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDB
 
Data modeling trends for Analytics
Data modeling trends for AnalyticsData modeling trends for Analytics
Data modeling trends for Analytics
 
Elevate workshop programmatic_2014
Elevate workshop programmatic_2014Elevate workshop programmatic_2014
Elevate workshop programmatic_2014
 
60 reporting tips in 60 minutes - SQLBits 2018
60 reporting tips in 60 minutes - SQLBits 201860 reporting tips in 60 minutes - SQLBits 2018
60 reporting tips in 60 minutes - SQLBits 2018
 
Msbi by quontra us
Msbi by quontra usMsbi by quontra us
Msbi by quontra us
 
PICKING FROM INVOICE LINE
PICKING FROM INVOICE LINEPICKING FROM INVOICE LINE
PICKING FROM INVOICE LINE
 
Playing with data and industry 4.0
Playing with data and industry 4.0Playing with data and industry 4.0
Playing with data and industry 4.0
 
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
 
From Spreadsheet Hell to Streamlined Automation with QuickBase
From Spreadsheet Hell to Streamlined Automation with QuickBaseFrom Spreadsheet Hell to Streamlined Automation with QuickBase
From Spreadsheet Hell to Streamlined Automation with QuickBase
 
ERP Genesis Digitek
ERP Genesis DigitekERP Genesis Digitek
ERP Genesis Digitek
 
GetX Universal Search Tool for Sage 100
GetX Universal Search Tool for Sage 100GetX Universal Search Tool for Sage 100
GetX Universal Search Tool for Sage 100
 
Meetup PredicSis.ai
Meetup PredicSis.aiMeetup PredicSis.ai
Meetup PredicSis.ai
 
Engage 2013 - Flexible Data Access with APIs
Engage 2013 - Flexible Data Access with APIsEngage 2013 - Flexible Data Access with APIs
Engage 2013 - Flexible Data Access with APIs
 
1030 track2 komp
1030 track2 komp1030 track2 komp
1030 track2 komp
 
Building Your Enterprise Data Marketplace with DMX-h
Building Your Enterprise Data Marketplace with DMX-hBuilding Your Enterprise Data Marketplace with DMX-h
Building Your Enterprise Data Marketplace with DMX-h
 

Recently uploaded

Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 

Recently uploaded (20)

Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 

How to build a data warehouse - code.talks 2014

  • 1. Martin Loetzsch How to Build a Data Warehouse? Project A Ventures, Berlin ! http://project-a.com http://twitter.com/martin-loetzsch
  • 2. The “typical startup” ‣ Has data in • application database • Excel & csv files • external tools ‣ Excel based reporting chains • manual sql queries, CSVs • copy & paste from external data sources • difficult to debug and test • sometimes cranky ! ‣ Everybody pulls their own numbers. # Orders? ! ! ! ! ! ! ! ‣ Does not have “big data” ‣ Will not have “big data” in the relevant future 2 / 25 -- count rows! SELECT count(*) FROM orders;! ! -- count everything except test orders! SELECT count(*) FROM orders! WHERE is_test IS NULL;! ! -- count everything that was once paid! SELECT count(*) FROM orders! JOIN order_history ON order_fk = order_id! WHERE status_id = 17; If Excel works for your company, stick to it
  • 3. Data driven growth requires integrated data ‣ Integrated data = Data Warehouse csv files Integrate data! ! ! ! ! ‣ Data in the Data Warehouse is • the single point of truth • cleaned up & validated • easy to access • embedded in the organisation ‣ Connect data from different domains 3 / 25 application databases json files apis reporting marketing crm search pricing … DWH orders users products stocks prices emails clicks …
  • 4. ‣ 1. Use a BI Solutions by one of the big vendors ! ! • classic agency business • takes forever in startup time • usually too expensive ! ‣ 2. Use a cloud based DWH solution ! ! • covers only 80% of your business questions • usually not possible to extend ‣ 3. Build your own, it’s easy! ! ! • with technology that existed in the 1990s • simple ETL scripts running inside Postgresql • open source Pentaho Mondrian as query processor • own lightweight reporting frontend • integrated in own shop system ‣ Keep it simple & pragmatic ‣ Don’t use big data technologies if you don’t have big data How to build a Data Warehouse? Invest in own BI infrastructure 4 / 25
  • 5. Basis of any Data Warehouse: fact tables ‣ Works with Excel, SQL frontends, Elasticsearch, Mondrian & other BI front ends ! ! ! ‣ KPIs: aggregations on single columns ‣ All time orders? ! ‣ Revenue October 1st? ! ‣ Sales by product? ! ! ‣ Allowed query operations • Aggregations (count, distinct-count, sum, avg) • Filtering • Grouping 5 / 25 item id order id has voucher price day product 1 1 20 09-30 Cat 2 1 10 09-30 Dog 3 2 2 20 09-30 Cat 4 3 30 09-30 Cow 5 4 4 10 10-01 Dog 6 4 4 30 10-01 Cow # Sold items: count(item_id) # Orders: distinct-count(order_id) # Orders with vouchers: distinct-count(has_voucher) Revenue: sum(price) Avg product price: avg(price) SELECT count(distinct order_id) FROM order_item; SELECT sum(price) FROM order_item WHERE day = ’10-01'; SELECT count(item_id) FROM order_item GROUP BY product;
  • 6. Dimensional modelling ‣ Move redundant categorial data to “dimension” tables order_item item_id order_id has_voucher price day_fk product_fk day day_id day_name month_id month_name product product_id product_name Key challenge: finding good keys 6 / 25 item id order id has voucher price day fk product fk 1 1 20 930 1 2 1 10 930 2 3 2 2 20 930 1 4 3 30 930 3 5 4 4 10 1001 2 6 4 4 30 1001 3 day id day name month id month name 930 09-30 9 Sep 1001 10-01 10 Oct product id product name 1 Cat 2 Dog 3 Cow
  • 7. Real life schemas I ‣ https://www.contorion.de/ early stage project order_item_status order_item_status_id order_item_status_sort_id order_item_status_name order_item_status_mapping order_item_status_fk order_item_status_partition_fk sales_event sales_event_id order_item_fk order_item_current_status_fk order_item_status_partition_fk order_timestamp event_timestamp hours_since_order hours_since_last_event hours_to_next_event estimated_net_revenue order_item_status_partition order_item_status_partition_id order_item_status_perspective_id order_item_status_perspective_name order_item_status_process_id order_item_status_process_name order_item_status_group_id order_item_status_group_name order_item order_item_id order_fk merchant_fk product_fk category_fk category_tree_fk order_process_fk order_item_status_fk processed_order_item_id net_shipping_revenue tax_amount_shipping gross_voucher_value net_voucher_value gross_revenue_before_voucher net_item_value gross_item_value tax_amount_before_voucher tax_amount_voucher gross_shipping_revenue gross_shipping_revenue_before_voucher net_purchase_cost gross_purchase_cost net_revenue_returned net_revenue_canceled net_payment_cost net_return_cost_and_loss_and_fraud net_shipping_and_fulfillment_cost net_marketing_expenses address zip_code first_name last_name city country_fk gender account_disabled company phone cell_phone country country_id country_name order order_id increment_id order_type_fk is_first_order_id is_follow_up_order_id is_second_order_id is_second_or_subsequent_order_id customer_fk returning_customer_fk order_rank_fk items_per_order_fk payment_method_fk payment_provider_fk zip_code_fk order_rank_1st_fk order_rank order_rank_id order_rank_name order_rank_group_id order_rank_group_name customer customer_id increment_id customer_name email number_of_orders first_order_date last_order_date avg_days_between_orders number_of_orders_with_vouchers phone company gender_fk customer_type_fk customer_group_fk customer_industry_fk order_type order_type_id order_type_name order_date day_fk hour_of_day_fk day_of_week_fk order_fk order_date_perspective_fk customer_date day_fk customer_fk customer_date_perspective_fk order_date_perspective order_date_perspective_id order_date_perspective_name hour_of_day hour_of_day_id hour_of_day_name day_of_week day_of_week_id day day_of_week_name day_id day_reversed_id day_name year_id year_reversed_id iso_year_id iso_year_reversed_id quarter_id quarter_name month_id month_reversed_id month_name week_id week_reversed_id week_name customer_date_perspective customer_date_perspective_id customer_date_perspective_name gender gender_id gender_name customer_type customer_type_id customer_type_name customer_group customer_group_id customer_group_name sales_event_duration_perspective sales_event_duration_perspective_id sales_event_duration_perspective_name sales_event_duration sales_event_fk duration_fk sales_event_duration_perspective_fk duration duration_id days days_name weeks weeks_name months months_name quarters quarters_name years years_name sales_event_date sales_event_fk day_fk sales_event_date_perspective_fk sales_event_date_perspective sales_event_date_perspective_id sales_event_date_perspective_name product product_id sku ean sales_sku merchant_sku product_name category category_id category_parent_fk category_name newsletter_event newsletter_event_id day_fk campaign_fk customer_increment_fk sent bounce bounce_block bounce_soft bounce_hard bounce_reason_fk open first_open click first_click url_fk complaint subsequent_order_fk first_order gross_revenue_before_voucher gross_voucher_value net_voucher_value tax_amount_before_voucher net_purchase_cost url cost_per_campaign_and_day day_fk campaign_fk number_of_clicks imported_cost_mci imported_cost_api cost_of_clicks_directly_assigned cost_of_clicks_campaigns_without_clicks cost_of_clicks_unknown_campaign campaign campaign_id campaign_name subject sent_hour sent_date campaign campaign_id campaign_name level_3_id level_3_name level_2_id level_2_name channel_fk channel channel_id channel_name Star schema, galaxy schema, nth normal form? Doesn’t matter, do what’s fastest. 7 / 25
  • 8. campaign_click_performance campaign_click_fk performance_attribution_model_fk attribution_path_segment_fk number_of_signups number_of_activations number_of_transactions gross_revenue campaign_click campaign_click_id visitor_id day_fk campaign_fk user_fk path_segment_fk path_position_fk reverse_path_position_fk step_fk next_step_fk step_reverse_fk next_step_reverse_fk number_of_clicks number_of_new_visitors number_of_daily_visitors number_of_monthly_visitors duration_fk time_to_end marketing_cost path_segment path_segment_id path_segment_name performance_attribution_model performance_attribution_model_id performance_attribution_model_name user_date day_fk user_fk user_time_perspective_fk cost_per_campaign_and_day day_fk campaign_fk number_of_clicks imported_cost_mci imported_cost_api cost_of_clicks_directly_assigned cost_of_clicks_campaigns_without_clicks cost_of_clicks_unknown_campaign campaign campaign_id campaign_name level_3_id level_3_name level_2_id level_2_name channel_fk corridor_fk conversion_path_transition conversion_path_transition_id conversion_path_transition_name channel_with_position_id channel_with_position_name reverse_path_position reverse_path_position_id reverse_path_position_name path_position path_position_id path_position_name user user_id number_of_users customer_id number_of_customers repeat_customer_id gender_fk age_fk user_city_fk user_state_fk user_country_fk most_freq_corridor_fk total_transaction_range_fk referral_source_fk transaction_frequency_fk has_sent_cash_id has_sent_airtime_id has_sent_cash_and_airtime_id sent_amount_money_transfer number_of_transactions number_of_transactions_with_voucher fees fx_gain sent_amount_airtime voucher_cost_money_transfer voucher_cost_airtime days_between_signup_and_first_transaction days_between_signup_and_second_transaction days_between_first_and_second_transaction days_between_second_and_third_transaction average_days_between_transactions days_since_last_transaction days_since_last_login day day_id day_name year_id iso_year_id quarter_id quarter_name month_id month_name week_id week_name day_of_week_id day_of_week_name day_of_month_id day_of_month_reversed_id number_of_days_in_month duration duration_id days days_name weeks weeks_name months months_name quarters quarters_name years years_name conversions conversions_name channel channel_id channel_name corridor corridor_id corridor_name sender_country_fk sender_country_name receiver_country_fk receiver_country_name campaign_cohort performance_attribution_model_fk user_fk campaign_fk channel_fk day_fk duration_fk number_of_transactions gross_revenue age age_id age_name age_group_id age_group_name user_city user_city_id user_city_name user_state_id user_state_name user_country_fk user_country_name gender gender_id gender_name referral_source referral_source_id referral_source_name total_transaction_range total_transaction_range_id total_transaction_range_name country country_id country_code country_name transaction_frequency transaction_frequency_id transaction_frequency_name transaction transaction_id number_of_transactions number_of_first_transactions number_of_second_transactions number_of_third_transactions number_of_subsequent_transactions number_of_transactions_with_voucher number_of_first_transactions_with_voucher number_of_money_transfer_transactions number_of_airtime_transactions number_of_on_hold_transactions number_of_pending_transactions number_of_paid_transactions is_repeat_customer_id transaction_status_fk cancellation_status_fk customer_fk user_fk sender_city_fk receiver_city_fk sender_currency_fk receiver_currency_fk correspondent_fk voucher_fk corridor_fk payment_method_fk receive_method_fk transaction_rank_fk bank_fk sent_amount_range_fk sent_amount_money_transfer sent_amount_airtime receive_amount_creation receive_amount_payout total_to_pay fx_gain fees voucher_cost_money_transfer voucher_cost_airtime fx_rate_gbp_to_sent_amount fx_rate_gbp_to_receive_amount_creation_date fx_rate_gbp_to_receive_amount_payout_date fx_rate_sent_to_receive bank bank_id bank_name transaction_rank transaction_rank_id transaction_rank_name transaction_rank_group_id transaction_rank_group_name cancellation_status cancellation_status_id cancellation_status_name correspondent correspondent_id correspondent_name transaction_city transaction_city_id transaction_city_name transaction_state_id transaction_state_name transaction_country_fk transaction_country_name transaction_country_code transaction_capital_latitude transaction_capital_longitude currency currency_id currency_code currency_name voucher voucher_id voucher_name voucher_type_id voucher_type_name voucher_percentage_id voucher_percentage_name voucher_receive_method_group_id voucher_receive_method_group_name voucher_start_date_fk voucher_end_date_fk voucher_duration_days_id voucher_duration_days_name voucher_duration_range_id voucher_duration_range_name payment_method payment_method_id payment_method_name payment_method_group_id payment_method_group_name Real life schemas II sent_amount_range sent_amount_range_id origin_currency_fk sent_amount_range_name range_lower_limit range_upper_limit receive_method receive_method_id receive_method_name receive_method_group_id receive_method_group_name receive_service_id transaction_status transaction_status_id transaction_status_name foreign_exchange_rate foreign_exchange_rate_id day_fk sender_currency_fk receiver_currency_fk foreign_exchange_rate foreign_exchange_rate_without_markup voucher_usage_fact day_fk voucher_fk voucher_duration_days_id voucher_duration_days_name voucher_duration_range_id voucher_duration_range_name voucher_start_date_fk voucher_end_date_fk voucher_is_money_transfer_id voucher_is_airtime_id voucher_is_valid_id voucher_is_used_id voucher_receive_method_id voucher_receive_method_name number_of_customers number_of_transactions number_of_first_transactions fees voucher_cost_money_transfer voucher_cost_airtime fx_gain sent_amount_money_transfer sent_amount_airtime transaction_event_date transaction_event_fk day_fk transaction_event_time_perspective_fk transaction_event transaction_event_id number_of_transaction_events transaction_fk previous_status_fk current_status_fk hours_since_transaction hours_since_last_event hours_to_next_event sent_amount_money_transfer sent_amount_airtime voucher_cost_money_transfer voucher_cost_airtime fx_gain fees transaction_event_time_perspective transaction_event_time_perspective_id transaction_event_time_perspective_name transaction_date day_fk transaction_fk transaction_time_perspective_fk transaction_duration duration_fk transaction_fk transaction_time_perspective_fk transaction_time_perspective transaction_time_perspective_id transaction_time_perspective_name transaction_event_duration transaction_event_fk duration_fk transaction_event_duration_perspective_fk 8 / 25 ‣ https://www.worldremit.com/ finished soon* project * A Data Warehouse is never finished
  • 9. order_item order_item_id processed_order_item_id is_original_id is_print_id processed_product_id order_fk product_fk order_item_status_fk price_range_fk order_process_fk option_fk fulfillment_provider_fk refund_reason_fk gross_revenue_item net_item_price net_item_price_first_order vat_amount net_shipping_revenue net_shipping_revenue_first_order duties_amount gross_revenue_item_option net_option_price net_option_price_first_order net_payment_cost net_option_cost net_printing_cost net_voucher_amount_saatchi_share net_voucher_amount_artist_share net_voucher_amount_saatchi_share_first_order net_voucher_amount_artist_share_first_order artist_commission artist_royalties estimated_net_revenue_after_vouchers origin_country_iso2 origin_latitude origin_longitude destination_latitude destination_longitude artwork artwork_id artist_fk showdown_fk artwork_category_fk artwork_subject_fk artwork_is_curated_fk artwork_is_user_collection_fk artwork_is_admin_collection_fk artwork_related_fk artwork_sale_category_fk artwork_for_sale_as_print_fk artwork_for_sale_as_original_fk date_uploaded_fk artwork_in_showdown_fk artwork_in_weekly_roundup_fk artwork_is_visible artwork_is_in_curated artwork_is_in_user_collection artwork_is_in_admin_collection user_collections_per_artwork admin_collections_per_artwork url title styles artist_name artist_first_name artist_last_name option option_id option_name artwork_for_sale_as_original artwork_for_sale_as_original_id artwork_for_sale_as_original_name Real life schemas III artwork_category artwork_category_id artwork_category_name artwork_in_showdown artwork_in_showdown_id artwork_in_showdown_name artwork_for_sale_as_print artwork_for_sale_as_print_id artwork_for_sale_as_print_name artwork_in_weekly_roundup artwork_in_weekly_roundup_id artwork_in_weekly_roundup_name artwork_is_admin_collection artwork_is_admin_collection_id artwork_is_admin_collection_name artwork_is_curated artwork_is_curated_id artwork_is_curated_name artwork_is_user_collection artwork_is_user_collection_id artwork_is_artwork_related user_collection_name artwork_related_id artwork_related_name artwork_sale_category artwork_sale_category_id artwork_sale_category_name artwork_subject artwork_subject_id artwork_subject_name round round_id showdown_id showdown_round showdown_title_sort_id showdown_title user user_id user_type_fk user_status_fk user_city_fk artist_with_artwork_for_sale_id artist_with_artwork_uploaded_id user_name user_first_name user_last_name email number_of_weekly_roundup number_of_showdown number_of_artwork_comments number_of_collection_comments number_of_artworks_in_user_collections number_of_user_likes number_of_collection_favourites number_of_user_logins number_of_messages_sent number_of_uploads hours_to_first_upload number_of_bought_items number_of_originals_bought number_of_prints_bought number_of_orders_made net_item_price_bought net_item_revenue_bought gross_revenue_after_vouchers_bought net_revenue_after_vouchers_bought net_voucher_cost_bought number_of_sold_items number_of_originals_sold number_of_prints_sold number_of_orders_sold net_item_price_sold net_item_revenue_sold net_voucher_cost_sold product product_id sku artwork_fk product_category_fk substrate_fk product_category product_category_id product_category_name edition_type substrate substrate_id substrate_name collection_artwork_order_item collection_artwork_order_item_id collection_fk artwork_fk order_item_fk collection collection_id collection_name user_fk collection_type_fk collection_detailed_type_fk date_created_fk date_initiated_fk artwork_style_mapping artwork_fk artwork_style artwork_style_fk artwork_style_id artwork_style_name artwork_in_collection artwork_fk collection_fk sales_event_duration_perspective sales_event_duration_perspective_id sales_event_duration_perspective_name sales_time_perspective sales_time_perspective_id sales_time_perspective_name collection_artwork_order_item_date collection_artwork_order_item_fk day_fk collection_artwork_order_item_time_perspective_fk day day_id day_name year_id iso_year_id quarter_id quarter_name month_id month_name week_id week_name day_of_the_month number_of_days_in_month iso_date collection_artwork_order_item_time_perspective collection_artwork_order_item_time_perspective_id collection_artwork_order_item_time_perspective_name collection_detailed_type collection_detailed_type_id collection_detailed_type_name collection_type collection_type_id collection_type_name campaign_click_date campaign_click_fk day_fk online_marketing_time_perspective_fk campaign_click campaign_click_id campaign_fk search_phrase_fk referrer_fk user_fk number_of_clicks number_of_daily_visits number_of_monthly_visits number_of_new_visits number_of_daily_visitors number_of_monthly_visitors subsequent_registration_fk subsequent_confirmation_fk subsequent_first_order_fk subsequent_order_fk direct_cost cost_of_campaigns_without_clicks unmatched_cost visit_duration online_marketing_time_perspective online_marketing_time_perspective_id online_marketing_time_perspective_name email_event_date email_event_fk day_fk email_time_perspective_fk email_event email_event_id email_list_fk email_campaign_fk email_recipient_fk subscribe unsubscribe email_unsubscribe_reason_fk sent bounce_soft bounce_hard open first_open click first_click subsequent_order subsequent_first_order items net_item_price net_option_price net_shipping_revenue net_voucher_amount_saatchi_share net_voucher_amount_artist_share email_time_perspective email_time_perspective_id email_time_perspective_name transactional_mail number_of_mails_sent transactional_mail_type_fk day_fk transactional_mail_type transactional_mail_type_id transactional_mail_type_name sales_event_date sales_event_fk day_fk sales_event_date_perspective_fk sales_event sales_event_id order_item_fk order_item_current_status_fk order_item_status_partition_fk order_timestamp event_timestamp hours_since_order hours_since_last_event hours_to_next_event effected_net_revenue_after_vouchers estimated_net_revenue_after_vouchers sales_event_date_perspective sales_event_date_perspective_id sales_event_date_perspective_name order_date day_fk order_fk order_date_perspective_fk order order_id order_increment_id processed_order_id is_first_order_id is_second_order_id is_second_or_subsequent_order_id order_with_voucher_id user_fk returning_buyer_fk hour_of_day_fk voucher_fk payment_method_fk payment_provider_fk shipping_city_fk order_source_fk order_date_perspective order_date_perspective_id order_date_perspective_name sales_event_duration sales_event_fk duration_fk sales_event_duration_perspective_fk duration duration_id days days_name weeks weeks_name months months_name quarters quarters_name five_day_period five_day_period_name order_duration order_fk duration_fk sales_time_perspective_fk fulfillment_provider fulfillment_provider_id fulfillment_provider_name order_item_status order_item_status_id order_item_status_sort_id order_item_status_name order_process order_process_id order_process_name checkout_type_id checkout_type fulfillment_type_id fulfillment_type price_range price_range_id price_range_name refund_reason refund_reason_id refund_reason_name refund_code_id hour_of_day hour_of_day_id hour_of_day_name order_source order_source_id order_source_name payment_method payment_method_id payment_method_name payment_provider payment_provider_id payment_provider_name shipping_city shipping_city_id shipping_city_name shipping_country_id shipping_country_name voucher voucher_id voucher_name order_item_status_partition order_item_status_partition_id order_item_status_perspective_id order_item_status_perspective_name order_item_status_group_id order_item_status_group_name order_item_refunds order_item_refunds_id order_item_fk refund_code_id refund_code refund_desc refund_amount refund_date refund_comment order_item_status_mapping order_item_status_fk order_item_status_partition_fk email_campaign email_campaign_id email_campaign_name email_list_fk email_unsubscribe_reason email_unsubscribe_reason_id email_unsubscribe_reason_name email_recipient email_recipient_id email email_recipient_location_fk email_list email_list_id email_list_name email_recipient_location email_recipient_location_id country_id country_name region_id region_name latitude longitude user_city user_city_id user_city user_country_id user_country user_status user_status_id user_status_name user_type user_type_id user_type_name user_event_date_registration user_event_fk day_fk user_event_time_perspective_fk user_event user_event_id user_fk user_type_fk user_event_date registration_date weekly_roundup showdown artwork_comment collection_comment artwork user_likes collection_favourite user_login message_sent artwork_upload artwork_for_sale_as_print artwork_for_sale_as_original artwork_for_sale_as_both_print_and_original artwork_for_sale_as_either_print_or_original signup verified_signup user_order time_since_signup time_since_last_order user_event_date_event user_event_fk day_fk user_event_time_perspective_fk referrer referrer_id referrer_name referrer_type_name campaign campaign_id campaign_name campaign_code channel_id channel_name is_brand_id is_brand_name partner_or_adwords_account_id partner_or_adwords_account_name publication_or_adwords_campaign_id publication_or_adwords_campaign_name wmc_or_adwords_adgroup_id wmc_or_adwords_adgroup_name search_phrase search_phrase_id search_phrase_name search_phrase_type_name user_date user_fk day_fk sales_time_perspective_fk campaign_click_position campaign_click_fk conversion_type_fk campaign_click_performance campaign_click_fk performance_attribution_model_fk conversion_type_fk number_of_registrations number_of_leads number_of_orders number_of_received_orders number_of_first_orders number_of_orders_with_voucher net_order_revenue performance_attribution_model performance_attribution_model_id performance_attribution_model_name 9 / 25 ‣ http://www.saatchiart.com/ exit August 2014
  • 10. Data integration ‣ Visuals ETL tools • many data source connectors • hard to debug • slow to change Optimize for change speed! ‣ Start with simple sql queries & batch scripts cat create-tables.sql | psql dwh! ! cat load-order.sql ! | mysql --skip-column-names source_db ! | psql dwh --command="COPY tmp.order FROM STDIN ! ! ! ! ! ‣ Later build something more robust 10 / 25 WITH NULL AS 'NULL'"! ! cat /data/payment.csv ! | python payment_filter.py! | psql dwh --command="COPY tmp.payment FROM STDIN” ! ! cat transform-order.sql | psql dwh! !
  • 11. Data integration in Yves & Zed 11 / 25 ‣ Jobs = processing steps with dependencies • parallel execution with cost based scheduler • robust, transparent, no black boxes ‣ Parallel jobs & incremental processing ‣ Extensive visualisations & monitoring tools
  • 12. Plain text files ‣ Very git-friendly 12 / 25 <?xml version="1.0" encoding="UTF-8"?>! <process xmlns="http://project-a.com/dwh-process"! id=“operational-data" ..>! ! <initial-job id="initialize-schemas">! <description>Recreates schemas and writes configs</description>! <commands>! ..! </commands>! </initial-job>! ! <!-- Orders -->! <job id="load-order">! <description>Loads orders into tmp.order</description>! <commands>! <execute-sql-file file-name="orders/create-order-tmp-table.sql" echo-queries="true"/>! <load-from-mysql file-name="orders/load-order.sql"! target-table="tmp.order" database="app"! timezone="UTC"/>! <execute-sql>SELECT tmp.index_tmp_order();</execute-sql>! </commands>! </job>! ! <job id="cleanse-order">! <description>Deletes test orders and other invalid orders</description>! <dependencies>! <dependency job="cleanse-member"/>! <dependency job="load-order-item"/>! <dependency job="load-product"/>!
  • 13. MDX = query language for multidimensional data ‣ Developed by Microsoft as part of Analysis Services • http://en.wikipedia.org/wiki/MultiDimensional_eXpressions ‣ Each KPI is always computed in the same way ! ! ‣ 13 / 25 SELECT ! TopCount([Product].[Product].Members, 2,! [Measures].[Revenue])! ON COLUMNS,! [Measures].[Revenue]! ON ROWS! FROM [Pet sales]! WHERE [Date].[Month].[Oct] SELECT [Date].[Month].Members! ON COLUMNS,! CrossJoin({[Measures].[Sold items],! [Measures].[# Orders], ! [Measures].[Revenue]},! Descendants([Product].[All products]))! ON ROWS! FROM [Pet sales] order_item item_id order_id has_voucher price day_fk product_fk day day_id day_name month_id month_name product product_id product_name
  • 14. Mondrian = engine for executing MDX ‣ Open source analytics processor • http://mondrian.pentaho.com • http://en.wikipedia.org/wiki/Mondrian_OLAP_server • In Java • Eclipse Public License • Active community • https://github.com/pentaho/mondrian/ ! ‣ Part of Pentaho BI platform Open source business analytics William D. Back Nicholas Goodman Julian Hyde M A N N I N G 14 / 25 www.it-ebooks.info
  • 15. Mondrian schema I ‣ The relation between fact tables and dimension tables is defined in a XML file 15 / 25 <Cube name="Pet sales" defaultMeasure="# Orders">! <Table schema="dim" name="order_item"/>! ! <Dimension name="Date" type="TimeDimension" foreignKey="day_fk">! <Hierarchy allMemberName="All dates" hasAll="true" primaryKey="day_id">! <Table schema="dim" name="day"/>! <Level name="Month" column="month_id" nameColumn="month_name"! type="Integer" levelType="TimeMonths" uniqueMembers="true"/>! <Level name="Day" column="day_id" nameColumn="day_name"! type="Integer" levelType="TimeDays" uniqueMembers="true"/>! </Hierarchy>! </Dimension>! ! <Dimension name="Product" foreignKey="product_fk">! <Hierarchy hasAll="true" allMemberName="All products" primaryKey="product_id">! <Table schema="dim" name="product"/>! <Level name="Product" column="product_id" nameColumn="product_name"! type="Integer" uniqueMembers="true"/>! </Hierarchy>! </Dimension>! ! ..! </Cube> order_item item_id order_id has_voucher price day_fk product_fk day day_id day_name month_id month_name product product_id product_name
  • 16. Mondrian schema II ‣ Measures as defined as aggregates on columns Each KPI is always computed in the same way ! ! ! ! ‣ Mondrian = SQL query generator 16 / 25 SELECT [Date].[Month].Members! ON COLUMNS,! [Measures].[Avg cart value]! ON ROWS! FROM [Pet sales] SELECT! "day"."month_id" AS "c0",! count(DISTINCT "order_item"."order_id") AS "m0",! sum("order_item"."price") AS "m1"! FROM! "dim"."day" AS "day",! "dim"."order_item" AS "order_item"! WHERE! "order_item"."day_fk" = "day"."day_id"! GROUP BY! "day"."month_id" order_item item_id order_id has_voucher price day_fk product_fk day day_id day_name month_id month_name product product_id <Cube name="Pet sales" defaultMeasure="# Orders”>! product_name ..! <Measure name="# Orders" column="order_id" datatype="Integer" aggregator="distinct-count" formatString="Standard"/>! ! <Measure name="Revenue" column="price" datatype="Integer" aggregator="sum" formatString="Currency"/>! ! <Measure name="Sold items" column="item_id" datatype="Integer" aggregator="count" formatString="Standard"/>! ! <CalculatedMember name="Avg cart value" dimension="Measures">! <Formula>[Measures].[Revenue] / [Measures].[# Orders]</Formula>! </CalculatedMember>! </Cube>! ! ➞ ➞
  • 17. Mondrian schema III ‣ Everything about KPIs & dimensions (business) and tables & columns (IT) in one file • consistent & explicit semantics • transparency is easy Always draw your Mondrian schema! 17 / 25
  • 18. Ad-hoc queries with Saiku Analytics ‣ Drag & drop reporting tool on top of Mondrian • Open source (Apache 2.0) • Talks to Mondrian via MDX • http://meteorite.bi/saiku Try it out immediately, it’s amazing: http://demo.analytical-labs.com/ 18 / 25
  • 19. Reports in Yves & Zed I ‣ Own lightweight reporting frontend • bootstrap/ Google charts • lacks many features • features are easy to implement Numbers are random! 19 / 25
  • 20. Reports in Yves & Zed II ‣ Dashboard-like interactive reports • maintained by developers • each table / chart is an MDX query Numbers are random! 20 / 25
  • 21. XMLA = XML for Analysis = MDX via SOAP ‣ Industry standard originally proposed by Microsoft • http://en.wikipedia.org/wiki/XML_for_Analysis • Soap protocol to discover and query OLAP cubes • Mondrian has an XMLA server ‣ Request ‣ Response 21 / 25 <?xml version="1.0" encoding="UTF-8"?>! <SOAP-ENV:Envelope xmlns:SOAP-ENV=“..”>! <SOAP-ENV:Header/>! <SOAP-ENV:Body>! <Execute xmlns="urn:schemas-microsoft-com:xml-analysis">! <Command>! <Statement>! <![CDATA[! SELECT [Date].[Month].Members! ON COLUMNS,! [Measures].[Avg cart value]! ON ROWS! FROM [Pet sales]! ]]>! </Statement>! </Command>! <Properties>! <PropertyList>! <Catalog>dwh</Catalog>! <DataSourceInfo>Monsai</DataSourceInfo>! <Format>Multidimensional</Format>! <?xml version="1.0" encoding="UTF-8"?>! <SOAP-ENV:Envelope xmlns:SOAP-ENV="..">! <SOAP-ENV:Header ../>! <SOAP-ENV:Body>! <cxmla:ExecuteResponse xmlns:cxmla="urn:schemas-microsoft-<cxmla:return>! <root>! <OlapInfo ../>! <Axes>! <Axis name=“Axis0" ../>! <Axis name="Axis1">! <Tuples>! <Tuple>! <Member Hierarchy=“Measures" ..>! </Tuple>! </Tuples>! </Axis>! <Axis name=“SlicerAxis" ../>! </Axes>! <CellData>! <Cell CellOrdinal="0">! <Value xsi:type="xsd:double">26.666666666666668</<FmtValue>26,67 €</FmtValue>! <FormatString>Standard</FormatString>! </Cell>! <Cell CellOrdinal="1">! <Value xsi:type="xsd:double">40</Value>! <FmtValue>40,00 €</FmtValue>! <FormatString>Standard</FormatString>! </Cell>!
  • 22. Data Warehouse in Yves & Zed ! ! ! ! ! ! csv files ! ! ! ! ! data integration monsai reporting MDX results database mapping ! ‣ monsai = Mondrian XMLA Server + Saiku in a single war file, https://github.com/project-a/monsai 22 / 25 application databases json files apis SQL SQL DB results XMLA / MDX Mondrian XMLA response Mondrian schema
  • 23. What kind of people do you need to hire for this? ‣ The “typical BI expert”: • studied something related to business and learnt VBA programming through Excel • relies on others to set up databases and tools ‣ Your ideal candidate • has studied computer science • masters the basic tools of software development and computer science • likes to learn new technologies • understands how databases work ‣ Good profile example: http://www.project-a.com/en/careers/jobs/?yid=332 Job opportunity Data Engineer / Data Scientist (m/f) at Projec... https://karriere.project-a.com/eng?yid=For our "A-Team" we are looking to fill the following position as soon as possible Data Engineer / Data Scientist (m/f) Your tasks: You will help our business intelligence team to build data driven applications for our ventures: data warehouses, recommendation engines and CRM systems (developed in-house, based on open-source technologies) You will integrate, transform and index data from various data sources, develop meaningful data representations and visualisations, and provide aggregated data for third-party systems You will advance our software architecture and tool set to growing challenges and data amounts (performance, scaling, data quality) You will work in an agile software development process in close collaboration with a product management team Your profile: You have a Master's degree in computer science or a comparable degree You have a genuine interest in data and algorithms and you are excited about solving difficult problems and strive for efficient and robust solutions You master at least these basic tools of computer science: object oriented programming in multiple languages, HTTP and current web technologies, the unix command line and basic server administration, version control systems, a basic understanding of the interplay between software and memory, hard discs and the CPU You have profound knowledge about the inner workings of database systems You are eager to delve into new technologies and programming languages (our current stack: Mac or Linux, PostgreSQL, Mondrian & MDX, PHP, Java, Python, Solr, ElasticSearch, R) You have a basic understanding of mathematics and machine learning Your chance: 23 You will join a highly professional and motivated team You will have the unique opportunity to witness the launch of a newly established company and you can contribute your own ideas to its development Search for computer scientists, not business intelligence experts / 25
  • 24. Use a standard software engineering process! ‣ Product managers: what? • Collection of business requirements • KPI & report definitions • QA & analysis ! Any kind of Scrum / Kanban works, do it ‣ Developers: how? • Implementation, performance & stability • Schema & process design • Consistency checks Avg net revenue per buying member % Contribution margin 1 24 / 25 Net revenue Net voucher cost Avg net voucher cost per order Contribution margin 3a Tax shipping amount Tax amount Gross revenue Avg gross item price Gross price to gross retail price ratio Price to retail price ratio Avg gross order value % Gross voucher cost Gross invoiced amount Net invoiced amount Gross retail price Net price to net purchase price ratio Net price to net retail price ratio % Net discount Avg gross invoiced amount HGB net revenue margin Avg gross voucher cost per buying member Net item revenue Tax item amount Net purchase cost Net retail price Retail tax amount Gross voucher cost Net shipping revenue Gross shipping revenue
  • 25. Thank you Data integration is easy if you keep things simple! http://www.project-a.com/ 25 / 25