Anomaly detection and data imputation within time series
Walmart sales forecast
1. 1
(Praxis Business School)
Data Mining Assignment
A report on
Sales forecasting for Walmart
Submitted to
Prof. Suman K Mazumdar
In partial fulfillment of the requirements of the subject
(iSAS)
On (26th
September, 2015)
By
Anurag Mukherjee
3. 3
Table of Content
Sl
No Topic Page
1 Cover Page 1
2 Title Page 2
3 Executive Summary 3
4 Background 3
5 Business Problem 3
6 Data Overview 4
7 Exploratory Analysis 5
8 Examining the final features dataset : 19
9
Merging of train and features for the final data set
creation 20
10 Model Building 23
4. 4
Executive Summary :
Walmart is the world'slargestcompanybyrevenue, according to the Fortune Global 500 list in
2014, as well as the biggestprivate employerin the world with 2.2 million employees.
Walmart is a family-owned business, as the company is controlled by the Waltonfamily. Sam
Walton's heirs own over 50 percent of Walmart through their holding company, Walton
Enterprises, and through their individual holdings. It is also one of the world'smostvaluable
companiesbymarketvalue,[10]and is also the largestgrocery retailer in the U.S. In 2009, it
generated 51 percent of its US$258 billion (equivalent to $284 billion in 2015) sales in the
U.S. from its grocery business.
We are provided with datasets containing sales per store,per department on weekly
basis.We are are about to forecast sales for Walmart to help the company in taking much
better data driven decisions for inventory planning and channel optimization.
Background:
Wal-Mart Stores,Inc.isan Americanmultinational retailcorporation thatoperatesachain
ofdiscountdepartmentstores andwarehousestores.Headquarteredin Bentonville,
Arkansas,UnitedStates,the companywasfoundedby SamWaltonin1962 and incorporated on
October31, 1969. It hasover11,000 storesin28 countries,underatotal of 65 banners.The
companyoperatesunderthe Walmartname inthe UnitedStatesandCanada.It operatesasWalmart
de Méxicoy CentroaméricainMexico,as Asdainthe UnitedKingdom, as SeiyuinJapan,andas Best
Price inIndia.It has whollyownedoperationsinArgentina,Brazil,andCanada.Italsoownsand
operatesthe Sam'sClubretail warehouses.
Business Problem:
Withhistorical salesdatafor45 Walmartstoreslocatedindifferentregions.Eachstore contains
manydepartments,andthe aimisto projectthe salesfor eachdepartmentineachstore.To add to
the challenge,selectedholidaymarkdowneventsare includedinthe dataset.These markdownsare
knownto affectsales.
Data Overview :
5. 5
train.csv
Thisis the historical trainingdata,whichcoversto2010-02-05 to 2012-11-01. Withinthisfile youwill
findthe followingfields:
Store - the store number
Dept- the departmentnumber
Date - the week
Weekly_Sales - salesforthe givendepartmentinthe givenstore
IsHoliday - whetherthe weekisaspecial holidayweek
features.csv
Thisfile containsadditional datarelatedtothe store,department,andregional activityforthe given
dates.It containsthe followingfields:
Store - the store number
Date - the week
Temperature - average temperature inthe region
Fuel_Price - costof fuel inthe region
MarkDown1-5 - anonymizeddatarelatedtopromotionalmarkdownsthatWalmartisrunning.
MarkDown data isonlyavailable afterNov2011, and isnot available forall storesall the time.Any
missingvalue ismarkedwithanNA.
CPI - the consumerprice index
Unemployment- the unemploymentrate
IsHoliday - whetherthe weekisaspecial holidayweek
6. 6
Exploratory Analysis :
1.train.csv
1.1 Importing the raw dataset :
proc importout=walmart_traindatafile='/folders/myshortcuts/myfolder/train_walmart.csv'
dbms=csvreplace;
getnames=yes;
run;
1.2 Checkingthe contentsof train.csv :
proc contents data=walmart_train;
run;
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
3 Date Num 8 DDMMYY10. DDMMYY10.
2 Dept Num 8 BEST12. BEST32.
6 IsHoliday Char 5 $5. $5.
4 Month_Year Num 8 DATETIME. ANYDTDTM40.
1 Store Num 8 BEST12. BEST32.
5 Weekly_Sale
s
Num 8 BEST12. BEST32.
1.3 Checkingthe basic statistical measures
14. 14
1.6.Outlier Treatment for train.csv :
The data being a time series record have some seasonalities .During the month of December
there’s a sales spike.This can be explained further by Markdowns.
Markdown 1,2,4,5 doesnt seem to be that effective as compared to Markdown 3.
16. 16
As the spike in the sales would affect the entire model,the difference of excess sales has
been distributed across all the records.
data wal;
set walmart_train_data;
where Sales > 50000000;
sales_diff=Sales-46243899.58;
run;
proc sql;
create table mapper as
select sum(Sales_diff) from
wal;
run;
*total excess sales from weeks having > 50000000 = 181638262.18;
data walmart_final;
set walmart_train;
if Weekly_Sales > 50000000 then Weekly_Sales=46243899.58;
Weekly_Sales_new=Weekly_Sales+(181638262.18/421570);
run;
proc univariate data=walmart_final;
var Weekly_Sales;
run;
17. 17
2.features.csv
2.1 Importing raw data set :
proc import out=walmart_features datafile='/folders/myshortcuts/myfolder/features.csv'
dbms=csv replace;
getnames=yes;
guessingrows=200;
run;
2.2 Checking the contents of features.csv :
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
4 CPI Char 11 $11. $11.
2 Date Num 8 YYMMDD10. YYMMDD10.
6 Fuel_Price Num 8 BEST12. BEST32.
13 IsHoliday Char 5 $5. $5.
7 MarkDown1 Char 8 $8. $8.
8 MarkDown2 Char 8 $8. $8.
9 MarkDown3 Char 8 $8. $8.
10 MarkDown4 Char 8 $8. $8.
11 MarkDown5 Char 8 $8. $8.
1 Store Num 8 BEST12. BEST32.
5 Temperature Num 8 BEST12. BEST32.
12 Unemployme
nt
Char 5 $5. $5.
14 VAR14 Char 1 $1. $1.
3 Weekly_Sales Char 8 $8. $8.
18. 18
2.3 Checking the basic statistical measures of features.csv :
proc means data=walmart_features;
run;
2.4 OutlierTreatment :
data walmart_f;
setwalmart_features;
formatDate DDMMYY10.;
if MarkDown1="NA"or MarkDown1="#N/A" thenMarkDown1=0;
if MarkDown2="NA"or MarkDown2="#N/A" thenMarkDown2=0;
if MarkDown3="NA"or MarkDown3="#N/A" thenMarkDown3=0;
if MarkDown4="NA"or MarkDown4="#N/A" thenMarkDown4=0;
if MarkDown5="NA"or MarkDown5="#N/A" thenMarkDown5=0;
if IsHoliday="TRUE"thenIsHoliday_Yes=1;
else IsHoliday_Yes=0;
if Weekly_Sales="#N/A"thenWeekly_Sales=0;
run;
20. 20
Examining the final features dataset :
proc contentsdata=walmart_features_1;
run;
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
3 CPI Char 11 $11. $11.
2 Date Num 8 DDMMYY10. YYMMDD10.
5 Fuel_Price Num 8 BEST12. BEST32.
7 IsHoliday_Yes Num 8
8 MarkDown1_n Num 8
9 MarkDown2_n Num 8
10 MarkDown3_n Num 8
11 MarkDown4_n Num 8
12 MarkDown5_n Num 8
1 Store Num 8 BEST12. BEST32.
4 Temperature Num 8 BEST12. BEST32.
6 Unemployment Char 5 $5. $5.
13 Weekly_Sales_n Num 8
21. 21
Merging of trainand features for the final data set creation:
proc sql;
create table walmart_final_1 as
select
a.*,b.CPI,b.Temperature,b.Fuel_Price,b.MarkDown1_n,b.MarkDown2_n,b.MarkDown3_n,b.
MarkDown4_n,b.MarkDown5_n,b.Unemployment,b.IsHoliday_Yes
from walmart_final as a left join walmart_features_1 as b on
a.Date=b.Date and a.Store=b.Store;
run;
data walmart_final_2 (drop=IsHoliday Month_Year Unemployment Weekly_Sales);
set walmart_final_1;
run;
22. 22
proc contents data=walmart_final_2;
run;
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
5 CPI Char 11 $11. $11.
3 Date Num 8 DDMMYY10. DDMMYY10.
2 Dept Num 8 BEST12. BEST32.
7 Fuel_Price Num 8 BEST12. BEST32.
13 IsHoliday_Yes Num 8
8 MarkDown1_n Num 8
9 MarkDown2_n Num 8
10 MarkDown3_n Num 8
11 MarkDown4_n Num 8
12 MarkDown5_n Num 8
1 Store Num 8 BEST12. BEST32.
6 Temperature Num 8 BEST12. BEST32.
4 Weekly_Sales_new Num 8