BAN430 Forecasting – ws

Walmart’s Weekly Sales

You’ve just been hired as consultants by Walmart, the global retail giant, for a mission that is at the core of their strategic operations. Your challenge is to create a forecasting model capable of predicting weekly sales across 45 of Walmart’s diverse stores. This task is crucial; accurate forecasts drive decisions on inventory, staffing, marketing, and financial planning—essentially guiding the company’s future in a fiercely competitive market.

Your role goes beyond the numbers; you’re set to navigate through vast datasets, uncovering the impact of holidays, temperature and various proxies for the state of the economy, on Walmart’s sales. Success here means providing Walmart with the insights needed to optimize operations, meet customer demands, and achieve sales targets.

One methodological point of this workshop is the distinction between a buttoms-up approach, where you build a well-performing forecast model for the individual stores and aggregate the individual forecast to a total sales forecast, and the approach of aggregating the data to total sales (one time series) and build a forecast model for that. There are pros and cons with both approaches, and in the end of the workshop we will compare the two approaches.

Objective

Create a forecasting model for both the individual store’s weekly sales and total weekly sales.

Relevant learning outcomes

Explain the central ideas of time series analysis and forecasting.
Graphically present time series.
Model a real-world time series using an appropriate time series model and use it for forecasting.
Evaluate forecast performance and to identify the components of forecast errors.
Use R and appropriate packages.

Topics from the curriculum

Time series Graphics
Forecasters toolbox
Exponential Smoothing
Time series regression
ARIMA
Dynamical regression

Data description

The dataset is publicly available under a CC0 licence and can be downloaded from the Kaggle website. You may also load it from this websites’ github repository:

walmart <- readr::read_csv("https://raw.githubusercontent.com/holleland/BAN430/master/data/Walmart.csv")

The data consist of weekly sales from 45 Walmart Stores. We also have the following columns (with descriptions):

Store - the store number
Date - the week of sales
Weekly_Sales - sales for the given store
Holiday_Flag - whether the week is a special holiday week 1 - Holiday week 0 - Non-holiday week
Temperature - Temperature on the day of sale
Fuel_Price - Cost of fuel in the region
CPI - Prevailing consumer price index
Unemployment - Prevailing unemployment rate

The data set starts in week 5, 2010 and ends in week 43, 2012. The Holiday Events are:

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

Exercises

Load the data and convert it to a tsibble. It should have well-formatted index column of yearweek format and the key should be Store.
Do a preliminary analysis of the data. You may focus on stores 1-5 for simplifying figures, as you see fit. The analysis should discuss whether

Trend or seasonality is present.
Any visible effect of holidays, temperature, fuel price or unemployment is present.
You would prefer to inflation adjust the time series.
It is relevant to apply any mathematical transformation of the data.

Use relevant graphics and justify the decisions you make.

Create a train-test split of the data. We will use the last 20 weeks as test set.

In the following exercises 4-7, we will focus on Walmart Store number 3.

Consider seasonal- or first lag difference of the logarithmic weekly sales. Do a KPSS test for this particular setup (you may also use the unitroot_ndiffs/unitroot_nsdiff functions). Discuss potential downside of applying a seasonal differcing in this partciualar situation? Plotting the seasonally differenced time series may help.
Plot the ACF and PACF of the (potensially differenced) logartimic weekly sales. Suggest p and q for an ARIMA(p,d,q) model based on these figures.
Fit the following models:

Benchmark methods of choice (Naive, Seasonal Naive, mean or Drift method)
ETS
ARIMA (from exercise 5 and automatic)
Dynamic regression model with holiday effects

Select the best model in terms of RMSE on the test set. Assess whether the model assumptions are reasonable (address both the required- and “nice-to-have” assumptions). You may also add other candidate models as you see fit.

The dynamic regression model uses a predictor variable. Discuss briefly whether this is an ex-ante or ex-post forecast model? Would your conclusion change if also the price of fuel was included as a predictor?
Fit the same candidate models (as in ex 6) for all stores using only automatic selection procedures. For each store, choose the model that has the lowest RMSE on the test set. Plot the forecasts for the test data, focusing on a selection of the Stores (e.g. Stores 15-20).
Based on the best models in exercise 8, generate a forecast for the total weekly sales in the test set, for all stores combined. You may focus on the point forecasts. Extra challenge: Get the prediction intervals right (optional; not possible with standard functions).
This is what we would call a bottoms-up approach, since we forecast the individual stores weekly sales, and then aggregate to the total sales of the stores. Another approach would be to build a model for the total sales of all the stores. Aggregate the training and test data to total sales, and build a (non-benchmark) model of your choosing. Compare this model to the bottom-up forecast of exercise 9. Which approach do you prefer?
Plot the final forecast for total sales with the test data with a short comment. Note: If you did not manage to do exercise 9 or 10, plot another forecast.