| date | holiday_name | holiday | shops_closed | winter_school_holidays | school_holidays | warehouse | |
|---|---|---|---|---|---|---|---|
| 0 | 2022-03-16 | NaN | 0 | 0 | 0 | 0 | Frankfurt_1 |
| 1 | 2020-03-22 | NaN | 0 | 0 | 0 | 0 | Frankfurt_1 |
| 2 | 2018-02-07 | NaN | 0 | 0 | 0 | 0 | Frankfurt_1 |
| 3 | 2018-08-10 | NaN | 0 | 0 | 0 | 0 | Frankfurt_1 |
| 4 | 2017-10-26 | NaN | 0 | 0 | 0 | 0 | Prague_2 |
Rohlik Sales Forecasting
Photo: Matheus Cenali, pexels.com
About
Predicting sales values is essential for planning, supplying chains, delivery logistics and inventory management. With accurate forecast, businesses can minimaze waste of resources and volume of unnecessary work-related tasks.
The data comes from kaggle competition Rohlik Sales Forecasting Challenge.
The project analyzes historical sales data across multiple European warehouses of grocery chain Rohlik. The goal is to understand sales patterns over time and to engineer meaningful features that can later be used for demand forecasting.
Data:
sales_train.csv - training set containing the historical sales data for given date and inventory with selected features described below
sales_test.csv - testing set
inventory.csv - additional information about inventory like its product (same products across all warehouses share same product unique id and name, but have different unique id)
solution.csv - full submission file in the correct format
calendar.csv - calendar containing data about holidays or warehouse specific events, some columns are already in the train data but there are additional rows in this file for dates where some warehouses could be closed due to public holiday or Sunday (and therefore they are not in the train set)
Information about Data
calendar_df:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23016 entries, 0 to 23015
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 23016 non-null object
1 holiday_name 930 non-null object
2 holiday 23016 non-null int64
3 shops_closed 23016 non-null int64
4 winter_school_holidays 23016 non-null int64
5 school_holidays 23016 non-null int64
6 warehouse 23016 non-null object
dtypes: int64(4), object(3)
memory usage: 1.2+ MB
'2016-01-01'
'2024-12-31'
The earliest and the latest date in the dataset.
inventory_df:
| unique_id | product_unique_id | name | L1_category_name_en | L2_category_name_en | L3_category_name_en | L4_category_name_en | warehouse | |
|---|---|---|---|---|---|---|---|---|
| 0 | 5255 | 2583 | Pastry_196 | Bakery | Bakery_L2_14 | Bakery_L3_26 | Bakery_L4_1 | Prague_3 |
| 1 | 4948 | 2426 | Herb_19 | Fruit and vegetable | Fruit and vegetable_L2_30 | Fruit and vegetable_L3_86 | Fruit and vegetable_L4_1 | Prague_3 |
| 2 | 2146 | 1079 | Beet_2 | Fruit and vegetable | Fruit and vegetable_L2_3 | Fruit and vegetable_L3_65 | Fruit and vegetable_L4_34 | Prague_1 |
| 3 | 501 | 260 | Chicken_13 | Meat and fish | Meat and fish_L2_13 | Meat and fish_L3_27 | Meat and fish_L4_5 | Prague_1 |
| 4 | 4461 | 2197 | Chicory_1 | Fruit and vegetable | Fruit and vegetable_L2_17 | Fruit and vegetable_L3_33 | Fruit and vegetable_L4_1 | Frankfurt_1 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5432 entries, 0 to 5431
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 unique_id 5432 non-null int64
1 product_unique_id 5432 non-null int64
2 name 5432 non-null object
3 L1_category_name_en 5432 non-null object
4 L2_category_name_en 5432 non-null object
5 L3_category_name_en 5432 non-null object
6 L4_category_name_en 5432 non-null object
7 warehouse 5432 non-null object
dtypes: int64(2), object(6)
memory usage: 339.6+ KB
sales_train_df:
| unique_id | date | warehouse | total_orders | sales | sell_price_main | availability | type_0_discount | type_1_discount | type_2_discount | type_3_discount | type_4_discount | type_5_discount | type_6_discount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4845 | 2024-03-10 | Budapest_1 | 6436.0 | 16.34 | 646.26 | 1.00 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.15312 | 0.0 | 0.0 |
| 1 | 4845 | 2021-05-25 | Budapest_1 | 4663.0 | 12.63 | 455.96 | 1.00 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.15025 | 0.0 | 0.0 |
| 2 | 4845 | 2021-12-20 | Budapest_1 | 6507.0 | 34.55 | 455.96 | 1.00 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.15025 | 0.0 | 0.0 |
| 3 | 4845 | 2023-04-29 | Budapest_1 | 5463.0 | 34.52 | 646.26 | 0.96 | 0.20024 | 0.0 | 0.0 | 0.0 | 0.15312 | 0.0 | 0.0 |
| 4 | 4845 | 2022-04-01 | Budapest_1 | 5997.0 | 35.92 | 486.41 | 1.00 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.15649 | 0.0 | 0.0 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 unique_id 1048575 non-null int64
1 date 1048575 non-null object
2 warehouse 1048575 non-null object
3 total_orders 1048556 non-null float64
4 sales 1048556 non-null float64
5 sell_price_main 1048575 non-null float64
6 availability 1048575 non-null float64
7 type_0_discount 1048575 non-null float64
8 type_1_discount 1048575 non-null float64
9 type_2_discount 1048575 non-null float64
10 type_3_discount 1048575 non-null float64
11 type_4_discount 1048575 non-null float64
12 type_5_discount 1048575 non-null float64
13 type_6_discount 1048575 non-null float64
dtypes: float64(11), int64(1), object(2)
memory usage: 112.0+ MB
'2020-08-01'
'2024-06-02'
The earliest and the latest date in the dataset.
sales_test_df:
| unique_id | date | warehouse | total_orders | sell_price_main | type_0_discount | type_1_discount | type_2_discount | type_3_discount | type_4_discount | type_5_discount | type_6_discount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1226 | 2024-06-03 | Brno_1 | 8679.0 | 13.13 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 1226 | 2024-06-11 | Brno_1 | 8795.0 | 13.13 | 0.15873 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 1226 | 2024-06-13 | Brno_1 | 10009.0 | 13.13 | 0.15873 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 1226 | 2024-06-15 | Brno_1 | 8482.0 | 13.13 | 0.15873 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 1226 | 2024-06-09 | Brno_1 | 8195.0 | 13.13 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47021 entries, 0 to 47020
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 unique_id 47021 non-null int64
1 date 47021 non-null object
2 warehouse 47021 non-null object
3 total_orders 47021 non-null float64
4 sell_price_main 47021 non-null float64
5 type_0_discount 47021 non-null float64
6 type_1_discount 47021 non-null float64
7 type_2_discount 47021 non-null float64
8 type_3_discount 47021 non-null float64
9 type_4_discount 47021 non-null float64
10 type_5_discount 47021 non-null float64
11 type_6_discount 47021 non-null float64
dtypes: float64(9), int64(1), object(2)
memory usage: 4.3+ MB
'2024-06-03'
'2024-06-16'
The earliest and the latest date in the dataset.
solution_df:
| id | sales_hat | |
|---|---|---|
| 0 | Brno_1_1226_2024-06-03 | 10.486372 |
| 1 | Brno_1_1226_2024-06-11 | 8.936784 |
| 2 | Brno_1_1226_2024-06-13 | 9.241525 |
| 3 | Brno_1_1226_2024-06-15 | 10.004795 |
| 4 | Brno_1_1226_2024-06-09 | 12.988256 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47021 entries, 0 to 47020
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 47021 non-null object
1 sales_hat 47021 non-null float64
dtypes: float64(1), object(1)
memory usage: 734.8+ KB
test_weight_df:
| unique_id | weight | |
|---|---|---|
| 0 | 0 | 2.535369 |
| 1 | 1 | 3.888933 |
| 2 | 2 | 5.885237 |
| 3 | 3 | 2.139552 |
| 4 | 5 | 3.021715 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5390 entries, 0 to 5389
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 unique_id 5390 non-null int64
1 weight 5390 non-null float64
dtypes: float64(1), int64(1)
memory usage: 84.3 KB
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) reveals basic characteristics of the dataset, such as distributions, ranges, and typical values of variables. This step helps identify what the data represents and how individual features behave. In order to perform proper exploratory analysis and enable further data modelling, additional features will be created from the date variable, including year, month, day, day of the week, and holiday indicators. These features will help capture seasonal patterns and calendar effects that may influence sales behaviour.
sales_train_df["date"] = pd.to_datetime(sales_train_df["date"])
sales_test_df["date"] = pd.to_datetime(sales_test_df["date"])
calendar_df["date"] = pd.to_datetime(calendar_df["date"])
sales_train_df["is_test"] = 0
sales_test_df["is_test"] = 1
data = pd.concat([sales_train_df, sales_test_df], axis=0)
data = data.sort_values(["unique_id", "date"])
# Merging
data = data.merge(calendar_df, on=["date", "warehouse"], how="left")
data = data.merge(inventory_df, on=['unique_id', 'warehouse'], how='left')
data = data.merge(test_weights_df, on=['unique_id'], how='left')
# Data Cleaning and Feature Engineering
data["year"] = data["date"].dt.year
data["month"] = data["date"].dt.month
data["day"] = data["date"].dt.day
data["day_of_week"] = data["date"].dt.dayofweek
data["weekend"] = (data["day_of_week"] >= 5).astype("int8")
# keep holiday numeric
data["holiday"] = data["holiday"].astype("int8")
# categorical ids
data["warehouse"] = data["warehouse"].astype("category")
data["day_of_week"] = data["day_of_week"].astype("category")Daily Sales







Prague and Brno are both in Czech Republic. Summary sales show simmilar repeated pattern of sales during every year with tendency to grow every year. Budapest is the capital of Hungary. Summary sales plot shows again repeated pattern of sales over years, but a bit different than plots from Czech Republic. There seems to be no tendency of growth over the years. City of Munich is in Germany. Plot of summary daily sales shows again repeated pattern of sales over every year with a tendency to grow each year. City of Frankfurt is in Germany, too. Plot of summary daily sales does not seems to copy the pattern of Munich but there is a tendency to grow each year.
Average sales during weekdays







Plots of average sales during weekdays shows tendency of custumers in all countries to do shopping on Thurdays and Fridays.
Average monthly sales







Plots of average monthly sales seems to reflect regionally differences in shopping habits of customers.
Average sales during holidays and common days

Prediction
Data preparation
To improve model performance, several time-series features will be created, including lagged values, rolling means, and momentum indicators, which will help capture temporal dependencies and short-term trends in the data. Lag features use past observations as predictors, rolling means summarize recent behaviour through moving averages, and momentum measures quantify the recent change in a variable.
# feature engineering
lags = [1, 2, 3, 7, 14, 21, 28, 35, 42, 56]
for lag in lags:
data[f"lag_{lag}"] = (
data.groupby("unique_id")["sales"]
.shift(lag)
)
lags = [1, 7, 14, 28]
for lag in lags:
data[f"orders_lag_{lag}"] = (
data.groupby("warehouse")["total_orders"]
.shift(lag)
)
# rolling
windows = [7, 14, 28]
for w in windows:
data[f"roll_mean_{w}"] = (
data.groupby("unique_id")["sales"]
.shift(1)
.rolling(w)
.mean()
)
# simple momentum signals
data["trend_ratio"] = data["roll_mean_7"] / (data["roll_mean_28"] + 1e-6)
data["momentum_7"] = data["lag_1"] - data["lag_7"]
# warehouse daily total trend
data["warehouse_roll7"] = (
data.groupby(["warehouse", "date"])["sales"]
.transform("sum")
.groupby(data["warehouse"])
.transform(lambda x: x.shift(1).rolling(7).mean())
)
data["product_roll7"] = (
data.groupby(["product_unique_id", "date"])["sales"]
.transform("sum")
.groupby(data["product_unique_id"])
.transform(lambda x: x.shift(1).rolling(7).mean())
)
data["orders_roll7"] = (
data.groupby("warehouse")["total_orders"]
.shift(1)
.rolling(7)
.mean()
)
# dropping redundant columns
data = data.drop(columns=['holiday_name', 'name', 'day' ])
cat_cols = [
"warehouse",
"day_of_week",
"L1_category_name_en",
"L2_category_name_en",
"L3_category_name_en",
"L4_category_name_en",
"product_unique_id"
]
for col in cat_cols:
data[col] = data[col].astype("category")
Model
LightGBM is a machine learning algorithm that predicts outcomes by combining many simple decision trees. Each tree learns from the errors of the previous ones and gradually improves the prediction, allowing the model to capture complex patterns in the data and achieve high predictive accuracy.
# train
train_final = data[data["is_test"] == 0].copy()
# drop rows with missing sales / missing lag history
train_final = train_final.dropna().copy()
y = train_final["sales"].values.ravel()
X = train_final.drop(columns=["sales"]).copy()
# Time split on train
cutoff = train_final["date"].max() - pd.Timedelta(days=14)
train_mask = train_final["date"] < cutoff
valid_mask = train_final["date"] >= cutoff
X_train = X.loc[train_mask].copy()
X_valid = X.loc[valid_mask].copy()
y_train = y[train_mask.values]
y_valid = y[valid_mask.values]
# Weights
train_weights = X_train["weight"].values
valid_weights = X_valid["weight"].values
# Drop non-features
drop_cols = ["unique_id", "date", "is_test", "weight"]
X_train_model = X_train.drop(columns=drop_cols, errors="ignore")
X_valid_model = X_valid.drop(columns=drop_cols, errors="ignore")
model = LGBMRegressor(
n_estimators=4000,
learning_rate=0.03,
num_leaves=63,
min_data_in_leaf=150,
subsample=0.8,
colsample_bytree=0.8,
reg_lambda=2.0,
random_state=42,
force_row_wise=True
)
model.fit(
X_train_model, y_train,
sample_weight=train_weights,
eval_set=[(X_valid_model, y_valid)],
eval_sample_weight=[valid_weights],
eval_metric="l1",
callbacks=[lgb.early_stopping(400)]
)
pred = model.predict(X_valid_model)
wmae = np.sum(valid_weights * np.abs(y_valid - pred)) / np.sum(valid_weights)
mae = mean_absolute_error(y_valid, pred)
print("Validation WMAE:", wmae)
print("MAE:", mae)[LightGBM] [Warning] min_data_in_leaf is set=150, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=150
[LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
[LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
[LightGBM] [Warning] min_data_in_leaf is set=150, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=150
[LightGBM] [Info] Total Bins 8125
[LightGBM] [Info] Number of data points in the train set: 981146, number of used features: 46
[LightGBM] [Warning] min_data_in_leaf is set=150, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=150
[LightGBM] [Info] Start training from score 80.861976
Training until validation scores don't improve for 400 rounds
Early stopping, best iteration is:
[3343] valid_0's l1: 14.4715 valid_0's l2: 1304.26
[LightGBM] [Warning] min_data_in_leaf is set=150, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=150
Validation WMAE: 14.4714968291173
MAE: 19.225200316446223
The model is trained using all historical data except for the final 14 days. These last 14 days are held out as a validation set to evaluate how well the model predicts future sales. After training, the model generates predictions for this period, and the predicted values are compared with the actual observed sales to assess model performance.
[LightGBM] [Warning] min_data_in_leaf is set=150, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=150
Validation WMAE: 14.4714968291173
The final training of the LightGBM model using the entire training dataset after the model parameters have already been validated.
X_full = train_final.drop(columns=["sales", "unique_id", "date", "is_test", "weight"], errors="ignore")
y_full = train_final["sales"].values
weights_full = train_final["weight"].values
final_model = LGBMRegressor(
n_estimators=model.best_iteration_,
learning_rate=0.03,
num_leaves=63,
min_data_in_leaf=150,
subsample=0.8,
colsample_bytree=0.8,
reg_lambda=2.0,
random_state=42,
force_row_wise=True
)
final_model.fit(X_full, y_full, sample_weight=weights_full)[LightGBM] [Warning] min_data_in_leaf is set=150, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=150
[LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
[LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
[LightGBM] [Warning] min_data_in_leaf is set=150, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=150
[LightGBM] [Info] Total Bins 8115
[LightGBM] [Info] Number of data points in the train set: 993369, number of used features: 46
[LightGBM] [Info] Start training from score 80.844171
LGBMRegressor(colsample_bytree=0.8, force_row_wise=True, learning_rate=0.03,
min_data_in_leaf=150, n_estimators=3343, num_leaves=63,
random_state=42, reg_lambda=2.0, subsample=0.8)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LGBMRegressor(colsample_bytree=0.8, force_row_wise=True, learning_rate=0.03,
min_data_in_leaf=150, n_estimators=3343, num_leaves=63,
random_state=42, reg_lambda=2.0, subsample=0.8)[LightGBM] [Warning] min_data_in_leaf is set=150, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=150
Train MAE: 16.073860052291398
Train WMAE: 12.07818805959296
avg_sales = data["sales"].mean()
print("Average sales:", avg_sales)Average sales: 135.09114999103528
Relative error:
12.08/135.09≈0.0894
Accuracy:
1 − 0.0894 ≈ 0.9106
Result
≈ 91.1% training accuracy
Final Solution:
test_final = data[data["is_test"] == 1].copy()
X_test = test_final.drop(columns=["sales", "unique_id", "date", "is_test", "weight"], errors="ignore")
pred_test = final_model.predict(X_test)
pred_test = np.clip(pred_test, 0, None)
solution_df["sales_hat"] = pred_test
solution_df.to_csv("../rohlik_project/csv/solution.csv", index=False)
solution_df.head()[LightGBM] [Warning] min_data_in_leaf is set=150, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=150
| id | sales_hat | |
|---|---|---|
| 0 | Brno_1_1226_2024-06-03 | 10.486372 |
| 1 | Brno_1_1226_2024-06-11 | 8.936784 |
| 2 | Brno_1_1226_2024-06-13 | 9.241525 |
| 3 | Brno_1_1226_2024-06-15 | 10.004795 |
| 4 | Brno_1_1226_2024-06-09 | 12.988256 |
Conclusion
Overall, the project demonstrates that machine learning methods such as LightGBM can effectively model retail demand when combined with appropriate feature engineering and time-based validation. Future improvements could include additional external features, more advanced time-series modelling techniques, or ensemble approaches to further enhance forecasting accuracy.