| Sales Person | Country | Product | Date | Amount | Boxes Shipped | Year | Month | Month Label | Quarter | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Jehu Rudeforth | UK | Mint Chip Choco | 2022-01-04 | 5320.0 | 180 | 2022 | 1 | 2022-01 | Q1 |
| 1 | Van Tuxwell | India | 85% Dark Bars | 2022-08-01 | 7896.0 | 94 | 2022 | 8 | 2022-08 | Q3 |
| 2 | Gigi Bohling | India | Peanut Butter Cubes | 2022-07-07 | 4501.0 | 91 | 2022 | 7 | 2022-07 | Q3 |
| 3 | Jan Morforth | Australia | Peanut Butter Cubes | 2022-04-27 | 12726.0 | 342 | 2022 | 4 | 2022-04 | Q2 |
| 4 | Jehu Rudeforth | UK | Peanut Butter Cubes | 2022-02-24 | 13685.0 | 184 | 2022 | 2 | 2022-02 | Q1 |
Chocolate Sales
Author: Alleksana, pexels.com
About the Dataset
This dataset contains transactional sales records covering multiple countries, products, and salespeople across a time period from roughly 2022 to 2024.
Information about Dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3282 entries, 0 to 3281
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sales Person 3282 non-null object
1 Country 3282 non-null object
2 Product 3282 non-null object
3 Date 3282 non-null datetime64[ns]
4 Amount 3282 non-null float64
5 Boxes Shipped 3282 non-null int64
6 Year 3282 non-null int64
7 Month 3282 non-null int64
8 Month Label 3282 non-null object
9 Quarter 3282 non-null object
dtypes: datetime64[ns](1), float64(1), int64(3), object(5)
memory usage: 256.5+ KB
Overview for the Whole Dataset
Sales Persons: ['Jehu Rudeforth' 'Van Tuxwell' 'Gigi Bohling' 'Jan Morforth' 'Oby Sorrel'
'Gunar Cockshoot' 'Brien Boise' 'Rafaelita Blaksland' 'Barr Faughny'
'Mallorie Waber' 'Karlen McCaffrey' "Marney O'Breen" 'Beverie Moffet'
'Roddy Speechley' 'Curtice Advani' 'Husein Augar' 'Kaine Padly'
'Dennison Crosswaite' "Wilone O'Kielt" 'Andria Kimpton' 'Kelci Walkden'
'Camilla Castle' 'Madelene Upcott' 'Dotty Strutley' 'Ches Bonnell']
Countries: ['UK' 'India' 'Australia' 'New Zealand' 'USA' 'Canada']
Products: ['Mint Chip Choco' '85% Dark Bars' 'Peanut Butter Cubes'
'Smooth Silky Salty' '99% Dark & Pure' 'After Nines' '50% Dark Bites'
'Orange Choco' 'Eclairs' 'Drinking Choco' 'Organic Choco Syrup'
'Milk Bars' 'Spicy Special Slims' 'Fruit & Nut Bars' 'White Choc'
'Manuka Honey Choco' 'Almond Choco' 'Raspberry Choco'
'Choco Coated Almonds' "Baker's Choco Chips" 'Caramel Stuffed Bars'
'70% Dark Bites']
Range: 2022-01-03 00:00:00 - 2024-08-31 00:00:00
Total Amount: 19791571.86
Total Boxes Shipped: 540437
Yearly Sales and Growth:
Amount Growth %
Year
2022 6183625.00 NaN
2023 6643377.96 7.435007
2024 6964568.90 4.834753
Overview for Years
Earliest dates:
2022-01-03 00:00:00
2023-01-03 00:00:00
2024-01-03 00:00:00
Latest dates:
2022-08-31 00:00:00
2023-08-31 00:00:00
2024-08-31 00:00:00
Salesmen list for year 2022:
['Jehu Rudeforth' 'Van Tuxwell' 'Gigi Bohling' 'Jan Morforth' 'Oby Sorrel'
'Gunar Cockshoot' 'Brien Boise' 'Rafaelita Blaksland' 'Barr Faughny'
'Mallorie Waber' 'Karlen McCaffrey' "Marney O'Breen" 'Beverie Moffet'
'Roddy Speechley' 'Curtice Advani' 'Husein Augar' 'Kaine Padly'
'Dennison Crosswaite' "Wilone O'Kielt" 'Andria Kimpton' 'Kelci Walkden'
'Camilla Castle' 'Madelene Upcott' 'Dotty Strutley' 'Ches Bonnell']
Salesmen list for year 2023:
['Jehu Rudeforth' 'Van Tuxwell' 'Gigi Bohling' 'Jan Morforth' 'Oby Sorrel'
'Gunar Cockshoot' 'Brien Boise' 'Rafaelita Blaksland' 'Barr Faughny'
'Mallorie Waber' 'Karlen McCaffrey' "Marney O'Breen" 'Beverie Moffet'
'Roddy Speechley' 'Curtice Advani' 'Husein Augar' 'Kaine Padly'
'Dennison Crosswaite' "Wilone O'Kielt" 'Andria Kimpton' 'Kelci Walkden'
'Camilla Castle' 'Madelene Upcott' 'Dotty Strutley' 'Ches Bonnell']
Salesmen list for year 2024:
['Jehu Rudeforth' 'Van Tuxwell' 'Gigi Bohling' 'Jan Morforth' 'Oby Sorrel'
'Gunar Cockshoot' 'Brien Boise' 'Rafaelita Blaksland' 'Barr Faughny'
'Mallorie Waber' 'Karlen McCaffrey' "Marney O'Breen" 'Beverie Moffet'
'Roddy Speechley' 'Curtice Advani' 'Husein Augar' 'Kaine Padly'
'Dennison Crosswaite' "Wilone O'Kielt" 'Andria Kimpton' 'Kelci Walkden'
'Camilla Castle' 'Madelene Upcott' 'Dotty Strutley' 'Ches Bonnell']
Product list for 2022:
['Mint Chip Choco' '85% Dark Bars' 'Peanut Butter Cubes'
'Smooth Silky Salty' '99% Dark & Pure' 'After Nines' '50% Dark Bites'
'Orange Choco' 'Eclairs' 'Drinking Choco' 'Organic Choco Syrup'
'Milk Bars' 'Spicy Special Slims' 'Fruit & Nut Bars' 'White Choc'
'Manuka Honey Choco' 'Almond Choco' 'Raspberry Choco'
'Choco Coated Almonds' "Baker's Choco Chips" 'Caramel Stuffed Bars'
'70% Dark Bites']
Product list for 2023:
['Mint Chip Choco' '85% Dark Bars' 'Peanut Butter Cubes'
'Smooth Silky Salty' '99% Dark & Pure' 'After Nines' '50% Dark Bites'
'Orange Choco' 'Eclairs' 'Drinking Choco' 'Organic Choco Syrup'
'Milk Bars' 'Spicy Special Slims' 'Fruit & Nut Bars' 'White Choc'
'Manuka Honey Choco' 'Almond Choco' 'Raspberry Choco'
'Choco Coated Almonds' "Baker's Choco Chips" 'Caramel Stuffed Bars'
'70% Dark Bites']
Product list for 2024:
['Mint Chip Choco' '85% Dark Bars' 'Peanut Butter Cubes'
'Smooth Silky Salty' '99% Dark & Pure' 'After Nines' '50% Dark Bites'
'Orange Choco' 'Eclairs' 'Drinking Choco' 'Organic Choco Syrup'
'Milk Bars' 'Spicy Special Slims' 'Fruit & Nut Bars' 'White Choc'
'Manuka Honey Choco' 'Almond Choco' 'Raspberry Choco'
'Choco Coated Almonds' "Baker's Choco Chips" 'Caramel Stuffed Bars'
'70% Dark Bites']
Countries list for year 2022:
['UK' 'India' 'Australia' 'New Zealand' 'USA' 'Canada']
Countries list for year 2023:
['UK' 'India' 'Australia' 'New Zealand' 'USA' 'Canada']
Countries list for year 2024:
['UK' 'India' 'Australia' 'New Zealand' 'USA' 'Canada']
Total Amount in 2022: 6183625.0
Total Boxes Shipped in 2022: 177007
Total Amount in 2023: 6643377.96
Total Boxes Shipped 2023: 181411
Total Amount in 2024: 6964568.9
Total Boxes Shipped in 2024: 182019
Exploratory Data Analysis

Monthly sales show a seasonal pattern with a mid-period dip (April), strong recovery, and peak around June followed by a gradual decline. What is atypical is the period of sales from January to August only.

Sales vary strongly by country, with a common mid-year peak (Month 6–7), high volatility in UK/New Zealand, and steady growth in India.




The business shows strong, consistent, and diversified growth across years, countries, products, and salesmen.

The charts represent product popularity based on boxes shipped, showing that demand is well distributed across products. Common winners across countries are Eclairs, Milk Bars, and Dark chocolate variants.

Sales are evenly distributed across countries, with Australia leading slightly, while all other markets contribute at similar levels—indicating a well-balanced and globally diversified business.

Top Performers are: Smooth Silky Salty (~1.12M), followed by 50% Dark Bites, White Choc, and Peanut Butter Cubes.

Leading salesmen are Ches Bonnell (~1.02M), Oby Sorrel, Madelene Upcott, Kelci Walkden, and Brien Boise.



Data are sparse with the unusual seasonal pattern.
Random Forest Regressor Model
# trend feature
df["year_trend"] = df["Year"] - df["Year"].min()
kf = KFold(n_splits=5, shuffle=False)
amount_mae_list = []
boxes_mae_list = []
amount_pct_list = []
boxes_pct_list = []
# features
for train_idx, test_idx in kf.split(df):
train_df = df.iloc[train_idx].copy()
test_df = df.iloc[test_idx].copy()
# --- feature engineering ---
global_avg = train_df["Amount"].mean()
product_avg = train_df.groupby("Product")["Amount"].mean()
person_avg = train_df.groupby("Sales Person")["Amount"].mean()
country_avg = train_df.groupby("Country")["Amount"].mean()
year_avg = train_df.groupby("Year")["Amount"].mean()
quarter_avg = train_df.groupby("Quarter")["Amount"].mean()
product_country_avg = train_df.groupby(["Product", "Country"])["Amount"].mean()
product_person_avg = train_df.groupby(["Product", "Sales Person"])["Amount"].mean()
product_year_avg = train_df.groupby(["Product", "Year"])["Amount"].mean()
product_month_avg = train_df.groupby(["Product", "Month"])["Amount"].mean()
person_country_avg = train_df.groupby(["Sales Person", "Country"])["Amount"].mean()
def add_features(data):
out = data.copy()
out["product_avg"] = out["Product"].map(product_avg).fillna(global_avg)
out["person_avg"] = out["Sales Person"].map(person_avg).fillna(global_avg)
out["country_avg"] = out["Country"].map(country_avg).fillna(global_avg)
out["year_avg"] = out["Year"].map(year_avg).fillna(global_avg)
out["quarter_avg"] = out["Quarter"].map(quarter_avg).fillna(global_avg)
out["product_country_avg"] = out.apply(
lambda row: product_country_avg.get((row["Product"], row["Country"]), global_avg),
axis=1
)
out["product_person_avg"] = out.apply(
lambda row: product_person_avg.get((row["Product"], row["Sales Person"]), global_avg),
axis=1
)
out["product_year_avg"] = out.apply(
lambda row: product_year_avg.get((row["Product"], row["Year"]), global_avg),
axis=1
)
out["product_month_avg"] = out.apply(
lambda row: product_month_avg.get((row["Product"], row["Month"]), global_avg),
axis=1
)
out["person_country_avg"] = out.apply(
lambda row: person_country_avg.get((row["Sales Person"], row["Country"]), global_avg),
axis=1
)
out["year_trend"] = out["Year"] - df["Year"].min()
return out
train_feat = add_features(train_df)
test_feat = add_features(test_df)
feature_cols = [
"quarter_avg",
"product_avg",
"person_avg",
"country_avg",
"product_country_avg",
"product_person_avg",
"product_month_avg",
"person_country_avg",
"year_trend"
]
X_train = train_feat[feature_cols]
X_test = test_feat[feature_cols]
y_train = train_feat[["Amount", "Boxes Shipped"]]
y_test = test_feat[["Amount", "Boxes Shipped"]]
model = RandomForestRegressor(
n_estimators=200,
random_state=42,
n_jobs=-1
)
model.fit(X_train, y_train)
preds = model.predict(X_test)
# --- fold MAE ---
amount_mae = mean_absolute_error(y_test["Amount"], preds[:, 0])
boxes_mae = mean_absolute_error(y_test["Boxes Shipped"], preds[:, 1])
# --- fold MAE % ---
amount_pct = amount_mae / y_test["Amount"].mean()
boxes_pct = boxes_mae / y_test["Boxes Shipped"].mean()
amount_mae_list.append(amount_mae)
boxes_mae_list.append(boxes_mae)
amount_pct_list.append(amount_pct)
boxes_pct_list.append(boxes_pct)
# --- fold-average results ---
print("FOLD-AVERAGED RESULTS")
print(f"Amount MAE: {np.mean(amount_mae_list):.2f}")
print(f"Amount MAE %: {np.mean(amount_pct_list):.2%}")
print(f"Boxes Shipped MAE: {np.mean(boxes_mae_list):.2f}")
print(f"Boxes Shipped MAE %: {np.mean(boxes_pct_list):.2%}")FOLD-AVERAGED RESULTS
Amount MAE: 552.16
Amount MAE %: 9.12%
Boxes Shipped MAE: 25.50
Boxes Shipped MAE %: 15.49%
Final Evaluation
# --- separate actual values ---
y_test_amount = y_test["Amount"].values
y_test_boxes = y_test["Boxes Shipped"].values
# --- separate predictions ---
pred_amount = preds[:, 0]
pred_boxes = preds[:, 1]
# --- MAE ---
mae_amount = mean_absolute_error(y_test_amount, pred_amount)
mae_boxes = mean_absolute_error(y_test_boxes, pred_boxes)
# --- MAE % ---
mae_amount_pct = mae_amount / y_test_amount.mean()
mae_boxes_pct = mae_boxes / y_test_boxes.mean()
# --- print ---
print(f"Amount MAE: {mae_amount:.2f}")
print(f"Amount MAE %: {mae_amount_pct:.2%}")
print(f"Boxes Shipped MAE: {mae_boxes:.2f}")
print(f"Boxes Shipped MAE %: {mae_boxes_pct:.2%}")Amount MAE: 664.61
Amount MAE %: 10.37%
Boxes Shipped MAE: 25.53
Boxes Shipped MAE %: 14.85%
Prediciton for Year 2025
# training of model on all history
train_full = add_features(df)
X_full = train_full[feature_cols]
y_full = train_full[["Amount", "Boxes Shipped"]]
final_model = RandomForestRegressor(
n_estimators=200,
random_state=42,
n_jobs=-1
)
final_model.fit(X_full, y_full)
# building future 2025 dataset
# copy last real year structure
future_2025 = filtered_2024.copy()
future_2025["Year"] = 2025
future_2025["Date"] = pd.to_datetime(future_2025["Date"]) + pd.DateOffset(years=1)
future_2025["Month"] = future_2025["Date"].dt.month
future_2025["Quarter"] = future_2025["Date"].dt.quarter
future_2025 = future_2025.drop(columns=["Amount", "Boxes Shipped"], errors="ignore")
# predict 2025
future_2025_feat = add_features(future_2025)
future_preds = final_model.predict(future_2025_feat[feature_cols])
future_2025["Predicted Amount"] = future_preds[:, 0]
future_2025["Predicted Boxes Shipped"] = future_preds[:, 1]
# average growth rate around 6.13 %
growth_rate = 0.0613
future_2025["Predicted Amount"] = future_2025["Predicted Amount"] * (1+growth_rate)
future_2025["Predicted Boxes Shipped"] = future_2025["Predicted Boxes Shipped"] * (1+ growth_rate)
future_2025=future_2025.reset_index()
future_2025.head()| index | Sales Person | Country | Product | Date | Year | Month | Month Label | Quarter | Predicted Amount | Predicted Boxes Shipped | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2188 | Jehu Rudeforth | UK | Mint Chip Choco | 2025-01-04 | 2025 | 1 | 2024-01 | 1 | 6579.131362 | 217.847744 |
| 1 | 2189 | Van Tuxwell | India | 85% Dark Bars | 2025-08-01 | 2025 | 8 | 2024-08 | 3 | 9259.760037 | 103.041617 |
| 2 | 2190 | Gigi Bohling | India | Peanut Butter Cubes | 2025-07-07 | 2025 | 7 | 2024-07 | 3 | 5373.299283 | 106.374099 |
| 3 | 2191 | Jan Morforth | Australia | Peanut Butter Cubes | 2025-04-27 | 2025 | 4 | 2024-04 | 2 | 14435.567628 | 329.623860 |
| 4 | 2192 | Jehu Rudeforth | UK | Peanut Butter Cubes | 2025-02-24 | 2025 | 2 | 2024-02 | 1 | 14809.633161 | 199.964839 |
Sales Persons: ['Jehu Rudeforth' 'Van Tuxwell' 'Gigi Bohling' 'Jan Morforth' 'Oby Sorrel'
'Gunar Cockshoot' 'Brien Boise' 'Rafaelita Blaksland' 'Barr Faughny'
'Mallorie Waber' 'Karlen McCaffrey' "Marney O'Breen" 'Beverie Moffet'
'Roddy Speechley' 'Curtice Advani' 'Husein Augar' 'Kaine Padly'
'Dennison Crosswaite' "Wilone O'Kielt" 'Andria Kimpton' 'Kelci Walkden'
'Camilla Castle' 'Madelene Upcott' 'Dotty Strutley' 'Ches Bonnell']
Countries: ['UK' 'India' 'Australia' 'New Zealand' 'USA' 'Canada']
Products: ['Mint Chip Choco' '85% Dark Bars' 'Peanut Butter Cubes'
'Smooth Silky Salty' '99% Dark & Pure' 'After Nines' '50% Dark Bites'
'Orange Choco' 'Eclairs' 'Drinking Choco' 'Organic Choco Syrup'
'Milk Bars' 'Spicy Special Slims' 'Fruit & Nut Bars' 'White Choc'
'Manuka Honey Choco' 'Almond Choco' 'Raspberry Choco'
'Choco Coated Almonds' "Baker's Choco Chips" 'Caramel Stuffed Bars'
'70% Dark Bites']
Range: 2025-01-03 00:00:00 - 2024-08-31 00:00:00
Predicted Amount: 7257209.6
Predicted Boxes Shipped: 192811.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1094 entries, 0 to 1093
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 1094 non-null int64
1 Sales Person 1094 non-null object
2 Country 1094 non-null object
3 Product 1094 non-null object
4 Date 1094 non-null datetime64[ns]
5 Year 1094 non-null int64
6 Month 1094 non-null int64
7 Month Label 1094 non-null object
8 Quarter 1094 non-null int64
9 Predicted Amount 1094 non-null float64
10 Predicted Boxes Shipped 1094 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(4), object(4)
memory usage: 94.1+ KB
None
