Lucia Macakova
  • About
  • Portfolio

On this page

  • About the Dataset
  • Exploratory Data Analysis
  • Random Forest Regressor Model
  • Prediciton for Year 2025
    • End

Chocolate Sales

various chocolate bars Author: Alleksana, pexels.com

About the Dataset

This dataset contains transactional sales records covering multiple countries, products, and salespeople across a time period from roughly 2022 to 2024.

Sales Person Country Product Date Amount Boxes Shipped Year Month Month Label Quarter
0 Jehu Rudeforth UK Mint Chip Choco 2022-01-04 5320.0 180 2022 1 2022-01 Q1
1 Van Tuxwell India 85% Dark Bars 2022-08-01 7896.0 94 2022 8 2022-08 Q3
2 Gigi Bohling India Peanut Butter Cubes 2022-07-07 4501.0 91 2022 7 2022-07 Q3
3 Jan Morforth Australia Peanut Butter Cubes 2022-04-27 12726.0 342 2022 4 2022-04 Q2
4 Jehu Rudeforth UK Peanut Butter Cubes 2022-02-24 13685.0 184 2022 2 2022-02 Q1

Information about Dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3282 entries, 0 to 3281
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Sales Person   3282 non-null   object        
 1   Country        3282 non-null   object        
 2   Product        3282 non-null   object        
 3   Date           3282 non-null   datetime64[ns]
 4   Amount         3282 non-null   float64       
 5   Boxes Shipped  3282 non-null   int64         
 6   Year           3282 non-null   int64         
 7   Month          3282 non-null   int64         
 8   Month Label    3282 non-null   object        
 9   Quarter        3282 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(3), object(5)
memory usage: 256.5+ KB

Overview for the Whole Dataset

Sales Persons: ['Jehu Rudeforth' 'Van Tuxwell' 'Gigi Bohling' 'Jan Morforth' 'Oby Sorrel'
 'Gunar Cockshoot' 'Brien Boise' 'Rafaelita Blaksland' 'Barr Faughny'
 'Mallorie Waber' 'Karlen McCaffrey' "Marney O'Breen" 'Beverie Moffet'
 'Roddy Speechley' 'Curtice Advani' 'Husein Augar' 'Kaine Padly'
 'Dennison Crosswaite' "Wilone O'Kielt" 'Andria Kimpton' 'Kelci Walkden'
 'Camilla Castle' 'Madelene Upcott' 'Dotty Strutley' 'Ches Bonnell']

Countries: ['UK' 'India' 'Australia' 'New Zealand' 'USA' 'Canada']

Products: ['Mint Chip Choco' '85% Dark Bars' 'Peanut Butter Cubes'
 'Smooth Silky Salty' '99% Dark & Pure' 'After Nines' '50% Dark Bites'
 'Orange Choco' 'Eclairs' 'Drinking Choco' 'Organic Choco Syrup'
 'Milk Bars' 'Spicy Special Slims' 'Fruit & Nut Bars' 'White Choc'
 'Manuka Honey Choco' 'Almond Choco' 'Raspberry Choco'
 'Choco Coated Almonds' "Baker's Choco Chips" 'Caramel Stuffed Bars'
 '70% Dark Bites']

Range: 2022-01-03 00:00:00 - 2024-08-31 00:00:00
Total Amount: 19791571.86
Total Boxes Shipped: 540437

Yearly Sales and Growth:
          Amount  Growth %
Year                      
2022  6183625.00       NaN
2023  6643377.96  7.435007
2024  6964568.90  4.834753

Overview for Years

Earliest dates: 
2022-01-03 00:00:00
2023-01-03 00:00:00
2024-01-03 00:00:00

Latest dates: 
2022-08-31 00:00:00
2023-08-31 00:00:00
2024-08-31 00:00:00

Salesmen list for year 2022:
['Jehu Rudeforth' 'Van Tuxwell' 'Gigi Bohling' 'Jan Morforth' 'Oby Sorrel'
 'Gunar Cockshoot' 'Brien Boise' 'Rafaelita Blaksland' 'Barr Faughny'
 'Mallorie Waber' 'Karlen McCaffrey' "Marney O'Breen" 'Beverie Moffet'
 'Roddy Speechley' 'Curtice Advani' 'Husein Augar' 'Kaine Padly'
 'Dennison Crosswaite' "Wilone O'Kielt" 'Andria Kimpton' 'Kelci Walkden'
 'Camilla Castle' 'Madelene Upcott' 'Dotty Strutley' 'Ches Bonnell']

Salesmen list for year 2023:
['Jehu Rudeforth' 'Van Tuxwell' 'Gigi Bohling' 'Jan Morforth' 'Oby Sorrel'
 'Gunar Cockshoot' 'Brien Boise' 'Rafaelita Blaksland' 'Barr Faughny'
 'Mallorie Waber' 'Karlen McCaffrey' "Marney O'Breen" 'Beverie Moffet'
 'Roddy Speechley' 'Curtice Advani' 'Husein Augar' 'Kaine Padly'
 'Dennison Crosswaite' "Wilone O'Kielt" 'Andria Kimpton' 'Kelci Walkden'
 'Camilla Castle' 'Madelene Upcott' 'Dotty Strutley' 'Ches Bonnell']

Salesmen list for year 2024:
['Jehu Rudeforth' 'Van Tuxwell' 'Gigi Bohling' 'Jan Morforth' 'Oby Sorrel'
 'Gunar Cockshoot' 'Brien Boise' 'Rafaelita Blaksland' 'Barr Faughny'
 'Mallorie Waber' 'Karlen McCaffrey' "Marney O'Breen" 'Beverie Moffet'
 'Roddy Speechley' 'Curtice Advani' 'Husein Augar' 'Kaine Padly'
 'Dennison Crosswaite' "Wilone O'Kielt" 'Andria Kimpton' 'Kelci Walkden'
 'Camilla Castle' 'Madelene Upcott' 'Dotty Strutley' 'Ches Bonnell']

Product list for 2022: 
['Mint Chip Choco' '85% Dark Bars' 'Peanut Butter Cubes'
 'Smooth Silky Salty' '99% Dark & Pure' 'After Nines' '50% Dark Bites'
 'Orange Choco' 'Eclairs' 'Drinking Choco' 'Organic Choco Syrup'
 'Milk Bars' 'Spicy Special Slims' 'Fruit & Nut Bars' 'White Choc'
 'Manuka Honey Choco' 'Almond Choco' 'Raspberry Choco'
 'Choco Coated Almonds' "Baker's Choco Chips" 'Caramel Stuffed Bars'
 '70% Dark Bites']

Product list for 2023: 
['Mint Chip Choco' '85% Dark Bars' 'Peanut Butter Cubes'
 'Smooth Silky Salty' '99% Dark & Pure' 'After Nines' '50% Dark Bites'
 'Orange Choco' 'Eclairs' 'Drinking Choco' 'Organic Choco Syrup'
 'Milk Bars' 'Spicy Special Slims' 'Fruit & Nut Bars' 'White Choc'
 'Manuka Honey Choco' 'Almond Choco' 'Raspberry Choco'
 'Choco Coated Almonds' "Baker's Choco Chips" 'Caramel Stuffed Bars'
 '70% Dark Bites']

Product list for 2024: 
['Mint Chip Choco' '85% Dark Bars' 'Peanut Butter Cubes'
 'Smooth Silky Salty' '99% Dark & Pure' 'After Nines' '50% Dark Bites'
 'Orange Choco' 'Eclairs' 'Drinking Choco' 'Organic Choco Syrup'
 'Milk Bars' 'Spicy Special Slims' 'Fruit & Nut Bars' 'White Choc'
 'Manuka Honey Choco' 'Almond Choco' 'Raspberry Choco'
 'Choco Coated Almonds' "Baker's Choco Chips" 'Caramel Stuffed Bars'
 '70% Dark Bites']

Countries list for year 2022: 
['UK' 'India' 'Australia' 'New Zealand' 'USA' 'Canada']

Countries list for year 2023: 
['UK' 'India' 'Australia' 'New Zealand' 'USA' 'Canada']

Countries list for year 2024: 
['UK' 'India' 'Australia' 'New Zealand' 'USA' 'Canada']

Total Amount in 2022: 6183625.0
Total Boxes Shipped in 2022: 177007

Total Amount in 2023: 6643377.96
Total Boxes Shipped 2023: 181411

Total Amount in 2024: 6964568.9
Total Boxes Shipped in 2024: 182019

Exploratory Data Analysis

Monthly sales show a seasonal pattern with a mid-period dip (April), strong recovery, and peak around June followed by a gradual decline. What is atypical is the period of sales from January to August only.

Sales vary strongly by country, with a common mid-year peak (Month 6–7), high volatility in UK/New Zealand, and steady growth in India.

The business shows strong, consistent, and diversified growth across years, countries, products, and salesmen.

The charts represent product popularity based on boxes shipped, showing that demand is well distributed across products. Common winners across countries are Eclairs, Milk Bars, and Dark chocolate variants.

Sales are evenly distributed across countries, with Australia leading slightly, while all other markets contribute at similar levels—indicating a well-balanced and globally diversified business.

Top Performers are: Smooth Silky Salty (~1.12M), followed by 50% Dark Bites, White Choc, and Peanut Butter Cubes.

Leading salesmen are Ches Bonnell (~1.02M), Oby Sorrel, Madelene Upcott, Kelci Walkden, and Brien Boise.

Data are sparse with the unusual seasonal pattern.

Random Forest Regressor Model

#  trend feature 
df["year_trend"] = df["Year"] - df["Year"].min()

kf = KFold(n_splits=5, shuffle=False)

amount_mae_list = []
boxes_mae_list = []

amount_pct_list = []
boxes_pct_list = []

# features
for train_idx, test_idx in kf.split(df):

    train_df = df.iloc[train_idx].copy()
    test_df = df.iloc[test_idx].copy()

    # --- feature engineering ---
    global_avg = train_df["Amount"].mean()

    product_avg = train_df.groupby("Product")["Amount"].mean()
    person_avg = train_df.groupby("Sales Person")["Amount"].mean()
    country_avg = train_df.groupby("Country")["Amount"].mean()
    year_avg = train_df.groupby("Year")["Amount"].mean()
    quarter_avg = train_df.groupby("Quarter")["Amount"].mean()

    product_country_avg = train_df.groupby(["Product", "Country"])["Amount"].mean()
    product_person_avg = train_df.groupby(["Product", "Sales Person"])["Amount"].mean()
    product_year_avg = train_df.groupby(["Product", "Year"])["Amount"].mean()
    product_month_avg = train_df.groupby(["Product", "Month"])["Amount"].mean()

    person_country_avg = train_df.groupby(["Sales Person", "Country"])["Amount"].mean()

    def add_features(data):
        out = data.copy()

        out["product_avg"] = out["Product"].map(product_avg).fillna(global_avg)
        out["person_avg"] = out["Sales Person"].map(person_avg).fillna(global_avg)
        out["country_avg"] = out["Country"].map(country_avg).fillna(global_avg)
        out["year_avg"] = out["Year"].map(year_avg).fillna(global_avg)
        out["quarter_avg"] = out["Quarter"].map(quarter_avg).fillna(global_avg)

        out["product_country_avg"] = out.apply(
            lambda row: product_country_avg.get((row["Product"], row["Country"]), global_avg),
            axis=1
        )

        out["product_person_avg"] = out.apply(
            lambda row: product_person_avg.get((row["Product"], row["Sales Person"]), global_avg),
            axis=1
        )

        out["product_year_avg"] = out.apply(
            lambda row: product_year_avg.get((row["Product"], row["Year"]), global_avg),
            axis=1
        )

        out["product_month_avg"] = out.apply(
            lambda row: product_month_avg.get((row["Product"], row["Month"]), global_avg),
            axis=1
        )

        out["person_country_avg"] = out.apply(
            lambda row: person_country_avg.get((row["Sales Person"], row["Country"]), global_avg),
            axis=1
        )

        
        out["year_trend"] = out["Year"] - df["Year"].min()

        return out

    train_feat = add_features(train_df)
    test_feat = add_features(test_df)

    feature_cols = [
        "quarter_avg",
        "product_avg",
        "person_avg",
        "country_avg",
        "product_country_avg",
        "product_person_avg",
        "product_month_avg",
        "person_country_avg",
        "year_trend"
    ]

    X_train = train_feat[feature_cols]
    X_test = test_feat[feature_cols]

    y_train = train_feat[["Amount", "Boxes Shipped"]]
    y_test = test_feat[["Amount", "Boxes Shipped"]]

    model = RandomForestRegressor(
        n_estimators=200,
        random_state=42,
        n_jobs=-1
    )

    model.fit(X_train, y_train)
    preds = model.predict(X_test)

    # --- fold MAE ---
    amount_mae = mean_absolute_error(y_test["Amount"], preds[:, 0])
    boxes_mae = mean_absolute_error(y_test["Boxes Shipped"], preds[:, 1])

    # --- fold MAE % ---
    amount_pct = amount_mae / y_test["Amount"].mean()
    boxes_pct = boxes_mae / y_test["Boxes Shipped"].mean()

    amount_mae_list.append(amount_mae)
    boxes_mae_list.append(boxes_mae)

    amount_pct_list.append(amount_pct)
    boxes_pct_list.append(boxes_pct)

# --- fold-average results ---
print("FOLD-AVERAGED RESULTS")
print(f"Amount MAE: {np.mean(amount_mae_list):.2f}")
print(f"Amount MAE %: {np.mean(amount_pct_list):.2%}")

print(f"Boxes Shipped MAE: {np.mean(boxes_mae_list):.2f}")
print(f"Boxes Shipped MAE %: {np.mean(boxes_pct_list):.2%}")
FOLD-AVERAGED RESULTS
Amount MAE: 552.16
Amount MAE %: 9.12%
Boxes Shipped MAE: 25.50
Boxes Shipped MAE %: 15.49%

Final Evaluation

# --- separate actual values ---
y_test_amount = y_test["Amount"].values
y_test_boxes = y_test["Boxes Shipped"].values

# --- separate predictions ---
pred_amount = preds[:, 0]
pred_boxes = preds[:, 1]

# --- MAE ---
mae_amount = mean_absolute_error(y_test_amount, pred_amount)
mae_boxes = mean_absolute_error(y_test_boxes, pred_boxes)

# --- MAE % ---
mae_amount_pct = mae_amount / y_test_amount.mean()
mae_boxes_pct = mae_boxes / y_test_boxes.mean()

# --- print ---
print(f"Amount MAE: {mae_amount:.2f}")
print(f"Amount MAE %: {mae_amount_pct:.2%}")

print(f"Boxes Shipped MAE: {mae_boxes:.2f}")
print(f"Boxes Shipped MAE %: {mae_boxes_pct:.2%}")
Amount MAE: 664.61
Amount MAE %: 10.37%
Boxes Shipped MAE: 25.53
Boxes Shipped MAE %: 14.85%

Prediciton for Year 2025

# training of model on all history

train_full = add_features(df)

X_full = train_full[feature_cols]
y_full = train_full[["Amount", "Boxes Shipped"]]

final_model = RandomForestRegressor(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)

final_model.fit(X_full, y_full)


# building future 2025 dataset
# copy last real year structure


future_2025 = filtered_2024.copy()
future_2025["Year"] = 2025
future_2025["Date"] = pd.to_datetime(future_2025["Date"]) + pd.DateOffset(years=1)
future_2025["Month"] = future_2025["Date"].dt.month
future_2025["Quarter"] = future_2025["Date"].dt.quarter

future_2025 = future_2025.drop(columns=["Amount", "Boxes Shipped"], errors="ignore")

# predict 2025
future_2025_feat = add_features(future_2025)
future_preds = final_model.predict(future_2025_feat[feature_cols])

future_2025["Predicted Amount"] = future_preds[:, 0]
future_2025["Predicted Boxes Shipped"] = future_preds[:, 1]
# average growth rate around 6.13 %
growth_rate = 0.0613
future_2025["Predicted Amount"] = future_2025["Predicted Amount"] * (1+growth_rate)
future_2025["Predicted Boxes Shipped"] = future_2025["Predicted Boxes Shipped"] * (1+ growth_rate)
future_2025=future_2025.reset_index()
future_2025.head()
index Sales Person Country Product Date Year Month Month Label Quarter Predicted Amount Predicted Boxes Shipped
0 2188 Jehu Rudeforth UK Mint Chip Choco 2025-01-04 2025 1 2024-01 1 6579.131362 217.847744
1 2189 Van Tuxwell India 85% Dark Bars 2025-08-01 2025 8 2024-08 3 9259.760037 103.041617
2 2190 Gigi Bohling India Peanut Butter Cubes 2025-07-07 2025 7 2024-07 3 5373.299283 106.374099
3 2191 Jan Morforth Australia Peanut Butter Cubes 2025-04-27 2025 4 2024-04 2 14435.567628 329.623860
4 2192 Jehu Rudeforth UK Peanut Butter Cubes 2025-02-24 2025 2 2024-02 1 14809.633161 199.964839

Sales Persons: ['Jehu Rudeforth' 'Van Tuxwell' 'Gigi Bohling' 'Jan Morforth' 'Oby Sorrel'
 'Gunar Cockshoot' 'Brien Boise' 'Rafaelita Blaksland' 'Barr Faughny'
 'Mallorie Waber' 'Karlen McCaffrey' "Marney O'Breen" 'Beverie Moffet'
 'Roddy Speechley' 'Curtice Advani' 'Husein Augar' 'Kaine Padly'
 'Dennison Crosswaite' "Wilone O'Kielt" 'Andria Kimpton' 'Kelci Walkden'
 'Camilla Castle' 'Madelene Upcott' 'Dotty Strutley' 'Ches Bonnell']


Countries: ['UK' 'India' 'Australia' 'New Zealand' 'USA' 'Canada']


Products: ['Mint Chip Choco' '85% Dark Bars' 'Peanut Butter Cubes'
 'Smooth Silky Salty' '99% Dark & Pure' 'After Nines' '50% Dark Bites'
 'Orange Choco' 'Eclairs' 'Drinking Choco' 'Organic Choco Syrup'
 'Milk Bars' 'Spicy Special Slims' 'Fruit & Nut Bars' 'White Choc'
 'Manuka Honey Choco' 'Almond Choco' 'Raspberry Choco'
 'Choco Coated Almonds' "Baker's Choco Chips" 'Caramel Stuffed Bars'
 '70% Dark Bites']


Range: 2025-01-03 00:00:00 - 2024-08-31 00:00:00

Predicted Amount: 7257209.6

Predicted Boxes Shipped: 192811.0

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1094 entries, 0 to 1093
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   index                    1094 non-null   int64         
 1   Sales Person             1094 non-null   object        
 2   Country                  1094 non-null   object        
 3   Product                  1094 non-null   object        
 4   Date                     1094 non-null   datetime64[ns]
 5   Year                     1094 non-null   int64         
 6   Month                    1094 non-null   int64         
 7   Month Label              1094 non-null   object        
 8   Quarter                  1094 non-null   int64         
 9   Predicted Amount         1094 non-null   float64       
 10  Predicted Boxes Shipped  1094 non-null   float64       
dtypes: datetime64[ns](1), float64(2), int64(4), object(4)
memory usage: 94.1+ KB
None


End

© 2025 Lucia Macakova