Starbucks Marketing Offer Analysis and Personalisation

Lusiana Djie
11 min readSep 18, 2021

Project Overview

Who hasn’t heard of Starbucks? The modest American coffee roaster and retailer, known for its “green twin-tailed mermaid” logo has transformed into an international coffee powerhouse. Like many other big stores, Starbucks spends an exorbitant amount of money on marketing campaigns. Every now and then, customers will receive offers, such as BOGO (buy one, get one), informational and discount promotions.

Marketers have to constantly weigh the benefits of their promotions against the costs, and understand the driving factors that are impacting their marketing optimization efforts. Poorly targeted campaigns can undoubtedly drain marketing budget — but more than that, it can also cost time, human resources and even cause the company to lose customers!

Here are some possible scenarios that happen when promotions are poorly targeted:

  • Sending offers to “disinterested” customers who would not purchase regardless of how many times they receive the offers
  • Sending offers to customers who are inactive (e.g. deleted the app, turned off push notifications)
  • Sending offers to Starbucks die-hard fans who would have bought the products anyway, regardless of the availability of promotion

Therefore, it is essential for marketers to be fully aware of the driving factors behind their campaign effectiveness. It is only then that they are able to harness these factors using data-based decisions to steer the company to generate fruitful marketing returns.

Problem Statement

This project aims to explore a simulated dataset that mimics the responses of Starbucks customers on the mobile app when offers are received. Without a clear promotional strategy, Starbucks is not able to optimize the return from its offers.

The goal is to understand the success rate of an offer depending on the customer demographics. Additionally, a few classification predictive models will be explored to see if a customer is likely to purchase, given a specific type of offer.

Here are some of the steps that will be taken to achieve the goal:

  • Initial exploration to give a better understanding of the data
  • Data clean-up to deal with anomalies or missing records
  • Exploratory data analysis on to understand the different factors affecting purchase behaviour
  • Further pre-processing before the data get ingested into the model. E.g. split into train and test dataset, data normalization with StandardScaler()
  • Exploration of two commonly-used classification models that can be used to predict if a customer is likely to purchase
  • Using GridSearchCV() to find the best parameters for the model
  • Selection of the model based on the comparison of F1-scores

Metrics

Below are a number of metrics that can be used to evaluate a classification model:

Accuracy shows the number of data points that are correctly predicted out of the entire data points.

Precision is showing how many data points are correctly predicted out of those that are predicted as positive. It is a good measure of model precision when the cost of having a false positive is high. E.g. Spam detection problem.

Recall calculates the number of actual positives that are captured by the model by labelling it as positive. It is an especially useful metric when the cost of getting false negatives is high. E.g. Cancer detection problem.

F1-score is a function of both precision and recall. It is used when the costs of having False Negatives and False Positives are high.

These are the costs of having False Positives and False Negatives respectively:

  • Losing money to customers who are unlikely to purchase when offers are sent to them
  • Missing returns from not sending offers to customers that are likely to purchase

As the implication of having these False Postives and False Negatives are costly, it is important to take them into consideration. Therefore, F1-score would be used as the evaluation metric of the model.

Dataset

There are three datasets that will be used for our analysis. Here are the details:

  1. Portfolio
  • id (string) — offer id
  • offer_type (string) — type of offer ie BOGO, discount, informational
  • difficulty (int) — minimum required spend to complete an offer
  • reward (int) — reward given for completing an offer
  • duration (int) — time for offer to be open, in days
  • channels (list of strings)

2. Profile

  • age (int) — age of the customer
  • became_member_on (int) — date when customer created an app account
  • gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
  • id (str) — customer id
  • income (float) — customer’s income

3. Transcript

  • event (str) — record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) — customer id
  • time (int) — time in hours since the start of test. The data begins at time t=0
  • value — (dict of strings) — either an offer id or transaction amount depending on the record

This is what the DataFrame look like after some clean-up:

Data Exploration and Visualization

Assumptions and Caveats

  • Product prices and customer behaviour vary across geographical areas — the dataset is assumed to be taken from a sample of Starbucks customers in the United States
  • The simulated data only assumes 1 product. In the real world, Starbucks has numerous products offerings
  • O (others) genders are excluded in the analysis to provide, as it cannot be determined if O represents customers who are not comfortable with revealing their gender, or if it represents a non-binary gender

Customers Age Distribution

The above histogram shows the age distribution of Starbucks customers. It seems that Starbucks is well-received across different age groups. However, Starbucks is most popular among those between the age of 55–65 years old (median of 58). It can also be seen that the age distribution across gender is almost identical.

One interesting point to take note of is that there is a high number of customers aged 118 years old. It is highly unlikely that this is the real age of the customer — therefore, these data points can be ruled out as outliers and removed to prevent the distortion of statistical analyses performed in the next steps.

Income Distribution

According to The U.S. Census Bureau, the annual real median personal income at 35,977 USD in 2019. Based on the sample data that we have, 50% of Starbucks customers are earning above the median income for the US, ranging from 50,000–80,000 USD, with a median of 64,000 USD. This means 50% of Starbucks customers are earning ~77% higher than 50% of the population in the US!

Income distribution seems to differ by gender. The distribution among male customers is right-skewed, whereas the distribution among female customers looks to be more evenly distributed. From the KDE plot above, we can also see those female customers in the dataset are earning higher than their male counterparts. The females earn slightly above $70,000 and the males earn slightly below $60,000. Even though the difference is not that significant, there might be a possibility that the groups may respond differently to certain types of promotion.

Length of Stay Distribution

A new column called length_of_stay_in_dayswas created to show the number of days a person has been a customer of Starbucks according to the data collected. The length of stay median of Starbucks customers is 1501 days or 4 years. It is interesting to see Starbucks is still able to command a loyal customer base for so many years despite being in an increasingly competitive industry. In this case, the customers' length of stay does not seem to differ across gender groups.

Offer Success Rate by Gender

To see if an offer is ‘successful’, a new column called offer_successful was created. It was computed by checking if an offer has been viewed and completed by the customer, which was subsequently used to compute “success rate”. By plotting the “success rate” against gender, we can see that the distribution of offer success across female customers is almost equal. On the other hand, there are slightly fewer successful offers across male customers.

Data Pre-processing

Merging DataFrames

Merging the transcript, profile and portfolio DataFrames into one and doing one-hot encoding for the predictive model to ingest. Here is the resulting DataFrame:

Dealing with Missing Values and Anomalies

There are 8066 NULL values (12 %) in the dataset that have empty income columns. All of these 8066 records also belong to customers that put 118 as their age. Also, there is no gender specified in these records. As these records are most likely bad data points, keeping them will throw off the accuracy of our prediction.

Usually, other methods such as value imputation can be considered to fill in the NULL values. However, there are a number of problems present with these records. On top of that, these bad data points only make up a small proportion of the entire dataset, and thus it might be a wiser decision to remove these records.

Implementation

Splitting into Train and Test Dataset

Since our objective is to create a model that can be used to predict an offer success based on customer attributes, the “offer successful” column would be the target variable. The third column and onwards will be assigned as the predictor variables — these include income, length_of_stay_in_days, gender and many more.

The predictor and target variable is then split into training and testing dataset. The training set is used to fit into the model, whereas the testing set is required to do an unbiased evaluation of the model.

Feature Scaling

Before creating the model, the features need to be scaled as they are measured in different units. This step is essential to prevent feature(s) with wider ranges from dominating the contribution to the model. Here, StandardScaler() is used to normalize the features and transform each column to have a μ = 0 and σ = 1 .

As the test set is reserved to evaluate the model on unseen data, the normalization process is done after the ‘train test split’. This is to prevent information about the distribution of the test set from leaking into the model.

Modelling - Random Forest Classifier

The first algorithm that will be explored is Random Forest. Instead of just using 1 decision tree, this algorithm utilizes ensemble learning. It takes a randomly selected subset from the dataset and creates a set of decision trees — this increases the diversity of the trees and generally results in a better model.

To start, RandomForestClassifier() is instantiated and GridSearchCV() is used to find the best parameters for the model.

Inside the rfc_params , ‘n_estimators’ represent the number of trees in the forest. The optimal number of trees depends on a number of factors, one of them is the number of observations in the dataset. The more observations the dataset has, the more trees that are needed. ‘max_depth’ represents the maximum depth of the tree. A deeper tree allows more splits. The more split that a tree has, the more information that is captured.

Refinement

GridSearchCV is a function of scikit-learn library that performs hyperparameter tuning to identify the most optimal parameter combinations. The best parameters for the RandomForestClassifier identified by GridSearchCV are as follows:

  • ‘max_depth’: 10
  • ‘n_estimators’: 70

Evaluation

The model is fitted to the training dataset, with the parameters defined in the rfc_params and then used to predict the test dataset,

In the figure above, a classification report is showing the metrics to evaluate model performance.

An f1-score of 0.76 suggests that the model is doing relatively well in predicting the success of the offers on the test dataset.

The above illustration shows a confusion matrix that gives a more visual representation of the RandomForestClassifier model prediction classification report.

Modelling — Logistic Regression

Another algorithm is explored to see if the prediction score can be improved further. Logistics regression is a regression method that is commonly used and one of the simplest algorithms for binary classification.

LogisticRegression() is instantiated and again, GridSearchCV() to search for the best parameters combination. The first parameter ‘penalty’ is used to specify the regularization terms — regularization is used used to penalize high coefficients to prevent overfitting problems. ‘C’ represents the inverse regularization strength, where stronger regularization will be applied when the value of ‘C’ is lower.

Next, the model is fitting to the training set using the parameters defined in lgr_params . Then, using the predict() method, the fitted model is used to predict the ‘success outcome’ of the promotion based on the features in the test dataset (X_test).

Refinement

The best parameters for the Logistic Regression model identified by GridSearchCV are as follows:

  • ‘C’: 10
  • ‘penalty’: l2

Evaluation

The model performs relatively well with 0.74 F1-score. However, the score is slightly worse than the Random Forest model that we had earlier.

The above picture illustrates the confusion matrix to give you a visual understanding of how the classification model performs.

Summary Points

  • Starbucks customers mostly comprised of people who belong to the middle-aged group, with an increasing number of young-adult audience. The average age does seem to differ much by gender
  • These customers are earning considerably higher than the average middle-class Americans ($64,000 vs. $35,977)
  • There is a larger proportion of high-income female customers compared to their male counterparts
  • Starbucks commands a loyal customer base, with a median length of stay of 4 years
  • Based on the exploration, RandomForestClassifier turned out to be the best algorithm to predict ‘offer success’ with an F1-score of 0.76

Reflection & Improvements

Increasing the number of purchases through targeted offers was the one step that we could take to optimize offers. However, it would be better if revenue uplift can also be estimated from these the new promotional strategy. Ultimately, the increase in purchases must translate to an increase in revenue. Some metrics that can be explored are Incremental Response Rate (IRR) and Net Incremental Revenue (NIR). More details can be seen here.

As this model is going to be served to the Marketers, it can also be packaged in a user-friendly web app that allows the user to explore how certain features will have an impact on purchasing behaviour and revenue.

Note:

  • The notebook used for the analysis can be found here.
  • The writer is also still relatively new to Machine Learning topics so any correction or feedback would be more than welcome!‍🙇🏻‍♀️

--

--