What is the offer in Starbucks?
This project is part of Udacity’s Data Scientist Nanodegree program(Capstone Project).
project overview:
Keep customers satisfied is one of the most successful roles of business, there is no doubt Starbuckswork to increase customers loyalty. Furthermore, analyzing data is a method to follow the customer’s behaviour and guarantee to strive for their satisfaction.
Problem Statement:
The aim goal of this project to understand customers behave and interact with the type of offers. First of all, we will answer questions about the customer’s demographics and transactions and offer.
- Type of the customers and their age group.
- Rate of customer’s income and their transcripts.
- Types of offers and customer interaction with it.
Metrics:
In this project, the metric we will use (MSE) The mean squared error is by far the most used metric for optimization in regression problems and R2 is another common metric when looking at regression values for evaluating the model.
Data Exploration and Preprocessing:
This project encompasses three Data Sets:
- portfolio.json — containing offer ids and metadata about each offer (duration, type, etc.)
- profile.json — demographic data for each customer
- transcript.json — records for transactions, offers received, offers viewed, and offers completed.
After discovering our files, we just did some data wranglings, which are:
- In the Portfolio table: we split the type of channel in the Portfolio into the different attributes.
- In the Transcript table: we split the Value column into offer_id, amount, and reward attributes.
- In the Profile table: we filled N/A for NaNs values in Gender attribute, and we filled Mode by NaNs values in income attribute. In age attribute we have noticed there is outliers values in age attributes such as 101 and 118, that impossible as age’s value, so we dropped age values that greater than 95.
Before making data ready for analyzing and visualizing, we merged all data sets in one table.
In this part, we will answer business questions by conducting univariate and multivariate analysis:
What are the rates of profiles per age on Starbucks?
What is the age groups per gender that include in Starbucks profiles?
What are the rates of incomes per ages in Starbucks profiles?
Are there any increases in the number of profiles every month that depends on the rates of income for members?
What are the rates of Starbucks members rewards every year?
What are the rates of events In Transcripts?
What is the highest Offers Type chosen by gender?
What are the rates of completed promotion for each offer types?
What is the rate of offer type which is a complete offer and type of promotion which is a Bogo promotion?
What is the rate of offer type which is a complete offer and type of promotion which is a Discount promotion?
Data Modeling:
In this part, we will use many types of models such as GaussianNB, DecisionTreeClassifier, LinearRegression, and KNeighborsClassifier to find the highest accuracy of determining the best type of offers:
before starting modelling, we do some steps:
- One Hot Encoding for Event and Gender columns.
- Replace offer types to 1 for BOGO, 2 for discount, 3 for Informational.
Then we found that GaussianNB the best model with 61% of R2 and 19% of mean squared error.
Conclusion:
In this project, we analyzed Starbucks customers, first of all, we start cleaning the data, the assessment process, after that, we did data visualization to get the results from our analysis. Moreover, we found that the Male is recorded at the highest rate of Starbucks customers. The adult age group which has the highest rate of incomes have the highest rate of having a profile as s Starbucks member. Also, males recorder the highest rate using promotions especially: BOGO, and discount promotions type. Finally, we found the best model for the best offer for customers which is GaussianNB classifier with61% of R2 and 19% of mean squared error, the metric we used (MSE) Mean squared error and R2 for evaluating the model, for optimizing a model we should have the lowest MSE and highest R2 value.
For finding more about this analysis, take a look at this Github link.