Using K-means to segment customers based on RFM Variables

Published in

Web Mining [IS688, Spring 2021]

11 min readApr 4, 2021

Dividing customers into different segments based on the RFM (Recency-Frequency-Monetary) score, in python

Every person is different and so is their behavior as customers.
Imagine you are the owner of a shop. It doesn’t matter if you own an e-commerce or a supermarket. It doesn’t matter if it is a small shop or a big company such as Google or Microsoft, it’s always better to know your customers.

Coming from a business family background, I have always seen my father facing problem in targeting the right customers and building a good relationship with the customer to optimize sales. During my summer vacations, I have helped my father with our business, and I have always observed that every customer has different needs, interest and spending habits. Some were the regular and frequent buyer, some weren’t. I have realised that it’s very difficult for businesses to personalize marketing campaigns for each customer

My father always believes in taking feedback, personalizing communication, providing responsive support, educating the customers with resources to know the customer better and provide them with their respective needs. This has been helping him to retain customers, increase their lifetime value and optimize sales.

Nowadays never techniques have come up in the market for segmenting customers and also Customer segmentation very important to make a decision, what action needed to increase revenue, build a good relationship with the customer and optimize sales, for a business. I’ll show you how to do RFM analysis for analyzing customer value.

The main focus of this blog would be to show how we can segment customers into different clusters using the K-means algorithm. But, first I would be doing RFM analysis to get the desired values and those features will be used as an input in K-means, to determine similarity. K-Means uses Euclidean distance as a distance metric to calculate the distance between the data points.

I will show you the main calculations of RFM analysis only so that this blog doesn’t go that big.

You can also check out the full code on my GitHub repository here.

Let’s Begin!

To create a good customer segmentation, I will be using the famous Online retail dataset by UCI Machine Learning Library. The data set contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. It has 8 columns and 541909 rows.

The data set has attributes like- InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID and Country. But, we will be mainly making use of features like CustomerID, UnitPrice, Quantity, InvoiceDate etc and also we would need to make new features as well to calculate the RFM score and those features will be used further for finding the similarity to assign each customer to a cluster using k-means.

Importing Python Libraries and dependencies

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsfrom scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.cluster import KMeansimport plotly.offline as pyoff
import plotly.graph_objs as goimport datetime as dt
import feature_enginefrom feature_engine.outliers import Winsorizerimport warnings
warnings.filterwarnings("ignore")

Data Preparation/Data Cleaning

First, we need to clean the data to make it into a more usable format.

Reading the data into a pandas DataFrame.

There are 8 columns and 541909 rows in our data.

Description of the DataSet

Created a custom function that returns the desc, shape and data types of our data.

We can see that there are many nulls values in CustomerID. CustomerID is one of the main features. We need to remove them as there is no way we can get the number of CustomerID

There are few records with UnitPrice<0 and Quantity<0. We need to remove them from the analysis.

The min and max value for Quantity is 80995, this could represent cancelled or returned orders.
The UnitPrice also have few negative values which are uncommon, these transactions could represent cancelled orders by customers or bad debt incurred by the business.
Bad debt adjustments will be dropped from the dataset as these do not represent actual sales.

There are about 25% of Null CustomerID in the data.

We will be using the K-means algorithm in this project later and k-means is not able to deal with missing values. Therefore, We will drop those records with missing CustomerID.

The proportion of Customers- country wise

As customer clusters may vary by geography, we will restrict the data to only United Kingdom customers, which contains most of our customer’s historical data (91% of customers)

df = df[df.Country == 'United Kingdom']

Removing the negative values from UnitPrice and Quantity

df = df[df.Quantity > 0]
df = df[df.UnitPrice > 0]#Removing the Null values from the data.
df = df[pd.notnull(df['CustomerID'])]

Cleaning the Date Column

df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['InvoiceYearMonth'] = df['InvoiceDate'].map(lambda date: 100*date.year + date.month)
df['Date'] = df['InvoiceDate'].dt.strftime('%Y-%m')

Now, all our data is clean.

We have all the crucial information we need:
- Customer ID
- Unit Price
- Quantity
- Invoice Date

Exploratory Data Analysis

#Function to plot
def plot_df(df, x, y, title="", xlabel='Date', ylabel='Orders', dpi=100):
    plt.figure(figsize=(16,5), dpi=dpi)
    plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
    plt.plot(x, y, color='tab:Blue', marker='o')
    plt.show()#plotting
plot_df(df_agg, x=df_agg.Date, y=df_agg.Quantity,title='Orders in 2011')

The number of orders started increasing from the third quarter and the maximum number of orders are in November

Calculating Revenue

Revenue = Order Count * Average Revenue per Order

df['Revenue'] = df['Quantity']*df['UnitPrice']

Monthly Revenue

To plot the graph for Revenue monthly

plot_data = [
    go.Scatter(
        x=df_revenue['InvoiceYearMonth'],
        y=df_revenue['Revenue'],
        mode='lines+markers'
    )
]plot_layout = go.Layout(
        xaxis={"type": "category"},
        title='Montly Revenue'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

RFM ANALYSIS

“RFM” is a method used for analyzing customer value.

The beauty of RFM analysis is that it requires only the basic features, which is present in a transactional dataset most of the time.

RFM stands for Recency (R) — Frequency (F)— Monetary Value(M). It groups customers based on their transactional history: [2]

To check out more about RFM analysis in detail please refer to this blog.

Recency

The last invoice date is 2011–12–09, we will use this date to calculate Recency.

NOW = dt.date(2011,12,9) 
df['Date'] = pd.DatetimeIndex(df.InvoiceDate).date

Frequency — Monetary

Calculating Frequency and Monetary value for each customer

RFM Table

Now we split the metrics into segments using quantiles. We will assign a score from 1 to 4 to each Recency, Frequency and Monetary respectively. 1 is the highest value, and 4 is the lowest value. A final RFM score (Overall Value) is calculated simply by combining individual RFM score numbers.

We will calculate quartile values for each R, F, M values

quantiles = RFM_Table.quantile(q=[0.25,0.50,0.75])
quantiles = quantiles.to_dict()

Here, I have created two functions, to calculate the Quantile scores for Recency, Frequency and Monetary.[4]

The resulting table has our Recency, Frequency and Monetary columns, along with the quartile for each value[1]

To finish this off, we just need to concatenate the three score columns.

segmented_rfm['RFM_Segment'] = segmented_rfm.R_quartile.map(str)+segmented_rfm.F_quartile.map(str)+segmented_rfm.M_quartile.map(str)segmented_rfm['RFM_Score'] = segmented_rfm[['R_quartile','F_quartile','M_quartile']].sum(axis=1)

Here, is our finished table.

RFM segmentation readily shows customer for any business like Best Customers, Loyal Customer, Customers on the verge of losing, Highest revenue-generating customers etc.

We can further visualise the above values and go into more granularity for making customer-retention or customer holding decisions. But, that would be out of scope for this blog.

Time for Clustering!

K-means clustering algorithm is an unsupervised machine learning algorithm that uses multiple iterations to segment the unlabeled data points into ‘k's different clusters in a way such that each data point belongs to only a single group that has similar properties. These points are more similar between them than they are to points belonging to other clusters. Distance-based clustering groups the points into some number of clusters such that distances within the cluster should be small while distances between clusters should be large.

K-means uses Euclidean distance as a distance metric to calculate the distance between each point and the centroid.

K-means gives the best result under the following conditions:

Data’s distribution is not skewed
Data is standardised (i.e. mean of 0 and standard deviation of 1).

Let’s find the skewness in our data

## Function to check skewness
def check_skew(df_skew, column):
    skew = stats.skew(df_skew[column])
    skewtest = stats.skewtest(df_skew[column])
    plt.title('Distribution of ' + column)
    sns.distplot(df_skew[column])
    print("{}'s: Skew: {}, : {}".format(column, skew, skewtest))
    return

The data is highly skewed, therefore we will perform log transformations to reduce the skewness of each variable. I added a small constant as log transformation demands all the values to be positive.

#Removing Skewness
df_rfm_log = np.log(df_rfm_log+1)
plt.figure(figsize=(9, 9))plt.subplot(3, 1, 1)
check_skew(df_rfm_log,'Recency')
plt.subplot(3, 1, 2)
check_skew(df_rfm_log,'Frequency')
plt.subplot(3, 1, 3)
check_skew(df_rfm_log,'Monetary')
plt.tight_layout()

Once the skewness is reduced, I standardised the data by centring and scaling. Note all the variables now have a mean of 0 and a standard deviation of 1.

scaler = StandardScaler()scaler.fit(df_rfm_log)RFM_Table_scaled = scaler.transform(df_rfm_log)

Why scaling is important? The location of each data point on the graph is determined by considering all information associated with the specific customer. If any of the information is not on the same distance scale, K-means might not form meaningful clusters for you.

Finding the optimal Number of Clusters!

A different number of clusters can lead us to completely different results. Therefore, it’s important to get the optimal number of clusters for our analysis. To find it out, we will apply Elbow Method. Elbow Method simply tells the optimal cluster number for optimal inertia. Code snippet and Inertia graph are as follows:

from scipy.spatial.distance import cdist
distortions = [] 
inertias = [] 
mapping1 = {} 
mapping2 = {} 
K = range(1,10) 
  
for k in K: 
    #Building and fitting the model 
    kmeanModel = KMeans(n_clusters=k).fit(RFM_Table_scaled) 
    kmeanModel.fit(RFM_Table_scaled)     
      
    distortions.append(sum(np.min(cdist(RFM_Table_scaled, kmeanModel.cluster_centers_, 
                      'euclidean'),axis=1)) / RFM_Table_scaled.shape[0]) 
    inertias.append(kmeanModel.inertia_) 
  
    mapping1[k] = sum(np.min(cdist(RFM_Table_scaled, kmeanModel.cluster_centers_, 
                 'euclidean'),axis=1)) / RFM_Table_scaled.shape[0] 
    mapping2[k] = kmeanModel.inertia_

Here it looks like 3 is the optimal one. Based on business requirements, we can go ahead with fewer or more clusters. We will be trying our analysis with 3,4 and 5 clusters.

def kmeans(normalised_df_rfm, clusters_number, original_df_rfm):
    
    kmeans = KMeans(n_clusters = clusters_number, random_state = 1)
    kmeans.fit(normalised_df_rfm)# Extract cluster labels
    cluster_labels = kmeans.labels_
        
    # Create a cluster label column in original dataset
    df_new = original_df_rfm.assign(Cluster = cluster_labels)
    
    # Initialise TSNE
    model = TSNE(random_state=1)
    transformed = model.fit_transform(df_new)
    
    # Plot t-SNE
    plt.title('Flattened Graph of {} Clusters'.format(clusters_number))
    sns.scatterplot(x=transformed[:,0], y=transformed[:,1], hue=cluster_labels, style=cluster_labels, palette="Set1")
    
    return df_newplt.figure(figsize=(10, 10))plt.subplot(3, 1, 1)
df_rfm_k3 = kmeans(RFM_Table_scaled, 3, RFM_Table)plt.subplot(3, 1, 2)
df_rfm_k4 = kmeans(RFM_Table_scaled, 4, RFM_Table)plt.subplot(3, 1, 3)
df_rfm_k5 = kmeans(RFM_Table_scaled, 5, RFM_Table)plt.tight_layout()

The image is obtained by flattening three-dimensional graphs (created from Recency, Frequency, and MonetaryValue) into two-dimensional graphs for ease of visualisation.

The technique for flattening high dimensional graph and visualising it in a two-dimensional format is known as t-Distributed Stochastic Neighbor Embedding (t-SNE). You can read up more on this if you are interested.

Now, we will plot a snake plot, to build personas of each cluster of the segmentation. It’s commonly used in the marketing industry for customer segmentation.

def snake_plot(normalised_df_rfm, df_rfm_kmeans, df_rfm_original):normalised_df_rfm = pd.DataFrame(normalised_df_rfm, 
                                       index=RFM_Table.index, 
                                       columns=RFM_Table.columns)
    normalised_df_rfm['Cluster'] = df_rfm_kmeans['Cluster']# Melt data into long format
    df_melt = pd.melt(normalised_df_rfm.reset_index(), 
                        id_vars=['CustomerID', 'Cluster'],
                        value_vars=['Recency', 'Frequency', 'Monetary'], 
                        var_name='Metric', 
                        value_name='Value')plt.xlabel('Metric')
    plt.ylabel('Value')
    sns.pointplot(data=df_melt, x='Metric', y='Value', hue='Cluster')
    
    return

From the flattened graphs and the snake plots, it is evident that having a cluster value of 4 segments our customers well. We could also go for a higher number of clusters, it completely depends on how the company wants to segment their customers.

Summarizing my findings (clusters)

Interpretation of the clusters formed using k-means.

def rfm_values(df):df_new = df.groupby(['Cluster']).agg({
        'Recency': 'mean',
        'Frequency': 'mean',
        'Monetary': ['mean', 'count']
    }).round(0)
    
    return df_new

What does each cluster represent?

The first cluster belongs to the “Best Customers” segment which we saw earlier as they purchase recently (R=1), frequent buyers (F=1), and spent the most (M=1).
Customers in the second cluster can be interpreted as passerby customers as their last purchase is long ago (R=4), purchased very few (F=4) and spent little (M=4). The company has to come up with new strategies to make them permanent members.
The third cluster is more related to the “Almost Lost” segment as they Haven’t purchased for some time(R=3) but used to purchase frequently and spent a lot.
The last cluster is very Loyal Customers and they also spent a lot.

LIMITATIONS and BUGS

One of the major limitations of the above analysis is the dataset had only data for one year, it would be more fun to have data for more years. Although, for our analysis, it was enough. But, if we want to apply this analysis in a real-world scenario then it’s recommendable to have more historical year data.
The data was highly skewed and one of the major limitations of k-means is the curse of dimensionality. At first when I applied the k-means without removing the skewness of data, all the clusters we mixed and didn’t segregate properly. Therefore, I need to apply log transformations to reduce the skewness of each variable.

CONCLUSIONS

To conclude, we saw how we can segment our customer depending on our business requirements. You can perform RFM for your entire customer base, or just a subset. For example, you may first segment customers based on a geographical area or other demographics, and then by RFM for historical, transaction-based behaviour segments.

RFM analysis can help in answering many questions with respect to their customers and this can help companies to make marketing strategies for their customers, retaining their slipping customers and providing recommendations to their customer based on their interest.

We used the K-means algorithm to segment our customer in various clusters having similar similarity. I think K-means did a pretty good job here.

For more detail about this data, the code, and more visualize you can reach my GitHub by following this link. Feel free to ask, and let's start to discuss guys!