Ch 4 – Clustering Examples

In this example we’ll explain the methods for profit-driven clustering, following what is written in the book. We will show how to plot a histogram, design cuts, run a K-Means clustering, and a self-organizing map (SOM) using profit-related variables.

Preparations

We first start by loading the data and loading the packages. The ‘kohonen’ package provides a simple interface for SOM, and we will use ggplot2 with its, sadly deprecated, ggfortify companion in order to plot the clusters.

library(ggplot2)
library(reshape2)
library(plyr)
library(stargazer)
library(kohonen)
library(ggfortify)
library(cluster)

# Set seed. Change to rotate images.
seed <- 19853857.3333333

# Load the example data
load('CLV_book.Rdata')

We can now plot the histogram of the distribution of the CLV.

# Multigroup histogram
hist.groups <- ggplot(CLV.example, aes(x=CLV, fill=groups)) + geom_histogram(alpha=0.2, position="identity", bins = 70)
hist.groups

Single variable (CLV-based) cuts.

Now we make some cuts. Clearly, the 5-group strategy is the best, but we might attempt to create a three-cut strategy if it suits our purposes better.

# Cut option 1: In clear segments.
hist.cut3 <- hist.groups +  geom_vline(xintercept = 6500, linetype = "longdash") +
                     geom_vline(xintercept = -6500, linetype = "longdash")
hist.cut3 <- hist.cut3 + annotate("text", x = 12500, y = 0.00017, label = "Gold")
hist.cut3 <- hist.cut3 + annotate("text", x = 500, y = 0.00017, label = "Silver")
hist.cut3 <- hist.cut3 + annotate("text", x = -10000, y = 0.00017, label = "Bronze")
hist.cut3
# ggsave("Hist_Cuts3Groups.png")  # Uncomment to save image.

# Cut option 2: In 5 segments.
hist.cut5 <- hist.groups +  geom_vline(xintercept = 6500, linetype = "longdash") +
                     geom_vline(xintercept = -6500, linetype = "longdash") + 
                     geom_vline(xintercept = -1000, linetype = "longdash") + 
                     geom_vline(xintercept = 13100, linetype = "longdash")

hist.cut5 <- hist.cut5 + annotate("text", x = 17000, y = 0.00017, label = "Platinum")
hist.cut5 <- hist.cut5 + annotate("text", x = 10000, y = 0.00017, label = "Gold")
hist.cut5 <- hist.cut5 + annotate("text", x = 2500, y = 0.00017, label = "Silver")
hist.cut5 <- hist.cut5 + annotate("text", x = -3500, y = 0.00017, label = "Bronze")
hist.cut5 <- hist.cut5 + annotate("text", x = -10000, y = 0.00017, label = "Lead")
hist.cut5
# ggsave("Hist_Cuts5Groups.png") # Uncomment to save image.

# Create Variables of the groups
CLV.example$ThreeCuts <- cut(CLV.example$CLV, c(-Inf,-6500,6500, Inf), labels=c("Bronze", "Silver", "Gold"))
CLV.example$FiveCuts <- cut(CLV.example$CLV, c(-Inf,-6500,-1000,6500,13100, Inf), labels=c("Lead", "Bronze", "Silver", "Gold", "Platinum"))

# Create Table with the Groups
ddply(CLV.example, .(ThreeCuts), summarise, Mean=mean(CLV), count=length(CLV))
##   ThreeCuts      Mean count
## 1    Bronze -9554.939    38
## 2    Silver  1260.077   486
## 3      Gold 11548.066   226
ddply(CLV.example, .(FiveCuts), summarise, Mean=mean(CLV), count=length(CLV))
##   FiveCuts      Mean count
## 1     Lead -9554.939    38
## 2   Bronze -2828.781   103
## 3   Silver  2359.692   383
## 4     Gold  9989.067   155
## 5 Platinum 14951.516    71

Cluster Analysis: K-Means

A second step involves using K-Means clustering in order to create the clusters. We will create 3, 4, and 5 groups, and then plot them using the ‘ggfortify’ package. We overlay the segmentation from the previous step.

The plots are generated using a PCA projection, and only the first two components are plotted. This example shows that a five-cluster segmentation is a better choice, as explained in the book.

set.seed(seed)

source('scale01.R')

# 4 Clusters
CLV.clara <- clara(scale01(CLV.example[,2:5]), 4)
clara.plot <- autoplot(CLV.clara, frame = TRUE, frame.type = 'norm')
clara.plot <- clara.plot + geom_point(aes(shape=CLV.example$ThreeCuts)) + scale_shape_discrete(name="Customer Segment", solid = FALSE) 
clara.plot <- clara.plot + scale_fill_brewer(guide='none', type = 'seq') + theme_bw()
clara.plot
# ggsave("Clustering_4G.png")  # Uncomment to save image.

# 3 Clusters
CLV.clara <- clara(scale01(CLV.example[,2:5]), 3)
clara.plot <- autoplot(CLV.clara, frame = TRUE, frame.type = 'norm')
clara.plot <- clara.plot + geom_point(aes(shape=CLV.example$ThreeCuts)) + scale_shape_discrete(name="Customer Segment", solid = FALSE) 
clara.plot <- clara.plot + scale_fill_brewer(guide='none') + theme_bw()
clara.plot
# ggsave("Clustering_3G.png")  # Uncomment to save image.

# 5 Clusters
CLV.clara <- clara(scale01(CLV.example[,2:5]), 5)
clara.plot <- autoplot(CLV.clara, frame = TRUE, frame.type = 'norm')
clara.plot <- clara.plot + geom_point(aes(shape=CLV.example$FiveCuts)) + scale_shape_discrete(name="Customer Segment", solid = FALSE) 
clara.plot <- clara.plot + scale_fill_brewer(guide='none') + theme_bw()
clara.plot
# ggsave("Clustering_5G.png")  # Uncomment to save image.

# Table with mediods
as.data.frame(CLV.clara$medoids)
##     Recency Frequency        MV       CoS
## 1 0.1128205 0.8751194 0.1570665 0.1588122
## 2 0.3114221 0.1832106 0.8717910 0.7940841
## 3 0.4999409 0.1744580 0.8384154 0.7576167
## 4 0.8491001 0.1415940 0.5917979 0.8490926
## 5 0.8783622 0.4724076 0.1777525 0.2639661

Cluster analysis: SOM

We will now use SOM to plot make clusters. This is an excellent methodology for clustering very large amounts of data, as we reduce first the space to a prototype map, and then create the clusters using k-means over the prototypes. This is much, much cheaper computationally.

With few nodes, then SOM is indistinguishable from K-Means, but we will plot a network of 50 nodes in a 5×5 grid.

set.seed(seed)

# SOM parameters
som_grid <- somgrid(xdim = 5, ydim=5, topo="hexagonal")

# Fix SOM Size
sizeSOM <- 5

# Train
CLV.som <- som(as.matrix(scale01(CLV.example[,2:5])), grid=som_grid, rlen = 200, 
               alpha = c(0.05, 0.01) )

palette.som <- heat.colors


# Count per cluster. Uncomment first and third line to save image.
# png(file = paste0("CountSOM",sizeSOM,"x",sizeSOM,".png"), height = 20, width = 20, unit = "cm", res = 300)
plot(CLV.som, type="count", palette.name = palette.som, main = "")
# dev.off()

Some diagnosis graphs allow us to see the topology of the map.

# Distance between neighbours
plot(CLV.som, type="dist.neighbours", palette.name = palette.som, main = "Distance Heatmap")

# CLV heatmap
plot(CLV.som, type = "property", property = CLV.som$codes[[1]][,1],
     palette.name=palette.som, main="Recency Heatmap")

plot(CLV.som, type = "property", property = CLV.som$codes[[1]][,2],
     palette.name=palette.som, main="Frequency Heatmap")

plot(CLV.som, type = "property", property = CLV.som$codes[[1]][,3],
     palette.name=palette.som, main="Monetary Value Heatmap")

plot(CLV.som, type = "property", property = CLV.som$codes[[1]][,4],
     palette.name=palette.som, main="Cost of Service Heatmap")

The distance graph shows how the prototypes are in the space. In the upper section, as well as the lower right section the neurons are quite packed together, but in the lower-right section the neurons are much farther apart, hinting at better defined clusters. The second step is to plot the heatmaps of each variable, as they appear in the neuron prototypes. These will help us determine how are the clusters composed of. Please see the book for a detailed explanation of each heatmap.

The final step is to construct clusters with the prototype neurons. Given that now the sample is small, we can use a hierarchical clustering analysis to identify which neurons conform a cluster. We already had decided on five groups, so we will continue using this, but an analysis of the below dendrogram of the hierarchical cluster, or of the previous distance plot should help us arrive at the same conclusion.

set.seed(seed)

# Codebook
plot(CLV.som, type="codes", palette.name = palette.som, main = "")

## use hierarchical clustering to cluster the codebook vectors
raw.hc <- hclust(dist(CLV.som$codes[[1]]), method="complete")
som.hc <- cutree(raw.hc, 5)
dend.hc <- as.dendrogram(raw.hc)

# Dendrogram
plot(dend.hc, type = "triangle")

# Codebook with clusters overlaid.
plot(CLV.som, type="codes", palette.name = palette.som, main = "")
add.cluster.boundaries(CLV.som, som.hc)

With this we can replicate the analysis done in the book. We have studied the main profit sources for this particular problem (CLV), but of course the process is very similar to any other profit-based source of data.

Note: To recover the exacts results from the book, just change the starting seed. The results are equivalent, save for rotations.