DATA SCIENCE-ZING
  • Blog
  • Advertise with us
  • Contact

5 Clustering methods in R

2/15/2018

1 Comment

 
Machine Learning Specialization from University of Washington
Clustering is a way of grouping  a set of objects into clusters or groups such that object in each group have similar characteristics. R has many methods of clustering and each method helps to cluster the objects according to certain parameters. 5 methods of clustering discussed in this article are:
  1. K-means
  2. K-medoids /PAM Clustering
  3. Hierarchical Clustering
  4. Hierarchical K-means Clustering and
  5. Model Based Clustering   
Informatica Online Training by Edureka

Clustering Methods in R

Data for clustering

library(rattle)

#Input wine data from rattle package
data(wine,package = 'rattle')

head(wine)

##   Type Alcohol Malic  Ash Alcalinity Magnesium Phenols Flavanoids
## 1    1   14.23  1.71 2.43       15.6       127    2.80       3.06
## 2    1   13.20  1.78 2.14       11.2       100    2.65       2.76
## 3    1   13.16  2.36 2.67       18.6       101    2.80       3.24
## 4    1   14.37  1.95 2.50       16.8       113    3.85       3.49
## 5    1   13.24  2.59 2.87       21.0       118    2.80       2.69
## 6    1   14.20  1.76 2.45       15.2       112    3.27       3.39
##   Nonflavanoids Proanthocyanins Color  Hue Dilution Proline
## 1          0.28            2.29  5.64 1.04     3.92    1065
## 2          0.26            1.28  4.38 1.05     3.40    1050
## 3          0.30            2.81  5.68 1.03     3.17    1185
## 4          0.24            2.18  7.80 0.86     3.45    1480
## 5          0.39            1.82  4.32 1.04     2.93     735
## 6          0.34            1.97  6.75 1.05     2.85    1450

#Scale the variables to standard format
wine_scaled<-scale(wine[-1])

Determine the optimal number of clusters

wssplot <- function(data, nc=15, seed=1234){
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:nc){
    set.seed(seed)
    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
  plot(1:nc, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares")}

wssplot(wine_scaled, nc=6) 

plot of chunk unnamed-chunk-2

Different methods of Cluster analysis in R

1) K-Means

kmeans_fit<-kmeans(wine_scaled,3) # 3 is the optimal number for k

#Clustering attributes of the kmeans-fit
attributes(kmeans_fit)

## $names
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"      
## 
## $class
## [1] "kmeans"

#Centeroids of kmeans-fit
kmeans_fit$centers

##      Alcohol      Malic        Ash Alcalinity   Magnesium     Phenols
## 1  0.8328826 -0.3029551  0.3636801 -0.6084749  0.57596208  0.88274724
## 2  0.1644436  0.8690954  0.1863726  0.5228924 -0.07526047 -0.97657548
## 3 -0.9234669 -0.3929331 -0.4931257  0.1701220 -0.49032869 -0.07576891
##    Flavanoids Nonflavanoids Proanthocyanins      Color        Hue
## 1  0.97506900   -0.56050853      0.57865427  0.1705823  0.4726504
## 2 -1.21182921    0.72402116     -0.77751312  0.9388902 -1.1615122
## 3  0.02075402   -0.03343924      0.05810161 -0.8993770  0.4605046
##     Dilution    Proline
## 1  0.7770551  1.1220202
## 2 -1.2887761 -0.4059428
## 3  0.2700025 -0.7517257

#Aggregate the number of clusters by means
aggregate(wine_scaled,by=list(kmeans_fit$cluster),FUN=mean)

##   Group.1    Alcohol      Malic        Ash Alcalinity   Magnesium
## 1       1  0.8328826 -0.3029551  0.3636801 -0.6084749  0.57596208
## 2       2  0.1644436  0.8690954  0.1863726  0.5228924 -0.07526047
## 3       3 -0.9234669 -0.3929331 -0.4931257  0.1701220 -0.49032869
##       Phenols  Flavanoids Nonflavanoids Proanthocyanins      Color
## 1  0.88274724  0.97506900   -0.56050853      0.57865427  0.1705823
## 2 -0.97657548 -1.21182921    0.72402116     -0.77751312  0.9388902
## 3 -0.07576891  0.02075402   -0.03343924      0.05810161 -0.8993770
##          Hue   Dilution    Proline
## 1  0.4726504  0.7770551  1.1220202
## 2 -1.1615122 -1.2887761 -0.4059428
## 3  0.4605046  0.2700025 -0.7517257

#Append Cluster to data

wine<-data.frame(wine,Cluster=kmeans_fit$cluster)
head(wine)

##   Type Alcohol Malic  Ash Alcalinity Magnesium Phenols Flavanoids
## 1    1   14.23  1.71 2.43       15.6       127    2.80       3.06
## 2    1   13.20  1.78 2.14       11.2       100    2.65       2.76
## 3    1   13.16  2.36 2.67       18.6       101    2.80       3.24
## 4    1   14.37  1.95 2.50       16.8       113    3.85       3.49
## 5    1   13.24  2.59 2.87       21.0       118    2.80       2.69
## 6    1   14.20  1.76 2.45       15.2       112    3.27       3.39
##   Nonflavanoids Proanthocyanins Color  Hue Dilution Proline Cluster
## 1          0.28            2.29  5.64 1.04     3.92    1065       1
## 2          0.26            1.28  4.38 1.05     3.40    1050       1
## 3          0.30            2.81  5.68 1.03     3.17    1185       1
## 4          0.24            2.18  7.80 0.86     3.45    1480       1
## 5          0.39            1.82  4.32 1.04     2.93     735       1
## 6          0.34            1.97  6.75 1.05     2.85    1450       1

#Size of the cluster
kmeans_fit$size

## [1] 62 51 65

#Evaluate the Clustering accuracy using a confusion matrix
confusion_matrix_table<-table(wine[,1],kmeans_fit$cluster)
library(caret)
confusionMatrix(confusion_matrix_table)

## Confusion Matrix and Statistics
## 
##    
##      1  2  3
##   1 59  0  0
##   2  3  3 65
##   3  0 48  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3483          
##                  95% CI : (0.2786, 0.4232)
##     No Information Rate : 0.3652          
##     P-Value [Acc > NIR] : 0.7053          
##                                           
##                   Kappa : 0.0299          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.9516  0.05882   0.0000
## Specificity            1.0000  0.46457   0.5752
## Pos Pred Value         1.0000  0.04225   0.0000
## Neg Pred Value         0.9748  0.55140   0.5000
## Prevalence             0.3483  0.28652   0.3652
## Detection Rate         0.3315  0.01685   0.0000
## Detection Prevalence   0.3315  0.39888   0.2697
## Balanced Accuracy      0.9758  0.26170   0.2876

#Cluster visualization

#1) Cluster Plot
library(cluster)
clusplot(wine_scaled, kmeans_fit$cluster, main='2D representation of the Cluster solution',color=TRUE, shade=TRUE,labels=2, lines=0)

plot of chunk unnamed-chunk-3

#2)Using package factoextra
library(factoextra)

fviz_cluster(kmeans_fit,data = wine_scaled,ellipse.type = "convex",palette="jco",ggtheme = theme_minimal())

plot of chunk unnamed-chunk-3

2)K-medoids/PAM Clustering

pam_fit<-pam(wine_scaled,3)

#Visualization using Factoextra
fviz_cluster(pam_fit,data = wine_scaled,ellipse.type = "convex",palette="jco",ggtheme = theme_minimal())

plot of chunk unnamed-chunk-4

3)Hierarchical Clustering

#Eucledian Distance Matrix
distance<-dist(wine_scaled,method = "euclidean")

#Model for hierchical clustering
hierach_fit<-hclust(distance,method = "ward.D2")

# Simple Plot
plot(hierach_fit)

# Rectangular plot with groups 
groups<-cutree(hierach_fit,k=3) # Cutting the dendrogram into 3 groups

rect.hclust(hierach_fit,k=3,border = "red")

plot of chunk unnamed-chunk-5

#Visualize using factoextra

# Cut in 4 groups and color by groups
fviz_dend(hierach_fit, k = 4, # Cut in four groups
          cex = 0.5, # label size
          k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE # Add rectangle around groups
          )

plot of chunk unnamed-chunk-5

#Evaluate the accuarcy of model with confusion matrix

table(wine[,1],groups)

##    groups
##      1  2  3
##   1 59  0  0
##   2  5 58  8
##   3  0  0 48

4)Hierarchical K-means Clustering

hierach_kmeans_fit<-hkmeans(wine_scaled,4)

hierach_kmeans_fit

## Hierarchical K-means clustering with 4 clusters of sizes 59, 36, 51, 32
## 
## Cluster means:
##      Alcohol      Malic        Ash Alcalinity   Magnesium    Phenols
## 1  0.9028300 -0.2969818  0.2925626 -0.6927815  0.55567318  0.8917585
## 2 -0.8903875 -0.4346029 -1.1132218 -0.3153936 -0.50507989  0.2266227
## 3  0.1644436  0.8690954  0.1863726  0.5228924 -0.07526047 -0.9765755
## 4 -0.9249888 -0.3486324  0.4159309  0.7987738 -0.33636118 -0.3427131
##    Flavanoids Nonflavanoids Proanthocyanins      Color        Hue
## 1  0.95147729    -0.6073179       0.5966082  0.1931051  0.4694211
## 2  0.22876921    -0.6019082       0.3279625 -0.8411535  0.3927557
## 3 -1.21182921     0.7240212      -0.7775131  0.9388902 -1.1615122
## 4 -0.08029881     0.6429805      -0.2297927 -0.9060961  0.5438146
##      Dilution    Proline
## 1  0.77491047  1.1750720
## 2  0.47415842 -0.7144213
## 3 -1.28877614 -0.4059428
## 4  0.09181759 -0.7158436
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 4 3 2 2 4 4 2 2 4 2
##  [71] 4 4 4 1 2 2 2 4 2 4 2 2 4 3 2 2 4 4 4 4 4 4 4 2 2 1 4 2 2 2 2 2 4 2 2
## [106] 4 2 4 2 2 2 2 4 4 4 4 2 4 3 2 2 4 4 2 2 2 2 4 4 4 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [176] 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 322.0013 273.4497 326.3537 252.2581
##  (between_SS / total_SS =  49.0 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "data"         "hclust"

# Visualize the dendrogram tree

fviz_dend(hierach_kmeans_fit, cex = 0.6, palette = "jco", 
          rect = TRUE, rect_border = "jco", rect_fill = TRUE)

plot of chunk unnamed-chunk-6

# Visualize the clusters

fviz_cluster(hierach_kmeans_fit, palette = "jco", repel = TRUE,
             ggtheme = theme_classic())

plot of chunk unnamed-chunk-6

5) Model Based Clustering

library(mclust)

mclust_fit<- Mclust(wine_scaled)

#Summary of the model

summary(mclust_fit)

## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust VVE (ellipsoidal, equal orientation) model with 3 components:
## 
##  log.likelihood   n  df       BIC     ICL
##       -2285.363 178 158 -5389.448 -5390.9
## 
## Clustering table:
##  1  2  3 
## 56 73 49

# Plot using facto extra

# BIC values used for choosing the number of clusters
fviz_mclust(mclust_fit, "BIC", palette = "jco")

plot of chunk unnamed-chunk-7

# Classification: plot showing the clustering
fviz_mclust(mclust_fit, "classification", geom = "point", 
            pointsize = 1.5, palette = "jco")

plot of chunk unnamed-chunk-7

# Classification uncertainty
fviz_mclust(mclust_fit, "uncertainty", palette = "jco")

plot of chunk unnamed-chunk-7

1 Comment
riccardo_b
3/24/2018 02:58:46 am

Thank you very much for this post! It contains all relevant clustering methods clearly summarized and visualized !

Reply



Leave a Reply.

    Picture
    Picture

    RSS Feed

    Categories

    All
    Basics
    Classification
    Clustering
    Conferences
    Foreacsting
    Mapping
    R
    Regression
    Shiny
    Visualization

    Archives

    April 2018
    February 2018
    October 2017
    August 2017
    June 2017
    December 2016
    August 2016
    April 2016
    March 2016
    February 2016

    Picture
    Tableau 10 Online Training by Edureka
    Python Online Training by Edureka
    Picture
    Big Data Architect Masters Program Online Training by Edureka
    Picture
    Picture
Proudly powered by Weebly
  • Blog
  • Advertise with us
  • Contact
✕