- Basic Objective: Convert heterogenous multidimensional objects to homogenous.
*Clustering is a technique generally used to do initial profiling of the portfolio. After having a good understanding of the portfolio, an objective modelling technique is used to build specific strategy.**Regardless of the predictive variable, a single model may not perform optimally across the target population because there may be distinct segments with different characteristics inherent in the population.**Clustering and profiling of the customer base can answer the following questions:**♦ Who are my customers?**♦ How profitable are my customers?**♦ Who are my least profitable customers?**♦ Why are my customers leaving?**♦ What do my best customers look like?**Taxonomies in biology for grouping all living organisms.**Psychological classifications based on personality and other personality traits.**Analyses of similarities and differences among new products.**Performance evaluation of firms to identify grouping based on firms strategies or strategic orientation.**Recommendation engines**Market segmentation**Social network analysis**Search result grouping**Medical imaging**Image segmentation**Anomaly detection*### Image Segmentation

- Cluster Analysis is descriptive, a theoretical and non-inferential.
- Cluster analysis will always create clusters, regardless of the actual existence of any structure in the data. Only with strong conceptual support and then validation are the cluster meaningful and relevant.
- The cluster solution is not generalizable because it is totally dependent upon the variables used as the basis for similarity measure.
*Data reduction: Removing the observations that are meaning less to analysis. Example if you understand the attitudes of population by identifying the major groups within the population, then reduced the data.**Hypothesis Generation: Cluster Analysis is also useful when the researcher wishes to develop hypotheses concerning the nature of the data or to examine previously stated hypothesis.*- Hierarchical clustering Analysis (HCA)
- Non-Hierarchical Cluster Analysis.
*In HCA the observations vectors are grouped together on the basis of their mutual distance.**HCA is normally hierarchical tree, called the dendrogram tree. This hierarchical tree is a nested set of partition represented by tree diagrams.*- L1-> (1,2), 3,4,(5,6),7,8,9,(10,11),12
- L2-> (1,2,3),(4,5,6),7,(8,9),(10,11),12
- Agglomerative HCA
- Divisive HCA
- Successively merging at every step.
- Starting with n clusters, each containing single case.
- Every stage merge the two most similar groups to form a new cluster by 1.
- Continue until all the sub group fuse form one cluster.
- Successively splitting of group at every step.
- Begins with single group that is one single cluster.
- Group is divided into two groups. The objects in one sub group are as far as possible from the objects in the other.
- The steps continues till there are in groups. Each with a single object
- * Computationally not feasible as agglomerative.
- Euclidean distance :
- Squared Euclidean distance : Recommended for centroid and ward’s method.
- City-block (Manhattan) distance:

=Sum (|xi - yi|) - Mahalanobis distance( D
^{2}): - Step 1: Starts with n clusters each containing a single object and n*n symmetric matrix of distances (or similarity).
- Step2: Search the distance matrix d for the nearest (most similar) pair of clusters. Let the distance between the most similar clusters say r & s be D(rs).
- Step3: Merge r&s label as (r,s) as the new cluster. Update the distance matrix by
- Deleting the rows and columns corresponding to cluster r and s.
- Adding (r&s) distances
- Step4: Repeat 2 and 3 a total n-1 times.
- 1) Single linkage – Shortest distance.
- 2) Complete linkage – Maximum distance.
- 3) Average linkage – Average distance.
- 4) Centroid Method – Distance between centroids of the two cluster.
- 5) Wards Method:- Which minimizes.
- K-means.
- As the k is defined, say(k=3).
- Select 3 seed or centroids randomly.
- Measure the distance of each observations from the centroids.
- The observations closest to centroids are assigned to that centroid group.
- Once the all the observations are grouped in all three centroids. The centroids mean is calculated. The means becomes the centroids for respective clusters.
- Then again the observations are re-assigned to cluster groups based on new centroids.
- This process of reassigning carries on until proper clusters are form.
- Its an attempt to assure cluster solution is representative of the general population and thus generalizable to other objects and and is stable over time.
- Cross-Validation : Split the sample into two groups. Each cluster is analyzed separately and the results are then compared. Cross Tabulation is used because the member of specific cluster in one solution should stay together in another solution.
- Sample Representativeness: Cluster Analysis results are not generalizable from the sample unless representativeness is established.
- Multicollinearity: The correlated variables will influence the cluster solution.

the effect of single large differences (outliers) is dampened

**Quiz-**

- What is the objective of cluster Analysis?
- Why do we do Clustering?
- Give few Applications?
- Use of Cluster Analysis in Statistics and modelling ?
- Types of clustering?
- Name few Distance measurement technique?
- Which distance measurement dampens the effect of outliers?
- What is single and Complete linkage and average linkage?
- What is centroid and wards method?
- How do you validate Clusters?
- Assumptions of Cluster Analysis?

*# For HCA**data.clust=iris**boxplot(data.clust[,-5])**boxplot(data.clust[,-5],horizontal = TRUE)**plot(Sepal.Length~Sepal.Width,data.clust)**with(data.clust,text(Sepal.Length~Sepal.Width,labels=Species,pos=4,cex=.6))**#Normalization**m<-apply(data.clust[,-5],2,mean)**sd<-apply(data.clust[,-5],2,sd)**z<-scale(data.clust[,-5],m,sd)**#Boxplot:**boxplot(z,horizontal=TRUE)**#Calculating distance**distance<-dist(z)*

*# Cluster Dendogram with compelete linkage**hc.c<-hclust(distance,method = "complete")**# plot with labels**plot(hc.c,labels = data.clust$Species)**# plot with numbers**plot(hc.c,hang=-1)**#Cluster membership**member.c<-cutree(hc.c,3)**#Checking the clusters in the dendogram**rect.hclust(hc.c,k=3,border="red")*

*# Cluster Dendogram with single linlkage**hc.s<-hclust(distance,method = "single")**#plot with labels**plot(hc.s,labels = data.clust$Species)**#plot with numbers**plot(hc.s,hang = -1)**#Checking the clusters in the dendogram**rect.hclust(hc.s,k=3,border="red")**# Cluster Membership**member.s<-cutree(hc.s,3)*

*# Cluster Dendogram with average linkage**hc.a<-hclust(distance,method = "average")**# plot with labels**plot(hc.a,labels = data.clust$Species)**#plot with numbers**plot(hc.a,hang = -1)**# Cluster Membership**member.a<-cutree(hc.a,3)**#Checking the clusters in the dendogram**rect.hclust(hc.a,k=3,border="red")*

*# Cluster Dendogram with ward linkage**hc.w<-hclust(distance,method = "ward.D2")**# plot with labels**plot(hc.w,labels = data.clust$Species)**#plot with numbers**plot(hc.w,hang = -1)**# Cluster Membership**member.w<-cutree(hc.w,3)**#Checking the clusters in the dendogram**rect.hclust(hc.w,k=3,border="red")*

*# Cluster Dendogram with Centroid linkage**hc.cen<-hclust(distance,method = "centroid")**# plot with labels**plot(hc.cen,labels = data.clust$Species)**#plot with numbers**plot(hc.cen,hang = -1)**# Cluster Membership**member.cen<-cutree(hc.cen,3)**#Checking the clusters in the dendogram**rect.hclust(hc.cen,k=3,border="red")*

*install.packages("fpc")**install.packages("cluster")**library("fpc")**library("Cluster")**plotcluster(data.clust[,-5],member.s) # fpc**clusplot(data.clust[,-5],member.s,color = TRUE,shade = TRUE,labels = 2,lines = 0) # cluster*

*# K means clustering**library(fpc)**nhc.kmeans<-kmeans(data.clust[,-5],centers=3,nstart = 20)**plotcluster(data.clust[,-5],nhc.kmeans$cluster)**print(nhc.kmeans)*