# 7.2: Classification without learning

We see that plotting of multivariate data always has two problems: either there are too many elements (e.g., in parallel coordinates) which are hard to understand, or there is a need of some grouping operation (e.g., median or range) which will result in the lost of information. What will be really helpful is to safely process the data first, for example, to reduce dimensions—from many to 2 or 3. These techniques are described in this section.

Apart from (a) reduction of dimensionality (projection pursuit), the following methods help to (b) find groups (clusters) in data, (c) discover hidden factors (latent variables) and understand variable importance (feature selection$$^{[1]}$$), (d) recognize objects (e.g., complicated shapes) within data, typically using densities and hiatus (gaps) in multidimensional space, and (e) unmix signals.

Figure $$\PageIndex{1}$$ Static 3D cloud plot of iris data with several rotations.

## Classification with primary data

Primary is what come directly from observation, and did not yet processes in any way (to make secondary data).

### Shadows of hyper clouds: PCA

RGL (see above) allows to find the best projection manually, with a mouse. However, it is possible to do programmatically, with principal component analysis, PCA. It belongs to the family of non-supervised methods, methods of classification without learning, or ordination.

PCA treats the data as points in the virtual multidimensional space where every dimension is the one character. These points make together the multidimensional cloud. The goal of the analysis is to find a line which crosses this cloud along its most elongated part, like pear on the stick (Figure $$\PageIndex{2}$$). This is the first principal component. The second line is perpendicular to the first and again span the second most elongated part of the cloud. These two lines make the plane on which every point is projected.

Figure $$\PageIndex{2}$$ Principal component analysis is like the pear on the stick.

Let us prove this practically. We will load the two-dimensional (hence only two principal components) black and white pear image and see what PCA does with it:

Code $$\PageIndex{1}$$ (Python):

library(png) # package to read PNG images
bb <- which(aa == 0, arr.ind=TRUE) # pixels to coordinates
## plot together original (green) and PCA-rotated:
bbs <- scale(bb)
pps <- scale(prcomp(bb)$x[, 1:2]) # only two PCs anyway xx <- range(c(bbs[, 1], pps[, 1])) yy <- range(c(bbs[, 2], pps[, 2])) plot(pps, pch=".", col=adjustcolor("black", alpha=0.5), xlim=xx, ylim=yy) points(bbs, pch=".", col=adjustcolor("green", alpha=0.5)) legend("bottomright", fill=adjustcolor(c("green", "black"), alpha=0.5), legend=c("Original", "PCA-rotated"), bty="n", border=0)  PCA is related with a task of finding the “most average person”. The simple combination of averages will not work, which is well explained in Todd Rose’s “The End of Average” book. However, it is usually possible to find in the hyperspace the configuration of parameters which will suit most of people, and this is what PCA is for. Figure $$\PageIndex{3}$$ Shadow of the pear: how PCA projects the image. After the PCA procedure, all columns (characters) are transformed into components, and the most informative component is the first, next is the second, then third etc. The number of components is the same as the number of initial characters but first two or three usually include all necessary information. This is why it is possible to use them for 2D visualization of multidimensional data. There are many similarities between PCA and factor analysis (which is out of the scope of this book). At first, we will use an example from the open repository presenting measurements of four different populations of sedges: Code $$\PageIndex{2}$$ (Python): ca <- read.table("http://ashipunov.info/shipunov/open/carex.txt", h=TRUE) str(ca) ca.pca <- princomp(scale(ca[,-1]))  (Function scale() standardizes all variables.) The following (Figure $$\PageIndex{4}$$) plot is technical screeplot which shows the relative importance of each component: Code $$\PageIndex{3}$$ (Python): ca <- read.table("http://ashipunov.info/shipunov/open/carex.txt", h=TRUE) ca.pca <- princomp(scale(ca[,-1])) plot(ca.pca, main="")  Figure $$\PageIndex{4}$$ Plot showing the importance of each component. Here it is easy to see that among four components (same number as initial characters), two first have the highest importances. There is a way to have the same without plotting: Code $$\PageIndex{4}$$ (Python): ca <- read.table("http://ashipunov.info/shipunov/open/carex.txt", h=TRUE) ca.pca <- princomp(scale(ca[,-1])) summary(ca.pca)  First two components together explain about 84% percents of the total variance. Visualization of PCA is usually made using scores from PCA model (Figure $$\PageIndex{5}$$): Code $$\PageIndex{5}$$ (Python): ca <- read.table("http://ashipunov.info/shipunov/open/carex.txt", h=TRUE) ca.pca <- princomp(scale(ca[,-1])) ca.p <- ca.pca$scores[, 1:2]
plot(ca.p, type="n", xlab="PC1", ylab="PC2")
text(ca.p, labels=ca[, 1], col=ca[, 1])
Hulls(ca.p, ca[, 1]) # asmisc.r


Figure $$\PageIndex{5}$$ Diversity of sedges on the plot of two first principal components.

(Last command draws hulls which help to conclude that first sedges from the third population are intermediate between first and second, they might be even hybrids. If there are three, not two, components which are most important, then any of 3D plots like scatterplot3d() explained above, will help to visualize them.)

It is tempting to measure the intersection between hulls. This is possible with Overlap() function, which in turn loads PBSmapping package:

Code $$\PageIndex{6}$$ (Python):

ca <- read.table("http://ashipunov.info/shipunov/open/carex.txt", h=TRUE)
ca.pca <- princomp(scale(ca[,-1]))
ca.p <- ca.pca$scores[, 1:2] ca.h <- Hulls(ca.p, ca[, 1], plot=FALSE) # asmisc.r ca.o <- Overlap(ca.h) # asmisc.r summary(ca.o)  Sometimes, PCA results are useful to present as a biplot (Figure $$\PageIndex{6}$$): Code $$\PageIndex{7}$$ (Python): ca <- read.table("http://ashipunov.info/shipunov/open/carex.txt", h=TRUE) ca.pca <- princomp(scale(ca[,-1])) biplot(ca.pca, xlabs=ca[, 1])  Figure $$\PageIndex{6}$$ Biplot shows the load of each character into two first components. Biplot helps to understand visually how large is the load of each initial character into first two components. For example, characters of height and spike length (but spike width) have a biggest loads into the first component which distinguishes populations most. Function loadings() allows to see this information in the numerical form: Code $$\PageIndex{8}$$ (Python): ca <- read.table("http://ashipunov.info/shipunov/open/carex.txt", h=TRUE) ca.pca <- princomp(scale(ca[,-1])) loadings(ca.pca)  R has two variants of PCA calculation, first (already discussed) with princomp(), and second with prcomp(). The difference lays in the way how exactly components are calculated. First way is traditional, but second is recommended: Code $$\PageIndex{9}$$ (Python): iris.pca <- prcomp(iris[, 1:4], scale=TRUE) iris.pca$rotation
iris.p <- iris.pca$x[, 1:2] plot(iris.p, type="n", xlab="PC1", ylab="PC2") text(iris.p, labels=abbreviate(iris[, 5], 1, method="both.sides"), col=as.numeric(iris[, 5])) Ellipses(iris.p[, 1:2], as.numeric(iris[, 5])) # asmisc.r  Example above shows some differences between two PCA methods. First, prcomp() conveniently accepts scale option. Second, loadings are taken from the rotation element. Third, scores are in the the element with x name. Please run the code yourself to see how to add 95% confidence ellipses to the 2D ordination plot. One might see that Iris setosa (letter “s” on the plot) is seriously divergent from two other species, Iris versicolor (“v”) and Iris virginica (“a”). Packages ade4 and vegan offer many variants of PCA (Figure $$\PageIndex{7}$$): Code $$\PageIndex{10}$$ (Python): library(ade4) iris.dudi <- dudi.pca(iris[, 1:4], scannf=FALSE) s.class(iris.dudi$li, iris[, 5])


Figure $$\PageIndex{7}$$ Diversity of irises on the plot of two first principal components (ade4 package)

(The plot is similar to the shown on Figure $$\PageIndex{5}$$; however, the differences between groups are here more clear.)

In addition, this is possible to use the inferential approach for the PCA:

Code $$\PageIndex{11}$$ (Python):

iris.between <- bca(iris.dudi, iris[, 5], scannf=FALSE)
randtest(iris.between)


Monte-Carlo randomization allows to understand numerically how well are Iris species separated with this PCA. The high Observation value (72.2% which is larger than 50%) is the sign of reliable differences.

There are other variants of permutation tests for PCA, for example, with anosim() from the vegan package.

Please note that principal component analysis is in general a linear technique similar to the analysis of correlations, and it can fail in some complicated cases.

### Data solitaire: SOM

There are several other techniques which allow unsupervised classification of primary data. Self-organizing maps (SOM) is a technique somewhat similar to breaking the deck of cards into several piles:

Code $$\PageIndex{12}$$ (Python):

library(kohonen)
iris.som <- som(scale(iris[, 1:4]), grid = somgrid(3, 3, "hexagonal")) # 9 "piles"
predict(iris.som)$unit.classif # content of each "pile" iris.som$distances # distance to the "center of pile"
set.seed(1)
plot(iris.som, main="")
oldpar <- par(new=TRUE, oma=c(2, rep(0, 3)))
plot(iris.som, type="mapping", col=iris$Species, main="", border=0) par(oldpar) legend("top", col=1:3, pch=1, legend=levels(iris$Species))


The resulted plot (Figure $$\PageIndex{8}$$) contains graphical representation of character values, together with the placement of actual data points.

Figure $$\PageIndex{8}$$ Self-organizing map for iris data. Both character values (codes) and data placement is shown.

In fact, SOM is the non-learning neural network. More advanced Growing Neural Gas (GNG) algorithm uses ideas similar to SOM.

### Data density: t-SNE

With the really big number of samples, t-SNE algorithm (name stands for “t-Distributed Stochastic Neighbor Embedding”) performs better than classical PCA. t-SNE is frequently used for the shape recognition. It is easy enough to employ it in R(Figure $$\PageIndex{9}$$):

Code $$\PageIndex{13}$$ (Python):

library(Rtsne)
iris.unique <- unique(iris)
set.seed(42)
tsne.out <- Rtsne(as.matrix(iris.unique[, 1:4]))
SP <- iris.unique$Species plot(tsne.out$Y, col=SP, pch=14+as.numeric(SP), xlab="", ylab="")
legend("topleft", pch=14+1:nlevels(SP), col=1:nlevels(SP), legend=levels(SP))


Figure $$\PageIndex{9}$$ t-SNE algorithm splits the iris data.

## Classification with correspondence

Correspondence analysis is the family of techniques similar to PCA, but applicable to categorical data (primary or in contingency tables). Simple variant of the correspondence analysis is implemented in corresp() from MASS package (Figure $$\PageIndex{10}$$) which works with contingency tables:

Code $$\PageIndex{14}$$ (Python):

HE <- margin.table(HairEyeColor, 1:2)
HE.df <- Table2df(HE) # asmisc.r
biplot(corresp(HE.df, nf = 2), xpd=TRUE)
legend("topleft", fill=1:2, legend=c("hair","eyes"))


(We converted here “table” object HE into the data frame. xpd=TRUE was used to allow text to go out of the plotting box.)

Figure $$\PageIndex{10}$$ Correspondence plot of contingency table.

This example uses HairEyeColor data from previous chapter. Plot visualizes both parameters so if the particular combination of colors is more frequent, then positions of corresponding words is closer. For example, black hairs and brown eyes frequently occur together. The position of these words is more distant from the center (designated with cross) because numerical values of these characters are remote.

This possibility to visualize several character sets simultaneously on the one plot is the impressive feature of correspondence analysis (Figure $$\PageIndex{11}$$):

Code $$\PageIndex{15}$$ (Python):

library(vegan)
alla <- read.table("data/lakesea_abio.txt", sep="	", h=TRUE)
allc <- read.table("data/lakesea_bio.txt", sep="	", h=TRUE)
all.cca <- cca(allc, alla[,-14])
plot.all.cca <- plot(all.cca, display=c("sp","cn"), xlab="", ylab="")
points(all.cca, display="sites", pch=ifelse(alla[, 14], 15, 0))
legend("topleft", pch=c(15, 0), legend=c("lake","sea"))
text(-1.6, -4.2, "Carex.lasiocarpa", pos=4)


This is much more advanced than biplot. Data used here contained both abiotic (ecotopes) and biotic factors (plant species), plus the geography of some Arctic islands: were these lake islands or sea islands. The plot was able to arrange all of these data: for abiotic factors, it used arrows, for biotic—pluses, and for sites (islands themselves as characterized by the sum of all available factors, biotic and abiotic)—squares of different color, depending on geographic origin. All pluses could be identified with the interactive identify(plot.all.cca, "species") command. We did it just for one most outstanding species, Carex lasiocarpa (woolly-fruit sedge) which is clearly associated with lake islands, and also with swamps.

## Classification with distances

Important way of non-supervised classification is to work with distances instead of original data. Distance-based methods need the dissimilarities between each pair of objects to be calculated first. Advantage of these methods is that dissimilarities could be calculated from data of any type: measurement, ranked or nominal.

### Distances

There are myriads of ways to calculate dissimilarity (or similarity which is essentially the reverse dissimilarity)$$^{[2]}$$. One of these ways already explained above is a (reverse absolute) correlation. Other popular ways are Euclidean (square) distance and Manhattan (block) distance. Both of them (Figure $$\PageIndex{12}$$) are useful for measurement variables.

Figure $$\PageIndex{11}$$ Canonical correlation analysis plot showing Arctic islands (squares), species (crosses) and habitat factors (arrows)

Manhattan distances are similar to driving distances, especially when there are not many roads available. The example below are driving distances between biggest North Dakota towns:

Code $$\PageIndex{16}$$ (Python):

nd <- read.table("data/nd.txt", h=TRUE, sep="	", row.names=1)
nd.d <- as.dist(nd)
str(nd.d)


Figure $$\PageIndex{12}$$ Euclidean (1) and Manhattan (2) distances between A and B

In most cases, we need to convert raw variables into distance matrix. The basic way is to use dist(). Note that ranked and binary variables usually require different approaches which are implemented in the vegan (function vegdist()) and cluster packages (function daisy()). The last function recognizes the type of variable and applies the most appropriate metric (including the universal Gower distance); it also accepts the metric specified by user:

Code $$\PageIndex{17}$$ (Python):

library(cluster)
iris.dist <- daisy(iris[, 1:4], metric="manhattan")
str(iris.dist)


In biology, one can use Smirnov taxonomic distances, available from smirnov package. In the following example, we use plant species distribution data on small islands.

The next plot intends to help the reader to understand them better. It is just a kind of map which shows geographical locations and sizes of islands:

Code $$\PageIndex{18}$$ (Python):

mp <- read.table( "http://ashipunov.info/shipunov/open/moldino_l.txt", h=TRUE, sep="	", row.names=1)
library(MASS)
eqscplot(mp$LON, mp$LAT, cex=round(log(mp$SQUARE))-5.5, axes=FALSE, xlab="", ylab="", xpd=T) text(mp$LON, mp$LAT, labels=row.names(mp), pos=4, offset=1, cex=.9)  (Please plot it yourself.) Now we will calculate and visualize Smirnov’s distances: Code $$\PageIndex{19}$$ (Python): mo <- read.table("http://ashipunov.info/shipunov/open/moldino.txt", h=TRUE, sep=" ", row.names=1) m1 <- t((mo > 0) * 1) # convert to occurrence 0/1 data and transpose library(smirnov) m1.Txy <- smirnov(m1) m1.s <- (1 - m1.Txy) # similarity to dissimilarity dimnames(m1.s) <- list(row.names(m1)) symnum(m1.s)  Smirnov’s distances have an interesting feature: instead of 0 or 1, diagonal of the similarity matrix is filled with the coefficient of uniqueness values (Txx): Code $$\PageIndex{20}$$ (Python): mo <- read.table("http://ashipunov.info/shipunov/open/moldino.txt", h=TRUE, sep=" ", row.names=1) m1 <- t((mo > 0) * 1) # convert to occurrence 0/1 data and transpose library(smirnov) m1.Txy <- smirnov(m1) m1.Txx <- diag(m1.Txy) names(m1.Txx) <- row.names(m1) rev(sort(round(m1.Txx, 3)))  This means that Verik island is a most unique in regards to plant species occurrence. ### Making maps: multidimensional scaling There are many things to do with the distance matrix. One of most straightforward is the multidimensional scaling, MDS (the other name is “principal coordinate analysis”, PCoA): Code $$\PageIndex{21}$$ (Python): nd <- read.table("data/nd.txt", h=TRUE, sep=" ", row.names=1) nd.d <- as.dist(nd) nd.c <- cmdscale(nd.d) new.names <- sub("y C", "y C", row.names(nd)) library(MASS) eqscplot(nd.c, type="n", axes=FALSE, xlab="", ylab="") points(nd.c, pch=19) text(nd.c, labels=new.names, xpd=TRUE, pos=3, cex=0.8)  Figure $$\PageIndex{13}$$ It is not a map of North Dakota towns but the plot of cmdscale() output from the driving distance data. Compare the plot (Figure $$\PageIndex{13}$$) it with any geographical map. If you do not have a map of North Dakota but have these driving distances, cmdscale() allows to re-create the map! So in essence, MDS is a task reverse to navigation (finding driving directions from map): it uses “driving directions” and makes a map from them. Another, less impressive but more useful example (Figure $$\PageIndex{14}$$) is from raw data of Fisher’s irises: Code $$\PageIndex{22}$$ (Python): library(KernSmooth) library(cluster) iris.dist <- daisy(iris[, 1:4], metric="manhattan") iris.c <- cmdscale(iris.dist) est <- bkde2D(iris.c, bandwidth=c(.7, 1.5)) plot(iris.c, type="n", xlab="Dim. 1", ylab="Dim. 2") text(iris.c, labels=abbreviate(iris[, 5], 1, method="both.sides")) contour(est$x1, est$x2, est$fhat, add=TRUE, drawlabels=FALSE, lty=3)


Figure $$\PageIndex{14}$$ The result of the multidimensional scaling of the iris. data. Visualization uses the estimation of density.

(There is no real difference from PCA because metric multidimensional scaling is related to principal component analysis; also, the internal structure of data is the same.)

To make the plot “prettier”, we added here density lines of point closeness estimated with bkde2D() function from the KernSmooth package. Another way to show density is to plot 3D surface like (Figure $$\PageIndex{15}$$):

Code $$\PageIndex{23}$$ (Python):

library(cluster)
iris.dist <- daisy(iris[, 1:4], metric="manhattan")
iris.c <- cmdscale(iris.dist)
est <- bkde2D(iris.c, bandwidth=c(.7, 1.5))
persp(est$x1, est$x2, est$fhat, theta=135, phi=45, col="purple3", shade=0.75, border=NA, xlab="Dim. 1", ylab="Dim. 2", zlab="Density")  Figure $$\PageIndex{15}$$ 3D density surface of multidimensionally scaled iris data. In addition to cmdscale(), MASS package (functions isoMDS() and sammon()) implements the non-metric multidimensional scaling, and package vegan has the advanced non-metric metaMDS(). Non-metric multidimensional scaling does not have analogs to PCA loadings (importances of variables) and proportion of variance explained by component, but it is possible to calculate surrogate metrics: Code $$\PageIndex{24}$$ (Python): iris.dist2 <- dist(iris[, 1:4], method="manhattan") ## to remove zero distances: iris.dist2[iris.dist2 == 0] <- abs(jitter(0)) library(MASS) iris.m <- isoMDS(iris.dist2) cor(iris[, 1:4], iris.m$points) # MDS loadings surrogate
plot(m1.ba$cons.tree) # majority-rule consensus  (This method requires to make an anonymous function which uses methods you want. It also plots both consensus tree (without support values) and original tree with support values. Please make these trees. Note that by default, only support values greater then 50% are shown.) ### Making groups: k-means and friends Apart from hierarchical, there are many other ways of clustering. Typically, they do not return any ordination (“map”) and provide only cluster membership. For example, k-means clustering tries to obtain the a priori specified number of clusters from the raw data (it does not need the distance matrix to be supplied): Code $$\PageIndex{41}$$ (Python): eq <- read.table("data/eq.txt", h=TRUE) eq.k <- kmeans(eq[,-1], 2)  K-means clustering does not plot trees; instead, for every object it returns the number of its cluster: Code $$\PageIndex{42}$$ (Python): eq <- read.table("data/eq.txt", h=TRUE) eq.k <- kmeans(eq[,-1], 2) str(eq.k$cluster)
Misclass(eq.k$cluster, eq$SPECIES) # asmisc.r


(As you see, misclassification errors are low.)

Instead of a priori cluster number, function kmeans() also accepts row numbers of cluster centers.

Spectral clustering from kernlab package is superficially similar method capable to separate really tangled elements:

Code $$\PageIndex{43}$$ (Python):

library(kernlab)
data(spirals)
set.seed(1)
sc <- specc(spirals, centers=2)
plot(spirals, col=sc, xlab="", ylab="")


Figure $$\PageIndex{23}$$ Kernel-based spectral clustering is capable to separate two spirals.

Kernel methods (like spectral clustering) recalculate the primary data to make it more suitable for the analysis. Support vector machines (SVM, see below) is another example. There is also kernel PCA (function kpca() in kernlab package).

Next group of clustering methods is based on fuzzy logic and takes into account the fuzziness of relations. There is always the possibility that particular object classified in the cluster A belongs to the different cluster B, and fuzzy clustering tries to measure this possibility:

Code $$\PageIndex{44}$$ (Python):

library(cluster)
iris.f <- fanny(iris[, 1:4], 3)
head(data.frame(sp=iris[, 5], iris.f$membership))  Textual part of the fanny() output is most interesting. Every row contains multiple membership values which represent the probability of this object to be in the particular cluster. For example, sixth plant most likely belongs to the first cluster but there is also visible attraction to the third cluster. In addition, fanny() can round memberships and produce hard clustering like other cluster methods: Code $$\PageIndex{45}$$ (Python): Misclass(iris.f$clustering, factor(iris[, 5], levels=c("setosa", "virginica", "versicolor"))) # asmisc.r


(We had to re-level the Species variable because fanny() gives number 2 to the Iris virginica cluster.)

### How to know cluster numbers

All “k-means and friends” methods want to know the number of clusters before they start. So how to know a priori how many clusters present in data? This question is one of the most important in clustering, both practically and theoretically.

The visual analysis of banner plot (invented by Kaufman & Rousseeuw, 1990) could predict this number (Figure $$\PageIndex{24}$$):

Code $$\PageIndex{46}$$ (Python):

eq <- read.table("data/eq.txt", h=TRUE)
library(cluster)
eq.a <- agnes(eq[,-1])
plot(eq.a, which=1, col=c(0, "maroon"))


Figure $$\PageIndex{25}$$ Banner plot. White bars suggest possible cluster partitions.

White bars on the left represent unclustered data, maroon lines on the right show height of possible clusters. Therefore, two clusters is the most natural solution, four clusters should be the next possible option.

Model-based clustering allows to determine how many clusters present in data and also cluster membership. The method assumes that clusters have the particular nature and multidimensional shapes:

Code $$\PageIndex{47}$$ (Python):

library(mclust)
iris.mclust <- Mclust(iris[, -5])
summary(iris.mclust) # 2 clusters
iris.mclust$classification  (As you see, it reveals two clusters only. This is explanable because in iris data two species are much more similar then the third one.) DBSCAN is the powerful algorithm for the big data (like raster images which consist of billions of pixels) and there is the R package with the same name (in lowercase). DBSCAN reveals how many clusters are in data at particular resolution: Code $$\PageIndex{48}$$ (Python): library(dbscan) kNNdistplot(iris[, -5], k = 5) abline(h=.5, col = "red", lty=2) (iris.dbscan <- dbscan(iris[, -5], eps = .5, minPts = 5)) plot(iris.p, type="n", xlab="", ylab="") text(iris.p, labels=abbreviate(iris[, 5], 1, method="both.sides"), col=iris.dbscan$cluster+1)


(Plots are not shown, please make then yourself. First plot helps to find the size of neighborhood (look on the knee). The second illustrates results. Similar to model-based clustering, DBSCAN by default reveals only two clusters in iris data.)

Note that while DBSCAN was not able to recover all three species, it recovered clouds, and also places marginal points in the “noise” group. DBSCAN, as you see, is useful for smoothing, important part of image recognition. Parameter eps allows to change “resolution” of clustering and to find more, or less, clusters. DBSCAN relates with t-SNE (see above) and with supervised methods based on proximity (like kNN, see below). It can also be supervised itself and predict clusters for new points. Note that k-means and DBSCAN are based on specifically calculated proximities, not directly on distances.

Data stars contains information about 50 brightest stars in the night sky, their location and constellations. Please use DBSCAN to make artificial constellations on the base of star proximity. How are they related to real constellations?

Note that location (right ascension and declination) is given in degrees or hours (sexagesimal system), they must be converted into decimals.

“Mean-shift” method searches for modes within data, which in essence, is similar to finding proximities. The core mean-shift algotithm is slow so approximate “blurring” version is typically preferable:

Code $$\PageIndex{49}$$ (Python):

library(MeanShift)
bandwidth <- quantile(dist(iris[, -5]), 0.25)
(bmsClustering(t(iris[, -5]), h=bandwidth))


Another approach to find cluster number is similar to the PCA screeplot:

Code $$\PageIndex{50}$$ (Python):

eq <- read.table("data/eq.txt", h=TRUE)
library(cluster)
eq.a <- agnes(eq[,-1])
wss <- (nrow(eq[,-1]) - 1) * sum(apply(eq[,-1], 2, var))
for (i in 2:15) wss[i] <- sum(kmeans(eq[,-1], centers=i)$withinss) barplot(wss, names.arg=1:15, xlab="Number of clusters", main="Sums of squares within groups", yaxt="n", ylab="")  >  (Please check this plot yourself. As on the banner plot, it visible that highest relative “cliffs” are after 1 and 4 cluster numbers.) Collection asmisc.r contains function Peaks() which helps to find local maxima in simple data sequence. Number of these peaks on the histogram (with the sensible number of breaks) should point on the number of clusters: Code $$\PageIndex{51}$$ (Python): histdata <- hist(apply(scale(iris[, -5]), 1, function(.x) sum(abs(.x))), breaks=10, plot=FALSE) sum(Peaks(histdata$counts))

>
[1] 3

(“Three” is the first number of peaks after “one” and does not change when 8 < breaks < 22.)

Finally, the integrative package NbClust allows to use diverse methods to assess the putative number of clusters:

Code $$\PageIndex{52}$$ (Python):

library(NbClust)
iris.nb <- NbClust(iris[, -5], method="ward.D") # wait!


## How to compare different ordinations

Most of classification methods result in some ordination, 2D plot which includes all data points. This allow to compare them with Procrustes analysis (see Appendix for more details) which rotates ans scales one data matrix to make in maximally similar with the second (target) one. Let us compare results of classic PCA and t-SNE:

Code $$\PageIndex{53}$$ (Python):

irisu.pca <- prcomp(iris.unique[, 1:4], scale=TRUE)
irisu.p <- irisu.pca$x[, 1:2] library(vegan) irisu.pr <- procrustes(irisu.p, tsne.out$Y)
plot(irisu.pr, ar.col=iris.unique\$Species, xlab="", ylab="", main="") # arrows point to target (PCA)
with(iris.unique, legend("topright", pch="↑", col=1:nlevels(Species), legend=levels(Species), bty="n"))
legend("bottomright", lty=2:1, legend=c("PCA", "t-SNE"), bty="n")


Resulted plot (Figure 7.3.1) shows how dense are points in t-SNE and how PCA spreads them. Which of methods makes better grouping? Find it yourself.

## References

1. Package Boruta is especially god for all relevant feature selection.