Short R Tutorial: Validating Sequence Analysis Typologies To be Used in Subsequent Regression

Matthias Studer

Introduction

This document provides a very quick introduction to the R code needed to estimate the quality of a typology in a subsequent regression or when the relationship between the typology and the covariate is of key interest. Readers interested in the methods and the exact interpretation of the results are referred to:

You are kindly asked to cite the above reference if you use the methods presented in this document.

Let’s start by setting the seed for reproducible results.

set.seed(1)

Time for this code chunk to run: 0.06 seconds

Creating the State Sequence Object

For this example, we use the mvad dataset. Let’s start with the creation of the state sequence object.

## Loading the TraMineR library
library(TraMineR)
## Loading the data
data(mvad)

## State properties
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
mvad.lab <- c("employment", "further education", "higher education", "joblessness", "school", "training")
mvad.shortlab <- c("EM","FE","HE","JL","SC","TR")

## Creating the state sequence object
mvad.seq <- seqdef(mvad, 17:86, alphabet = mvad.alphabet, states = mvad.shortlab, labels = mvad.lab, xtstep = 6)

Time for this code chunk to run: 0.26 seconds

Creating the typology

We will now create a typology using cluster analysis. Readers interested in more detail are referred to the WeightedCluster library manual (also available as a vignette), which goes into the details of the creation and computation of cluster quality measures.

We start by computing dissimilarities with the seqdist function using the Hamming distance. We then use Ward clustering to create a typology of the trajectories. For this step, we recommend the use of the fastcluster library (Müllner 2024), which considerably speed up the computations.

## Using fastcluster for hierarchical clustering
library(fastcluster)
## Distance computation
diss <- seqdist(mvad.seq, method="LCS")
## Hierarchical clustering
hc <- hclust(as.dist(diss), method="ward.D")

Time for this code chunk to run: 4.78 seconds

We can now compute several cluster quality indices using as.clustrange function from two to ten groups.

# Loading the WeightedCluster library
library(WeightedCluster)
# Computing cluster quality measures.
clustqual <- as.clustrange(hc, diss=diss, ncluster=10)
clustqual
##            PBC   HG HGSD  ASW ASWw     CH   R2   CHsq R2sq   HC
## cluster2  0.59 0.75 0.74 0.43 0.43 216.72 0.23 431.41 0.38 0.13
## cluster3  0.43 0.51 0.50 0.25 0.25 175.61 0.33 335.58 0.49 0.25
## cluster4  0.52 0.67 0.67 0.29 0.30 164.63 0.41 352.31 0.60 0.17
## cluster5  0.52 0.69 0.68 0.31 0.31 153.46 0.46 326.41 0.65 0.16
## cluster6  0.57 0.79 0.78 0.34 0.35 151.17 0.52 372.95 0.73 0.11
## cluster7  0.58 0.83 0.83 0.37 0.37 144.38 0.55 383.10 0.77 0.09
## cluster8  0.51 0.78 0.78 0.32 0.33 140.00 0.58 358.85 0.78 0.12
## cluster9  0.52 0.85 0.85 0.34 0.35 137.85 0.61 386.02 0.81 0.09
## cluster10 0.51 0.87 0.87 0.35 0.36 131.67 0.63 379.81 0.83 0.08

Time for this code chunk to run: 0.44 seconds

The clustassoc function

In this example, we will focus on the association between father unemployment status (funemp variable) and our school-to-work trajectories. The clustassoc function provides several indicators of the quality of typology to study this association.

The function takes a clustrange object as the first argument. The diss argument specifies the distance matrix used for clustering, covar the covariate of association of interest, and weights an optional case weights vector.

cla <- clustassoc(clustqual, diss=diss, covar=mvad$funemp)
cla
##               Unaccounted   Remaining      BIC numcluster
## No Clustering   1.0000000 0.006135499 642.7743          1
## cluster2        0.6804065 0.004174633 639.6969          2
## cluster3        0.4544939 0.002788547 642.3764          3
## cluster4        0.3432326 0.002105903 646.4923          4
## cluster5        0.3445621 0.002114060 652.8668          5
## cluster6        0.2225184 0.001365262 657.7167          6
## cluster7        0.2269752 0.001392606 662.9861          7
## cluster8        0.2192153 0.001344995 667.0654          8
## cluster9        0.1944971 0.001193337 671.3423          9
## cluster10       0.1970281 0.001208866 675.5324         10

Time for this code chunk to run: 19.77 seconds The resulting object presents three indicators. The Unaccounted column shows the share of the direct association between the trajectories and the covariates that is by the typology. This computation are based on the discrepancy analysis framework (Studer et al. 2011). A low value means that the typology carries most of the information that is relevant to study the association between our covariate and the trajectories.

The Remaining column presents the share of the overall variability of the trajectories that is by the typology. A low value indicates that there is no variation left not explained by the typology. Warning, this is usually a very low value. The value presented in the “No clustering” row (the first) is equivalent to the pseudo-\(R^2\) of a discrepancy analysis between the trajectories and the covariates.

The BIC column presents the Bayesian Information Criterion for the association between the typology and the covariate (again the lower the better). While the first column provides the most reliable information, the BIC might be useful when parcimony is of key interest.

The general idea is to select a cluster solution with low values on Unaccounted and BIC (only if relevant).

The results can be plotted to make it easier to find the minimum.

plot(cla, main="Unaccounted")

Time for this code chunk to run: 0.09 seconds

According to the plot, at least 6 groups are required. However, around one fifth of the association is left un-reproduced by the clustering. It might be interesting to compare the 5 and 6 clusters solutions to better understand the association.

seqdplot(mvad.seq, group=clustqual$clustering$cluster5, border=NA)

Time for this code chunk to run: 0.68 seconds

seqdplot(mvad.seq, group=clustqual$clustering$cluster6, border=NA)

Time for this code chunk to run: 0.5 seconds

We can notice that the 6 cluster solutions contains a new joblessness cluster, which is found to be important to study the association between father unemployment and son school-to-work trajectories.

The presented might lead to different recommendations than the usual cluster quality indices, because it focuses on a relationship with a covariate. The method often suggests a higher number of groups.

References

Müllner, Daniel. 2024. Fastcluster: Fast Hierarchical Clustering Routines for r and Python. https://danifold.net/fastcluster.html.
Studer, Matthias, Gilbert Ritschard, Alexis Gabadinho, and Nicolas S. Müller. 2011. “Discrepancy Analysis of State Sequences.” Sociological Methods and Research 40 (3): 471–510. https://doi.org/10.1177/0049124111415372.