Short R Tutorial: Fuzzy and Property-Based Clustering for Sequence Analysis

Matthias Studer

Introduction

This document provides an introduction to the R code to create typologies of trajectories using fuzzy and property-based clustering in sequence analysis using the cluster and WeightedCluster R library (Studer 2013). Readers interested by the methods and looking for more information on the interpretation of the results are referred to:

You are kindly asked to cite the above reference if you use the methods presented in this document.

We start by loading the required library and setting the seed.

set.seed(1)
library(WeightedCluster)

Time for this code chunk to run: 0.08 seconds

Data Preparation

We rely on the biofam dataset to illustrate the use of the WeightedCluster package. This public dataset is distributed with the TraMineR R package and code family trajectories of a subsample of the Swiss Household Panel (SHP Group 2024).

First, we need to prepare the data by creating a state sequence object using the seqdef command (Gabadinho et al. 2011). This object stores all the information about our trajectories, including the data and associated characteristics, such as state labels. During this step, we further define proper state labels as the original data are coded using numerical values. We then plot the sequences using, for instance, a chronogram.

data(biofam) #load illustrative data
## Defining the new state labels 
statelab <- c("Parent", "Left", "Married", "Left/Married",  "Child", 
            "Left/Child", "Left/Married/Child", "Divorced")
## Creating the state sequence object,
biofam.seq <- seqdef(biofam[,10:25], alphabet=0:7, states=statelab)
seqdplot(biofam.seq, legend.prop=0.2)

Time for this code chunk to run: 0.69 seconds

The clustering algorithms discussed here relies on a distance matrix. We compute one using the "LCS" distance.

diss <- seqdist(biofam.seq, method="LCS")

Time for this code chunk to run: 1.37 seconds

Fuzzy Clustering

In fuzzy clustering, the sequences belong to each cluster with an estimated membership strength. This approach is particularly useful when some sequences are thought to lie in between two (or more) types of sequences (i.e., hybrid-type sequences) or when only weak structure is found in the data. Here, we use the fanny fuzzy clustering algorithm Maechler et al. (2024), which is available in the cluster R library. It is used as follows:

library(cluster) ## Loading the library
fclust <- fanny(diss, k=5, diss=TRUE, memb.exp=1.5)

Time for this code chunk to run: 34.46 seconds

The fanny function requires us to specify the distance matrix (our diss object), the number of groups (argument k=5), diss=TRUE specifies that we directly provided a distance matrix, and memb.exp=1.5 the fuzziness parameters (values between 1.5 and 2 are often recommended).

Descriptive statistics of membership strength to each cluster provide useful information. The mean can be interpreted as a relative frequency of each cluster if sequences are weighted according to their membership strength.

summary(fclust$membership)
##        V1                 V2                 V3                 V4          
##  Min.   :0.008774   Min.   :0.001817   Min.   :0.007352   Min.   :0.005009  
##  1st Qu.:0.077276   1st Qu.:0.019741   1st Qu.:0.039166   1st Qu.:0.044493  
##  Median :0.140503   Median :0.053033   Median :0.055403   Median :0.091802  
##  Mean   :0.205773   Mean   :0.211530   Mean   :0.210021   Mean   :0.178295  
##  3rd Qu.:0.286928   3rd Qu.:0.298238   3rd Qu.:0.210623   3rd Qu.:0.234374  
##  Max.   :0.778559   Max.   :0.956435   Max.   :0.981889   Max.   :0.820392  
##        V5          
##  Min.   :0.002511  
##  1st Qu.:0.034931  
##  Median :0.079874  
##  Mean   :0.194380  
##  3rd Qu.:0.212156  
##  Max.   :0.895185

Time for this code chunk to run: 0.03 seconds

Plotting the typology

A graphical representation of the typology can be obtained by weighting each sequences according to its membership strength (Studer 2018). Using this strategy, one can represent distribution plots (see the seqplot function of the TraMineR package). The fuzzyseqplot function from the WeightedCluster can be used to do so.

## Displaying the resulting clustering with membership threshold of 0.4
fuzzyseqplot(biofam.seq, group=fclust$membership, type="d")

Time for this code chunk to run: 0.29 seconds

Following a similar strategy, one can further order the sequences according to the membership strength (argument sortv="membership") in an index plot (argument type="I") and remove sequences with low membership strengths (argument membership.threashold =0.4). In this plot, each sequence is represented using a thin line. The sequences at the top are the one with the highest membership strength. They therefore describe the best the full cluster membership.

## Displaying the resulting clustering with membership threshold of 0.4
fuzzyseqplot(biofam.seq, group=fclust$membership, type="I", membership.threashold =0.4, sortv="membership")

Time for this code chunk to run: 1.76 seconds ## Regression models

Fuzzy clustering membership matrix can be directly included in subsequent regression as an independent variable. To use it as a dependent variable, you mus use Dirichlet or Beta regression (see the full article for more explanation).

Dirichlet regressions are available in the DirichletReg package (Maier 2021). Here is a sample regression with sex and birth year as independent variables.

library(DirichletReg)
## Loading required package: Formula
##Estimation of Dirichlet Regression
##Dependent variable formatting
fmember <- DR_data(fclust$membership)
## Estimation
bdirig <- DirichReg(fmember~sex+birthyr|1, data=biofam, model="alternative")
## Displaying results of Dirichlet regression.
summary(bdirig)
## Call:
## DirichReg(formula = fmember ~ sex + birthyr | 1, data = biofam, model =
## "alternative")
## 
## Standardized Residuals:
##         Min       1Q   Median      3Q     Max
## v1  -1.1019  -0.7502  -0.4404  0.2699  2.7339
## v2  -1.0543  -0.8090  -0.6264  0.5368  5.4646
## v3  -1.4116  -0.8212  -0.6344  0.0549  4.9510
## v4  -1.0227  -0.7999  -0.5514  0.1737  3.3375
## v5  -1.0592  -0.7934  -0.5392  0.0987  3.9312
## 
## MEAN MODELS:
## ------------------------------------------------------------------
## Coefficients for variable no. 1: v1
## - variable omitted (reference category) -
## ------------------------------------------------------------------
## Coefficients for variable no. 2: v2
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -35.706127   5.156757  -6.924 4.39e-12 ***
## sexwoman      0.034515   0.054809   0.630    0.529    
## birthyr       0.018218   0.002654   6.864 6.72e-12 ***
## ------------------------------------------------------------------
## Coefficients for variable no. 3: v3
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 31.472422   4.975613   6.325 2.53e-10 ***
## sexwoman    -0.144001   0.054029  -2.665  0.00769 ** 
## birthyr     -0.016240   0.002562  -6.338 2.32e-10 ***
## ------------------------------------------------------------------
## Coefficients for variable no. 4: v4
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -10.819906   4.962915  -2.180   0.0292 *
## sexwoman      0.061080   0.054263   1.126   0.2603  
## birthyr       0.005476   0.002555   2.143   0.0321 *
## ------------------------------------------------------------------
## Coefficients for variable no. 5: v5
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -15.239307   4.929080  -3.092 0.001990 ** 
## sexwoman      0.190014   0.054565   3.482 0.000497 ***
## birthyr       0.007686   0.002537   3.029 0.002452 ** 
## ------------------------------------------------------------------
## 
## PRECISION MODEL:
## ------------------------------------------------------------------
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.14238    0.01312    87.1   <2e-16 ***
## ------------------------------------------------------------------
## Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-likelihood: 7438 on 13 df (247 BFGS + 2 NR Iterations)
## AIC: -14850, BIC: -14777
## Number of Observations: 2000
## Links: Logit (Means) and Log (Precision)
## Parametrization: alternative

Time for this code chunk to run: 7.63 seconds

Beta regression is available in the betareg package (Zeileis et al. 2024). To estimate a beta regression for the third fuzzy type, one can use.

library(betareg)
## Estimation of beta regression
breg1 <- betareg(fclust$membership[, 3]~sex+birthyr, data=biofam)
## Displaying results
summary(breg1)
## 
## Call:
## betareg(formula = fclust$membership[, 3] ~ sex + birthyr, data = biofam)
## 
## Quantile residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8110 -0.6738 -0.3801  0.0635  2.7888 
## 
## Coefficients (mean model with logit link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 44.527815   4.881684   9.121  < 2e-16 ***
## sexwoman    -0.184664   0.052269  -3.533 0.000411 ***
## birthyr     -0.023337   0.002514  -9.284  < 2e-16 ***
## 
## Phi coefficients (precision model with identity link):
##       Estimate Std. Error z value Pr(>|z|)    
## (phi)  1.47522    0.04207   35.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Type of estimator: ML (maximum likelihood)
## Log-likelihood:  1163 on 4 Df
## Pseudo R-squared: 0.07331
## Number of iterations: 19 (BFGS) + 4 (Fisher scoring)

Time for this code chunk to run: 1.89 seconds

Property-based clustering

The aim of property-based clustering is to build a sequence typology by identifying well-defined clustering rules that are based on the most relevant properties of the analyzed object. Here, we present the algorithm discussed in (Studer 2018), an extension of the algorithms of Chavent, Lechevallier, and Briant (2008) and Piccarreta and Billari (2007).

Property-based clustering works in two steps. First, sequence properties are extracted. Second, the clustering is created based on the extracted properties. The following table presents the states sequences properties that can be extracted.

Name Description
"state" The state in which an individual is found, at each time position \(t\).
"spell.age" The age at the beginning of each spell of a given type.
"spell.dur" The duration of each of the spells presented above.
"duration" The total time spent in each state.
"pattern" Count of the frequent subsequences of states in the DSS.
"AFpattern" Age at the first occurrence of the above frequent subsequence.
"transition" Count of the frequent subsequence of events in each sequence, where each transition is considered another event.
"AFtransition" Age at the first occurrence of the above frequent subsequence.
"Complexity" Complexity index, number of transitions, turbulence.

Second, the clustering algorithm is run. In R, the two steps are regrouped in the same function seqpropclust. The first argument specifies the state sequence object. Then, the diss argument specifies the distance matrix, maxcluster the maximum number of clusters to consider, and properties the list of properties to consider using the names in the above table.

pclust <- seqpropclust(biofam.seq, diss=diss, maxcluster=5, properties=c("state", "duration"))
##  [>] Extracting 'state' properties...OK (16 properties extracted)
##  [>] Extracting 'duration' properties...OK (9 properties extracted)
##  [>] 25 properties extracted.
pclust
## Dissimilarity tree:
##  Parameters: min.size=20, max.depth=5, R=1, pval=1 
##  Formula: seqdata ~ state.a15 + state.a16 + state.a17 + state.a18 + state.a19 +      state.a20 + state.a21 + state.a22 + state.a23 + state.a24 +      state.a25 + state.a26 + state.a27 + state.a28 + state.a29 +      state.a30 + duration.Parent + duration.Left + duration.Married +      `duration.Left/Married` + duration.Child + `duration.Left/Child` +      `duration.Left/Married/Child` + duration.Divorced + `duration.*` 
##  Global R2: 0.48761 
## 
##  Fitted tree: 
## 
##  |-- Root  (n: 2000 disc: 8.0991) 
##    |-> duration.Left 0.21436 
##              |-- <= 3  (n: 1335 disc: 6.8008) 
##                |-> duration.Left/Married/Child 0.22773 
##                    |-- <= 3  (n: 860 disc: 6.0596) 
##                      |-> state.a26 0.2375 
##                          |-- [ Parent ]    (n: 386 disc: 2.2151)[(Parent,16)] * 
##                          |-- [ Left,Married,Left/Married,Child,Left/Child,Left/Married/Child,Divorced ]  (n: 474 disc: 6.5793) 
##                            |-> duration.Married 0.35945 
##                                |-- <= 2    (n: 309 disc: 4.9034)[(Parent,9)-(Left/Married,7)] * 
##                                |-- > 2    (n: 165 disc: 2.9239)[(Parent,8)-(Married,8)] * 
##                    |-- > 3    (n: 475 disc: 3.7901)[(Parent,8)-(Left/Married,1)-(Left/Married/Child,7)] * 
##              |-- > 3    (n: 665 disc: 5.4841)[(Parent,6)-(Left,9)-(Left/Married,1)] *

Time for this code chunk to run: 1.59 seconds

If GraphViz is installed on the computer, one can use seqtreedisplay for a graphical representation of the results (not presented here).

seqtreedisplay(pclust, type="d", border=NA, showdepth=TRUE)

Time for this code chunk to run: 0 seconds

The cluster quality indices can be computed using the as.clustrange() function as for hierarchical clustering algorithms.

pclustqual <- as.clustrange(pclust, diss=diss, ncluster=5)
pclustqual
##           PBC   HG HGSD  ASW ASWw     CH   R2    CHsq R2sq   HC
## cluster2 0.50 0.61 0.59 0.36 0.36 545.15 0.21 1108.58 0.36 0.19
## cluster3 0.54 0.69 0.67 0.35 0.35 518.98 0.34 1100.68 0.52 0.16
## cluster4 0.54 0.73 0.71 0.36 0.36 478.65 0.42 1008.63 0.60 0.15
## cluster5 0.58 0.80 0.78 0.41 0.41 474.63 0.49 1122.81 0.69 0.12

Time for this code chunk to run: 0.69 seconds

The clustering solution can then be accessed and plotted using the same procedure as for any cluster algorithms, e.g.:

seqdplot(biofam.seq, pclustqual$clustering$cluster4)

Time for this code chunk to run: 0.21 seconds

References

Chavent, Marie, Yves Lechevallier, and Olivier Briant. 2008. “DIVCLUS-t: A Monothetic Divisive Hierarchical Clustering Method.” Computational Statistics & Data Analysis 52 (2): 687–701. https://doi.org/10.1016/j.csda.2007.03.013.
Gabadinho, Alexis, Gilbert Ritschard, Nicolas S. Müller, and Matthias Studer. 2011. “Analyzing and Visualizing State Sequences in R with TraMineR.” Journal of Statistical Software 40 (4): 1–37. https://doi.org/https://doi.org/10.18637/jss.v040.i04.
Kaufman, L., and P. J. Rousseeuw. 1990. Finding Groups in Data. An Introduction to Cluster Analysis. New York: Wiley.
Maechler, Martin, Peter Rousseeuw, Anja Struyf, and Mia Hubert. 2024. Cluster: "Finding Groups in Data": Cluster Analysis Extended Rousseeuw Et Al. https://svn.r-project.org/R-packages/trunk/cluster/.
Maier, Marco Johannes. 2021. DirichletReg: Dirichlet Regression. https://github.com/maiermarco/DirichletReg.
Piccarreta, Raffaella, and Francesco C. Billari. 2007. “Clustering Work and Family Trajectories by Using a Divisive Algorithm.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 170 (4): 1061–78. https://doi.org/10.1111/j.1467-985X.2007.00495.x.
SHP Group. 2024. “Living in Switzerland Waves 1-24 (Including a Long File) + Covid 19 Data.” FORS data service. https://doi.org/10.48573/58NW-6A50.
Studer, Matthias. 2013. “WeightedCluster Library Manual: A Practical Guide to Creating Typologies of Trajectories in the Social Sciences with R.” {LIVES} Working Papers 24. Switzerland: NCCR LIVES. https://doi.org/10.12682/lives.2296-1658.2013.24.
———. 2018. “Divisive Property-Based and Fuzzy Clustering for Sequence Analysis.” In Sequence Analysis and Related Approaches: Innovative Methods and Applications, edited by Gilbert Ritschard and Matthias Studer, 10:223–39. Life Course Research and Social Policies. Springer. https://doi.org/10.1007/978-3-319-95420-2_13.
Zeileis, Achim, Francisco Cribari-Neto, Bettina Grün, and Ioannis Kosmidis. 2024. Betareg: Beta Regression. https://topmodels.R-Forge.R-project.org/betareg/.