Title: | Simplex-Structured Matrix Factorisation for Metabolomics Analysis |
---|---|
Description: | Provides a framework to perform soft clustering using simplex-structured matrix factorisation (SSMF). The package contains a set of functions for determining the optimal number of prototypes, the optimal algorithmic parameters, the estimation confidence intervals and the diversity of clusters. Abdolali, Maryam & Gillis, Nicolas (2020) <doi:10.1137/20M1354982>. |
Authors: | Wenxuan Liu [aut, cre], Thomas Brendan Murphy [aut], Lorraine Brennan [aut] |
Maintainer: | Wenxuan Liu <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2025-03-06 05:30:38 UTC |
Source: | https://github.com/wenxuanliu1996/metabolssmf |
Bootstrap resampling approach to estimate the confidence intervals for the cluster prototypes.
bootstrap(data, k, H, mtimes = 50, lr = 0.01, ncore = 2)
bootstrap(data, k, H, mtimes = 50, lr = 0.01, ncore = 2)
data |
Data matrix or data frame. |
k |
The number of prototypes/clusters. |
H |
Matrix, input |
mtimes |
Integer, number of bootstrap samples. Default number is 50. |
lr |
Optimisation learning rate in ssmf(). |
ncore |
The number of cores to use for parallel execution. |
Create bootstrap samples of size by sampling from the data set with replacement and repeat the steps
times.
The
bootstrap sample is denoted as
where each is a random sample (with replacement) from the data set.
Then, apply the SSMF algorithm to each bootstrap sample and calculate the bootstrap replicate of the prototypes matrix,
which is denoted as
.
The estimate standard deviation of bootstrap replicates can be calculated by
where . Therefore, the 95% CIs for the prototypes can be calculated by
where and
is the quantiles of student
distribution with 95% significance and
degrees of freedom.
W.est
The matrix estimated by bootstrap.
lower
Lower bound of confidence intervals.
upper
Upper bound of confidence intervals.
Wenxuan Liu
Stine, R. (1989). An Introduction to Bootstrap Methods: Examples and Ideas. Sociological Methods & Research, 18(2-3), 243-291. <doi:10.1177/0049124189018002003>
# example code data <- SimulatedDataset k <- 4 fit <- ssmf(data = data, k = k) bootstrap(data = data , k = k, H = fit$H)
# example code data <- SimulatedDataset k <- 4 fit <- ssmf(data = data, k = k) bootstrap(data = data , k = k, H = fit$H)
Calculate the Shannon diversity index of the memberships of an observation. The base of the logarithm is 2.
diversity(x, two.power = FALSE)
diversity(x, two.power = FALSE)
x |
A membership vector. |
two.power |
Logical, whether return to the value of |
Given a membership vector of the observation
, the Shannon diversity index is defined as
Specifically, in the case of , the value of
is taken to be 0.
A numeric value of Shannon diversity index or
.
Wenxuan Liu
# Memberships vector membership1 <- c(0.1, 0.2, 0.3, 0.4) diversity(membership1) diversity(membership1, two.power = TRUE) # Memberships matrix membership2 <- matrix(c(0.1, 0.2, 0.3, 0.4, 0.3, 0.2, 0.4, 0.1, 0.2, 0.3, 0.1, 0.4), nrow=3, ncol=4, byrow=TRUE) E <- rep(NA, nrow(membership2)) for(i in 1:nrow(membership2)){ E[i] <- diversity(membership2[i,]) } E
# Memberships vector membership1 <- c(0.1, 0.2, 0.3, 0.4) diversity(membership1) diversity(membership1, two.power = TRUE) # Memberships matrix membership2 <- matrix(c(0.1, 0.2, 0.3, 0.4, 0.3, 0.2, 0.4, 0.1, 0.2, 0.3, 0.1, 0.4), nrow=3, ncol=4, byrow=TRUE) E <- rep(NA, nrow(membership2)) for(i in 1:nrow(membership2)){ E[i] <- diversity(membership2[i,]) } E
A list of the results for bootstrap example.
fit_boot
fit_boot
A list of bootstrap result, including the values of estimated prototype matrix (),
the lower bound of confidence intervals and the upper bound of confidence intervals.
A list of the results for gap statistic example for .
fit_gap
fit_gap
A list of gap statistic result, including the gap value vector, the optimal number of prototypes/clusters and the Standard error vector.
A list of the results for SSMF example for .
fit_SSMF
fit_SSMF
A list with 10 items, each item is a results of SSMF,
containing the values of the estimated prototype matrix () and
the estimated membership matrix (
) matrix and the value of
the residuals sum of square (SSE).
Estimating the number of prototypes/clusters in a data set using the gap statistic.
gap( data, rss, meth = c("kmeans", "uniform", "dirichlet", "nmf"), itr = 50, lr = 0.01, ncore = 2 )
gap( data, rss, meth = c("kmeans", "uniform", "dirichlet", "nmf"), itr = 50, lr = 0.01, ncore = 2 )
data |
Data matrix or data frame. |
rss |
Numeric vector, residual sum of squares from ssmf model using the number of clusters |
meth |
Character, specification of method to initialise the |
itr |
Integer, number of Monte Carlo samples. |
lr |
Optimisation learning rate in ssmf(). |
ncore |
The number of cores to use for parallel execution. |
This gap statistic selects the biggest difference between the original residual sum of squares (RSS) and the RSS under an appropriate null reference distribution of the data, which is defined to be
where is the number of samples from the reference distribution;
is the residual sum of squares of the
sample from the reference distribution fitted in the SSMF model model using
clusters;
is the residual sum of squares for the original data
fitted the model using the same
.
The estimated gap suggests the number of prototypes/clusters (
) using
where is standard error that is defined as
and is the standard deviation:
gap
Gap value vector.
optimal.k
The optimal number of prototypes/clusters.
standard.error
Standard error vector.
Wenxuan Liu
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the Number of Clusters in a Data Set via the Gap Statistic. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 63(2), 411–423. <doi:10.1111/1467-9868.00293>
# example code data <- SimulatedDataset k <- 6 rss <- rep(NA, k) for(i in 1:k){ rss[i] <- ssmf(data = data, k = i)$SSE } gap(data = data, rss = rss)
# example code data <- SimulatedDataset k <- 6 rss <- rep(NA, k) for(i in 1:k){ rss[i] <- ssmf(data = data, k = i)$SSE } gap(data = data, rss = rss)
or prototype matrix
.This function initialises the matrix
or the
matrix to start the SSMF model.
This function is often used in conjunction with the function ssmf( ). Also, the code can be run separately from the function
ssmf( ). This function returns to simplex-structured soft membership matrix
and prototype matrix
.
init( data = NULL, k = NULL, method = c("kmeans", "uniform", "dirichlet", "nmf") )
init( data = NULL, k = NULL, method = c("kmeans", "uniform", "dirichlet", "nmf") )
data |
Data matrix or data frame. |
k |
The number of prototypes/clusters. |
method |
Character: 'kmeans', 'uniform', 'dirichlet' or 'nmf'. If there are more than one method, the default is selecting the first method in the vector. |
'kmeans': create the matrix using the centres of the kmeans output; create the
matrix by converting the classification into a binary matrix.
'uniform': create the matrix by sampling the values from uniform distribution and making the rows of the matrix lie in the unit simplex; group the observations with their maximum memberships
and create the
matrix by combining the mean vector in each group.
'dirichlet': create the matrix by sampling the values from Dirichlet distribution; group the observations with their maximum memberships
and create the
matrix by combining the mean vector in each group.
'nmf': create the matrix using the matrix of basic components from NMF model; the coefficient matrix is acquired from NMF model,
then the
is created by making the rows of the coefficient matrix lie in the unit simplex.
Initialised ,
matrix.
Wenxuan Liu
# example code init(data = SimulatedDataset, k = 4, method = 'kmeans')
# example code init(data = SimulatedDataset, k = 4, method = 'kmeans')
Soft adjusted Rand index, a soft agreement measure for class partitions incorporating assignment probabilities.
sARI(partition1, partition2)
sARI(partition1, partition2)
partition1 |
Numeric matrix/data frame of the probabilities of assignment of observations in partition 1 (membership matrix). |
partition2 |
Numeric matrix/data frame of the probabilities of assignment of observations in partition 2 (membership matrix). |
Soft adjusted Rand index.
Wenxuan Liu
Flynt, A., Dean, N. & Nugent, R. (2019) sARI: a soft agreement measure for class partitions incorporating assignment probabilities. Adv Data Anal Classif 13, 303–323 (2019). <doi:10.1007/s11634-018-0346-x>
A simulated metabolomic data set containing 138 variables for 177 individuals.
data(SimulatedDataset)
data(SimulatedDataset)
A data frame with 177 rows and 138 columns.
A simulated membership matrix containing 4 cluster memberships for 177 individuals.
data(SimulatedMemberships)
data(SimulatedMemberships)
A data frame with 177 rows and 4 columns.
A simulated prototype matrix containing 4 cluster prototypes.
data(SimulatedPrototypes)
data(SimulatedPrototypes)
A data frame with 4 rows and 138 columns.
This function implements on SSMF on a data matrix or data frame.
ssmf( data, k, H = NULL, W = NULL, meth = c("kmeans", "uniform", "dirichlet", "nmf"), lr = 0.01, nruns = 50 )
ssmf( data, k, H = NULL, W = NULL, meth = c("kmeans", "uniform", "dirichlet", "nmf"), lr = 0.01, nruns = 50 )
data |
Data matrix or data frame. |
k |
The number of prototypes/clusters. |
H |
Matrix, user input |
W |
Matrix, user input |
meth |
Specification of method to initialise the |
lr |
Optimisation learning rate. |
nruns |
The maximum times of running the algorithm. |
Let be the data set with
observations and
variables.
Given an integer
,
the data set is clustered by simplex-structured matrix factorisation (SSMF), which aims to process soft clustering
and partition the observations into
fuzzy clusters such that the sum of squares from observations to the
assigned cluster prototypes is minimised.
SSMF finds
and
,
such that
A cluster prototype refers to a vector that represent the characteristics of a particular cluster,
denoted by , where
is the
cluster.
A cluster membership vector
describes the proportion of the cluster prototypes
of the
observation.
is the prototype matrix where each row is the cluster prototype and
is the soft membership matrix where each row gives the soft cluster membership of each observation.
The problem of finding the approximate matrix factorisation is solved by minising residual sum of squares (RSS), that is
such that and
.
W
The optimised matrix, containing the values of prototypes.
H
The optimised matrix, containing the values of soft memberships.
SSE
The residuals sum of square.
Wenxuan Liu
Abdolali, Maryam & Gillis, Nicolas. (2020). Simplex-Structured Matrix Factorization: Sparsity-based Identifiability and Provably Correct Algorithms. <doi:10.1137/20M1354982>
library(MetabolSSMF) # Initialisation by user data <- SimulatedDataset k <- 4 ## Initialised by kmeans fit.km <- kmeans(data, centers = k) H <- mclust::unmap(fit.km$cluster) W <- fit.km$centers fit1 <- ssmf(data, k = k, H = H) #start the algorithm from H fit2 <- ssmf(data, k = k, W = W) #start the algorithm from W # Initialisation inside the function fit3 <- ssmf(data, k = 4, meth = 'dirichlet') fit4 <- ssmf(data, k = 4)
library(MetabolSSMF) # Initialisation by user data <- SimulatedDataset k <- 4 ## Initialised by kmeans fit.km <- kmeans(data, centers = k) H <- mclust::unmap(fit.km$cluster) W <- fit.km$centers fit1 <- ssmf(data, k = k, H = H) #start the algorithm from H fit2 <- ssmf(data, k = k, W = W) #start the algorithm from W # Initialisation inside the function fit3 <- ssmf(data, k = 4, meth = 'dirichlet') fit4 <- ssmf(data, k = 4)