Welcome
SISPCA (Supervised Independent Subspace Principal Component Analysis) is a Python package designed to learn linear representations capturing variations associated with factors of interest in high-dimensional data. It extends the Principal Component Analysis (PCA) to multiple subspaces and encourage subspace disentanglement by maximizing the Hilbert-Schmidt Independence Criterion (HSIC). The model is implemented in PyTorch and uses the Lightning framework for training.
For more theoretical connections and applications, please refer to our paper Disentangling Interpretable Factors with Supervised Independent Subspace Principal Component Analysis.
Installation
The package can be installed via pip:
# from PyPI (stable version)
$ pip install sispca
# or from github (latest version)
$ pip install git+https://github.com/JiayuSuPKU/sispca.git#egg=sispca
The following dependencies will be installed automatically:
torch # may need to install with specific python version
lightning
scipy
scikit-learn
In addition to the linear PCA models, we also re-implemented non-linear VAE-based counterparts in sispca.hcv_vi following the HCV paper (Lopez et al. 2018) under the latest scvi-tools framework (version 1.2.0). To run those models, you need to also install the following dependencies:
$ pip install scanpy
$ pip install scvi-tools
Please refer to the scvi-tools documentation for installation instruction.
Basic usage
import numpy as np
import torch
from sispca import Supervision, SISPCADataset, SISPCA
# simulate random inputs
x = torch.randn(100, 20)
y_cont = torch.randn(100, 5) # continuous target
y_group = np.random.choice(['A', 'B', 'C'], 100) # categorical target
# simulate custom kernel K_y
# in general, K_y should be either sparse, i.e. a graph Laplacian kernel, or low-rank, i.e. K_y = L @ L.T
L = torch.randn(100, 20)
K_y = L @ L.T # (n_sample, n_sample)
# create a dataset with supervision
sdata = SISPCADataset(
data = x.float(), # (n_sample, n_feature)
target_supervision_list = [
Supervision(target_data=y_cont, target_type='continuous'),
Supervision(target_data=y_group, target_type='categorical'),
# Supervision(target_data=None, target_type='custom', target_kernel_K = K_y)
Supervision(target_data=None, target_type='custom', target_kernel_Q = L) # equivalent to the above
]
)
# fit the sisPCA model
sispca = SISPCA(
sdata,
n_latent_sub=[3, 3, 3, 3], # the last subspace will be unsupervised
lambda_contrast=10,
kernel_subspace='linear',
solver='eig'
)
sispca.fit(batch_size = -1, max_epochs = 100, early_stopping_patience = 5)
Note
The computational bottleneck of sispca-linear is the multiple rounds of eigen-decomposition (numpy.linalg.eigh) of the matrix of size (n_feature, n_feature), which scales as O(n_feature^3). For large feature sets, consider reducing the number of features.
Note
The memory bottleneck is the storage of the kernel matrix of size (n_sample, n_sample). In most cases, we store a low-rank Q from the decomposition K = Q.T @ Q, which scales as O(n_sample). To further reduce memory usage, consider mini-batch training by setting batch_size in the fit method, although this is an experimental feature and may not converge in some cases.
Tutorials
See the Tutorial Gallery for examples on how to use the package.
Citation
If you find this work useful, please consider citing our paper:
@misc{su2024disentangling,
title={Disentangling Interpretable Factors with Supervised Independent Subspace Principal Component Analysis},
author={Jiayu Su and David A. Knowles and Raul Rabadan},
year={2024},
eprint={2410.23595},
archivePrefix={arXiv},
primaryClass={stat.ML},
url={https://arxiv.org/abs/2410.23595},
}