Welcome

SISPCA (Supervised Independent Subspace Principal Component Analysis) is a Python package designed to learn linear representations capturing variations associated with factors of interest in high-dimensional data. It extends the Principal Component Analysis (PCA) to multiple subspaces and encourage subspace disentanglement by maximizing the Hilbert-Schmidt Independence Criterion (HSIC). The model is implemented in PyTorch and uses the Lightning framework for training.

Overview

For more theoretical connections and applications, please refer to our paper Disentangling Interpretable Factors with Supervised Independent Subspace Principal Component Analysis.

Installation

The package can be installed via pip:

# from PyPI (stable version)
$ pip install sispca

# or from github (latest version)
$ pip install git+https://github.com/JiayuSuPKU/sispca.git#egg=sispca

The following dependencies will be installed automatically:

torch # may need to install with specific python version
lightning
scipy
scikit-learn

In addition to the linear PCA models, we also re-implemented non-linear VAE-based counterparts in sispca.hcv_vi following the HCV paper (Lopez et al. 2018) under the latest scvi-tools framework (version 1.2.0). To run those models, you need to also install the following dependencies:

$ pip install scanpy
$ pip install scvi-tools

Please refer to the scvi-tools documentation for installation instruction.

Basic usage

import numpy as np
import torch
from sispca import Supervision, SISPCADataset, SISPCA

# simulate random inputs
x = torch.randn(100, 20)
y_cont = torch.randn(100, 5) # continuous target
y_group = np.random.choice(['A', 'B', 'C'], 100) # categorical target

# simulate custom kernel K_y
# in general, K_y should be either sparse, i.e. a graph Laplacian kernel, or low-rank, i.e. K_y = L @ L.T
L = torch.randn(100, 20)
K_y = L @ L.T # (n_sample, n_sample)

# create a dataset with supervision
sdata = SISPCADataset(
    data = x.float(), # (n_sample, n_feature)
    target_supervision_list = [
        Supervision(target_data=y_cont, target_type='continuous'),
        Supervision(target_data=y_group, target_type='categorical'),
        # Supervision(target_data=None, target_type='custom', target_kernel_K = K_y)
        Supervision(target_data=None, target_type='custom', target_kernel_Q = L) # equivalent to the above
    ]
)

# fit the sisPCA model
sispca = SISPCA(
    sdata,
    n_latent_sub=[3, 3, 3, 3], # the last subspace will be unsupervised
    lambda_contrast=10,
    kernel_subspace='linear',
    solver='eig'
)
sispca.fit(batch_size = -1, max_epochs = 100, early_stopping_patience = 5)

Note

The computational bottleneck of sispca-linear is the multiple rounds of eigen-decomposition (numpy.linalg.eigh) of the matrix of size (n_feature, n_feature), which scales as O(n_feature^3). For large feature sets, consider reducing the number of features.

Note

The memory bottleneck is the storage of the kernel matrix of size (n_sample, n_sample). In most cases, we store a low-rank Q from the decomposition K = Q.T @ Q, which scales as O(n_sample). To further reduce memory usage, consider mini-batch training by setting batch_size in the fit method, although this is an experimental feature and may not converge in some cases.

Tutorials

See the Tutorial Gallery for examples on how to use the package.

Citation

If you find this work useful, please consider citing our paper:

@misc{su2024disentangling,
  title={Disentangling Interpretable Factors with Supervised Independent Subspace Principal Component Analysis},
  author={Jiayu Su and David A. Knowles and Raul Rabadan},
  year={2024},
  eprint={2410.23595},
  archivePrefix={arXiv},
  primaryClass={stat.ML},
  url={https://arxiv.org/abs/2410.23595},
}