Feature selection using sisPCA on the Breast Cancer Wisconsin dataset

In this tutorial, we will showcase how sisPCA can be used as a plug-in replacement of PCA to get better disentangled data representation.

[1]:

import torch
from lightning.pytorch import seed_everything
torch.set_default_dtype(torch.float32)

import numpy as np
import pandas as pd
from sklearn.metrics import silhouette_score
from scipy.linalg import subspace_angles
from plotnine import *

from sispca.utils import normalize_col, delta_kernel, hsic_gaussian, hsic_linear
from sispca import Supervision, SISPCADataset, SISPCA

Load and preprocess the breast cancer dataset

The Kaggle Breast Cancer Wisconsin (Diagnostic) Data Set contains 569 samples with 30 real-valued summary features calculated from imaging data of breast mass. Here our goals include:

Learn a representation for predicting diagnosis status (‘Malignant’ or ‘Benign’, not used during training).
Understand how original features contribute to data variability, the learned representation, and diagnosis potential.

[2]:

# kaggle datasets download -d uciml/breast-cancer-wisconsin-data
data_raw = pd.read_csv('./data/breast_cancer.csv')
data_raw.head()

[2]:

	id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	...	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst	Unnamed: 32
0	842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	...	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890	NaN
1	842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	...	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902	NaN
2	84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	...	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758	NaN
3	84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	...	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300	NaN
4	84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	...	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678	NaN

5 rows × 33 columns

[3]:

# remove the id and diagnosis columns
data = data_raw.drop(['Unnamed: 32','id','diagnosis'], axis = 1)

Each one of the 30 features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image, such as radius (mean of distances from center to points on the perimeter), texture (standard deviation of gray-scale values), perimeter and area. Note that the feature set is somewhat redundant. For example, ‘perimeter’ features are highly correlated with ‘radius’ features.

[4]:

torch.corrcoef(
    normalize_col(torch.from_numpy(
        data_raw[['radius_mean', 'perimeter_mean']].to_numpy()
        ), center=True, scale=True).T
)[0][1].item()

[4]:

0.997855281493811

[5]:

torch.corrcoef(
    normalize_col(torch.from_numpy(
        data_raw[['radius_mean', 'perimeter_worst']].to_numpy()
        ), center=True, scale=True).T
)[0][1].item()

[5]:

0.9651365139559879

Principal component analysis (PCA)

Let us first try the unsupervised PCA to select features that best capture the data. This can be done by running torch.linalg.svd, or equivalently, running SISPCA with a single unsupervised subspace.

[6]:

# first normalize the data
x = torch.from_numpy(data.to_numpy()) # (569, 30)
x = normalize_col(x, center=True, scale = True)

# then solve PCA by SVD
u, s, vh = torch.linalg.svd(x, full_matrices=False)
pca_rep_1 = (u @ torch.diag(s)).numpy() # the svd-based pca scores, (569, 30)

[7]:

# or alternatively, solve PCA as a special case of SISPCA
model = SISPCA(
    SISPCADataset(x, target_supervision_list=[]),
    n_latent_sub = [10],
    solver = 'eig',
)
model.fit(batch_size = -1, max_epochs = 20)
pca_rep_2 = model.get_latent_representation()

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/jysumac/miniforge3/envs/sispca/lib/python3.10/site-packages/lightning/pytorch/trainer/setup.py:177: PossibleUserWarning: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
/Users/jysumac/miniforge3/envs/sispca/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py:182: UserWarning: `LightningModule.configure_optimizers` returned `None`, this fit will run with no optimizer

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 300    | n/a
---------------------------------------------
0         Trainable params
300       Non-trainable params
300       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

0 supervision variables provided for 1 subspaces. The last subspace will be unsupervised.

/Users/jysumac/miniforge3/envs/sispca/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: PossibleUserWarning: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.

We can see that the two solutions are essentially the same.

[8]:

np.linalg.norm(subspace_angles(pca_rep_1[:, :10], pca_rep_2[:, :10]))

[8]:

np.float64(2.679780878067073e-06)

Examine variances explained by each PC

To determine the number of PCs to include, we can check the elbow of the variance explained curve.

[9]:

df_var = pd.DataFrame({
    'pc': np.arange(len(s)) + 1,
    'var_explained_ratio': (s / s.sum()).numpy()
})

(
    ggplot(df_var, aes(x = 'pc', y = 'var_explained_ratio')) +
        geom_point() +
        geom_line() +
        labs(x = 'PC', y = 'Variance proportion\nexplained by each PC') +
        theme_classic() +
        theme(figure_size=(4,3))
)

../_images/tutorials_tutorial_brca_14_0.png

[10]:

n_dim = 6
print(f"The top {n_dim} PCs collectively explain {s[:n_dim].sum() / s.sum() * 100 :.1f}% of total variance.")

The top 6 PCs collectively explain 61.6% of total variance.

Visualize and quantify the quality of PCA representations

From the plot, it appears that the malignant and benign samples are well separated according to PC1 and PC2, while there is also strong within-group variation.

[11]:

df_sample = data_raw.copy()
df_sample['PC_1'] = pca_rep_1[:, 0]
df_sample['PC_2'] = pca_rep_1[:, 1]

(
    ggplot(df_sample, aes(x = 'PC_1', y = 'PC_2')) +
        geom_point(aes(color = 'diagnosis')) +
        labs(title = 'PCA') +
        theme_classic() +
        theme(figure_size=(4,3))
)

../_images/tutorials_tutorial_brca_17_0.png

To quantitatively evaluate the quality of our representation, we can measure the prediction power regarding malignancy status using the Sihouette score. A higher score means that malignant and benign samples are more separated in the corresponding latent space.

[12]:

print(f"Silhouette score (PCA - PC1-3) {silhouette_score(u[:, :3], data_raw['diagnosis']):.3f}")
print(f"Silhouette score (PCA - PC4-6) {silhouette_score(u[:, 3:6], data_raw['diagnosis']):.3f}")
print(f"Silhouette score (PCA - PC1-6) {silhouette_score(u[:, :6], data_raw['diagnosis']):.3f}")

Silhouette score (PCA - PC1-3) 0.294
Silhouette score (PCA - PC4-6) 0.013
Silhouette score (PCA - PC1-6) 0.160

Extract top features that contribute to PC1

Using the PCA loading matrix, we can rank features based on their contributions to each PC. Here, we focus on the first PC where malignant and benign samples are best separated. From the above plot, it is clear that malignancy-specific features should contribute negatively to PC1.

[13]:

df_project = pd.DataFrame({
    'name': data.columns,
    'loading_pc1': vh[:, 0].numpy(),
    'loading_pc2': vh[:, 1].numpy(),
    'loading_pc3': vh[:, 2].numpy(),
})
df_project.sort_values('loading_pc1', ascending=True).head()

[13]:

	name	loading_pc1	loading_pc2	loading_pc3
8	symmetry_mean	-0.223110	0.112699	-0.223739
0	radius_mean	-0.218902	-0.103725	-0.227537
23	area_worst	-0.182579	0.098787	-0.116649
15	compactness_se	-0.150584	-0.157842	-0.114454
26	concavity_worst	-0.131527	-0.017357	-0.115415

Let’s visualize the top 2 features, namely ‘symmetry_mean’ and ‘radius_mean’.

[14]:

(
    ggplot(df_sample, aes(x = 'PC_1', y = 'PC_2')) +
        geom_point(aes(color = 'radius_mean')) +
        labs(title = 'PCA') +
        theme_classic() +
        theme(figure_size=(4,3))
)

../_images/tutorials_tutorial_brca_23_0.png

[15]:

(
    ggplot(df_sample, aes(x = 'PC_1', y = 'PC_2')) +
        geom_point(aes(color = 'symmetry_mean')) +
        labs(title = 'PCA') +
        theme_classic() +
        theme(figure_size=(4,3))
)

../_images/tutorials_tutorial_brca_24_0.png

Supervised PCA (sPCA)

In practice, we often want to build representations that reflect specific aspects of the data. For example, the two top contributors to PC1, ‘symmetry_mean’ and ‘radius_mean’, each represents different quantities (“shape” vs “size” of the cell nucleus, respectively). sPCA (Barshan, Elnaz, et al.) extends PCA by allowing us to provide supervision information to guide the learning of the desired latent space.

Model fitting

We can again fit the sPCA model as a special case of SISPCA with a single supervised subspace. To do that, we will first need to create a SISPCADataset object that contains the data and the supervision information.

[16]:

# select the target variables for each subspace
y_radius = normalize_col(torch.from_numpy(
    data[['radius_mean', 'radius_se']].to_numpy()
), center=True, scale=True)
y_sym = normalize_col(torch.from_numpy(
    data[['symmetry_mean', 'symmetry_se']].to_numpy()
), center=True, scale=True)

# remove the target variables from the data
data_sub = data.drop(
    ['radius_mean', 'radius_se',
    'symmetry_mean', 'symmetry_se',],
    axis = 1
)
x_sub = normalize_col(torch.from_numpy(data_sub.to_numpy()), center=True, scale=True)

# prepare the model inputs
sdata = SISPCADataset(
    data = x_sub.float(),
    target_supervision_list = [
        Supervision(target_data=y_radius, target_type='continuous', target_name='radius'),
        Supervision(target_data=y_sym, target_type='continuous', target_name='symmetry'),
    ]
)

The first subspace is the radius subspace, and the second is the symmetry subspace. Here we use linear kernel for both subspaces since the supervisions are continuous. Each subspace has an effective dimension of 2, since the rank of target kernel (Y.T @ Y) is 2.

[17]:

print(f"The radius subspace has {sdata.target_kernel_list[0].rank()} effective dimensions.")
print(f"The symmetry subspace has {sdata.target_kernel_list[1].rank()} effective dimensions.")

The radius subspace has 2 effective dimensions.
The symmetry subspace has 2 effective dimensions.

Now let’s run the model fitting based on pytorch lightning.

[18]:

# each subspace will have dimension 3
n_latent_sub = [3, 3]
lambda_contrast = 0
kernel_subspace = 'linear'
solver = 'eig'

seed_everything(42, workers=True)

spca = SISPCA(
    sdata,
    n_latent_sub=n_latent_sub,
    lambda_contrast=lambda_contrast,
    kernel_subspace=kernel_subspace,
    solver=solver
)
spca.fit(batch_size = -1, max_epochs = 5)

Seed set to 42
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/jysumac/miniforge3/envs/sispca/lib/python3.10/site-packages/lightning/pytorch/trainer/setup.py:177: PossibleUserWarning: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
/Users/jysumac/miniforge3/envs/sispca/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py:182: UserWarning: `LightningModule.configure_optimizers` returned `None`, this fit will run with no optimizer

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode
/Users/jysumac/miniforge3/envs/sispca/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: PossibleUserWarning: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.

`Trainer.fit` stopped: `max_epochs=5` reached.

Visualize and quantify the quality of sPCA representations

The ‘radius’ and ‘symmetry’ subspaces are concatenated together after training.

[19]:

spca_rep = spca.get_latent_representation()

df_sample = data_raw.copy()
df_sample['PC_radius_1'] = spca_rep[:, 0]
df_sample['PC_radius_2'] = spca_rep[:, 1]
df_sample['PC_sym_1'] = spca_rep[:, 3]
df_sample['PC_sym_2'] = spca_rep[:, 4]

[20]:

(
    ggplot(df_sample, aes(x = 'PC_radius_1', y = 'PC_radius_2')) +
        geom_point(aes(color = 'diagnosis')) +
        labs(title = 'sPCA (radius)') +
        theme_classic() +
        theme(figure_size=(4,3))
)

../_images/tutorials_tutorial_brca_35_0.png

[21]:

(
    ggplot(df_sample, aes(x = 'PC_sym_1', y = 'PC_sym_2')) +
        geom_point(aes(color = 'diagnosis')) +
        labs(title = 'sPCA (symmetry)') +
        theme_classic() +
        theme(figure_size=(4,3))
)

../_images/tutorials_tutorial_brca_36_0.png

From the above sPCA plots, both subspaces seem to contain sample variations related to disease status. We can further quantify the observation by calculating the silhouette scores.

[22]:

print(f"Silhouette score (sPCA - radius) {silhouette_score(spca_rep[:, :3], data_raw['diagnosis']):.3f}")
print(f"Silhouette score (sPCA - symmetry) {silhouette_score(spca_rep[:, 3:6], data_raw['diagnosis']):.3f}")
print(f"Silhouette score (sPCA - overall) {silhouette_score(spca_rep[:, :6], data_raw['diagnosis']):.3f}")

Silhouette score (sPCA - radius) 0.471
Silhouette score (sPCA - symmetry) 0.407
Silhouette score (sPCA - overall) 0.437

Extract top features that contribute to each subspace

Now we can look at the top features that contribute to each subspace. Because we did not explicitly ask the two subspaces to be disentangled, the two spaces still capture overlapping information even under distinct supervision. For example, the PC2 loadings of the two subspaces are highly correlated.

[23]:

vh = spca.U.detach()

df_project = pd.DataFrame({
    'name': data_sub.columns,
    'loading_pc_radius_1': vh[:, 0].numpy(),
    'loading_pc_radius_2': vh[:, 1].numpy(),
    'loading_pc_sym_1': vh[:, 3].numpy(),
    'loading_pc_sym_2': vh[:, 4].numpy(),
})

(
    ggplot(df_project, aes(x = 'loading_pc_radius_2', y = 'loading_pc_sym_2')) +
        geom_point() +
        # geom_text(aes(label = 'name')) +
        labs(title = f"Pearson R = {np.corrcoef(vh[:, 1].numpy(), vh[:, 4].numpy())[0][1]:.3f}") +
        theme_classic() +
        theme(figure_size=(3,3))
)

../_images/tutorials_tutorial_brca_40_0.png

In the PC2 loadings, we see that ‘smoothness_se’ strongly contributes to both subspaces, which may not be ideal.

[24]:

df_project.sort_values('loading_pc_radius_2', ascending=False, key = abs).head()

[24]:

	name	loading_pc_radius_1	loading_pc_radius_2	loading_pc_sym_1	loading_pc_sym_2
9	perimeter_se	-0.294413	0.406375	-0.187506	0.126918
11	smoothness_se	0.014764	0.390662	-0.164616	0.363588
10	area_se	-0.302685	0.324720	-0.122182	0.027174
8	texture_se	-0.017534	0.323500	-0.139245	0.394407
7	fractal_dimension_mean	0.059744	0.298165	-0.273750	0.123427

[25]:

df_project.sort_values('loading_pc_sym_2', ascending=False, key = abs).head()

[25]:

	name	loading_pc_radius_1	loading_pc_radius_2	loading_pc_sym_1	loading_pc_sym_2
8	texture_se	-0.017534	0.323500	-0.139245	0.394407
11	smoothness_se	0.014764	0.390662	-0.164616	0.363588
23	concave points_worst	-0.232865	-0.137929	-0.175061	-0.273636
20	smoothness_worst	-0.047029	0.038949	-0.177212	-0.251543
16	radius_worst	-0.307310	-0.154605	-0.051738	-0.248183

We can also directly quantify subspace entanglement using the Grassmann distance or the HSIC statistics. The higher/lower the distance/HSIC, the more disentangled the subspaces are.

[26]:

# subspace grassmann distance
_dist = np.linalg.norm(subspace_angles(
    spca_rep[:, :3],
    spca_rep[:, 3:6]
))

_hsic = hsic_linear(
    torch.from_numpy(spca_rep[:, :3]),
    torch.from_numpy(spca_rep[:, 3:6])
)
print(f"Subspace distance {_dist:.3f}, HSIC-linear {_hsic:.3f}")

Subspace distance 1.321, HSIC-linear 105.829

Supervised independent subspace PCA (sisPCA)

Now we introduce sisPCA, a multi-subspace extension of PCA and sPCA that incorporates an HSIC-based disentanglement penalty to further encourage subspace separation.

Model fitting

We again define two subspaces to focus on two image properties, the ‘radius’ subspace for nuclei size, and the ‘symmetry’ subspace for nuclei shape.

[27]:

# select the target variables for each subspace
y_radius = normalize_col(torch.from_numpy(
    data[['radius_mean', 'radius_se']].to_numpy()
), center=True, scale=True)
y_sym = normalize_col(torch.from_numpy(
    data[['symmetry_mean', 'symmetry_se']].to_numpy()
), center=True, scale=True)

# remove the target variables from the data
data_sub = data.drop(
    ['radius_mean', 'radius_se',
    'symmetry_mean', 'symmetry_se',],
    axis = 1
)
x_sub = normalize_col(torch.from_numpy(data_sub.to_numpy()), center=True, scale=True)

# prepare the model inputs
sdata = SISPCADataset(
    data = x_sub.float(),
    target_supervision_list = [
        Supervision(target_data=y_radius, target_type='continuous', target_name='radius'),
        Supervision(target_data=y_sym, target_type='continuous', target_name='symmetry'),
    ]
)

When running sisPCA, the hyperparameter lambda_contrast controls the strength of the disentanglement penalty. We will discuss the tuning of this parameter in the final section.

[28]:

n_latent_sub = [3, 3]
lambda_contrast = 10
kernel_subspace = 'linear'
solver = 'eig'

seed_everything(42, workers=True)

sispca = SISPCA(
    sdata,
    n_latent_sub=n_latent_sub,
    lambda_contrast=lambda_contrast,
    kernel_subspace=kernel_subspace,
    solver=solver
)
sispca.fit(batch_size = -1, max_epochs = 100, early_stopping_patience = 5)

Seed set to 42
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/jysumac/miniforge3/envs/sispca/lib/python3.10/site-packages/lightning/pytorch/trainer/setup.py:177: PossibleUserWarning: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
/Users/jysumac/miniforge3/envs/sispca/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py:182: UserWarning: `LightningModule.configure_optimizers` returned `None`, this fit will run with no optimizer

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode
/Users/jysumac/miniforge3/envs/sispca/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: PossibleUserWarning: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.

`Trainer.fit` stopped: `max_epochs=100` reached.

Visualize and quantify the quality of sisPCA representations

[29]:

sispca_rep = sispca.get_latent_representation()

df_sample = data_raw.copy()
df_sample['PC_radius_1'] = sispca_rep[:, 0]
df_sample['PC_radius_2'] = sispca_rep[:, 1]
df_sample['PC_sym_1'] = sispca_rep[:, 3]
df_sample['PC_sym_2'] = sispca_rep[:, 4]

[30]:

(
    ggplot(df_sample, aes(x = 'PC_radius_1', y = 'PC_radius_2')) +
        geom_point(aes(color = 'diagnosis')) +
        labs(title = 'sisPCA (radius)') +
        theme_classic() +
        theme(figure_size=(4,3))
)

../_images/tutorials_tutorial_brca_54_0.png

[31]:

(
    ggplot(df_sample, aes(x = 'PC_sym_1', y = 'PC_sym_2')) +
        geom_point(aes(color = 'diagnosis')) +
        labs(title = 'sisPCA (symmetry)') +
        theme_classic() +
        theme(figure_size=(4,3))
)

../_images/tutorials_tutorial_brca_55_0.png

[32]:

print(f"Silhouette score (sisPCA - radius) {silhouette_score(sispca_rep[:, :3], data_raw['diagnosis']):.3f}")
print(f"Silhouette score (sisPCA - symmetry) {silhouette_score(sispca_rep[:, 3:6], data_raw['diagnosis']):.3f}")
print(f"Silhouette score (sisPCA - overall) {silhouette_score(sispca_rep[:, :6], data_raw['diagnosis']):.3f}")

Silhouette score (sisPCA - radius) 0.521
Silhouette score (sisPCA - symmetry) 0.033
Silhouette score (sisPCA - overall) 0.371

We see that in the above sisPCA results, the ‘radius’ subspace remains predictive, while the ‘symmetry’ subspace no longer captures difference in disease status. This indeed reflects the true predictive power of the target variables. Note also that by allowing additional information to contribute to the ‘radius’ subspace, it is now containing more malignancy information than the raw supervision target variable [‘radius_mean’, ‘radius_se’].

[33]:

_ft = ['radius_mean', 'radius_se']
print(f"Silhouette score ({_ft}) {silhouette_score(data[_ft], data_raw['diagnosis']):.3f}")

Silhouette score (['radius_mean', 'radius_se']) 0.457

[34]:

_ft = ['symmetry_mean', 'symmetry_se']
print(f"Silhouette score ({_ft}) {silhouette_score(data[_ft], data_raw['diagnosis']):.3f}")

Silhouette score (['symmetry_mean', 'symmetry_se']) 0.092

Extract top features that contributes to each subspace

[35]:

vh = sispca.U.detach()

df_project = pd.DataFrame({
    'name': data_sub.columns,
    'loading_pc_radius_1': vh[:, 0].numpy(),
    'loading_pc_radius_2': vh[:, 1].numpy(),
    'loading_pc_sym_1': vh[:, 3].numpy(),
    'loading_pc_sym_2': vh[:, 4].numpy(),
})

[36]:

(
    ggplot(df_project, aes(x = 'loading_pc_radius_2', y = 'loading_pc_sym_2')) +
        geom_point() +
        # geom_text(aes(label = 'name')) +
        labs(title = f"Pearson R = {np.corrcoef(vh[:, 1].numpy(), vh[:, 4].numpy())[0][1]:.3f}") +
        theme_classic() +
        theme(figure_size=(3,3))
)

../_images/tutorials_tutorial_brca_62_0.png

Now the PC2 loadings are no longer correlated, and each subspace is constructed from distinct features that are more relevant to the target variables, i.e. ‘area_’ for the ‘radius’ subspace and ‘smoothness_’ for the ‘symmetry’ subspace.

[37]:

df_project.sort_values('loading_pc_radius_1', ascending=False, key = abs).head()

[37]:

	name	loading_pc_radius_1	loading_pc_radius_2	loading_pc_sym_1	loading_pc_sym_2
2	area_mean	-0.323873	0.079059	-0.196451	-0.055753
1	perimeter_mean	-0.317973	0.162154	-0.186226	-0.054114
19	area_worst	-0.317266	-0.027713	-0.201404	0.041665
16	radius_worst	-0.316681	0.059447	-0.205232	0.026357
18	perimeter_worst	-0.315147	0.068681	-0.183593	0.017961

[38]:

df_project.sort_values('loading_pc_sym_1', ascending=False, key = abs).head()

[38]:

	name	loading_pc_radius_1	loading_pc_radius_2	loading_pc_sym_1	loading_pc_sym_2
24	symmetry_worst	-0.035552	0.105500	0.423216	0.012136
7	fractal_dimension_mean	0.085793	-0.145951	0.372799	0.159669
11	smoothness_se	0.030047	-0.308551	0.267046	-0.044819
15	fractal_dimension_se	-0.009894	0.003989	0.261434	-0.204098
12	compactness_se	-0.081300	0.122348	0.249812	-0.336611

[39]:

df_project.sort_values('loading_pc_radius_2', ascending=False, key = abs).head()

[39]:

	name	loading_pc_radius_1	loading_pc_radius_2	loading_pc_sym_1	loading_pc_sym_2
9	perimeter_se	-0.294218	-0.515662	0.044653	-0.061888
10	area_se	-0.306857	-0.489507	-0.055998	0.019116
11	smoothness_se	0.030047	-0.308551	0.267046	-0.044819
22	concavity_worst	-0.155004	0.239243	0.040021	-0.027116
3	smoothness_mean	-0.066986	-0.233024	0.246799	0.467927

[40]:

df_project.sort_values('loading_pc_sym_2', ascending=False, key = abs).head()

[40]:

	name	loading_pc_radius_1	loading_pc_radius_2	loading_pc_sym_1	loading_pc_sym_2
20	smoothness_worst	-0.031913	-0.117651	0.110923	0.507345
3	smoothness_mean	-0.066986	-0.233024	0.246799	0.467927
12	compactness_se	-0.081300	0.122348	0.249812	-0.336611
8	texture_se	-0.010335	-0.162091	0.222898	-0.303128
13	concavity_se	-0.077843	0.126739	0.185260	-0.300306

[41]:

# subspace grassmann distance
_dist = np.linalg.norm(subspace_angles(
    sispca_rep[:, :3],
    sispca_rep[:, 3:6]
))

_hsic = hsic_linear(
    torch.from_numpy(sispca_rep[:, :3]),
    torch.from_numpy(sispca_rep[:, 3:6])
)
print(f"Subspace distance {_dist:.3f}, HSIC-linear {_hsic:.3f}")

Subspace distance 2.710, HSIC-linear 0.002

Automatic hyperparameter tuning for sisPCA (experimental)

In this section, we will demonstrate how to automatically tune the hyperparameter lambda_contrast using spectral clustering.

[42]:

from sispca.model import SISPCAAuto

Subspace ordering

When the target variables are not “balanced”, i.e., they may be on different scales or have different kernel configurations, the sisPCA solution may not be globally optimal. Fortunately, this can be resolved by re-ordering the subspaces during the iterative eigendecomposition process. The path towards the global optimum is the one that has the lowest loss after the first few iterations.

[43]:

seed_everything(42, workers=True)

n_latent_sub = [3, 3]
lambda_contrast = 10
kernel_subspace = 'linear'
solver = 'eig'

sispca_auto = SISPCAAuto(
    sdata,
    n_latent_sub=n_latent_sub,
    max_log_lambda_c = 2, # maxmium log10(lambda_contrast)
    n_lambda = 20, # number of lambda_contrast values to search
    n_lambda_clu = 3, # number of lambda_contrast clusters to return
    kernel_subspace=kernel_subspace,
    solver=solver
)

sispca_auto.find_best_update_order()

Seed set to 42
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/jysumac/miniforge3/envs/sispca/lib/python3.10/site-packages/lightning/pytorch/trainer/setup.py:177: PossibleUserWarning: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
/Users/jysumac/miniforge3/envs/sispca/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py:182: UserWarning: `LightningModule.configure_optimizers` returned `None`, this fit will run with no optimizer

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode
/Users/jysumac/miniforge3/envs/sispca/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: PossibleUserWarning: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.

`Trainer.fit` stopped: `max_epochs=5` reached.
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

`Trainer.fit` stopped: `max_epochs=5` reached.

Best subspace order: (1, 0)

[44]:

print(sispca_auto.dataset.target_name_list)

['symmetry', 'radius']

Selecting the best `lambda_contrast`

[45]:

seed_everything(42, workers=True)

n_latent_sub = [3, 3]
lambda_contrast = 10
kernel_subspace = 'linear'
solver = 'eig'

sispca_auto = SISPCAAuto(
    sdata,
    n_latent_sub=n_latent_sub,
    max_log_lambda_c = 2, # maxmium log10(lambda_contrast)
    n_lambda = 20, # number of lambda_contrast values to search
    n_lambda_clu = 3, # number of lambda_contrast clusters to return
    kernel_subspace=kernel_subspace,
    solver=solver
)

sispca_auto.fit(batch_size = -1, max_epochs = 100, early_stopping_patience = 10)

Seed set to 42
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 156    | n/a
---------------------------------------------
0         Trainable params
156       Non-trainable params
156       Total params
0.001     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode

Let’s visualize the affinity matrix of model fittings with different lambdas and the loss curve.

[46]:

sispca_auto.create_affinity_matrix(affinity_metric='determinant')
sispca_auto.find_spectral_lambdas()
sispca_auto.lambda_contrast_clusters

[46]:

array([0, 3, 3, 3, 3, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
      dtype=int32)

[47]:

_lambda_list = [str(round(float(x), 3)) for x in sispca_auto.lambda_contrast_list]
_df = pd.DataFrame(
    sispca_auto.affinity_matrix,
    columns=_lambda_list,
    index=_lambda_list
)

import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(_df, annot=False, fmt=".2f", square = True)
plt.show()

../_images/tutorials_tutorial_brca_78_0.png

[48]:

_df = pd.DataFrame(sispca_auto.final_loss[:]).melt(
    id_vars = 'lambda_contrast', var_name = 'loss_type', value_name = 'loss'
)

(
    ggplot(_df, aes(x = 'lambda_contrast', y = 'loss', group = 'loss_type')) +
        facet_wrap('~loss_type', scales='free_y') +
        geom_line() +
        geom_point() +
        theme_classic() +
        theme(figure_size=(6,2))
)

../_images/tutorials_tutorial_brca_79_0.png

From the above results, it appears that the optimal lambda_contrast is around 1, where the subspace representations become more stable and the loss curve starts to plateau. We can again visualize the changes of the individual subspaces.

[49]:

_df = pd.concat([
    pd.DataFrame(
        {'lambda_contrast': l,
         'PC_1': m.get_latent_representation()[:, 0],
         'PC_2': m.get_latent_representation()[:, 1],
         'diagnosis': data_raw['diagnosis']
        }
    )
    for l, m in zip(sispca_auto.lambda_contrast_list, sispca_auto.models)
])

(
    ggplot(_df, aes(x = 'PC_1', y = 'PC_2', color = 'diagnosis')) +
        facet_wrap('~lambda_contrast', labeller=labeller(cols = lambda x: str(round(float(x), 3)))) +
        geom_point() +
        labs(title = 'The radius subspace') +
        theme_classic() +
        theme(
            figure_size=(6,6),
            plot_background = element_rect(fill = 'white'),
        )
)

../_images/tutorials_tutorial_brca_81_0.png

[50]:

_df = pd.concat([
    pd.DataFrame(
        {'lambda_contrast': l,
         'PC_1': m.get_latent_representation()[:, 3],
         'PC_2': m.get_latent_representation()[:, 4],
         'diagnosis': data_raw['diagnosis']
        }
    )
    for l, m in zip(sispca_auto.lambda_contrast_list, sispca_auto.models)
])

(
    ggplot(_df, aes(x = 'PC_1', y = 'PC_2', color = 'diagnosis')) +
        facet_wrap('~lambda_contrast', labeller=labeller(cols = lambda x: str(round(float(x), 3)))) +
        geom_point() +
        labs(title = 'The symmetry subspace') +
        theme_classic() +
        theme(
            figure_size=(6,6),
            plot_background = element_rect(fill = 'white'),
        )
)

../_images/tutorials_tutorial_brca_82_0.png

Feature selection using sisPCA on the Breast Cancer Wisconsin dataset

Load and preprocess the breast cancer dataset

Principal component analysis (PCA)

Examine variances explained by each PC

Visualize and quantify the quality of PCA representations

Extract top features that contribute to PC1

Supervised PCA (sPCA)

Model fitting

Visualize and quantify the quality of sPCA representations

Extract top features that contribute to each subspace

Supervised independent subspace PCA (sisPCA)

Model fitting

Visualize and quantify the quality of sisPCA representations

Extract top features that contributes to each subspace

Automatic hyperparameter tuning for sisPCA (experimental)

Subspace ordering

Selecting the best lambda_contrast

Selecting the best `lambda_contrast`