Iteratively train an ML model model on a dataset#
In the previous tutorial, we loaded an entire dataset into memory to perform a simple analysis.
Here, weβll iterate over the files within the dataset, to train an ML model.
import lamindb as ln
import anndata as ad
import numpy as np
π‘ loaded instance: testuser1/test-scrna (lamindb 0.54.4)
ln.track()
π‘ notebook imports: anndata==0.9.2 lamindb==0.54.4 numpy==1.25.2 scgen==2.1.1
π‘ Transform(id='Qr1kIHvK506rz8', name='Iteratively train an ML model model on a dataset', short_name='scrna4', version='0', type=notebook, updated_at=2023-10-02 10:20:11, created_by_id='DzTjkKse')
π‘ Run(id='yRdlmr6K2xK3m6ZYbIKB', run_at=2023-10-02 10:20:11, transform_id='Qr1kIHvK506rz8', created_by_id='DzTjkKse')
Setup#
dataset_v2 = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="2").one()
dataset_v2
Dataset(id='p563WE4VtMIf6FQfXJE4', name='My versioned scRNA-seq dataset', version='2', hash='0Uq1qU7xX7R6pyWN3oOT', updated_at=2023-10-02 10:19:42, transform_id='ManDYgmftZ8Cz8', run_id='BJevtAUNru8jYr9cO093', initial_version_id='p563WE4VtMIf6FQfXJmN', created_by_id='DzTjkKse')
We import scGen, which is built on scvi-tools.
import scgen
Show code cell output
2023-10-02 10:20:13,775:INFO - Created a temporary directory at /tmp/tmpfevj7lid
2023-10-02 10:20:13,779:INFO - Writing /tmp/tmpfevj7lid/_remote_module_non_scriptable.py
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/scvi/_settings.py:63: UserWarning: Since v1.0.0, scvi-tools no longer uses a random seed by default. Run `scvi.settings.seed = 0` to reproduce results from previous versions.
self.seed = seed
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/scvi/_settings.py:70: UserWarning: Setting `dl_pin_memory_gpu_training` is deprecated in v1.0 and will be removed in v1.1. Please pass in `pin_memory` to the data loaders instead.
self.dl_pin_memory_gpu_training = (
Similar to what we did in the previous tutorial, we could load the entire dataset into memory and train a model in 4 lines of code.
How would this look like?
data_train = dataset.load(join="inner")
scgen.SCGEN.setup_anndata(data_train)
vae = scgen.SCGEN(data_train)
vae.train(max_epochs=1) # we use max_epochs=1 to be able to run it on CI
Let us instead load all file records:
file1, file2 = dataset_v2.files.list()
Weβd like some context on what the first file contains and where itβs from:
file1.describe()
file1.view_flow()
Show code cell output
File(id='be3FZ3wa9dpwAGDafjLI', suffix='.h5ad', accessor='AnnData', description='10x reference adata', size=660792, hash='a2V0IgOjMRHsCeZH169UOQ', hash_type='md5', updated_at=2023-10-02 10:19:35)
Provenance:
ποΈ storage: Storage(id='7ZVN6khD', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-10-02 10:18:00, created_by_id='DzTjkKse')
π transform: Transform(id='ManDYgmftZ8Cz8', name='Append a new batch of data', short_name='scrna1', version='0', type='notebook', updated_at=2023-10-02 10:19:37, created_by_id='DzTjkKse')
π£ run: Run(id='BJevtAUNru8jYr9cO093', run_at=2023-10-02 10:19:02, transform_id='ManDYgmftZ8Cz8', created_by_id='DzTjkKse')
π€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-02 10:18:00)
β¬οΈ input_of (core.Run): ['2023-10-02 10:19:49']
Features:
var: FeatureSet(id='JH4lHm0bKOjJwld8gvuS', n=754, type='number', registry='bionty.Gene', hash='WMDxN7253SdzGwmznV5d', updated_at=2023-10-02 10:19:35, modality_id='Z4a7WYsI', created_by_id='DzTjkKse')
'PTPRC', 'COX17', 'HIGD2A', 'LINC01857', 'MRPS6', 'GZMH', 'TPD52', 'TMEM256', 'SNX2', 'CCT8', 'TBC1D10C', 'FCN1', 'MATK', 'PRDX6', 'APEX1', 'TINF2', 'HSD17B8', 'DEK', 'NCL', 'GZMK', ...
obs: FeatureSet(id='iJO8vTcbKF2WmUE5abJn', n=1, registry='core.Feature', hash='N2S5inpQSO0LupVR9d63', updated_at=2023-10-02 10:19:35, modality_id='9FuQHsY3', created_by_id='DzTjkKse')
π cell_type (9, bionty.CellType): 'dendritic cell', 'B cell, CD19-positive', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'monocyte', 'CD4-positive, alpha-beta T cell', 'CD16-positive, CD56-dim natural killer cell, human', 'gamma-delta T cell', 'CD24-positive, CD4 single-positive thymocyte'
external: FeatureSet(id='Tb1kUtpJdIBsnA90OVZ7', n=2, registry='core.Feature', hash='K0Hptsob4fESZu-GTvkj', updated_at=2023-10-02 10:19:35, modality_id='9FuQHsY3', created_by_id='DzTjkKse')
π assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
π species (1, bionty.Species): 'human'
Labels:
π·οΈ species (1, bionty.Species): 'human'
π·οΈ cell_types (9, bionty.CellType): 'dendritic cell', 'B cell, CD19-positive', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'monocyte', 'CD4-positive, alpha-beta T cell', 'CD16-positive, CD56-dim natural killer cell, human', 'gamma-delta T cell', 'CD24-positive, CD4 single-positive thymocyte'
π·οΈ experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
Weβll need to make a decision on the features that we want to use for training the model.
Because each file is validated, theyβre all indexed by ensembl_gene_id
in the var
slot of AnnData.
shared_genes = file1.features["var"] & file2.features["var"]
shared_genes_ensembl = shared_genes.list("ensembl_gene_id")
Train the model#
Let us load the first file into memory:
data_train1 = file1.load()[:, shared_genes_ensembl].copy()
data_train1
AnnData object with n_obs Γ n_vars = 70 Γ 749
obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
var: 'gene_symbol', 'n_counts', 'highly_variable'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
Train the model on this first file:
scgen.SCGEN.setup_anndata(data_train1)
vae = scgen.SCGEN(data_train1)
vae.train(max_epochs=1) # we use max_epochs=1 to run it on CI
vae.save("saved_models/scgen1")
Show code cell output
INFO: GPU available: False, used: False
2023-10-02 10:20:16,239:INFO - GPU available: False, used: False
INFO: TPU available: False, using: 0 TPU cores
2023-10-02 10:20:16,242:INFO - TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
2023-10-02 10:20:16,243:INFO - IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
2023-10-02 10:20:16,244:INFO - HPU available: False, using: 0 HPUs
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py:281: PossibleUserWarning: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
rank_zero_warn(
Training: 0%| | 0/1 [00:00<?, ?it/s]
Epoch 1/1: 0%| | 0/1 [00:00<?, ?it/s]
Epoch 1/1: 100%|ββββββββββ| 1/1 [00:00<00:00, 16.79it/s, v_num=1, train_loss_step=459, train_loss_epoch=459]
INFO: `Trainer.fit` stopped: `max_epochs=1` reached.
2023-10-02 10:20:16,550:INFO - `Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 1/1: 100%|ββββββββββ| 1/1 [00:00<00:00, 13.41it/s, v_num=1, train_loss_step=459, train_loss_epoch=459]
Load the second file and resume training the model:
data_train2 = file2.load()[:, shared_genes_ensembl].copy()
vae = scgen.SCGEN.load("saved_models/scgen1", data_train2)
vae.train(max_epochs=1)
vae.save("saved_models/scgen1", overwrite=True)
Show code cell output
INFO
File saved_models/scgen1/model.pt already downloaded
INFO: GPU available: False, used: False
2023-10-02 10:20:16,759:INFO - GPU available: False, used: False
INFO: TPU available: False, using: 0 TPU cores
2023-10-02 10:20:16,764:INFO - TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
2023-10-02 10:20:16,766:INFO - IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
2023-10-02 10:20:16,769:INFO - HPU available: False, using: 0 HPUs
Training: 0%| | 0/1 [00:00<?, ?it/s]
Epoch 1/1: 0%| | 0/1 [00:00<?, ?it/s]
Epoch 1/1: 100%|ββββββββββ| 1/1 [00:00<00:00, 2.71it/s]
Epoch 1/1: 100%|ββββββββββ| 1/1 [00:00<00:00, 2.71it/s, v_num=1, train_loss_step=184, train_loss_epoch=254]
INFO: `Trainer.fit` stopped: `max_epochs=1` reached.
2023-10-02 10:20:17,162:INFO - `Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 1/1: 100%|ββββββββββ| 1/1 [00:00<00:00, 2.62it/s, v_num=1, train_loss_step=184, train_loss_epoch=254]
Save the model#
weights = ln.File("saved_models/scgen1/model.pt", description="My trained model")
weights.save()
Save latent representation as a new dataset#
latent1 = vae.get_latent_representation(data_train1)
latent2 = vae.get_latent_representation(data_train2)
adata_latent1 = ad.AnnData(X=latent1, obs=data_train1.obs)
adata_latent2 = ad.AnnData(X=latent2, obs=data_train2.obs)
INFO
Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup
Because the latent representation is low-dimensional, we can typically fit very high number of observations into memory.
Hence, letβs store it as a concatenated adata.
adata_latent = ad.concat([adata_latent1, adata_latent2])
dataset_v2_latent = ln.Dataset(
adata_latent,
name="Latent representation of scRNA-seq dataset v2",
description="For the original data, see dataset T5x0SkRJNviE0jYGbJKt",
)
dataset_v2_latent.save()
Let us look at the data flow:
dataset_v2_latent.view_flow()
Compare this with the model:
weights.view_flow()
Annotate with labels:
dataset_v2_latent.labels.add_from(dataset_v2)
dataset_v2_latent.describe()
Dataset(id='AgyNjaMS0OIULKARPJL3', name='Latent representation of scRNA-seq dataset v2', description='For the original data, see dataset T5x0SkRJNviE0jYGbJKt', hash='UD18fqUz1eIFj64cNMbV0g', updated_at=2023-10-02 10:20:17)
Provenance:
π« transform: Transform(id='Qr1kIHvK506rz8', name='Iteratively train an ML model model on a dataset', short_name='scrna4', version='0', type=notebook, updated_at=2023-10-02 10:20:17, created_by_id='DzTjkKse')
π£ run: Run(id='yRdlmr6K2xK3m6ZYbIKB', run_at=2023-10-02 10:20:11, transform_id='Qr1kIHvK506rz8', created_by_id='DzTjkKse')
π file: File(id='AgyNjaMS0OIULKARPJL3', suffix='.h5ad', accessor='AnnData', description='See dataset AgyNjaMS0OIULKARPJL3', size=838706, hash='UD18fqUz1eIFj64cNMbV0g', hash_type='md5', updated_at=2023-10-02 10:20:17, storage_id='7ZVN6khD', transform_id='Qr1kIHvK506rz8', run_id='yRdlmr6K2xK3m6ZYbIKB', created_by_id='DzTjkKse')
π€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-02 10:18:00)
Features:
external: FeatureSet(id='iOsJ5gsfHpxohnmgiYk5', n=5, registry='core.Feature', hash='Xx0XNmoj1n5Atm52X0Zw', updated_at=2023-10-02 10:20:17, modality_id='9FuQHsY3', created_by_id='DzTjkKse')
π donor (12, core.ULabel): 'A37', 'A31', 'A36', 'A29', '637C', 'D503', '640C', 'D496', '621B', '582C', ...
π species (1, bionty.Species): 'human'
π cell_type (39, bionty.CellType): 'animal cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'monocyte', 'cytotoxic T cell', 'progenitor cell', 'macrophage', 'gamma-delta T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'classical monocyte', 'regulatory T cell', ...
π tissue (17, bionty.Tissue): 'thymus', 'ileum', 'blood', 'jejunal epithelium', 'lung', 'duodenum', 'thoracic lymph node', 'skeletal muscle tissue', 'spleen', 'mesenteric lymph node', ...
π assay (4, bionty.ExperimentalFactor): '10x 5' v1', 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2'
Labels:
π·οΈ species (1, bionty.Species): 'human'
π·οΈ tissues (17, bionty.Tissue): 'thymus', 'ileum', 'blood', 'jejunal epithelium', 'lung', 'duodenum', 'thoracic lymph node', 'skeletal muscle tissue', 'spleen', 'mesenteric lymph node', ...
π·οΈ cell_types (39, bionty.CellType): 'animal cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'monocyte', 'cytotoxic T cell', 'progenitor cell', 'macrophage', 'gamma-delta T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'classical monocyte', 'regulatory T cell', ...
π·οΈ experimental_factors (4, bionty.ExperimentalFactor): '10x 5' v1', 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2'
π·οΈ ulabels (12, core.ULabel): 'A37', 'A31', 'A36', 'A29', '637C', 'D503', '640C', 'D496', '621B', '582C', ...
# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna
π‘ deleting instance testuser1/test-scrna
β
deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env
β
instance cache deleted
β
deleted '.lndb' sqlite file
β consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna