Iteratively train an ML model model on a dataset#

In the previous tutorial, we loaded an entire dataset into memory to perform a simple analysis.

Here, we’ll iterate over the files within the dataset, to train an ML model.

import lamindb as ln
import anndata as ad
import numpy as np

💡 loaded instance: testuser1/test-scrna (lamindb 0.54.4)

ln.track()

💡 notebook imports: anndata==0.9.2 lamindb==0.54.4 numpy==1.25.2 scgen==2.1.1

💡 Transform(id='Qr1kIHvK506rz8', name='Iteratively train an ML model model on a dataset', short_name='scrna4', version='0', type=notebook, updated_at=2023-10-02 10:20:11, created_by_id='DzTjkKse')

💡 Run(id='yRdlmr6K2xK3m6ZYbIKB', run_at=2023-10-02 10:20:11, transform_id='Qr1kIHvK506rz8', created_by_id='DzTjkKse')

Setup#

dataset_v2 = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="2").one()

dataset_v2

Dataset(id='p563WE4VtMIf6FQfXJE4', name='My versioned scRNA-seq dataset', version='2', hash='0Uq1qU7xX7R6pyWN3oOT', updated_at=2023-10-02 10:19:42, transform_id='ManDYgmftZ8Cz8', run_id='BJevtAUNru8jYr9cO093', initial_version_id='p563WE4VtMIf6FQfXJmN', created_by_id='DzTjkKse')

We import scGen, which is built on scvi-tools.

import scgen

Similar to what we did in the previous tutorial, we could load the entire dataset into memory and train a model in 4 lines of code.

Let us instead load all file records:

file1, file2 = dataset_v2.files.list()

We’d like some context on what the first file contains and where it’s from:

file1.describe()
file1.view_flow()

Show code cell output Hide code cell output

File(id='be3FZ3wa9dpwAGDafjLI', suffix='.h5ad', accessor='AnnData', description='10x reference adata', size=660792, hash='a2V0IgOjMRHsCeZH169UOQ', hash_type='md5', updated_at=2023-10-02 10:19:35)

Provenance:
  🗃️ storage: Storage(id='7ZVN6khD', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-10-02 10:18:00, created_by_id='DzTjkKse')
  📔 transform: Transform(id='ManDYgmftZ8Cz8', name='Append a new batch of data', short_name='scrna1', version='0', type='notebook', updated_at=2023-10-02 10:19:37, created_by_id='DzTjkKse')
  👣 run: Run(id='BJevtAUNru8jYr9cO093', run_at=2023-10-02 10:19:02, transform_id='ManDYgmftZ8Cz8', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-02 10:18:00)
  ⬇️ input_of (core.Run): ['2023-10-02 10:19:49']
Features:
  var: FeatureSet(id='JH4lHm0bKOjJwld8gvuS', n=754, type='number', registry='bionty.Gene', hash='WMDxN7253SdzGwmznV5d', updated_at=2023-10-02 10:19:35, modality_id='Z4a7WYsI', created_by_id='DzTjkKse')
    'PTPRC', 'COX17', 'HIGD2A', 'LINC01857', 'MRPS6', 'GZMH', 'TPD52', 'TMEM256', 'SNX2', 'CCT8', 'TBC1D10C', 'FCN1', 'MATK', 'PRDX6', 'APEX1', 'TINF2', 'HSD17B8', 'DEK', 'NCL', 'GZMK', ...
  obs: FeatureSet(id='iJO8vTcbKF2WmUE5abJn', n=1, registry='core.Feature', hash='N2S5inpQSO0LupVR9d63', updated_at=2023-10-02 10:19:35, modality_id='9FuQHsY3', created_by_id='DzTjkKse')
    🔗 cell_type (9, bionty.CellType): 'dendritic cell', 'B cell, CD19-positive', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'monocyte', 'CD4-positive, alpha-beta T cell', 'CD16-positive, CD56-dim natural killer cell, human', 'gamma-delta T cell', 'CD24-positive, CD4 single-positive thymocyte'
  external: FeatureSet(id='Tb1kUtpJdIBsnA90OVZ7', n=2, registry='core.Feature', hash='K0Hptsob4fESZu-GTvkj', updated_at=2023-10-02 10:19:35, modality_id='9FuQHsY3', created_by_id='DzTjkKse')
    🔗 assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
    🔗 species (1, bionty.Species): 'human'
Labels:
  🏷️ species (1, bionty.Species): 'human'
  🏷️ cell_types (9, bionty.CellType): 'dendritic cell', 'B cell, CD19-positive', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'monocyte', 'CD4-positive, alpha-beta T cell', 'CD16-positive, CD56-dim natural killer cell, human', 'gamma-delta T cell', 'CD24-positive, CD4 single-positive thymocyte'
  🏷️ experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'

https://d33wubrfki0l68.cloudfront.net/e46e681b29b7587f1ff2e7a7c4ece98e819b3211/f1454/_images/9f0d6087d7219bc16af527cdf661243c12e7eaa8f4c4507c352bd2c0ebc8e005.svg

We’ll need to make a decision on the features that we want to use for training the model.

Because each file is validated, they’re all indexed by ensembl_gene_id in the var slot of AnnData.

shared_genes = file1.features["var"] & file2.features["var"]
shared_genes_ensembl = shared_genes.list("ensembl_gene_id")

Train the model#

Let us load the first file into memory:

data_train1 = file1.load()[:, shared_genes_ensembl].copy()
data_train1

AnnData object with n_obs × n_vars = 70 × 749
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'gene_symbol', 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

Train the model on this first file:

scgen.SCGEN.setup_anndata(data_train1)
vae = scgen.SCGEN(data_train1)
vae.train(max_epochs=1)  # we use max_epochs=1 to run it on CI
vae.save("saved_models/scgen1")

Load the second file and resume training the model:

data_train2 = file2.load()[:, shared_genes_ensembl].copy()
vae = scgen.SCGEN.load("saved_models/scgen1", data_train2)
vae.train(max_epochs=1)
vae.save("saved_models/scgen1", overwrite=True)

Save the model#

weights = ln.File("saved_models/scgen1/model.pt", description="My trained model")
weights.save()

Save latent representation as a new dataset#

latent1 = vae.get_latent_representation(data_train1)
latent2 = vae.get_latent_representation(data_train2)

adata_latent1 = ad.AnnData(X=latent1, obs=data_train1.obs)
adata_latent2 = ad.AnnData(X=latent2, obs=data_train2.obs)

INFO

 Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup

Because the latent representation is low-dimensional, we can typically fit very high number of observations into memory.

Hence, let’s store it as a concatenated adata.

adata_latent = ad.concat([adata_latent1, adata_latent2])

dataset_v2_latent = ln.Dataset(
    adata_latent,
    name="Latent representation of scRNA-seq dataset v2",
    description="For the original data, see dataset T5x0SkRJNviE0jYGbJKt",
)
dataset_v2_latent.save()

Let us look at the data flow:

dataset_v2_latent.view_flow()

https://d33wubrfki0l68.cloudfront.net/971109f77d5eeb0351fef8427e62a2fcd256d343/2c1c4/_images/9855abd2864357e10fc9d4eeadc108b503d2af534fcff5831ffd3919e40d1c09.svg

Compare this with the model:

weights.view_flow()

https://d33wubrfki0l68.cloudfront.net/70211002c1c8112b149cddf76db90901c62730fd/87b93/_images/4cbede4668e1df9da620c3a5a7f0e36704742da9ba32cb22fbbac651b724ef2d.svg

Annotate with labels:

dataset_v2_latent.labels.add_from(dataset_v2)

dataset_v2_latent.describe()

Dataset(id='AgyNjaMS0OIULKARPJL3', name='Latent representation of scRNA-seq dataset v2', description='For the original data, see dataset T5x0SkRJNviE0jYGbJKt', hash='UD18fqUz1eIFj64cNMbV0g', updated_at=2023-10-02 10:20:17)

Provenance:
  💫 transform: Transform(id='Qr1kIHvK506rz8', name='Iteratively train an ML model model on a dataset', short_name='scrna4', version='0', type=notebook, updated_at=2023-10-02 10:20:17, created_by_id='DzTjkKse')
  👣 run: Run(id='yRdlmr6K2xK3m6ZYbIKB', run_at=2023-10-02 10:20:11, transform_id='Qr1kIHvK506rz8', created_by_id='DzTjkKse')
  📄 file: File(id='AgyNjaMS0OIULKARPJL3', suffix='.h5ad', accessor='AnnData', description='See dataset AgyNjaMS0OIULKARPJL3', size=838706, hash='UD18fqUz1eIFj64cNMbV0g', hash_type='md5', updated_at=2023-10-02 10:20:17, storage_id='7ZVN6khD', transform_id='Qr1kIHvK506rz8', run_id='yRdlmr6K2xK3m6ZYbIKB', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-02 10:18:00)
Features:
  external: FeatureSet(id='iOsJ5gsfHpxohnmgiYk5', n=5, registry='core.Feature', hash='Xx0XNmoj1n5Atm52X0Zw', updated_at=2023-10-02 10:20:17, modality_id='9FuQHsY3', created_by_id='DzTjkKse')
    🔗 donor (12, core.ULabel): 'A37', 'A31', 'A36', 'A29', '637C', 'D503', '640C', 'D496', '621B', '582C', ...
    🔗 species (1, bionty.Species): 'human'
    🔗 cell_type (39, bionty.CellType): 'animal cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'monocyte', 'cytotoxic T cell', 'progenitor cell', 'macrophage', 'gamma-delta T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'classical monocyte', 'regulatory T cell', ...
    🔗 tissue (17, bionty.Tissue): 'thymus', 'ileum', 'blood', 'jejunal epithelium', 'lung', 'duodenum', 'thoracic lymph node', 'skeletal muscle tissue', 'spleen', 'mesenteric lymph node', ...
    🔗 assay (4, bionty.ExperimentalFactor): '10x 5' v1', 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2'
Labels:
  🏷️ species (1, bionty.Species): 'human'
  🏷️ tissues (17, bionty.Tissue): 'thymus', 'ileum', 'blood', 'jejunal epithelium', 'lung', 'duodenum', 'thoracic lymph node', 'skeletal muscle tissue', 'spleen', 'mesenteric lymph node', ...
  🏷️ cell_types (39, bionty.CellType): 'animal cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'monocyte', 'cytotoxic T cell', 'progenitor cell', 'macrophage', 'gamma-delta T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'classical monocyte', 'regulatory T cell', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): '10x 5' v1', 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2'
  🏷️ ulabels (12, core.ULabel): 'A37', 'A31', 'A36', 'A29', '637C', 'D503', '640C', 'D496', '621B', '582C', ...

# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna

💡 deleting instance testuser1/test-scrna
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env

✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna