Append a new batch of data#

We have one file in storage and are about to receive a new batch of data.

In this notebook, we’ll see how to manage the situation.

import lamindb as ln
import lnschema_bionty as lb
import readfcs

lb.settings.species = "human"

💡 loaded instance: testuser1/test-facs (lamindb 0.54.4)

ln.track()

💡 notebook imports: anndata==0.9.2 lamindb==0.54.4 lnschema_bionty==0.31.2 pytometry==0.1.4 readfcs==1.1.6 scanpy==1.9.5

💡 Transform(id='SmQmhrhigFPLz8', name='Append a new batch of data', short_name='facs1', version='0', type=notebook, updated_at=2023-10-02 10:21:18, created_by_id='DzTjkKse')

💡 Run(id='MAsFiWynt6ZzShK6Xhp5', run_at=2023-10-02 10:21:18, transform_id='SmQmhrhigFPLz8', created_by_id='DzTjkKse')

Ingest a new file#

Access #

Let us validate and register another .fcs file:

filepath = ln.dev.datasets.file_fcs()

adata = readfcs.read(filepath)

adata

AnnData object with n_obs × n_vars = 65016 × 16
    var: 'n', 'channel', 'marker', '$PnB', '$PnR', '$PnG'
    uns: 'meta'

Transform: normalize #

import anndata as ad
import pytometry as pm

pm.pp.split_signal(adata, var_key="channel")

pm.tl.normalize_biExp(adata)

adata = adata[  # subset to rows that do not have nan values
    adata.to_df().isna().sum(axis=1) == 0
]

adata.to_df().describe()

	KI67	CD3	CD28	CD45RO	CD8	CD4	CD57	CD14	CCR5	CD19	CD27	CCR7	CD127
count	64593.000000	64593.000000	64593.000000	64593.000000	64593.000000	64593.000000	64593.000000	64593.000000	64593.000000	64593.000000	64593.000000	64593.000000	64593.000000
mean	995.527609	991.177964	987.078176	992.669879	997.885007	991.256722	992.786610	991.668695	1005.040547	990.848286	992.297503	1000.381267	1004.603404
std	1250.888451	1247.088234	1243.992707	1247.520841	1251.943558	1247.162966	1249.840684	1244.215164	1258.626730	1247.542888	1246.295504	1251.073884	1254.980319
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	457.194312	457.194312	457.194312	457.194312	457.194312	457.194312	457.194312	457.194312	457.194312	457.194312	457.194312	457.194312	457.194312
50%	462.811405	462.810928	462.810928	462.810928	462.811394	462.810928	462.811362	462.810928	462.811424	462.810928	462.810928	462.811400	462.811491
75%	1087.252514	1064.944577	939.866561	1067.880110	1096.877718	1033.079126	1037.662928	1063.171051	1119.130893	1002.628162	1092.763905	1121.682920	1174.015294
max	4096.000000	4096.000000	4096.000000	4096.000000	4096.000000	4096.000000	4096.000000	4096.000000	4096.000000	4096.000000	4096.000000	4096.000000	4096.000000

Validate cell markers #

Let’s see how many markers validate:

validated = lb.CellMarker.validate(adata.var.index)

❗ 7 terms (53.80%) are not validated for name: KI67, CD45RO, CD4, CD14, CCR5, CD19, CCR7

Let’s standardize and re-validate:

adata.var.index = lb.CellMarker.standardize(adata.var.index)
validated = lb.CellMarker.validate(adata.var.index)

❗ found 1 synonym in Bionty: ['KI67']
   please add corresponding CellMarker records via `.from_values(['Ki67'])`

❗ 3 terms (23.10%) are not validated for name: Ki67, CD45RO, CCR5

Next, register non-validated markers from Bionty:

records = lb.CellMarker.from_values(adata.var.index[~validated])
ln.save(records)

Now they pass validation:

validated = lb.CellMarker.validate(adata.var.index)
assert all(validated)

Register #

modalities = ln.Modality.lookup()
features = ln.Feature.lookup()
efs = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
markers = lb.CellMarker.lookup()

file = ln.File.from_anndata(
    adata,
    description="Flow cytometry file 2",
    field=lb.CellMarker.name,
    modality=modalities.protein,
)

/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/anndata/_core/anndata.py:1230: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  df[key] = c

... storing '$PnR' as categorical

/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/anndata/_core/anndata.py:1230: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  df[key] = c

... storing '$PnG' as categorical

❗    3 terms (100.00%) are not validated for name: FSC-A, FSC-H, SSC-A

❗    no validated features, skip creating feature set

file.save()

file.labels.add(efs.fluorescence_activated_cell_sorting, features.assay)
file.labels.add(species.human, features.species)

file.features

Features:
  var: FeatureSet(id='7J3Su8wy2U2PnMiLgeGq', n=13, type='number', registry='bionty.CellMarker', hash='cInZdHy3fspNNLGysq01', updated_at=2023-10-02 10:21:21, modality_id='EFBdjpXy', created_by_id='DzTjkKse')
    'CD3', 'Ccr7', 'CD28', 'CD45RO', 'Cd14', 'Cd4', 'Cd19', 'CD57', 'CD127', 'Ki67', 'CCR5', 'CD27', 'CD8'
  external: FeatureSet(id='W8W3sdej2JMu852vJAcA', n=2, registry='core.Feature', hash='ImQPRYuWxRjMVJdunJJg', updated_at=2023-10-02 10:21:21, modality_id='shAk3Do6', created_by_id='DzTjkKse')
    🔗 assay (1, bionty.ExperimentalFactor): 'fluorescence-activated cell sorting'
    🔗 species (1, bionty.Species): 'human'

View data flow:

file.view_flow()

https://d33wubrfki0l68.cloudfront.net/af4b307143f076345579d82ebf63412e98e62021/a7b6f/_images/eb914b975754f79b163f7eef8d2aab979668f1a5ee519684f79316a5e7665f51.svg

Inspect a PCA fo QC - this dataset looks much like noise:

import scanpy as sc

sc.pp.pca(adata)
sc.pl.pca(adata, color=markers.cd14.name)

https://d33wubrfki0l68.cloudfront.net/d2452c86170ea75df31c4a01a36db379fa9354f9/34870/_images/fcfcbe97539c40ce37329c02a2c7cea58eaf727cae4228a293ee5d1d7a825d14.png

Create a new version of the dataset by appending a file#

Query the old version:

dataset_v1 = ln.Dataset.filter(name="My versioned cytometry dataset").one()

dataset_v2 = ln.Dataset(
    [file, dataset_v1.file], is_new_version_of=dataset_v1, version="2"
)

dataset_v2

Dataset(id='UYxBk5c2glqsNytkYhPu', name='My versioned cytometry dataset', version='2', hash='cSKkfcii0eGS8TGGTW53', transform_id='SmQmhrhigFPLz8', run_id='MAsFiWynt6ZzShK6Xhp5', initial_version_id='UYxBk5c2glqsNytkYhWE', created_by_id='DzTjkKse')

dataset_v2.features

Features:
  var: FeatureSet(id='vfekByORnfOKR2m3c12C', n=48, type='number', registry='bionty.CellMarker', hash='lta50RjC3dMs1x5JqZxy', created_by_id='DzTjkKse')
    'CD3', 'Ccr7', 'CD28', 'CD45RO', 'Cd14', 'Cd4', 'Cd19', 'CD57', 'CD127', 'Ki67', 'CCR5', 'CD27', 'CD8', 'Ccr7', 'CD27', 'CD33', 'CD3', 'CD16', 'CXCR3', 'CD38', ...
  obs: FeatureSet(id='D9IelF210m2KvoLE4tz2', n=5, registry='core.Feature', hash='kN_l0cF14_oL_mMi1lHi', updated_at=2023-10-02 10:21:10, modality_id='shAk3Do6', created_by_id='DzTjkKse')
    Time (number)
    Dead (number)
    Bead (number)
    Cell_length (number)
    (Ba138)Dd (number)
  external: FeatureSet(id='W8W3sdej2JMu852vJAcA', n=2, registry='core.Feature', hash='ImQPRYuWxRjMVJdunJJg', updated_at=2023-10-02 10:21:21, modality_id='shAk3Do6', created_by_id='DzTjkKse')
    🔗 assay (0, bionty.ExperimentalFactor): 
    🔗 species (0, bionty.Species): 

dataset_v2

Dataset(id='UYxBk5c2glqsNytkYhPu', name='My versioned cytometry dataset', version='2', hash='cSKkfcii0eGS8TGGTW53', transform_id='SmQmhrhigFPLz8', run_id='MAsFiWynt6ZzShK6Xhp5', initial_version_id='UYxBk5c2glqsNytkYhWE', created_by_id='DzTjkKse')

dataset_v2.save()

dataset_v2.labels.add(efs.fluorescence_activated_cell_sorting, features.assay)
dataset_v2.labels.add(species.human, features.species)

dataset_v2.view_flow()

https://d33wubrfki0l68.cloudfront.net/c8fa437ef568e653df90123df7e5e4fb7f5a7991/e766a/_images/c73cfdcb611e40d3a2381b872c78de38c37b391d30f25e111291b2b872da67dc.svg