scrna2/6 Jupyter Notebook lamindata

Standardize and append a batch of data

Here, we’ll learn

  • how to standardize a less well curated collection

  • how to append it to the growing versioned collection

import lamindb as ln
import bionty as bt

ln.context.uid = "ManDYgmftZ8C0000"
ln.context.track()
💡 connected lamindb: testuser1/test-scrna
💡 notebook imports: bionty==0.48.1 lamindb==0.76.0
💡 created Transform('ManDYgmftZ8C0000') & created Run('2024-08-16 09:30:21.294892+00:00')

Let’s now consider a less-well curated dataset:

adata = ln.core.datasets.anndata_pbmc68k_reduced()
adata
Hide code cell output
AnnData object with n_obs × n_vars = 70 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

We are still working with human data, and can globally set an organism:

bt.settings.organism = "human"
curate = ln.Curate.from_anndata(adata, var_index=bt.Gene.symbol, categoricals={adata.obs.cell_type.name: bt.CellType.name})
💡 3 non-validated categories are not saved in Feature.name: ['n_genes', 'louvain', 'percent_mito']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns

Standardize & validate genes

Let’s convert Gene symbols to Ensembl ids via standardize(). Note that this is a non-unique mapping and the first match is kept because the keep parameter in .standardize() defaults to "first":

adata.var["ensembl_gene_id"] = bt.Gene.standardize(
    adata.var.index,
    field=bt.Gene.symbol,
    return_field=bt.Gene.ensembl_gene_id,
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")

# we only want to save data with validated genes
validated = bt.Gene.validate(adata.var.index, bt.Gene.ensembl_gene_id, mute=True)
adata_validated = adata[:, validated].copy()
💡 standardized 749/765 terms
❗ found 5 symbols in Bionty: ['GPX1', 'IGLL5', 'SNORD3B-2', 'RN7SL1', 'SOD2']
   please add corresponding Gene records via `.from_values(['ENSG00000233276', 'ENSG00000262074', 'ENSG00000291237', 'ENSG00000276168', 'ENSG00000254709'])`

Here, we’ll use .raw:

adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index
curate = ln.Curate.from_anndata(adata_validated, var_index=bt.Gene.ensembl_gene_id, categoricals={"cell_type": bt.CellType.name})
💡 3 non-validated categories are not saved in Feature.name: ['n_genes', 'louvain', 'percent_mito']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns
curate.validate()
✅ var_index is validated against Gene.ensembl_gene_id
💡 mapping cell_type on CellType.name
9 terms are not validated: 'Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+'
      → save terms via .add_new_from('cell_type')
False
curate.add_validated_from_var_index()

Standardize & validate cell types

Since none of the cell types are validate, let us search the cell type names from the public ontology, and add the name found in the AnnData object as a synonym to the top match found in the public ontology.

bionty = bt.CellType.public()  # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
    # search the public ontology and use the ontology id of the top match
    ontology_id = bionty.search(name).iloc[0].ontology_id
    # create a record by loading the top match from bionty
    record = bt.CellType.from_source(ontology_id=ontology_id)
    name_mapper[name] = record.name  # map the original name to standardized name
    record.save()
    record.add_synonym(name)
Hide code cell output
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0001087'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000910'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000911'
❗ CellType records from source (cl, 2024-05-15) are already in the database!
   → pass `update=True` to update the records
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000919'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000795'
❗ CellType records from source (cl, 2024-05-15) are already in the database!
   → pass `update=True` to update the records
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0002057'
✅ loaded 1 CellType record matching ontology_id: 'CL:0000860'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0001054'
❗ CellType records from source (cl, 2024-05-15) are already in the database!
   → pass `update=True` to update the records
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0002101'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0002051'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000952'
❗ CellType records from source (cl, 2024-05-15) are already in the database!
   → pass `update=True` to update the records

We can now standardize cell type names using the search-based mapper:

adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)

Now, all cell types are validated:

curate.validate()
✅ var_index is validated against Gene.ensembl_gene_id
✅ cell_type is validated against CellType.name
True

Register

artifact = curate.save_artifact(description="10x reference adata")
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/ZklPeK6odPwGoyt9ZvMP.h5ad')
✅ storing artifact 'ZklPeK6odPwGoyt9ZvMP' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/ZklPeK6odPwGoyt9ZvMP.h5ad'
💡 parsing feature names of X stored in slot 'var'
749 terms (100.00%) are validated for ensembl_gene_id
✅    linked: FeatureSet(uid='q9dlSlOLP349B3y7RZsq', n=749, dtype='float', registry='bionty.Gene', hash='o70Gw1y_TnH190ggJ4FwgA', created_by_id=1, run_id=2)
💡 parsing feature names of slot 'obs'
1 term (25.00%) is validated for name
3 terms (75.00%) are not validated for name: n_genes, percent_mito, louvain
✅    linked: FeatureSet(uid='sNp9qhUicsATutSqTzuH', n=1, registry='Feature', hash='WX38VhDsIVP0wIHn-JxUww', created_by_id=1, run_id=2)
✅ saved 2 feature sets for slots: 'var','obs'
artifact.view_lineage()
_images/1fb98039277b21c0a8dc158849e5e70dcf5aee0ca3dbcff1cc2233a160f6f0bd.svg

Append the dataset to the collection

Query the previous collection:

collection_v1 = ln.Collection.filter(
    name="My versioned scRNA-seq collection", version="1"
).one()

Create a new version of the collection by sharding it across the new artifact and the artifact underlying version 1 of the collection:

collection_v2 = ln.Collection(
    [artifact, collection_v1.ordered_artifacts.first()],
    is_new_version_of=collection_v1,
)
collection_v2.save()
Hide code cell output
💡 adding collection ids [1] as inputs for run 2, adding parent transform 1
💡 adding artifact ids [1] as inputs for run 2, adding parent transform 1
Collection(uid='zIay5QcJ7xPWQi7DozgB', version='2', is_latest=True, name='My versioned scRNA-seq collection', hash='dBJLoG6NFZ8WwlWqnfyFdQ', visibility=1, created_by_id=1, transform_id=2, run_id=2, updated_at='2024-08-16 09:30:45 UTC')

Version 2 of the collection covers significantly more conditions.

collection_v2.describe()
Collection(uid='zIay5QcJ7xPWQi7DozgB', version='2', is_latest=True, name='My versioned scRNA-seq collection', hash='dBJLoG6NFZ8WwlWqnfyFdQ', visibility=1, updated_at='2024-08-16 09:30:45 UTC')
  Provenance
    .created_by = 'testuser1'
    .transform = 'Standardize and append a batch of data'
    .run = '2024-08-16 09:30:21 UTC'
  Feature sets
    'var' = 'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'OR4F29', 'OR4F16', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C'
    'obs' = 'donor', 'tissue', 'cell_type', 'assay'

View data lineage:

collection_v2.view_lineage()
_images/df1a5781ccfecdda08661d682ded3c2ea3d817fa0534fe15fdc659b414b49315.svg