Crunch 2 – Nov 18 to Mar 21 – Predicting Unseen Genes
Last updated
Last updated
In Crunch 2, your task is to predict the expression levels of genes that were not measured in a spatial transcriptomics dataset. You will use both spatial data and single-cell RNA sequencing (scRNA-Seq) data from similar colon tissue samples to make these predictions.
Spatial Data: The .zarr
data provided in .
scRNA-Seq Data: The Crunch2_scRNAseq.h5ad
file contains gene expression data for 18,615 protein-coding genes, including the 460 genes in the Spatial Data object.
Gene Expression Predictions: The expression levels of 2,000 genes.
In Crunch 2, you will have the opportunity to evaluate your model’s predictive performance on a validation dataset, before submission of your test dataset predictions.
There will be checkpoints every:
Friday — to get your scores before the weekend
Monday — to see how your weekend work stacks up
The final submission must be submitted by March 21th (Eastern Time 17:59).
Colon tissue samples similar to those profiled by Xenium spatial transcriptomics.
These datasets contain single-cell gene expression data for 18,615 protein-coding genes, including the 460 genes in the Spatial Data.
We provide datasets (atlases) from multiple studies to represent all the cell types that are found in the colon tissue.
scRNA-Seq.obs
Cell Type: scRNA-Seq.obs["annotation"]
Study: scRNA-Seq.obs["study"]
Individual: scRNA-Seq.obs["individual"]
Disease Status: scRNA-Seq.obs["status"]
scRNA-Seq.X
— log1p-normalized
countsOriginal raw counts per cell are divided by the sum of counts per cell, multiplied by 10,000, and then log1p
-transformed.
This representation displays the scRNAseq.X
matrix in DataFrame format to clarify the structure of the CSR matrix.
The columns in the DataFrame are as follows:
Row: the row index corresponding to the observation index, accessible via scRNAseq.obs
Column: the column index corresponding to the gene index, accessible via scRNAseq.var
Value: normalized, log-transformed gene expression counts
scRNA-Seq.layers["counts"]
The output must be provided as a DataFrame with the following structure:
Index
Contains the cell_id
values corresponding to the validation and test groups expected in the SpatialData (.zarr
file provided by the infer
function).
Columns
Contains 2,000 genes randomly selected from the 18,615 protein-coding genes in the scRNA-Seq
data, including the 20 genes already measured by Xenium spatial transcriptomics but excluded from the Spatial Data object.
You can retrieve this list from the Crunch2_gene_list.csv
file included in the competition dataset.
Values
Gene expression predictions for each cell and gene.
Predictions must be log1p-normalized and rounded to 2 decimal points.
Your predictions are evaluated on the 20 held-out genes using Spearman’s rank correlation for cells with non-zero expression. For cells with zero expression, a separate metric applies. Scores combine predictions across global and local regions for a balanced final score.
: An atlas of ulcerative colitis patients, including inflamed, non-inflamed, and healthy colon tissue.
: An atlas of the enteric nervous system, including glial cells and neurons innervating the colon.
: An atlas of the colon muscle layer.
Provided as an stored in an h5ad file: Crunch2_scRNAseq.h5ad
.
Refer to the notebook for an example of how to format your submission.