Crunch 1 – Oct 28 to Feb 9 – Predict gene expression
Predict gene expression in spatial transcriptomics data from matched pathology images
Evaluation Phases
In Crunch 1, you will have the opportunity to evaluate your model’s predictive performance on a validation dataset, before submission of your test dataset predictions.
There will be multiple validation checkpoints:
Checkpoint 1 - November 30th (Eastern Time 17:59)
Checkpoint 2 - December 16th (Eastern Time 17:59)
Checkpoint 3 - December 30th (Eastern Time 17:59)
Checkpoint 4 - January 13th (Eastern Time 17:59)
Checkpoint 5- January 27th (Eastern Time 17:59)Continuous Public Leaderboard - January 20th
Last submission - February 9th (Eastern Time 17:59)
Overview
In Crunch 1, you will train an algorithm to predict spatial transcriptomics data (gene expression in each cell) from matched H&E images. In other words predict the gene expression (Y) in cells from specific tissue patches based on the H&E images (X) and surrounding spatial transcriptomics data.
X (Input):
HE_original
: The original H&E image in its native pixel coordinates. Alignment from H&E native coordinate system to Xenium coordinate system has been handled from our end. If you prefer to handle alignment yourself, you can check HE_original and DAPI (provided in crunch1_max), but it may require additional processing.HE_nuc_original
: The nucleus segmentation mask of H&E image, in H&E native coordinate system. The cell_id in this segmentation mask matches with the nuclei by gene matrix stored in anucleus.
Y:
anucleus
: This file contains the aggregated gene expression data for each nucleus. It is log1p-normalized and stores the gene expression profiles for 460 genes per nucleus. This is the primary target (Y) for your model.
Linking the H&E image to spatial transcriptomics
Steps to align X and Y:
Step 1: Identify nuclei in the H&E image
Use the nucleus segmentation masks:
H&E nucleus segmentation (
HE_nuc_original
): This mask identifies the location of nuclei in the original H&E image (i.e. HE_original).
Step 2: Link gene expression to H&E images
For each nucleus in the H&E image, use the
anucleus
file to get the corresponding gene expression profile (Y) for that nucleus.The
anucleus
file provides the gene expression data, where each row corresponds to a nucleus (cell) and each column corresponds to a gene.The nuclei IDs from the segmentation masks (e.g., from
HE_nuc_original
) will match the IDs used in theanucleus
file.
If you open the image HE_nuc_original,
e.g. through mask=sdata['HE_nuc_original'][0].to_numpy()
.
You can directly find the location of that cell, with cell_id, through mask==cell_id
.
The datasets are store in a SpatialData object. Learn more about this format here.
In the minimum version of the data provided for crunch1 (in crunch1_min.tar), only HE_original, HE_nuc_original, anucleus and cell_id-group are provided.
Expected Output
The output consists of four columns:
cell_id: contains the held-out nuclei (both validation and test tissue regions).
gene: the gene among the 460 genes to be predicted.
prediction: the gene expression value, rounded to two decimal places.
sample: the tissue sample among the 8 samples to process.
Make sure your predictions are log1p-normalized
with a scale factor of 100 as in anucleus.X
Scoring
The scoring metric is a cell-wise Spearman correlation.
A Mean Squared Error metric is also computed, the value must be below 0.2. Since the baseline is 0.1, a model with an MSE that is too high is not considered viable and will not be eligible for rewarded.
Submit
To build a valid submission, your model need to be coded within the infer function, effectively respecting the crunch code submission interface.
Data Variants
Due to the large size of the datasets, Crunch provides both a small (aka. default) and a large version.
Depending on your local setup and goals within Crunch, you can choose either one.
By default, the small dataset is downloaded.
To access the larger dataset, specify it explicitly with a different CLI command:
The larger version contain the Xenium transcriptomic data. It allow you to know both the gene expression and the coordinate (x, y, z) of the position of the gene in the Cells.
More details about the gene transcriptomic data in the full documentation.
The large variant is for local use only.
The Cloud Environment will always use the default dataset.
Last updated