Crunch 1 – Predicting the effect of held-out single-gene perturbations

Predict the single-cell transcriptomic response to unseen single-gene perturbations.

Evaluation Phases

In Crunch 1, you will have the opportunity to evaluate the predictive performance of your model on a validation dataset.

There will be multiple validation checkpoints, with one occurring every Monday at 6:00 p.m. UTC:

  • Checkpoint 1 - December 22th

  • Checkpoint n - every Monday

  • Last checkpoint - February 23th

  • Last submission - February 28th

    • Start of the selection period

  • End of the selection period - March 4th

circle-info

You can still submit and run your code multiple times onto the platform.

At a checkpoint, all of your (non scored) predictions will be scored. Predictions will also be scored at the beginning of the selection period.

Overview

In Crunch 1, we will explore how well we can predict the single-cell transcriptomic response to single-gene perturbations that were not measured and provided in the training dataset.

Dataset

The dataset includes perturbations targeting 157 genes, of which 150 are transcription factors (TFs). For each perturbation, we provide single-cell gene expression (RNA-seq) profiles measured at the day 14 of adipocyte differentiationarrow-up-right, annotated with gene perturbation identity, quality control (QC) metrics, and cell metadata. The training dataset contains a subset of these perturbations, while a distinct set of single-gene perturbations is held out for validation and test.

The layout is as follow:

  • The dataset is provided in AnnData format (.h5ad)arrow-up-right as obesity_challenge_1.h5ad.

  • Normalized gene expression values are stored in adata.X. Raw counts were normalized to a target sum of 100,000 per cell, followed by a log2(1+x)log_2(1+x) transformation (standard single-cell RNA-seq normalization; see lecture 2 of the crash course).

  • Raw gene expression counts prior to normalization are stored in adata.layers['counts'] for reproducibility and alternative preprocessing.

  • The perturbation target gene information is provided in adata.obs['gene'], with values corresponding to either “NC” for control cells or to the target gene name if the cell is perturbed. Control cells receive a perturbation that has no effect on the cell’s RNA-Seq profile.

  • Cell state/program enrichment information is provided in .obs, with columns pre_adipo, adipo, lipo, and other indicating whether each cell was enriched for pre-adipocyte, adipocyte, or lipogenic programs. other was defined as cells that were not enriched for either pre-adipocyte or adipocyte programs. Program enrichment assignments were based on expert-curated canonical signature genes, and the list of signature genes is provided in signature_genes.csv.

    • The full analysis workflow used to determine program enrichment is provided in the accompanying notebook (in R)arrow-up-right, which can be consulted for additional methodological details.

    • We provide the cell state proportion for each of the perturbations in a separate file program_proportion.csv.

  • During preprocessing, standard single-cell quality control (QC) was applied to remove low-quality cells and cell doublets based on sequencing library complexity, gene detection rate, and mitochondrial gene content. The dataset was then restricted to cells with a single confident guide assignment to a perturbation, and guides represented by fewer than 10 cells were excluded. Genes detected in fewer than 10 cells were removed, and known signature genes from signature_genes.csv were subsequently re-introduced.

The .obs columns are defined as:

  • orig.ident: The original sample ID.

  • nCount_RNA: The number of UMIs detected per cell.

  • nFeature_RNA: The number of genes detected per cell.

  • nCount_guide: The number of sgRNA UMIs detected per cell.

  • nFeature_guide: The number of sgRNAs detected per cell.

  • percent.mt: The fraction of UMIs per cell that map to mitochondrial transcripts.

  • SampleID: The sample ID.

  • Day: The day of sample collection.

  • num_features: The number of guides per cell (for low MOI data, after qc, only the cells with 1 guide are kept).

  • feature_call: The guide assignment of each cell.

  • num_umis: The number of guide umis per cell.

  • gene: The perturbation target gene (or perturbation identity).

  • positive_control: Whether the perturbation is one of the positive controls.

Expected Output

Participants must submit three outputs:

File: prediction.h5ad

An AnnDataarrow-up-right file containing predicted gene expression profiles normalized and log-transformed post-perturbation for 2,863 gene perturbations indicated in predict_perturbations.txt.

Predictions should be stored in adata.X matrix with the corresponding perturbation identity recorded in adata.obs['gene'].

The set of genes (columns) included in the prediction is defined explicitly by genes_to_predict provided at inference time and the columns of adata.X must follow this order.

Note that the genes_to_predict list may change between validation (N=10,237) and test phases, and your model must generate predictions for whichever set of genes is supplied. The maximum number of genes that could be included in genes_to_predict is 21,592 corresponding to the total number of genes in the dataset.

For each gene perturbation, we ask you to predict the gene expression profiles for 100 cells to quantify the distribution of each perturbation prediction. With N = len(genes_to_predict), the final prediction file is therefore required to have dimensions: [286,300 × N] (cells × genes_to_predict).

File: predict_program_proportion.csv

A CSVarrow-up-right file reporting the predicted proportion of cells with enriched programs for each gene perturbation listed in predict_perturbations.txt.

The file should contain one row per perturbation with the following columns:

  • gene: should contain the perturbation name,

  • pre_adipo, adipo, lipo, and other: should specify the predicted proportion of cells in each corresponding state for that perturbation,

  • lipo_adipo: should be the ratio of lipo to adipo (representing the proportion of adipocytes with enriched lipogenic programs).

This file should thus have 2,863 rows and 6 columns. An example is available in the data/ directory.

File: Method description.md

We ask you to please write a small document outlining the approaches used to generate both the predictions and the estimated proportions of cells enriched for each program.

This should include sufficient details of the computational models employed and the procedures used to derive cell proportions.

The document should be organized into three sections, represented as titles in a Markdown file:

  • Method Description: Explain how your method works. (5-10 sentences)

  • Rationale: Describe the reasoning behind your model. (5-10 sentences)

  • Data and Resources Used: Specify the datasets and any other resources utilized. (5-10 sentences)

Notes:

  • A human will validate the content at the end of the competition. Work deemed unsatisfactory may be disqualified.

  • This file must be provided during submission. If content needs to be changed, you must re-submit with the new version.

  • The name must be Method Description.md; case does not matter.

  • Only non-empty and non-comment lines are considered.

Below is an example of how to format the file:

circle-info

If there's an obvious issue regarding the format, you'll receive an immediate notification.

Notebook users are required to use embedded files.

Scoring

Each metric will be displayed in a different leaderboard. Each will have a different ranking and opportunity for a prize.

The metrics are classed into 2 categories:

  • Transcriptome-wide metrics that will be computed computed using a subset of genes (i.e., the columns of the predicted matrix) for each perturbation.

    • Metrics include:

      • Pearson Delta between predicted and observed perturbation effects relative to perturbed mean.

      • Maximum mean discrepancy (MMD) between predicted and observed distributions of single-cell profiles.

    • Public leaderboard / validation (updated weekly): Evaluation uses 1,000 hidden genes.

    • Private leaderboard / test phase: Both the number and identity of scoring genes will remain unknown.

  • A Program-level metric that will evaluate whether models capture meaningful biological outcomes, which is:

    • L1-distance between predicted and observed four cell state proportions for each perturbation (i.e. pre-adipogenic, adipogenic, lipogenic, and other)

circle-info

The evaluation code is available on GitHubarrow-up-right.

Code for local scoring will be available in the quickstarterarrow-up-right.

For more details about how the metrics formulas, please consult the Full Specifications.

Submit

To build a valid submission, your model needs to be coded within the infer function, effectively respecting the crunch code submission interface.

circle-exclamation
circle-info

An example is available in the quickstarterarrow-up-right.

FAQ

chevron-rightI missed a checkpoint, can I participate to the next one?hashtag

Yes.

There are checkpoints every Monday, and missing one will not affect your final ranking. Once your model is ready, submit it!

chevron-rightWhy must external resources be published or in the public domain?hashtag

While releasing the full model is encouraged, it is not strictly required if the weights are sufficient for reproducibility.

When constraints limit what can be released, we ask that the methods and training procedures be clearly documented. These cases can be reviewed individually to ensure transparency and fairness.

Last updated

Was this helpful?