Data

In all Crunchs, the crunchers can access our datasets as follow:

Splited datasets

x_train, y_train, x_test

  1. x_train, y_train: Labeled training dataset;

  2. x_test, y_test: Small portion of the test set that can be used to run local test.

In Crunches, crunchers submit code or models in the form of notebooks or Python files. These submissions are then run on testing or live data by the system. As a result, crunchers never have direct access to the testing data.

The example prediction

A sample prediction is provided to show the required format for the output of the submission. Any deviation from this format will result in an invalid submission.

The crunch-cli also uses this file to perform a local check.

The values are usually either random or a constant.

Data Formats

Data can come in various formats

Cross-sectional DataFrame

Containing only a moon, id and features columns.

Moon are proxy for timestamps, in other words moons are date

Participant can access the name of the columns (the features) with parameters of the code interface.

How to load the data?

Loading the data from a notebook is really easy, the load_data function will make sure that the latest version of the data is available locally, or download it if necessary, and return 3 dataframes.

# x_train: pandas.DataFrame
# y_train: pandas.DataFrame
#  x_test: pandas.DataFrame

x_train, y_train, x_test = crunch.load_data()

Examples from the DataCrunch competition

Prediction's format

There should be one column per target.

DAG

...or Directed Acyclic Graph data are distributed as a pickled dict of pandas.DataFrame.

How to load the data?

# x_train: typing.Dict[str, pandas.DataFrame]
# y_train: typing.Dict[str, pandas.DataFrame]
#  x_test: typing.Dict[str, pandas.DataFrame]

x_train, y_train, x_test = crunch.load_data()

Examples from the ADIA Lab Causality Discovery competition

Prediction's format

The main column is example_id, formatted as follows:

<dataset_id>_<from_node>_<to_node>

Stream

Crunch's Streams are iterator object that allows you to traverse through all the elements of a time serie one at a time.

How to load the data?

# x_train: typing.List[typing.Iterator[dict]]
#  x_test: typing.List[typing.Iterator[dict]]

x_train, x_test = crunch.load_streams()

Examples from the Mid+One competition

Prediction's format

Stream competition have a different way of submitting the result. Please follow the Stream Code Interface instead.

Last updated