Datasets Overview

A quick summary of Gigantum Datasets

What is a Dataset?

A Gigantum Dataset is a repository, similar to a Project, for the sole purpose of managing data. While Datasets share many features and UI elements with Projects, they are fundamentally different in that only metadata is embedded in the Dataset, instead of all the actual contents like with a Project. This provides many benefits, including fast sync times, partial downloads, and the ability to de-duplicate files across versions and Projects.

With Datasets, the actual management and storage of data during syncing is delegated to the Dataset's storage backend, which can easily be modified to integrate with various existing services. This design allows Datasets to be flexible and act as an integrator for many different methods and places that data is stored, all through a single user interface that continues to "level the playing field" when it comes to skills required to perform data analysis.

An example DatasetAn example Dataset

Currently, the only Dataset backend is provided by Gigantum Cloud, but in the future, additional Dataset types will be available. The Gigantum Cloud Dataset type supports individual files up to 15GB in size and can handle many files in a single Dataset.

How Do You Use a Dataset?

Datasets enable the independent management of data, which may be useful for publishing data along, but in the end of the day you want to work with a Dataset. To do this, you must "link" a Dataset to a Project. When you link a Dataset, you are creating a reference in the Project to a specific version of the Dataset. Currently you can only link to the latest version of the Dataset.

Once a Dataset is linked, any files in that Dataset that exist locally will be mounted into the Project container at runtime. The files will appear in the input directory as a folder with the same name as the Dataset. Also, note that files will be read-only, as they can be mounted into any Project to which the Dataset is currently linked.

To update files in a Dataset, you must update the Dataset itself and then update the reference to the Dataset in the desired Project.

You can learn more about working with Datasets here.