A Gigantum Dataset is a particularly formatted Git repository, similar to a Project, for the sole purpose of managing data. While Datasets share many features and UI elements with Projects, they are fundamentally different in that only metadata is stored in the Dataset's repository, instead of the actual file content. This provides many benefits, including fast sync times, incremental downloads, and the ability to de-duplicate files across versions and Projects.
An example Dataset
With Datasets, the actual management and storage of data is delegated to the Dataset's storage backend, which can easily be modified to integrate with various existing services. This design allows Datasets to be flexible and act as an integrator for many different methods and places that data is stored, all through a single user interface that continues to "level the playing field" when it comes to skills required to perform data analysis.
Currently, only the Gigantum Cloud Dataset type is available, but more will be added in the near future.
File Upload Limit
Currently, the maximum supported individual file size is 15GB. You can upload multiple files for a total Dataset size greater than 15GB, but any single file must be 15GB or less.
If you have a usecase where you need larger files, we'd love to hear about it!
While you can always place data directly into a Project's input directory, there are many cases where using a Dataset makes more sense. Some common use cases are:
- Individual data files are large (i.e. over 500MB). Very large files are not recommended to embed in Projects due to performance limitations of git LFS.
- Total size of your data is large (e.g. over 1GB). If data is embedded in the Project, all of the data is downloaded with the Project. With Datasets, you can choose when and how much data you download, making syncs much faster.
- Reusing data. With Datasets, many Projects can link to a Dataset and use the data with only 1 copy of the data existing on your computer
- Using multiple versions of data at once. Datasets are linked to a Project at a specific version. This means you can have Projects refer to different versions of the data without having to duplicate the whole dataset
- Using data across branches in a Project. Using Datasets with Projects makes managing data across branches within the Project easier.
- Publishing data alone. Sometimes you just want to make data available to others.