A Gigantum Dataset is similar to a Project because it is a a particularly formatted Git repository to manage data. While Datasets share features and UI elements with Projects, they are fundamentally different because they are meant for data files, not code.
Datasets use custom versioning to improve performance for collections of large files, and they combine a metadata repository with a file management and storage backend for the actual data. A Dataset repository only stores and versions file metadata instead of the actual file content, and it does this to better enable things like fast sync times, incremental downloads, and file de-duplication across version and Projects.
The management and storage of the actual data is done by the Dataset's storage backend, which can easily be modified to integrate with various existing services.
This design allows Datasets to be flexible and integrate many different methods for data storage and management, all through a single user interface.
Currently, only the Gigantum Cloud Dataset type is available, but more are being added.
File Upload Limit
Currently, the maximum supported individual file size is 15GB. You can upload multiple files for a total Dataset size greater than 15GB, but any single file must be 15GB or less.
If you have a usecase where you need larger files, we'd love to hear about it!
You can always put data directly into a Project, but there are many cases where using a Dataset makes more sense. Some common use cases are:
- Large individual data files (i.e. over 500MB): Very large files are not recommended to embed in Projects due to performance limitations of Git-LFS, which is used to version input data.
- Total data size is large (e.g. over 1GB): Data embedded in a Project will be uploaded or downloaded with the Project. Datasets allows full selective downloads to make syncs much faster.
- Reusing data: Datasets can be linked to multiple Projects without replicating files for each Project.
- Using multiple versions of a dataset at once: A Dataset and Project can be linked at specific versions, allowing for a lot of flexibility while reducing the need to duplicate data.
- Using data across branches in a Project: Linking Datasets to Projects makes managing data across branches within the Project easier.
- Publishing data: Data is a first class citizen, and sometimes you want to make versioned data available without required the context of computational work.
Updated 6 months ago