A Gigantum Dataset is a repository, similar to a Project, for the sole purpose of managing data. While Datasets share many features and UI elements with Projects, they are fundamentally different in that only metadata is embedded in the Dataset, instead of all the actual contents like with a Project. This provides many benefits, including fast sync times, partial downloads, and the ability to de-duplicate files across versions and Projects.
With Datasets, the actual management and storage of data is delegated to the Dataset's storage backend, which can easily be modified to integrate with various existing services. This design allows Datasets to be flexible and act as an integrator for many different methods and places that data is stored, all through a single user interface that continues to "level the playing field" when it comes to skills required to perform data analysis.
Datasets come in two flavors, Managed and Unmanaged:
- Managed Dataset Types provide full versioning functionality, meaning you can add, delete, and update data from directly within the Client.
- Unmanaged Dataset Types provide a consistent view into a collection of data that is essentially read-only. The Gigantum Client will notify you of changes to any files since you've last downloaded, but you can't physically change the data from with the Client since it is managed elsewhere.
A concrete example of an Unmanaged Dataset Type is a public S3 bucket. You can download and use the data, but you can't edit it. Someone else who owns the bucket could edit the data, and the Client would notify you that file contents have been modified. You can then "accept" the changes into the dataset to remove the warning in the future.
To use a Dataset, it is linked to a project. This linking operation ties the Dataset at a specific version to the Project, aiding in reproducibility. This approach is also beneficial because it enables de-duplication of data across Projects. Only one copy of the files at any versions in use are kept on disk and simply linked through to the Project container.
A final benefit of most Dataset Types that are under development, is the ability to incrementally download files within a dataset. This allows users to only download files they need if a Dataset is large.
Currently, only the managed Gigantum Cloud dataset type is available. Soon, additional unmanaged dataset types will be released, including public S3 buckets and local-only files.
Have an idea for an integration that would make a good Dataset Type? Suggest it here