A Gigantum Project is just a bunch of files on disk in a specially formatted git repository. No complex databases or formats, just files. This was a very intentional design decision that makes it very easy to move Projects, inspect them, manipulate them, and even use them without using the Client.
.gigantum directory contains metadata that gives Gigantum Projects their power.
activity directory contains files that are written by a simple git-compatible key-value store. This store is used in conjunction with the git log to store detailed records of everything a user does to a Project.
datasets directory will exist if you link a Dataset to a Project. When you do this, the Client automatically creates a git submodule reference to the Dataset that gets stored here. Conceptually, this is a link from the Project to a Dataset at a specific version. Since this link is to a version, Datasets can change while your view into that Dataset does not until you explicitly update it.
env directory contains all the metadata required to configure the environment (e.g. pip packages, conda packages, Docker snippets). This information is stored as a collection of files and rendered into a Dockerfile when a container is built for you. By storing the configuration this way, the Client can support easy editing across collaborators with a much lower chance of conflicts.
Finally, Projects attempt to enforce a structure of
output for your code, input data and output data. While you are free to do what you'd like (it's just a git repo!), this structure is desirable to help build expectations around where things will be when sharing Projects.
Sometimes you don't want to have Gigantum automatically version your files. For this case, we include
untracked directories in each of the
output sections. You can read more about this in the Untracked Folders section.
File Upload Limits
There are limits on the size of files you can upload via the Client UI to help avoid these issues, but you can still cause git performance problems if you write large files manually from a development tool!
Limits when uploading through the Client UI:
- Code: 100MB (warning at 10MB)
- Input: 500MB (warning at 100MB)
- Output: 500MB (warning at 100MB)
If you need to store large files, Datasets support individual files up to 15GB in size and scale well.
While git is great for tracking changes to text files, it can slow down dramatically with large files.
To help deal with this, the Client automatically uses Git Large File Storage (LFS) for any file added to the
output directories. This will handle large files, but still will become slow with many very large (over ~1GB) files.
Datasets are available when data gets large and will keep your Project snappy. They are designed to scale well, deduplicate data across files and projects, and upload and download fast. If you have large files over 1GB, it is recommended to use a Dataset instead of embedding data directly in a Project.
Updated about a year ago