A Gigantum Project is just a bunch of files on disk in a specially formatted git repository. No complex databases or formats, just files. This was a very intentional design decision that makes it very easy to move Projects, inspect them, manipulate them, and even use them without using the Client.
An example project directory structure as seen from the host file system
.gigantum directory contains metadata that gives Gigantum Projects their power.
activity directory contains files that are written by a simple git-compatible key-value store. This store is used in conjunction with the git log to store detailed records of everything a user does to a Project.
datasets directory will exist if you link a Dataset to a Project. When you do this, the Client automatically creates a git submodule reference to the Dataset that gets stored here. Conceptually, this is a link from the Project to a Dataset at a specific version. Since this link is to a version, Datasets can change while your view into that Dataset does not until you explicitly update it.
env directory contains all the metadata required to configure the environment (e.g. pip packages, conda packages, Docker snippets). This information is stored as a collection of files and rendered into a Dockerfile when a container is built for you. By storing the configuration this way, the Client can support easy editing across collaborators with a much lower chance of conflicts.
Finally, Projects attempt to enforce a structure of
output for your code, input data and output data. While you are free to do what you'd like (it's just a git repo!), this structure is desirable to help build expectations around where things will be when sharing Projects.
output directory is a directory called
untracked. Here you can write files that you don't want to be versioned. Anything written here is ignored when versioning, detecting changes, and syncing.
File Upload Limits
There are limits on the size of files you can upload via the Client UI to help avoid these issues, but you can still cause git performance problems if you write large files manually from a development tool!
Limits when uploading through the Client UI:
- Code: 100MB (warning at 10MB)
- Input: 500MB (warning at 100MB)
- Output: 500MB (warning at 100MB)
If you need to store large files, Datasets support individual files up to 15GB in size and scale well.
While git is great for tracking changes to text files, it can slow down dramatically with large files.
To help deal with this, the Client automatically uses Git Large File Storage (LFS) for any file added to the
output directories. This will handle large files, but still will become slow with many very large (over ~1GB) files.
Datasets are available when data gets large and will keep your Project snappy. They are designed to scale well, deduplicate data across files and projects, and upload and download fast. If you have large files over 1GB, it is recommended to use a Dataset instead of embedding data directly in a Project.
Updated 7 months ago