A key feature of the Gigantum Client is automated versioning. The Client integrates with development tools (i.e. Jupyter, JupyterLab, and RStudio) to automatically version and capture everything that you do. Because every activity is monitored and each version created links together your code, data, and environment configuration, it is easy to see exactly what was done, when, by whom, and even rollback to a previous point of time with a single click directly from the Activity Feed.
An example Activity Feed tab
This high-resolution version data is a new source of information that we expect many useful features will be built on top of. It is made possibly by a novel git-compatible key-value store that is used to store rich metadata about ever version (e.g. figures, code snippets) beyond what is possible with the git commit log alone. The data is managed in a way that merge conflicts cannot occur and is linked directly to git commits containing changes with specially crafted git commit messages that are automatically created by the Client.
A snippet from the git log of a Project showing how Gigantum embeds additional metadata with your changes
Gigantum Client deeply integrates with Jupyter Notebook, JupyterLab, and RStudio. This integration monitors for activity and then automatically creates a commit, capturing your changes as you work. This separates the process of checkpointing your work and writing notes to yourself and others - such notes become a separate commit that can be added at any time to the top of the Activity Feed. There are some important things to understand that will make using the versioning system easier and more powerful.
Versioning currently occurs after the underlying compute system (e.g. the Jupyter kernel) is idle for 1 second. This means if you execute many cells at once or quickly, those results will be grouped into a single Activity Record (commit). In the future, this timeout will be configurable.
Additionally, if the Project is "busy", the Client will avoid creating versions as this can lead to blowing up the size of a git repo. A concrete example of this is if you have two notebooks in Jupyter. The first is run and takes a long time. In the second you execute a cell that completes. This will not create a new version because the long-running notebook could be in the middle of writing data to disk, which we don't want to prematurely version. When the long-running notebook completes all changes will be captured in a single Activity Record. (This is not an issue with RStudio, as it uses a single kernel shared across documents.)
There are certain things that can create excessive and unnecessary Activity Detail data. The progress bar output while downloading models via tensorflow is one such example. In these cases you may way to suppress output (e.g. add a
; to the end of a line in Jupyter) or Control Activity Feed Behavior to limit what is captured.
The system will limit a record to 255 "Details" so that the UI remains snappy. All your changes on disk will always be captured, but not everything may show up in the Activity Feed if it is excessive.