Versioning is a feature that records changes to metadata and/or data. Think of it like “git for data”.
Versioning means that so we can go back to previous revisions, track history and more. Versioning can also include features such as the ability to “tag” a given revision with a label e.g. “v1.0”.
All the benefits you get with revisioning for code but for data …
- Rollback: you can rollback (aka revert) to previous states of the data.
- => Greater freedom to make changes: This, in turn, brings more freedom in making changes and the ability to recover from errors
- Pinning: the ability for dependent applications (e.g. an analytic workflow, or a data-driven web app) to “pin” their use of this data to a particular revision. This would be like declaring explicit version dependences in a software application.
- => Reduced coupling, improving collaboration and independence: data curators can make changes (without worrying about breaking downstream users) and client users have confidence that their applications won’t suddenly break
- Pull requests: the ability to receive contribution from other parties in a structured way (you have a middle way between everyone needing access to contribute and no-one having access to contribute).
- => Easier, faster, distributed collaboration: therefore structured contribution model which in turn allows much faster, more open, more distributed collaboration
- Complex Merge: distributed contribution models, feature branches etc
- Changelogs: … and therefore auditability (NB: this can be achieved other ways)
Also worth mentioning is the potential integration with code: now that your data has revisioning too, you can keep in sync between, for example, your machine learning model in code and your training data in the data management system.
Versioning as a term can be confusing because it is ambiguous. For example, when some people say “version” they mean a revision e.g. “does this tool support data versioning” (i.e. does it support recording each change to the data). Whilst, when other people say “version” they mean a (revision) tag e.g. “what version of this software are you using” (answer: “version 1.3”.
We avoid this ambiguity by using specific terms – revisioning and (revision) tagging – for these different features and reserving versioning for the overall system incorporating these.
When you update a dataset (metadata or data) a new revision is created and the current state is “snapshotted” and preserved.
More generally, revisioning is functionality whereby changes to a dataset (and its child resources) are logged and prior state is accessible. For example, if a dataset with value “Foo” is changed to have value “Bar”, one can still to access the previous revision where it had value “Foo”.
- Metadata or metadata and data revisioning: revisioning can be metadata only (it is rarely data only). For example, CKAN (as of v2) only revisions metadata.
- DAG or linear: revisioning can be simple “linear” revisioning or it can be full “DAG” (directed acyclic graph).
- Linear: each revision has a single parent and successor e.g.
- DAG: “DAG” (directed acyclic graph) is where there can be branching and merging e.g.
- Linear: each revision has a single parent and successor e.g.
- Branch labelling and management: with a DAG one can have multiple “branches” rather than just the single “trunk” of the linear case. With branches it can be useful to label these branches and to designate a “master” or primary branch to which new revisions are appended by default.
Tagging is the ability to “tag” a revision, i.e. create a labelled pointer to that revisions e.g.
Often referred to as revision tagging to disambiguate it from normal tagging with keywords.
In addition, to a convenient name e.g.
v1.2 a tag may also incorporate other metadata, for example a description e.g.
Introduced new column xyz and reformatted column abc.
Whilst tagging itself is relatively trivial functionality, there may be significant business and technical processses associated. For example, a tag may be the basis for a “release”.
# Domain Model
- Revision: an object recording metadata of a revision e.g. when it happened, who created it etc.
- (Revision) Tag: a pointer to a specific revision with additional metadata e.g. name, description.
# CKAN v2
CKAN v2 (up to v2.8) used
vdm to provide metadata revisioning. However, there was no data revisioning. In v2.9
vdm was removed and metadata revisioning is provided by the activity stream system.
There is an extension called ckanext-datasetversions with a basic implementation of dataset versioning. It implements the version as a child - father relationship between datasets. There is a detailed analysis of the package in this document.
The package internally use child_of relationship to model versions: “The plugin models dataset versions internally by creating a parent dataset, with minimal metadata and no resources. A child dataset is created for each version.” So new versions are new datasets, and CKAN restrictions applies: these datasets cannot share url or name.
The package was created 4y ago and does not seem to be actively maintained.
- Data revisioning is not supported.
- Revision tags are not supported.
- Only linear revision trees i.e. no branching
# CKAN v3
We offer an approach to data versioning that is integrated with CKAN, but does not implement large amounts of custom logic in order to achieve versioning, and instead, leverages git, the world’s most popular software for versioning, for this purpose.
It is backwards compatible with CKAN v2.
CKAN v2 extension: ckanext-versioning: This CKAN extension adds a full data versioning capability to CKAN including:
- Metadata and data is revisioned so that all updates create new revision and old versions of the metadata and data are accessible
- Create and manage “revision tags” (named labels plus a description for a specific revision of a dataset e.g. “v1.0”)
- Diffs, reverting, etc
- metastore-lib: Library for storing dataset metadata, with versioning support and pluggable backends including GitHub. metastore-lib is used by ckanext-versioning and requires environment variables related to the Github Repository where the data is gonna be stored and the access token.
- frictionless-ckan-mapper (python): A library for mapping CKAN metadata <=> Frictionless metadata.
Important: ckanext-versioning depends on Blob Storage v3
# Open Questions
- How does revisioning work when a revisioned object e.g. Dataset has a reference to an unrevisioned object e.g. a Tag? For example, imagine an old dataset revision has a reference to a tag that has been deleted from the system? In this case displaying a link to that tag will fail.
# Appendix: Mapping against Git
Git terminology on left, our terminology on the right.
- Commit <=> Revision
- Tag <=> Release