# Versioning

# The Advantages of a Git-Based Approach

  • Excellent command line support out of the box (git)
  • Full revisioning and tagging and more (e.g. branches) in an extremely robust system
  • Support for non-dataset files in same place … (e.g. code, visualization, data processing, data analytics)

# What shall we use to create the Hub part of the DataHub

  • CKAN Classic MetaStore
  • Gitea or Gitlab or Github …

For now definitely CKAN Classic MetaStore

# What shall we use to create / manage git repos for us?

# Metadata flow

# Research

# References

# Context: (Dataset) Versioned Blob Trees

  • We could also add things like a dataflows.yml to a repo to make a data pipeline or a model.pkl file to store your machine learning analysis …

# Context: Project => Project Hub

# Approaches for storing large files and versioning them

For now I’ll assume we use Git for versioning and we want large files outside of git.

My sense is that Git LFS with custom backend storage works fine for most CKAN use cases in which customer has their own storage.

In more ML use cases the ability to have multiple data sources from different systems could be valuable.

It seems to me some hybrid could be achieved using extensions to Data Resource (to use remote URLs that have a local cache) and a special client that is aware of those extensions.

See also this matrix comparison https://docs.google.com/spreadsheets/d/1gc7vxcFt9OSVL7JoXXo9KSBVG4oIASaL08vdvoEst4o/edit#gid=1833965075 (opens new window)

# Git LFS option

Use git-lfs and build a custom lfs server to store to arbitrary cloud storage.

  • Pros: compatible with git lfs client
  • Cons: you are limited to one cloud storage and can’t pull data from different places (and no central caching)

For example, suppose i have a project using an AWS public dataset. In this approach i have to first copy that large dataset down, add to git (lfs) and push into my own cloud storage via git lfs.

# Manifest option (dvc approach and datahub.io (opens new window))

We store a manifest in git of the “local” paths that are not in git but in cloud storage.

One approach would be to mod Data Resource to have a storage / cloud url option

{
  path: 'mydata.csv'
  storageUrl: 'https://cloudstorage.com/content-addressed-path'
}

As long as the storage url changes each time you change the file (e.g. using content addressing) you get proper data versioning.

Another option is to store soft links in the main repo pointing into a local cache directory that is gitignored but has a manifest listing what to download into it. These would have to get updatd each time the data changed (as we ould point to a new blob file in the cache)

Or you could store a special file
https://dvc.org/doc/user-guide/dvc-files-and-directories (opens new window)
(the dvc approach i think )

Notes

  • Authentication / authorization is sort of out of scope (we need to assume that user has access to storage url and permissions to upload)
  • Could achieve some degree of similar functionality by inverting and having a cachePath or similar in datapackage.json and having a tool that pulls all resources that are remote and stores them to their cachePath

Pros:

  • I could use multiple cloud storage sources in a given dataset (including pulling from public sources)

Cons:

  • Need a separate tool other than git (lfs)
  • Some weird behaviour if i pull and mod a data file and then push - where does it now go? (not that weird though: my command line tool can take care of this)
    • Guess you would set a default storage “server/service”

# Research

# Git

See https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain (opens new window)

# Git-Hubs and how they work …

# Git-Hub APIs for creating files etc

# Git LFS

Git LFS works as follows:

  • When committing LFS-tracked files replace them with pointer file
  • Store the actual file into some backend storage
  • When pulling cache thoese large files
  • On checkout into the working directory replace the pointer file with the actual file

Key specs

Implementation has 3 components:

  • Git client “plugin”
  • Server API
  • Storage: use your storage of choice

API https://github.com/git-lfs/git-lfs/blob/master/docs/api/README.md (opens new window)

# File Storage flow

https://github.com/datopian/datahub-client/blob/master/lib/utils/datahub.js#L22 (opens new window)

In storing a file there are the following steps

  • Discover the LFS server to use
  • Authenticate
  • Call it with batch API with upload option
    • Tell it what protocols the client supports
  • Get back URLs to store to
  • Store to them
    • Note there are only certain protocols supported

# Servers

# Batch API

https://github.com/git-lfs/git-lfs/blob/master/docs/api/batch.md (opens new window)

Basic Transfers

https://github.com/git-lfs/git-lfs/blob/master/docs/api/basic-transfers.md (opens new window)

The Basic transfer API is a simple, generic API for directly uploading and downloading LFS objects. Git LFS servers can offload object storage to cloud services like S3, or implement this API natively.

This is the original transfer adapter. All Git LFS clients and servers SHOULD support it, and default to it if the Batch API request or response do not specify a transfer property.

They say that tus.io (opens new window) may be supported … (and that in theory supports s3 tho’ issues with multipart https://tus.io/blog/2016/03/07/tus-s3-backend.html (opens new window))

# Batch Upload to Cloud Storage

Looks like this is def possible. Here’s someone doing it with GCS:

https://github.com/git-lfs/git-lfs/issues/3567 (opens new window)

# FAQs

# Git Annex

# Content Addressed Storage

https://en.wikipedia.org/wiki/Content-addressable_storage (opens new window)

API ideas: https://github.com/jakearchibald/byte-storage/issues/11 (opens new window)

https://gist.github.com/mikeal/70daaf34ab39db6f979b8cf36fa5ac56 (opens new window)
https://github.com/mikeal/lucass (opens new window) lucass (Lightweight Universal Content Addressable Storage Spec)

let key = await byteStorage(value) // value is a File, Blob, Stream, whatever
let value = await byteStorage(key) // could return a promise, or a stream, whatever you wanna go for

Garbage collection: how do you do it …

# DVC

https://dvc.org (opens new window) “Data Version Control”

It does large files but also much more related to machine learning workflow. e.g. it has a whole dep tree in each of its special files https://dvc.org/doc/user-guide/dvc-file-format (opens new window) so it is doing some kind of optimiazation there …

https://dvc.org/doc/understanding-dvc/related-technologies (opens new window)

Basically, it combines part of all of these

  • Git Large file management
  • Workflows: creating and running them esp machine learning workflows. Includes a DAGs for workflows (esp ML flows)
  • Experiment management

DVC does not require special Git servers like Git-LFS demands. Any cloud storage like S3, GCS, or an on-premises SSH server can be used as a backend for datasets and models. No additional databases, servers, or infrastructure are required.

NB: this is actually untrue about Git-LFS. Git LFS server could be backed by any cloud storage.

# Misc