# Publish Data

# Introduction

Publish functionality covers the whole area of creating and editing datasets and resources, including data upload. The core job story is something like:

When a Data Curator has a data file or dataset they want to add it manually (e.g. via drag and drop etc) to their data portal/platform quickly and easily so that it is avaialble there.

Publication as a process can be divided into the following cases:

  • Manual: publication is done by people via a user interfaces or other tool
  • Programmatic: publication is done programatically using APIs and is usually part of automated processes
  • Hybrid: which combines manual and programmatic, for example, harvesting where setup and configuration may be done in a UI by a person and then the process runs automatically and programmatically. In addition, some new harvesting flows require programmatic setup (e.g. writing a harvester in Python for a new source data format).

Focus on Manual we will focus on the manual in this section: programmatic is by nature largely up to the client programmer (assuming the APIs are there) whilst Harvesting has a section of its own. That said, many concepts here are relevant for other cases e.g. material on profiles and schemas.

Data uploading: included in publish is the process of uploading data into the DMS, and specifically into storage and especially (blob) storage.
.

# Examples

At its simplest, a publishing process can just involve providing a few metadata fields in a form – with the data itself stored elsewhere.

At the other end of the spectrum, we could have a multi-stage and complex process like this:

  • multiple (simultaneous) resource upload with shared metadata e.g. I’m creating a timeseries dataset with the last 12 months of data and I want each file to share the same column information but to have different titles
  • a variety of metadata profiles
  • data validation (prior to ingest) including checking for PII (personally identifiable infromation)
  • complex workflow related to approval e.g. only publish if at least two people have approved
  • embargoing (only make public at time X)

# Features

# Job Stories

When a Data Curator has a data file or dataset they want to add it manually (e.g. via drag and drop etc) to their data portal quickly and easily so that it is avaialble there.

More specifically: As a Data Curator I want to drop a file in and edit the metadata and have it saved in a really awesome interactive way so that the data is “imported” and of good quality (and i get feedback)

# Resources

TIP

A resource is any data item in a dataset e.g. a file.

When adding a resource to a dataset I want metadata pre-entered for me (e.g. resource name from file name, encoding, …) to save time and reduce errors

When adding a resource to a dataset I want to be able to edit the metadata whilst uploading so that I save time

When uploading a resource’s data as part of adding a resource to a dataset I want to see upload progress so that I have a sense of how long this will take

When adding resources to a dataset I want to be able to add and upload multiple files at once so that I save time and make one big change

When adding a resource which is tabular (e.g. csv, excel) I want to enter the (table) schema (i.e. the names, description and types of columns) so that my data is more useable, presentable, importable (e.g. to DataStore) and validatable

When adding a resource which is currently stored in dropbox/gdrive/onedrive I want to pull the bytes directly from there so as to speed up the upload process

# Remarks

Most ordinary data users don’t distinguish resources and datasets in their everyday use. They also prefer a single (denormalized) view onto their data.

Normalization is not normal for users (it is a convenience, economisation and consistency device)

And in any case most of us start from files not datasets (even if datasets evolve later).

# Flows

  • Publish flows are highly custom: different platforms have different needs
  • At the same time there are core workflows that most people will use (and customize)
  • The flows shown here are therefore illustrative and inspirational rather than definitive

The 30,000 foot view:

Let’s start with the simplest case of adding a single file:

Notes

  • Alternative to “Drop a file” would be to just “Link” to a file that is already online and available

TIP

We think a “file driven” approach where the flow starts with a user adding a file (and doing upload) is preferable to an appraoch where you start with a dataset and metatdata (as is the default today in CKAN) and then add files.

Why? First, a file is what the user has immediately to hand and it is concrete whilst “metadata” is abstract. Second, common tools for storing files e.g. Dropbox or Google Drive start with providing a file - only later, and optionally, do you rename it, move it etc.

That said, with tools like GitHub or Gitlab one needs to create a “project”, albeit a minimal one, before being able to push any content. However, GitHub and Gitlab are developer oriented tools that can assume a willingness to tolerate a slightly more cumbersome UX. Furthermore, on those platfomrs there is no use case of providing a single file - a user must create a git repo first.

# Overview Deck

Deck: This deck (Feb 2019) provides an overview of the core flow publishing a single tabular file e.g. CSV and includes a a basic UI mockup illustrating the flow described below.

# Overview

For v1: could assume small data (e.g. < 5Mb) so we can load into memory …?

Tabular data only

  1. Load

    1. File select
    2. Detect type
    3. Preview <= start preview here and continue throughout (?)
    4. Choose the data
  2. Structural check and correction

    1. Structural validation
    2. Error presentation
    3. Mini-ETL to correct errors
  3. Table metadata

    1. [Review the headers]
    2. Infer data-types and review
    3. [Add constraints]
    4. Data validation (and correction?)
  4. General metadata (if necessary)

    1. Title, description
    2. License
  5. Publish (atm: just download metadata (and cleaned data)

# 1. Load

  1. User drops a file or uploads a file

    • What about a url? Secondary for now
    • What about import from other sources e.g. google sheets, dropbox etc? KISS => leave for now
    • Size restrictions? Let’s assume we’re ok
    • Error reporting: any errors loading the data file should be reported …
    • [Future]: in the background we’d be uploading this file to a file store while we do the rest of this process
    • Tooling options: https://uppy.io/ (note does lots more!), roll out own, filepicker.io (proprietary => no), …
      • How do we find something that just does file selection and provides us with the object
    • [Final output] => a raw file object, raw file info (? or we already pass to data.js?)
  2. Detect type / format (from file name …)

  • Prompt user to confirm the guess (or proceed automatically if guessed)?
  • Tooling: data.js already does this …
  1. Choose the data (e.g. sheets from excel)

    • Skip if CSV or if one sheet
    • Multiple sheets:
      • Present preview of the sheets ?? (limit to first 10 if a lot of sheets)
      • Option of choosing all sheets

# 2. Structural check and correction

  1. Run a goodtables structure check on the data

    • => ability to load a sample of the data (not all of it if big)
    • => goodtables js version
  2. Preview the data and show structural errors

  3. [Optional / v2] Simple ETL in browser to correct this

# 3. Table metadata

All done in a tabular like view if possible.

Infer the types and present this in a view that allows review:

  1. [Review the headers]
  2. Infer data-types and review
  3. [Add constraints] - optional and could leave out for now.

Then we do data validation against types (could do this live whilst they are editing …)

  1. Data validation (and correction?)

# 4. General metadata (if necessary)

Add the general metadata.

  1. Title, description
  2. License

# 5. Publish (atm: just download metadata (and cleaned data)

Show the dataresource.json and datapackage.json for now …

# Existing work

# Original Flow for DataHub data cli in 2016

Context:

  • you are pushing the raw file
  • and the extraction to get one or more data tables …
  • in the background we are creating a data package + pipeline
data push {file}

Algorithm:

  1. Detect type / format
  2. Choose the data (e.g. sheet from excel)
  3. Review the headers
  4. Infer data-types and review
  5. [Add constraints]
  6. Data validation
  7. Upload
  8. Get back a link - view page (or the raw url) e.g. http://datapackaged.com/core/finance-vix
    • You can view, share, publish, [fork]

Details

  1. Detect file type

    1. file extension
    2. Offer guess
    3. Probable guess (options?)
    4. Unknown - tell us
    5. Detect encoding (for CSV)
  2. Choose the data

    1. Good data case
    2. 1 sheet => ok
    3. Multiple sheets guess and offer
    4. Multiple sheets - ask them (which to include)
    5. bad data case - e.g. selecting within table
  3. Review the headers

    • Here is what we found
    • More than one option for headers - try to reconcile

# Design

See Design page »