# Data Load
Data load covers functionality for automatedly loading structured data such as tables into a data management system. Data load is usually part of a larger Data API (DataStore) component.
Load is distinct from uploading raw files (“blobs”) and from a “write” data API: from blobs because the data is structured (e.g. rows and columns) and that structure is expected to be preserved; from a write data API because the data is imported in bulk (e.g. a whole CSV file) rather than writing one row at a time.
The load terminology comes from ETL (extract, transform, load) though in reality this functionality will often include some extraction and transformation – extracting the structured data from the source formats and potentially transformation if data needs some cleaning.
As a Publisher i want to load my dataset (resource) into the DataStore quickly and reliably so that my data is available over the data API.
- Be “tolerant” where possible of bad data so that it still loads
- Get feedback on load progress, especially if something went wrong (with info on how I can fix it), so that I know my data is loaded (or if not what I can do about it)
- I want to update the schema for the data so that the data has right types (before and/or after load)
- I want to be able to update with a new resource file and only have it load the most recent
- Track Performance: As a Datopian Cloud Sysadmin I want to know if there are issues so that I can promptly address any problems for clients
- One Data Load Service per Cloud: As a Datopian Cloud Manager I may want to have one “DataLoad” service I maintain rather than one per instance for efficiency …
# Automatic load
- Users uploads a file to portal using the Dataset editor
- This is stored into the blob storage (i.e. local or cloud storage)
- A “PUSH” notification is triggered to loader service
- Loader service load file to data API backend (a structured database with web API)
# Sequence diagram for manual load
The load to the data API system can also be triggered manually:
# CKAN v2
The actual loading is provided by either DataPusher or XLoader. There is a common UI. There is no explicit API to trigger a load – instead it is implicitly triggered, for example when a Resource is updated.
The UI shows runs of the data loading system with information on success or failure. It also allows eidtors to manually retrigger a load.
TODO: add more screenshots
Service (API) For pushing tabular data to datastore. Do not confuse it with
ckanext/datapusher in ckan core codbase which is simply an extension communicating with the DataPusher API. DataPusher itself is a standalone service, running separately from CKAN app.
XLoader runs as async jobs within CKAN and bulk loads data via Postgres COPY command. This is fast but it does mean it only loads data as strings and explicit type-casting must be done after the load (the user must edit the data dictionary). XLoader was built to address 2 major issues with DataPusher:
- Speed: DataPusher converts data row by row and writes over the DataStore write API and hence is quite slow.
- Dirty data: DataPusher attempts to guess data types and then cast and this regularly led to failures which though logical were frustrating to users. XLoader gets the data in (as strings) and let’s the user sort out types later.
load_csv: https://github.com/ckan/ckanext-xloader/blob/master/ckanext/xloader/loader.py#L40 (opens new window)
- Loader: https://github.com/ckan/ckanext-xloader/blob/master/ckanext/xloader/jobs.py#L100 (opens new window)
How does the queue system work: job queue is done by RQ, which is simpler and is backed by Redis and allows access to the CKAN model. Job results are currently still stored in its own database, but the intention is to move this relatively small amount of data into CKAN’s database, to reduce the complication of install.
# Flow of Data in Data Load
Sequence diagram showing the journey of a tabular file into the DataStore:
Q: What happens with non-tabular data?
A: CKAN has a list of types of data it can process into the DataStore (TODO:link) and will only process those.
# What Issues are there?
Generally: the Data Load system is an hand-crafted, bespoke mini-ETL process. It would seem better to use high-quality third-party ETL tooling here rather than hand-roll be that for pipeline creation, monitoring, orchestration etc.
- No connection between DataStore system and CKAN validation extension powered by GoodTables https://github.com/frictionlessdata/ckanext-validation (opens new window) Thus, for example, users may edit the DataStore Data Dictionary and be confused that this has no impact on validation. More generally, data validation and data loading might naturally be part of one overall ETL process but Data Load system is not architected in a way that makes this easy to add.
- No support for Frictionless Data spec sand their ability to specific incoming data structure (CSV format, encoding, column types etc).
- Dependent on error-prone guessing of types or manual type conversion
- Makes it painful to integrate with broader data processing pipeline (e.g. clean separation would allow type guessing to be optimized elsewhere in another part of the ETL pipeline)
- Excel loading won’t work or won’t load all sheets
- https://github.com/ckan/ckanext-xloader#key-differences-from-datapusher (opens new window)
- Works terribly with loading a bit big data. It may for no reason crash after hour of loading. And after reload it goes along
- Is slow esp for large datasets and even smallish datasets e.g. 25Mb
- often fails due to e.g. data validation/casting errors but this not clear (and unsatisfying to the user)
- Doesn’t work with XLS(X)
- has problems fetching resources from Blob Storage (it fails and need to wait until the Resource is uploaded.)
- raising Exception NotFound when CKAN has a delay creating resources
- re-submits Resources when creating a new Resource
- XLoader sets
datastore_activebefore data is uploaded
# CKAN v3
The v3 implementation is named 💨🥫 AirCan: https://github.com/datopian/aircan (opens new window)
Its a lightweight, standalone service using AirFlow.
Status: Beta (June 2020)
- Runs as a separate microservice with zero coupling with CKAN core (=> gives cleaner separation and testing)
- Uses Frictionless Data patterns and specs where possible e.g. Table Schema for describing or inferring the data schema
- Uses AirFlow as the runner
- Uses common ETL / Data Flows patterns and frameworks
See Design page »