# DataFrame

Designing a dataframe.js - and understanding data libs and data models in general.

TODO: integrate https://github.com/datopian/dataframe.js (opens new window) - my initial review from ~ 2015 onwards.

# Introduction

Conceptually a data library consists of:

  • A data model i.e. a set of classes for holding / describing data e.g. Series (Vector/1d array), DataFrame (Table/2d array) (and possibly higher dim arrays)
  • Tooling
    • Operations e.g. group by, query, pivot etc etc
    • Import / Export: load from csv, sql, stata etc etc

# Our need

We need to build tools for wrangling and presenting data … that are …

  • focused on smallish data
  • run in the browser and/or are lightweight/easy to install

Why? Because …

  • We want to build easy to use / install applications for non-developers (so they aren’t going to use pandas or a jupyter notebook PLUS they want a UI PLUS probably not big data (or if it is we can work with a sample))
  • We’re often using these tools in web applications (or in e.g. desktop app like electron)

Discussion

  • Could we not have browser act as thin client and push code to some backend …? Yes we could but that means a whole other service …

What we want: Something like openrefine but running in the browser …

# Why not just use R / Pandas

Context: R, Pandas are already awesome. In fact, super-awesome. And they have huge existing communities and ecosystems.

Furthermore, not only do they do data analysis (so all the data science folks are using) but they are also pretty good for data wrangling (esp pandas)

So, we’d heavily recommend these (esp pandas) if you are developer (and doing work on your local machine).

However, …

  • if you’re not a developer they can be daunting (even wrapped up in a juypyter notebook).
  • if you are a developer and actually doing data engineering there are some issues
    • pandas is a “kitchen-sink” of a library and depends on numpy. This makes it a heavy-weight dependency and harder to put into data pipelines and flows
    • monolithic nature makes them hard to componentize …

# Pandas

# Series

https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#series (opens new window)

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

>>> s = pd.Series(data, index=index)

  • Series is a 1-d array with the convenience of labelling each cell in the array with the index (which defaults to 0…n if not specified).
  • This allows you to treat Series as an array and a dictionary
  • You can give it a name "Series can also have a name attribute:

    s = pd.Series(np.random.randn(5), name='something')"

# DataFrame

https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html (opens new window)

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

### Higher dimensional arrays

Not supported. See xarray.

# XArray

Comment: mature and well thought out. Exists to generalize pandas to higher levels.

http://xarray.pydata.org/en/stable/ (opens new window) => multidimensional arrays in pandas

xarray has two core data structures, which build upon and extend the core strengths of NumPy and pandas. Both data structures are fundamentally N-dimensional:

DataArray is our implementation of a labeled, N-dimensional array. It is an N-D generalization of a pandas.Series. The name DataArray itself is borrowed from Fernando Perez’s datarray project, which prototyped a similar data structure.

Dataset is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to the pandas.DataFrame.

(Personally not sure about the analogy: Dataset is like a collection of series or DataFrames)

# NTS

# Inbox

# Blaze

The Blaze ecosystem is a set of libraries that help users store, describe, query and process data. It is composed of the following core projects:

  • Blaze: An interface to query data on different storage systems
  • Dask: Parallel computing through task scheduling and blocked algorithms
  • Datashape: A data description language
  • DyND: A C++ library for dynamic, multidimensional arrays
  • Odo: Data migration between different storage systems

# Appendix: JS “DataFrame” Libraries

A list of existing libraries.

Note: when we started research on this in 2015 there were none that we could find so a good sign that they are developing.

Other ones (not very active or without much info):

# References