Frictionless Data: Lightweight standards and tooling for data sharing

vitorbaptistaa · on Sept 27, 2017

Happy to see this on the frontpage. I work for the Open Knowledge International, which develops the Frictionless Data standard. Feel free to ask me anything, and I'll make sure myself or someone else from the team answers it.

jasode · on Sept 27, 2017

The frictionlessdata landing page has very generalized verbiage so here's my technical summary of it...

The main idea for "container" or "package" hinges on a file called "datapackage.json"[1].

An analogy would be the "sfv" files like "checksums.sfv" for verifying the integrity of files. Since so many people use "sfv" as a defacto standard, many programs exist to scan it and verify the associated files. Another analogy would be DTD for XML files.

Similarly, if everybody could converge on the file "datapackage.json" as a metadata & schema description standard, a useful ecosystem of utilities and libraries for processing data would take advantage of it.

One example library would be: https://github.com/frictionlessdata/datapackage-py

(In the Python source code for "package.py"[2], Ctrl+F search for "datapackage.json" to see how it looks for that particular file.)

With a data wrangling API like that, one could then do joins on csv files directly[3] and write the results to another csv file with the associated "datapackage.json".

Instead of passing "dumb" csv or raw json files around, add a little "intelligence" to the dataset by way of "datapackage.json" so tools can parse the schema and process csv/json at a higher abstraction level. That leads to more "effortless" and "frictionless" data interoperability.

What I can't tell so far is if "datapackage.json" already has momentum of adoption across many communities such as Julia, Tensorflow, Hadoop, etc. and we need to get on the bandwagon -- or -- adoption is still in its infancy and there are other competing data "container/package" specifications to look at.

[1] http://frictionlessdata.io/guides/data-package/

[2] https://github.com/frictionlessdata/datapackage-py/blob/mast...

[3] http://frictionlessdata.io/guides/joining-tabular-data-in-py...

_pwalsh · on Sept 27, 2017

Hi,

(I work on the Frictionless Data specifications and tooling at Open Knowledge International.)

Thanks. We are working on the website at present [1], and we are trying to manage a balance of targeting technical and non-technical users, which is hard to get right.

About momentum - I can address that. We have seen significant momentum in the last 2 years, around open data / government transparency / civic tech ( our natural environment - see https://okfn.org for details ), around scientific / academic research via our work enabled by a grant from Sloan [2]( see http://frictionlessdata.io/case-studies/ for a small selection, more reports coming ), and in general around data wrangling and data science efforts (including integration of Table Schema [3] with Pandas [4]).

In terms of big data / machine learning - we have not actively worked in that space to present.

In terms of Julia, and other languages, we have a Julia library in development via our Tool Fund [5], and this will add to implementations [6] in PHP, Java, R, Clojure which are already underway via the Tool Fund, and accompany the Python, Javascript and Ruby implementations that we maintain directly at Open Knowledge International.

[1]: https://github.com/frictionlessdata/frictionlessdata.io/issu... [2]: https://sloan.org [3]: http://specs.frictionlessdata.io/table-schema/ [4]: https://pandas-docs.github.io/pandas-docs-travis/generated/p... [5]: http://toolfund.frictionlessdata.io [6]: https://github.com/frictionlessdata

philipov · on Sept 27, 2017

Do you support yaml for people who dislike json?

_pwalsh · on Sept 27, 2017

Hi. One could read the YAML first with the library of choice, and then load into Data Package or Table Schema libraries.

craig_peacock · on Sept 27, 2017

This is an overly complicated data container format for not much advantage. To be honest, everything you can do with this can be done at the same level or better with SQLite, an actual database system. Having to implement 4 different parers and validation functions spanning a mix of csv, xml and json just to access what is essentially a csv file is not feasible.

curragh · on Sept 27, 2017

I agree that SQLite is amazing, and the problem that I had with some of the datapackage implementations (CSVLint) is that they stored validation errors in-memory (this is a deal breaker for data sets larger than a few hundred MB) and didn't work well when cross-validating data between multiple files. That's why I created ETLyte (https://github.com/sorrell/etlyte) which reads data into a SQLite DB, writes errors to the DB, and streams output to file/stdout.

I disagree that there is "not much advantage" in the format though. I use much of the "resources" area of the data container format and find it tremendously helpful for validating the expected datatypes (remember, SQLite has no true datatypes for columns), defining expected values, and defining some of the "ETL" functionality in ETLyte, like derived columns.

Also on the horizon is a fuzzing tool I'm creating to help exercise the boundaries and variations of data that an ETL process can expect, and this wouldn't be possible without a data container format. So again, I think there are very good use cases for it that we haven't even tapped into yet.

jakubp · on Sept 27, 2017

CSV as serialization format? Ouch. Could we do better? My experience with CSV has been nothing but pain in the past. Ambiguous formats, quoting issues, incompatible libraries between languages and popular GUI tools like Excel or data vis apps.

I wonder if there's anything better.

mangecoeur · on Sept 27, 2017

The point of the format is that you get json metadata to avoid the ambiguous issues (and it also has requirements on the csv format to use, although whether people will follow them is a different matter).

On the other hand, I used their python API a bit and for loading tables it's way too complicated, with the documentation going into great detail for faffing with metadata but not for actual loading (and nothing like a simple `read_datatable` function).

That said, because it's just a folder with CSVs you can just read them individually, although then there's nothing to take advantage of the metadata automaticall.

_pwalsh · on Sept 27, 2017

(I work on the Frictionless Data specifications and tooling at Open Knowledge International.)

Point taken about the API. The Data Package [1] and Table Schema [2] libraries are generally designed as low-level libraries for building higher-level applications using the specifications. goodtables-py [3] is an example of a higher-level application built on top. But, point taken, we will look at it, and we'd welcome your feedback on the issue tracker [4].

[1]: https://github.com/frictionlessdata/datapackage-py/issues [2]: https://github.com/frictionlessdata/tableschema-py/issues [3]: https://github.com/frictionlessdata/goodtables-py/issues [4]: https://github.com/frictionlessdata/datapackage-py/issues

_pwalsh · on Sept 27, 2017

Hi,

(I work on the Frictionless Data specifications and tooling at Open Knowledge International.)

CSV has many, many warts. However, it is the best thing we have right now for serialising data in a way that is easily read by humans (and consumer-grade software) and machines. Libraries like our Tabulator [1] which is used under-the-hood help provide an API to deal with many of the gotcha's when dealing with the format.

[1]: https://github.com/frictionlessdata/tabulator-py

jakubp · on Sept 27, 2017

Thanks, will have a look at tabulator. I appreciate a list of validators you published on FD site, can help at least a bit when working with non-techies to vet their data before submission.

curragh · on Sept 27, 2017

Not on the list is ETLyte (https://sorrell.github.io/etlyte/), which I built so that an insurance company's corporate clients could vet their flatfiles before submitting them - it has worked very well across multiple files, with custom validations, and is very speedy (uses SQLite). As far as "non-techies" go, it's pretty straightforward, but confined to the command line, so I guess I need to get working on a web frontend for this :)

vitorbaptistaa · on Sept 27, 2017

Although there're multiple issues with CSV, it's the de facto standard for tabular data. We try to avoid its pitfalls by limiting how the CSV must be (UTF-8, header in the first line, etc.)

There's a brief explanation on why CSV was selected on https://specs.frictionlessdata.io/tabular-data-package/#why-...

slowmotiony · on Sept 27, 2017

I hate csv so much. Simply opening a csv in Excel and saving it breaks the format so much that most parsers cannot understand it anymore. Plus there is no way to know which encoding the csv came in - Notepad++ might say one thing and PowerShell will say another. Have fun figuring out why cyrrilic characters or umlauts cause issues in your scripts...

jarman · on Sept 27, 2017

Problem is not CSV, it's Excel. It's hard to blame format, when application is insane enough to use locale settings to determine file parsing

carlps · on Sept 27, 2017

I think the excel issue is a problem with excel, not csv. Why can't excel preserve the format of the file?

jarman · on Sept 27, 2017

>why cyrrilic characters or umlauts cause issues in your scripts...

Any text format can be broken by broken character encoding. I saw plenty of XML being used without any charset declarations. And JSON is in same position as CSV.

BerislavLopac · on Sept 27, 2017

Avro?

sandGorgon · on Sept 27, 2017

it is relevant to point out yesterday's article by Wes McKinney on Apache Arrow and the future of high performance data formats - https://news.ycombinator.com/item?id=15335462