Nanocubes: Fast visualization of large spatiotemporal datasets

CyberDildonics · on Nov 10, 2015

I can't figure out what is supposed to be new here.

Spatio-temporal only means 4 dimensions at most I would think.

They give a time for 4.5 Million objects (not sure what that means) as 3.5 minutes. I've already personally been able to rasterize gaussian filtered kd-trees at a rate of 10 million in a few seconds.

Are people really paying for this? If so where can I find them?

cscheid · on Nov 11, 2015

(Author here, sorry I didn't catch this earlier) The 3.5 minutes is the preprocessing time (think of it as index-building time).

After the index is built, the goal is to be able to generate outputs at a time bounded by (up to a poly-log loss) the size of your screen, and not of your data.

After the index is built you get sub-millisecond answers for the 2-D histograms over reasonably-sized screens, and that turns out to be pretty useful for the use cases we're going for.

CyberDildonics · on Nov 12, 2015

I was including the indexing time too, I think 10 million points in 4 seconds on one core was what I saw.

These were single high dimensional points, I'm guessing that the data being dealt with here is different?

The link in your post doesn't work but I did look at the paper and couldn't really see the uniqueness. I wasn't clear on what 'data cubes' was supposed to mean.

When you say 'screen' you are talking about ranges or bounds of area right?

cscheid · on Nov 13, 2015

We use "data cubes", as per Jim Gray's paper, http://web.stanford.edu/class/cs345d-01/rl/olap.pdf.

When I say "screen", I mean (e.g.) your laptop's screen resolution. If you're going to display a heatmap on a screen with p pixels, we (roughly) touch only O(p) memory cells on query time, independently of the dataset size.

When you say "you don't see the uniqueness", I'm not sure what you're comparing it against, so I can't say anything more. You mentioned gkd-trees earlier: is that the comparison you mean? In that case, these are two completely different data structures. For example, (at least as described on the 2009 siggraph paper), you can't subset on a categorical dimension of the pixels. In the case of the data structure we created, we can report (for example) a heatmap of all geolocated tweets generated by an iPhone, or all geolocated tweets generated by a windows phone, or all geolocated tweets irrespective of device, without having to scan the 200M tweet database.

siddboots · on Nov 10, 2015

> Are people really paying for this?

This is open source, and as far as I can see, there is no way to pay for it.

CyberDildonics · on Nov 10, 2015

I meant in general, but that's fair.

cscheid · on Nov 11, 2015

In terms of "in general", there's mapd, a startup doing something on a very similar space http://mapd.com/

It's definitely the case that our work is not ready to be shrink-wrapped and sold. But the need for it is there. I'm no longer at AT&T Research and can't answer for their current needs, and I unfortunately cannot talk about the usage scenarios when I was there. There's the academic paper you can read, http://nanocubes.net/paper.pdf, which has a few examples.

In practical terms, the difference in a system that has latency of a few seconds to one in which the queries come back in 60fps is significant, and enables a bunch of interesting stuff that might not be obvious at first sight.

espeed · on Nov 11, 2015

This post pairs well with the post on data degeneracy 'jandrewrogers wrote a few years back: http://realityminer.blogspot.com/2009/04/indexing-and-data-s...

baconner · on Nov 10, 2015

So, how is this different from an in memory olap queryable column store like vertipaq?

cscheid · on Nov 11, 2015

(author here, sorry I didn't see this earlier)

One of the principles we tried hard to follow is that after building the indices, the query times should be proportional to the resolution of the plots you're generating.

So the difference here is building the data structure that's fundamentally well-suited to interactive visualization. So we push hard to get low latency, and we take trade-offs that will let us generate query results well-suited to create histograms , heatmaps, etc. These are typically multiresolution (sometimes you want mile-wide pixels, sometimes you want yard-wide-pixels, or day-wide pixels vs second-wide pixels) and require some index-building tricks that we hadn't seen combined in the way we did it.