Data mining for the taboo: searching for what isn't there

danso · on Jan 1, 2016

A similar but slightly simpler computational concept is the `LEFT OUTER JOIN` in SQL, in which for every row on the "left"-side table is listed, whether or not it has a corresponding match on the "right"-side table...doing a `WHERE` filter for `NULL` values then shows you all of the left-side records that have no match.

INNER JOINs are plenty useful in themselves, of course...it's already difficult for humans to manually compare two lists to see the matches, and it's not done as routinely as it could be (think of the number of surprising news stories and investigative journalism that involves discovering that a person in one list is also in another list, e.g. campaign contributors, convicted sex offenders, etc). Given the difficulty of matching -- even though its insights are tangible to most people -- my experience has been that people don't even consider looking for what isn't there that should be there. Maybe it's a corollary of Donald Rumsfeld's apt observation of the "unknown unknowns".

Here's a toy example I use in one of my classes: Using the U.S. Social Security baby names data, look for the most popular baby names in 2010 that did not exist in the 1990 dataset:

http://2015.padjo.org/assignments/midterm-babysteps-sql/#the...

(actually, it's pretty surprising how much churn there is even within a decade...When I initially used LEFT JOINs just to see which popular names in, say, 1970, were still among the most popular names today, the result set was so small (sometimes, just empty), that I thought I had made a mistake in my query)