Regarding the space consumption, it seems like this is a tradeoff of storing two integers (plus the row overhead) per rating versus storing 335 bytes per movie/user.
For the join table, that's 2 integers * 575,281 ratings * 4 bytes = 4,602,248 bytes used in the join table.
With the filter, in each movie row, you need to store 1632 bits for the person_filter and 1048 bits for the hash, so 3,883 movies * (1632 bits + 1048 bits) = 1,300,805 bytes.
In each user row you need to store the same number of bits for the filter and hash, so 6,040 users * (1632 bits + 1048 bits) = 2,023,400 bytes.
Is my math here wrong? With this approach you save about 1.22MB, or about 27% over the join table approach (ignoring how much overhead there is for each row of the table and each page to store the table in).
Depending on the dataset it doesn't seem like the space savings would be worth the sacrifice in accuracy.
or you could add an index to the join table and see a similar speedup without the accuracy tradeoff. I'm not saying that the approach doesn't have applications, but it would be much more useful if the post discussed a case that couldn't be addressed easily by standard means.
For the join table, that's 2 integers * 575,281 ratings * 4 bytes = 4,602,248 bytes used in the join table.
With the filter, in each movie row, you need to store 1632 bits for the person_filter and 1048 bits for the hash, so 3,883 movies * (1632 bits + 1048 bits) = 1,300,805 bytes.
In each user row you need to store the same number of bits for the filter and hash, so 6,040 users * (1632 bits + 1048 bits) = 2,023,400 bytes.
Is my math here wrong? With this approach you save about 1.22MB, or about 27% over the join table approach (ignoring how much overhead there is for each row of the table and each page to store the table in).
Depending on the dataset it doesn't seem like the space savings would be worth the sacrifice in accuracy.