Using bits to store tags, or perhaps not
15 Sep 2011In a previous post I looked at using PostgreSQL arrays w/ GiN indexes to store tags for quick subset and overlap matching. I noticed soon after implementing this on filmlust (for the film genres) that some queries were taking longer than they should have. Namely, queries with a genre filter in conjunction with a highly selective filter, such as a high minimum vote count.
So, I had a look at a query plan (note: I map the genres to integers for a smaller footprint, so this is actually looking for all of “action”, “adventure”, “classic”):
filmlust=> explain analyze select id from title WHERE imdb_count >= 5000 AND genre @> '{6,7,19}'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on title (cost=101.26..120.59 rows=5 width=4) (actual time=4.533..4.593 rows=55 loops=1) Recheck Cond: ((genre @> '{6,7,19}'::smallint[]) AND (imdb_count >= 5000)) -> BitmapAnd (cost=101.26..101.26 rows=5 width=0) (actual time=4.522..4.522 rows=0 loops=1) -> Bitmap Index Scan on ix_title_genre (cost=0.00..8.82 rows=175 width=0) (actual time=3.922..3.922 rows=372 loops=1) Index Cond: (genre @> '{6,7,19}'::smallint[]) -> Bitmap Index Scan on ix_title_imdb_count (cost=0.00..92.19 rows=5305 width=0) (actual time=0.563..0.563 rows=5296 loops=1) Index Cond: (imdb_count >= 5000) Total runtime: 4.620 ms
And there’s the crux of the issue. In a perfect world it would perform the one scan on the imdb_count index, which is faster than the genre GiN index, and then scan the few remaining rows (5296 here) for genre matches. I could make this happen with a multicolumn index, but because of the large variety of filters I have it’s not a viable option as there would be too many permutations. Thus, Postgres is forced to do two separate index scans on queries like this (see “combining multiple indexes”).
Drop the index? Surely you jest.
One way to get around this is to drop the genre index, so Postgres would first do the index scan on the fast, selective, imdb_count condition, and then filter those rows.
filmlust=> drop index ix_title_genre; DROP INDEX filmlust=> explain analyze select id from title WHERE imdb_count >= 5000 AND genre @> '{6,7,19}'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on title (cost=92.19..2169.77 rows=5 width=4) (actual time=0.838..5.229 rows=55 loops=1) Recheck Cond: (imdb_count >= 5000) Filter: (genre @> '{6,7,19}'::smallint[]) -> Bitmap Index Scan on ix_title_imdb_count (cost=0.00..92.19 rows=5305 width=0) (actual time=0.569..0.569 rows=5296 loops=1) Index Cond: (imdb_count >= 5000) Total runtime: 5.253 ms
Unfortunately that wasn’t any faster because the comparisons, as few as there may be, are too expensive. Perhaps there’s another way of storing genres to speed up the “Filter” step in the above query process.
Enter the bits
If we pack the genres into a bit string we can compare them very quickly with bitwise operations while still maintaining the small memory size I crave. Adding a mere bit for each of the 24 genres to every row. To implement this technique we’ll first need to map all the genres to a position in the bit string. The one I use for filmlust is at the bottom of this file (coincidentally this is how I do the integer mapping I noted above). So the bit at the 4th least significant position (23) would be on if the film contained the “foreign” genre.
Querying for all genres
Here’s the results for returning the same rows as the two examples above:
filmlust=> explain analyze select id from title_bit WHERE imdb_count >= 5000 AND genre & B'000010000000000011000000' = B'000010000000000011000000'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on title_bit (cost=96.97..1794.30 rows=28 width=4) (actual time=0.771..4.572 rows=55 loops=1) Recheck Cond: (imdb_count >= 5000) Filter: ((genre & B'000010000000000011000000'::"bit") = B'000010000000000011000000'::"bit") -> Bitmap Index Scan on ix_title_bit_imdb_count (cost=0.00..96.96 rows=5562 width=0) (actual time=0.554..0.554 rows=5296 loops=1) Index Cond: (imdb_count >= 5000) Total runtime: 4.602 ms
Huzzah! It cut the time by a whole .018ms. Well, that wasn’t the huge gain I was looking for, but I figured it warranted further investigation.
Benchmarks
The following benchmarks were run with my production code, so it includes the limits for pages and a second query to obtain the row count. The first column describes the conditions: small cull = a not very selective condition (leaves about 1/3), large cull = a selective condition (the imdb_count condition above, leaves only 1/30 of the rows). The second and third columns are the time it took to execute using the array and bit string types, respectively. The fourth column is the difference (array – bits).
Arrays Bits Diff all 1 genre 0.01839 0.06916 -0.05077 all 3 genres 0.00949 0.06374 -0.05425 any 3 genres 0.02175 0.06315 -0.04140 all 1 genre, small cull 0.00969 0.04161 -0.03192 all 3 genres, small cull 0.00982 0.04101 -0.03119 any 3 genres, small cull 0.01587 0.02574 -0.00987 all 1 genre, large cull 0.01072 0.00770 0.00303 all 3 genres, large cull 0.01053 0.00713 0.00339 any 3 genres, large cull 0.01759 0.00604 0.01155
There’s a definite performance advantage to using the bit strings with the more selective conditions, about 1.5x for matching any genre and 3x for matching all, but they’re a bit slower when flying solo or combined with a condition that doesn’t cut the data down very much.
Matching any genre
For posterity here’s how to match an overlap of genres (i.e. the film contains any of “action”, “adventure”, “classic”):
filmlust=> explain analyze select id from title_bit WHERE imdb_count >= 5000 AND genre & B'000010000000000011000000' != B'000000000000000000000000'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on title_bit (cost=98.34..1795.68 rows=5534 width=4) (actual time=0.758..4.767 rows=2228 loops=1) Recheck Cond: (imdb_count >= 5000) Filter: ((genre & B'000010000000000011000000'::"bit") <> B'000000000000000000000000'::"bit") -> Bitmap Index Scan on ix_title_bit_imdb_count (cost=0.00..96.96 rows=5562 width=0) (actual time=0.548..0.548 rows=5296 loops=1) Index Cond: (imdb_count >= 5000) Total runtime: 4.903 ms
Conclusion
You can eke out a little more performance using bit strings for tags/genres depending on the other conditions in your typical query. If they’re highly selective, use bit strings, otherwise you’ll probably be better off with the array data type. I’m sticking with the arrays for now because they are somewhat easier to work with and the queries are still a bit faster in many of my use cases. I’ve started logging user interaction to get a handle on which filters are most popular so I may switch over to bit strings in the future if necessary.
One caveat: a major potential shortcoming of the bit string method is the size. With my data I only have 24 possible genres yielding 24 bits per row. There is an average of 4 genres per film which would yield an array size of ~64 bits (4 smallints) per row. So the bit strings are nearly a third of the size. However, if you implemented this with 1000 possible genres the arrays would probably be the lighter option because you would need that massive 1000 bit column to hold the bit string for each row.
Thus, as usual, there’s no “one size fits all” solution, so pick your poison carefully.