nonplus .me

Using bits to store tags, or perhaps not

In a previous post I looked at using PostgreSQL arrays w/ GiN indexes to store tags for quick subset and overlap matching. I noticed soon after implementing this on filmlust (for the film genres) that some queries were taking longer than they should have. Namely, queries with a genre filter in conjunction with a highly selective filter, such as a high minimum vote count.

So, I had a look at a query plan (note: I map the genres to integers for a smaller footprint, so this is actually looking for all of “action”, “adventure”, “classic”):

filmlust=> explain analyze select id from title WHERE imdb_count >= 5000 AND genre @> '{6,7,19}';
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on title  (cost=101.26..120.59 rows=5 width=4) (actual time=4.533..4.593 rows=55 loops=1)
   Recheck Cond: ((genre @> '{6,7,19}'::smallint[]) AND (imdb_count >= 5000))
   ->  BitmapAnd  (cost=101.26..101.26 rows=5 width=0) (actual time=4.522..4.522 rows=0 loops=1)
         ->  Bitmap Index Scan on ix_title_genre  (cost=0.00..8.82 rows=175 width=0) (actual time=3.922..3.922 rows=372 loops=1)
               Index Cond: (genre @> '{6,7,19}'::smallint[])
         ->  Bitmap Index Scan on ix_title_imdb_count  (cost=0.00..92.19 rows=5305 width=0) (actual time=0.563..0.563 rows=5296 loops=1)
               Index Cond: (imdb_count >= 5000)
 Total runtime: 4.620 ms

And there’s the crux of the issue. In a perfect world it would perform the one scan on the imdb_count index, which is faster than the genre GiN index, and then scan the few remaining rows (5296 here) for genre matches. I could make this happen with a multicolumn index, but because of the large variety of filters I have it’s not a viable option as there would be too many permutations. Thus, Postgres is forced to do two separate index scans on queries like this (see “combining multiple indexes”).

Drop the index? Surely you jest.

One way to get around this is to drop the genre index, so Postgres would first do the index scan on the fast, selective, imdb_count condition, and then filter those rows.

filmlust=> drop index ix_title_genre;
DROP INDEX
filmlust=> explain analyze select id from title WHERE imdb_count >= 5000 AND genre @> '{6,7,19}';
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on title  (cost=92.19..2169.77 rows=5 width=4) (actual time=0.838..5.229 rows=55 loops=1)
   Recheck Cond: (imdb_count >= 5000)
   Filter: (genre @> '{6,7,19}'::smallint[])
   ->  Bitmap Index Scan on ix_title_imdb_count  (cost=0.00..92.19 rows=5305 width=0) (actual time=0.569..0.569 rows=5296 loops=1)
         Index Cond: (imdb_count >= 5000)
 Total runtime: 5.253 ms

Unfortunately that wasn’t any faster because the comparisons, as few as there may be, are too expensive. Perhaps there’s another way of storing genres to speed up the “Filter” step in the above query process.

Enter the bits

If we pack the genres into a bit string we can compare them very quickly with bitwise operations while still maintaining the small memory size I crave. Adding a mere bit for each of the 24 genres to every row. To implement this technique we’ll first need to map all the genres to a position in the bit string. The one I use for filmlust is at the bottom of this file (coincidentally this is how I do the integer mapping I noted above). So the bit at the 4th least significant position (23) would be on if the film contained the “foreign” genre.

Querying for all genres

Here’s the results for returning the same rows as the two examples above:

filmlust=> explain analyze select id from title_bit WHERE imdb_count >= 5000 AND genre & B'000010000000000011000000' = B'000010000000000011000000';
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on title_bit  (cost=96.97..1794.30 rows=28 width=4) (actual time=0.771..4.572 rows=55 loops=1)
   Recheck Cond: (imdb_count >= 5000)
   Filter: ((genre & B'000010000000000011000000'::"bit") = B'000010000000000011000000'::"bit")
   ->  Bitmap Index Scan on ix_title_bit_imdb_count  (cost=0.00..96.96 rows=5562 width=0) (actual time=0.554..0.554 rows=5296 loops=1)
         Index Cond: (imdb_count >= 5000)
 Total runtime: 4.602 ms

Huzzah! It cut the time by a whole .018ms. Well, that wasn’t the huge gain I was looking for, but I figured it warranted further investigation.

Benchmarks

The following benchmarks were run with my production code, so it includes the limits for pages and a second query to obtain the row count. The first column describes the conditions: small cull = a not very selective condition (leaves about 1/3), large cull = a selective condition (the imdb_count condition above, leaves only 1/30 of the rows). The second and third columns are the time it took to execute using the array and bit string types, respectively. The fourth column is the difference (array – bits).

                              Arrays      Bits        Diff
all 1 genre                   0.01839     0.06916     -0.05077
all 3 genres                  0.00949     0.06374     -0.05425
any 3 genres                  0.02175     0.06315     -0.04140
all 1 genre, small cull       0.00969     0.04161     -0.03192
all 3 genres, small cull      0.00982     0.04101     -0.03119
any 3 genres, small cull      0.01587     0.02574     -0.00987
all 1 genre, large cull       0.01072     0.00770     0.00303
all 3 genres, large cull      0.01053     0.00713     0.00339
any 3 genres, large cull      0.01759     0.00604     0.01155

There’s a definite performance advantage to using the bit strings with the more selective conditions, about 1.5x for matching any genre and 3x for matching all, but they’re a bit slower when flying solo or combined with a condition that doesn’t cut the data down very much.

Matching any genre

For posterity here’s how to match an overlap of genres (i.e. the film contains any of “action”, “adventure”, “classic”):

filmlust=> explain analyze select id from title_bit WHERE imdb_count >= 5000 AND genre & B'000010000000000011000000' != B'000000000000000000000000';
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on title_bit  (cost=98.34..1795.68 rows=5534 width=4) (actual time=0.758..4.767 rows=2228 loops=1)
   Recheck Cond: (imdb_count >= 5000)
   Filter: ((genre & B'000010000000000011000000'::"bit") <> B'000000000000000000000000'::"bit")
   ->  Bitmap Index Scan on ix_title_bit_imdb_count  (cost=0.00..96.96 rows=5562 width=0) (actual time=0.548..0.548 rows=5296 loops=1)
         Index Cond: (imdb_count >= 5000)
 Total runtime: 4.903 ms

Conclusion

You can eke out a little more performance using bit strings for tags/genres depending on the other conditions in your typical query. If they’re highly selective, use bit strings, otherwise you’ll probably be better off with the array data type. I’m sticking with the arrays for now because they are somewhat easier to work with and the queries are still a bit faster in many of my use cases. I’ve started logging user interaction to get a handle on which filters are most popular so I may switch over to bit strings in the future if necessary.

One caveat: a major potential shortcoming of the bit string method is the size. With my data I only have 24 possible genres yielding 24 bits per row. There is an average of 4 genres per film which would yield an array size of ~64 bits (4 smallints) per row. So the bit strings are nearly a third of the size. However, if you implemented this with 1000 possible genres the arrays would probably be the lighter option because you would need that massive 1000 bit column to hold the bit string for each row.

Thus, as usual, there’s no “one size fits all” solution, so pick your poison carefully.