nonplus .me

Using PostgreSQL arrays with GiN indexes for fast tag filtering

The setup

A major feature on sites like yelp and filmlust is the ability to quickly drill down search results by filtering in/out unwanted tags. In a typical normalized relational db the tags would have a many-to-many relationship with the main objects and they would be stored in a separate table with an association table joining the two. This leads to painfully slow filtering queries. A better method is to add a “tags” column using PostgreSQL’s array type to the main table and store the tags as text in there. For example…

Create a new table

CREATE TABLE posts (post_id INTEGER, title TEXT, content TEXT, tags TEXT[]);
INSERT INTO posts (post_id, title, content, tags) VALUES (1, 'el posto numero uno', 'witticisms here', '{"bacon", "chunky"}');
INSERT INTO posts (post_id, title, content, tags) VALUES (2, 'el posto numero dos', 'mas witticisms here', '{"chunky", "fauna"}');
INSERT INTO posts (post_id, title, content, tags) VALUES (3, 'el posto numero tres', 'plus witticisms here', '{"bacon", "fauna"}');

Alter an existing table

ALTER TABLE posts ADD COLUMN tags TEXT[]; #add the array of text values
UPDATE posts SET tags='{"bacon", "chunky"}' WHERE post_id=;

Querying

So the table is all set to go and the tags column is populated for all your items. Another advantage of using arrays over normalization is more intuitive queries, particularly for some of the more complex queries. To search for all posts tagged with “bacon” with the traditional many-to-many setup…

SELECT p.* FROM posts AS p JOIN posts_tags_assoc AS pta ON p.post_id=pta.post_id JOIN tags t ON pta.tag_id=t.tag_id WHERE t.name = 'bacon';

vs. the new array columns…

SELECT posts.* FROM posts WHERE 'bacon' = ANY (tags);

This is certainly more elegant, but the array type really shines when using the array operators to do overlaps and subsets. To find all posts which are tagged by “chunky” or “bacon” (an overlap) or to find the posts tagged by both “chunky” and “bacon” (a subset, or “is contained by” in postgres verbiage) in the normalized tables:

SELECT DISTINCT(p.*) FROM posts AS p JOIN posts_tags_assoc AS pta ON p.post_id=pta.post_id JOIN tags t ON pta.tag_id=t.tag_id WHERE t.name IN ('bacon', 'chunky'); # overlap
SELECT p.post_id, p.title, p.content FROM posts AS p JOIN posts_tags_assoc AS pta ON p.post_id=pta.post_id JOIN tags t ON pta.tag_id=t.tag_id WHERE t.name IN ('bacon', 'chunky') GROUP BY p.post_id, p.name, p.content HAVING COUNT(pta.post_id) = 2; # subset

vs. the new array columns…

SELECT posts.* FROM posts WHERE ARRAY['chunky', 'bacon'] && tags; # overlap
SELECT posts.* FROM posts WHERE ARRAY['chunky', 'bacon'] <@ tags; # subset

Not only is this easier on the developer but it’s going to be a bit faster since you’re not joining three tables together.

Optimizing with GiN indexes

At this point there’s a fully functioning table which you can query and it’s probably a little bit quicker than original many-to-many method. However, we can make it much faster by indexing the new tags array. The standard postgres index types (bitmap, hash, etc.) don’t work with arrays but fortunately there’s the oft overlooked GiN index which can make these queries really fly. Before adding the index let’s take a look at what’s going on behind the scenes when we run one of the above queries. I’m using the filtering table from filmlust for this since postgres won’t use the indexes without a significant amount of data in there.

mydb=> explain analyze select id from films_test WHERE 'action' = ANY(tag);
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Seq Scan on films_test  (cost=0.00..6666.27 rows=8507 width=4) (actual time=0.034..61.595 rows=14767 loops=1)
   Filter: ('action'::text = ANY (tag))
 Total runtime: 62.396 ms
(3 rows)

mydb=> explain analyze select id from films_test WHERE ARRAY['action','adventure'] <@ tag;
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Seq Scan on films_test  (cost=0.00..4926.15 rows=174 width=4) (actual time=0.123..75.030 rows=3340 loops=1)
   Filter: ('{action,adventure}'::text[] <@ tag)
 Total runtime: 75.269 ms
(3 rows)

So it’s doing a normal sequential scan and doing row-by-row comparisons. Now, to add the GiN index:

CREATE INDEX ix_posts_tags ON "posts" USING GIN ("tags");

And running the queries again we get:

mydb=> explain analyze select id from films_test WHERE 'action' = ANY(tag);
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Seq Scan on films_test  (cost=0.00..6666.27 rows=8507 width=4) (actual time=0.035..61.593 rows=14767 loops=1)
   Filter: ('action'::text = ANY (tag))
 Total runtime: 62.343 ms
(3 rows)

So that first query remained the same because the index isn’t used when using the ANY function. The GiN index only kicks in with the operators shown here. Trying it again with the overlap operator:

mydb=> explain analyze select id from films_test WHERE ARRAY['action'] && tag;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on films_test  (cost=4.83..557.34 rows=174 width=4) (actual time=2.926..8.755 rows=14767 loops=1)
   Recheck Cond: ('{action}'::text[] <@ tag)
   ->  Bitmap Index Scan on ix_films_test  (cost=0.00..4.78 rows=174 width=0) (actual time=2.533..2.533 rows=14767 loops=1)
         Index Cond: ('{action}'::text[] <@ tag)
 Total runtime: 9.461 ms
(5 rows)

mydb=> explain analyze select id from films_test WHERE ARRAY['action','adventure'] <@ tag;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on films_test  (cost=4.83..557.34 rows=174 width=4) (actual time=2.761..5.229 rows=3340 loops=1)
   Recheck Cond: ('{action,adventure}'::text[] <@ tag)
   ->  Bitmap Index Scan on ix_films_test  (cost=0.00..4.78 rows=174 width=0) (actual time=2.517..2.517 rows=3340 loops=1)
         Index Cond: ('{action,adventure}'::text[] <@ tag)
 Total runtime: 5.427 ms
(5 rows)

Now we’re cooking with grease! A solid order of magnitude performance increase when using the proper index and corresponding operators.

Notes

Disadvantages

One downside is the management of the tags themselves. It can become more tenuous to do some operations such as changing the name of the tag. Whereas in a normalized schema you just change one column in one row of the tags table you now need to search through the entire posts table and update all the tags arrays. There also could be some issues maintaining consistency since you don’t really have one list of tags anymore. It’s not to difficult with data like mine on filmlust, where all the tags are known beforehand, but it becomes hairy once you have users inserting/updating all sorts of things on something like delicious. One possibility for amelioration here would be to still carry that separate tags table but cut out the middleman and put the tag_ids directly into the tags array in the posts table.

Size

Usually when denormalizing a schema you face the prospect of growing the physical size of your db exponentially due to all the newly redundant data. For something as small as a tag this isn’t really the case. Before there was an association table which had a post_id and a tag_id for every tag/post combo so we’re only adding the difference of the average tag length (~10 chars = 10 bytes) and the two id integers (~2 * 4 = 8 bytes) (ignoring the table overhead and the entire separate tags table). A paltry 2 bytes, so I don’t think this is much of an issue here since the content itself is going to be much much larger.