nonplus .me

Launching chromium from xmonad prompt shortcuts

For some unknown reason chromium seems only to accept UTF-8 encoded urls for its cli arguments. This conflicts with xmonad which pipes ISO-8859 text when it spawns processes via shortcuts (e.g. the XMonad.Actions.Search module). To get around this you can point the default browser (xmonad uses the $BROWSER env variable in bash) to a script which uses the unix iconv program to sanitize xmonad’s text for chromium input and then launches chromium with the purified url. It also logs all the xmonad-spawned searches to a file for my future amusement.

In .bashrc:

export BROWSER="${HOME}/.chromium_opener.sh"

And .chromium_opener.sh…

Voila!

Using PostgreSQL arrays with GiN indexes for fast tag filtering

The setup

A major feature on sites like yelp and filmlust is the ability to quickly drill down search results by filtering in/out unwanted tags. In a typical normalized relational db the tags would have a many-to-many relationship with the main objects and they would be stored in a separate table with an association table joining the two. This leads to painfully slow filtering queries. A better method is to add a “tags” column using PostgreSQL’s array type to the main table and store the tags as text in there. For example…

Create a new table

CREATE TABLE posts (post_id INTEGER, title TEXT, content TEXT, tags TEXT[]);
INSERT INTO posts (post_id, title, content, tags) VALUES (1, 'el posto numero uno', 'witticisms here', '{"bacon", "chunky"}');
INSERT INTO posts (post_id, title, content, tags) VALUES (2, 'el posto numero dos', 'mas witticisms here', '{"chunky", "fauna"}');
INSERT INTO posts (post_id, title, content, tags) VALUES (3, 'el posto numero tres', 'plus witticisms here', '{"bacon", "fauna"}');

Alter an existing table

ALTER TABLE posts ADD COLUMN tags TEXT[]; #add the array of text values
UPDATE posts SET tags='{"bacon", "chunky"}' WHERE post_id=;

Querying

So the table is all set to go and the tags column is populated for all your items. Another advantage of using arrays over normalization is more intuitive queries, particularly for some of the more complex queries. To search for all posts tagged with “bacon” with the traditional many-to-many setup…

SELECT p.* FROM posts AS p JOIN posts_tags_assoc AS pta ON p.post_id=pta.post_id JOIN tags t ON pta.tag_id=t.tag_id WHERE t.name = 'bacon';

vs. the new array columns…

SELECT posts.* FROM posts WHERE 'bacon' = ANY (tags);

This is certainly more elegant, but the array type really shines when using the array operators to do overlaps and subsets. To find all posts which are tagged by “chunky” or “bacon” (an overlap) or to find the posts tagged by both “chunky” and “bacon” (a subset, or “is contained by” in postgres verbiage) in the normalized tables:

SELECT DISTINCT(p.*) FROM posts AS p JOIN posts_tags_assoc AS pta ON p.post_id=pta.post_id JOIN tags t ON pta.tag_id=t.tag_id WHERE t.name IN ('bacon', 'chunky'); # overlap
SELECT p.post_id, p.title, p.content FROM posts AS p JOIN posts_tags_assoc AS pta ON p.post_id=pta.post_id JOIN tags t ON pta.tag_id=t.tag_id WHERE t.name IN ('bacon', 'chunky') GROUP BY p.post_id, p.name, p.content HAVING COUNT(pta.post_id) = 2; # subset

vs. the new array columns…

SELECT posts.* FROM posts WHERE ARRAY['chunky', 'bacon'] && tags; # overlap
SELECT posts.* FROM posts WHERE ARRAY['chunky', 'bacon'] <@ tags; # subset

Not only is this easier on the developer but it’s going to be a bit faster since you’re not joining three tables together.

Optimizing with GiN indexes

At this point there’s a fully functioning table which you can query and it’s probably a little bit quicker than original many-to-many method. However, we can make it much faster by indexing the new tags array. The standard postgres index types (bitmap, hash, etc.) don’t work with arrays but fortunately there’s the oft overlooked GiN index which can make these queries really fly. Before adding the index let’s take a look at what’s going on behind the scenes when we run one of the above queries. I’m using the filtering table from filmlust for this since postgres won’t use the indexes without a significant amount of data in there.

mydb=> explain analyze select id from films_test WHERE 'action' = ANY(tag);
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Seq Scan on films_test  (cost=0.00..6666.27 rows=8507 width=4) (actual time=0.034..61.595 rows=14767 loops=1)
   Filter: ('action'::text = ANY (tag))
 Total runtime: 62.396 ms
(3 rows)

mydb=> explain analyze select id from films_test WHERE ARRAY['action','adventure'] <@ tag;
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Seq Scan on films_test  (cost=0.00..4926.15 rows=174 width=4) (actual time=0.123..75.030 rows=3340 loops=1)
   Filter: ('{action,adventure}'::text[] <@ tag)
 Total runtime: 75.269 ms
(3 rows)

So it’s doing a normal sequential scan and doing row-by-row comparisons. Now, to add the GiN index:

CREATE INDEX ix_posts_tags ON "posts" USING GIN ("tags");

And running the queries again we get:

mydb=> explain analyze select id from films_test WHERE 'action' = ANY(tag);
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Seq Scan on films_test  (cost=0.00..6666.27 rows=8507 width=4) (actual time=0.035..61.593 rows=14767 loops=1)
   Filter: ('action'::text = ANY (tag))
 Total runtime: 62.343 ms
(3 rows)

So that first query remained the same because the index isn’t used when using the ANY function. The GiN index only kicks in with the operators shown here. Trying it again with the overlap operator:

mydb=> explain analyze select id from films_test WHERE ARRAY['action'] && tag;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on films_test  (cost=4.83..557.34 rows=174 width=4) (actual time=2.926..8.755 rows=14767 loops=1)
   Recheck Cond: ('{action}'::text[] <@ tag)
   ->  Bitmap Index Scan on ix_films_test  (cost=0.00..4.78 rows=174 width=0) (actual time=2.533..2.533 rows=14767 loops=1)
         Index Cond: ('{action}'::text[] <@ tag)
 Total runtime: 9.461 ms
(5 rows)

mydb=> explain analyze select id from films_test WHERE ARRAY['action','adventure'] <@ tag;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on films_test  (cost=4.83..557.34 rows=174 width=4) (actual time=2.761..5.229 rows=3340 loops=1)
   Recheck Cond: ('{action,adventure}'::text[] <@ tag)
   ->  Bitmap Index Scan on ix_films_test  (cost=0.00..4.78 rows=174 width=0) (actual time=2.517..2.517 rows=3340 loops=1)
         Index Cond: ('{action,adventure}'::text[] <@ tag)
 Total runtime: 5.427 ms
(5 rows)

Now we’re cooking with grease! A solid order of magnitude performance increase when using the proper index and corresponding operators.

Notes

Disadvantages

One downside is the management of the tags themselves. It can become more tenuous to do some operations such as changing the name of the tag. Whereas in a normalized schema you just change one column in one row of the tags table you now need to search through the entire posts table and update all the tags arrays. There also could be some issues maintaining consistency since you don’t really have one list of tags anymore. It’s not to difficult with data like mine on filmlust, where all the tags are known beforehand, but it becomes hairy once you have users inserting/updating all sorts of things on something like delicious. One possibility for amelioration here would be to still carry that separate tags table but cut out the middleman and put the tag_ids directly into the tags array in the posts table.

Size

Usually when denormalizing a schema you face the prospect of growing the physical size of your db exponentially due to all the newly redundant data. For something as small as a tag this isn’t really the case. Before there was an association table which had a post_id and a tag_id for every tag/post combo so we’re only adding the difference of the average tag length (~10 chars = 10 bytes) and the two id integers (~2 * 4 = 8 bytes) (ignoring the table overhead and the entire separate tags table). A paltry 2 bytes, so I don’t think this is much of an issue here since the content itself is going to be much much larger.

Pragmatic language experimentation

I purchased the book Seven Languages in Seven Weeks as a Christmas present for myself in order to dabble in several languages and hopefully choose one or two of my favorites to pursue further. I’m three languages deep at the moment and so far the book has fulfilled this expectation.

It’s a good resource for quickly grokking a new language as it’s meant to be used as opposed to merely translating the ubiquitous C/Java/etc. approaches to solving problems. I especially like that the book is targeted towards experienced programmers who have already picked up multiple languages. Not much space is devoted to the tedious particulars of each language’s syntax, since these can be found readily via google or the main documentation. Instead the focus seems to be on the aspects of the language which are unique and the problem domains for which it is particular apt. For example, there is a section for prolog, which is an exceptional language for quickly finding solutions to constraint based problems. A huge chunk of this chapter walks through writing a program for solving sudoku puzzles, a problem well suited for prolog. This exercise isn’t used for any of the other languages, since, well, none of the other languages are quite as appropriate for the task.

You can definitely tell it’s a work by “The Pragmatic Programmers” as one of the goals throughout seems to be exposure to as many tools (i.e. languages) as possible so that you can pick the right one for your next project. Thus providing the programmer with a screwdriver to drive that screw instead of the more familiar sledgehammer.

I’ve put all my code up in a github repo (including the exercises) and I plan on sharing my thoughts on each of the languages in the days to come.

Thank you, Glaucon

In reading Plato’s The Republic I’ve been entirely befuddled by Socrates on more than one occasion. The beauty of the piece is that it’s a dialog and the various interlocutors are often as nonplussed as myself. I’ve found that usually during these times of bewilderment one only has to read on for Glaucon, Adeimantus, etc. to request and subsequently receive an explanation from Socrates himself. Sure beats trying to make sense of some Socrates’ verbal soup myself.

“However,” I said, “of all things that are such as to be related to something, those that are of a certain kind are related to a thing of a certain kind, as it seems to me, while those that are severally themselves are related only to a thing that is itself.”

“I don’t understand,” [Glaucon] said.

“Don’t you understand,” I said, “that the greater is such as to be greater than something?”

Dynamically adding methods to a Rails model

I’ve been using CanCan for implementing role-based access in my rails app and it’s been great so far. I had created a separate convenience method in the User model for testing each possible user role, e.g.

def admin?
  self.role == 'admin'
end

This was great and allowed me to do a simple user.admin? to test for role membership. After changing my roles around a few times I wanted a way to avoid the tedium of creating/editing those methods. I began experimenting with Ruby’s powerful class manipulation abilities and came up with this:

It dynamically adds all those methods based on a simple list of possible roles. I’ve only been using Ruby for a short while and the ease with which I was able to accomplish such a feat was pretty impressive. Score one for Ruby.