17 Aug 2011
For some unknown reason chromium seems only to accept UTF-8 encoded urls for its cli arguments. This conflicts with xmonad which pipes ISO-8859 text when it spawns processes via shortcuts (e.g. the XMonad.Actions.Search
module). To get around this you can point the default browser (xmonad uses the $BROWSER
env variable in bash) to a script which uses the unix iconv
program to sanitize xmonad’s text for chromium input and then launches chromium with the purified url. It also logs all the xmonad-spawned searches to a file for my future amusement.
In .bashrc:
export BROWSER="${HOME}/.chromium_opener.sh"
And .chromium_opener.sh…
Voila!
16 Aug 2011
The setup
A major feature on sites like yelp and
filmlust is the ability to quickly drill
down search results by filtering in/out unwanted tags. In a typical
normalized relational db the tags would have a many-to-many
relationship with the main objects and they would be stored in a
separate table with an association table joining the two. This leads
to painfully slow filtering queries. A better method is to add a
“tags” column using PostgreSQL’s array
type to the
main table and store the tags as text in there. For example…
Create a new table
CREATE TABLE posts (post_id INTEGER, title TEXT, content TEXT, tags TEXT[]);
INSERT INTO posts (post_id, title, content, tags) VALUES (1, 'el posto numero uno', 'witticisms here', '{"bacon", "chunky"}');
INSERT INTO posts (post_id, title, content, tags) VALUES (2, 'el posto numero dos', 'mas witticisms here', '{"chunky", "fauna"}');
INSERT INTO posts (post_id, title, content, tags) VALUES (3, 'el posto numero tres', 'plus witticisms here', '{"bacon", "fauna"}');
Alter an existing table
ALTER TABLE posts ADD COLUMN tags TEXT[];
UPDATE posts SET tags='{"bacon", "chunky"}' WHERE post_id=;
Querying
So the table is all set to go and the tags column is populated for all
your items. Another advantage of using arrays over normalization is
more intuitive queries, particularly for some of the more complex
queries. To search for all posts tagged with “bacon” with the
traditional many-to-many setup…
SELECT p.* FROM posts AS p JOIN posts_tags_assoc AS pta ON p.post_id=pta.post_id JOIN tags t ON pta.tag_id=t.tag_id WHERE t.name = 'bacon';
vs. the new array columns…
SELECT posts.* FROM posts WHERE 'bacon' = ANY (tags);
This is certainly more elegant, but the array type really shines when
using the array operators
to do overlaps and subsets. To find all posts which are tagged by “chunky” or “bacon” (an overlap) or to find the posts tagged by both “chunky” and “bacon” (a subset, or “is contained by” in postgres verbiage) in the normalized tables:
SELECT DISTINCT(p.*) FROM posts AS p JOIN posts_tags_assoc AS pta ON p.post_id=pta.post_id JOIN tags t ON pta.tag_id=t.tag_id WHERE t.name IN ('bacon', 'chunky');
SELECT p.post_id, p.title, p.content FROM posts AS p JOIN posts_tags_assoc AS pta ON p.post_id=pta.post_id JOIN tags t ON pta.tag_id=t.tag_id WHERE t.name IN ('bacon', 'chunky') GROUP BY p.post_id, p.name, p.content HAVING COUNT(pta.post_id) = 2;
vs. the new array columns…
SELECT posts.* FROM posts WHERE ARRAY['chunky', 'bacon'] && tags;
SELECT posts.* FROM posts WHERE ARRAY['chunky', 'bacon'] <@ tags;
Not only is this easier on the developer but it’s going to be a bit
faster since you’re not joining three tables together.
Optimizing with GiN indexes
At this point there’s a fully functioning table which you can query
and it’s probably a little bit quicker than original many-to-many
method. However, we can make it much faster by indexing the new tags
array. The standard postgres index types (bitmap, hash, etc.) don’t
work with arrays but fortunately there’s the oft overlooked GiN
index
which can make these queries really fly. Before adding the index let’s
take a look at what’s going on behind the scenes when we run one of
the above queries. I’m using the filtering table from filmlust for
this since postgres won’t use the indexes without a significant amount
of data in there.
mydb=> explain analyze select id from films_test WHERE 'action' = ANY(tag);
QUERY PLAN
Seq Scan on films_test (cost=0.00..6666.27 rows=8507 width=4) (actual time=0.034..61.595 rows=14767 loops=1)
Filter: ('action'::text = ANY (tag))
Total runtime: 62.396 ms
(3 rows)
mydb=> explain analyze select id from films_test WHERE ARRAY['action','adventure'] <@ tag;
QUERY PLAN
Seq Scan on films_test (cost=0.00..4926.15 rows=174 width=4) (actual time=0.123..75.030 rows=3340 loops=1)
Filter: ('{action,adventure}'::text[] <@ tag)
Total runtime: 75.269 ms
(3 rows)
So it’s doing a normal sequential scan and doing row-by-row
comparisons. Now, to add the GiN index:
CREATE INDEX ix_posts_tags ON "posts" USING GIN ("tags");
And running the queries again we get:
mydb=> explain analyze select id from films_test WHERE 'action' = ANY(tag);
QUERY PLAN
Seq Scan on films_test (cost=0.00..6666.27 rows=8507 width=4) (actual time=0.035..61.593 rows=14767 loops=1)
Filter: ('action'::text = ANY (tag))
Total runtime: 62.343 ms
(3 rows)
So that first query remained the same because the index isn’t used
when using the ANY function. The GiN index only kicks in with the
operators shown here.
Trying it again with the overlap operator:
mydb=> explain analyze select id from films_test WHERE ARRAY['action'] && tag;
QUERY PLAN
Bitmap Heap Scan on films_test (cost=4.83..557.34 rows=174 width=4) (actual time=2.926..8.755 rows=14767 loops=1)
Recheck Cond: ('{action}'::text[] <@ tag)
-> Bitmap Index Scan on ix_films_test (cost=0.00..4.78 rows=174 width=0) (actual time=2.533..2.533 rows=14767 loops=1)
Index Cond: ('{action}'::text[] <@ tag)
Total runtime: 9.461 ms
(5 rows)
mydb=> explain analyze select id from films_test WHERE ARRAY['action','adventure'] <@ tag;
QUERY PLAN
Bitmap Heap Scan on films_test (cost=4.83..557.34 rows=174 width=4) (actual time=2.761..5.229 rows=3340 loops=1)
Recheck Cond: ('{action,adventure}'::text[] <@ tag)
-> Bitmap Index Scan on ix_films_test (cost=0.00..4.78 rows=174 width=0) (actual time=2.517..2.517 rows=3340 loops=1)
Index Cond: ('{action,adventure}'::text[] <@ tag)
Total runtime: 5.427 ms
(5 rows)
Now we’re cooking with grease! A solid order of magnitude performance
increase when using the proper index and corresponding operators.
Notes
Disadvantages
One downside is the management of the tags themselves. It can become
more tenuous to do some operations such as changing the name of the
tag. Whereas in a normalized schema you just change one column in one
row of the tags table you now need to search through the entire posts
table and update all the tags arrays. There also could be some issues
maintaining consistency since you don’t really have one list of tags
anymore. It’s not to difficult with data like mine on filmlust, where
all the tags are known beforehand, but it becomes hairy once you have
users inserting/updating all sorts of things on something like
delicious. One possibility for amelioration here would be to still
carry that separate tags table but cut out the middleman and put the
tag_ids directly into the tags array in the posts table.
Size
Usually when denormalizing a schema you face the prospect of growing
the physical size of your db exponentially due to all the newly
redundant data. For something as small as a tag this isn’t really the
case. Before there was an association table which had a post_id and a
tag_id for every tag/post combo so we’re only adding the difference of
the average tag length (~10 chars = 10 bytes) and the two id integers
(~2 * 4 = 8 bytes) (ignoring the table overhead and the entire
separate tags table). A paltry 2 bytes, so I don’t think this is much
of an issue here since the content itself is going to be much much
larger.
11 Apr 2011
I purchased the book Seven Languages in Seven Weeks as a Christmas present for myself in order to dabble in several languages and hopefully choose one or two of my favorites to pursue further. I’m three languages deep at the moment and so far the book has fulfilled this expectation.
It’s a good resource for quickly grokking a new language as it’s meant to be used as opposed to merely translating the ubiquitous C/Java/etc. approaches to solving problems. I especially like that the book is targeted towards experienced programmers who have already picked up multiple languages. Not much space is devoted to the tedious particulars of each language’s syntax, since these can be found readily via google or the main documentation. Instead the focus seems to be on the aspects of the language which are unique and the problem domains for which it is particular apt. For example, there is a section for prolog, which is an exceptional language for quickly finding solutions to constraint based problems. A huge chunk of this chapter walks through writing a program for solving sudoku puzzles, a problem well suited for prolog. This exercise isn’t used for any of the other languages, since, well, none of the other languages are quite as appropriate for the task.
You can definitely tell it’s a work by “The Pragmatic Programmers” as one of the goals throughout seems to be exposure to as many tools (i.e. languages) as possible so that you can pick the right one for your next project. Thus providing the programmer with a screwdriver to drive that screw instead of the more familiar sledgehammer.
I’ve put all my code up in a github repo (including the exercises) and I plan on sharing my thoughts on each of the languages in the days to come.
01 Feb 2011
In reading Plato’s The Republic I’ve been entirely befuddled by Socrates on more than one occasion. The beauty of the piece is that it’s a dialog and the various interlocutors are often as nonplussed as myself. I’ve found that usually during these times of bewilderment one only has to read on for Glaucon, Adeimantus, etc. to request and subsequently receive an explanation from Socrates himself. Sure beats trying to make sense of some Socrates’ verbal soup myself.
“However,” I said, “of all things that are such as to be related to something, those that are of a certain kind are related to a thing of a certain kind, as it seems to me, while those that are severally themselves are related only to a thing that is itself.”
“I don’t understand,” [Glaucon] said.
“Don’t you understand,” I said, “that the greater is such as to be greater than something?”
…
24 Jan 2011
I’ve been using CanCan for implementing role-based access in my rails app and it’s been great so far. I had created a separate convenience method in the User model for testing each possible user role, e.g.
def admin?
self.role == 'admin'
end
This was great and allowed me to do a simple user.admin?
to test for role membership. After changing my roles around a few times I wanted a way to avoid the tedium of creating/editing those methods. I began experimenting with Ruby’s powerful class manipulation abilities and came up with this:
It dynamically adds all those methods based on a simple list of possible roles. I’ve only been using Ruby for a short while and the ease with which I was able to accomplish such a feat was pretty impressive. Score one for Ruby.