nonplus .me

Using Google's Closure Compiler in Phing

Phing is a build tool for automating and managing php deployments. One great feature is the ability to minify your .js files during the build process. Unfortunately Phing uses jsmin to accomplish this instead of the newer, more efficient compilers like Google’s Closure Compiler.

To add minifying via Closure I created this Phing task. The API is very similar to that of jsmin, e.g.

<taskdef name="jsCompile" classname="phing.tasks.my.JsCompileTask" />
<jsCompile compilerPath="docroot/bin/compiler.jar" targetPath="docroot/script/minified" failOnError="false">
    <fileset dir="docroot/script">
        <include name="**/*.js"/>
    </fileset>
</jsCompile>

Happy Phinging!

Using bits to store tags, or perhaps not

In a previous post I looked at using PostgreSQL arrays w/ GiN indexes to store tags for quick subset and overlap matching. I noticed soon after implementing this on filmlust (for the film genres) that some queries were taking longer than they should have. Namely, queries with a genre filter in conjunction with a highly selective filter, such as a high minimum vote count.

So, I had a look at a query plan (note: I map the genres to integers for a smaller footprint, so this is actually looking for all of “action”, “adventure”, “classic”):

filmlust=> explain analyze select id from title WHERE imdb_count >= 5000 AND genre @> '{6,7,19}';
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on title  (cost=101.26..120.59 rows=5 width=4) (actual time=4.533..4.593 rows=55 loops=1)
   Recheck Cond: ((genre @> '{6,7,19}'::smallint[]) AND (imdb_count >= 5000))
   ->  BitmapAnd  (cost=101.26..101.26 rows=5 width=0) (actual time=4.522..4.522 rows=0 loops=1)
         ->  Bitmap Index Scan on ix_title_genre  (cost=0.00..8.82 rows=175 width=0) (actual time=3.922..3.922 rows=372 loops=1)
               Index Cond: (genre @> '{6,7,19}'::smallint[])
         ->  Bitmap Index Scan on ix_title_imdb_count  (cost=0.00..92.19 rows=5305 width=0) (actual time=0.563..0.563 rows=5296 loops=1)
               Index Cond: (imdb_count >= 5000)
 Total runtime: 4.620 ms

And there’s the crux of the issue. In a perfect world it would perform the one scan on the imdb_count index, which is faster than the genre GiN index, and then scan the few remaining rows (5296 here) for genre matches. I could make this happen with a multicolumn index, but because of the large variety of filters I have it’s not a viable option as there would be too many permutations. Thus, Postgres is forced to do two separate index scans on queries like this (see “combining multiple indexes”).

Drop the index? Surely you jest.

One way to get around this is to drop the genre index, so Postgres would first do the index scan on the fast, selective, imdb_count condition, and then filter those rows.

filmlust=> drop index ix_title_genre;
DROP INDEX
filmlust=> explain analyze select id from title WHERE imdb_count >= 5000 AND genre @> '{6,7,19}';
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on title  (cost=92.19..2169.77 rows=5 width=4) (actual time=0.838..5.229 rows=55 loops=1)
   Recheck Cond: (imdb_count >= 5000)
   Filter: (genre @> '{6,7,19}'::smallint[])
   ->  Bitmap Index Scan on ix_title_imdb_count  (cost=0.00..92.19 rows=5305 width=0) (actual time=0.569..0.569 rows=5296 loops=1)
         Index Cond: (imdb_count >= 5000)
 Total runtime: 5.253 ms

Unfortunately that wasn’t any faster because the comparisons, as few as there may be, are too expensive. Perhaps there’s another way of storing genres to speed up the “Filter” step in the above query process.

Enter the bits

If we pack the genres into a bit string we can compare them very quickly with bitwise operations while still maintaining the small memory size I crave. Adding a mere bit for each of the 24 genres to every row. To implement this technique we’ll first need to map all the genres to a position in the bit string. The one I use for filmlust is at the bottom of this file (coincidentally this is how I do the integer mapping I noted above). So the bit at the 4th least significant position (23) would be on if the film contained the “foreign” genre.

Querying for all genres

Here’s the results for returning the same rows as the two examples above:

filmlust=> explain analyze select id from title_bit WHERE imdb_count >= 5000 AND genre & B'000010000000000011000000' = B'000010000000000011000000';
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on title_bit  (cost=96.97..1794.30 rows=28 width=4) (actual time=0.771..4.572 rows=55 loops=1)
   Recheck Cond: (imdb_count >= 5000)
   Filter: ((genre & B'000010000000000011000000'::"bit") = B'000010000000000011000000'::"bit")
   ->  Bitmap Index Scan on ix_title_bit_imdb_count  (cost=0.00..96.96 rows=5562 width=0) (actual time=0.554..0.554 rows=5296 loops=1)
         Index Cond: (imdb_count >= 5000)
 Total runtime: 4.602 ms

Huzzah! It cut the time by a whole .018ms. Well, that wasn’t the huge gain I was looking for, but I figured it warranted further investigation.

Benchmarks

The following benchmarks were run with my production code, so it includes the limits for pages and a second query to obtain the row count. The first column describes the conditions: small cull = a not very selective condition (leaves about 1/3), large cull = a selective condition (the imdb_count condition above, leaves only 1/30 of the rows). The second and third columns are the time it took to execute using the array and bit string types, respectively. The fourth column is the difference (array – bits).

                              Arrays      Bits        Diff
all 1 genre                   0.01839     0.06916     -0.05077
all 3 genres                  0.00949     0.06374     -0.05425
any 3 genres                  0.02175     0.06315     -0.04140
all 1 genre, small cull       0.00969     0.04161     -0.03192
all 3 genres, small cull      0.00982     0.04101     -0.03119
any 3 genres, small cull      0.01587     0.02574     -0.00987
all 1 genre, large cull       0.01072     0.00770     0.00303
all 3 genres, large cull      0.01053     0.00713     0.00339
any 3 genres, large cull      0.01759     0.00604     0.01155

There’s a definite performance advantage to using the bit strings with the more selective conditions, about 1.5x for matching any genre and 3x for matching all, but they’re a bit slower when flying solo or combined with a condition that doesn’t cut the data down very much.

Matching any genre

For posterity here’s how to match an overlap of genres (i.e. the film contains any of “action”, “adventure”, “classic”):

filmlust=> explain analyze select id from title_bit WHERE imdb_count >= 5000 AND genre & B'000010000000000011000000' != B'000000000000000000000000';
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on title_bit  (cost=98.34..1795.68 rows=5534 width=4) (actual time=0.758..4.767 rows=2228 loops=1)
   Recheck Cond: (imdb_count >= 5000)
   Filter: ((genre & B'000010000000000011000000'::"bit") <> B'000000000000000000000000'::"bit")
   ->  Bitmap Index Scan on ix_title_bit_imdb_count  (cost=0.00..96.96 rows=5562 width=0) (actual time=0.548..0.548 rows=5296 loops=1)
         Index Cond: (imdb_count >= 5000)
 Total runtime: 4.903 ms

Conclusion

You can eke out a little more performance using bit strings for tags/genres depending on the other conditions in your typical query. If they’re highly selective, use bit strings, otherwise you’ll probably be better off with the array data type. I’m sticking with the arrays for now because they are somewhat easier to work with and the queries are still a bit faster in many of my use cases. I’ve started logging user interaction to get a handle on which filters are most popular so I may switch over to bit strings in the future if necessary.

One caveat: a major potential shortcoming of the bit string method is the size. With my data I only have 24 possible genres yielding 24 bits per row. There is an average of 4 genres per film which would yield an array size of ~64 bits (4 smallints) per row. So the bit strings are nearly a third of the size. However, if you implemented this with 1000 possible genres the arrays would probably be the lighter option because you would need that massive 1000 bit column to hold the bit string for each row.

Thus, as usual, there’s no “one size fits all” solution, so pick your poison carefully.

Aggressively caching CSS media with Sass and Tornado

Tornado comes with a nice static url feature for including static files in your templates. It appends hashes to those urls and turns on really aggressive file caching, since the hash bearing url of the requested static file will change whenever the file itself changes, thus busting any browser caching. Unfortunately there’s no inherent means of dealing with the urls in the css (or scss as the case may be) files (e.g. png icons).

To solve this there are a few (probably more) options that I thought of: regex replacements before deployment, making the css files into a tornado template and processing them, adding a hook to sass, or adding the same static_url method to Sass’s arsenal using custom functions. I opted for the last method and added the following to config.rb in order to extend Sass and add a new static_url function for .scss files.

The perfect French press cold brew

Ah, the cold brew. Nothing quite hits the spot on a hot summer morning. I (over?) indulged in it frequently over the past few months and at this point I’d say unequivocally that I prefer it to hot coffee, regardless of the season. Most of the people online were using regular drippers to home brew the stuff so I figured I’d write this guide for using a French press, a most magical apparatus for both hot and cold alike. Though, it’s so simple and intuitive as to render a tutorial nearly superfluous.

In any event, after a bit of experimentation I’ve settled on the process and measurements for extracting the optimal cold brew cup from a French press. I haven’t found the ideal beans yet, as I prefer to try something new every time, but I generally opt for something dark that’s heavy on the chocolate notes. I’ve noticed the chocolate flavor, for which I have a particular affinity, really gets amplified with the cold brew process. Recently I rather enjoyed the Stinky Bkln blend (ultimately from D'Amico) and a Guatemalan roast from Gorilla.

Without further ado…

Ingredients

  • 0.5 cups fresh coarsely ground coffee (the usual for a French press)
  • 3.5 cups water
  • A jug of sorts. I use a glass (no BPA!) quart bottle. Previous tenant: milk.
  • Cold storage area (e.g. a refrigerator)

Process

  1. Pour the coffee into the French press.
  2. Add the water and stir vigorously with a spoon.
  3. Put the uncovered press into cold storage for 12-24 hours.
  4. Put the jug in next to the press (pouring the cold brew into a warm container sucks).
  5. After the time has elapsed, take the brewed coffee out
  6. Break the coffee cake on the top and lightly stir.
  7. Scrape off the “fines” that remain hovering (demonstration).
  8. Cover and press!
  9. Let it sit for a minute or so.
  10. Unload the cold brew into the cool container (or straight into a mug) and partake.

The length of the brewing process is up for debate; some swear by 24 hours and others 12. I always do the latter mostly because I don’t want to go sans French press for a whole day and it works out nicely to do the preparation after dinner, let it sit overnight, and then have it waiting in the morning.

One of the great things about cold brewing is it doesn’t go bad within the hour like hot coffee. You can brew a large quantity and store it in the fridge for up to a week without tarnishing the flavor.

Note: I’m sorely lacking in the precision measurement department at the moment, but I’ll hopefully acquire a scale soon and write an update with more accurate numbers.

Managing virtualenv apps with Supervisor

I’ve been using supervisor a lot to manage my server processes and I’ve grown quite fond of it but it can make things a bit tricky when running apps contained in a virtualenv sandbox. There are a couple workarounds: setting all the environment variables that virtualenv exports in its bin/activate script or using a bash script to launch the app from within the virtual environment. I chose the latter route because it’s pretty simple and it can be convenient to have the script around for other tasks. The script, which I place in a my_app/script folder, follows.

And in the supervisor conf file:

[program:my_app_0]
user=my_user
autostart=true
autorestart=true
command=/home/my_user/my_env/my_app/script/env_run.sh python main.py -p 8880

[program:my_app_1]
user=my_user
autostart=true
autorestart=true
command=/home/my_user/my_env/my_app/script/env_run.sh python main.py -p 8881

In the last line of the script using exec to start the program is somewhat important because it replaces the bash script’s process with the python process, thus giving supervisor control of the app itself instead of the bash script. You can check the process ids to verify. Here’s a quick test I ran on my local box.

# Running without exec (don't actually daemonize (the "&") with supervisor)
scott@c ~ $ /home/scott/webdev/my_env/my_app/script/env_run.sh python main.py -p 8888 &
[1] 5152
scott@c ~ $ ps -ef | grep main.py
scott     5152  4642  0 12:30 pts/19   00:00:00 /bin/bash /home/scott/webdev/my_env/my_app/script/env_run.sh python main.py -p 8888
scott     5161  5152 15 12:30 pts/19   00:00:00 python main.py -p 8888
#
#
# Running with exec...
scott@c ~ $ /home/scott/webdev/my_env/my_app/script/env_run.sh python main.py -p 8888 &
[1] 3740
scott@c ~ $ ps -ef | grep main.py
scott     3740 27015  7 12:24 pts/9    00:00:00 python main.py -p 8888

No more bash middleman!

Note: the above examples were done for a tornado deployment.