Saturday, October 20, 2012

Using Python to Compare Document IDs in Two CouchDB Databases

I'm doing a bit of research into what may or may not be an issue with a specific database in our BigCouch cluster, but regardless of the outcome of that side of things I thought I'd share how I used Python and couchdb-python to dig into the problem.

In our six-server BigCouch cluster we noticed that on the database for one of our most heavily trafficked applications the document counts displayed in Futon for each of the cluster members don't match. As I said above this may or may not be a problem (I'm waiting on further information on that particular point), but I was curious which documents were missing from the cluster member that has the lowest document count. (The interesting thing is the missing documents aren't truly inaccessible from the server with the lower document count, but we'll get to that in a moment.)

BigCouch is based on Apache CouchDB but adds true clustering as well as some other very cool features, but for those of you not familiar with CouchDB, you communicate with CouchDB through a RESTful HTTP interface and all the data coming and going is JSON. The point here is it's very simple to interact with CouchDB with any tool that talks HTTP.

Dealing with raw HTTP and JSON may not be difficult but isn't terribly Pythonic either, which is where couchdb-python comes in. couchdb-python lets you interact with CouchDB via simple Python objects and handles the marshaling of data between JSON and native Python datatypes for you. It's very slick, very fast, and makes using CouchDB from Python a joy.

In order to get to the bottom of my problem, I wanted to connect to two different BigCouch cluster members, get a list of all the document IDs in a specific database on each server, and then generate a list of the document IDs that don't exist on the server with the lower total document count.

Here's what I came up with:

>>> import couchdb
>>> couch1 = couchdb.Server('http://couch1:5984/')
>>> couch2 = couchdb.Server('http://couch2:5984/')
>>> db1 = couch1['dbname']
>>> db2 = couch2['dbname']
>>> ids1 = []
>>> ids2 = []
>>> for id in db1:
...     ids1.append(id)
... 
>>> for id in db2:
...     ids2.append(id)
... 
>>> missing_ids = list(set(ids1) - set(ids2))

What that gives me, thanks to the awesomeness of Python and its ability to subtract one set from another (note that you can also use the difference() method on the set object to achieve the same result), is a list of the document IDs that are in the first list that aren't in the second list.

The interesting part came when I took one of the supposedly missing IDs and tried to pull up that document from the database in which it supposedly doesn't exist:

>>> doc = db2['supposedly_missing_id_here']

I was surprised to see that it returned the document just fine, meaning it must be getting it from another member of the cluster, but I'm still digging into what the expected behavior is on all of this. (It's entirely possible I'm obsessing over consistent document counts when I don't need to be.)

So what did I learn through all of this?

  • The more I use Python the more I love it. Between little tasks like this and the fantastic experience I'm having working on our first full-blown Django project, I'm in geek heaven.
  • couchdb-python is awesome, and I'm looking forward to using it on a real project soon.
  • Even though we've been using CouchDB and BigCouch with great success for a couple of years now, I'm still learning what's going on under the hood, which for me is a big part of the fun.

CouchDB Tip: When You Can't Stop the Admin Party

I was setting up a new CouchDB 1.2 server today on Ubuntu Server, specifically following this excellent guide since sudo apt-get install couchdb still gets you CouchDB 0.10. Serious WTF on the fact that the apt installation method is years out of date -- maybe I should figure out who to talk to about it and volunteer to maintain the packages if it's just a matter of people not having time.

The installation went fine until I attempted to turn off the admin party, at which point after I submitted the form containing the initial admin user's name and password things just spun indefinitely. And apparently adding the admin user info manually to the [admin] section of the local.ini file no longer works, since it doesn't automatically encrypt the password you type into the file on a server restart like it used to.

Long and short of it is if you see this happening, chances are there's a permission problem with your config files, which are stored (if you compile from source) in /usr/local/etc/couchdb. In my case that directory and the files therein were owned by root and I'm not running CouchDB as root, so when I tried to fix the admin party the user that's running CouchDB didn't have permission to write to the files.

A quick chown on that directory structure and you're back to being an admin party pooper.

Wednesday, October 10, 2012

Three Approaches to Handling Static Files in Django

I had a really great (and lengthy) pair programming session today with a coworker during which we spent a bit of time going over a couple of different approaches for dealing with static files in Django, so I thought I'd document and share this information while it's fresh in my mind.

First, a little background. If you're not familiar with Django it was originally created for a newspaper web site, specifically the Lawrence Journal-World, so the approach to handling what in the Django world are called "static files" -- meaning things like images, JavaScript, CSS, etc. -- is based on the notion that you might be using a CDN so you should have maximum flexibility as to where these files are located.

While the flexibility is indeed nice, if you're used to a more self-contained approach it takes a little getting used to, and there are a few different ways to configure your Django app to handle static files. I'm going to outline three approaches, but using different combinations of things and other solutions of which I may be unaware there are certainly more ways to handle static files than what I'll outline here. (And as I'm still relatively new to Django, if I'm misunderstanding any of this I'd love to hear where I could improve any of what I'm sharing here!)

One other caveat -- I'm focusing here on STATIC_URL and ignoring MEDIA_URL but the approach would be the same.

Commonalities Across All Approaches

First, even though it may not strictly be required depending on which approach you take for handling static files, since you wind up needing to use this for other reasons, we'll use django.template.RequestContext in render_to_response as opposed to the raw request object. This is required if you want access to settings like MEDIA_URL and STATIC_URL in your Django templates. For more details about RequestContext, the TEMPLATE_CONTEXT_PROCESSORS that are involved behind the scenes, and the variables this approach puts into your context, check the Django docs.

I'm also operating under the assumption that the static files will live in a directory called static that's under the main application directory inside your project directory (i.e. the directory that has your main settings.py file in it). Depending on the approach you use you may be able to put the static directory elsewhere, but unless stated otherwise, that's where the directory is assumed to be. (Note that if you store static files on another server entirely, such as using a CDN, STATIC_URL can be a full URL as opposed to a root-relative URL like /static/)

Also in all examples it's assumed that the STATIC_URL setting in the main settings.py file is set to '/static/'

Approach One (Basic): Use STATIC_URL Directly in Django Templates

This is the simplest approach and may be all you need. With STATIC_URL set to '/static/' in the main settings.py file, all you really have to worry about is using RequestContext in your view functions and then referencing {{ STATIC_URL }} in your Django templates.

Here's a sample views.py file:

from django.shortcuts import render_to_response
from django.template import RequestContext

def index(request, template_name='index.html'):
    return render_to_response(template_name, context_instance=RequestContext(request))

By using RequestContext the STATIC_URL variable will then be available to use in your Django templates like so:

<html>
<head>
    <script src="{{ STATIC_URL }}scripts/jquery/jquery-1.8.1.min.js"></script>
</head>
<body>
    <img src="{{ STATIC_URL }}images/header.jpg" />
</body>
</html>

That's all there is to it. Again, since /static/ will be relative to the root of the main application directory in your project it's assumed that the static directory is underneath your main application directory for this example, and obviously in the case of the example above that means that underneath the static directory you'd have a scripts and images directory.

Approach Two: Use a URL Pattern, django.views.static.serve, and STATICFILES_DIRS

In this approach you leverage Django's excellent and hugely flexible URL routing to set a URL pattern that will be matched for your static files, have that URL pattern call the django.views.static.serve view function, and set the document_root that will be passed to the view function to a STATICFILES_DIRS setting from settings.py. This is a little bit more involved than the first approach but gives you a bit more flexibility since you can place your static directory anywhere you want.

The approach I took with this method was to set a CURRENT_PATH variable in settings.py (created by using os.path.abspath since we need a physical path for the document root) and leverage that to create the STATICFILES_DIRS setting. Here's the relevant chunks from settings.py:

import os
CURRENT_PATH = os.path.abspath(os.path.dirname(__file__).decode('utf-8')).replace('\\', '/')
...
STATICFILES_DIRS = (
    os.path.join(CURRENT_PATH, 'static'),
)

Note that the replace('\\', '/') bit at the end of the CURRENT_PATH setting is to make sure things work on Windows as well as Linux.

Next, set a URL pattern in your main urls.py file:

from django.conf.global_settings import STATICFILES_DIRS
...
urlpatterns = patterns('',
    url(r'^static/(?P<path>.*)$', 'django.views.static.serve', {'document_root': STATICFILES_DIRS}),
)

And then in your Django templates you simply prefix all your static assets with /static/ as opposed to using {{ STATIC_URL }} as a template variable. Even though you're specifying /static/ explicitly in your templates, you still have flexibility to put these files wherever you want since the URL pattern acts as an alias to the actual location of the static files.

Approach Three: Use staticfiles_urlpatterns and {% get_static_prefix %} Template Tag

django.contrib.staticfiles was first introduced in Django 1.3 and was designed to clean up, simplify, and create a bit more power for static file management. This approach gives you the most flexibility and employs a template tag instead of a simple template variable when rendering templates.

First, in settings.py we'll do the same thing we did in the previous approach, namely setting a CURRENT_PATH variable and then use that to set the STATICFILES_DIRS variable:


import os
CURRENT_PATH = os.path.abspath(os.path.dirname(__file__).decode('utf-8')).replace('\\', '/')
...
STATICFILES_DIRS = (
    os.path.join(CURRENT_PATH, 'static'),
)


Next, in urls.py we'll import staticfiles_urlpatterns from django.contrib.staticfiles.urls and call that function to add the static file URL patterns to the application's URL patterns:

from django.contrib.staticfiles.urls import staticfiles_urlpatterns
...
urlpatterns = patterns(
    # your app's url patterns here
)

urlpatterns += staticfiles_urlpatterns()

The final line there is what adds the static file URL patterns into the mix. If you output staticfiles_urlpatterns() you'll see it's something like so:

[<RegexURLPattern None ^static\/(?P<path>.*)$>]

And finally, at the very top of your templates you load the static template tags and then simply use the {% get_static_prefix %} tag to render the static URL:

{% load static %}
<html>
<head>
    <script src="{% get_static_prefix %}scripts/jquery/jquery-1.8.1.min.js"></script>
</head>
<body>
    <img src="{% get_static_prefix %}images/header.jpg" />
</body>
</html>

Conclusion

So there you have it, three approaches that more or less accomplish the same thing, but depending on the specific needs of your application or environment one approach may work better for you than another.

For our purposes on our current application we're using the first approach outlined above since it's simple and meets our needs, but it's great to know there's so much flexibility around static file handling in Django when you need it. As always read the docs for more information and yet more options for managing static files in your Django apps.

Tuesday, October 9, 2012

Installing python-ldap in virtualenv on Ubuntu

We're authenticating against Active Directory in our current Python/Django project and though we've had excellent luck with python-ldap in general, I ran into issues when trying to install python-ldap in a virtualenv this afternoon. As always a lot of DuckDuckGoing and a minimal amount of headbanging led to a solution.

The error I was getting after activating the virtualenv and running pip install python-ldap was related to gcc missing, which was misleading since that wasn't actually the issue:

error: Setup script exited with error: command 'gcc' failed with exit status 1

To add to the weirdness, when I installed python-ldap outside the context of a virtualenv, everything worked fine.

I'll save you the blow-by-blow and just tell you that on my machine at least, other than the required OpenLDAP installation and some other libraries, I also had to install libsasl2-dev:

sudo apt-get install libsasl2-dev
Once I had that installed, I could activate my virtualenv, run pip install python-ldap and install without errors.

If you still run into issues make sure (in addition to OpenLDAP) to have these packages installed:
  • python-dev
  • libldap2-dev
  • libssl-dev
Hope that saves someone else some time!