Sunday, May 22, 2011

String Matching in CouchDB Views

We're in the process of porting an application that has been running on SQL Server over to the fabulous and amazing CouchDB. We were originally under the impression that everyone accessing data from this application in their own code was doing so through our web service, which would have made our job pretty simple since we could swap the guts of the web service methods out and return the same data types to the caller, but upon further investigation we discovered that people had written their own custom queries directly against the database.

This alone isn't a big deal but in some cases people were running queries that included LIKE clauses, and since we opted not to install CouchDB-Lucene given both time constraints as well as the fact that the LIKE queries against SQL Server were pretty limited in scope and number, I thought I'd share what we came up with to do string matching in views in CouchDB.

This is by no means to suggest you should not use CouchDB-Lucene if you want true full-text searching against data in CouchDB, but in our case this was an acceptable compromise.

Matching Fields That Start With a String in Couch

SQL Equivalent: "WHERE field LIKE 'foo%'"

Let's assume I have a database called test and in that database I have documents that have fields of firstName and lastName. I want to write a view that will let me do wildcard matches against first names that begin with a string.

This turns out to be pretty simple given how keys work in CouchDB map functions. Since a view emits a key and a value and we can use start and end keys in our calls to CouchDB, we simply provide the string against which we want to match as our start key and some end key that will ensure we don't get back more than what we're wanting.

For example, let's say I want to match all documents in my database that start with 'Mat' so I can retrieve all people with a first name of Matt, or Matthew, or Mathew, or Mat, or Mathias ... you get the idea.

First I write a view that in its map function emits firstName as the key:

function (doc) {
  if (doc.firstName && doc.lastName) {
    emit(doc.firstName, doc);

Assume that my design document is 'people' and that's the map function for a view called 'byFirstName.' To call that view and get back only people with a first name staring with 'Mat' I use the following URL:


In case that wraps poorly in the blog post display, here's just the start and end keys:


That tells CouchDB to start its output for that view with anything that starts with Mat and end once it hits anything that starts with MatZ.

Matching Specific Strings Contained in Fields

SQL Equivalent: "WHERE field LIKE '%KnownString%'"

We had some use cases where users had canned queries (i.e. users can't enter random search terms) that were looking for a specific term contained anywhere within a specific field. I say specific term here and in the example I use "KnownString" because if you know the string ahead of time this is a simple problem to solve, whereas ad hoc terms are more problematic, but I'll address that below.

Remember that within CouchDB views you have full access to JavaScript, so solving this use case is simply a matter of using a regex to match against the known term.

Let's say I want to pull all documents that have a bio field containing the term 'CouchDB':

function(doc) {
  if ( &&\bCOUCHDB\b/)) {
    emit(doc._id, doc);

Again, since I know the term ahead of time I can do a regex match against it quite easily in my view.

Matching Ad Hoc Strings Contained in Fields

SQL Equivalent: "WHERE field LIKE '%adHocSearchTerm%'"

Where things get tricky in CouchDB without using something like CouchDB-Lucene is matching ad hoc strings. "Tricky" is actually putting it mildly, because the real story is you can't do this in CouchDB. So in use cases where people had code that had a search box into which users could type anything, we had to come up with another solution.

What I've found as I've been using CouchDB more and more is that it can shift things that you used to do in the database layer up into the application layer, and vice-versa. So in this case it was simply a matter of coming up with a view that pulled back a subset of documents into the application code, and then doing the matching there.

One caveat here is that since our database contains thousands of documents, it wasn't really feasible to pull back all the documents in the database and then perform matching in the application layer. Since these documents all have a date associated with them, what we wound up doing is using date range as start and end keys as a way of reducing the number of documents we have to match against in the application. This wasn't a huge burden on users and certainly will improve performance.

We wound up limiting documents returned by year (i.e. the users have to choose a year in which to search), which is enough of a range to not make things too annoying for users, but is also a small enough set of documents not to kill performance on the application side.

To call the view that uses date as its key, the URL params look like this to pull back all documents for 2011 in descending date order:


Remember that when you order descending you essentially flip the start and end keys around, hence why 2012/01/01 is used as the start key.

Once I have the documents back, I then deserialize the JSON into something usable by CFML and then loop over the documents to do my further refinement by search term.

Leaving out the subset controlled by date I described above, assuming I wanted to find all people with a bio field that contained the search term entered by a user on a form, the code winds up looking something like this:

<cfhttp url="http://server/test/_design/people/_view/hasBio"
        result="peopleJSON" />

<cfset peopleReturned =
        DeserializeJSON(peopleJSON.FileContent).rows />

<cfset matchingPeople = ArrayNew(1) />

<cfloop array="#peopleReturned#" index="person">
  <cfif FindNoCase(form.searchTerm, neq 0>
    <cfset ArrayAppend(matchingPeople, person) />

What we wind up with there is the matchingPeople array will contain only the people who had the search term included in their bio field.

The big caveat here again is that if you have a huge number of documents you can get into trouble on the application side, so make sure and limit what you get back from CouchDB since you'll wind up looping over all of those documents to do your search term matching.

Hope that helps others do some quick and dirty LIKE type queries in CouchDB. If there's a better way to do any of these I'm all ears!

No comments: