Skip to main content

String Matching in CouchDB Views

We're in the process of porting an application that has been running on SQL Server over to the fabulous and amazing CouchDB. We were originally under the impression that everyone accessing data from this application in their own code was doing so through our web service, which would have made our job pretty simple since we could swap the guts of the web service methods out and return the same data types to the caller, but upon further investigation we discovered that people had written their own custom queries directly against the database.

This alone isn't a big deal but in some cases people were running queries that included LIKE clauses, and since we opted not to install CouchDB-Lucene given both time constraints as well as the fact that the LIKE queries against SQL Server were pretty limited in scope and number, I thought I'd share what we came up with to do string matching in views in CouchDB.

This is by no means to suggest you should not use CouchDB-Lucene if you want true full-text searching against data in CouchDB, but in our case this was an acceptable compromise.

Matching Fields That Start With a String in Couch

SQL Equivalent: "WHERE field LIKE 'foo%'"

Let's assume I have a database called test and in that database I have documents that have fields of firstName and lastName. I want to write a view that will let me do wildcard matches against first names that begin with a string.

This turns out to be pretty simple given how keys work in CouchDB map functions. Since a view emits a key and a value and we can use start and end keys in our calls to CouchDB, we simply provide the string against which we want to match as our start key and some end key that will ensure we don't get back more than what we're wanting.

For example, let's say I want to match all documents in my database that start with 'Mat' so I can retrieve all people with a first name of Matt, or Matthew, or Mathew, or Mat, or Mathias ... you get the idea.

First I write a view that in its map function emits firstName as the key:

function (doc) {
  if (doc.firstName && doc.lastName) {
    emit(doc.firstName, doc);
  }
}

Assume that my design document is 'people' and that's the map function for a view called 'byFirstName.' To call that view and get back only people with a first name staring with 'Mat' I use the following URL:

http://couch/test/_design/people/_view/byFirstName?startkey="Mat"&endkey="MatZ"

In case that wraps poorly in the blog post display, here's just the start and end keys:

startkey="Mat"
endkey="MatZ"

That tells CouchDB to start its output for that view with anything that starts with Mat and end once it hits anything that starts with MatZ.

Matching Specific Strings Contained in Fields

SQL Equivalent: "WHERE field LIKE '%KnownString%'"

We had some use cases where users had canned queries (i.e. users can't enter random search terms) that were looking for a specific term contained anywhere within a specific field. I say specific term here and in the example I use "KnownString" because if you know the string ahead of time this is a simple problem to solve, whereas ad hoc terms are more problematic, but I'll address that below.

Remember that within CouchDB views you have full access to JavaScript, so solving this use case is simply a matter of using a regex to match against the known term.

Let's say I want to pull all documents that have a bio field containing the term 'CouchDB':

function(doc) {
  if (doc.bio && doc.bio.toUpperCase().match(/\bCOUCHDB\b/)) {
    emit(doc._id, doc);
  }
}

Again, since I know the term ahead of time I can do a regex match against it quite easily in my view.

Matching Ad Hoc Strings Contained in Fields

SQL Equivalent: "WHERE field LIKE '%adHocSearchTerm%'"

Where things get tricky in CouchDB without using something like CouchDB-Lucene is matching ad hoc strings. "Tricky" is actually putting it mildly, because the real story is you can't do this in CouchDB. So in use cases where people had code that had a search box into which users could type anything, we had to come up with another solution.

What I've found as I've been using CouchDB more and more is that it can shift things that you used to do in the database layer up into the application layer, and vice-versa. So in this case it was simply a matter of coming up with a view that pulled back a subset of documents into the application code, and then doing the matching there.

One caveat here is that since our database contains thousands of documents, it wasn't really feasible to pull back all the documents in the database and then perform matching in the application layer. Since these documents all have a date associated with them, what we wound up doing is using date range as start and end keys as a way of reducing the number of documents we have to match against in the application. This wasn't a huge burden on users and certainly will improve performance.

We wound up limiting documents returned by year (i.e. the users have to choose a year in which to search), which is enough of a range to not make things too annoying for users, but is also a small enough set of documents not to kill performance on the application side.

To call the view that uses date as its key, the URL params look like this to pull back all documents for 2011 in descending date order:

?startkey="2012/01/01"&endkey="2011/01/01"&descending=true

Remember that when you order descending you essentially flip the start and end keys around, hence why 2012/01/01 is used as the start key.

Once I have the documents back, I then deserialize the JSON into something usable by CFML and then loop over the documents to do my further refinement by search term.

Leaving out the subset controlled by date I described above, assuming I wanted to find all people with a bio field that contained the search term entered by a user on a form, the code winds up looking something like this:

<cfhttp url="http://server/test/_design/people/_view/hasBio"
        method="get"
        result="peopleJSON" />

<cfset peopleReturned =
        DeserializeJSON(peopleJSON.FileContent).rows />

<cfset matchingPeople = ArrayNew(1) />

<cfloop array="#peopleReturned#" index="person">
  <cfif FindNoCase(form.searchTerm, person.value.bio) neq 0>
    <cfset ArrayAppend(matchingPeople, person) />
  </cfif>
</cfloop>

What we wind up with there is the matchingPeople array will contain only the people who had the search term included in their bio field.

The big caveat here again is that if you have a huge number of documents you can get into trouble on the application side, so make sure and limit what you get back from CouchDB since you'll wind up looping over all of those documents to do your search term matching.

Hope that helps others do some quick and dirty LIKE type queries in CouchDB. If there's a better way to do any of these I'm all ears!

Comments

Popular posts from this blog

Installing and Configuring NextPVR as a Replacement for Windows Media Center

If you follow me on Google+ you'll know I had a recent rant about Windows Media Center, which after running fine for about a year suddenly decided as of January 29 it was done downloading the program guide and by extension was therefore done recording any TV shows.

I'll spare you more ranting and simply say that none of the suggestions I got (which I appreciate!) worked, and rather than spending more time figuring out why, I decided to try something different.

NextPVR is an awesome free (as in beer, not as in freedom unfortunately ...) PVR application for Windows that with a little bit of tweaking handily replaced Windows Media Center. It can even download guide data, which is apparently something WMC no longer feels like doing.

Background I wound up going down this road in a rather circuitous way. My initial goal for the weekend project was to get Raspbmc running on one of my Raspberry Pis. The latest version of XBMC has PVR functionality so I was anxious to try that out as a …

Running a Django Application on Windows Server 2012 with IIS

This is a first for me since under normal circumstances we run all our Django applications on Linux with Nginx, but we're in the process of developing an application for another department and due to the requirements around this project, we'll be handing the code off to them to deploy. They don't have any experience with Linux or web servers other than IIS, so I recently took up the challenge of figuring out how to run Django applications on Windows Server 2012 with IIS.

Based on the dated or complete lack of information around this I'm assuming it's not something that's very common in the wild, so I thought I'd share what I came up with in case others need to do this.


This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Assumptions and CaveatsThe operating system is Windows Server 2012 R2, 64-bit. If another variant of the operating system is being used, these instructions may not work properly.All of the soft…

Setting Up Django On a Raspberry Pi

This past weekend I finally got a chance to set up one of my two Raspberry Pis to use as a Django server so I thought I'd share the steps I went through both to save someone else attempting to do this some time as well as get any feedback in case there are different/better ways to do any of this.

I'm running this from my house (URL forthcoming once I get the real Django app finalized and put on the Raspberry Pi) using dyndns.org. I don't cover that aspect of things in this post but I'm happy to write that up as well if people are interested.

General Comments and Assumptions

Using latest Raspbian “wheezy” distro as of 1/19/2013 (http://www.raspberrypi.org/downloads)We’lll be using Nginx (http://nginx.org) as the web server/proxy and Gunicorn (http://gunicorn.org) as the WSGI serverI used http://www.apreche.net/complete-single-server-django-stack-tutorial/ heavily as I was creating this, so many thanks to the author of that tutorial. If you’re looking for more details on …