Matt Woodward's posterous

Matt Woodward's posterous

Matthew Woodward  //  * CFML, Grails, and Java Developer
* Principal IT Specialist, US Senate
* Open BlueDragon Steering Committee Member
* All-Around Geek

Sep 17 / 9:10am

Revisiting Retrieving Documents Between Two Dates From CouchDB

In a previous post I outlined how I was retrieving documents from CouchDB with a start date property less than the current date, and and end date property greater than the current date. To summarize, in my CouchDB view I created some date/time strings in JavaScript and only emitted documents in the view that met the date criteria.

My previous post got referenced in the CouchBase newsletter, and I'm really glad it did because while I came up with what I thought was a clever solution it was also wrong. (D'OH!)

The issue I didn't consider that some kind commenters on the previous post pointed out is that my approach creates side effects because I'm emitting documents in the view based on information that isn't in the document itself. Specifically since I'm using the current system date/time when the view is created, the documents included in the view will be ones for which the criteria is valid when the view is created.

What this means is that although views get updated with current data as data within documents changes, since the entire view isn't generated each time the criteria used to determine whether or not documents are included in the view is a fixed point in time. To put it another way, my current system date/time that was current when the view was first created essentially becomes hard-coded once the view is created, which isn't at all what I needed. This causes issues if the start and end date properties in the documents change after they've been added to the view because the view only checked to see if the date criteria was met at the time the document was added to the view.

There are some great suggestions in the comments on my previous post for including data in the document itself that would allow only valid documents to be pulled right from Couch, and you'll certainly want to check those out if you're dealing with a ton of data. The solution I'm using will not be ideal for massive datasets but since that isn't the situation I'm in with this data, I wanted to share the solution I came up with in case this works for other people.

To describe my documents again, I have documents that need to be displayed on a web page if their start date/time property is less than the current date/time and if their end date/time property is greater than the current date/time.

Since the valid ranges go in opposite directions for those fields, I didn't see a way to do something like have an array key that included both the start and end dates that would allow me to get only the documents I want back from Couch. But what I can do is use a single document property as a key in Couch and get close to what I want, and then I can pare the documents down further in the application code.

In my case the end date is a more strong limiting criteria since over time there will be a large number of documents with both start and end dates in the past, but documents with end dates >= the current date will be much fewer in number (only a handful in the case of this specific data).

The first step to fix my issue was to rewrite my view to eliminate the date/time check in JavaScript since that's the cause of the unwanted side effect, and emit documents using the end date/time property as the key. I have some other criteria as well (checking type and a couple of other fields to pull valid documents for this particular display), but the basic view is now very simple:

function(doc) {
  emit(doc.dtEnd, doc);
}

With the end date/time as the key, on the application side I can simply use the current date/time as my start key when I call this view, and that gives me all documents with a valid end date/time (>= current date/time).

At this point I may still have documents that shouldn't be displayed based on the start date/time, however, since when people enter data into this application they can schedule things for future display (i.e. both start and end date/time are in the future). But, again since I'm not dealing with a huge amount of data once I limit by the end date/time, it's simply a matter of looping over the documents I get back from Couch and checking for a valid start date/time (<= current date/time) and only displaying those documents.

The issue my original view code created makes total sense now, so thanks to the commenters on my previous post who pointed out the fatal flaw in my approach. Nothing like doing something wrong as a means of learning.

Filed under  //  CouchDB  
Aug 27 / 9:17am

Retrieving Documents Between Two Dates From CouchDB

I'm working on converting yet another application from using SQL Server to using CouchDB, and this morning I'm working with some announcement documents that are displayed based on their start and end date. There are numerous ways to approach this problem but I thought I'd share what I came up with in case this solution helps others, and also to see if there's maybe another approach I didn't consider.

First, since there is no date datatype in JSON, we've standardized (for better or worse) on storing dates as a string with the format "YYYY/MM/DD HH:MM:SS", e.g. "2011/08/27 09:22:36", so date and time separated by a space, always with leading zeros for single digits, and always using a 24-hour clock. This allows date/time strings to sort properly when they're used as keys, it's easy to split the string using the space if you need either just the date or just the time, and since this application is for my day job the time will always be in Eastern US time so we decided not to care about the timezone offset.

In the data I imported from SQL Server there is a dtStart and a dtEnd field so I just converted the SQL Server dates to our preferred CouchDB date format as I imported the data into CouchDB. So far so good.

The next step was to pull these documents from CouchDB based on their dtStart and dtEnd fields, and this is probably obvious but just so it's clear, I need to pull all documents of this type where dtStart <= now, and dtEnd >= now.

As I started creating my view in CouchDB for this, my first thought was to pull all the documents using an array including dtStart and dtEnd as the key. That way when I call the view I could, in theory, use a start and end key to get me the documents in the range of dates that I want.

That approach seems reasonable at first, but when you start trying to put it into practice things get weird rather quickly. This is because what you wind up needing is documents in which the first element of the key array is less than the current date, while the second element of the key array is greater than the current date. Maybe this is just "Saturday morning brain" on my part, but I didn't see a way to include both the start and end date in the key and get where I needed to go.

My next thought was to use only the end date as the key. This gets me a bit closer to what I need since I can at least use a start key to only get documents with an end date >= now, but I'm still faced with having to check the start date at the application level to see if the document is supposed to be displayed.

I'm sure there's some clever way to handle this situation with keys, and part of my reason for posting this is to see how others would approach this, but I messed around with keys for a while and didn't seem to be getting anywhere so I decided to take a different approach.

One of the great things about CouchDB is the fact that you have the full power of JavaScript available in your views. Although JSON doesn't know what a date is, JavaScript certainly does, so I decided that since I needed to pull things based on a specific date range across two fields in my documents the best place to handle that was in the view code itself.

Here's what I came up with for my map function:

var d = new Date();
var curYear = d.getFullYear();
var curMonth = (d.getMonth() + 1).toString();
var curDate = d.getDate().toString();
var curHours = d.getHours().toString();
var curMinutes = d.getMinutes().toString();
var curSeconds = d.getSeconds().toString();

if (curMonth.length == 1) {
  curMonth = '0' + curMonth;
}

if (curDate.length == 1) {
  curDate = '0' + curDate;
}

if (curHours.length == 1) {
  curHours = '0' + curHours;
}

if (curMinutes.length == 1) {
  curMinutes = '0' + curMinutes;
}

if (curSeconds.length == 1) {
  curSeconds = '0' + curSeconds;
}

var dateString = curYear + '/' + curMonth + '/' + curDate + ' ' + 
    curHours + ':' + curMinutes + ':' + curSeconds;

if (doc.type == 'announcement' && 
    doc.dtStart <= dateString && 
    doc.dtEnd >= dateString) {
      emit(doc.dtEnd, doc);
}

Now of course you could argue this would all be simpler if I stored the dtStart and dtEnd fields in my documents as milliseconds, because then I could just get the millisecond value of the current date and do a quick numeric comparison instead of all the string formatting and concatenation, and from that perspective you'd be absolutely right. One of the many things I love about CouchDB, however, is the ability to jump into Futon and more directly and easily interact with my data, so keeping the dates human readable is kind of nice. Now I could store both a string and the millisecond value I suppose, but since this did the trick I decided to leave well enough alone.

I'm very curious to hear how others might solve this problem. "You're doing it wrong" information would be quite welcome. ;-)

Filed under  //  CouchDB  
May 22 / 11:39am

String Matching in CouchDB Views

We're in the process of porting an application that has been running on SQL Server over to the fabulous and amazing CouchDB. We were originally under the impression that everyone accessing data from this application in their own code was doing so through our web service, which would have made our job pretty simple since we could swap the guts of the web service methods out and return the same data types to the caller, but upon further investigation we discovered that people had written their own custom queries directly against the database.

This alone isn't a big deal but in some cases people were running queries that included LIKE clauses, and since we opted not to install CouchDB-Lucene given both time constraints as well as the fact that the LIKE queries against SQL Server were pretty limited in scope and number, I thought I'd share what we came up with to do string matching in views in CouchDB.

This is by no means to suggest you should not use CouchDB-Lucene if you want true full-text searching against data in CouchDB, but in our case this was an acceptable compromise.

Matching Fields That Start With a String in Couch

SQL Equivalent: "WHERE field LIKE 'foo%'"

Let's assume I have a database called test and in that database I have documents that have fields of firstName and lastName. I want to write a view that will let me do wildcard matches against first names that begin with a string.

This turns out to be pretty simple given how keys work in CouchDB map functions. Since a view emits a key and a value and we can use start and end keys in our calls to CouchDB, we simply provide the string against which we want to match as our start key and some end key that will ensure we don't get back more than what we're wanting.

For example, let's say I want to match all documents in my database that start with 'Mat' so I can retrieve all people with a first name of Matt, or Matthew, or Mathew, or Mat, or Mathias ... you get the idea.

First I write a view that in its map function emits firstName as the key:

function (doc) {
  if (doc.firstName && doc.lastName) {
    emit(doc.firstName, doc);
  }
}

Assume that my design document is 'people' and that's the map function for a view called 'byFirstName.' To call that view and get back only people with a first name staring with 'Mat' I use the following URL:

http://couch/test/_design/people/_view/byFirstName?startkey="Mat"&endkey="MatZ"

In case that wraps poorly in the blog post display, here's just the start and end keys:

startkey="Mat"
endkey="MatZ"

That tells CouchDB to start its output for that view with anything that starts with Mat and end once it hits anything that starts with MatZ.

Matching Specific Strings Contained in Fields

SQL Equivalent: "WHERE field LIKE '%KnownString%'"

We had some use cases where users had canned queries (i.e. users can't enter random search terms) that were looking for a specific term contained anywhere within a specific field. I say specific term here and in the example I use "KnownString" because if you know the string ahead of time this is a simple problem to solve, whereas ad hoc terms are more problematic, but I'll address that below.

Remember that within CouchDB views you have full access to JavaScript, so solving this use case is simply a matter of using a regex to match against the known term.

Let's say I want to pull all documents that have a bio field containing the term 'CouchDB':

function(doc) {
  if (doc.bio && doc.bio.toUpperCase().match(/\\bCOUCHDB\\b/)) {
    emit(doc._id, doc);
  }
}

Again, since I know the term ahead of time I can do a regex match against it quite easily in my view.

Matching Ad Hoc Strings Contained in Fields

SQL Equivalent: "WHERE field LIKE '%adHocSearchTerm%'"

Where things get tricky in CouchDB without using something like CouchDB-Lucene is matching ad hoc strings. "Tricky" is actually putting it mildly, because the real story is you can't do this in CouchDB. So in use cases where people had code that had a search box into which users could type anything, we had to come up with another solution.

What I've found as I've been using CouchDB more and more is that it can shift things that you used to do in the database layer up into the application layer, and vice-versa. So in this case it was simply a matter of coming up with a view that pulled back a subset of documents into the application code, and then doing the matching there.

One caveat here is that since our database contains thousands of documents, it wasn't really feasible to pull back all the documents in the database and then perform matching in the application layer. Since these documents all have a date associated with them, what we wound up doing is using date range as start and end keys as a way of reducing the number of documents we have to match against in the application. This wasn't a huge burden on users and certainly will improve performance.

We wound up limiting documents returned by year (i.e. the users have to choose a year in which to search), which is enough of a range to not make things too annoying for users, but is also a small enough set of documents not to kill performance on the application side.

To call the view that uses date as its key, the URL params look like this to pull back all documents for 2011 in descending date order:

?startkey="2012/01/01"&endkey="2011/01/01"&descending=true
 

Remember that when you order descending you essentially flip the start and end keys around, hence why 2012/01/01 is used as the start key.

Once I have the documents back, I then deserialize the JSON into something usable by CFML and then loop over the documents to do my further refinement by search term.

Leaving out the subset controlled by date I described above, assuming I wanted to find all people with a bio field that contained the search term entered by a user on a form, the code winds up looking something like this:

<cfhttp url="http://server/test/_design/people/_view/hasBio" 
        method="get" 
        result="peopleJSON" />
 
<cfset peopleReturned = 
        DeserializeJSON(peopleJSON.FileContent).rows />

<cfset matchingPeople = ArrayNew(1) />

<cfloop array="#peopleReturned#" index="person">
   <cfif FindNoCase(form.searchTerm, person.value.bio) neq 0>
    <cfset ArrayAppend(matchingPeople, person) />
  </cfif>
</cfloop>

What we wind up with there is the matchingPeople array will contain only the people who had the search term included in their bio field.

The big caveat here again is that if you have a huge number of documents you can get into trouble on the application side, so make sure and limit what you get back from CouchDB since you'll wind up looping over all of those documents to do your search term matching.

Hope that helps others do some quick and dirty LIKE type queries in CouchDB. If there's a better way to do any of these I'm all ears!

Filed under  //  CFML   CouchDB   NoSQL  
May 9 / 3:20pm

cf.Objective() NoSQL BOF

Heads up that on Friday night of cf.Objective() I'll be facilitating a BOF on using NoSQL databases with CFML, so if you're interested in things like CouchDB (my favorite thing on the planet as of late), MongoDB, or any of the numerous others please come to the BOF!

All skill levels are welcome so come to learn, come to share what you've done, or come to mock crazy people like myself who think the relational model is the biggest hoax ever perpetrated on the technology world and that we should have been using document-based datastores all along. Yes, that statement is meant to incite you to come to the BOF if you think I'm wrong, but I do believe it to a certain extent. ;-)

When I say I'll be facilitating a BOF I mean just that--BOFs are meant to be highly participatory, free-form discussion forums, so while I'm happy to show off what I know about CouchDB, I'd personally love to learn more about some of the other NoSQL databases from people using those, and would love to have some heated discussions about NoSQL in general.

See you Friday night at 8 pm!

Filed under  //  Conferences   CouchDB   NoSQL   Presentations   cfobjective  
Apr 17 / 12:43pm

Accessing and Restarting Desktop CouchDB on Ubuntu/Mint

Recent versions of Ubuntu (and Ubuntu-based distros like LinuxMint) ship with Desktop CouchDB to interact with Ubuntu One and store things like replicated bookmarks in Firefox, contacts in Evolution, and some other data.

If you want to access Futon (CouchDB's web-based admin tool) for this instance of CouchDB you need to do a bit of hunting, but I found this page on freedesktop.org that was very helpful, and I thought I'd document here as well in case I forget this information in the future (which I'm sure I will!).

Accessing Futon

Open a terminal and navigate to ~/.local/share/desktop-couch and open couchdb.html in a browser (e.g. firefox couchdb.html), or navigate to file:///.local/share/desktop-couch/couchdb.html in your browser. This takes you to a page that will redirect you to Futon after a few seconds, at which point you can see which port CouchDB is running on and what the admin user name is.

If CouchDB Desktop Isn't Running

In my case CouchDB Desktop wasn't running for some reason so I had to follow these steps to get it going again:

  1. Open a terminal and do killall beam.smp and then killall beam (do this as your user, not as root or using sudo). I got 'no process found' errors in both cases but this will make sure all CouchDB Desktop processes have been killed.
  2. Again in a terminal, do rm ~/.config/desktop-couch/desktop-couch.ini
  3. Still in your trusty terminal, do dbus-send --session --dest=org.desktopcouch.CouchDB --print-reply --type=method_call / org.desktopcouch.CouchDB.getPort
    This will restart CouchDB and tell you what port it's running on.
  4. Open the couchdb.html file referenced above and you should be redirected to Futon
Filed under  //  CouchDB   GNU/Linux   LinuxMint   Ubuntu  
Oct 22 / 3:13pm

Grails + CouchDB #s2gx

Scott Davis - thirstyhead.com

NoSQL Databases in General

  • given the number of big companies using them, clearly they're ready to use today
  • time to re-examine our unnatural obsession for relational databases
  • rdbms has been around for 50 years now--well understood, great tooling, lots of information
  • rdbmses are silos
    • still good at what they do, but aren't necessarily well-suited to all data
  • as developers we're being forced to use sql to express something that's crucial to the success of your application
    • not our native language, kind of foreign when it comes down to it
  • we use orm to insulate ourselves from sql
    • express yourself in the native language of your choice instead of in sql
Is ORM State of the Art?
  • really just a bridge
  • why aren't there pure java or groovy datastores?
  • persistence is pretty uninteresting to developers
  • orm is a reasonable bridge, but a rather leaky abstraction as well
  • ted neward: orm is the vietnam of computer science
    • "[ORM] represents a quagmire which starts well, gets more complicated as time passes, and before long entraps its users in a commitment that has no clear demarcation point, no clear win conditions, and no clear exit strategy."
What Drew Me to CouchDB
  • what if i didn't have to bridge technologies anymore?
  • what if i could save my objects in their native format?
    • couchdb is actually a json datastore, but grails makes it trivial to transfer pogo <-> json
  • just need a thin translation layer
NoSQL Solutions
  • Google BigTable
  • mongoDB
  • CouchDB
  • Cassandra
    • "this is the future, but no one believes us"
  • each one of these are a bit different and each has their strengths and weaknesses
  • NoSQL = "not only SQL"
  • don't think of nosql solutions as just another database; truly different way to think about persistence
  • if you think of it as just another database, it'll be the worst database you've ever used
  • need to get out of the mindset of "spreadsheet" type format for data
  • start thinking more about the right tool for the job
CouchDB History
  • starting point was Lotus Notes
    • largely ahead of its time
    • document database
    • not brand-new stuff--ideas and foundation has been around for a very long time
  • Apache project
RDBMS vs. CouchDB
  • rdbms
    • row/column oriented
    • language: sql
    • insert, select, update, delete
  • CouchDB
    • if your data has a more vertical orientation as opposed to horizontal, starts to look more like attachments
    • email is a good example: to, from, body, attachment
    • language: javascript (map/reduce functions)
    • put, get, post, delete (REST)
    • "Django may be built for the Web, but CouchDB is built of the Web." -- Jacob Kaplan-Moss, Django Developer
    • can build entire apps in CouchDB
  • Couch = acronym for "cluster of unreliable commodity hardware"
  • clustering is much more difficult to do clustering--couch was built from the ground up to be massively distributed, clusters out of the box
  • O'Reilly book available -- free online
Using CouchDB With Grails
  • grails has native json support out of the box
 import grails.converters.* class AlbumController { def scaffold = true def listAsJson = { render Album.list() as JSON } def listAsXml = { render Album.list() as XML } } 
CouchDB 101
  • json up and down
  • restful interface
  • no drivers since it's just http
  • written in erlang
    • incredibly fast
    • designed for scalability and parallel processing
Installing CouchDB
  • sudo apt-get install couchdb
  • windows installer available
Kicking the Tires
  • ping
    • curl http://localhost:5984
      {"couchdb":"Welcome","version":"1.0.1"}
    • can also hit this in a browser, but of course can't do a POST from a URL in a browser
  • get databases
  • create a database
  • delete a database
  • uses standard HTTP response codes, e.g. a 201 response code for a database create
  • web UI available - "Futon"
  • create a document
  • create a document from a file
  • URIs for documents are essentially your primary key--unique way of representing the document
  • don't have to create schemas -- just start throwing documents at the database
  • documents get etags so they're very cache friendly
  • documents also get revisions--keeps tracks of multiple versions of the document
    • have to provide version number when updating
    • versioning numbers are revision number (integer), then -, then md5 hash of the document itself
    • can explicitly compress the database to get rid of old versions to reduce size of database
  • couch prefers uuids for the ids, but you can use anything you want
  • get UUID(s) from couch
  • to update a document, you'll get the latest version of the document, then do the update, then pass your changes back to couchdb which includes the revision number
  • one of the major things couchdb gives you since it's document based is that the data is accurate at that point in time
    • if the data changes in the future, in an rdbms the old document would get the new data
CouchDB With Grails
  • domain class--id and _rev as properties
  • can add couchdb stuff to Config.groovy to do stuff like create-drop for couchdb databases
  • add stuff to BootStrap.groovy
  • showing CouchDBService that has convenience methods around a lot of the URL calls to couch
Map/Reduce
  • in sql you say select firstname, lastname from foo (this is map) where state = 'NE' (this is reduce)
  • map and reduce are stored in 2 separate javascript functions
Filed under  //  CouchDB   Grails   NoSQL   ORM   SpringOne  
Sep 14 / 10:38am

Resolving "Connection Refused" Error With CouchDB 1.0.1 on Windows Server

If you run into a "Connection Refused" error when trying to access CouchDB from somewhere other than localhost, luckily it's an easy fix.

By default CouchBD is set to bind only to 127.0.0.1, which I suppose is nice for security reasons since when you first install CouchDB it's wide open.

To fix this, open {couchdb_install_dir}\etc\couchdb\default.ini and in the [httpd] section, change the bind_address value from 127.0.0.1 to 0.0.0.0 so it will be accessible from any IP.

Save the file, restart CouchDB, and you should be golden.

Filed under  //  CouchDB   Windows  
Jul 14 / 9:09am

CouchDB NoSQL Database Ready for Production Use - NYTimes.com

Two major enhancements to CouchDB make it 1.0-worthy, said Chris Anderson, the chief financial officer and a founder of Couchio. One is the fact that performance of the software has been greatly improved. The other is its ability to work on Microsoft Windows machines. A lot of work was also put into stabilization of the software.

Performance-wise, the new version has demonstrated a 300 percent increase in speed in reads and writes, as judged by internal benchmarking tests done by Couchio. The performance improvements were gained by optimizing the code, Anderson said.

This is also the first release of CouchDB that can fully run on Windows computers, either the servers or desktops, Anderson said. Previous versions could run on Linux, and there is a version being developed for the Google Android smartphone operating system.

I'm really stoked that I may finally get to use CouchDB on a "real" application soon, so this is great timing. Congrats to the Couchio guys and the fantastic community around CouchDB.

Filed under  //  CouchDB   NoSQL  
Mar 23 / 3:50pm

CouchDB basics for PHP developers

However, every once in a while, you work on a project where you probably think to yourself, "Why am I doing all this work?" The project you're working on contains very simple bits of data or data that's difficult to predict โ€” you might get different data fields on different days or even from transaction to transaction. If you were to create a schema to predict what's coming down the pike at you, you'd end up with tables that have lots of empty fields or lots of mapping tables.

This is an excellent intro to CouchDB even if you aren't a PHP developer.

Filed under  //  CouchDB   NoSQL  
Nov 6 / 8:04am

The "NoSQL" Discussion has Nothing to Do With SQL | Communications of the ACM

Recently, there has been a lot of buzz about โ€œNo SQLโ€ databases. In fact there are at least two conferences on the topic in 2009, one on each coast. Seemingly this buzz comes from people who are proponents of:

โ€ข document-style stores in which a database record consists of a collection of (key, value) pairs plus a payload. Examples of this class of system include CouchDB and MongoDB, and we call such systems document stores for simplicity

โ€ข key-value stores whose records consist of (key, payload) pairs. Usually, these are implemented by distributed hash tables (DHTs), and we call these key-value stores for simplicity. Examples include Memcachedb and Dynamo.

In either case, one usually gets a low-level record-at-a-time DBMS interface, instead of SQL. Hence, this group identifies itself as advocating โ€œNo SQL.โ€

Great first part of a two-part series about data storage and how "NoSQL" doesn't at all get at what things like CouchDB, MongoDB, etc. are all about.

Filed under  //  CouchDB   Databases