Saturday, August 27, 2011

Retrieving Documents Between Two Dates From CouchDB

I'm working on converting yet another application from using SQL Server to using CouchDB, and this morning I'm working with some announcement documents that are displayed based on their start and end date. There are numerous ways to approach this problem but I thought I'd share what I came up with in case this solution helps others, and also to see if there's maybe another approach I didn't consider.


First, since there is no date datatype in JSON, we've standardized (for better or worse) on storing dates as a string with the format "YYYY/MM/DD HH:MM:SS", e.g. "2011/08/27 09:22:36", so date and time separated by a space, always with leading zeros for single digits, and always using a 24-hour clock. This allows date/time strings to sort properly when they're used as keys, it's easy to split the string using the space if you need either just the date or just the time, and since this application is for my day job the time will always be in Eastern US time so we decided not to care about the timezone offset.


In the data I imported from SQL Server there is a dtStart and a dtEnd field so I just converted the SQL Server dates to our preferred CouchDB date format as I imported the data into CouchDB. So far so good.


The next step was to pull these documents from CouchDB based on their dtStart and dtEnd fields, and this is probably obvious but just so it's clear, I need to pull all documents of this type where dtStart <= now, and dtEnd >= now.


As I started creating my view in CouchDB for this, my first thought was to pull all the documents using an array including dtStart and dtEnd as the key. That way when I call the view I could, in theory, use a start and end key to get me the documents in the range of dates that I want.


That approach seems reasonable at first, but when you start trying to put it into practice things get weird rather quickly. This is because what you wind up needing is documents in which the first element of the key array is less than the current date, while the second element of the key array is greater than the current date. Maybe this is just "Saturday morning brain" on my part, but I didn't see a way to include both the start and end date in the key and get where I needed to go.


My next thought was to use only the end date as the key. This gets me a bit closer to what I need since I can at least use a start key to only get documents with an end date >= now, but I'm still faced with having to check the start date at the application level to see if the document is supposed to be displayed.


I'm sure there's some clever way to handle this situation with keys, and part of my reason for posting this is to see how others would approach this, but I messed around with keys for a while and didn't seem to be getting anywhere so I decided to take a different approach.


One of the great things about CouchDB is the fact that you have the full power of JavaScript available in your views. Although JSON doesn't know what a date is, JavaScript certainly does, so I decided that since I needed to pull things based on a specific date range across two fields in my documents the best place to handle that was in the view code itself.


Here's what I came up with for my map function:



var d = new Date();
var curYear = d.getFullYear();
var curMonth = (d.getMonth() + 1).toString();
var curDate = d.getDate().toString();
var curHours = d.getHours().toString();
var curMinutes = d.getMinutes().toString();
var curSeconds = d.getSeconds().toString();

if (curMonth.length == 1) {
  curMonth = '0' + curMonth;
}

if (curDate.length == 1) {
  curDate = '0' + curDate;
}

if (curHours.length == 1) {
  curHours = '0' + curHours;
}

if (curMinutes.length == 1) {
  curMinutes = '0' + curMinutes;
}

if (curSeconds.length == 1) {
  curSeconds = '0' + curSeconds;
}

var dateString = curYear + '/' + curMonth + '/' + curDate + ' ' +
    curHours + ':' + curMinutes + ':' + curSeconds;

if (doc.type == 'announcement' &&
    doc.dtStart <= dateString &&
    doc.dtEnd >= dateString) {
      emit(doc.dtEnd, doc);
}


Now of course you could argue this would all be simpler if I stored the dtStart and dtEnd fields in my documents as milliseconds, because then I could just get the millisecond value of the current date and do a quick numeric comparison instead of all the string formatting and concatenation, and from that perspective you'd be absolutely right. One of the many things I love about CouchDB, however, is the ability to jump into Futon and more directly and easily interact with my data, so keeping the dates human readable is kind of nice. Now I could store both a string and the millisecond value I suppose, but since this did the trick I decided to leave well enough alone.


I'm very curious to hear how others might solve this problem. "You're doing it wrong" information would be quite welcome. ;-)

11 comments:

Nuri Cevik said...

Is it possible to keep dates as dateValue while carriying over from MSSQL to CouchDB and retrieve it with these values so you can compare them? If you need to display them you can convert them to date and time.

radiospiel said...

Wouldn't this make the view non-sideeffect-free? It seems to me you are including a document in the view if its start time is in the past and its end time is in the future *from the moment the view was built*? Wouldn't that make the view contain different entries depending on when it was built?

radiospiel said...

If you want to have a view to find out which documents are valid at a certain point in time, you could probably emit(hourly_timestamp, doc) for each hour between dtStart and dtEnd for the document, and, if needed, let your app remove the first and the last entry, because they might then be false positives.

samdengler said...

@radiospiel is right. Each time that the view is requested, it only executes against the document delta from the last time the view was executed. If the view changes, the map functions are executed against all documents, but that's not the case here.Suppose that the database contains 100 documents, and the view is requested at Wed Sep 14 10:00:00 EDT 2011. The view will contain a subset of the 100 documents in which "Wed Sep 14 10:00:00 EDT 2011" falls within the start and end dates.Now suppose that 10 documents are added in the next 24 hours, and the view is requested tomorrow at Wed Sep 15 10:00:00 EDT 2011. The view will still contain the original subset of 100 documents plus an additional subset of the 10 documents in which "Wed Sep 15 10:00:00 EDT 2011" falls within the start and end dates.The easiest way to prove this is to create a blank database and a view with the following as the map function:function(doc) { emit(doc._id, new Date());}Now create a document and run the view. Create another document and run the view. The view should contain two documents with different timestamps, because the second time that the view was executed it only ran against the delta (the second document in this case).

Matthew Woodward said...

Thanks--this is great feedback! I indeed did not consider the fact that the view won't get updated as I thought it would when the documents change, so I'll have to rethink this. Really appreciate the info as I continue to get deeper into CouchDB.

Matthew Woodward said...

Interesting idea @radiospiel -- I wonder if it'll be simpler to emit the start or end date only that's within a range and then pare it down more in the application. This isn't a massive amount of data so that'd probably work in this case, though of course if it were a ton of data I'd have to approach it differently.Definitely an interesting problem to ponder (for me at least!).

jchris said...

@radiospiel is correct. To do date range queries you need to pick a granularity for queries (day? hour? 15 mins?) and then for long running events emit once per granular period.The alternative is complex query logic and potentially unbounded query time if you had, say, a year long event, in the same database as millions of shorter events.

Matthew Woodward said...

Thanks @jchris -- really appreciate the additional info! My Couch Fu will only get stronger by sharing mistakes such as this one. ;-)

samdengler said...

Another option is to handle complex queries with lucene, which can be configured to listen to couch for changes. There are options as to how you index full documents, selected fields, etc: https://github.com/rnewson/couchdb-lucene

Matthew Woodward said...

Yep, looking into Lucene with Couch on another project so may wind up using that on this project as well.

Matt Woodward said...

Realized I never answered Nuri's original question--yes, it's entirely possible to store the dates however you want. That doesn't address my particular issue though (which I reworked this week so I'll have a follow-up blog post about that), but you can store dates in whatever format you want.They're still strings, of course, and another thing to keep in mind is that while JSON doesn't have a date datatype, there are ISO standards around date formatting that might be wise to follow. We started there and settled on the format we're using because we will never, ever be using anything except US Eastern time and don't care about the timezone offset difference between daylight savings and not, so the slightly simpler format we're using vs. the ISO format works well for us.Big thing is is you want to use the date field as a key in Couch, meaning be able to sort documents based on date, you need a format that will sort correctly. Even something like not using leading zeros will throw your sorting off.Also, date strings are comparable as is too, meaning if you use a format like "YYYY/MM/DD HH:MM:SS" you can do things like if (date1 < date2) and get correct results based on string comparisons which is nice.