Matt Woodward’s posterous

Matt Woodward’s posterous

Matthew Woodward  //  * CFML, Grails, and Java Developer
* Principal IT Specialist, US Senate
* Open BlueDragon Steering Committee Member
* All-Around Geek

Jul 14 / 9:09am

CouchDB NoSQL Database Ready for Production Use - NYTimes.com

Two major enhancements to CouchDB make it 1.0-worthy, said Chris Anderson, the chief financial officer and a founder of Couchio. One is the fact that performance of the software has been greatly improved. The other is its ability to work on Microsoft Windows machines. A lot of work was also put into stabilization of the software.

Performance-wise, the new version has demonstrated a 300 percent increase in speed in reads and writes, as judged by internal benchmarking tests done by Couchio. The performance improvements were gained by optimizing the code, Anderson said.

This is also the first release of CouchDB that can fully run on Windows computers, either the servers or desktops, Anderson said. Previous versions could run on Linux, and there is a version being developed for the Google Android smartphone operating system.

I'm really stoked that I may finally get to use CouchDB on a "real" application soon, so this is great timing. Congrats to the Couchio guys and the fantastic community around CouchDB.

Filed under // CouchDB NoSQL

Comments (2)

Mar 23 / 3:50pm

CouchDB basics for PHP developers

However, every once in a while, you work on a project where you probably think to yourself, "Why am I doing all this work?" The project you're working on contains very simple bits of data or data that's difficult to predict — you might get different data fields on different days or even from transaction to transaction. If you were to create a schema to predict what's coming down the pike at you, you'd end up with tables that have lots of empty fields or lots of mapping tables.

This is an excellent intro to CouchDB even if you aren't a PHP developer.

Filed under // CouchDB NoSQL

Comments (0)

Nov 6 / 8:04am

The "NoSQL" Discussion has Nothing to Do With SQL | Communications of the ACM

Recently, there has been a lot of buzz about “No SQL” databases. In fact there are at least two conferences on the topic in 2009, one on each coast. Seemingly this buzz comes from people who are proponents of:

• document-style stores in which a database record consists of a collection of (key, value) pairs plus a payload. Examples of this class of system include CouchDB and MongoDB, and we call such systems document stores for simplicity

• key-value stores whose records consist of (key, payload) pairs. Usually, these are implemented by distributed hash tables (DHTs), and we call these key-value stores for simplicity. Examples include Memcachedb and Dynamo.

In either case, one usually gets a low-level record-at-a-time DBMS interface, instead of SQL. Hence, this group identifies itself as advocating “No SQL.”

Great first part of a two-part series about data storage and how "NoSQL" doesn't at all get at what things like CouchDB, MongoDB, etc. are all about.

Filed under // CouchDB Databases

Comments (0)

Nov 2 / 1:48pm

The Raindrop Backend Extension Model

One more Raindrop/CouchDB video--this app is really showing the power and benefits of using a schemaless database. Perfect fit for something like Raindrop.

Filed under // CouchDB Raindrop

Comments (0)

Nov 2 / 1:43pm

The Raindrop Schema Model

Follow-up to the previous Raindrop/CouchDB video--this shows how Raindrop uses CouchDB. Nice to see an increasing number of real-world uses of CouchDB out in the wild.

Filed under // CouchDB Raindrop

Comments (0)

Nov 2 / 1:41pm

Raindrop's Introduction to CouchDB

A friend of mine pointed out that Mozilla Raindrop is using CouchDB for persistence, and a member of the Raindrop team produced a really nice CouchDB intro video. There's another video about how Raindrop uses CouchDB--I'll post that one as well.

Filed under // CouchDB Raindrop

Comments (0)

Oct 29 / 2:37pm

Damien Katz: Koala on the loose!

Ubuntu 9.10 Karmic Koala has just been released. This is big news as this version includes Apache CouchDB, used as a replicable database by desktop apps. This means CouchDB will be on over 10 million desktops. Nice :)

WOW! Didn't realize CouchDB was standard in Ubuntu now. Very cool. I'll have to look more into how exactly it's being used. Congrats to the CouchDB team!

Filed under // CouchDB Linux Ubuntu

Comments (0)

Oct 26 / 4:41pm

Massive CouchDB Brain Dump

The following is a semi-unorganized brain dump of everything of interest I've come across while learning the incredibly cool CouchDB document-oriented database system. In this brain dump I pull things from many different resources including my own head, so there may be literal quotes from some of these resources without inline attributions. For that I apologize, but rest assured I'm not trying to plaigarize anyone or take credit where it's not due; I was just merely taking notes as I perused a lot of different resources and organized them in a way that made sense to me. I do have a complete list of all of the resources I used at the end to let you explore on your own. Again, my apologies to the creators of the resources from which I pull for not attributing inline.

I'll be presenting CouchDB to the ColdFusion Meetup on December 17 (Charlie did a great job of booking a full schedule through the end of the year) so don't miss it!

CouchDB: General Concepts

  • document-oriented
  • schemaless
  • NOT RELATIONAL
  • JSON-based
  • REST-based
  • MapReduce
  • basically throw out everything you know about databases and you'll pick this up a lot faster
  • Calls are made to the database via HTTP
    • Yes, that means via a browser, curl, cfhttp ... anything that talks HTTP
  • Responses come back as JSON
  • Lock-free design--reads don't have to wait for writes or other reads
  • Why are these good things?
    • More like how data works in the real world
      • e.g. business cards--if one has a fax and another doesn't, in an RDBMS you have to have a fax field that's going to be null for anyone who doesn't have a fax
      • with CouchDB, you can have one record with a fax field, and another with no fax field, but they're both considered business cards since every document in CouchDB is 100% self-contained
    • Simple
      • some of this stuff is so simple you'll be amazed there isn't more to it
    • Fast
      • Push/Pull of JSON data over HTTP
      • No messy, time-consuming joins between tables--a document contains all its data
    • Scalable
    • Takes the object-relational mismatch out of the picture to a certain extent
    • It works like the web does
  • History of CouchDB
    • development started in 2004
    • originally written in C++, now in Erlang
    • Damien Katz quit his job and self-funded development full-time for 2 years
      • was formerly with IBM working on Lotus Notes, also a brief stint at MySQL
    • Damien Katz is now a full-time employee at IBM and gets paid to work on CouchDB full time
    • CouchDB is a top-level Apache project and is released under the Apache 2.0 license
  • CouchDB's motto: RELAX.
    • we shouldn't have to worry about data so much


Why Use CouchDB?

  • throws out the relational model and looks at what matters with data in the majority of applications
  • vastly simplifies data modeling and interaction with data
  • extremely flexible since there are no preset schemas
  • in relational databases, data does get to a point where it's unwieldy and slow to access
  • relational model is hard to scale, and doesn't do so very naturally or quickly
  • CouchDB offers ...
    • robust, dead simple replication to any number of servers
    • bi-directional conflict detection and resolution
    • fantastic performance on huge databases
      • number of records in a database has very little impact on performance
    • fantastic scalability
      • Erlang was designed for real-time telcom apps in the 1980s, so it's ideal for high scalability and highly concurrent apps like database servers
      • early testing with CouchDB shows it can handle 20,000 concurrent connections with no problems, and they haven't even done and performance profiling yet
        • lead developer said in an interview that using conventional threading in C++ you'd be lucky to handle 500 concurrent connections
      • Erlang also can help with multi-machine scalability, failover, etc. but CouchDB is not taking advantage of any of this yet
  • CouchDB speaks the language of the web
    • REST, HTTP, and JSON are how CouchDB works natively
  • already gaining a huge amount of traction and becoming very popular


Is the Relational Model Dead?

  • there is an increasing indication that the relational model will begin being seen as a solution, not the solution
  • Map/Reduce is simply a better model for dealing with large datasets and taking advantage of parallel processing
  • "Map/Reduce will kill every traditional data warehousing vendor in the market. Those who adapt to it as a design/deployment pattern will survive, the rest won't."
    • Might think this came from a non-relational database vendor, but it's actually from Brian Aker, one of the original authors of MySQL and currently working on the Drizzle (http://drizzle.org/wiki/Main_Page) fork of MySQL
  • document-based databases like CouchDB scale far better and easier than relational databases do
    • both Amazon and Google came up with their own database solutions for their cloud computing platforms as opposed to using a traditional RDBMS--this should tell you something
  • Better and more natural fit for applications


More on "Better Fit for Applications"

  • self-contained documents
    • no more taking a real world construct and deconstructing it into a relatonal model
  • flexible schemas
    • two documents can be of the same type and not contain the same fields--don't have to have a bunch of nulls involved, worry about foreign keys, etc. etc. since every document is self-contained
    • if you need a change in your schema, it's dead simple to do--just start using the new schema
      • if you don't care that the old documents don't have the new field, you don't have to worry about them
  • speaks our language as application developers
    • REST and JSON--doesn't get much simpler than that
  • since it's all web based you take advantage of the following at the database level
    • can handle more traffic since connections aren't left open
    • clustering, proxying, caching, security, etc. behaves just as it would with an HTML document
  • Creator of CouchDB said one of the goals was to have users feel like "you could touch your data ... like it was right there in your hands"
    • eliminating all the layers between your application and your data


Relational Model vs. Document-Based aka "Key/Value Store" Databases

  • relational diagram
  • key/value diagram
  • CouchDB pros
    • ideally suited for cloud computing
    • more natural fit with the code we write -- no ORM mismatch nonsense to worry about
  • CouchDB cons
    • relational databases enforce integrity at the database level
    • schemaless nature of CouchDB means your data integrity is that the APPLICATION level
      • bugs in application code using RDBMS don't lead to data integrity issues
      • bugs in application code using CouchDB CAN lead to data integrity issues
      • really this just puts this concern in a different place, but it's something to be aware of
    • no shared standards between key/value database vendors
      • much easier to move from SQL Server to MySQL than it would be to move from CouchDB to Amazon or Google
  • other concerns with cloud databases ...
    • limtations on analytics -- e.g. Amazon queries cannot take longer than 5 seconds to run
    • limitations on data returned -- e.g. Google queries cannot return more than 1000 rows
  • from an application development standpoint, in my experience thus far this does bring your data repository a bit more into the realm of your application
    • again, this isn't SQL, so your code isn't running queries and dealing with query objects; instead it's making HTTP calls and dealing with JSON
    • less friction between your app and your data, but be aware that it's a bit of a whole new world when you're working with CouchDB


Should You Consider CouchDB?

  • yes if you ...
    • have tables with lots of columns of which you typically only use/display a few
    • have lots of joins in your queries
    • are serializing JSON or XML data into single columns in your relational database
    • have data that is more heirarchical or flat than it is relational
    • have systems that require frequent schema changes
    • are reaching the performance capacity of a single database server and need to scale out
    • have an amount of data that is difficult for a single server to hold
    • have background processes running on your database that impact performance of the database as a whole
  • the nice thing about CouchDB is that it's highly and easily scalable by its very nature
    • but if you don't need scalability now, you don't have to worry about it; you just get it when you need it practically for free
  • Why not just dump JSON data into a relational database?
    • because RDBMSes don't know anything about JSON, so you don't get any of the huge efficiency and functionality advantages you get with CouchDB


Other Document-Based Databases


Building/Installing CouchDB

  • basic requirements: Erlang, SpiderMonkey (JavaScript engine), other miscellany
  • on Linux you'll need to install some prerequisites/dependencies if you don't have them; here's the list for Ubuntu ...
    • sudo apt-get install subversion
    • sudo apt-get install libtool
    • sudo apt-get install automake
    • sudo apt-get install libmozjs-dev
    • sudo apt-get install libicu-dev
    • sudo apt-get install curl
    • sudo apt-get install libcurl4-gnutls-dev
    • sudo apt-get install erlang-dev
    • sudo apt-get install erlang-nox
    • sudo apt-get install openssl
    • sudo apt-get install libssl-dev
      • double-check you have openssl and libssl installed; otherwise you may not get an error until you first try to run CouchDB
  • alternatively you can try ...
  • then grab the code and build it (do this from your home directory or wherever you like)
    • svn co http://svn.apache.org/repos/asf/couchdb/trunk couchdb
    • cd couchdb
    • sudo ./bootstrap
      • You should see "You have bootstrapped Apache CouchDB, time to relax." If not, fix any dependency issues it lists (the error messages are very explicit).
    • sudo ./configure
      • You should see "You have configured Apache CouchDB, time to relax."
    • sudo make
    • sudo make install
      • if you don't get any errors with make or make install you should be able to launch CouchDB!
  • Check http://wiki.apache.org/couchdb/Installing_on_Windows for information about installing on Windows. Haven't tried this myself, likely won't, so best of luck.


Running CouchDB

  • on Linux the install process puts the couchdb script in your path, so you can open a terminal and type sudo couchdb
    • you should see:
      Apache CouchDB has started. Time to relax.
      [info] [<version_number>] Apache CouchDB has started on http://127.0.0.1:5984/
    • If you see an error along the lines of {"init terminating in do_boot",{undef,[{crypto,start,[]} ... that means you don't have erlang-nox and/or libssl-dev installed, so you'll have to go back through the steps above once you have those dependencies resolved.
    • If you have other errors when trying to start CouchDB check http://wiki.apache.org/couchdb/Error_messages


Interacting with CouchDB with CURL

  • some basic examples
    • curl -X GET http://localhost:5984/
      • returns basic server info
    • curl -X GET http://localhost:5984/_all_dbs
      • returns list of all databases on the server
    • curl -X PUT http://localhost:5984/contacts
      • creates a new database called contacts
    • curl -X PUT http://localhost:5984/contacts/6e1295ed6c29495e54cc05947f18c8af -d '{"firstName":"Matt", "lastName":"Woodward", "email":"matt@mattwoodward.com"}'
      • creates a new document in the contacts database; the string after the database name is a UUID
    • curl -vX PUT http://localhost:5984/contacts/6e1295ed6c29495e54cc05947f18c8af/headshot.jpg?rev=2-2739352689 -d@headshot.jpg -H "Content-Type: image/jpg"
      • attaches a headshot jpeg to the document with the ID provided
    • curl -X GET http://localhost:5984/_uuids
      • CouchDB returns a new UUID; can add ?count=N to get back N UUIDs if you need more than one
    • curl -X GET http://localhost:5984/contacts/6e1295ed6c29495e54cc05947f18c8af
      • returns the document with the UUID provided
    • curl -X DELETE http://localhost:5984/contacts/6e1295ed6c29495e54cc05947f18c8af?rev=2-212344
      • deletes the document with the ID provided; note that you must provide the latest revision number for the document in order for the delete to succeed
    • curl -X DELETE http://localhost:5984/contacts
      • deletes the contacts database
    • curl -X POST http://localhost:5984/_replicate -d '{"source":"contacts","target":"contacts-replica"}'
      • replicates the contacts database to the contacts-replica database
  • of course since this is just HTTP, you can use CURL's -v flag to get a verbose listing of everything CouchDB is doing on each request
  • performing updates is a bit different
    • if you do a PUT of a document with the same ID but don't include a revision number, the update will fail
    • you have to include the latest revision number in CouchDB in updates for them to work
    • what this means in practice is that you'll pull the document you want to update back, update the JSON (or update the data in your application code), and then do a put of the updated document with the new data since this will contain the most recent revision in CouchDB
    • the updated document gets a new revision number, and the original document is retained in CouchDB as a previous revision


Versioning of Documents

  • CouchDB uses a multi-version concurrency control (MVCC) system
  • each document in CouchDB gets a revision number
  • previous versions of documents are saved in CouchDB
    • BUT ... unlike a version control system, there is no guarantee how long the previous versions will be retained
    • you can tell CouchDB you want to retain the previous versions of a document if you need to
  • remember that all communication with CouchDB is done over HTTP
    • HTTP is stateless--you open a connection to CouchDB, make a request, then the connection is terminated
      • this is good because it means CouchDB can handle a lot of traffic since connections are short-lived
  • If you're familiar with Etags in the HTTP world, CouchDB uses its revision numbers as Etags in HTTP responses
    • Etags are very useful for caching
    • since all documents in CouchDB are really just resources in the HTTP/REST sense, your data behaves like any other HTML resource


Futon: CouchDB's Web-Based Interface

  • browse to http://localhost:5984/_utils for the web-based interface to CouchDB
  • handy way of perusing your documents, managing datbases, etc.
  • definitely handy for creating design documents and views
    • mini editor for creating temporary and permanent views--can execute temporary views from within the editor
  • can kick off replication and a ton of other stuff from Futon


Creating a Document

  • new documents have _id and _rev fields added automatically
  • documents are versioned much like code is in SVN, so every version of every document in the db is stored
  • click "add field" in Futon to add a new field to a document
  • double-click value (default is null) to edit
  • values must be JSON valid data
    • strings have to have quotes around them, e.g. "hello" not just hello
    • valid datatypes are string, number, boolean, list, and key/value dictionaries
  • you can do a "view source" on a document from Futon to see the JSON version of the document
  • as you update a document, the version number will change with each revision
    • if another process changes the document before you save your changes, a conflict will arise
  • CouchDB has no concept of "types"
    • e.g. in a blog application we would think of "posts" and "comments" as types
    • remember that CouchDB is schemaless, so there is no inherent structure to documents contained within the database itself
    • common to use a type field on a document containing a string that defines the type
      • makes it easy to write a view that pulls back specific document types
      • CouchDB does NOT CARE what field you use to define type--you can call this anything because again, CouchDB has no concept of document types
    • remember also that even if you define a document as a type, it does NOT have to literally match the structure of other documents with that same type
      • e.g. music library--could define "album" as a type, and if one album has a year field and another doesn't, they're both still "album" types since we defined the type explicitly
      • BUT, if you do want to require a specific structure for a document type, you can do that with validation functions and, e.g., reject an addition or update of an album to a music library if it didn't contain a year field
    • also handy to infer type based on fields for more flexibility
      • e.g. in a blog app we could use if (doc.title && doc.body) and assume if those fields are present that this is a blog post as opposed to a comment


How Documents Are Not Like Database Records

  • self-contained--no joins across table to put together a single record
    • documents in CouchDB map directly to an object instance in your application
  • typically documents will automatically have authors and publish dates associated with them
    • very easy to publish documents of any type in the future
    • if you create user accounts for CouchDB it automatically keeps track of who created and modified records
  • don't break documents into smaller units than you need to!
    • a blog post will have an author--don't have the author be a separate document
  • JSON document format
    • CouchDB documents all have _id and _rev fields
      • _id can be anything, so long as its unique--UUID, plain old string, whatever
      • _rev is the revision number--this changes with each update to a document
        • to update a document, you have to provide the most recent value of _rev so CouchDB knows you're working with the latest revision
    • if users have been configured, documents will have an author field
    • Do I really look like a guy with a plan? You know what I am? I’m a dog chasing cars. I wouldn’t know what to do with one if I caught it. You know, I just… do things. The mob has plans, the cops have plans, Gordon’s got plans. You know, they’re schemers. Schemers trying to control their little worlds. I’m not a schemer. I try to show the schemers how pathetic their attempts to control things really are.

      The Joker, The Dark Knight (this quote is used in the forthcoming O'Reilly CouchDB book)
    • NEED INFO ABOUT THINGS LIKE ARRAYS, ETC. HERE


Running Queries

  • again, forget everything you know
  • there is no SQL here
  • instead of running queries in the traditional sense, data is filtered using map and reduce functions, which are written in javascript
    • DEFINE MAPREDUCE HERE
  • the map and reduce functions combined create a CouchBD view
  • views are stored as rows sorted by key
    • extremely efficient even for millions of records
  • can create temporary views for testing, but these are rather inefficient, so views that are going to be used regularly are stored in the database as documents
    • once a view is stored in the database as a document, CouchDB indexes behind the scenes for efficiency
  • main points:
    • map functions allow you to sort your data using any key you choose
    • CouchDB is designed to provide extremely fast access to data by key and key range
    • you don't really run queries against CouchDB, you query a view
    • when you query a view, CouchDB runs the map function against every document in the database in which the map function is defined
  • map functions have a single "doc" parameter which is each individual document in your database
  • the emit() function is used to spit out matching documents, and you can specify the fields you want to output
  • if you're querying every document in the database every time, isn't that inefficient?
    • you'd think so, but no
    • CouchDB only runs through all the documents the first time the view is queried
    • as documents change, CouchDB only has to update what's changed
    • everything is stored in a B-Tree, which is very efficient
  • creating multiple views specific to how you want to access the data helps with efficiency
  • to execute a view, you just--surprise--hit a URL over HTTP and get JSON back
    • e.g. http://localhost:5984/database/_design/designdocname/_view/viewname
    • to add a key to this, it's just an argument in the query string of the url, e.g.
      • http://localhost:5984/database/_design/designdocname/_view/viewname?key=value
    • can also retrieve documents by key range
      • http://localhost:5984/database/_design/designdocname/_view/viewname?startkey=startvalue&endkey=endvalue
  • default query engine or "view server" in CouchDB is JavaScript, but you can write your own in any language
    • Remember it's all just HTTP and JSON!
  • MAP FUNCTIONS take a document as an argument and emit key/value pairs
    • Btrees are very efficient--even with lots of documents the tree is "shallow" and it's pre-indexed so searches are very fast
  • REDUCE FUNCTIONS operate on the rows returned by map functions and act as filters on the documents


Replication

  • can replicate local -> local, local -> remote, or remote -> remote from Futon
  • as with everything in CouchDB, this is all HTTP/REST based
    • initial replication may be time consuming
    • subsequent replications are diffs only
    • if you trigger replication from Futon, you have to leave the browser window open!
    • but since all this is HTTP based, easy to set up cron jobs that use curl to do replication
  • a POST to CouchDB containing the source and target of replication is all that's needed to kick off replication
    • CouchDB maintains a session history of replication sessions, again in JSON
  • can replicate among local databases or between databases not on the same physical box
  • CouchDB has automatic conflict detection and resolution
    • remember that documents in the database are versioned, so conflicts are handled quite gracefully
    • documents that are in conflict when a replication occurs get a new _conflict:true attribute added to them
      • one of the two competing documents is given the latest revision number, the other is given a previous revision number
      • these conflicts are also replicated, so all databases will have the same information
  • CouchDB takes the approach of "eventual consistency"
    • traditional RDBMS systems enforce consistency--put consistency above all in replication situations
    • “Each node in a system should be able to make decisions purely based on local state. If you need to do something under high load with failures occurring and you need to reach agreement, you’re lost… If you’re concerned about scalability, any algorithm that forces you to run agreement will eventually become your bottleneck. Take that as a given.”

      Werner Vogels, Amazon CTO and Vice President
    • consistency between nodes is not guaranteed on writes, but the nodes will eventually be consistent on reads


Design Documents

  • documents that contain application code
  • IDs must start with _design as the ID, e.g. "_design/myapp"
  • possible to write entire apps in HTML/JavaScript, store this code as a design document in CouchDB, and run the entire app from the CouchDB database
    • dynamic code (views and validation) written as JSON and stored as a document in CouchDB
      • MapReduce queries stored in the views field
      • data output CAN be things other than JSON using the show field, e.g. CouchDB can output RSS without any middleware
    • static HTML pages stored as attachments to the design document


Validation

  • validation functions are used to do things like prevent users who aren't logged into an app from performing document updates
  • validation functions are stored in design documents under the validate_doc_update field
    • can only have one validation function per design document
    • but remember you can have multiple design documents per database
  • documents must pass all the validation rules on all design documents in the database in order to be saved
    • the order in which validation functions are executed is arbitrary
  • most common example, since CouchDB is schemaless, is to require that certain fields be included if a document is declared to be of a particular type
    • e.g. require "title" and "body" for a document type of post
      function(newDoc, oldDoc, userCtx) {
         function require(field, message) {
           message = message || "Document must have a " + field;
           if (!newDoc[field]) throw({forbidden : message});
         };

         if (newDoc.type == "post") {
           require("title");
           require("body");
        }
      }


Show Functions

  • since everything in CouchDB is JSON, HTTP, and JavaScript, it works well in any programming environment
  • CouchDB doesn't, however, address things like outputting HTML
  • easy enough to have CFML call CouchDB over HTTP and output the results in HTML
  • CouchDB can, however, generate HTML natively using show functions
  • basic show function
    function(doc, req) {
       return '<h1>' + doc.title + '</h1>';
    }
    • the "return" bit here is sent back to the browser as a HTTP response
  • you can even write full HTML templates that embed CouchDB-specific scripting so you don't have to embed HTML in javascript functions


Attachments

  • Documents in CouchDB, which are JSON, can have file attachments
  • doing an HTTP PUT with a -d@filename.ext flag tells CouchDB to attach the file to the document ID provided in the PUT request
    • as with other updates to documents, you DO need to provide the current revision number of the document to attach a file to the document
    • unlike other updates you do NOT need to provide the data for the document itself in order to add an attachment to it
  • documents may have mutliple attachments
  • attachments are made available as, surprise, HTTP resources
    • http://localhost:5984/contacts/6e1295ed6c29495e54cc05947f18c8af/headshot.jpg would display the headshot.jpg file attached to the document with the ID provided
  • if you pull a document back from CouchDB that has an attachment, the attachment file name and meta information such as type, size, etc. are contained in the JSON with a key of "_attachments"
    • adding ?attachments=true returns the attachments in base64 format as part of the JSON


Applications

  • you can build applications entirely in CouchDB
  • if you replicate your databases to another server, you replicate your app as well
    • means if you replicate to a local instance of CouchDB, you get offline data mode "for free"
  • applications are stored in CouchDB as design documents
  • can use couchapp for developing native CouchDB apps


Security

  • can lock down databases by editing a simple config file
    • by default it's in /usr/local/etc/couchdb/local.ini
    • there's also a default.ini file, but any changes made to default.ini are overwritten when CouchDB is upgraded
  • e.g. adding admin accounts to CouchDB
    • uncomment [admins] section and add authorized users as user = pass for each line
    • when CouchDB is restarted the passwords are hashed so they aren't stored in the config file in plain text
  • remember--this is all just HTTP so you can apply the same HTTP-based security, proxies, reverse proxies, etc. as you would to any web resource
    • e.g. putting a web server in front of CouchDB and using HTTP authentication would be trivial


General Tips/Tricks

  • In your applications, you'll want to create your own UUIDs for document IDs instead of letting CouchDB auto-create them
    • WHY? stated this in book but didn't elaborate
  • Since replication and resynching is so dead simple, easy to replicate to a local DB for offline use, then resynch when back online
  • For bulk conversions of existing databases to CouchDB, couple of performance tips
    • http://www.atypical.net/archive/2009/05/12/couchdb-090-bulk-document-post-performance
    • use the bulk document API instead of looping and doing individual document additions
      http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
    • don't use CouchDB's auto-assigned IDs--increases db size and has a big performance hit during conversion


There Are Some Cons ...

  • it's new--they call it the bleeding edge for a reason
    • stuff WILL change between versions that will break your apps!
  • it's a completely new skillset--there is no sql here
  • views take a long time to build the first time they're saved, but after that they're incredibly fast regardless of the number of documents involved
  • large databases can take up a lot of disk space
    • raw data is one consideration, but the views often take up much more disk space than the data itself
    • this is trading disk space for performance, which is a good tradeoff, so you just need to plan for disk capacity
  • CouchDB does not deal well with relational data
    • that being said, we all likely spend a lot of time dealing with the shortcomings of relational data, specifically how horrendously bad the relational model is at dealing with heirarchical data, so I'm not sure this is a straight con as compare with the relational model
    • recurring theme in the CouchDB literature is DON'T GO AGAINST THE GRAIN! Don't try to force CouchDB to behave like an RDBMS, because it's not.
  • CouchDB does not support transactions well
    • e.g. check to see if a user name is unique, then assign it--no way to isolate this from another simultaneous request in the same way you can with relational database transactions
  • Reads on single documents, and writes in general, are slower than you might be used to with an RDBMS
    • but, the CouchDB model scales much better
  • Have to write all of your "queries" (views) in advance--no on the fly SQL allowed
  • map-reduce not as flexible as sql--sometimes you'll have to return more data than you want or need and process on the application side


Common Use Case That Is a Bit Odd in CouchDB: Unique Constraints

  • e.g. guarantee a unique user name or email address
  • thread here http://markmail.org/thread/qwgqql74b2gg5hc5
  • other than the _id field, CouchDB has no way of guaranteeing uniqueness in any other field
  • you CAN do a check to see if the record already exists, but remember ...
    • This is HTTP. There are no transactions.
    • There is no locking in CouchDB
  • On the other hand, guaranteed uniqueness doesn't scale
  • Some solutions
    • you can use anything as your _id, so use one piece of data that has to be unique AS your _id
    • use a relational table in an RDBMS to store any unique values and put a unique constraint on that field in the RDBMS
      • you could still have problems, but this would reduce the likelihood from slim (without doing this) to extremely slim


Resources

Filed under // CouchDB Databases

Comments (6)

Oct 7 / 7:06am

CouchDB emerging as a top choice for offline Web apps | InfoWorld

We're not trying to build the Ferrari of databases; we're trying to build the Honda Accord of databases and that's a little different sweet spot," Anderson said.

Really nice article about CouchDB that focuses on CouchDB as an application development platform, which makes running applications offline dead simple.

I'm not sure I'll ever get into building entire apps in CouchDB but I absolutely love it as a data storage engine.

Filed under // CouchDB Databases

Comments (0)

Sep 1 / 9:30am

Caveats of Evaluating Databases (Jan Lehnardt)

This is part two in a small series about measuring software performance. There’s a lot of common sense covered, but I feel it necessary to shed some light.

If you haven’t, check out part one.

Say you want to find out what’s behind the buzz of all these new #nosql databases. There’s a large number to choose from today. All options come in varying degrees of maturity and characteristics so it’d be nice to know what solves your problem best. A non-exhaustive list of these databases or storage systems include Memcache[DB], Tokyo Cabinet / Tyrant, Project Voldemort, Scalaris, Dynamite, Redis, Persevere, MongoDB, Solr or my favourite CouchDB. And these are just some of the open source ones.

This article is not a comprehensive comparison of any of the mentioned systems. Instead it tries to give you an idea about what to look for when evaluating a storage system or how to take into perspective evaluations and benchmarks others have done.

We’ll look at some of the technical aspects of data storage systems: Applying common sense when reading benchmarks; b-trees and hashing; speed vs. concurrency; networked systems and their problems; low level data storage (disks’n stuff); and data reliability on single-nodes and multi-node systems.

There are a lot of other reasons to decide for or against a project based on a lot of non-technical criteria, but things like commercial support or a healthy open source community are not part of this article.

Astounding Numbers

From time to time you see some crazy numbers posted to the reddits of the internets that claim fantastic performance.

The (imaginary) SuperfastDB can store 450,000 items per second!.

Wow.

No word on where the items are stored (in memory? on a harddrive? Spindles? Solid State?), what an item is exactly and how big it is, the rest of the hardware this was run on and how to reproduce it.

But boy, 450,000 a second!

My shoes can do 650,000 a second, but you’ve got to figure out what.

Context is as important as reproducibility. The last article here established that finding out that my system and your system come up with different numbers is not much of a help. Any sort of serious test must come with a set of scripts or programs and comprehensive instructions on how the tests were run.

Everything “cool” in computer science has been around for 25+ years. Actual innovation is rare. Advancements in hardware and new combinations of existing solutions make for new stuff coming out each day (that’s a good thing), but the fundamental rules are the same for all. We’re all running von Neumann machines, quicksort is still pretty quick and hashes and b-trees rule the storage world.

Let’s recap.

Hashes & Trees

Hashing revolves around the idea of O(1) lookups. Allocate a number of buckets, create a function that gives you a number of a bucket for any data item you might want to store, make sure no two data items hit the same bucket (or work around that). Runtime characteristics include that you only need to ask your function where to look for or store your data and the allocation of your set of buckets: If you need to store more items than you have buckets, some more work is required which gives you O(N) operations that you can’t ignore in practice.

D5563B63-7B48-4280-A31F-EDB37DB78416.jpg

The other elephant in the room is b-trees. The fundamental idea here is to get to your data in a minimal number of steps traversing a tree because making a step is expensive, but reading your data is very fast comparatively. Steps are expensive because they translate to a head seek (that is the time your spinning hard drive needs to position the reading arm to find the spot to read your data from), but reading from a harddrive once the reading head is in place is fast.

6720EE64-4DFC-4298-B3BA-0145746C6523.jpg

There are a bunch of more interesting lookup structure like R-Trees for spatial queries, but they are mostly used for secondary indexes on top a regular data set that lives in a hash or b-tree.

Concurrency vs. Speed

Concurrency is hard. The devil lies in the details and when briefly looking at things, the details are often overlooked. Suits the devil.

Creating storage systems that assume only one access occurs at a time is relatively easy. If resources are shared concurrently, things become tricky. The two larger schools of thought (and practice) are locking and no-locking (heh).

Locking means that the database has to maintain information for everybody who wants to write to a part of the database, and what part it is.

No locking, or optimistic locking or MVCC moves that burden to the person who is trying to write to the database. She must prove that she won’t be overwriting any existing data.

The trade-offs here are a leaner request handing on the server that works well with remote & concurrent clients at the expense of more complexity on the client (the person who wants to store something in our database).

Hybrid approaches are possible too: While MVCC is used internally, the database’s clients can rely on database-side locking (e.g. PostgreSQL or InnoDB).

Networks

Just a quick note: We already talk about client and server here. There is a strong case for embedded databases like SQLite that don’t expose a concurrent user model to the outside. The program that needs an embedded database just includes it.

Another approach to using databases is having a dedicated computer running a database system and sharing it over the network with any number of clients using this database server. They can often be “a bunch of servers” or a cluster. More on that later.

A separate database server (networked or not) will need to spend some time to deal with connections, network failures, unspecified client behaviour and so on. The upside is a piece of infrastructure that can be maintained separately. An embedded database will thus be faster but probably won’t solve all of your problems and it will always be tied to your application.

fsync(): Reliability vs. Speed

When people tell me “SuperfastDB does 450,000 a second!” I ask “How many fsync()s is that?”. Let me explain:

A database system uses operating system services to use any hardware. The operating systems exposes a harddrive through a filesystem. The database systems talks to the filesystem and asks it to store or retrieve data in its behalf. The filesystem then goes ahead and tries to satisfy the database’s requests.

(I’ll not talk about databases that can use raw block devices to store data. They exist but they are not as common as those who use the filsystem.)

The filesystem also tries to be clever – for good reasons. When the database requests a piece of data, the filesystem will not only find that piece and return it, it will also store it in a cache to avoid having to actually talk to the harddrive the next time this piece of data gets requested. When the data changes, the filesystem either removes it from the cache or updates it with the harddrive. It might even go further and only store the new data that comes in with a write request into the cache and rely on a periodic task to write all of the cache back to the drive. Writing a bunch of of pieces at once is more efficient than storing each one on its own.

More efficient equals to faster and faster is good, right? Well, it depends: If all goes well, this approach is a nice one. But you know computers, things will not go well 100% of the time. The failure scenarios are endless, but they boil down to the question: “What happens when your machine dies and you have data that has only been written to memory?” — The answer isn’t too hard: That data is lost. If there is a delay between a write request finishing and data being written (or “flushed”) to disk any data that has been “written” during the delay period is subject to loss.

There are cases where this is not a problem; in other cases it is. A developer should have the chance to decide. (Note that even your hardware could be lying to you about having stored data, but I’ll punt on this one, get proper hardware).

So, flushing to disk needs to happen before you can rest assured your data has been stored. Your operating system has an API call that forces the filesystem to write its cache to disk. It is called fsync() (on UNIX systems) and it is an expensive operation. You can only do so many fsync()s in a second and it is not a great many.

The 450,000 items were most likely just written to memory and not to disk.

Space & In-Place

When writing files to disk (at the end of the day, your data ends up in one file or another on the filesystem) that represents what lives in a database, there are multiple options to handle updates.

An update is a change to your data item, for example, a new phone number. The intuitive way to handle this is to go and find the old phone number in the file, and overwrite it with the new number. Easy.

There are several problems with this approach: What to do if the new phone number is longer than the old one (say you added an international calling prefix)? The new number needs to be written to a different place and the change in location must be recorded. Not too big of an issue.

Back to failure scenarios: Again, the reasons can be manifold, but what happens when we’ve (over-)written the first 4 digits of the old with the new number and then the server dies, power goes away or the database server crashes? The next time you want to read the phone number you get a mix of the old and the new one (if you are lucky) and you don’t exactly know that this is the case and which parts are missing. Your database file is inconsistent and you need to run an integrity check to find missing bits and correct half-written bytes. In the worst case that means scanning your entire database file a few times before you resolved all inconstancies. If you have a lot of data, that can take days.

To solve this, you always write the new phone number to a new place in the database file and only when it has been fsync()ed to disk, you update the location of the phone number (and then flush that update to disk as well). You will never end up in a scenario where your database file can end up an inconsistent state and after a crash you are back online without an integrity check.

The trade-off for consistency is write-speed (remember fsync()s are expensive) for consistency-check-speed after a failure.

A nice bonus is that if the “new place in the database” is the end of the file, you keep your disk-drive head busy with writing data to disk instead of seeking all over the place (remember: seeks are expensive).

Distribution, Sharding & Resharding

So far, we’ve been looking at scenarios that involve a single database. We learned a great deal (I hope), but in reality we often deal with more than one database. The simplest reason to have two databases is for redundancy. Failures can bring down your database temporarily or even permanently. If it is a temporary issue, waiting a bit (or a bit longer) to get up and running again might be an option, but often, an application or service should be available at all times. A fatal failure where a database server is lost beyond repair, your data is gone if you haven’t stored it in a second place.

“I’ll just make two copies, easy!”. Yup easy, until you look at the details (that damn devil again!).

It’s all about failures again. Consider a single read request. A client connects to a server and asks for a data item. The server looks it up and returns the data to the client. All is well. At any point things can go wrong. The network connection can drop (or slow down so much that client or server assume it dropped), the client can disappear (because of a network failure or crash) as can the server. Clients, servers and the protocols they speak need to be built around the assumption that any of these things (and many more) can go wrong. If any parts is not designed to handle error cases, your system will do funny things, but it won’t reliably store and manage your data.

Add complexity: With each write target (store in two places) the possibility of error and the need for proper error handling grows quadratically. When evaluating a distributed storage system, looking at how errors are handled is vital.

Another reason to distribute data among multiple servers is capacity. The three metrics of interest here are read requests, write requests and data. If you have more requests or data than a single machine can handle, you need to move to multiple machines. Each metric calls for different strategies, but they often go along with each other. The need for fault tolerance that I discussed above needs to be considered alongside.

Growing read capacity is relatively easy once you covered the base case where the source for reading data might not be the same as the the target for writing data and that there can be a mismatch (cf. eventual consistency).

Distributing writes and data works by designating two machines with 50% of the operations. A clever intermediate, a proxy server for example, decides which request goes where and all is well, we can store twice as much and we can store at twice the speed. When we need to grow bigger yet, we add another server and tell the proxy server to distribute the load equally among them. Adding a proxy for distribution introduces a single point of failure and you don’t want these; there’s added complexity with this approach.

resharding.png

The diagram shows that there is another step needed that wasn’t included in the above description. The new “node” needs to have a copy of all data items that are assigned to him and are currently living on the two existing nodes. The process of moving data items to new nodes is called resharding and needs to happen every time a new node is added.

Resharding can be an expensive operation if you have a lot of data. Techniques like consistent hashing help with minimising the amount of items that need to move. If you are looking at a sharding database, you want to understand how the sharding is performed and if you like the trade-offs.

CAP Theorem

The CAP Theorem states that out of consistency, availability and partition tolerance, a system can choose to support two at any given moment, but never three.

cap.png

Consistency guarantees that all clients that talk to cluster of nodes will always get to read the same data. Write operations are atomic on all nodes.

Availability guarantees that in any (reasonable) failure scenario, clients are still able to access their data.

Partition tolerance guarantees that when nodes in the cluster lose their network connection and two or more completely separated sub-clusters emerge, the system will still be able to store and retrieve data.

Please Talk! (To Developers)

If you are aiming for a comparative benchmark of two or more systems, you should run your procedure by they authors. I found developers are happy to help out with benchmarks by clearing up misconceptions or sharing tricks to speed things up (which you can choose to ignore, if you are looking for out-of-the box comparison, but this is rarely useful).

Filed under // CouchDB Databases

Comments (0)