Thursday, June 3, 2010

Open Source Bridge - Relational vs. Non-Relational

Josh Berkus, PostgreSQL Experts Inc.
  • overview focused on choosing what type of database you need vs. investigation of any specific database
  • up until a few years ago there were only a handful of options for open source databases
    • most were sql/relational
    • a few written in java
    • only really exciting thing going on in the relational world is postgres vs mysql
  • today there are many more open source databases
    • as many as 5 dozen now?
  • databases for lots of different purposes, but lots of people want to lump a lot of the new ones under the "nosql movement" label
    • not so fond of this term
    • has implication that every database that doesn't have a sql interface is more or less identical
    • all non-relational databases aren't the same
    • have graph, document, key-value, distributed, hierarchical ... quite different from one another
    • some of the non-relational databases have sql interfaces
  • all relational databases aren't the same either
    • embedded, oltp, mpp, streaming, c-store ...
  • mythbusting
    • "revolutionary" is bandied about a lot but database technology goes back a long way
    • not really any new database designs in terms of fundamental architecture
    • are new implementations and combinations of design
    • last "new" thing was map-reduce in 2002
    • even couchdb is largely similar to Pick, which was created in 1965
    • when looking at new databases, don't look for revolutionary concepts, look for good implementations
    • what's going on right now is actually a renaissance of non-relational databases
  • myth: "non-relational databases are toys"
    • google - bigtable
    • amazon - dynamo
    • facebook - memcached
    • us veterans administration - pick, cache
  • myth: "relational databases will become obsolete"
    • xml databases were supposed to replace rdbms ca. 2001 -- didn't happen
    • rdbmses evolved to include xml functionality
    • one of the things we'll see out of the current non-relational innovation is that some of the implementations will hybridize with one another
  • myth: "relational databases are for when you need ACID transactions"
    • transactions != relational -- orthagonal features
    • robust transactions without relationality: berkelydb, amazon dynamo
    • sql without transactions: mysql isam, ms access
  • myth: "users are adopting nosql for web-scale performance"
    • sometimes it is, sometimes it isn't
    • performance test done by myyearbook.com
      • benchmark of key/value storage and retrieval
      • only real difference in performance is between databases that guarantee durability and those that don't
    • horizontal scalability
      • some non-relational databases are built for horizontal scalability and some aren't
      • complexity of implementation rises with the ability to scale out to a massive number of nodes
  • myth: "one ring theory of database selection"
    • "what's the best database to use?" - wrong question
    • don't need to use only one database
    • choose the db that meets your applicaton's goals, or use more than one together
    • use a hybrid
      • mysql ndb
      • postgresql hstore
      • hadoopdb
  • but what about choosing between relational and non-relational?
  • relational oltp databases
    • transactions: more mature support
    • constraints: enforce data rules absolutely
    • consistency: enforce structure 100%
    • complex reporting: keep management happy!
    • vertical scaling (but not horizontal)
  • sql vs. no sql--sql promotes ...
    • portability
    • managed changes over time (ddl)
    • multi-application access
    • many mature tools
    • but, sql is a full programming language and you have to learn how to use it
  • no sql promotes ...
    • programmers as dbas
    • no impedance mismatch
    • fast interfaces
    • fast development and deployment
    • but, man involve learning complex proprietary APIs
      • in some cases not easier than sql
  • main reason to use sql-relational databases
    • "immortal data"
    • your data has a life independent of this specific application implementation
    • important to be able to access data accurately and consistently forever
  • how DO i choose?
    • define the problem you're trying to solve
      • what is it that my application wants to do with this data and how does it want to do it?
      • i need a database for my blog, i need to add thousands of objects per second on a low-end device, etc.
    • from the definition you create you can define a database shopping list
      • define the features you ACTUALLY need
  • fit the database to the task
  • "I need a database for my blog"
    • use anything!
    • no open source databases that wouldn't support someone's individual blog, flat files would even work
  • "I need my database to unify several applications and keep them consistent"
    • high-end sql-relational database best choice for this
    • "PostgreSQL: It's not a database, it's a development platform"
  • "I need my application to be location-aware" -- geo applications
    • PostGIS - geographic relational database
    • queries across "contains" "near" "closest"
    • complex geometric map objects
    • couchdb spatial and spatialite are now available as well
  • "I need to store 1000s of event objects per second on a piece of embedded hardware"
    • db4object -- embedded key-value store
    • others: berkeleydb, redis, tokyocabinet, mongodb
    • db4object
      • project was german train system -- records data every few milliseconds
      • low-end embedded console computer
      • simple access in native programming language (java, .net)
  • "I need to access 100K objects per second over thousands of connections from the web"
    • memcached - distributed in-memory key-value non-persistent database
    • use: public website
    • typically supplements another database
    • alternatives: redis, kyototyrant, etc.
  • "i need to produce complex summary reports over 2tb of data"
    • luciddb - relational column-store database
    • for reporting and analysts
    • large quantities of data
    • complex olap and analytics
    • used along-side oltp running production apps
  • "I have 100s of govt documents I need to serve on the web and need to mine the data as cheaply as possible"
    • CouchDB
    • storing lots and lots of government documents that didn't have a consistent format (don't know content or structure)
    • used in combination with postgres to keep structured metadata
    • couchdb is also great for mobile apps
  • "I have a social application and I need to know who-knows-who-knows-who-knows-who"
    • surprisingly hard question to answer with a normal db
    • use a graph database -- neo4j is most popular open source one
    • social network website
    • 6 degrees of separation
    • "you may also like"
    • type and degrees of relationship
  • "I get 1000s of 30K bug reports per minute and I need to mine them for trends"
    • used on mozilla firefox crash reports
    • hadoop -- massively parallel datamine
    • hadoop + hbase
      • reports are then put into postgres for viewing
  • conclusion
    • database systems do better at different tasks
      • every database feature is a tradeoff
      • no database can do all things well
      • need to make tradeoff decisions when picking databases
    • relational vs non-relational doesn't matter
      • pick the database(s) for the project or task
Questions
  • how difficult to migrate from something like couchdb to postgres?
    • depends on how much data and in what form
    • since couchdb works with json that's pretty easy
    • if you want to take the document structure out of something like couchdb and put it into a relational model, the decomposition process will be complicated

No comments: