Open Source Bridge - Relational vs. Non-Relational
Josh Berkus, PostgreSQL Experts Inc.
- overview focused on choosing what type of database you need vs. investigation of any specific database
- up until a few years ago there were only a handful of options for open source databases
- most were sql/relational
- a few written in java
- only really exciting thing going on in the relational world is postgres vs mysql
- today there are many more open source databases
- as many as 5 dozen now?
- databases for lots of different purposes, but lots of people want to lump a lot of the new ones under the "nosql movement" label
- not so fond of this term
- has implication that every database that doesn't have a sql interface is more or less identical
- all non-relational databases aren't the same
- have graph, document, key-value, distributed, hierarchical ... quite different from one another
- some of the non-relational databases have sql interfaces
- all relational databases aren't the same either
- embedded, oltp, mpp, streaming, c-store ...
- mythbusting
- "revolutionary" is bandied about a lot but database technology goes back a long way
- not really any new database designs in terms of fundamental architecture
- are new implementations and combinations of design
- last "new" thing was map-reduce in 2002
- even couchdb is largely similar to Pick, which was created in 1965
- when looking at new databases, don't look for revolutionary concepts, look for good implementations
- what's going on right now is actually a renaissance of non-relational databases
- myth: "non-relational databases are toys"
- google - bigtable
- amazon - dynamo
- facebook - memcached
- us veterans administration - pick, cache
- myth: "relational databases will become obsolete"
- xml databases were supposed to replace rdbms ca. 2001 -- didn't happen
- rdbmses evolved to include xml functionality
- one of the things we'll see out of the current non-relational innovation is that some of the implementations will hybridize with one another
- myth: "relational databases are for when you need ACID transactions"
- transactions != relational -- orthagonal features
- robust transactions without relationality: berkelydb, amazon dynamo
- sql without transactions: mysql isam, ms access
- myth: "users are adopting nosql for web-scale performance"
- sometimes it is, sometimes it isn't
- performance test done by myyearbook.com
- benchmark of key/value storage and retrieval
- only real difference in performance is between databases that guarantee durability and those that don't
- horizontal scalability
- some non-relational databases are built for horizontal scalability and some aren't
- complexity of implementation rises with the ability to scale out to a massive number of nodes
- myth: "one ring theory of database selection"
- "what's the best database to use?" - wrong question
- don't need to use only one database
- choose the db that meets your applicaton's goals, or use more than one together
- use a hybrid
- mysql ndb
- postgresql hstore
- hadoopdb
- but what about choosing between relational and non-relational?
- relational oltp databases
- transactions: more mature support
- constraints: enforce data rules absolutely
- consistency: enforce structure 100%
- complex reporting: keep management happy!
- vertical scaling (but not horizontal)
- sql vs. no sql--sql promotes ...
- portability
- managed changes over time (ddl)
- multi-application access
- many mature tools
- but, sql is a full programming language and you have to learn how to use it
- no sql promotes ...
- programmers as dbas
- no impedance mismatch
- fast interfaces
- fast development and deployment
- but, man involve learning complex proprietary APIs
- in some cases not easier than sql
- main reason to use sql-relational databases
- "immortal data"
- your data has a life independent of this specific application implementation
- important to be able to access data accurately and consistently forever
- how DO i choose?
- define the problem you're trying to solve
- what is it that my application wants to do with this data and how does it want to do it?
- i need a database for my blog, i need to add thousands of objects per second on a low-end device, etc.
- from the definition you create you can define a database shopping list
- define the features you ACTUALLY need
- define the problem you're trying to solve
- fit the database to the task
- "I need a database for my blog"
- use anything!
- no open source databases that wouldn't support someone's individual blog, flat files would even work
- "I need my database to unify several applications and keep them consistent"
- high-end sql-relational database best choice for this
- "PostgreSQL: It's not a database, it's a development platform"
- "I need my application to be location-aware" -- geo applications
- PostGIS - geographic relational database
- queries across "contains" "near" "closest"
- complex geometric map objects
- couchdb spatial and spatialite are now available as well
- "I need to store 1000s of event objects per second on a piece of embedded hardware"
- db4object -- embedded key-value store
- others: berkeleydb, redis, tokyocabinet, mongodb
- db4object
- project was german train system -- records data every few milliseconds
- low-end embedded console computer
- simple access in native programming language (java, .net)
- "I need to access 100K objects per second over thousands of connections from the web"
- memcached - distributed in-memory key-value non-persistent database
- use: public website
- typically supplements another database
- alternatives: redis, kyototyrant, etc.
- "i need to produce complex summary reports over 2tb of data"
- luciddb - relational column-store database
- for reporting and analysts
- large quantities of data
- complex olap and analytics
- used along-side oltp running production apps
- "I have 100s of govt documents I need to serve on the web and need to mine the data as cheaply as possible"
- CouchDB
- storing lots and lots of government documents that didn't have a consistent format (don't know content or structure)
- used in combination with postgres to keep structured metadata
- couchdb is also great for mobile apps
- "I have a social application and I need to know who-knows-who-knows-who-knows-who"
- surprisingly hard question to answer with a normal db
- use a graph database -- neo4j is most popular open source one
- social network website
- 6 degrees of separation
- "you may also like"
- type and degrees of relationship
- "I get 1000s of 30K bug reports per minute and I need to mine them for trends"
- used on mozilla firefox crash reports
- hadoop -- massively parallel datamine
- hadoop + hbase
- reports are then put into postgres for viewing
- conclusion
- database systems do better at different tasks
- every database feature is a tradeoff
- no database can do all things well
- need to make tradeoff decisions when picking databases
- relational vs non-relational doesn't matter
- pick the database(s) for the project or task
- database systems do better at different tasks
- how difficult to migrate from something like couchdb to postgres?
- depends on how much data and in what form
- since couchdb works with json that's pretty easy
- if you want to take the document structure out of something like couchdb and put it into a relational model, the decomposition process will be complicated