3. August, 2006
27. June, 2007
in by Michael Neumann

How are databases used today by peoples writing web-applicatinos? Or what features do you really need?

  • Transactions: YES
  • FK Constraints: Nice to have, but not neccessary
  • Check constraints: Well, you already do that in the model, do you?
  • Triggers: BEWARE!
  • Aggregations: It depends! Mostly they are too limited, so you do them in any way in your application.

Imagine a very simple database, which supports only little SQL, and you’re using Rails. Would you miss something? I say: NO!

Of course in such a case, where all validation is done in the model, you should not issue SQL commands on your own without using the model. That’s the only danger. But one could live with it. The advantage is, that the database and the model do not duplicate validations, and that the database implementation can be kept at a minumum (you could even get completely rid of SQL!).

What do we really need?

  • CREATE TABLE (col type [pk], ...)
  • DROP TABLE
  • SELECT cols FROM table WHERE cond
  • DELETE FROM table WHERE cond
  • UPDATE table SET col=val … WHERE cond

And, do we really need JOINS? Essential if you’re going over a wire! So YES! How are JOINs used? I think mostly to connect a tables FK to another tables PK. So how about automatic joins on FK and PKs? And how about FKs and PKs always being integers?

Do we need this?

SELECT t1.a, t2.a FROM table1 t1, table2 t2
WHERE t1.fk_id = t2.id

For ActiveRecords in Rails that’s mostly useless, because you usually get back a model object, so that values from multiple tables cannot be used anyway.

What types do we need? Integers, blobs and user-defined types.

ORDER BY and LIMIT/OFFSET are very useful too, to limit traffic on the wire!

How to implement a database

For most users databases are some kind of magic, that guarantee ACID properties (not all databases do that). But how the database achives this, is mostly unknown. Well, it isn’t that hard! Usually a database uses some kind of B-Trees, that is a balanced binary tree with nodes that contain not just a single value as binary trees do, but contain multiple ordered values. The property of binary trees, that left subnodes have smaller values and right subnodes have larger values apply to B-Tree in the same way. But regular binary trees degenerate very quickly. Of course you can use balanced binary trees, like AVL trees, but they only really work well in memory and not on disk. So, B-Trees are quite simple to implement and are efficient, as they keep multiple values per disk-page.

A page is usually a 4 KB block of memory and should ideally be the same size as the underlying disk-block size. A page stores a B-Tree node or the actual data of the record. Now we have to maintain information about which pages are free and which are not. A very simple approach is to keep a pointer to the first free page, and each free page in turn keeps a pointer to the next free page.

For performance reasons we also need a page cache, which caches often used pages in memory. Uh! What happens if you accidentally plug the power jack? Of course everything in memory gets lost. And how do we guarantee atomicity. For example it might happen, that we insert a new key into the B-Tree, and the B-Tree has to rebalance. Usually this affects multiple pages. Now what happens if we write those affected pages after we’ve modified them back to disk, but in between there occurs an error? We’ve left the database in an inconsistent state from which it’s hard to recover. The key to this problem is a redo-log. This is a file to which we write those changes before touching the database file. We also write a “commit” marker to the log. At least the commit write to the log must be atomic. And important is that we are not allowed to touch the database file before this commit has been written out permanently. In case of a failure, either the log is incomplete (commit has not been written). In this case, we can simply ignore the log, nothing has been written to the database anyways. Or the commit has successfully been written to the log, but during updating the database file, we “plugged the power jack”, so that the database is inconsistent. But we can use the log to redo those changes and bring the DB into a consistent state.

Replicated databases

That’s an advanced topic. There are two major flavours of replication, synchronous and asynchronous. Synchronous replication is slow! You have to wait until each participating database gets the change, before the transaction is finally commited and the client can proceed. So it’s only effective if the access pattern is mostly read-only. Asynchronous replication is (with my limited understanding) mostly used to keep hot-backups, so in case a database goes down, you can proceed with the hot-backup. There’s usually a transaction propagation delay, meaning you might loose already committed transactions, and this is no longer ACID (as transactions are in this case no longer durable). So you can decide between slow and dangerous. Are there alternatives?

Quorum replicated temporal databases

A temporal database is a database, that keeps past records. So records are appended, never overwritten (unless you VACUUM). So the database grows and grows. One advantage to this is, that it’s mostly lock-less. You do not lock a record before you update it. You just write a new record. The old is never touched. Of course you need some kind of locking in the B-tree and page cache and the redo-log. But those locks are only of very short periods of time and are very local. This gives you a very high degree of parallelism. Many threads can access the database at the same time.

But from time to time, you’d have to “optimize” your database. Remove all those old records. That takes time and is best done offline, because in this append-only scenario, you don’t have to keep free-lists, and why would you do something that do can get around? But you cannot go offline with your database, unless it’s replicated. That’s where replication comes to light. That’s a bit more complex then the rest. It needs to be:

  • fast
  • fault-tolerant
  • deterministic

I’m currently trying to understand that topic. There are some issues I haven’t grasped yet. But the priciple of a quorum is indeed very simple. Say you have 5 nodes in total. So the quorum is 3 nodes (n/2 + 1). If you commit a transaction and 3 nodes (the quorum) say “okay”, then the transaction is accepted.