9. August, 2006
27. June, 2007
in by Michael Neumann

In the last couple of days, I wrote a lot about databases and their implementation. Mostly about replicated databases and their techniques (MVTO, quorum).

Replication is only one technique to scale-up a database, increase read-performance and availability, but mostly not so much write-performance. Whether write-performance is increased depends heavily on the granularity of database objects considered for conflicts in transactions. If you use tables, then this quickly becomes a bottleneck. So better choose rows. This increases concurrency, but also the management overhead. One problem with rows as “objects”, is their primary key. If you use a sequence, this then becomes the bottleneck (only for inserts). A solution would be to use a sequence generator per database node, each generating unique sequence numbers (just prepend the node-number for example). This way you could achieve large concurrency for both inserts and updates. Updating different rows could happen on different nodes without conflicts.

The problem with replication is that it’s not easy to add on existing databases. And trying to build your own (like me :) is doomed to failure. So what’s the alternative?

Partitioning

Partitioning the data into independent pieces is always A good idea™, if not the best. Each node owns a part of the data, so no or little synchronization or conflict resolution has to happen. And it’s quite easy to implement on top of existing databases (just write an interceptor which understands SQL and distribute the queries to the corresponding node). But if your data has lots of (inter-table) dependencies and inter-node data dependencies, then you might be badly off, because then you cannot execute the query on a single node and you have to go and query other nodes. Another problem of partitioning is that it’s availability drops down. One node is less error-prone than five. The chance that one machine out of five is defunct is much higher than that of a single node. Compare it with striping in a RAID system! So ideally you employ both replication and partitioning. Have n separate clusters, each serving a part of the whole data, and within the cluster replicate the nodes.

Analogy with processors

You might ask, what has partitioning (or replication) to do with a processor? Well, in a replicated system, you have to maintain coherency. Most shared memory multi processor systems employ cache coherency (from simple dual-core processors to bigger SMP, DSM or ccNUMA systems). And what do caches do? Caches replicate data! So they have to communicate to keep them synchronized. They do so by using a cache coherency protocol (e.g. MESI – Modified, Exclusive, Shared, Invalid). In a presentation about multi-chip/multi-core processors at my university, I heard (if I recall correctly), that a huge amount of a chips area (I don’t know how much exactly, but I think it was from 10 to 20 %) is spent on implementing cache-coherency. And I think, this was a dual-core system. But I might be totally wrong. While caches increase the performance of the local processor, in a multi-processor system it is a bottle-neck, and it really does not scale. Those systems are easy to program with OpenMP for example, but they just don’t scale. I think the largest cache coherent shared memory system has 256 processors (it’s actually a ccNUMA DSM system). And because it’s a NUMA (non-uniform memory access), meaning that local memory is faster than remote memory (but both are in the same address space), you as a programmer still have to know about that fact. So it’s not a big improvement over message-passing at all.

The cell architecture is a quite different approach. It’s internally message-passing, even if it’s on the same processor. It does not use caches for it’s processing elements. Data has to be transfered into the processing elements local memory. No coherency management needed. Lots of chip area saved, so that the local memory is the fastest memory available. It’s as fast as a register access! And you can even connect multiple chips together. This approach clearly scales. Due to partitioning!