22. January, 2009
in by Michael Neumann

As you can read here, PostgreSQL will contain so called window functions which is part of the SQL 2008 specification. Now that I know how this feature is named, I know what I was missing for the past 10 years. Window functions allow you to produce aggregated values from each row over a specific set of rows (a partition).

Assuming a table staff defined as

create table staff (
    name    varchar(30),
    dept    varchar(30),
    salaray int
);

we can now show each staff person together with the average salary of the department he/she belongs to using a very simple SQL statement like

select name, dept, salary,
       avg(salary) over (partition by dept)
from staff;

Without using window functions the necessary SQL statement is considerably more complex and requires ugly subselects and joins:

select s.name, s.dept, s.salary,
       a.avg_salary
from staff s,
     (select dept, avg(salary) as avg_salary
      from staff group by dept) as a
where s.dept = a.dept

Window functions can do a lot more, so for example can produce cummulative sums easily.

9. October, 2007
9. October, 2007
in by Michael Neumann

PostgreSQL 8.3 (which is now in beta) will contain a lot of useful performance improvements:

  • Implement "top N" sorting in ORDER BY ... LIMIT queries

    Very useful! Should make pagination with many pages faster!

  • Lazy XID allocation

    Means that read-only transactions are cheaper and faster.

  • Implement an optional asynchronous commit mode

    While this doesn’t guarantee the “D” in ACID, this should lead to more write-throughput. And how likely is it anyways that a Postgres database breaks down (on FreeBSD :)

And some more features:

  • Autovacuum is enabled by default. Great!
  • Full text search is now built-in
  • ENUM data type
23. August, 2007
23. August, 2007
in by Michael Neumann

This article shows some interesting results on how good PostgreSQL scales on FreeBSD 7.0 on a 8-core machine (compared to FreeBSD 6.2 and 2-core).

8. February, 2007
27. June, 2007
in by Michael Neumann

This article is about the down-sides of Software Transactional Memory (STM). STM uses concepts from databases and applies them to main memory. Databases allow concurrent reads and writes to a single resource (a table for example) and implement techniques to fullfill the ACID properties. Now imagine multiple threads accessing a single resource (a variable in memory). STM does the same as a database, but for main memory.

But the question is, does it scale? Is it good to have this kind of shared-state concurrency? Of course, shared-state concurrency doesn’t scale per-se. You always have a single resource, and multiple accessors that concur for the resource. Whereas in a shared-nothing architecture, you have one resource per accessor, so there is true parallelism and no conflicts. It’s much easier to program in a shared-nothing paradigm than it is for shared-state. It’s less error-prone and it scales.

Whenever you have multiple threads accessing the same shared state (the same memory locations), you have to coordinate them when accessing the state. You have to gain locks, release locks, be careful to not dead-lock or live-lock the threads. It’s very hard to predict what is going on.

I’ve been thinking about a simple database implementation lately, that is single-threaded and event-driven like modern web-servers (Lighttpd or Nginx). The whole index is in main-memory, whereas the data lives on the disk. Queries can only contain elements that live in the index in memory. Because whenever you go to disk, you have to wait a very long time. That’s why it is event-driven. To avoid locking, the index is traversed in memory and those rows that match or are of interest are marked and at the end a request to read those rows from the disk is issued in the background (asynchronous). Maybe it is handed-over to a second thread, that handles disk-access, so that the other thread is free to continue with the next request. This method is very simple. But of course it has the requirement that every column that occurs in a “WHERE” clause has to be available in the in-memory index (more complex queries could be build at the application layer). There is no concurrency, everything is sequential. But usually the access to the index is very fast (as it is in memory). And this doesn’t matter, as you’d partition your database into small pieces, each of which is a database as described above. Of course, there are no referential constraints across partitions (inside it is possible).

Would be interesting how fast such a database would be. The key point is that it avoids complexity by avoiding shared-state and concurrency at all.

9. August, 2006
27. June, 2007
in by Michael Neumann

In the last couple of days, I wrote a lot about databases and their implementation. Mostly about replicated databases and their techniques (MVTO, quorum).

Replication is only one technique to scale-up a database, increase read-performance and availability, but mostly not so much write-performance. Whether write-performance is increased depends heavily on the granularity of database objects considered for conflicts in transactions. If you use tables, then this quickly becomes a bottleneck. So better choose rows. This increases concurrency, but also the management overhead. One problem with rows as “objects”, is their primary key. If you use a sequence, this then becomes the bottleneck (only for inserts). A solution would be to use a sequence generator per database node, each generating unique sequence numbers (just prepend the node-number for example). This way you could achieve large concurrency for both inserts and updates. Updating different rows could happen on different nodes without conflicts.

The problem with replication is that it’s not easy to add on existing databases. And trying to build your own (like me :) is doomed to failure. So what’s the alternative?

Partitioning

Partitioning the data into independent pieces is always A good idea™, if not the best. Each node owns a part of the data, so no or little synchronization or conflict resolution has to happen. And it’s quite easy to implement on top of existing databases (just write an interceptor which understands SQL and distribute the queries to the corresponding node). But if your data has lots of (inter-table) dependencies and inter-node data dependencies, then you might be badly off, because then you cannot execute the query on a single node and you have to go and query other nodes. Another problem of partitioning is that it’s availability drops down. One node is less error-prone than five. The chance that one machine out of five is defunct is much higher than that of a single node. Compare it with striping in a RAID system! So ideally you employ both replication and partitioning. Have n separate clusters, each serving a part of the whole data, and within the cluster replicate the nodes.

Analogy with processors

You might ask, what has partitioning (or replication) to do with a processor? Well, in a replicated system, you have to maintain coherency. Most shared memory multi processor systems employ cache coherency (from simple dual-core processors to bigger SMP, DSM or ccNUMA systems). And what do caches do? Caches replicate data! So they have to communicate to keep them synchronized. They do so by using a cache coherency protocol (e.g. MESI – Modified, Exclusive, Shared, Invalid). In a presentation about multi-chip/multi-core processors at my university, I heard (if I recall correctly), that a huge amount of a chips area (I don’t know how much exactly, but I think it was from 10 to 20 %) is spent on implementing cache-coherency. And I think, this was a dual-core system. But I might be totally wrong. While caches increase the performance of the local processor, in a multi-processor system it is a bottle-neck, and it really does not scale. Those systems are easy to program with OpenMP for example, but they just don’t scale. I think the largest cache coherent shared memory system has 256 processors (it’s actually a ccNUMA DSM system). And because it’s a NUMA (non-uniform memory access), meaning that local memory is faster than remote memory (but both are in the same address space), you as a programmer still have to know about that fact. So it’s not a big improvement over message-passing at all.

The cell architecture is a quite different approach. It’s internally message-passing, even if it’s on the same processor. It does not use caches for it’s processing elements. Data has to be transfered into the processing elements local memory. No coherency management needed. Lots of chip area saved, so that the local memory is the fastest memory available. It’s as fast as a register access! And you can even connect multiple chips together. This approach clearly scales. Due to partitioning!

7. August, 2006
27. June, 2007
in by Michael Neumann

That’s the approach I’m mostly interested in, and which was used in the distributed Backplane database. For fast lookup it’s important that you order the tuples in reverse timestamp order. But you also need to index on the tuple-id. The first tuple you find will be the youngest tuple. If you used a timestamp in the past, you have to search through until you find the tuple with a smaller timestamp than the transaction. It’s important that you first order on the primary key, then on the transaction timestamp. Or at least that’s my understanding.

7. August, 2006
27. June, 2007
in by Michael Neumann

A Cache-Sensitive B+Tree (CSB+) uses one cache-line per node. A cache line is usually only 64 bytes, so only around 14 32-bit keys fit in there. In comparison to a regular B+Tree, which has node sizes of 4k and above. The trick is to avoid the child-array. Only one pointer to the first child is stored. This points to the node group. The corresponding node is found by simply using the index of the key-array and add it to the pointer. A node group is simply the size of a single node (64 bytes) multiplied by the number of keys per node (e.g. 14).

Another unrelated technique to improve key lookup in a B+Tree node is to use interpolation search instead of binary search. Binary search has a worst case runtime of O(log n). Interpolation search has O(n), but usually finds the key within 3 or 4 lookups. You store the smallest and largest key, calculate the difference and divide by the number of elements. So you know the “step” per key. Then you add the smallest key to the lookup key and divide by “step” and use the result as index. You can also use binary search as a fall-back method, if you’ve not found the key after 4 tries, so you keep the logarithmic worst case.

Another technique that can be used to increase insertion performance is to use gaps within the key-array. By doing this you avoid copying the whole array before you can insert a new key.

While Skip-Lists are quite popular and well known, nobody knows about Skip-Trees. It’s an interesting paper. The only ugly thing is that nodes may have an arbitrary number of keys. And there’s not much known about it’s performance characteristics.

4. August, 2006
27. June, 2007
in by Michael Neumann

Transactions per se do not have anything to do with databases. There are four operations:

  • read(object)
  • write(object)
  • commit
  • abort

A partial order is defined on those operations. For example:

t1: r(X) w(X) c
t2: r(Y) w(Y) c

Transaction t1 reads object X, then writes it and finally commits. It’s obvious that those two transaction do not disturb each other, meaning that you can execute them (or parts of them) in any order without giving a different result.

t1: r(X) w(X) ....
t2: r(X)

In the example above, the two transactions are no longer independent of each other (because they access the same object X). You might get a dirty read, depending on the execution order.

r_1(X) w_1(X) r_2(X) ...

If t1 aborts, then t2 has not only read a dirty value, but has also been an inconsistent read. Dirty reads and inconsistent reads have to be avoided. This brings me to the topic of the article.

Out-of-Order Instruction Execution

Modern processors execute the instructions of a program not sequentially but out of order. Instructions have dependencies. There are three types of them:

  • True dependency: Read-after-Write (RAW)
  • Anti-dependency: Write-after-Read (WAR)
  • Output dependency: Write-after-Write (WAW)

True dependecies

MOV EAX, 1
MOV EBX, EAX

You cannot exchange the execution order of those two operations, because you don’t know the value of EAX before execution of the first instruction.

Anti-dependencies

MOV EBX, EAX
MOV EAX, 1

Again, you cannot exchange the execution order of both instructions. But what you can do is, to rename the registers:

MOV EBX, EAX
MOV renamed(EAX), 1

Now the second instruction do not writes to EAX, but to an internal register. Those two instructions can be executed in parallel. But you have to take care, that you later assign EAX the value of the renamed register in the correct order.

This is called register renaming and is used in any modern processor.

Output dependency

MOV EBX, 1
MOV EBX, 2

Again, register renaming is used to be able to execute the two instructions in parallel.

The link to transactions and databases

To increase concurrency, modern databases use Multi-Versioning. That is, they keep multiple versions of the rows. The advantage is that read-only transactions can be executed without being disturbed by writing transactions. There are no conflicts. And even writing transactions can be executed in any order, but you might have to undo them on commit if you detect a conflict with an other writing transaction. That’s exactly what modern processors do with register renaming (multiple version of a register) and out-of-order execution.

3. August, 2006
27. June, 2007
in by Michael Neumann

What to know more about (database) transactions? These slides are very good!

3. August, 2006
27. June, 2007
in by Michael Neumann

How are databases used today by peoples writing web-applicatinos? Or what features do you really need?

  • Transactions: YES
  • FK Constraints: Nice to have, but not neccessary
  • Check constraints: Well, you already do that in the model, do you?
  • Triggers: BEWARE!
  • Aggregations: It depends! Mostly they are too limited, so you do them in any way in your application.

Imagine a very simple database, which supports only little SQL, and you’re using Rails. Would you miss something? I say: NO!

Of course in such a case, where all validation is done in the model, you should not issue SQL commands on your own without using the model. That’s the only danger. But one could live with it. The advantage is, that the database and the model do not duplicate validations, and that the database implementation can be kept at a minumum (you could even get completely rid of SQL!).

What do we really need?

  • CREATE TABLE (col type [pk], ...)
  • DROP TABLE
  • SELECT cols FROM table WHERE cond
  • DELETE FROM table WHERE cond
  • UPDATE table SET col=val … WHERE cond

And, do we really need JOINS? Essential if you’re going over a wire! So YES! How are JOINs used? I think mostly to connect a tables FK to another tables PK. So how about automatic joins on FK and PKs? And how about FKs and PKs always being integers?

Do we need this?

SELECT t1.a, t2.a FROM table1 t1, table2 t2
WHERE t1.fk_id = t2.id

For ActiveRecords in Rails that’s mostly useless, because you usually get back a model object, so that values from multiple tables cannot be used anyway.

What types do we need? Integers, blobs and user-defined types.

ORDER BY and LIMIT/OFFSET are very useful too, to limit traffic on the wire!

How to implement a database

For most users databases are some kind of magic, that guarantee ACID properties (not all databases do that). But how the database achives this, is mostly unknown. Well, it isn’t that hard! Usually a database uses some kind of B-Trees, that is a balanced binary tree with nodes that contain not just a single value as binary trees do, but contain multiple ordered values. The property of binary trees, that left subnodes have smaller values and right subnodes have larger values apply to B-Tree in the same way. But regular binary trees degenerate very quickly. Of course you can use balanced binary trees, like AVL trees, but they only really work well in memory and not on disk. So, B-Trees are quite simple to implement and are efficient, as they keep multiple values per disk-page.

A page is usually a 4 KB block of memory and should ideally be the same size as the underlying disk-block size. A page stores a B-Tree node or the actual data of the record. Now we have to maintain information about which pages are free and which are not. A very simple approach is to keep a pointer to the first free page, and each free page in turn keeps a pointer to the next free page.

For performance reasons we also need a page cache, which caches often used pages in memory. Uh! What happens if you accidentally plug the power jack? Of course everything in memory gets lost. And how do we guarantee atomicity. For example it might happen, that we insert a new key into the B-Tree, and the B-Tree has to rebalance. Usually this affects multiple pages. Now what happens if we write those affected pages after we’ve modified them back to disk, but in between there occurs an error? We’ve left the database in an inconsistent state from which it’s hard to recover. The key to this problem is a redo-log. This is a file to which we write those changes before touching the database file. We also write a “commit” marker to the log. At least the commit write to the log must be atomic. And important is that we are not allowed to touch the database file before this commit has been written out permanently. In case of a failure, either the log is incomplete (commit has not been written). In this case, we can simply ignore the log, nothing has been written to the database anyways. Or the commit has successfully been written to the log, but during updating the database file, we “plugged the power jack”, so that the database is inconsistent. But we can use the log to redo those changes and bring the DB into a consistent state.

Replicated databases

That’s an advanced topic. There are two major flavours of replication, synchronous and asynchronous. Synchronous replication is slow! You have to wait until each participating database gets the change, before the transaction is finally commited and the client can proceed. So it’s only effective if the access pattern is mostly read-only. Asynchronous replication is (with my limited understanding) mostly used to keep hot-backups, so in case a database goes down, you can proceed with the hot-backup. There’s usually a transaction propagation delay, meaning you might loose already committed transactions, and this is no longer ACID (as transactions are in this case no longer durable). So you can decide between slow and dangerous. Are there alternatives?

Quorum replicated temporal databases

A temporal database is a database, that keeps past records. So records are appended, never overwritten (unless you VACUUM). So the database grows and grows. One advantage to this is, that it’s mostly lock-less. You do not lock a record before you update it. You just write a new record. The old is never touched. Of course you need some kind of locking in the B-tree and page cache and the redo-log. But those locks are only of very short periods of time and are very local. This gives you a very high degree of parallelism. Many threads can access the database at the same time.

But from time to time, you’d have to “optimize” your database. Remove all those old records. That takes time and is best done offline, because in this append-only scenario, you don’t have to keep free-lists, and why would you do something that do can get around? But you cannot go offline with your database, unless it’s replicated. That’s where replication comes to light. That’s a bit more complex then the rest. It needs to be:

  • fast
  • fault-tolerant
  • deterministic

I’m currently trying to understand that topic. There are some issues I haven’t grasped yet. But the priciple of a quorum is indeed very simple. Say you have 5 nodes in total. So the quorum is 3 nodes (n/2 + 1). If you commit a transaction and 3 nodes (the quorum) say “okay”, then the transaction is accepted.

23. November, 2004
23. November, 2004
in by Michael Neumann
I’m proud to announce the first public "release" of postgres-pr. It is a library to access PostgreSQL from Ruby without the need of any C library.
  • You can use it only with newer 7.x databases that use wire-protocol 3.
  • Lot’s of stuff is missing, only the wire-protocol is quite complete.

Quick Example:

> gem install postgres-pr
> irb -r rubygems

Then in the interactive Ruby interpreter type (replace DBNAME and DBUSER accordingly):

require 'postgres-pr/connection'
c = PostgresPR::Connection.new('DBNAME', 'DBUSER')
c.query('SELECT 1+2')                  # => [["3"]]

Have fun!