Honestly, did you ever actually make use of the swap partition on your operating system? I don’t remember that my server ever made use of it and that with as little as 256 MB main memory. Nowadays, memory is very cheap and if your system starts to use the swap, something is usually going wrong. Most of the time it simply means that too many processes are running or that some processes use up too much memory or that they are configured to use too much memory, because they are optimized for more recent hardware.
Matthew Dillon, the leader of DragonFly and well-respected “guru” in the community, revived the swap as a system-wide fast filesystem cache when using SSDs. He extended DragonFly so that not only anonymous memory (i.e. memory not backed by a file) is written to the swap in case of a low physical memory situation, but also other types of memory, especially memory used up by the buffer cache. The buffer cache caches data read from a device (e.g. from a hard disk). By default, all unused physical memory is used for the buffer cache, greatly speeding up reads from that device in case of a buffer cache hit. But memory is limited. So the idea is to use SSDs as 2nd level buffer cache, i.e. when data is eleminated from the in-memory buffer cache because new data is read and the cache is full, it is written to the 2nd level SSD buffer cache, which usually is much larger in size (e.g. 40-120 GB). SSDs are much faster when it comes to random accesses (a hard disk can’t do more than 10 MB/sec) and also linear reads (you can get about 200 MB/sec with cheap SSDs) are almost twice as fast as regular SATA disks.
As SSDs have limited per-cell write cycles (1000 to 10000 depending on the technology), it is also important to limit writes to the SSDs depending on how long you want to use the SSD.
Yesterday I gave a presentation about DragonFlyBSD’s HAMMER filesystem at the end of the system architecture lecture here at Karlsruhe Institute of Technology (also well known as University of Karlsruhe). It was quite interesting and exciting to stand in our hugest lecture hall and talk to maybe 150 students about an highly “innovative” topic, at least that’s how our Professor announced my presentation.
Automatic PFS creation for DragonFly’s HAMMER file system has now been committed (read here).
A few seconds ago I tried out DragonFly’s HAMMER filesystem myself. All you need is to compile a new kernel with “options HAMMER” and the userland programs hammer, newfs_hammer and mount_hammer, plus a free partition (/dev/ad0s1d in my case). Here we go:
# newfs_hammer -L Home /dev/ad0s1d Volume 0 DEVICE /dev/ad0s1d size 4.00GB --------------------------------------------- 1 volume total size 4.00GB cluster-size: 16.00MB max-volume-size: 0.50TB max-filesystem-size: 16384.00TB boot-area-size: 16.00MB memory-log-size: 16.00MB allocate cluster id=10a9b3f3f83e5fd0 0@02040000 cluster 0 has 1024 buffers # mount_hammer /dev/ad0s1d /mnt # echo "hallo" > /mnt/abc # hammer now 0x47910e84 # echo "hallo leute" > /mnt/abc # cat /mnt/abc@@0x47910e73 hallo # cat /mnt/abc hallo leute
As you can see, at first I create a new HAMMER filesystem on partition ad0s1d using newfs_hammer. In the next step I mount that filesystem onto /mnt using mount_hammer. Then I create a file on it which contains “hallo”. After I acquire the current timestamp (hammer now) I overwrite the file with content “hallo leute”. But the old content is not gone. I can still access it using the @@timestamp notation. Nice ;-)
ZFS (Zeta filesystem), according to Sun the "last word in filesystems", is definitively a great piece of software. It frees us sysadmins from the burden to plan upfront how to partition a disk, makes backups so much easier due to snapshots, allows space efficient storage thanks to optional compression support and is very safe in terms of data corruption (at least it detects corruption) and it’s quickly online again after a crash. But is it really the "last word in filesystems"?
Well, it will be successful, no question. FreeBSD has it, Solaris of course, Apple too, and it’s planned for NetBSD. Don’t know about Linux (licensing issues) and Windows.
DragonFly’s HammerFS is progressing well and will be available in an alpha version with the next release of DragonFlyBSD which is planned to be out around by mid of February. First interest was shown to port it over to NetBSD (in userland using puffs), which is great news because NetBSD is still more widely used than DragonFly and just runs on more hardware (at least DragonFly still doesn’t run on my new laptop).
So what’s so great about HammerFS?
You don’t need to manually take snapshots. The system is able keep a history of every change according to a retention policy that you can specify.
You can mount a filesystem or access a file as-of a specific point in time (in the past). What was the content of this file yesterday? Last week? Even atime and mtime is tracked.
The major goal of HammerFS is clustering. A HammerFS filesystem can be shared by a bunch of machines. No need of a single master. Based on a Chorum protocol. (this is currently not implemented)
Backups made easy
Three things make backups easy. First, the as-of feature (eases taking tarballs). Second, journaling. And third, clustering. Do you need to backup at all if three or more machines have an identical copy of your filesystem including history? Think about the hurdles when restoring a traditional backup. With HammerFS you simply replace the harddisk and reconnect the machine to the network. Done! Think about a malicious or stupid (like me sometimes) user deleting files accidentially, right before the next backup is taken. You will loose your work, unless you use something like HammerFS. ZFS doesn’t help here because it doesn’t has the infinite snapshot feature. While I trust harddisks and RAIDs, I don’t trust users!
HammerFS will allow you to open a file in DB mode. I don’t know how exactly this works out, but you will be able to use a file as a simple key/value database. No need for special libraries and probably quite fast.
The last couple of days I played with DragonFlyBSD. Except some minor issues (like fdisk completely hangs my BIOS :) it works pretty well. I was mostly interested in it's journaling provided by jscan. You can read more here (or follow the direct link to my mail).
I like DragonFly due to it's simplicity. Currently I am running FreeBSD 7.0-CURRENT. It has tons of features, but many of them are not well thought out. For example disk or file-system labeling through glabel confuses me a lot and doesn't work as expected. Normally it should show up the labeled disks/file systems in /dev/ufs/name or /dev/label/name, so that you can easily mount them without knowing the exact device number or slice. The concept is nice, but I think the implementation is way too complex (or too generic). Or I am just to stupid to understand. But I think the whole geom stuff (this framework is called geom) is not yet 100% stable.
Matthew Dillon (DragonFly) in now working on getting user land file systems working. Well, as I know him, this will be a comparatively easy task, as all the former work he did on DragonFly will help him a lot! So for example, the whole kernel can already run in user land. Or the work he did on syslink, which is a message passing interface.
Most of the features FreeBSD provides I don't need at all. What I really like is if the basic kernel infrastructure is very clean and understandable. This leads to a stable kernel and will make it easy to extend it.
As far as I understand, this layer will allow (or make it possilbe) to do local as well as remote journaling, and even will make it feasible to do real-time off-site backup (a secondary spool will be used in case of network downs).
I wrote a Ruby wrapper around the sys_checkpoint syscall. You can find the sources here: www.ntecs.de/viewcvs/viewcvs/DragonFly/checkpoint/
The next step is to put this into the Wee sources, so that you can shutdown and reload Wee applications that use continuations. Of course, this only works on DragonFlyBSD. But I don’t think this is much of a problem…