What I’m Going to Tell You About

In the previous article we constructed the environment needed to get familiar with the kernel. Then we took a look at loadable kernel modules and wrote a simple “Hello World!”. Finally, we wrote a simple and useless file system. Now it’s time to go on.

The main aim of this article is to teach the file system to read from disk. For now it will read the operating information only (the super block and index nodes).

Doesn’t seem like much, eh? The thing is that in this post we’ll have to define the structure of our file system – the way it will be stored on disk. Besides, we’ll also get familiar with SLAB and RCU. All of this will require some explanations – a lot of words and little code. Therefore, the post will be quite long.

A Boring Beginning

Our file should be simple, so it will be stored on disk the following way:

Teaching the File System to Read
Let’s start from the end:


The same disk format is used in other files systems as well. For instance, a block group in ext2/3 has a similar structure, but ext2/3 works with several such groups while we’ll work with one only. Such format is considered to be out-of-date and new file systems step away from it. For example, btrfs uses a much more interesting scheme that provides many advantages over ext family. But let’s get back to our muttons.

We’ve defined that the super block and bit maps of blocks/index nodes occupy the first three blocks of the file system. But how much does the table of index nodes occupy? Generally speaking, it’s not right to affix this dimension as it depends on the way we’re going to use the file system. For instance, you should make this table smaller if you’re going to store big files mostly. You’re unlikely to run out of index nodes earlier than of the disk space. On the other hand, if you’re going to store a lot of small files you’re likely to run out of index nodes earlier than of the disk space, provided that the table is too small.

There is a key -i in mke2fs utility we use for formatting the disk for ext2/3 file systems. The key indicates how much disk space we need to create an index node. So indicating -i 16384 we’ll of create one index node at every 16 kB of the disk space without an ability to change this value (at least for now).

The last thing I’d like to tell you about is the size of the block. A file system can operate with blocks of different size. I’ll support blocks of 512, 1024, 2048 and 4096 bite, as it’s better to work with one page blocks (we’ll get back to it later). But it’s not necessary to do so. Moreover, big size blocks can assist higher performance.

Choosing the appropriate block size in a classical file systems is quite an interesting subject. For instance, a popular book about Operating Systems says that a block of 4 kB 60-70% of files will fit into one block. If you fit more files into one block, the fragmentation will be less, the read rate higher. But you’ll also use a lot of space to no purpose. In our case, the block size is also the main limiter of the file system. When the block size is 4 KB the bit map of free blocks can cover 125 MB of the disk space only

Getting Back to Practice

Now it’s time to write some code. We’ll begin with structures we’re going to use.

struct aufs_disk_super_block
{
    __be32      dsb_magic;
    __be32      dsb_block_size;
    __be32      dsb_root_inode;
    __be32      dsb_inode_blocks;
};

We store the super block structure at the very beginning block 0 of the disk. It begins with the magic number. It helps us to make sure that aufs is stored on disk (I’ve mentioned it previously).

Considering the fact that we’ll support different block sizes (which won’t require any efforts from our side) we should know the size we’re using in this file system. The super block stores this information in the dsb_block_size field.

Besides, the dsb_inode_blocks also stores a great number of blocks in the table of index nodes. I’ve already mentioned the size of this table can vary.

dsb_root_inode field also stores the number of the root directory index node. Of course, it isn’t necessary to store it. You can just use a fixed number for the root, but then the structure will be empty in this case, so it seems stronger with it.

Big endian
You should note that we use types of fixed size for fields, as we’ll write this structure on a disk “as it is”. But it’s not enough to fix the size. You should also affix the bite order. We’ll use big endian. It’s the network bite order. We can understand it by the type name (__be32).

The order you’re going to use isn’t that important, but you should necessarily affix it. But some developers think that there are more platforms using little endian. Therefore, it’s better to use it. Anyway, let’s get back to the point.

The __be32 type is actually a synonym of uint32_t. But its name emphasizes that the variable stores data in big endian (such way of documenting). There’s a similar type in the kernel for little endian as well.

Now, let’s take a look at the most important structure of the file system. It’s the index node:

struct aufs_disk_inode
{
    __be32      di_first;
    __be32      di_blocks;
    __be32      di_size;
    __be32      di_gid;
    __be32      di_uid;
    __be32      di_mode;
    __be64      di_ctime;
};

First of all, the index node determines the place the file/catalog is stored on disk. There can be various ways of storing files. We’ll use a simple one: one extent for one file. Extent is the continuous sequence of disk blocks. We’ll store each file/catalog in the continuous sequence of blocks. di_first and di_blocks fields store the first block and the number of blocks in the extent accordingly.

You might wonder how we can write data to the end of the file and add the writing to the catalog. When storing in such a way the complete implementation of operations leading to the change of the file/catalog is indeed a pain in the neck (let alone the efficiency of such implementation). Therefore, we’re not going to perform the full writing implementation. Maybe I’ll tell you about it in the following article.

Such organization has positive aspects. Files are not fragmented, which has a good affect on the speed of successive reading. Therefore, we can use such structure in file systems meant for reading only. For example, in iso 9660 (though it already supports files fragmentation).

It’s clear that extents aren’t a complex structure that doesn’t provide a lot of things. But together with classic tree-like structure for file storage they are quite a good variant for file systems supporting fragmentation.

In addition to specifying the place on a disk, the index node also stores an actual size of the file in di_size field. Actually, the file size shouldn’t fit the size of one block.

di_gid and di_uid are identifiers of the group and the user. It’s not always reasonable to store such information in the file system. I put in them as an example.

di_mode field stores access rights for the group of the file owner and all other users.
Since we’ve stored the group and the owner, we should store the access rights as well. di_mode also stores the object type describing the index node. For example, it indicates, whether the object is a catalog or a file.

Finally, di_ctime field stores the date of the file creation. File systems usually store the dates of the latest update and access to the file as well, but we won’t care about them.

Formatting the Disk

Thus, we’ve determined the file system format on the disk and now it’s time to write an utility casting the disk to the necessary format. Disks in Linux are just a file (that’s where we should mention a popular Unix feature). So disk formatting is just writing necessary data to the file. In our case it’s the super block, bit maps and the root catalog (empty for now).

Not to make the article about Linux kernel become an article about C++ (especially considering Linus attitude towards the latter), I recommend you to get familiar with the source code at github. Anyway, we’re going to review the basic classes:


The utility allows changing the block size using -s or --block_size keys, as well as the number of blocks we’re going to use for the file system utilizing -b or –blocks keys.
On the whole, the utility is simple, but some of you may think that the code is too complex for such an easy task. The thing is that we’ll firstly teach the file system read form the disk and then get down to writing. In order to control the operation of our driver we should write something to the disk. We’ll later add to our utility an ability of importing the files/catalogs when formatting the disk. This will help us a lot.

Getting Back to the File System

Now let’s get back to our loadable module. We’ll start with reading the super block from the disk. But before that, we’re going to build another structure for the super block:

struct aufs_super_block
{
    uint32_t    asb_magic;
    uint32_t    asb_inode_blocks;
    uint32_t    asb_block_size;
    uint32_t    asb_root_inode;
    uint32_t    asb_inodes_in_block;
};

This structure will represent the super block in the memory. The idea is simple – we read from aufs_disk_super_block disk and convert it to aufs_super_block executing the conversion of the bite order and also calculating various useful data (asb_inodes_in_block in the given case). This structure is actually a great place for global variables of the file system.

Getting back the previous post, there are three structures to represent the super block:
Two structures are okay, but what is the third one for? The thing is that Linux doesn’t know anything about our file system. So it’s quite possible that the super block (as well as inode and any other structure of Linux kernel) doesn’t have all the necessary fields. Therefore, we should build additional structures and attach them to the kernel structures. But how should we do that?

There are two popular ways of implementing such connection in the kernel (let’s name them composition and generalization). We’ll utilize the composition for the super block. This approach requires support from the kernel. There’s an interesting filed inside the super_block structure:

struct super_block {
    ...
    void        *s_fs_info;
    ...
};

We can store a pointer to any data in this field. That’s exactly where we’re going to store the pointer for the aufs_super_block. Everywhere we have the access to the super block structure we can get the access to the aufs_super_block structure as well. Anyway, our task is to read the super block from the disk. We’ll write several functions for that purpose:

static struct aufs_super_block *aufs_super_block_read(struct super_block *sb)
{
    struct aufs_super_block *asb = (struct aufs_super_block *)kzalloc(sizeof(struct aufs_super_block), GFP_NOFS);
    struct aufs_disk_super_block *dsb = NULL;
    struct buffer_head *bh = NULL;

    if (!asb)
    {
        pr_err("aufs cannot allocate super block\n");
        return NULL;
    }

    bh = sb_bread(sb, 0);
    if (!bh)
    {
        pr_err("cannot read 0 block\n");
        goto free_memory;
    }

    dsb = (struct aufs_disk_super_block *)bh->b_data;
    aufs_super_block_fill(asb, dsb);
    brelse(bh);

    if (asb->asb_magic != AUFS_MAGIC)
    {
        pr_err("wrong magic number %u\n", (unsigned)asb->asb_magic);
        goto free_memory;
    }

    return asb;

free_memory:
    kfree(asb);
    return NULL;
}

The first thing this function does is allocating memory for the super block structure. There are quite many ways to allocate the memory in the kernel. kzalloc (and kmalloc as well) is the simplest one. It operates the same way as general malloc, but it requires passing the additional set of flags. In contrast to kmalloc, kzalloc fills the allocated memory with zeros (which just comes to passing an additional flag inside kmalloc).

We’ve mentioned flags. What are they for? The point is that different kernel parts should meet requirements of different guarantees. For instance, we shouldn’t lock up when processing the network packet. In order to involve DMA we should allocate the memory in the special memory area. Since we use the memory allocation everywhere, we need a mechanism of “adjustment”. We’ll utilize GFP_NOFS flag. It notifies that the memory allocator won’t refer to the file system facilities. It’s quite logical when implementing the file system, though it isn’t necessary in this very case.

We shouldn’t forget to control that we’ve allocated the memory.

The next principal moment is calling sb_bread function. Here’s reading from the disk itself! The function accepts the pointer to the super block and the block number we should read. That’s simple. The function returns the pointer to buffer_head structure, while the block data are available using b_data of the structure.

Of course, we shouldn't forget to check that the reading took place without fail.

Then we just convert the pointer into char to aufs_disk_super_block structure pointer. aufs_super_block_fill function fills aufs_super_block structure by using aufs_disk_super_block:

static inline void aufs_super_block_fill(struct aufs_super_block *asb,
            struct aufs_disk_super_block const *dsb)
{
    asb->asb_magic = be32_to_cpu(dsb->dsb_magic);
    asb->asb_inode_blocks = be32_to_cpu(dsb->dsb_inode_blocks);
    asb->asb_block_size = be32_to_cpu(dsb->dsb_block_size);
    asb->asb_root_inode = be32_to_cpu(dsb->dsb_root_inode);
    asb->asb_inodes_in_block =
        asb->asb_block_size / sizeof(struct aufs_disk_inode);
}

As you might have guessed, be32_to_cpu function converts the number from big endian into the bite order used by the platform.

We should allocate the block after finishing our work with it. There’s brelse function for this purpose. It reduces the counter of references to this block. The block won’t be deallocated right after the counter reaches 0. The garbage collector operates for blocks in the kernel. It won’t free the block unless necessary. The thing is that reading blocks from the disk is quite a long operation. Therefore, it’s reasonable to support the cache of the read blocks and return the one already read when the block is reread (of course, if it’s not in the cache yet).

The last thing we’ll do is check the magic number as we should make sure that aufs is stored on disk.

For those having paid their attention to goto, it’s used in the kernel quite often. Mostly to organize the error processing. There are no exceptions in C language. But the idea of dividing the main way of executing and processing errors is quite attractive. That’s when goto comes to help us. In our case goto use doesn’t give anything. I used it knowingly, as an example of what it can be used for. There’re not many goto haters among kernel developers. Therefore, there are some areas in the code misusing the ill-fated operator. You should be ready for that.

A cautious reader may have paid attention to a mismatch. As I’ve already mentioned, file systems can work with blocks of different size. This information is most likely to be stored in the super block. So the block of what size will sb_bread function read when reading the super block? It’s simple in our case. By default, the block size is determined as a size of the block device (so many blocks…). We’re hoping that its size is enough for the super block structure. It is in our case.

We’ve written the function to read the super block. We’re going to call it from aufs_fill_super (refer to the previous post). Now it looks like the following:

static int aufs_fill_sb(struct super_block *sb, void *data, int silent)
{
    struct inode *root = NULL;
    struct aufs_super_block *asb = aufs_super_block_read(sb);

    if (!asb)
        return -EINVAL;

    sb->s_magic = asb->asb_magic;
    sb->s_fs_info = asb;
    sb->s_op = &aufs_super_ops;

    if (sb_set_blocksize(sb, asb->asb_block_size) == 0)
    {
        pr_err("device does not support block size %u\n",
                    (unsigned)asb->asb_block_size);
        return -EINVAL;
    }

    root = aufs_inode_get(sb, asb->asb_root_inode);
    if (IS_ERR(root))
        return PTR_ERR(root);

    sb->s_root = d_make_root(root);
    if (!sb->s_root)
    {
        pr_err("aufs cannot create root\n");
        return -ENOMEM;
    }

    return 0;
}

As I’ve already mentioned, we store the pointer for the aufs_super_block in the s_fs_info field. We also set the proper block size by calling sb_set_blocksize. As says the comment inside the function the block size should be from 512 bytes till one page size. That’s how we chose the block size. We’re going to need additional efforts (well, not that big) if the file system has to work with a block of the big size.

Anyway, we’ve allocated aufs_super_block in the dynamic memory. Thus, we should take care of deallocating it. Let's apply some changes to another function from the previous post:

static void aufs_put_super(struct super_block *sb)
{
    struct aufs_super_block *asb = (struct aufs_super_block *)sb->s_fs_info;
    if (asb)
        kfree(asb);
    sb->s_fs_info = NULL;
    pr_debug("aufs super block destroyed\n");
}

As you might have guessed, kfree is the pair function to kmalloc (there are several kfree implementations in the kernel — here and here), but let’s not go into details.

Another significant change inside aufs_fill_sb function is calling aufs_inode_get. In the previous article we created a dummy inode. Now we’ll learn how to read them from the disk.

But before that I’m going to draw your attention to IS_ERR and PTR_ERR. These simple conversions from pointers into the number and backwards are based on the fact that the kernel possesses the full information about its memory. So it knows which bits we can use as intended. It’s the simplest example of using the knowledge about the pointer structure. There are also more interesting ones, not only in the kernel. For instance, take a look at Facebook Folly.

We’ll start our work with index nodes from extending the super_operations structure we got familiar with previously. Let’s fill it the following way:

static struct super_operations const aufs_super_ops = {
    .alloc_inode = aufs_inode_alloc,
    .destroy_inode = aufs_inode_free,
    .put_super = aufs_put_super,
};

We’ve added a pair of pointers for the aufs_inode_alloc and the aufs_inode_free functions. They are specific functions to allocate and deallocate the inode. That’s when we meet SLAB (we’ve actually met it in the form of kmalloc) and RCU (just a little bit).

Thus, we’ll start deallocating the memory for the index node from defining another structure – representing the index node in the memory (the way it was with the super block):

struct aufs_inode
{
    struct inode        ai_inode;
    uint32_t    ai_block;
};

This time we’re going to use inheritance instead of composition. Inheritance in C is not difficult (which is no surprise, since C doesn’t support it). The first field of aufs_inode structure is the basic structure (basic class) — inode structure. So we can use the pointer to aufs_inode as a pointer to inode, and vice versa (of course, if we know for sure that the given pointer refers exactly to aufs_inode).

In contrast to composition, «inheritance» doesn’t require support from the kernel side. Besides, it’s more advantageous from the point of view of the number of memory allocations. Instead of two, just one allocation is required for each index node (the way it was with the super block). As against the super block, inode contains all the necessary fields. Since our file system stores data on the disk easily, it’s rather an exception than a rule.

In order to allocate the memory for index nodes we’re going to utilize SLAB allocator. SLAB allocator is the caching allocator. It allows allocating memory blocks of the same size. You might have guessed that thanks to this restriction we can simplify memory control and speed up memory allocation. SLAB allocator acquires big memory chunks from the operational system. It allocates small areas from them on the request. Therefore, requests to the manager of OS memory take place less frequently. While users requests are satisfied faster.

Initially, the gain in the speed of memory allocation when using SLABs wasn’t just due to simpler memory control. It was also reached thanks to reducing the cost of the memory initialization. Really, we often use SLAB allocator not to simply allocate objects of the same size. We use it to allocate the objects of the same type. This allows us to skip initialization of some fields when allocating again the same memory area. For instance, mutex, spin locks and other similar objects are most likely to have the “proper” value when deallocating the block. Therefore, they don’t need the repeated initialization when reallocated. Please refer to the original article in order to learn the details and results.

For the moment there are three different types of allocators in Linux — SLAB, SLUB and SLOB. Let’s not get into details of their differences since they provide the same interface. Anyway, we’re going to use the following function in order to create a SLAB allocator:

int aufs_inode_cache_create(void)
{
    aufs_inode_cache = kmem_cache_create("aufs_inode",
                sizeof(struct aufs_inode),
                0, (SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD),
                aufs_inode_init_once);

    if (aufs_inode_cache == NULL)
        return -ENOMEM;

    return 0;
}

When creating a SLAB, we pass function the name, the object size, the function of initialization (the function which will be called once only, when the object is allocated for the first time) and few more parameters to kmem_cache_create. For those interested I can say that creating SLABs for index nodes looks the same in all file systems and their differences are inessential.

We’re going to call aufs_inode_cache_create function when loading the module, before we register the file system in the kernel. There’s also a pair function we’ll call when unloading the module:

void aufs_inode_cache_destroy(void)
{
    rcu_barrier();
    kmem_cache_destroy(aufs_inode_cache);
    aufs_inode_cache = NULL;
}

kmem_cache_destroy function destroys the SLAB allocator. You should deallocate all objects from this cache by the moment of destruction. Otherwise we’ll get an error message in the system log together with a lot of other troubles.

Now, let’s get down to the promised RCU contact. In short, RCU is a popular synchronization (and also safe memory reclamation for lock-free algorithms) mechanism in the kernel. RCU maintainer in Linux kernel wrote a book, in which he touched upon his brainchild as well.

Among all RCU functions, we should deal only with rcu_barrier (as well as with kfree, here is another implementation of this function). To cut a long story short, this function will wait till all time-delayed operations on the protected RCU data finish. Then it will return the control to the one having called it. So this function is the blocking one. We’ll see below, what it’s used for.

But let’s get back to memory allocation and consider the mentioned above function:

struct inode *aufs_inode_alloc(struct super_block *sb)
{
    struct aufs_inode *const i = (struct aufs_inode *)
                kmem_cache_alloc(aufs_inode_cache, GFP_KERNEL);

    if (!i)
        return NULL;

    return &i->ai_inode;
}

It uses the created earlier SLAB allocator (with the help of one of kmem_cache_alloc implementations) and returns the pointer to inode. Nothing irregular. But function of memory reclamation is a lot more interesting:

void aufs_inode_free(struct inode *inode)
{
    call_rcu(&inode->i_rcu, aufs_free_callback);
}

That’ when we face RCU again. We should say a few words about lock-free algorithms. The problem of such algorithms is that without locks there are no guarantees that the object isn’t being used in-parallel by some other thread of execution. Therefore, we can’t deallocate the memory occupied by this object as another thread can store a pointer to it. That’s why we should think of strategies of safe memory reclamation in lock-free algorithms. Meanwhile, RCU provides facilities to solve this problem. All implementations of call_rcu function delay the execution of some function (in our case aufs_free_callback reclamation function) till the moment it’s safe to do it. The mentioned above rcu_barrier waits for completion of all delayed functions.

Tired? It’s okay, we’re almost finished. Now we’ll read the index node from the disk. I wrote the mentioned aufs_inode_get function for that purpose:

struct inode *aufs_inode_get(struct super_block *sb, uint32_t no)
{
    struct aufs_super_block const *const asb = AUFS_SB(sb);
    struct buffer_head *bh = NULL;
    struct aufs_disk_inode *di = NULL;
    struct aufs_inode *ai = NULL;
    struct inode *inode = NULL;
    uint32_t block = 0, offset = 0;

    inode = iget_locked(sb, no);
    if (!inode)
        return ERR_PTR(-ENOMEM);

    if (!(inode->i_state & I_NEW))
        return inode;

    ai = AUFS_INODE(inode);
    block = aufs_inode_block(asb, no);
    offset = aufs_inode_offset(asb, no);

    pr_debug("aufs reads inode %u from %u block with offset %u\n",
                (unsigned)no, (unsigned)block,
                (unsigned)offset);

    bh = sb_bread(sb, block);
    if (!bh)
    {
        pr_err("cannot read block %u\n", (unsigned)block);
        goto read_error;
    }

    di = (struct aufs_disk_inode *)(bh->b_data + offset);
    aufs_inode_fill(ai, di);
    brelse(bh);

    unlock_new_inode(inode);

    return inode;

read_error:
    pr_err("aufs cannot read inode %u\n", (unsigned)no);
    iget_failed(inode);

    return ERR_PTR(-EIO);
}

I should begin my explanations from the simple things. The AUFS_SB and AUFS_INODE functions allow us to get a pointer for the aufs_super_block and the aufs_inode, via pointers to super_block and inode accordingly. Since I’ve already described the way they’re connected, I’m not going to provide their code (it’s quite simple).

aufs_inode_block and aufs_inode_offset functions allow us get the block number and the shift inside the block by the index node number. It’s no magic, just a simple arithmetic.

Now let’s take a look at iget_locked and unlock_new_inode functions. As well as in case with blocks the kernel supports the cache of inodes. We need it not only to avoid reading the index node from the disk once more. The point is that the same file/catalog can be used by several processes at the same time. In this case all of them should operate on one inode exemplar, so that we could synchronize them. We can apply it to blocks as well.

Thus, idet_locked function first of all looks for an inode in the cache and allocates memory for the new one if the inode isn’t found. If we’ve allocated the index node once more but haven’t found it in the cache, we’ll set I_NEW flag in i_state field. We’ll also hold the spin lock of this node (i_lock field). That’s why our function firstly checks the i_state field. If I_NEW flag is dropped then we simply return the cached inode. Otherwise we should fill inode, by reading the necessary block from the disk (by using sb_bread).

aufs_inode_fill function is responsible for filling:

static void aufs_inode_fill(struct aufs_inode *ai,
            struct aufs_disk_inode const *di)
{
    ai->ai_block = be32_to_cpu(di->di_first);
    ai->ai_inode.i_mode = be32_to_cpu(di->di_mode);
    ai->ai_inode.i_size = be32_to_cpu(di->di_size);
    ai->ai_inode.i_blocks = be32_to_cpu(di->di_blocks);
    ai->ai_inode.i_ctime.tv_sec = be64_to_cpu(di->di_ctime);
    ai->ai_inode.i_mtime.tv_sec = ai->ai_inode.i_atime.tv_sec =
                ai->ai_inode.i_ctime.tv_sec;
    ai->ai_inode.i_mtime.tv_nsec = ai->ai_inode.i_atime.tv_nsec =
                ai->ai_inode.i_ctime.tv_nsec = 0;
    i_uid_write(&ai->ai_inode, (uid_t)be32_to_cpu(di->di_uid));
    i_gid_write(&ai->ai_inode, (gid_t)be32_to_cpu(di->di_gid));
}

i_uid_write and i_gid_write functions assign values to the appropriate fields.

I’d also like to draw your attention to the time representation in the form of timespec structure. It consists of two numbers – the number of seconds and nanoseconds. So we can potentially store the time for a long time.

In the end, we should deallocate the spin lock and return the pointer. We’ll use unlock_new_inode function for that purpose.

Instead of the Summary

The post is really big and even so it doesn’t cover everything. I just tried to explain the key parts of implementation.

You can find all the source code here. There are two folders in the repository now — kern and user. One of them stores the code of our module. The second one stores the code of the utility used for formatting. The code is longer. Therefore, the probability of errors in it is higher. You’re most welcome to write your comments and useful criticism, as well as pull requests.

In order to get the disk image for arrangement, you can do the following:

dd bs=1M count=100 if=/dev/zero of=image
./mkfs.aufs ./image

Now you can use image file the way it’s used in the previous post.

I’ve omitted some debugger output from the code. But if you’re going to use the code from the repository, you’ll be able to make sure that the module operates utilizing dmesg command.

Subscribe to Kukuruku Hub

Or subscribe with RSS

0 comments

Read Next