sil2100//vx developer log

Welcome to my personal web page

Kernel sysctl

More basics. The Linux kernel offers an interface for browsing and modifying system parameters, mostly kernel related. This interface is called sysctl. In Linux, sysctl variables are available to the user as normal, editable files through a virtual filesystem browsable in the /proc/sys directory - or by the usage of the sysctl application. In my today's post I would like to concentrate on how to create sysctl configurations in kernel code.



As with most Linux kernel related topics, the internal mechanisms and definitions vary from version to version. Most of my knowledge comes from the one used in the 2.6.31 series kernels, but I also have some experience with the more recent 2.6.34 series - and there were some changes made somewhere in-between those two releases.

In short - kernel code and modules can export internal variables and data through mechanisms of sysctl or procfs. Usually the proc interface is used for read-only variables and internally structured data, where sysctl is used typically for read/write operations on short, non-structured data. Both choices have a hierarchical tree-like structure, in which directories that can be used for organization of variables.
sysctl variables are defined in code by ctl_table structures. Each sysctl variable (both file and directory) has its own ctl_table object representing the variable. The ctl_table structures need to be grouped into arrays representing parts of a given level in the directory tree (e.g. the contents of the /proc/sys/dev directory). Such arrays need to be terminated by a NULL ctl_table entry - i.e. a variable with the name (procname) and ID (ctl_name, but only in 2.6.31 and similar) equal to NULL.

/* 2.6.31 */
struct ctl_table 
    int ctl_name;           /* Binary ID, not present in later versions */
    const char *procname;       /* Text ID for /proc/sys, or zero */
    void *data;
    int maxlen;
    mode_t mode;
    struct ctl_table *child;
    struct ctl_table *parent;   /* Automatically set */
    proc_handler *proc_handler; /* Callback for text formatting */
    ctl_handler *strategy;      /* Callback function for all r/w, not present in later versions */
    void *extra1;
    void *extra2;

The procname defines the name of the variable and ctl_name the ID for the given variable in the current directory. The data and maxlen fields can be used by some handling functions, while mode defines the basic access rights to the variable. There are also the extra1 and extra2 fields - reserved for any extra data you may need. The important fields are child and proc_handler. When the child pointer is different than NULL, then the variable in mention is a directory and proc_handler is ignored. The child pointer then points to another ctl_table array defining the contents of the subdirectory. A directory of a given path can be defined by more than one ctl_table array.
The proc_handler function pointer is the function that is called during I/O operations performed on the sysctl variable.

typedef int proc_handler (struct ctl_table *ctl, int write, struct file * filp,
              void __user *buffer, size_t *lenp, loff_t *ppos);

There is a set of predefined routines for basic operations on variables that can be used as a proc_handler. These include proc_dostring() for reading/writing a string, proc_dointvec() for reading/writing one or more integers - as well as a few other variants of the latter function. In case of using these functions, the data and maxlen fields are used. Data points to the buffer holding the variable in-system, and maxlen the length of the buffer.
A kernel programmer can also define his/her own proc_handler function. In this case, the write function parameter shows whether the operation was a read (write == 0) or write (write == 1) operation. The buffer pointer is a pointer to the buffer with the data being read/written. The lenp is a pointer to the size of the user buffer holding (or to be used for holding) the data, and ppos is the offset from the beginning of the sysctl file of the variable during the operation I/O. These two are pointers so that they can be modified during handling.

So, what did change between 2.6.31 and 2.6.34? As noted in the comments, the ctl_name and strategy fields have been removed. I never used the strategy field before, but it seems it was used to optionally initialize and format data before display or storage. The proc_handler functions do not include the filp parameter anymore as well. No big changes really. At least I didn't notice anything else of interest.
The ctl_name field was indeed useless since long. Most variables used CTL_UNNUMBERED as the ctl_name since they did not care about an unique ID. There were times it was useful though, for instance while creating one proc_handler for many sysctl variables, later identified by ctl_name - but still the 'extra' fields can be used for that now. Or even a strcmp of the procname field.

But how are these variables positioned in the sysctl tree? The function register_sysctl_table() needs to be called for the main ctl_table array. The main root of all sysctl variables is /proc/sys. From that, you need to provide the ctl_table's of all directories in the path e.g. if we want to have a variable accessible at /proc/sys/dev/ala0/name, the ctl_table arrays for dev and ala0 need to be created and linked with each other using the child fields. If no other kernel code already defined the given directory, it is created in the virtual filesystem.

The sysctl interface is one of the recommended ways of exchanging data between the user and the kernel. Just remember to always use copy_from_user()/copy_to_user() when writing your own proc_handler functions! I must say that I like the idea of how sysctl configurations are created, accessed and exported - reminds me of Flatconf somehow... But more about this in the nearest future.