Welcome to my personal web page
A modified kernel, a custom system - this can lead to the kernel not being able to boot properly. What to do in such case? Usually we can try getting as much information as possible to locate the underlying problem. We can use some quite basic techniques to achieve our goal.
When working with a relatively sophisticated-debugging-unprepared system, it's best to just see what the kernel says, deducing which part causes the system to halt. In most cases, if there is a display device present and configured, we should be able to see the kernel messages on this device - if, of course, the respective kernel config variables are set (in this case, the CONFIG_VGA_CONSOLE or CONFIG_FRAMEBUFFER_CONSOLE).
The case is different when a display is not present. We can either use a serial console or a net-console here, whichever is available. The easiest approach is using a serial console. We just need to be sure that our kernel configuration includes all necessary entries, such as CONFIG_SERIAL_CORE, CONFIG_SERIAL_CORE_CONSOLE and respective serial drivers (e.g. CONFIG_SERIAL_8250 and CONFIG_SERIAL_8250_CONSOLE in case of a 8250 UART chip). We then just append to the CONFIG_CMDLINE configuration the console=ttyS[console number],[baud rate] parameter and we're ready to go.
In some cases, however, the kernel halts even before we can see some actual output, for instance, before the console driver or the video device are setup. In this case, we might get lucky by using the so-called earlyprintk's mechanism. The Linux kernel has a feature allowing the kernel to output messages to the serial console or VGA buffer directly even before the real console code is initialized. This feature can be enabled by setting the CONFIG_EARLY_PRINTK variable in the kernel config, additionally providing an earlyprintk= parameter to boot arguments. It can be either vga, or ttyS0/ttyS1 (with the baud rate added as necessary). After the real console is initialized, the earlyprintk console is disabled by default - but if you want, you can keep it running by appending a ,keep argument to the earlyprintk parameter. But most of the time, it is not needed.
This can give us a good overview of where the problem lies. There are some flags and kernel command-line parameters which can aid us in debugging certain features, like e.g. initcall_debug for making initcall execution more verbose. This can help a bit when your kernel hangs up and we have problems in locating the source of the problem.
More useful parameters can be found in Documentation/kernel-parameters.txt in the kernel source.
My common way for fast problem localization is using the usual "print it!" debugging, using printk()'s around suspicious kernel areas. Early printk's help in this as well.
If we know that the kernel itself has no problems but problems probabbly appear during or right after rootfs setup, we can also try preparing a small initramfs to include in our image instead. An initramfs is a file-system image that resides directly in the kernel image, being loaded to RAM during boot time. We can then, with the available tools, try hacking the real rootfs manually. Busybox is a good choice for a fast, lightweight and working environment for the RAM file-system. To include an initramfs in our image, we need to set the CONFIG_BLK_DEV_INITRD config option and set the CONFIG_INITRAMFS_SOURCE to point either to the directory to be included or the .cpio archive with our prepared RAM rootfs. We will also need to specify whether the initramfs should be compressed or not, setting the necessary flags as needed. The CONFIG_INITRAMFS_SOURCE also accepts files containing specifications for directories and device nodes to be created on it during building the kernel image. More about this can be found in Documentation/filesystems/ramfs-rootfs-initramfs.txt.
Another useful tool is the SysRq magic key. If we configure our kernel with a CONFIG_MAGIC_SYSRQ option, we can use the specified key combination to command the kernel regardless of what it currently does (most of the time). The key combination varies from architecture to architecture, but from experience I know that usually it's the same ALT + SysRq + [command key] set. The SysRq key is also known on some keyboards as the Print Screen button. If you're working on a serial console, you can try sending the combination through the terminal, raw.
Most useful commands:
The SysRq mechanisms are very well documented in the Documentation/sysrq.txt file.
If we encounter a kernel Oops or, even worse, a kernel panic - it is also nice to know how to dig as much information as possible from such a crash. When working on a remote device it is also wise to include the panic=[timeout] kernel parameter to our arguments. This way, when the kernel panics, the device will try to restart itself after the set timeout period, allowing us access to the bootloader without performing a hard-reset. We can set it to a bigger value to still be able to analyze the crash-log.
As for handling the actual crash, the kernel documentation again has a very nicely written guide to Oops handling. Check it out in Documentation/oops-tracing.txt in the source code.
There are cases in which all these methods are useless, and we need something more sophisticated and/or low-level. When such a need arises, we can try our chances with either kgdb-gdb debugging or Linux/gdb-aware JTAG hardware. But I will try to cover these some other time.
Most of the time printk's (and early printk's) will help in finding the problem. Sometimes some disassembly is necessary - for instance a closer look at some parts of the vmlinux image or, specifically, particular object files composing the bootable image. Kernel debugging is usually like crime-solving. It takes much effort, clue-searching, time and thinking. And, as it is also with crime-solving - sometimes you might simply fail. But one must try not to demotivate oneself. If all ideas have been already used up - take your time, switch context, and return with a fresh mindset after a while. This helps.