Kernel Traffic #298 For 6�Mar�2005

There were 623 different contributors. 238 posted more than once. The average length of each message was 98 lines.

The top posters of the week were:	The top subjects of the week were:
74 posts in 556KB by Tejun Heo 66 posts in 470KB by Eric W. Biederman 62 posts in 260KB by Bartlomiej Zolnierkiewicz 53 posts in 495KB by Adrian Bunk 43 posts in 186KB by Greg KH	63 posts in 303KB for "Patch 4/6 randomize the stack pointer" 44 posts in 213KB for "i8042 access timings" 39 posts in 228KB for "[PATCH] OpenBSD Networking-related randomization port" 38 posts in 245KB for "[PATCH] Dynamic tick, version 050127-1" 36 posts in 144KB for "[PROPOSAL/PATCH] Remove PT_GNU_STACK support before 2.6.11"

18�Jan�2005�-�4�Feb�2005 (99 posts) Archive Link: "[PATCH 0/29] overview"

This patchsset is a major refresh of the kexec on panic functionality in the kernel. The primary aim of which was to take the requirements capture of the kernel crashdump patches and start integrating the functionality cleanly into the kexec patches.

Major accomplishments:

Compat syscall support has been added.
The crashdump capture code has been separated from the kexec on panic code.
The kernel to jump to on panic is now loaded in place.
A long standing bug that allowed 2 sources pages to copy data to a single destination page has been caught and fixed.
Support for loading an x86_64 kernel in a reserved of memory has been completed.

The crashdump code is currently slightly broken. I have attempted to minimize the breakage so things can quick be made to work again.

With respect to a final design discussion there are two remaining open issues. The first is how little hardware shutdown we can get away with in the kernel that is panicing. I believe we can reduce this to a simply NMI to the other cpus telling them to stop. This has been address as a major concern in previous conversations.

The second is an issue is the most significant with respect to the design of a kernel based crash dump capture implementation. How does the crashdump capture process discover relevant information about the kernel that just crashed? There are two options.

As represented by the current crashdump patches the crashdump kernel and the kernel in which it loads are kept in sync so that it has uptodate versions all of crashed kernels data structures because it is built from the same source. So it only needs to find the address of the data structures it would like to look at.
The relevant information if it is available when sys_kexec_load is called is exported to user space, or the machine_crash_shutdown method marshalls what little information must be captured when the machine dies in a well known standard format (most likely ELF notes). Allowing the crashdump capture process to simply pass on the information or utilize it as appropriate.

If the second method can successfully represent all of the interesting information then we can allow kernel version skew, between the two kernels, and potentially implement the entire crash dump capture process in user space.

As best as I have been able to discover the interesting information includes. The cpu state (registers) at the time of the crash/panic. The list of memory regions the kernel that has crashed was using. And potentially the list of pages dedicated to kernel data as opposed to user space, so the the people with insane amounts of memory (1TB+) don't require unmanagely large core files.

He quoted an earlier message by Andrew Morton, in which Andrew had said, "I don't want us to be in a position of merging all that code and then finding out that it cannot be made to work "sufficiently well", forcing us to revert it and find a new crashdump solution. You guys know far better than I when we will reach that threshold. If the kexec/dump developers can say "yup, this is going to work (because X)" then I'm happy." Eric now offered:

So here is my subjective view.

This code needs to sit in a development tree for a little while to shake out whatever bugs still linger from my massive refactoring.
Through the kexec patches the code and design appears to be sound. Given that machine_kexec is little more than a jump there are few possible implementations that will be able to use it. The only exception I can see are running special dump drivers from the kernel that crashed, and I believe no one thinks the that will work well.
Once we finish sorting out the best way to get information out of the kernel that crashed I think we will have a complete architecture that is largely portable to any architecture.

In the interests of full disclosure my main interesting is using the kernel as a bootloader for other kernels and that has been working fairly for years now :)

We have started doing changes to make crashdump up and running again. Following are few identified items to be done.

Reserve the backup region (640k) during kernel bootup.
Copy the data to backup region during crash.(moved to kexec user space code, patch posted in separate mail)
Prepare elf headers while loading kexec panic kernel and store in reserved memory area.
Pass required information to crashdump kernel, which parses it and exports through /proc/vmcore. (may be user space utility, open to discussion)

Following patch implements item 1) in the list. Soon we shall be rolling out the patches for rest.

In going over some of the implementation details, Eric found a number of problems with Vivek's patch; for awhile it seemed the discussion would descend into confusion, when Eric felt Vivek was only producing minimal changes in response to Eric's design suggestions. This had not been Vivek's intention, however, and they soon were 'back on the same page', as Eric put it. Vivek described the new design, saying, "The whole idea is that Crash image is represented in ELF Core format. These ELF Headers are prepared by kexec-tools user space and put in one segment. Address of start of image is passed to the capture kernel(or user space) using one command line (eg. crashimage=). Now either kernel space or user space can parse the elf headers and extract required information and export final kernel elf core image." He went on:

If I prepare One elf header for each physical contiguous memory area (as obtained from /proc/iomem) instead of per zone, then number of elf headers will come down significantly. I don't have any idea on number of actual physically contiguous regions present per machine, but roughly assuming it to be 1 per node, it will lead to 256 + 1024 = 1280 program headers.At 56 bytes per 64 bit program header this will amount to 70KB.

This is worst case estimate and on lower end machines this will require much less a space. On machines as big as 1024 cpus, this should not be a concern, as big machines come with big RAMs.

Eric, do you still think that ELF headers are inappropriate to be passed across interface boundary.

ELF headers can be prepared by kexec-tools in advance and put into one of the data segments. This requires following information to be available to user space.

Starting address of space reserved by kernel for notes section (crash_notes[]). Probably can be obtained from /proc/kallsysms?
NR_CPUS. May be sysconf(_SC_NPROCESSORS_CONF) should be sufficient.
Size of memory reserved per cpu. No clue how to get that? Any suggestions?
May be hard-coding like 1K area per cpu should be to address the future needs ?

Regarding Backup Region

Kexec user space does the reservation for backup region segment.
Purgatory copies the backup data to backup region. (Already implemented)
A separate elf header is prepared to represent backed up memory region. And "offset" field of this program header can contain the actual physical address where backup contents are stored.

Eric had some criticisms, but felt this was a "good place to start". Itsuro Oda asked why, in all this, the ELF format was considered necessary. Eric replied that the ELF format itself was not necessary, but the information contained within an ELF header was a match for the kind of information that needed to be used here. Therefore, Eric said, it made a good match. When Koichi Suzuki echoed Itsuro's concerns, saying, "Format conversion should be done in healthy system separately and we should restrict what to do while taking the dump as few as possible," Eric expanded:

The big part of the conversation that is happening right now is how do we uncouple dependencies between the various parts as much as possible. There is nothing here about format conversions except as to convert weird kernel formats into a stable interface.

There are 3 pieces of code interacting.

The primary kernel that will call panic.
The kernel+initrd that takes over.
The user space that sets it all up (/sbin/kexec) while the primary kernel is still in a sane state.

The goal is to make those 3 pieces as independent of each other as reasonably possible.

So the kernel+initrd that captures a crash dump will live and execute in a reserved area of memory. It needs to know which memory regions are valid, and it needs to know small things like the final register state of each cpu. For the set of valid memory regions it is the intention to encode that as an array of ELF program headers. The information of what the final register contents were will be encoded as ELF notes. There will be one PT_NOTE segment per cpu that holds the notes needed to encode a given cpu's final state. It really does not matter to implementation that captures each cpu's final register state which format we record the data in so using a format designed not to change is not a problem. So all that needs to be communicated to the kernel+initrd that captures a crash dump is the location of an ELF header and it can figure out all of the rest.

For the primary kernel except for remembering it's final cpu register state as it dies it does nothing except jump to the crash recover kernel. All of the interesting information will be exported to user space.

/sbin/kexec is the glue that fills in the cracks. While the primary kernel is in a sane state it sets everything up including finding out which memory areas need to be looked at. And it stashes it all in a reserved area of memory, that has never been the target of DMA transfers.

The goal is to reduce the dependencies as much as possible. So an old stable kernel can take a crash dump of a new buggy kernel. And so that you don't have to be running the latest and greatest user space simply to set everything up. Although it is still better to require a user-space upgrade to cope with new kernels than to require the crash capture kernel+initrd to be upgraded.

Christoph Lameter posted a patch that "Adds management of ZEROED and NOT_ZEROED pages and a background daemon called scrubd." He went on:

scrubd is disabled by default but can be enabled by writing an order number to /proc/sys/vm/scrub_start. If a page is coalesced of that order or higher then the scrub daemon will start zeroing until all pages of order /proc/sys/vm/scrub_stop and higher are zeroed and then go back to sleep.

In an SMP environment the scrub daemon is typically running on the most idle cpu. Thus a single threaded application running on one cpu may have the other cpu zeroing pages for it etc. The scrub daemon is hardly noticable and usually finished zeroing quickly since most processors are optimized for linear memory filling.

Note that this patch does not depend on any other patches but other patches would improve what scrubd does. The extension of clear_pages by an order parameter would increase the speed of zeroing and the patch introducing alloc_zeroed_user_highpage is necessary for user pages to be allocated from the pool of zeroed pages.

There was a good bit of wrangling over implementation, and later he posted an update, saying:

Changes from V4 to V6:

V5 posted as independent patches
copyright update in Altix BTE driver
Note early work on __GFP_ZERO by Andrea Arcangeli
Simplify Altix BTE zeroing driver and handle timeouts correctly (kscrubd hung once in a while).
Support /proc/buddyinfo
Make the higher order clear_page patch less invasive. Name it clear_pages.
patch against 2.6.11-rc3

More information and a combined patchset is available at http://oss.sgi.com/projects/page_fault_performance.

The most expensive operation in the page fault handler is (apart of SMP locking overhead) the touching of all cache lines of a page by zeroing the page. This zeroing means that all cachelines of the faulted page (on Altix that means all 128 cachelines of 128 byte each) must be handled and later written back. This patch allows to avoid having to use all cachelines if only a part of the cachelines of that page is needed immediately after the fault. Doing so will only be effective for sparsely accessed memory which is typical for anonymous memory and pte maps. Prezeroed pages will only be used for those purposes. Unzeroed pages will be used as usual for file mapping, page caching etc etc.

The patch makes prezeroing very effective by:

Appplying zeroing operations only to pages of higher order, which results in many pages that will later become zero order pages to be zeroed in one step.
Hardware support for offloading zeroing from the cpu. This avoids the invalidation of the cpu caches by extensive zeroing operations.

The scrub daemon is invoked when a unzeroed page of a certain order has been generated so that its worth running it. If no higher order pages are present then the logic will favor hot zeroing rather than simply shifting processing around. kscrubd typically runs only for a fraction of a second and sleeps for long periods of time even under memory benchmarking. kscrubd performs short bursts of zeroing when needed and tries to stay out off the processor as much as possible.

The benefits of prezeroing are reduced to minimal quantities if all cachelines of a page are touched. Prezeroing can only be effective if the whole page is not immediately used after the page fault.

The patch is composed of 3 parts:

[1/3] clear_pages(page, order) to zero higher order pages Adds a clear_pages function with the ability to zero higher order pages. This allows the zeroing of large areas of memory without repeately invoking clear_page() from the page allocator, scrubd and the huge page allocator.

[2/3] Page Zeroing Adds management of ZEROED and NOT_ZEROED pages and a background daemon called scrubd.

[3/3] SGI Altix Block Transfer Engine Support Implements a driver to shift the zeroing off the cpu into hardware. This avoids the potential impact of zeroing on cpu caches.

Andrew Morton seemed interested in accepting the patch; but he required some benchmarks showing a real improvement; and he needed the patch to adhere to existing APIs for starting, binding, and stopping kernel threads. Christopher started to comply, but the thread petered out.

This patch adds support for the ST M41T00 RTC chip.

You will likely notice that it implements a PPC-specific interface (/dev/rtc->drivers/char/genrtc.h->include/asm-ppc/rtc.h->this file). This was necessary to support a subset of ppc platforms that need to hook up the rtc support at runtime. If I implemented /dev/rtc directly or interfaced to genrtc.c directly, those platforms couldn't use this driver. Eventually, I hope to work on more uniform rtc support across all the processor architectures.

Also, on ppc at least, the hw clock can be set from a timer interrupt if STA_UNSYNC is not set (e.g., ntpd is running). To handle this, a tasklet is used to set the clock if in_interrupt() is true.

Jean Delvare, although not intimately familiar with the hardware involved, still offered some comments, mainly typos, naming conventions, and some memory management advice. Mark posted an updated patch, taking all of Jean's suggestions. Several days later, with no further replies, he asked if his patch could be accepted for inclusion at that point. Greg KH asked if Mark could send the patch with a proper Changlog blurb, and Mark did so. The blurb read:

This patch adds support for the ST M41T00 I2C RTC chip.

This rtc chip has no mechanism to freeze it's registers while being read; however, it will delay updating the external values of the registers for 250ms after a register is read. To ensure that a sane time value is read, the driver verifies that the same registers values were read twice before returning.

Also, when setting the rtc from an interrupt handler, a tasklet is used to provide the context required by the i2c core code.

This has a number of architecture updates (mips, arm, ppc, x86-64, ia64), and updates ACPI, DRI, ALSA, SCSI, XFS and InfiniNand.. And a lot of small one-liners all over.

I'd _really_ like to calm down for a final 2.6.11 now, so please note anything really important I missed, but keep the rest pending. And give this a good testing..

Oh, and the automated bitkeeper mirroring to bkbits.net seems slightly broken right now (hasn't updated in the last 48 hours), but the tar-balls are all there, and the BK upating mechanism will hopefully be fixed soon.

(I've got a few BK trees in private places, it's only the public bkbits.net one that hasn't gotten mirrored out yet - many other BK developers will know where to find my secondary trees and can pull from them instead).

FUSE version 2.2 is out there:

http://sourceforge.net/project/showfiles.php?group_id=121684&package_id=132802&release_id=301878

This can be used standalone or with recent -mm kernels (with the exception of -rc2-mm2).

Most notable changes since 2.1:

Added file handle parameter to open/read/write/release. This should make life easier for filesystems wanting to implement stateful I/O.
Added compatibility to the 2.1 and to some extent to the 1.X API
Re-added ability to interrupt operations. This time more carefully than in 1.X.

Regressions:

Removed shared-writable mmap support, which could deadlock the linux memory subsystem. This should not affect most people, but if some application breaks for you, I'd like to hear about it.
Made the readpages() operation synchronous, again for deadlock considerations. This can degrade performance, especially for high latency filesystems, since previously parallel read-ahead is now serialized.

In the long run I hope to solve both problems, but neither is trivial. Ideas are welcome, as well as bugreports of course.

Franco Broi reported excellent success, saying, "I've just ported my filesystem to 2.2-pre6 and was able to throw away about 300 lines of code, the filehandle stuff is great. I was hoping to give it a thorough test and report back before 2.2 was released but you beat me to it. It just keeps getting better and better, well done!"

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc3/2.6.11-rc3-mm1/

The bk-usb and bk-pci and bk-driver-core trees have been temporarily dropped from -mm, for they are not healthy at present.
After many months dormancy, the ieee1394 tree is back and is included in -mm. Anyone who has been having firewire problems please test it.

Ok, I've cleaned up the bk-usb tree a bunch. If anyone had a previous copy of it, please just delete it and clone it again. It's at:

bk://kernel.bkbits.net/gregkh/linux/usb-2.6

and is safe for consumption.

Andrew, can you put it back into the next -mm release?

Oh, and below is the diffstat and changelog of the patches in it. I've also placed a full patch of it, against the 2.6.11-rc3-bk1 tree for those who don't like to use bk, or are just curious about putting this on top of the latest -mm release:

kernel.org/pub/linux/kernel/people/gregkh/usb/2.6/2.6.11-rc3/bk-usb-2.6.11-rc3-mm1.patch

Also, if you have sent me a USB patch that is not already in the mainline tree, and is not included in this big patch-bundle, please resend it, as my USB patch queue is now empty.

Oops, no, I have a pending patch from Petko Manolov that didn't make it into here, sorry about that Petko, I'll get to that one next week.

Next up, the bk-pci and bk-driver-core mess...

loading dm-mod module fails with this message :

FATAL: Error inserting dm-mod (/lib/modules/2.6.11-rc3-mm1/kernel/drivers/md/dm-mod.ko): Device or resource busy

The following line appears in dmesg :

register_blkdev: failed to get major for device-mapper

It was OK with kernel 2.6.11-rc2-mm2. Same config, did "make oldconfig".

You've enabled CONFIG_BASE_SMALL and so the major_names[] hashtable has just one element. device-mapper uses dynamic major allocation, the range of which is limited to the size of the top-level major_names[] array. You ran out of slots and register_blkdev() failed.

So for now I guess we must drop base-small-shrink-major_names-hash.patch.

Al, that code looks rather crappy. Shouldn't we be using an idr tree or something?

Also, we can never generate a major number of zero if the caller passed in major=0. How come?

Laurent confirmed that selecting CONFIG_BASE_FULL=y solved his problem. Close by, Christoph Hellwig remarked, "It'd be nice to see major_names just gone completely. It's only used for /proc/devices output, and with the infrastucture for easily sharing majors that one is completely misleading.." Alexander Viro replied:

ACK. Moreover, dynamic registration of *majors* makes very little sense these days - about as much as setting lower limit on IP block registration to /12.

IMO we should put a large part of device number space for dynamic allocations (current static ones barely scratch the surface - we could easily leave upper half and nobody'd noticed) and use e.g. buddy allocator within it. With allocation requests taking size of area as argument (rounded up to power of 2, which it normally would be anyway).

Any objections to that? Hell, we can even have register_blkdev() without a fixed major calling blkdev_allocate(name, 1<<20) and then eliminate the callers in favour of saner-sized requests. Then kill register_blkdev() completely...

Here's the latest version of relayfs, against 2.6.10. It includes a bunch of cleanup and restructuring prompted by the previous round of comments, but the major change that people would care about would probably be the changes to the logging functions relay_write(), __relay_write(), and relay_reserve(). They've been rewritten to be more efficient, or so I hope - I'm sure I'll hear about how they should be improved for the next version in any case. ;-) Thanks to everyone who commented on the previous version.

This is what the API currently looks like:

rchan *relay_open(chanpath, subbuf_size, n_subbufs, flags, callbacks);
void relay_close(chan);
unsigned relay_write(chan, data, length);
unsigned __relay_write(chan, data, length);
void *relay_reserve(chan, length);
void relay_subbufs_consumed(chan, subbufs_consumed, cpu);
extern void relay_reset(chan);
void relay_commit(buf, subbuf_idx, count);

helper macros:

relay_get_buffer(chan, cpu)
relay_get_padding(buf, subbuf_idx)
relay_get_commit(buf, subbuf_idx)

callbacks:

int subbuf_start(buf, subbuf, prev_subbuf_idx);
int deliver(buffer, subbuf, subbuf_idx);
int fileop_notify(buf, filp, fileop);

As before, I've tested this code on a single proc machine using a hacked version of the kprobes network packet tracing module, which can be found here:

http://prdownloads.sourceforge.net/dprobes/plog.tar.gz?download

Once everyone's more or less happy with the API and implementation, I'll do some SMP testing and write some Documentation.

Christoph Hellwig and Andi Kleen both had nitty-gritty objections to various lines of the patch; but neither had any serious problems with it, and Tom said he'd incorporate all their corrections into a subsequent version.

Marty Ridgeway announced the February release of the Linux Test Project (LTP), saying:

LTP-20050207

runltp now exports $TMPDIR as a copy of $TMP, certain exceptions caused these to be different.
extra functions for LTP libs are to make these tests fail with a more informative message when attempts to create swap on tmpfs are made.
IPV6 testcase updates from David Stevens
Applied patch from Jacky Malcles that fixes an inconsistency regarding synchronization.
Make proc01 skip kcore
Fix gives an hint to the probable solution if capset01 test fails
Fix for race conditions in synchronization between children and parent on fcntl15.
Applied patch from Jacky Malcles to allow test to run on ia64.
The test llseek sets RLIMIT_FSIZE to a small number, this fix to restore it to its original value.
Fix IPV6 Makefile install path problem

Marvell makes a line of host bridge for PPC and MIPS systems. On those bridges is an i2c controller. This patch adds the driver for that i2c controller.

Please apply.

Depends on patch submitted by Jean Delvare: http://archives.andrew.net.au/lm-sensors/msg29405.html

Bartlomiej Zolnierkiewicz offered some minor fixes and criticisms of the patch, and Mark went through several patch iterations with him.

�
Kernel Traffic Latest�\|�Archives�\|�People�\|�Topics	Wine Latest�\|�Archives�\|�People�\|�Topics	GNUe Latest�\|�Archives�\|�People�\|�Topics
Czech

1.	18�Jan�2005�-�4�Feb�2005	(99 posts)	kexec And crashdump
2.	21�Jan�2005�-�8�Feb�2005	(38 posts)	New scrubd Page Zeroing Daemon
3.	31�Jan�2005�-�4�Feb�2005	(8 posts)	ST M41T00 I2C RTC Chip Driver Released
4.	2�Feb�2005�-�4�Feb�2005	(9 posts)	Linux 2.6.11-rc3 Released
5.	3�Feb�2005	(2 posts)	FUSE Version 2.2 Released
6.	4�Feb�2005�-�9�Feb�2005	(59 posts)	Linux 2.6.11-rc3-mm1 Released
7.	4�Feb�2005�-�5�Feb�2005	(9 posts)	RelayFS Updated
8.	7�Feb�2005	(1 post)	Linux Test Project Updated
9.	8�Feb�2005�-�9�Feb�2005	(5 posts)	New Marvell MV64xxx I2C Driver

Kernel Traffic #298 For 6�Mar�2005

By Zack Brown

Regarding Backup Region