Kernel Traffic #219 For 16Jun2003

There were 1019 different contributors. 574 posted more than once. 176 posted last week too.

the attached patch addresses a futex related SMP scalability problem of glibc. A number of regressions have been reported to the NTPL mailing list when going to many CPUs, for applications that use condition variables and the pthread_cond_broadcast() API call. Using this functionality, testcode shows a slowdown from 0.12 seconds runtime to over 237 seconds (!) runtime, on 4-CPU systems.

pthread condition variables use two futex-backed mutex-alike locks: an internal one for the glibc CV state itself, and a user-supplied mutex which the API guarantees to take in certain codepaths. (Unfortunately the user-supplied mutex cannot be used to protect the CV state, so we've got to deal with two locks.)

The cause of the slowdown is a 'swarm effect': if lots of threads are blocked on a condition variable, and pthread_cond_broadcast() is done, then glibc first does a FUTEX_WAKE on the cv-internal mutex, then down a mutex_down() on the user-supplied mutex. Ie. a swarm of threads is created which all race to serialize on the user-supplied mutex. The more threads are used, the more likely it becomes that the scheduler will balance them over to other CPUs - where they just schedule, try to lock the mutex, and go to sleep. This 'swarm effect' is purely technical, a side-effect of glibc's use of futexes, and the imperfect coupling of the two locks.

the solution to this problem is to not wake up the swarm of threads, but 'requeue' them from the CV-internal mutex to the user-supplied mutex. The attached patch adds the FUTEX_REQUEUE feature FUTEX_REQUEUE requeues N threads from futex address A to futex address B:

sys_futex(uaddr, FUTEX_REQUEUE, nr_wake, NULL, uaddr2);

the 'val' parameter to sys_futex (nr_wake) is the # of woken up threads. This way glibc can wake up a single thread (which will take the user-mutex), and can requeue the rest, with a single system-call.

Ulrich Drepper has implemented FUTEX_REQUEUE support in glibc, and a number of people have tested it over the past couple of weeks.

the speedup with increasing number of threads is quite significant: in the 128 threads case, it's more than 8 times. In the cond-perf test, on 4 CPUs it's almost infinitely faster than the 'swarm of threads' catastrophy triggered by the old code.

there's a slowdown on UP, which is expected: on UP the O(1) scheduler implicitly serializes all active threads on the runqueue, and doesnt degrade under lots of threads. On SMP the 'point of breakdown' depends on the precise amount of time needed for the threads to become rated as 'cache-cold' by the load-balancer.

(the patch adds a new futex syscall parameter (uaddr2), which is a compatible extension of sys_futex. Old NPTL applications will continue to work without any impact, only the FUTEX_REQUEUE codepath uses the new parameter.)

Christoph Hellwig rolled his eyes and said, "Urgg, yet another sys_futex extension. Could you please split all these totally different cases into separate syscalls instead?" Ingo said fine, but he wanted to do that later, after finishing the current work.

There was a bit of technical discussion, and then an interesting exchange involving changing some functionality used by existing user binaries. Rusty Russell had said, "Ingo's "new syscall" patch has backwards compat code for the old syscalls. That's fugly 8(" . Ingo replied, "yes, but the damage has been done already, and now we've got to start the slow wait for the old syscall to flush out of our tree. It will a few years to get rid of the compat code, but we better start now." Christoph replied, "Actually it should go away before 2.6.0. sys_futex never was part of a released stable kernel so having the old_ version around is silly. I Think it's enough time until 2.6 hits the roads for people to have those vendor libc flushed out that use it." And Rusty replied, "Hmm, in that case I'd say "just break it", and I'd be all in favour of demuxing the syscall."

have you all gone nuts??? It's not an option to break perfectly working binaries out there. Hell, we didnt even reorder the new NPTL syscalls/extensions 1-2 kernel releases after the fact. Please grow up!

the interface should have been gotten right initially. We are all guilty of it - now lets face the consequences. It's only a couple of lines of code in a well isolated place of the file so i dont know what the fuss is about.

To Ingo's statement that it was not an option to break working binaries, Rusty said, "Of course it is. Linux has enough problem problems due to past mainline stupidities, now we don't need to codify vendor braindamages aswell." At this point, Linus Torvalds came down hard, with:

NO.

Christoph, get a grip. Ingo is 100% right.

IT IS NEVER ACCEPTABLE TO BREAK USER LEVEL BINARIES! In particular, we do _not_ do it just because of some sense of "aesthetics".

If you want "aesthetics", go play with microkernels, or other academic projects. They don't care about their users, they care about their ideas. The end result is, in my opinion, CRAP.

Linux has never been that way. The _founding_ principle of Linux was "compatibility". Don't ever forget that. The user comes first. ALWAYS. Pretty code is a desirable feature, but if prettifying the code breaks user apps, it doesn't get prettified.

Repeat after me: the goodness of an operating system is not in how pretty it is, but in how well it supports the user.

Make it your mantra.

Guys, binary compatibility is important. It's important enough that if something got extensively used on development kernels, it's _still_ a hell of a lot more important than most other things around. The _only_ things that trump binary compatibility are

developer sanity (ie it has to be truly mindbogglingly hard to support the old interface)
stability (ie if the old interface was so badly designed that it can't be done right - mmap on /proc/<pid>/mem was one of these).
it's been deprecated over at least one full stable release.

Something like "it's only been in the development kernels" is simply not an issue. The only thing that matters is whether it is used by various binaries or not. And I think futexes are used a lot by glibc..

I've finalised all the documentation that I'm going to do for the 2.4 VM and no further updates will be posted on the web site to this version. At this stage it has been heavily read by a number of people and there hasn't been a complaint or correction in a few weeks now. I'm happy to say it is now complete (and more importantly correct) and acts as a detailed description of the 2.4 VM, the algorithms that it is based on and comprehensive coverage of the code. People who are only interested in the 2.5.x VMs will still find it much easier to follow when they clearly know how 2.4 is put together.

As always, it comes in two parts. The first part is the actual documentation and gives a description of the whole VM. The second is a code commentary which covers a significant percentage of the VM for guiding through the messier parts. They are available in PDF, HTML and plain text formats.

Main site: http://www.csn.ul.ie/~mel/projects/vm/

Understanding the Linux Virtual Memory Manager
PDF: http://www.csn.ul.ie/~mel/projects/vm/guide/pdf/understand.pdf
HTML: http://www.csn.ul.ie/~mel/projects/vm/guide/html/understand/
Text: http://www.csn.ul.ie/~mel/projects/vm/guide/text/understand.txt

Code Commentary on the Linux Virtual Memory Manager
PDF: http://www.csn.ul.ie/~mel/projects/vm/guide/pdf/code.pdf
HTML: http://www.csn.ul.ie/~mel/projects/vm/guide/html/code
Text: http://www.csn.ul.ie/~mel/projects/vm/guide/text/code.txt

Thanks to all the people who read through it, helped me out and sent encouragement. It's been fun.

Denis Vlasenko was completely wowed, and thanked Mel for the work. Paulo Andre also said to Mel, "thank you for the effort you've put into this, for how brilliantly put together it is and most of all for releasing it to the general public. I for one, really appreciate this as it is one GIANT leap towards documenting well an area of the kernel which has been historically less than well documented." He also asked obliquely if Mel planned on doing anything similar for the 2.6 kernel, and Mel replied, "heh, maybe much later, but not now. Even thinking about writing that much again is making me cringe. When I start writing again, it'll be in the form of notes rather than updating the whole document."

I've discussed adding Page Attribute Table (PAT) support to the kernel w/ a few developers offline. They were very supportive and suggested I bring the discussion to lkml so others could get involved.

PAT support allows setting cache attributes via the virtual page table entries that are traditionally set via the MTRRs. The specific cache attribute graphics companies such as ourselves (nvidia), ATI, Matrox, and others are becoming interested in is Write-Combining (WC), both for the AGP and framebuffer apertures. Traditionally, these apertures are marked WC by setting the physical memory ranges to WC in the MTRRs. This has traditionally worked very well, but is becoming a problem with workstation systems with 1+ Gigs of memory.

The problem here is that the system bios typically covers physical ram with Write-Back (WB) MTRRs. On systems with large amounts of physical ram, especially when physical memory ranges can intersperse with ram, the bioses are using multiple MTRRs with strange results. In some cases, enough MTRRs are used to cover physical ram, such that MTRRs are not left over for the AGP or framebuffer apertures. In other cases, 1 MTRR is used to mark non-physical ram as Uncached (which covers both apertures). When trying to mark the appropriate apertures as WC, the kernel refuses to overlap the MTRRs.

Windows works around this MTRR issue by using the PATs.

An example of such a report recently sent to lkml is here: http://www.ussg.iu.edu/hypermail/linux/kernel/0303.1/0606.html

I discussed this some with Jeff Hartmann, who had some initial development code that was integrated into agpgart, for marking agp pages WC as they were allocated. I think it would be preferable to have pat support seperate from agpgart. In that way, other drivers could make use of PAT support for other means (such as mapping the framebuffer). Jeff Hartmann sent us a pass at adding PAT support to agpgart. We've modified his code slightly to be more generic (standalone from agpgart) and usable via the traditional __pgprot() macros (and therefore with the change_page_attr() function).

Mikael Pettersson replied, "Not that I disagre with utilising the PAT, but I don't see anything in this code to deal with the widespread PAT indexing erratum in Intel's processors. I don't have the errata sheets here, but it definitely affected the PIIIs and I think also some P4s. (Large pages ignoring PAT index bit 2, or something like that.)" Terence replied that he hadn't been aware of any errata, and he'd go check it out.

Recently, I started to look into some odd performance behaviors of the O(1) scheduler. I decided to document what I found in a web page at:

http://www.hpl.hp.com/research/linux/kernel/o1.php

(it may take another couple of hours before the pages show up outside the HP firewall, so if you get "page not found" at the moment, don't be surprised).

I should say that I have no direct stake in the CPU scheduler (never worked on it, not sure I ever would want to), but I feel that it's worthwhile to at least document the O(1) scheduler a bit better. Also, I do feel that the O(1) scheduler currently has rather poor "long-term" properties. It would be nice if some of those properties could be improved without hurting the other excellent properties of the current O(1) scheduler.

I think the web pages should be most relevant to the HPTC (high performance technical computing) community, since this is the community that is most likely affected by some of the performance oddities of the O(1) scheduler. Certainly anyone using OpenMP on Intel platforms (x86 and ia64) may want to take a look.

As some of you may be aware, I've been creating a series of articles on porting device drivers (and other kernel code) to the 2.5 kernel. That series is now (more or less) complete. It's also freely available at this point - no subscription gates to worry about. So I wanted to be sure people knew that this resource is there:

http://lwn.net/Articles/driver-porting/

If there is interest, I could repackage this stuff under some sort of suitably free license for wider distribution. It's easier for me to keep it up to date on the LWN site, though. I'd also be curious to hear if there's any additional topics that people think I should have covered.

Patrick Mochel replied, "Thank you very much. These articles are great. You've done a wonderful job on them, which is definitely worth a beer (or three) in Ottawa."

In the course of trying to get something working, Andries Brouwer asked, "do we want to follow POSIX, also in the silly requirement that truncate only sets mtime when the size changes, while O_TRUNC and ftruncate always set mtime. If so, we have to uglify do_truncate()." Linus Torvalds replied, "Does POSIX really say that? What a crock. If so, we should probably add the ATTR_xxx mask as an argument to do_truncate() itself, and then make sure that may_open() sets the ATTR_MTIME bit." Alexander Viro added:

"POSIX says" has value only if there is at least some consensus among implementations. Otherwise it's worthless, simply because any program that cares about portability can't rely on specified behaviour and any program that doesn't couldn't care less anyway - it will rely on actual behaviour on system it's supposed to run on.

Andries had shown that there is _no_ consensus. Ergo, POSIX can take a hike and we should go with the behaviour convenient for us. It's that simple..

This type of attitude ensures there never will be a consensus among implementations. A lack of consensus today is not grounds for failing to comply with a standard specifically designed to eliminate that lack.

On the other hand, that has to be balanced by how objectively reasonable or unreasonable the standard is. However, there should be an extremely strong preference for concurring with the standard, even against the weight of other implementations.

Skipping the update on a truncate not changing size is a performance win although not a very useful one. I don't think we can ignore the standard however. For one it simply means all the vendors have to fix it so they can sell to Government etc.

Now we can certainly put the fix in -every- vendor tree on the planet and not base, I'm just not sure for something so trivial it isnt better just to fix it to the spec *or* beat the spec authors up to fix the spec.

Here and elsewhere, Andrew Morton advocating at least maintaining similar behavior in 2.5 as existed in 2.4; but no agreement was reached on the list.

My little project is now ready to face the masses. Put simply, this driver allows one to present ATA (and soon ATAPI) devices through the Linux SCSI layer.

Since we already have an ATA (a.k.a. IDE) driver, I suppose the main question on a lot of people's minds is "why?" So I present some quick notes, in bullet form, that can form the basis of discussion.

Review is requested by all interested: driver authors, SCSI mavens, ATA mavens, whatever.

I request that people try to avoid this posting devolving into an ata-versus-scsi flamewar. Maybe I can defuse that before the fact, by saying: "when in doubt, use drivers/ide."

So let's begin... comments and questions welcome!

Why SCSI?
---------

Many of the advantages are derived the existence of the scsi mid-layer. It does a lot of work on our behalf, allowing me to focus on the ATA command protocols (PIO-in, PIO-out, DMA, etc.) almost exclusively.
The SCSI mid-layer has shrunk and benefitted a lot from Axboe's block layer work. It's much more lightweight in 2.5.x.
Serial ATA is looming quickly on the horizon. Both device and host controller SATA implementations really lend themselves to behaviors that have existed in SCSI for a while. SATA even defines use of SCSI Enclosure Services.
The Linux SCSI layer handles hotplugging, and is more modular. It already has refcounted devices and sysfs and such. Creating a new block device driver from scratch means handling all those little details.
SCSI has been doing basic error recovery and queue control for a while now. Upcoming SATA2 will benefit greatly from this, as well ATA TCQ if I ever get around to implementing the latter.
ATAPI is SCSI-like.
Note that PATA in my driver is only an afterthought. The main area of focus, now and in the future, is SATA. It has strict PATA device and host controller restrictions: must do UDMA, LBA, be at least ATA-3, etc. See "when in doubt..." statement above :)

Build notes
-----------

Patch below requires you move drivers/scsi/{scsi,hosts,scsi_obsolete}.h to include/linux/scsi_{defs,hosts,obsolete}.h in order to build. These changes are present in the FTP site patches and BK repos, but are not included below to save space. They are obvious changes that can even be recreated by hand.
You should disable CONFIG_IDE. Both drivers should request_region properly, but if you're not careful the IDE driver will grab the hardware before my driver does.

Testing status
--------------

SATA stressed with iozone, bonnie, dbench, and some proprietary stress tools, on multiple boxes.
PATA stressed similarly, on one box.

Near-future directions
----------------------

ATAPI (see below)
libata.c DMA and taskfile handling is still host-controller specific. It's the most widely used host controller standard, sure. But that mainly applies to PATA devices.

Future host controllers, with a tiny additional bit of abstracting-out, will simply ignore these functions (provided as defaults for most host controllers) and use their own.
Better error handling (see below), and more command queueing work

ATA notes
---------

Supports max UDMA/33 for PATA right now. Temporary limitation because I'm too slack to worry about cable detection.
No packet command -- yet. And thus, no ATAPI cdroms/burners/etc. Coming soon!
Does polling PIO in a kthread. Watch katad chew up CPU.

ATA hardware notes
------------------

Intentionally concentrated on modern hardware: mainly SATA, with a little bit of PATA thrown in when convenient.
Only supports Intel PATA and SATA right now
Code is structured such that DMA engine and command queueing engine can be replaced by different hardware. Next hardware driver will demonstrate this.

SCSI protocol notes
-------------------

claims compliance with latest scsi3 standards (sam3, spc3, sbc2). bug reports in spec compliance are welcome.

Resources
---------

BitKeeper for 2.4.x and 2.5.x:

bk://kernel.bkbits.net/jgarzik/atascsi-2.4
bk://kernel.bkbits.net/jgarzik/atascsi-2.5
Patchkits for 2.4.x and 2.5.x:

ftp://ftp.kernel.org/pub/linux/kernel/people/jgarzik/patchkits/2.4/2.4.21-rc3-atascsi1.patch.bz2
ftp://ftp.kernel.org/pub/linux/kernel/people/jgarzik/patchkits/2.5/2.5.69-bk17-atascsi1.patch.bz2

Open issues (currently being addressed)
---------------------------------------

ATAPI. FWIW, yes, I do know that ATAPI is "SCSI-like" and not 100% conformant SCSI.
Error handling. VERY primitive right now. Handling of errors is often "we'll stop talking to the device forever". Hey, it's better than data corruption. :)

Kudos (in alpha order for fairness :))
--------------------------------------

Jens Axboe, for tons of block layer knowledge and general reality checks.
James Bottomley, for patiently answering a lot of my SCSI questions.
Alan Cox, for key inspirational ideas.
Andre Hedrick, for tons of ATA knowledge. I've soaked up a ton of time talking ATA with him, for which I'll be forever grateful.
Lots of people on IRC and linux-scsi for helpful answers and suggestions.
myself, for progressing quite a bit on one of my Big Projects To Do Before I Die.

This patch is an adaptation of the earlier work on the node affine NUMA scheduler to the NUMA features meanwhile integrated into 2.5. Compared to the patch posted for 2.5.39 this one is much simpler and easier to understand.

The main idea is (still) that tasks are assigned a homenode to which they are preferentially scheduled. They are not only sticking as much as possible to a node (as in the current 2.5 NUMA scheduler) but will also be attracted back to their homenode if they had to be scheduled away. Therefore the tasks can be called "affine" to the homenode.

The implementation is straight forward:

Tasks have an additional element in their task structure (node).
The scheduler keeps track of the homenodes of the tasks running in each node and on each runqueue.
At cross-node load balance time nodes/runqueues which run tasks originating from the stealer node are preferred. They get a weight bonus for each task with the homenode of the stealer.
When stealing from a remote node one tries to get the own tasks (if any) or tasks from other nodes (if any). This way tasks are kept on their homenode as long as possible.

The selection of the homenode is currently done at initial load balancing, i.e. at exec(). A smarter selection method might be needed for improving the situation for multithreaded processes. An option is the dynamic_homenode patch I posted for 2.5.39 or some other scheme based on an RSS/node measure. But that's another story...

A node affine NUMA scheduler based on these principles is running very successfully in production on NEC TX7 machines since 1 year. The current patch was tested on a 32 CPU Itanium2 TX7 machine.

Andi Kleen replied, saying he'd done a similar thing for 2.4 based on the same principles himself, but Erich's was much simpler. But Andi added, "But the main problems I have is that the tuning for threads is very difficult. On AMD64 where Node equals CPU it is important to home node balance threads too. After some experiments I settled on homenode assignment on the first load balance (called "lazy homenode") When a thread clones it initially executes on the CPU of the parent, but there is a window until the first load balance tick where it can allocate memory on the wrong node. I found a lot of code runs very badly until the cache decay parameter is set to 0 (no special cache affinity) to allow quick initial migration. Migration directly on fork/clone requires a lot of changes and also breaks down on some benchmarks."

He and Erich wen't back and forth for awhile on it, and Martin J. Bligh also got in on the act with his own implementation ideas. Rick Lindsley also joined in, but the thread ended inconclusively.

there's been too much delay between 69 and 70, but I was hoping to make 70 the last "Linus only" release before getting together with Andrew and figuring out how to start the "pre-2.6" series and more of a code slush.

Whatever. Th eend result is a pretty big patch, although a lot of it is due to fairly minor patches. But it's a _lot_ of fairly minor patches, as can be seen from the changelog (also, the acorn drivers got moved around, which always makes for big patches).

You're kidding, right? 2.5 is no where near ready to be called anything like "2.6". With an actual code freeze -- as in fix the shit that's broken, non-functional, and/or incompletely implemented without adding any more stuff or rebuilding entire subsystems as opposed to the standard Linus "code freeze" that's much like a slushy in the 9th level of Hell (assuming it gets there, it doesn't last long and really does no go) -- 2.5 is about a year away from having the current code base fully functional and on it's way to stable.

Count up the number of drivers that haven't been updated to the current PCI, hotplug, and modules interfaces.

Take a look at the number of arch's that haven't seen much testing (and in many respects are thus broken)... does anyone have a functional 2.5.70 sparc64 kernel? I've built several but they're all too big to be booted (i.e. over 3.5M, and yes, I've turned off everything possible.)

Regarding the number of drivers that hadn't been updated to the new interfaces, Linus replied:

Tough. If people don't use them, they don't get supported. It's that easy.

The thing is, these things won't change before 2.6 (or at least a pre-2.6). When 2.6.0 comes out, and somebody notices that they haven't bothered to try the 2.5.x series, _then_ maybe some of those odd-ball drivers get fixed.

Or not. Some of them may be literally due for retirement, with users just running an old kernel on old hardware.

Btw, this is nothing new. It has _always_ been the case that a lot of people didn't use the latest stable kernel until it was released, and then they complained because the drivers they used weren't up to spec.

Its also a lot easier to update them once the core stops changing! There are some things I worry about - the bio splitting has to be resolved, IDE raid can't happen until that occurs and I'm also waiting for the IDE taskfile stuff/bio splitting bits to resolve so I can merge a load of IDE updates to make things like SII IDE and newer HPT chips work in 2.5.x/2.6

Architectures are also normally just a sync up job and its again easier to do once the core has stoppee changing.

Indeed. I think its more the rule than the exception that non-x86 architectures "get with the program" sometime during the stable release rather than before. There's just not a lot of incentive for the odd-ball architectures to care before the fact.

Would I prefer to have everything fixed by 2.6.0 (or even the pre-2.6 kernels)? Sure, everybody would. But it's just a fact of life that we won't see people who care about the issues before that happens.

In fact, judging by past performance, a lot of things won't get fixed before the actual vendors have made _releases_ that use 2.6.x (and the first ones inevitably will have 2.4.x as a fall-back: that's only prudent and sane).

This is not just a core kernel issue - we've seen this with subsystems like ext3 and ReiserFS: they were "finished' and "stable", but what made them _really_ stable was a release or two on vendor kernels, and thousands of users.

Elsewhere, Bill Davidsen said to Linus, "Just the other day you posted strong opposition to breaking existing binaries, how does that map with breaking existing hardware?" And Linus said:

One fundamental difference is that I cannot fix it without people who _have_ the hardware caring. So if they don't care, I don't care. It's that easy. If you want to have your hardware supported, you need to help support it.

Another difference is that it's better to not work at all, than to work incorrectly. So if your kernel doesn't boot or can't use your random piece of hardware, you just use an old kernel. But if everything looks normal, but some binary breaks in strange ways, that's _bad_.

The latter reason is, btw, why we don't paper over the build failures like some people suggested. If it hasn't been updated to the new interfaces, it should preferably not even build: which is a big reason why we try to rename interfaces when they change, exactly so that you don't get a subtly broken build.

I am ecstatic beyond words to announce the release of procps version 2.0.13. This release contains a number of NPTL-related enhancements, courtesy of Alexander Larsson of Red Hat. Some of the enhancements are generic in nature, and thus also benefit non-NPTL applications.

I encourage everyone to give this a try, especially 2.5 users.

Tarball, RPM packages, and CVS information is available at:

http://tech9.net/rml/procps/
http://sources.redhat.com/procps/

Change Log:

fix top(1) -p flag behavior (Lars Holmberg)
do not qsort the process list if we are not sorting (Alexander Larsson)
read tgid from /proc/pid/status if it exists (Alexander Larsson)
PROC_SKIPTHREADS flag for ps_readproc() to force only reading of (tgid != pid) to avoid lots of syscalls (Alexander Larsson)
Look at PM->flags in ps_readproc() to avoid reading /proc files files that are not needed. (Alexander Larsson)
Support FILLMEM, FILLCMD, FILLENV, FILLWCHAN for above. (Alexander Larsson)
Fix wchan decoding bug (Alexander Larsson)
Fix for ticks going backward and cleanup (Denis Vlasenko)

And other misc. cleanups and changes.

Phil Oester asked, "Any comment on the procps v3.1.8 located here: http://procps.sourceforge.net/? Seems it is actively maintained..." Robert replied, "It is a fork of the original procps tree. The maintainer of that tree will tell you his is better. I do not want to argue it." Miquel van Smoorenburg replied, "Well, you could also argue that the current 2.0 tree is a retroactive fork of an older version of procps. I don't understand why you don't work together." Xose Vazquez Perez quoted from the procps FAQ on why there were so many forks: "The original maintainer seems to have had little time for procps. Whatever his reasons, the project didn't get maintained. Starting in 1997, Albert Cahalan wrote a new ps program for the package. For the next few years, Albert quietly helped the Debian package maintainer fix bugs. In 2001, Rik van Riel decided to do something about what appeared to be the lack of a maintainer. He picked up the buggy old code in Red Hat's CVS and started adding patches. Meanwhile, other people have patched procps in a great many ways. In 2002, Albert moved procps to this site. This was done to ensure that years of testing and bug fixes would not be lost. The major version number was changed to 3, partly to avoid confusing users and partly because the top program has been redone."

Xose added, "I think too that is to waste the time&resources to have two, and to do a little different some LiNUX distributions in a basic and important package." Adrian Bunk gave a link to some of Albert D. Cahalan's comments on a Debian Bug, in which Albert had said it was OK for the procps programs to crash if /proc wasn't mounted. Adrian said this justified the folks maintaining other versions.

Vincent Hanquez gave a pointer to a similar comment by Albert about a different bug, and Robert replied that both bugs were fixed in his (Robert's) tree. Elsewhere, he added:

It sucks. I told Albert I would be happy to merge each and every (sane) change he sends me. He refuses. To be fair, I also refuse to work under his tree. His comments on this list is part of the reason. For what its worth, he did not fork off and create his tree until Rik starting work on the official tree.

In the end, all that matters to me really is that Red Hat and other big distributions use my tree (apparently whether I maintain it or not) and I use those distributions. If I used Debian, maybe my view would be different. Or maybe I would make them switch trees :)

Rik's activity was one of two reasons for me to increment the major number, and one of many reasons to put the project on SourceForge.

Prior to that, I had provided 2.x.x versions to Debian for years. Also note that I wrote the ps program itself and a large portion of the library you use. This all was long before you and Rik touched procps.

Go ahead and check the 2.x.x in Debian. It's mine.

DKMS (Dynamic Kernel Module Support) now contains module build kernel-preparation support for SLES/United Linux type kernels. This allows you to properly build modules for these kernels using DKMS. Many thanks to Fred Treasure - fwtreas () us ! ibm ! com for assisting in this.

Also, as SLES uses a different mkinitrd command then the one found in RH, if REMAKE_INITRD is set in your dkms.conf, it does not remake the initrd for you and instead tells you that you will have to do this manually. If anyone would like to send me feedback on best practices for SLES/UL initrd stuffs so that this process can be automated, your help would be greatly appreciated.

DKMS is at: http://www.lerhaupt.com/dkms/
Changeblog: http://www.lerhaupt.com/dkms/dkms.html

JP Sugarbroad complained that compiling IDE as a module was completely broken, and that his patches and complaints were being ignored. Tomas Szepe explained, "Alan Cox is working on the problem ATM, check 2.4.21-rc6-ac1." JP asked if Alan's work was also going into the 2.5 tree, and Alan Cox replied, "The same logic applies to 2.5, and it may be more useful there as it can be used to clean up a pile of other ordering stuff."

the attached patch, against 2.5.70, adds a new system-call called sys_tkill2():

asmlinkage long sys_tkill2(int tgid, int pid, int sig);

this new syscall solves the problem of PID-reuse. The pthread_kill() interface itself cannot guarantee via current kernel mechanisms that a thread wont go away (call pthread_exit()) before the signal is delivered - due to threads being detached. If some other application creates threads fast enough and makes the PID space overwrap, then it might happen that the wrong application gets the signal.

the tgid never changes during the lifetime of the application, so specifying tgid guarantees that pthread_kill() will only send signals to threads within this application.

(i also added the rule of tgid == 0 meaning 'no tgid check' - this made it possible to merge the previous sys_tkill() API into the new sys_tkill2() API. The sys_tkill API is of course preserved.)

Ulrich says that this interface is OK and desired for glibc. The patch was sanity-compiled & booted on x86 SMP.

I'd suggest changing the name. It's not "tkill2", it's a totally new system call with different inputs.

How about calling it "tgkill()" for "thread" and "group", which are the new inputs?

Ingo agreed to the name change, and he and Linus discussed other interface details.

The Linux Test Project test suite has been released. The latest version of the testsuite contains 1800+ tests for the Linux OS. Our web site also contains other information such as: test results, a Linux test tools matrix, an area for keeping up with fixes for known blocking problems in the 2.5 kernel releases, technical papers and HowTos on Linux testing, and a code coverage analysis tool.

Highlights:

Relocation of all Open POSIX Test Suite tests to /testcases/open_posix_testsuite.
Inclusion of modified `top` tool to allow for system usage data gathering.
More support for 64bit architectures and large memory machines
Fixes and code cleanups. See our Bug Tracker for more details

We encourage the community to post results, patches or new tests on our mailing list and use the CVS bug tracking facility to report problems that you might encounter with the test suite.

Colin Paul Adams asked, "I am somewhat confused about how much swap space you can have with a 2.4 series kernel. If I read the mkswap man page, I get the impression that I could have up to 8x2GB of swap space for a total of 16 GB, but reading the RedHat reference guide, it says 2GB maximum. I presume 2.5 kernels have much higher limits?" Rik van Riel replied, "That piece of documentation is out of date. I'm using a 20 GB swap partition on one of my test systems, with a 2.4 kernel." Randy Dunlap asked what the new 2.4 and 2.5 limits were, adding that Andrew Morton had successfully tested 52 gigs of swap on one kernel. Andrew confirmed this elsewhere. Finally, William Lee Irwin III said:

I apologize for failing to do a proper wrap-up. AIUI, we have:

both 2.4.x and 2.5.x kernels support swapspaces of up to 64GB in size
2.4.x supports 64 swapspaces and 2.5.x supports 32 (not reparable)
mkswap(8) needs fixes for creating swapspaces larger than 2GB merged back to util-linux; aeb (util-linux maintainer) has publicly requested the code be sent back to him for merging, presumably with some evidence of its correctness. One of the several distro people who are maintaining such patches against mkswap(8) is going to send that in.

Awhile ago, I mentioned that the Linksys WRT54G wireless access point used several GPL projects in its firmware, but did not seem to have any of the source available, or acknowledge the use of the GPLed software. Four weeks ago, I spoke with an employee at Linksys who confirmed that the system did use Linux, and also mentioned that he would work with his management to ensure that the source was released. Unfortunately, my e-mails to this individual over the past three weeks have gone unanswered. Of course, I also tried contacting Linksys through their common public e-mail accounts ([email protected], [email protected]) to no avail.

However, it is hard for me to know if my contact in the company has just gone on a three week vacation (and not set an auto-responder), or has been asked to not answer anymore mail on this subject. Also, I should note that I don't own this product, so I can't determine if the source is shipped with it. However, I have gone through all the available information on the Linksys website, and can find no reference to the GPL, Linux (as it relates to this product), or the firmware source code. Also, the firmware binary (see below) is freely available from their website. There is no link from the download page to the source, or any mention of Linux or the GPL. Finally, it would be strange if the source was included in the physical package, as my contact at Linksys was initially unaware Linux was used in this product.

The following steps can be used to determine the exact nature of the possible GPL violation.

Go to the following URL: http://www.linksys.com/download/firmware.asp?fwid=178
Download the "firmware upgrade files": ftp://ftp.linksys.com/pub/network/WRT54G_1.02.1_US_code.bin (MD5SUM: b54475a81bc18462d3754f96c9c7cc0f)
While it is downloading, confirm that there is nothing on the webpage to indicate that this binary contains GPLed software.
Once the download is complete, copy the contents of the file from offset 0xC0020 onward into a new file.

dd if=WRT54G_1.02.1_US_code.bin of=test.dump skip=24577c bs=32c
Notice that this file is an image of a CramFS filesystem. Mount it.
Explore the filesystem. You will notice that the system appears to be based on Linux 2.4.5.

Incidentally, there is at least one other GPLed project in the firmware: the BusyBox userland component: (http://www.busybox.net/)
The Linux kernel (I think) is mixed up with a bunch of other stuff in: bin/boot.bin

You might want to know why I am interested in getting the code for the kernel used in this device.

There's been some discussion here about Linux's lack of wireless support for a few of the newer 802.11b and (nearly?) all 802.11g chips. Incidentally, Linux has excellent support for at least one manufacturer's wireless family. The following Broadcom chips all appear to be supported under Linux -- if you happen to be running Linux on a MIPS processor in a Linksys router:

Broadcom BCM4301 Wireless 802.11b Controller
Broadcom BCM4307 Wireless 802.11b Controller
Broadcom BCM4309 Wireless 802.11a Controller
Broadcom BCM4309 Wireless 802.11b Controller
Broadcom BCM4309 Wireless 802.11 Multiband Controller
Broadcom BCM4310 Wireless 802.11b Controller
Broadcom BCM4306 Wireless 802.11b/g Controller
Broadcom BCM4306 Wireless 802.11a Controller
Broadcom BCM4306 Wireless 802.11 Multiband Controller

This list was produced by running strings on: lib/modules/2.4.5/kernel/drivers/net/wl/wl.o

I am trying to determine exactly how tightly coupled these drivers are to the kernel.

As an aside, I know that some wireless companies have been hesitant of releasing open source drivers because they are worried their radios might be pushed out of spec. However, if the drivers are already written, would there be any technical reason why they could not simply be recompiled for Intel hardware, and released as binary-only modules?

Finally, I know that traditionally, Linux has allowed binary-only modules. However, I was always under the impression that this required that the final customer be allowed to remove them at will. That is to say, you couldn't choose to implement a portion of the kernel critical to the system's operation in a module, and then not release that module under the GPL. In this particular case, I would argue that the wireless drivers are critical to this device's operation (after all, it is a wireless access point). In addition, the final user in this case really can't just "rmmod" the wireless driver.

The Broadcom driver, kernel, and really everything else in the firmware, are (IMHO anyways) being used to form a discrete package -- the WRT54Gs firmware. Does/should this have any implication on whether the Broadcom wireless module must be covered by the GPL?

I would be very interested in knowing if I am mistaken in any of my claims or conclusions, and if not, how I should proceed in getting this issue resolved.

Brad Chapman said he thought Andrew had identified a true violation of the GPL, and asked, "Why is Linksys and/or Broadcom doing this?" Davide Libenzi remarked bitterly, "These guys, not only they make money using GNU/Linux inside their products but they do even refuse to support the GNU/Linux community when it comes to drivers. In many cases even a mere binary driver is not available. This is as lame as it gets."

Dave Jones also said, in response to the original post, "Curiously, the Belkin products (http://networking.belkin.com) also seem to be based upon the same source. Looks like they could just be rebranded firmware images with some features disabled." And Alan Cox said, "If Belkin are shipping the same code and apparently got it from Linksys someone migth also care to write to Belkin and inform them about this. They may not even know about their rather large potential liability."

I have a Buffalo (Melco) WBR-G54.

Looking through the latest firmware update available : http://www.buffalo-technology.com/support/firmware.htm

It does appear to be similar to the Linksys firmware and contain linux and possibly busybox

No mention here or anywhere on there site of the GPL or the source code to what they are distributing!

Erik Andersen replied, "I just visited the Buffalo site http://www.buffalo-technology.com/ and I could not find any source code. And not only are they distributing the linux kernel and BusyBox, their rom is _remarkably_ similar to the Linksys one in many respects. Perhaps they share an upstream vendor that did not make them aware of their responsibilities?" John Shifflett remarked, "Fry's appears to be selling a lot of Linksys WAP-54g units, which are low cost wireless access points. For firmware version 1.06.03, the cramfs starts at skip=24576c, bs=32c." And Alan said, "Has anyone had a lawyer write and advise Fry s yet ? Hey theory seems to work for SCO 8)"

I have another product from Melco/Buffalo and found out, that it had Linux and other GPLed (and BSD with adv. clause and Apache with end-user doc clasue and and and) materials on it, without the manual or their web page mentioning it. Their telephone support in Japan (where the stuff comes from) didn't give sh*t about it and repeatedly told me about their policy of not telling anything about the inner workings of products and even refused to either elevate me to talk to somebody else or give me an contacting point to either their legal or PR departments.

I directly called the PR guy responsible for that line of products at their headquater (good to be on their list of press) and explained the situation to him. He promised to clear it up with legal and the developers. After about a week (and me ranting about an anonymous company in my blog) he got back with a very detailed and polite answer about the situation. They DID know about the issues involved and their obligations. Just that their other hand (like support and web page creation) isn't up to the "new" style and that they were fixing it. I also found out, that the "quick setup" leaflet in the product actually contained a small note that the product included GPLed software.

It has been 3 month since. If they havn't gotten their act together still, they need to be reminded of it.

Michael Neuffer said, "Over a month ago I send them a request and this is their response:"

Joern Engel noticed that Charles White, the official maintainer of the Compaq Smart2 RAID driver and the Compaq Smart CISS RAID driver, had a bad email address in the MAINTAINERS file. He posted a patch to remove the dead address, but then noticed that the mailing list listed in the same file as covering these drivers, was also dead. He posted a new patch removing that address as well, and Adam Kropelin replied that Stephen Cameron had posted the proper patch back in February, making himself (Steve) the official maintainer, and migrating the mailing list from Compaq over to Hewlett Packard. That patch had never been applied, so Adam included it again in his post.

Here's some more PCI changes against the latest 2.5.70 bk tree. They contain the following:

remove almost all usages of pci_present(). There are only 2 users of this function left, and I'll continue to work to remove them.
add sysfs support for pci domains. This is from Matthew Wilcox, and is a bit different from the last patch he sent to lkml. This one supports sparc64 and ppc64 and has been blessed by David Miller.
updated pci pool CONFIG_DEBUG_SLAB logic
removed pci_for_each_bus() macro, and added a pci_find_next_bus() function to prevent people from directly walking the PCI bus lists.

Kernel-Janitor patchsets
2003-06-10
Draft version 1
========================================

KJ patchsets

The Kernel Janitors (KJ) project reviews patches, comments on them, screens and/or merges them, and creates patchsets that contain them. These patchsets are available for anyone to download and use.

Patches should be sent to "[email protected]" for review. Randy Dunlap ([email protected]) will compile-test them and merge them into the -kj patchset if they pass review.
KJ patchset review and approvals

A patch must be approved by at least two (2) KJ project developers and receive no vetoes (rejects) before it is added to the -kj patchset.
KJ patch forwarding After a patch has been in the -kj patchset for 1 week (without problems) or on request from another Linux tree maintainer, it will be forwarded to the current Linux tree maintainer (Linus, Andrew, Marcelo, etc.) for merging.
Where hosted

KJ patchsets are hosted at OSDL.org. They will begin at http://www.osdl.org/archive/rddunlap/kj-patches/ ... but this is only a temporary location until a more permanent solution is available (soon).
Patchset type
- merged and separated patches available as a diff (optionally bzip2- compressed)
- maybe a BK tree in the future, but that won't be the only format
- separated (split) patches are forwarded to tree maintainers so that credits for the patches are logged

I noticed both the 2.4 and 2.5 BK->CVS trees don't have version tags any more (v2_5_70, for example, as in the old tree).

Is this intentional? Did CVS take too long to tag all files or something?

It was quite a nice feature to have them, very useful for finding out the differences between certain kernel versions. I can live without it, though. It's still a nice service without the tags. (Thanks!)

Larry McVoy replied, "I'll go look. CVS takes way too long to tag, it forces a rewrite of every file. I did attempt to filter out tags that weren't of the forn v2.* but it looks like I screwed up." Elsewhere, Ben Collins said:

Looks like the tags are on the ChangeSet file only. Which is why I didn't notice. You could get a timestamp from the tag on ChangeSet and use that for a -D argument.

A quick script could do this for you. I think it's wise of Larry to keep it this way.

Larry replied, ""Wise" is a stretch, more like a fortunate scripting screwup. But I agree with Ben, given that you can get the info another way it is a lot easier (i.e., thrashes the disk less) if we leave it as is." And Pascal replied:


Kernel Traffic Latest\|Archives\|People\|Topics	Wine Latest\|Archives\|People\|Topics	GNUe Latest\|Archives\|People\|Topics
Czech

1.	19May2003-22May2003	(53 posts)	Futex Updates; Backward Compatibility Policy
2.	19May2003-21May2003	(4 posts)	Status Of Virtual Memory Documentation
3.	20May2003-22May2003	(10 posts)	Page Attribute Table (PAT) Support
4.	20May2003-4Jun2003	(45 posts)	Web Page For The O(1) Scheduler
5.	22May2003	(2 posts)	Linux Weekly News Articles On Porting Code To 2.5
6.	22May2003-26May2003	(13 posts)	Kernel Developers' Relationship To POSIX And Other Standards
7.	24May2003-28May2003	(18 posts)	SCSI Driver To Access IDE Devices
8.	27May2003-28May2003	(14 posts)	NUMA Scheduler Enhancements
9.	26May2003-29May2003	(60 posts)	Linux 2.5.70 Released; Moving Toward "pre-2.6"
10.	28May2003-31May2003	(19 posts)	procps Maintainership Still Contentious
11.	30May2003	(1 post)	Status Of Dynamic Kernel Module Support (DKMS)
12.	31May2003-2Jun2003	(6 posts)	Status Of Modular IDE
13.	3Jun2003	(5 posts)	New tgkill() System Call
14.	6Jun2003	(1 post)	Linux Test Project June Release
15.	7Jun2003-10Jun2003	(16 posts)	Limits On Maximum Swap Space For 2.4 And 2.5
16.	7Jun2003-11Jun2003	(64 posts)	Possible GPL Violations By Many Wireless Vendors
17.	8Jun2003	(3 posts)	RAID Driver Maintainership
18.	10Jun2003	(78 posts)	PCI Fixes For 2.5.70
19.	10Jun2003	(1 post)	Kernel Janitor Patchset Availability
20.	11Jun2003	(5 posts)	Version Tags Missing From BK->CVS Gateway Files

Kernel Traffic #219 For 16Jun2003

By Zack Brown