Kernel Traffic #216 For 20May2003

There were 440 different contributors. 234 posted more than once. 226 posted last week too.

Is there any interest in a single system call that will perform both a fork() and exec()? Could this save some extra work of doing a copy_mm(), copy_signals(), etc?

I would think on large, multi-user systems that are spawning processes all day, this might improve performance if the shells on such a system were patched.

Perhaps a system call like:

   pid_t spawn(const char *p_path,
               const char *argv[],
               const char *envp[],
               const int   filp[]);

The filp array would allow file descriptors to be redirected. It could be terminated by a -1 and reference the file descriptors of the current process (this could also potentially save some dup() syscalls).

If any of these parameters (exclusing p_path) are NULL, then the appropriate values are taken from the current process.

I originally was thinking of a name of fexec() for such a syscall, but since there are already "f" variant syscalls (fchmod, fstat, ...) that an fexec() would make more sense about executing an already open file, so the name spawn() came to mind.

I know almost all of my fork()-exec() code does almost the same thing. I guess vfork() was a potential solution, but this somehow seems cleaner (and still may be more efficient than having to issue two syscalls)... the downside is, of course, another syscall.

There was not much enthusiasm. Only Rafael Costa dos Santos showed any interest, offering to help code the thing up. Elsewhere, Larry McVoy, while not outright against the idea, felt there were significant compatibility issues. In particular, he suggested ensuring compatibility with Windows NT. Elsewhere, Matthias Andree took a somewhat dimmer view of the compatibility issue. He said adding a new system call was "a major showstopper, because it'd only be useful to non-portable, Unix-specific applications (thus it wouldn't be put to much use)."

Other folks pointed out that the small amount of time saved by avoiding an extra system call during process creation would be completely overwhelemed by the time it took to actually execute the program that would run in that process. On the other hand, Davide Libenzi suggested doing this as part of the C library, in user space. A new system call would be overkill for such small gains, but it might be worth adding a library call.

Bas Mevissen asked if Linux had any support for Broadcom's BCM4306 or BCM2050 WLAN chips. He saw that the BCM4401 ethernet chip had a Linux driver, and was hopeful that maybe the WLAN chips did as well. Martin List-Petersen replied, "It seems, that the specs haven't been released yet. There are quite a few Wlan cards out there based on the Broadcom chips (nearly all cards, that support 802.11g), so it's quite a shame. (Actually this fits the the TrueMobile 1180, 1300 and 1400, speaking of Dell wireless lan cards)." He added, "The same problem is with the Intel Prowireless 2100 (Centrino) WLan card. No Linux support available yet, which is another choice for the Dell notebooks at the moment." But he also said there was a Petition folks could sign, regarding this very issue. Martin concluded, "I've tried to contact Broadcom directly, but they are just ignoring mails containing the word "Linux", so it seems." David S. Miller also said:

Don't expect specs or opensource drivers for any of these pieces of hardware until these vendors figure out a way to hide the frequency programming interface.

Ie. these cards can be programmed to transmit at any frequency, and various government agencies don't like it when f.e. users can transmit on military frequencies and stuff like that.

The only halfway plausible idea I've seen is to not document the frequency programming registers, and users get a "region" key file that has opaque register values to program into the appropriate registers. The file is per-region (one for US, Germany, etc.)and the wireless kernel driver reads in this file to do the frequency programming.

So don't blame the vendors on this one, several of them would love to publish drivers public for their cards, but simply cannot with upsetting federal regulators.

Alan Cox remarked that folks were already cracking the Windows interface on those cards, and that non-US governments cared about this issue as well. He said, "The fact people are already abusing the technology suggests that they will be forced to go the crypted settings route for next generation hardware anyway." And added, "I talked to one vendor about this stuff and fingers crossed we will see open drivers except for the radio module. In the longer term I suspect vendors will move to signed register sets, so you can load "US 802.11g" but you can't load "police frequency, full power""

At some point Bas suggested that if these vendors were really willing to release their specs, but were only holding back to satisfy government agencies, then maybe they could release some binary drivers in the interim. Martin replied to this, "I totally agree on this. A binary driver could better than nothing at this point. Another thing that wonders me, is why companies like Broadcom, if they are so open to releasing the drivers at some point, where they can make the regulation agencies somewhat happy, are so ignorant then. I've heard of serveral people, that tried to get a statement on the possibilty for Linux drivers from then and the return is nothing. I've actually tried myself. No response at all."

Elsewhere, Carl-Daniel Hailfinger's eyes lit up at the prospect of transmitting on military frequencies. He said he "wants binary only driver for these cards to build opensource driver with ability to set "interesting" frequency range." Martin said, "It's there for Windows." And at some point, Richard B. Johnson said:

Contrary to popular opinion, there is no FCC regulation prohibiting one from receiving some particular frequency. There is, however, a federal law prohibiting the disclosure of a radio message by a third party. This means that the media, or even law enforcement can't listen to a private radio (cell phone) conversation and then disclose its content. At one time, cell phones used FM at 960 MHz. This could be readily received by receivers designed for Amateur Radio use. For a time, the FCC refused to Type Approve receivers that cover these frequencies. However, most Hams know how to fix their receivers so they can receive whatever they want and Type Approval was only required for receivers that were designed to be sold. You could build anything you want for yourself. This refusal to Type Approve receivers was a trick to make the usual receiver owner think that there was some dumb regulation when, in fact, under the Communications Act of 1934 (as amended), there can't be such a regulation without creating a new public law, which hasn't happened and probably will not.

Recently, some broadcast satellite companies have tried to get the FCC to declare that their transmissions are private and unauthorized reception should be unlawful. The FCC has continually postponed any such declaration because, if once broadcast, a radio signal doesn't become public, then anybody could sue every radio transmitter operator to prevent the trespass of "their" signals onto private property. You can't have it both ways, either radio signals are public and, therefore cannot commit a trespass, or they are private and can.

But, unlike some other countries regulators, the FCC has steadfastly refused to allow broadcasters, even satellite broadcasters, to pursue such extortion. Basically, once a signal leaves an antenna, it becomes public property.

The same is not true for cable and "guided waves". Satellite broadcasters have not been able to convince the FCC that their transmissions are "guided waves". However, some private RF link companies signals, including some that use satellites, are considered "guided waves" and cannot be used without permission.

Various commercial interests have convinced governments of many other countries that they "own" their radio signals and therefore different regulations exist in many other countries. In the UK, for instance, one has to purchase a license to use a receiver (you know, some Sony Walkman). This is, in my opinion, extremely repressive. It would be nice for somebody to start suing the BBC (and others) to recover damages for the criminal trespass of "their" radio signals onto private property. After a few such lawsuits, the ownership of such broadcast signals would revert to the public, just like in the US.

Carl-Daniel replied, "Here in Germany, receiving some particular frequencies (e.g. those used by the police) was prohibited a few years ago (I don't know exactly if they changed the law). The argument was that some receiver types emitted a weak signal on the frequency they were listening to (and could be tuned to become a private radio station) which could interfere with the low-power police devices. However, it was simply not sensible to prohibit all radios, so they were constained to a specific frequency range."

Close by, Alan took exception to Richard's statement that people needed a license for things like a Sony Walkman in England. Alan said, "You need a license to receive terrestrial TV but that is rather different and relates to both cultural and historical tax differences in philosophy between the US and UK. The big problem with 'soft' radios is transmit. You can hotwire your centrino. People in the UK are already trying to use US drivers in Windows XP because "they go further". If you listen to police transmissions then its ultimately poor police security, if you transmit on their frequency then its a lot more serious because you might interfere with emergency services."

Below is a first cut at tracking the major work items which should be completed for a 2.6 release.

When considering these items it would be useful to have a clear idea of what a 2.6.0 release is actually _for_. Obviously, 2.6.0 doesn't mean "it's finished, ship it".

I'd propose that 2.6.0 means that users can migrate from 2.4.x with a good expectation that everything which they were using in 2.4 will continue to work, and that the kernel doesn't crash, doesn't munch their data and doesn't run like a dog. Other definitions are welcome.

I shall be maintaining list this so we can understand where we are with respect to 2.6 readiness. And so we can look at features and say "no". And so we can look at bugs and say "not gating 2.6.0".

Things we should not track here are:

Regular old bugs. Please use bugzilla.
Wishlist items. This list is not a route for getting commitment for inclusion of $FAVEFEATURE. In fact it's probably a good way of getting the feature shot down ;)
Driver problems. Most important drivers mostly work OK now. Please use bugzilla.

Things which we should track here are significantly-sized outstanding development activities which resolve big bugs or which address missing features & speedups.

I've organised it into three main sections:

must-fix bugs which require significant amounts of work/restructuring to fix.
late features and speedups.
Important driver bugs. This wasn't supposed to be here, but various contributors sent me a lot of details, and it would be sad to lose them.

The list is already very long, and very incomplete. Additions (and removals!!!) are sought. Thanks.

And thanks to the various contributors who helped pull this together.

Must-fix bugs

drivers/char/

TTY locking is broken (see FIXME in do_tty_hangup())

"One bug that was found is that the dropping of lock_kernel from do_exit caused races in the exit tty cleanup. There was a patch for that, but I'm not sure it was merged."

drivers/block/

RAID0 dies on strangely aligned BIOs

- Need to hoist BIO-split code out of device mapper, use that.

(neilb)

1/ RAID5 should work fine. It accepts any sort of bio and always submits a 1-page bio to the underlying device, and if my understanding is correct, every device must be able to handle a single page bio, no matter what the alignment (which is why raid0 has a problem - it doesn't).

2/ RAID1 works pretty well. The only improvement needed is to define a merge_bvec_fn function which passes the question down to lower layers. This should be easy except for the small fact that it is impossible :-) There is no enforced pairing between calls to merge_bvec_fn and submit_bh, so it is possible that a hot spare with different restrictions could get swapped in between the one and the other and could confuse things. I suspect that can be worked around somehow though...

Someone sent me a patch that is sorely needed - it allows you to simply call blk_queue_stack() (or somethink like that), and it will get your stacked limits set appropriately.

3/ I just realised that raid0 is easier than I had previously thought. We don't need the completely functional bio splitting that dm has. We only need to be able to split a bio that has just one page as the use of merge_bvec_fn will ensure that we never get a larger bio that we cannot handle. And splitting a bio with only one page is a lot easier. I now have code in my tree that implements this quite cleanly and will probably post a patch during the week.
ideraid hasn't been ported to 2.5 at all yet.
CD burning. There are still a few quirks to solve wrt SG_IO and ide-cd.

Jens: The basic hang has been solved (double fault in ide-cd), there still seems to be some cases that don't work too well. Don't really have a handle on those :/
IDE tcq. Either kill it or fix it. Not a "big todo", as such.

drivers/video/

Lots of drivers don't compile, others do but don't work.

fs/

NFS client gets an OOM deadlock.

- Some fixes exist in -mm. Seem to mostly work.
NFS client runs very slowly consuming 100% CPU under heavy writeout.

- Unsubtle fix exists in -mm. (Looks like it's fixed anyway).
ext3 data=journal mode is bust.
ext3/htree doesn't play right with NFS server. 90% fixed in -mm.
AIO/direct-IO writes can race with truncate and wreck filesystems.

- Easy fix is to only allow the feature for S_ISBLK files.
davej: NFS seems to have a really bad time for some people. (Including myself on one testbox). The common factor seems to be a high spec client torturing an underpowered NFS server with lots of IO. (fsx/fsstress etc show this up). Lots of "NFS server cheating" messages get dumped, and a whole lot of bogus packets start appearing. They look severely corrupted, (they even crashed ethereal once 8-)

kernel/

O(1) scheduler starvation, poor behaviour seems unresolved.

Jens: "I've been running 2.5.67-mm3 on my workstation for two days, and it still doesn't feel as good as 2.4. It's not a disaster like some revisisons ago, but it still has occasional CPU "stalls" where it feels like a process waits for half a second of so for CPU time. That's is very noticable."

Also see Mike Galbraith's work.
Alan: 32bit uid support is *still* broken for process accounting.

(Test case?)

mm/

Overcommit accounting gets wrong answers

- underestimates reclaimable slab, gives bogus failures when dcache&icache are large.

- gets confused by reclaimable-but-not-freed truncated ext3 pages. Lame fix exists in -mm.
Proper user level no overcommit also requires a root margin adding

modules

(Rusty)

The .modinfo patch needs to go in. It's trivial, but it's the major missing functionality vs. 2.4. Keeps bouncing off Linus.
__module_get(): "I know I have a refcount already and I don't care if they're doing rmmod --wait, gimme.". Keeps bouncing off Linus.
Per-cpu support inside modules (have patch, in testing).
driver class code is getting redone. I have this now working, and will send it out in a few days.

net/

(davem)

UDP apps can in theory deadlock, because the ip_append_data path can end up sleeping while the socket lock is held.

It is OK to sleep with the socket held held, normally. But in this case the sleep happens while waiting for socket memory/space to become available, if another context needs to take the socket lock to free up the space we could hang.

I sent a rough patch on how to fix this to Alexey, and he is analyzing the situation. I expect a final fix from him next week or so.
Semantics for IPSEC during operations such as TCP connect suck currently.

When we first try to connect to a destination, we may need to ask the IPSEC key management daemon to resolve the IPSEC routes for us. For the purposes of what the kernel needs to do, you can think of it like ARP. We can't send the packet out properly until we resolve the path.

What happens now for IPSEC is basically this:

O_NONBLOCK: returns -EAGAIN over and over until route is resolved

!O_NONBLOCK: Sleeps until route is resolved

These semantics are total crap. The solution, which Alexey is working on, is to allow incomplete routes to exist. These "incomplete" routes merely put the packet onto a "resolution queue", and once the key manager does it's thing we finish the output of the packet. This is precisely how ARP works.

I don't know when Alexey will be done with this.
There are those mysterious TCP hangs of established state sockets. Someone has to get a good log in order for us to effectively debug this.

net/*/netfilter/

(Rusty)

Handle non-linear skbs everywhere. This is going in via Dave now.
Rework conntrack hashing.
Module relationship bogosity fix (trivial, have patch).

global

Lots of 2.4 fixes including some security are not in 2.5
There are about 60 or 70 security related checks that need doing (copy_user etc) from Stanford tools
A couple of hundred real looking bugzilla bugs

Not-ready features and speedups

drivers/block/

Framework for selecting IO schedulers. This is the main one really. Once this is in place we can drop in new schedulers any old time, no risk.
Dynamic disk request allocation. Patch exists.
Runtime-selectable disk scheduler framework.
Anticipatory scheduler. Working OK now, still has problems with seeky OLTP-style loads.
CFQ scheduler. Seems to work but Jens planning significant rework.
The feral.com qlogic driver: needs work.

fs/

reiserfs_file_write() speedup. There are concerns that some applications do the wrong thing with large stat.st_blksize.
ext3 lock_kernel() removal: that part works OK and is mergeable. But we'll also need to make lock_journal() a spinlock, and that's deep surgery.
32bit quota needs a lot more testing but may work now
Integrate Chris Mason's 2.4 reiserfs ordered data and data journaling patches. They make reiserfs a lot safer.
(Trond:) Yes: I'm still working on an atomic "open()", i.e. one where we short-circuit the usual VFS path_walk() + lookup() + permission() + create() + .... bullsh*t...

I have several reasons for wanting to do this (all of them related to NFS of course, but much of the reasoning applies to *all* networked file systems).

1) The above sequence is simply not atomic on *any* networked filesystem.

2) It introduces a sh*tload of completely unnecessary RPC calls (why do a 'permission' RPC call when the server is in *any* case going to tell you whether or not this operations is allowed. Why do a 'lookup()' when the 'create()' call can be made to tell you whether or not a file already exists).

3) It is incompatible with some operations: the current create() doesn't pass an 'EXCLUSIVE' flag down to the filesystems.

4) (NFS specific?) open() has very different cache consistency requirements when compared to most other VFS operations.

I'd very much like for something like Peter Braam's 'lookup with intent' or (better yet) for a proper dentry->open() to be integrated with path_walk()/open_namei(). I'm still working on the latter (Peter has already completed the lookup with intent stuff).

kernel/

(Rusty)

Zippel's Reference count simplification. Tricky code, but cuts about 120 lines from module.c. Patch exists, needs stressing.
/proc/kallsyms. What most people really wanted from /proc/ksyms. Patch exists.
Fix module-failed-init races by starting module "disabled". Patch exists, requires some subsystems (ie. add_partition) to explicitly say "make module live now". Without patch we are no worse off than 2.4 etc.
Integrate userspace irq balancing daemon.

mm/

objrmap: concerns over page reclaim performance at high sharing levels, and interoperation with nonlinear mappings is hairy.
Readd and make /proc/sys/vm/freepages writable again so that boxes can be tuned for heavy interrupt load.

net/

(davem)

Real serious use of IPSEC is hampered by lack of MPLS support. MPLS is a switching technology that works by switching based upon fixed length labels prepended to packets. Many people use this and IPSEC to implement VPNs over public networks, it is also used for things like traffic engineering.

A good reference site is:

http://www.mplsrc.com/

Anyways, an existing (crappy) implementation exists. I've almost completed a rewrite, I should have something in the tree next week.
Sometimes we generate IP fragments when it truly isn't necessary.

The way IP fragmentation is specified, each fragment must be modulo 8 bytes in length. So suppose the device has an MTU that is not 0 modulo 8, ethernet even classifies in this way. 1500 == (8 * 187) + 4

Our IP fragmenting engine can fragment on packets that are sized within the last modulo 8 bytes of the MTU. This happens in obscure cases, but it does happen.

I've proposed a fix to Alexey, whereby very late in the output path we check the packet, if we fragmented but the data length would fit into the MTU we unfragment the packet.

This is low priority, because technically it creates suboptimal behavior rather than mis-operation.
IPV4 output engine changes for IPSEC need to be moved over to IPV6.

IPV6 ipsec works but gravely suboptimally in some cases. It is also for this reason that the zerocopy UDP stuff isn't functional on the ipv6 side.

The USAGI project (www.linux-ipv6.org) is working with Alexey on this work.

net/*/netfilter/

Lots of misc. cleanups, which are happening slowly.
davem: Netfilter needs to stop linearizing packets as much as possible.

Zerocopy output packets are basically undone by netfilter becuase all of it assumed it was working with linear socket buffers.

Rusty is fixing this piece by piece. He is nearly done with this work.

power management

(Pat) There is some preliminary work at bk://ldm.bkbits.net/linux-2.5-power, though I'm currently in the process of reworking it.

It includes:

New device power management core code, both for individual devices, and for global state transitions.
A generic user interface for triggering system power state transitions.
Arch-independent code for performing state transitions, that calls platform-specific methods along the way.
A better suspend-to-disk mechanism that swsusp.

There are various other details to be worked out, which are the real fun part. And of course, driver support, but that is something that can happen at any time.

(Alan)
PCI locking
Frame buffer restore codepaths (that requires some deep PCI magic)
XFree86 hooks
AGP restoration
DRI restoration
IDE suspend/resume without races (Ben is looking at this a little)
How to deal with devices that babble (some stuff we have to global IRQ off to save, and global IRQ on -after- we recover with APM)
Pat's swsusp rework?

arch/i386/

Andi: i386 sub architectures for common boxes (in particular bigsmp and summit) need to be runtime probed options, not compile time. Vendors cannot ship an own kernel rpm for all these cases. (patch is in -mm, works OK).
Also PC9800 merge needs finishing to the point we want for 2.6 (not all).
ES7000 wants merging (now we are all happy with it). That shouldn't be a big problem.

global

64-bit dev_t. Seems almost ready, but it's not really known how much work is still to do. Patches exist in -mm but with the recent rise of the neo-viro I'm not sure where things are at.
We need a kernel side API for reporting error events to userspace (could be async to 2.6 itself)

(Prototype core based on netlink exists)
Kai: Introduce a sane, easy and standard way to build external modules
Kai: Allow separate src/objdir

drivers

Alan: PCI random reordering from 2.4 to 2.5 isnt understood yet (might be fixed now?)
Alan: We have multiple drivers walking the pci device lists and also using things like pci_find_device in unsafe ways with no refcounting. I think we have to make pci_find_device etc refcount somewhere and add pci_device_put as was done with networking.
Lots of network drivers don't even build
Alan: PCI hotplug is unsafe (locking is totally screwed)
Ditto cardbus
Alan: Cardbus/PCMCIA requires all Russell's stuff is merged to do multiheader right and so on

drivers/acpi/

davej: ACPI has a number of failures right now. There are a number of entries in bugzilla which could all be the same bug. It manifests as a "network card doesn't recieve packets" booting with 'acpi=off noapic' fixes it.
davej: There's also another nasty 'doesnt boot' bug which quite a few people (myself included) are seeing on some boxes (especially laptops).

drivers/block/

Alan: Partition handling is hosed for DM users. (I have some partly debugged patches in the -ac tree, but Andries objects to them and I think his user knows magic options hack is unacceptable too. Mostly this is figuring out the right answer)
Floppy is almost unusably buggy still

drivers/char/

Alan: Multiple serious bugs in the DRI drivers (most now with patches thankfully). "The badness I know about is almost entirely IRQ mishandling. DRI failing to mask PCI irqs on exit paths."
Various suspect things in AGP.

drivers/ide/

(Alan)

IDE requires bio walking
IDE PIO has occasional unexplained PIO disk eating reports
IDE has multiple zillions of races/hangs in 2.5 still
IDE eats disks with HPT372N on 2.5.x
IDE scsi needs rewriting
IDE needs significant reworking to handle Simplex right
IDE hotplug handling for 2.5 is completely broken still

drivers/isdn/

(Kai, rmk)

isdn_tty locking is completely broken (cli() and friends)
fix lots of remaining bugs in the isdn link layer / hisax protocol layer / hisax subdrivers, so that at least 99% of the users have a usable ISDN subsystem
fix other drivers
lots more cleanups, adaption to recent APIs etc
fixup tty-based ISDN drivers which provide TIOCM* ioctls (see my recent 3-set patch for serial stuff)

Alternatively, we could re-introduce the fallback to driver ioctl parsing for these if not enough drivers get updated.
fixup the usb-serial core and drivers to provide support for this patch.

drivers/net/

davej: Either Wireless network drivers or PCMCIA broke somewhen. A configuration that worked fine under 2.4 doesn't receive any packets. Need to look into this more to make sure I don't have any misconfiguration that just 'happened to work' under 2.4

drivers/scsi/

Half of SCSI doesn't compile

arch/i386/

2.5.x won't boot on some 440GX
2.5.x doesn't handle VIA APIC right yet - dont know why
ACPI needs the relax patches merging to work on lots of laptops
ECC driver questions are not yet sorted (DaveJ is working on this)

arch/x86_64/

(Andi)

time handling is broken. Need to move up 2.4 time.c code.
memory corruption with IOMMU pci_free_consistent - often causes crashes at shutdown. This is rather mysterious, the code is basically identical to 2.4 which works fine. Can only be seen on systems with >4GB of memory or with iommu=force
Another report of a crash at shutdown on Simics with no iommu when all memory was used. Could be related to the one above.
change_page_attr corrupts memory/crashes. Breaks some AGP users.
NMI watchdog seems to tick too fast
some fixes from 2.4 still need to be merged
not very well tested. probably more bugs lurking.

I found a new bad class of bugs (slowly working on fixing them, also present in 2.4)

Machine Check handlers use printk in an NMI like (ignoring cli) situation. This can deadlock on the console or low level character driver (serial, vga) locks. Not all MCEs are fatal (e.g. corrected ECC errors) and the kernel should be safely able to continue.

Need to buffer the printk in an atomic fashion (e.g. in a ring buffer managed with cmpxchg) and cause an self IPI that triggers an interrupt after the next sti. This is easy with x86/APIC mode, but difficult with PIC (the 8259 supports it in theory, but it's not clear that all clones in various chipsets do; also changing the programming may be risky). Fallback: pick it up with the next timer interrupt by adding a check there.

New entries for the x86-64 list (actually I'm not sure they are all x86-64 specific, just that the bug has been seen there)

32bit core dumps do not dump 32bit SSE data currently. they should
AT_GID/AT_UID ELF environment vector contains crap currently This breaks debugging of the shared linker for suid programs because ld.so always thinks it is suid/not called by root and ignores environment variables.
NIS/ypbind breaks with an abort() in glibc. Only happens on 2.5, 2.4 is fine.
need /proc/kcore access for kernel mappings that are outside vmalloc (in particular the kernel and the modules are special mappings on x86-64; other architectures have the same problem)

Best would be to put them in the vmalloc mappings list, but that requires some more fixes in other code that uses it. Also /proc/kcore seems to have some 64bit signedness bugs (patch for 2.4 exists)

Generic item:

need to share the ioctl 32bit emulation handlers between ports. Pavel has a patch, but he's running into difficulties with merging it.

To the generic item, Pavel Machek replied that his patch had been accepted. Andi replied that things were still quite broken; and Pavel said a new patch was on his way to Linus.

Elsewhere, regarding Andrew's item regarding IDE suspend/resume without races, Benjamin Herrenschmidt said, "I have something that work not too badly for PPC already but that need some cleanup, to be tested/adapted to Pat's new work (especially tested against his swsusp, and we shall still verify if it fits x86 needs)" .

drivers/scsi/

large parts of the locking are hosed or not existant
- shost->my_devices isn't locked down at all
- the host list ist locked but not refcounted, mess can happen when the spinlock is dropped
- there are lots of members of struct Scsi_Host/scsi_device/scsi_cmnd with very unclear locking, many of them probably want to become atomic_t's or bitmaps (for the 1bit bitfields).
- there's lots of volatile abuse in the scsi code that needs to be thought about.
- there's some global variables incremented without any locks

fs/devfs/

there's a fundamental lookup vs devfsd race that's only fixable by introducing a lookup vs devfs deadlock. I can't see how this is fixable without getting rid of the current devfsd design. Mandrake seems to have a workaround for this so this is at least not triggered so easily, but that's not what I'd considere a fix..

Martin Schlemmer got sound working on his ICH5, by simply adding the ICH5 IDs to the list. That worked for his system, but Jeff Garzik replied, "Unfortunately this doesn't work on all ICH5s out there. At the very minimum, for now, it would be nice to match up ich5 and codec pairs, as codec differentiation seems to be what stops this patch from working on all ICH5." And Martin replied:

Hmm, right.

Anybody working on getting support for the 875 Chipset into 2.5? Can I send a 'lspci -vv' to help ? I have a Asus P4C800 here (Intel 875p), so I can do some testing if need be.

I've just uploaded version 1.3.8 of the aic79xx driver and version 6.2.33 of the aic7xxx driver. Both are available for 2.4.X and 2.5.X kernels in either bk send format or as a tarball from here:

http://people.FreeBSD.org/~gibbs/linux/SRC/

RPMs and DUDs for various distributions are also available:

http://people.FreeBSD.org/~gibbs/linux/DUD/aic7xxx/
http://people.FreeBSD.org/~gibbs/linux/DUD/aic79xx/
http://people.FreeBSD.org/~gibbs/linux/RPM/aic7xxx/
http://people.FreeBSD.org/~gibbs/linux/RPM/aic79xx/

Someone gave a link to a CNet article which said that the SCO group claimed to have found instances of copyrighted UnixWare code in the Linux kernel sources. Chris Friesen replied:

According to an article here:

http://slashdot.org/articles/03/05/01/2332226.shtml?tid=167&tid=99

SCO-Caldera Senior Vice President Chris Sontag explicitly says that the kernel.org kernel is *not* tainted, but that that other stuff that Red Hat and SuSE are including *is*.

Quote from the interview:

"Chris Sontag: We're not talking about the Linux kernel that Linus and others have helped develop. We're talking about what's on the periphery of the Linux kernel."

He doesn't specify exactly what he's talking about, but he makes an interesting claim:

"Chris Sontag: We are using objective third parties to do comparisons of our UNIX System V [SCO-owned Unix] source code and Red Hat as an example. We are coming across many instances where our proprietary software has simply been copied and pasted or changed in order to hide the origin of our System V code in Red Hat. This is the kind of thing that we will need to address with many Linux distribution companies at some point."

Hmm. SCO Group Chief Executive Darl McBride says _exactly_ the opposite according to http://msnbc-cnet.com.com/2100-1016_3-999371.html :

"We're finding ... cases where there is line-by-line code in the Linux kernel that is matching up to our UnixWare code.

We're finding code that looks likes it's been obfuscated to make it look like it wasn't UnixWare code -- but it was."

Chris Sontag should get his story straight with his boss before he opens his mouth to the press.

As somone who walked for SCO (or rather Caldera how it was called at that time) I can tell you this is utter crap. There were very people actually doing Linux kernel work then (and when the German office was closed down all those left the company) and we really had better things to do then trying to retrofit UnixWare code into the linux kenrel. Especially given that the kernel internals are so different that you'd need a big glue layer to actually make it work and you can guess how that would be ripped apart in a usual lkml review :)

It might be more interesting to look for stolen Linux code in Unixware, I'd suggest with the support for a very well known Linux fileystem in the Linux compat addon product for UnixWare..

Jim Nance said, "Wouldnt it be halirous if whatever code SCO is talking about when they say there is Unix code in Linux turns out to be code some SCO employee ripped out of some GPL program and stuck it into Unixware. That is actually far more likely than what they alledge."

I've just packaged up the latest Linux hotplug scripts into a release, which can be found at:

http://sourceforge.net/project/showfiles.php?group_id=17679

Or from your favorite kernel.org mirror at:

kernel.org/pub/linux/utils/kernel/hotplug/hotplug-2003_05_01.tar.gz

or for those who like bz2 packages:

kernel.org/pub/linux/utils/kernel/hotplug/hotplug-2003_05_01.tar.bz2

I've also packaged up some pre-built (and signed) Red Hat 7.3 based rpms:

kernel.org/pub/linux/utils/kernel/hotplug/hotplug-2003_05_01-1.noarch.rpm

kernel.org/pub/linux/utils/kernel/hotplug/hotplug-base-2003_05_01-1.noarch.rpm

The source rpm is available if you want to rebuild it for other distros or versions of Red Hat at:

kernel.org/pub/linux/utils/kernel/hotplug/hotplug-2003_05_01-1.src.rpm

The main web site for the linux-hotplug project can be found at:

http://linux-hotplug.sf.net/

which contains lots of documentation on the whole linux-hotplug process.

There are lots of changes in this release from the last one (which was almost 8 months ago), most of them make things work better for systems running 2.5, but some of them fix problems that 2.4 users will see.

Some of the major changes in this release are:

fix for the lack of a drivers file in usbfs in 2.5.
initial scsi.agent for 2.5, modprobes sd_mod or sr_mod
call devlabel if it's present
made /sbin/hotplug a tiny multiplexer program, moving the original /sbin/hotplug program to /etc/hotplug.d/default/default.hotplug

The full ChangeLog extract since the last release is included below for those who want to know everything that's been changed, and who to blame for them :)

Keith Owens announced kdb kernel debugger 4.2 for the 2.4.20 kernel on i386 and ia64 systems. He said:

Changelog extracts since v4.1.

2003-05-02 Keith Owens <[email protected]>

Some architectures have problems with the initial empty kallsyms section so revert to three kallsyms passes.
Flush buffered input at startup and at 'more' prompt.
Only print 'more' prompt when longjmp data is available.
Print more data for buffers and inodes.
Disable kill command when O(1) scheduler is installed, the code needs to be redone for O(1).
The kernel has an undocumented assumption that enable_bh() is always called with interrupts enabled, make it so.
Print trailing punctuation even for symbols that are not in kernel.
Add read/write access to user pages. Vamsi Krishna S., IBM
Rename cpu_is_online to cpu_online, as in 2.5.
O(1) scheduler removes init_task so kdb maintains its own list of active tasks.
Delete btp 0 <cpuid> option, it needed init_tasks.
Clean up USB keyboard support. Steven Dake.
Sync with XFS 2.4.20 tree.
kdb v4.2-2.4.20-common-1.

2.4.20-i386-1

2003-05-02 Keith Owens <[email protected]>

Add kdba_fp_value().
Limit backtrace size to catch loops.
Add read/write access to user pages. Vamsi Krishna S., IBM
Clean up USB keyboard support. Steven Dake.
kdb v4.2-2.4.20-i386-1.

2.4.20-ia64-020821-1

2003-05-02 Keith Owens <[email protected]>

Add kdba_fp_value().
Limit backtrace size to catch loops.
Print spinlock name in ia64_spinlock_contention.
Tweak INIT slave stack lock and handler.
Add read/write access to user pages. Vamsi Krishna S., IBM
Rename cpu_is_online to cpu_online, as in 2.5.
Clean up USB keyboard support.
Clean up serial console support.
kdb v4.2-2.4.20-ia64-020821-1.

Attached is a (link to a) forward port of the sparc64 kdb patch to v4.2 of kdb. It is still rough around the edges, but it at least builds and boots and is somewhat usable.

You must apply the kdb common patch and the patch to kdb common I sent earlier to get it to build properly.

Enjoy!

http://www.dslextreme.com/users/tomduffy/kdb-v4.2-2.4.20-sparc64-1.bz2

2May2003-5May2003 (49 posts) Subject: "[Announcement] "Exec Shield", new Linux security feature"

We are pleased to announce the first publically available source code release of a new kernel-based security feature called the "Exec Shield", for Linux/x86. The kernel patch (against 2.4.21-rc1, released under the GPL/OSL) can be downloaded from:

http://redhat.com/~mingo/exec-shield/

The exec-shield feature provides protection against stack, buffer or function pointer overflows, and against other types of exploits that rely on overwriting data structures and/or putting code into those structures. The patch also makes it harder to pass in and execute the so-called 'shell-code' of exploits. The patch works transparently, ie. no application recompilation is necessary.

Background:
-----------

It is commonly known that x86 pagetables do not support the so-called executable bit in the pagetable entries - PROT_EXEC and PROT_READ are merged into a single 'read or execute' flag. This means that even if an application marks a certain memory area non-executable (by not providing the PROT_EXEC flag upon mapping it) under x86, that area is still executable, if the area is PROT_READ.

Furthermore, the x86 ELF ABI marks the process stack executable, which requires that the stack is marked executable even on CPUs that support an executable bit in the pagetables.

This problem has been addressed in the past by various kernel patches, such as Solar Designer's excellent "non-exec stack patch". These patches mostly operate by using the x86 segmentation feature to set the code segment 'limit' value to a certain fixed value that points right below the stack frame. The exec-shield tries to cover as much virtual memory via the code segment limit as possible - not just the stack.

Implementation:
---------------

The exec-shield feature works via the kernel transparently tracking executable mappings an application specifies, and maintains a 'maximum executable address' value. This is called the 'exec-limit'. The scheduler uses the exec-limit to update the code segment descriptor upon each context-switch. Since each process (or thread) in the system can have a different exec-limit, the scheduler sets the user code segment dynamically so that always the correct code-segment limit is used.

the kernel caches the user segment descriptor value, so the overhead in the context-switch path is a very cheap, unconditional 6-byte write to the GDT, costing 2-3 cycles at most.

Furthermore, the kernel also remaps all PROT_EXEC mappings to the so-called ASCII-armor area, which on x86 is the addresses 0-16MB. These addresses are special because they cannot be jumped to via ASCII-based overflows. E.g. if a buggy application can be overflown via a long URL:

http://somehost/buggy.app?realyloooooooooooooooooooong.123489719875

then only ASCII (ie. value 1-255) characters can be used by attackers. If all executable addresses are in the ASCII-armor, then no attack URL can be used to jump into the executable code - ie. the attack cannot be successful. (because no URL string can contain the \0 character.) E.g. the recent sendmail remote root attack was an ASCII-based overflow as well.

With the exec-shield activated, and the 'cat' binary relinked into the the ASCII-armor, the following layout is created:

  $ ./cat-lowaddr /proc/self/maps
  00101000-00116000 r-xp 00000000 03:01 319365     /lib/ld-2.3.2.so
  00116000-00117000 rw-p 00014000 03:01 319365     /lib/ld-2.3.2.so
  00117000-0024a000 r-xp 00000000 03:01 319439     /lib/libc-2.3.2.so
  0024a000-0024e000 rw-p 00132000 03:01 319439     /lib/libc-2.3.2.so
  0024e000-00250000 rw-p 00000000 00:00 0
  01000000-01004000 r-xp 00000000 16:01 2036120    /home/mingo/cat-lowaddr
  01004000-01005000 rw-p 00003000 16:01 2036120    /home/mingo/cat-lowaddr
  01005000-01006000 rw-p 00000000 00:00 0
  40000000-40001000 rw-p 00000000 00:00 0
  40001000-40201000 r--p 00000000 03:01 464809     locale-archive
  40201000-40207000 r--p 00915000 03:01 464809     locale-archive
  40207000-40234000 r--p 0091f000 03:01 464809     locale-archive
  40234000-40235000 r--p 00955000 03:01 464809     locale-archive
  bfffe000-c0000000 rw-p fffff000 00:00 0

In the above layout, the highest executable address is 0x01003fff, ie. every executable address is in the ASCII-armor.

this means that not only the stack is non-executable, but lots of mmap()-ed data areas and the malloc() heap is non-executable as well. (some data areas are still executable, but most of them are not.)

the first 1MB of the ASCII-armor is left unused to provide NULL pointer dereference protection and leave space for 16-bit emulation mappings used by XFree86 and others.

Compare this with the memory layout without exec-shield:

  08048000-0804b000 r-xp 00000000 16:01 3367       /bin/cat
  0804b000-0804c000 rw-p 00003000 16:01 3367       /bin/cat
  0804c000-0804e000 rwxp 00000000 00:00 0
  40000000-40012000 r-xp 00000000 16:01 3759       /lib/ld-2.2.5.so
  40012000-40013000 rw-p 00011000 16:01 3759       /lib/ld-2.2.5.so
  40013000-40014000 rw-p 00000000 00:00 0
  40018000-40129000 r-xp 00000000 16:01 4058       /lib/libc-2.2.5.so
  40129000-4012f000 rw-p 00111000 16:01 4058       /lib/libc-2.2.5.so
  4012f000-40133000 rw-p 00000000 00:00 0
  bffff000-c0000000 rwxp 00000000 00:00 0

In this layout none of the executable areas are in the ASCII-armor, plus the exec-limit is 0xbfffffff (3GB) - ie. including all userspace mappings.

Note that the kernel will relocate every shared-library to the ASCII-armor, but the binary address is determined at link-time. To ease the relinking of applications to the ASCII-armor, Arjan Van de Ven has written a binutils patch (binutils-2.13.90.0.18-elf-small.patch), which adds a new 'ld' flag "ld -melf_i386_small" (or "gcc -Wl,-melf_i386_small") to relink applications into the ASCII-armor. (The patch can be found at the exec-shield URL as well.)

Overhead:
---------

the patch was designed to be as efficient as possible. There's a very minimal (couple of cycles) tracking overhead for every PROT_MMAP system-call, plus there's the 2-3 cycles cost per context-switch.

Limitations:
------------

This feature will not protect against every type of attack.

E.g. if an overflow can be used to overwrite a local variable which changes the flow of control in a way that compromises the system. But we do believe that this feature will stop every attack that is purely operating by overflowing the return address on the stack, or overflowing a function pointer in the heap. Furthermore, exec-shield makes it quite hard to mount a successful attack even in the other cases, because it inhibits the execution of exploit shell-code, in most cases.

also, if the overflow is within the exec-shield itself (e.g. within the data section of one of the shared library objects in the ASCII-armor) then the overflow might be possible to exploit.

All in one, exec-shield is one barrier against attacks, not blanket 100% protection in any way. The most efficient security can be provided by installing as many layers as possible.

To provide as good protection as possible, there's no trampoline workaround in the exec-shield code - ie. exec-limit violations in the trampoline case are never let through. Applications that need to rely on gcc trampolines will have to use the per-binary ELF flag to make the stack executable again. (The ELF flag is the same as used by Solar Designer's non-exec stack patch, to provide as much compatibility with existing non-exec-stack installations as possible.)

The exec-shield feature will uncover applications that incorrectly assumed that PROT_READ allows execution on x86. One such example is the XFree86 module loader. The latest XFree86 on rawhide.redhat.com fixes this problem. For those who cannot install the XFree86 bugfix at the moment there's a workaround added by the patch, which can be activated via:

echo 1 > /proc/sys/kernel/X-workaround

This will make every iopl() using application (such as X) have the exec-shield disabled. Other applications (sendmail, etc.) will still have the exec-shield enabled. This workaround is default-off. We strongly encourage to solve this problem by upgrading X, or by using the 'chkstk' utility to make X's stack forced-executable.

Using it:
---------

Apply the exec-shield-2.4.21-rc1-B6 kernel patch to the 2.4.21-rc1 kernel, recompile & install the kernel and reboot into it, that's all.

There is a new boot-time kernel command line option called exec-shield=, which has 4 values. Each value represents a different level of security:

 exec-shield=0    - always-disabled
 exec-shield=1    - default disabled, except binaries that enable it
 exec-shield=2    - default enabled, except binaries that disable it
 exec-shield=3    - always-enabled

the current patch defaults to 'exec-shield=2'. The security level can also be changed runtime, by writing the level into /proc:

echo 0 > /proc/sys/kernel/exec-shield

IMPORTANT: security-relevant applications that were started while the exec-shield was disabled, will have an executable stack and will thus have to be restarted if the exec-shield is enabled again.

I've also uploaded a modified version of Solar Designer's chstk.c code, which adds the options necessary to change the 'enable non-exec stack' ELF flag:

  $ ./chstk
  Usage: ./chstk OPTION FILE...
  Manage stack area executability flag for binaries

    -e    enable execution permission
    -E    enable non-execution permission
    -d    disable execution permission
    -D    disable non-execution permission
    -v    view current flag state

ie. there are two distinct flags, one for forcing an executable stack, one for forcing a non-executable stack. If both flags are zero then the binary will follow the system default.

ie. it's possible to use an exec-shield level of 1, and enable the non-exec stack on a per binary basis, by using the 'exec-shield=1' boot option and changing binaries one at a time:

./chstk -E /usr/sbin/sendmail

(People migrating production environments to an exec-shield kernel might prefer this variant.)

anyway, comments, suggestions and test feedback is welcome.

a new (-C5) release of the exec-shield patch can be found at:

http://redhat.com/~mingo/exec-shield/exec-shield-2.4.21-rc1-C5

Changes since -B6:

removed the X_workaround - chstk can be used for equivalent functionality. (issue raised by Yoav Weiss)
increase SHLIB_BASE from 1MB to 1MB + 64KB, suggested by Alexandre Julliard, to fix DOS loaders.
fix Pentium/i386 compilation failure in fault.c. (reported by Johannes Walch)
fix signal return bug, found by [email protected].
shared library address randomization, both within and outside the ASCII-shield. This should make remote attacks a little bit more difficult.
process stack randomization. A number of other patches did this as well, it generally helps. (There's no memory wasted because the stack area left out will simply not be paged in.)
turn off shlib relocation if the stack is executable. This is needed for Wine, qemu and other apps that need the low memory range.
do not show the wchan field of non-owned processes, and do not show the maps file either. This should make it a little bit harder to guess library locations for local attackers.

most of the new stuff in this patch (randomization, information filtering) has been done in other patches as well (such as PaX, grsecurity, non-exec stack patch, etc.) - i tried to filter out and add the ones that matter most, do not introduce constraints and are thus uncontroversial.

Various folks were very happy to see this work, and a bunch of people started discussing the implementation and various security issues.

Daniele Pala asked, "Trying to run 'make xconfig' i got into the message 'you don't have installed qt!'...so the xconfig is now dependant from qt? why? what about us poor guy who only use twm and not kde? isn't qt pretty big and fat?" Diego Calleja Garcia replied that 'make gconfig' would use the gtk library; Balram Adlakha said, "I think xconfig should be the "X" based one, qconfig should be the qt based one and gconfig should be the gtk one." Sam Ravnborg invited him to contribute a generic X-based config program.

Ok, I finally found the reason for why some of my machines had trouble with restarting the X server, and it turns out that it's been around since very early February. I bet others must have seen it too, with random crashes on X server restart when the server used AGP (which means that it mainly hit either hw-accelerated 3D setups or the intel integrated graphics which use a UMA model with AGP as the backing store).

That's a big relief for me, as it was the major thing I personally worried about for 2.6.x.

Anyway, that's fixed here, along with a lot of other updates. Much of 2.5.69 is small one-liners to drivers to handle the new IRQ semantics, but there's a lot of other cleanups in there too (Christoph Hellwig continued on his devfs rampage, for example).

NOTE! As of this release I think I'll want to have patches either be _really_ obvious, or they should go through one of more people for approval. In particular, I'm hoping that the paperwork stuff with Andrew should be getting closer to finalized, and that we could start moving over towards a 2.6.x release schedule..

this patch updates the dvb subsystem core.

Fixed problems:

partly reintroduced the DVB_DEVFS_ONLY switch, which was previously wiped out by Alan Cox: if enabled, some really obscure code is not compiled into the kernel that is necessary to xxx
switched from user-land types like __u8 to u8 and uint16_t to u16 this makes the patch rather large.
updated the dvr (digital videorecording) facility
renamed some structures, like "struct dmxdev_s" to "struct dmxdev"
introduced dvb_functions.[ch], where some linux-kernel specific functions are encapsulated. by this, the dvb subsystem stays quite independent from deeper linux kernel functions.
moved dvb_usercopy() to dvb_functions.c -- this is essentially video_usercopy() which should be generic_usercopy() instead... ;-)
Made the dvb-core in dvbdev.c work with devfs again. I had to introduce some #if KERNELVERSION magic again here, sorry. I'll fix it up with the next patchset.

Christoph Hellwig had some criticism of the code, and gave Michael some suggestions. He also said, "your devfs stuff is a mess. I already told one of the DVB folks (it wasn't you IIRC) that I'll publish a 2.5 devfs API on 2.4 header. But first I have to fix the devfs API on 2.5 and randomly bringing back old crap and lots of ifdefs in those changing areas won't help. What the problem with 2.5, dvb and devfs?" Michael replied:

The main problem is that our development "dvb-kernel" CVS tree *should* compile under 2.4 aswell, because most of the dvb-users don't want to participate in kernel development in general, but only on the development of the dvb subsystem. So work is done on the "dvb-kernel" tree, which should be synced with the 2.5 kernel frequently.

So, regarding devfs, I introduced #ifdefs around the functions that have changed recently. That's not nice, I know. But in my eyes it's important to keep the CVS and the kernel version more in sync.

IIRC Gerd Knorr has the same problems with his driver packages (regarding the i2c subsystem mainly), but he has written some perl scripts to remove the #ifdef stuff before submitting his patches...

Christoph felt that it would be best to delay the dvb updates, because "you don't just add ifdefs (which give me lots of rejects and you much uglier code than just using the compat header I'll send to lkml once I'm done with the API changes) but you also change the code that's ifdefed for 2.5 to reverse change I did. There is a reason why I removed every occurance of devfs_handle_t from all drivers and the particular reason is that it will go away in the next series of patches." Michael asked how best to proceed, and after a little wrangling, it was agreed that Michael should continue to send updates to either Christoph or Alan Cox, bearing in mind that 2.5 features shouldn't be broken by Michael's updates.


Kernel Traffic Latest\|Archives\|People\|Topics	Wine Latest\|Archives\|People\|Topics	GNUe Latest\|Archives\|People\|Topics
Czech

1.	27Apr2003-5May2003	(48 posts)	Proposed System Call To Speed Process Creation
2.	28Apr2003-1May2003	(21 posts)	Some WLAN Chip Specs Secret To Protect Military Communications
3.	29Apr2003-2May2003	(47 posts)	'Must-Fix' Bug List For 2.6 (Or 3.0)
4.	1May2003-2May2003	(3 posts)	OSS Support For ICH5 Sound
5.	1May2003-7May2003	(12 posts)	Aic7xxx And Aic79xx Driver Updates
6.	1May2003-2May2003	(19 posts)	Possible License Violations Within The Kernel Source
7.	1May2003	(1 post)	New Release Of Hotplugging Scripts
8.	1May2003-5May2003	(3 posts)	kdb 4.2 Released
9.	2May2003-5May2003	(49 posts)	New 'Exec Shield' Security Feature
10.	3May2003	(5 posts)	Status Of 'make xconfig'
11.	4May2003-7May2003	(32 posts)	Linux 2.5.69 Released; Approaching 2.6
12.	6May2003-7May2003	(14 posts)	Status Of DVB In 2.5
13.	7May2003	(29 posts)	TTY Updates

Kernel Traffic #216 For 20May2003

By Zack Brown

Must-fix bugs

Not-ready features and speedups