Kernel Traffic #62 For 10 Apr 2000 By Zack Brown Table Of Contents * Standard Format * Text Format * XML Source * Introduction * Mailing List Stats For This Week * Threads Covered 1. 21 Mar 2000 - 27 Mar 2000 (4 Driver Return Values posts) 2. 21 Mar 2000 - 27 Mar 2000 (4 Real Data Corruption Under ext2 In The posts) Stable And Unstable Kernels, And A Fix 3. 24 Mar 2000 - 28 Mar 2000 (9 GCC 2.95.2 Bug; Workaround; And Fix posts) 4. 24 Mar 2000 - 30 Mar 2000 (97 POSIX Threads; Philosophy Of Kernel posts) Development 5. 24 Mar 2000 - 1 Apr 2000 (17 Keyboard Repeat Rate posts) 6. 26 Mar 2000 - 27 Mar 2000 (11 kswapd Speedups posts) 7. 26 Mar 2000 - 29 Mar 2000 (4 Mount Code Cleanup posts) 8. 28 Mar 2000 - 30 Mar 2000 (21 Things To Do Before 2.4: Saga posts) Continues 9. 28 Mar 2000 - 29 Mar 2000 (6 Problems With kernel.org Mirrors posts) 10. 28 Mar 2000 - 29 Mar 2000 (5 New Scheduler Code; Locking Issues posts) 11. 28 Mar 2000 - 30 Mar 2000 (12 Network Load Balancing posts) 12. 29 Mar 2000 - 3 Apr 2000 (37 AFFS Support And Discussion posts) 13. 30 Mar 2000 - 31 Mar 2000 (14 Intel eepro100 Driver To Be posts) GPL-Compatible? 14. 1 Apr 2000 - 2 Apr 2000 (21 This Year's April Fool's Joke posts) 15. 2 Apr 2000 (1 New Networking HOWTOs And LVM HOWTO post) Introduction Seth M LaForge felt that Issue #61, Section #11 (21 Mar 2000: Video CD Under Linux) was misleading. As he put it, "in the discussion on VCD you include a list of links from Quang Nguy. However, the links all point to DVD info - DVD is a very different beast from VCD. VCDs are regular CD-ROMs with MPEG-1 encoded movies on them. They're very popular in Asia (usually illegal copies of movies, available for about $2 each on the street), but never caught on (I'm sure pressure from the movie industry has a lot to do with this) in the West. You might want to either point out that DVD and VCD are different or remove the list of links, to avoid confusing readers." That's a good point, Seth. Thanks! Many thanks go to Antony West, for catching an old bug in the XML of Issue #32 (kt19990830_32.html) , which caused the mailing list stats to render incorrectly; and for reminding me about some runaway
tags in various
other issues. Thanks, Antony! Those back-issue bugs can languish for years.
Mailing List Stats For This Week
We looked at 1680 posts in 7272K.
There were 482 different contributors. 236 posted more than once. 184 posted
last week too.
The top posters of the week were:
* 57 posts in 181K by Alan Cox
* 43 posts in 182K by Andrea Arcangeli
* 40 posts in 165K by Jeff V. Merkey
* 33 posts in 101K by Stephen C. Tweedie
* 30 posts in 109K by Andre Hedrick
* Full Stats
1. Driver Return Values
21 Mar 2000 - 27 Mar 2000 (4 posts) Archive Link: "__setup return value"
People: Tim Waugh, Alan Cox, Russell King
Tim Waugh asked, "When is a driver supposed to return 0 from a __setup
function? When it can't parse the options? Or when there's a possibility that
the option is intended for another driver?" When no one replied for almost a
week, he asked, "Is it worth me making a patch to change the behaviour of those
drivers that return 0 on error (rather that when the option could be used by
another driver) to return 1 instead?" Alan Cox replied, "I've done some of them
but not all. So yes." And seven hours later, Russell King said that Alan had
sent him the ARM-specific parts of a patch Tim had presumably sent to Alan.
Russell affirmed that it would be going into the Linus tree, and thanked Tim
heartily.
2. Real Data Corruption Under ext2 In The Stable And Unstable Kernels, And A
Fix
21 Mar 2000 - 27 Mar 2000 (4 posts) Archive Link: "ext2fs bug : files are
disapeared, unable to delete, two files' contents are switched etc."
Topics: FS: ext2
People: Theodore Y. Ts'o
Soohoon Lee reported serious data corruption with ext2, in cases of low memory
and frequent file creation and deletion operations. He explained that under
those conditions, files would spontaneously disappear or be undeletable, the
contents of two files would spontaneously switch, and fsck would find
inconsistancies in properly unmounted filesystems. He posted a one-line patch
to fs/ext2/namei.c to fix it; and Theodore Y. Ts'o confirmed that this was
indeed a problem, and not just some local glitch. In response to Soohoon's
patch, Ted replied, "This apparently solves the problem for Linux 2.2, but I'd
prefer a cleaner patch for Linux 2.3. Enclosed find the patch, which I will be
sending on to Linus. I haven't had a chance to backport this patch to Linux 2.2
yet, but it shoudl be relatively simple." He posted his patch, which was much
longer than Soohoon's, and there was some talk about other changes to make to
fs/ext2/namei.c, if they were going to be changing it at all.
3. GCC 2.95.2 Bug; Workaround; And Fix
24 Mar 2000 - 28 Mar 2000 (9 posts) Archive Link: "[2.3.99-pre3] via-rhine.o
died again!"
Topics: Assembly
People: Urban Widmark, Jean-Luc Pedneault
After a bit of private discussion between Jean-Luc Pedneault, Justin Guyett,
Urban Widmark, and other unknown assailants; Urban posted a patch for a problem
Jean-Luc had been having with via-rhine under 2.3.99-pre3. But he acknowledged
being mystified over "why a value that was just written is no longer there,
except if you add an if or printk then the value written "stays" written."
Jean-Luc replied, "Like you said, it seems that by doing the "if" instruction,
the old value doesn't get wiped. The inside loop doesn't ever get run!! It's
weird, because it may suggest that the system runs too fast or something.. like
if the value wasn't written in memory." He added, "It could be a bug in GCC
2.95.2, a bug that does an optimization wrongly. I'm using this compiler. I
haven't tested with egcs-1.1.2 yet (and I don't have GCC 2.7.2.3 compiled at
all)."
A couple of hours later, he went on to say, "I'd like to point out that
egcs-1.1.2 executed the code fine even with #if 0 (ie. without compiling the
additional code). GCC 2.95.2's optimizations breaks the code, at least that's
what I think." Urban agreed, but rather than attempt to understand the
inscrutable assembly code generated, he posted another patch which tried to do
the same thing in a different way. He also suggested generating a smaller,
non-kernel test-case to share with the GCC developers, and offered to do it
himself if no one else stepped forward. Stephane Casset replied with
confirmation that Urban's patch seemed to work.
In the end, Urban concluded the thread by pointing out that the latest GCC
snapshot appeared not to have the problem anymore, and he also gave a link to
Code Sourcery (http://www.codesourcery.com/gcc-compile.html) , a really cool
page that would compile any code on the latest snapshot.
4. POSIX Threads; Philosophy Of Kernel Development
24 Mar 2000 - 30 Mar 2000 (97 posts) Archive Link: "Slow pthread_create() under
high load"
Topics: POSIX, SMP
People: Ulrich Drepper, Alvin Starr, Richard Gooch, Linus Torvalds, Alan Cox
An interesting threadlet came out of this interesting discussion about Linux
implementation of POSIX threading. In the course of discussion, Ulrich Drepper
led into it by saying, "some additional functionality required to implement the
correct POSIX threads behaviour is missing." Alvin Starr let himself in for it
when he replied, "I am sure that if the next version of the thread library
required a set of kernel patches to run effectivly then those patches would end
up in the kernel source tree within a version or so." Richard Gooch replied,
"Mate, where have you been? The day Linus lets user-space dictate what goes
into the kernel is the day hell freezes over. If you want a patch to go into
the kernel, you need to convince him it's a good idea. Adding a dependency in
user-space, expecting it to "force his hand", will not help. It will probably
just piss him off. Or make him laugh."
A bit later, Linus Torvalds made a case about PID sharing and gave some example
code. He went on to say, about threading:
Note that the reason the kernel is not POSIX-compliant is:
* the POSIX standard is technically stupid. It's much better to use a cleaner
fundamental threading model and build on top of that.
* things like the above are just so much better and more easily done in user
space anyway.
The reason LinuxThreads has a hard time becoming POSIX-compliant is that I
refuse to apply stupid patches, and a lot of the patches sent to me have been
frankly stupid. They've often implemented pthreads functionality without any
actual thought of how it _could_ be done more cleanly with a user/kernel split.
A post or so down the line, Linus added
Note that when I started doing clone(), I basically said: "this is how I think
threads should be done". I added a few example flags to show the concept,
without really having a firm plan on what the final situation would be. Some of
those flags got expanded upon (CLONE_PARENT is only the latest addition), while
some ended up not being very useful at all (CLONE_PID is basically useless -
the only use for it is to start up the original idle threads under SMP, and
that code is so specialized anyway that it could basically do the CLONE_PID
logic by hand).
There are bound to be more issues. I've seen patches floating around that
expand it, and especially in signal handling SOMETHING has to be done. I don't
think the "share all signal queues" is the right answer: I suspect the right
answer to the signal handling issue is to have a "private" queue (the regular
one) along with a separate method of handling "shared" queues and a way to
attach to a shared signal queue.
Shared signals are potentially useful outside pure threading models too, and
I'm looking for something more generic. I suspect that what I'm looking for is
more like a message list, along with some thin compatibility code to make it
easy for pthreads emulation that looks like signals..
That's kind of my gripe in general - I think there is a bigger picture than
just plain pthreads. Like clone(), let's do this right.
Some earlier discussion of clone() took place in Issue #1, Section #3 (
8 Jan 1999: Porting The vfork() Syscall) ; a LinuxThreads announcement appeared
in Issue #30, Section #10 (30 Jul 1999: CLONE_PPID Support In LinuxThreads) ;
a little bit of clone() history appeared in Issue #32, Section #14 (
17 Aug 1999: Threads In Linux) ; some general discussion of threading took
place in Issue #34, Section #22 (1 Sep 1999: Some Explanation Of Threading) ;
then in Issue #35, Section #13 (7 Sep 1999: CLONE_PID Problems) , Alan Cox
said CLONE_PID would be going away in 2.3.x, but according to Linus' statements
above, this has not happened yet. A vfork() flamewar (with some good technical
goodies) took place in Issue #45, Section #1 (1 Nov 1999: vfork() Discussion
And Flame Fest) .
Discussions of development philosophy occur throughout KT, but some specific
articles are Issue #4, Section #1 (27 Jan 1999: Philosophy Of The Stable
Series) , Issue #5, Section #10 (3 Feb 1999: Philosophy Of Binary-Only
Modules) , Issue #9, Section #12 (4 Mar 1999: Philosophy Of Kernel
Development) , Issue #18, Section #5 (2 May 1999: Philosophy Of Open Source;
Maintainer Conflict) , and Issue #60, Section #8 (15 Mar 2000: Philosophy Of
Having Debugging Code In The Kernel) .
5. Keyboard Repeat Rate
24 Mar 2000 - 1 Apr 2000 (17 posts) Archive Link: "Keyboard rate question.."
People: Andrew Morton, Russell King, Mike A. Harris
Mike A. Harris had a problem with his keyboard repeat rate slowing down when
switching a shared keyboard between two machines. In the course of discussion,
it came out that if there was any loss of power to the keyboard, this would be
the inevitable result. However, at some point Andrew Morton mentioned:
Russell King says he has a patch which does autorepeat in s/w. This is most
definitely the best way. I did this in an OS many years ago - took one look at
the XT keyboard specs and said "nope".
It's very easy to do. Just a little state machine which squirts out the most
recent 'make' code and stops doing that when it sees a 'break'. It also gives
you infinite control over the autorepeat speed, although topping out at HZ
seems reasonable.
Russell replied, "I'll be sorting it out later today when I do my next set of
patches for Linus et al."
6. kswapd Speedups
26 Mar 2000 - 27 Mar 2000 (11 posts) Archive Link: "[PATCH] Re: kswapd"
Topics: Virtual Memory
People: Rik van Riel, Kanoj Sarcar, Linus Torvalds, Christoph Rohland, Mark
Hahn
Rik van Riel posted a patch against 2.3.99pre3, to take some code out from the
interior of a while loop in kswapd; and added, "I wonder who sent the
brown-paper-bag patch with the superfluous while loop to Linus ..." Kanoj
Sarcar replied red handed, "That would be me ..." He asked what Rik's patch
actually fixed, aside from the while loop, which appeared cosmetic. Rik replied
that ditching the loop was actually a serious fix by itself. Without his patch,
he reported that kswapd used between 50% and 70% of the CPU in a particular
workload. With his patch, it used between 3% and 5% of the CPU. Kanoj pointed
out that the while loop had been in the kernel since way back in 2.3.43, and
Linus Torvalds, replying elsewhere, said, "you're definitely right that this is
not a new bug introduced by you, Kanoj - this seems to be just a thinko that
has been there for a long long time. And I suspect I may have been the original
perpetrator of the crime." Kanoj let out a facetious sigh of relief that he
hadn't introduced the bug, and they discussed some of the ins and outs of Rik's
patch. It turned out that, as Linus described the situation prior to Rik's
patch:
What happens is that kswapd is only woken up when needed, so most of the time
it is sleeping. It's only when it is woken up and when it has done its work
when the loop turns into a CPU-burner, but it can easily mean that kswapd will
just spend CPU time for no good reason until its time-slice is exhausted.
So think of the bug as "kswapd will waste the final part of its timeslice doing
nothing useful".
Elsewhere, under the Subject: [RFT] balancing patch (http://kernelnotes.org/
lnxlists/linux-kernel/lk_0003_04/msg01025.html) , Kanoj posted a separate patch
to ease kswapd's CPU usage. Mark Hahn didn't notice any improvement, and
Christoph Rohland noticed various Out Of Memory breakages when the system
started to use swap. However, Rik reported:
I'm now testing Kanoj' balancing patch together with my kswapd
infinite-loop-removal patch. The system seems to work quite well, I haven't
seen any big strangeness in the VM load (the variance in the amount of free
memory is a bit bigger, naturally, but that's to be expected) and interactive
performance from the console seems unaffected.
It would be nice if a few more people tested the combination of 2.3.99-pre3
with Kanoj' balancing patch and my infinite-loop- removal patch ... (because
YMMV)
7. Mount Code Cleanup
26 Mar 2000 - 29 Mar 2000 (4 posts) Archive Link: "[Announce][CFT] loopback
mounts and stuff"
Topics: FS: ext2, FS: procfs
People: Alexander Viro
Alexander Viro called for testers, and announced:
Folks, there is a cleanup of mount-related stuff underway. Right now the patch
seems to be usable for testing.
It doesn't include pieces in nfsd and auotfs*, so these simply wont compile,
but it should work with everything else.
It allows
1. to mount the same filesystem several times. Yup, ext2 included. No cache
coherency problems - all instances share the dentry tree.
2. to work with shmfs without mounting it.
3. to do explicit loopback mount. As in
# mount -t bind /usr/X11R6 /mnt
# ls /mnt
bin doc include lib man share
# cd /mnt
# ls ..
bin dev home lost+found mnt proc sbin usr
boot etc lib misc opt root tmp var
(IOW, unlike the usage of symlinks it gives correct behaviour on ..)
4. to mark filesystem type as 'single'. One superblock will be created when
you initialize the driver, all later mounts of that type will be aliases to
that one. IOW, we can start storing procfs immediately in dcache - just a
normal tree. Moreover, kernel can access that tree even if it's not mounted
by user. That's how the shmfs stuff is done. Oh, and all instances will
share the device number, so you can create a thousand chroot jails, mount
devpts on each and spend 1 (one) anonymous device. Ditto for procfs, etc.
Patch (against 2.3.99-pre3) lives on ftp://ftp.math.psu.edu/pub/viro/
mount-patch-4.
Folks, give it a try. There may be bugs. I think that it's cleaner than the old
code, but don't let it play with critical data. Bug reports are more than
welcome, indeed. Another thing that is _very_ welcome is a discussion of the
export rules in situation when filesystems on server may be mounted several
times (as well as the current implementation - I'ld really like to hear
comments on it, in particular on the exp_parent(). It's either buggy and
doesn't what it is supposed to do or contains seriously superfluous code).
General comments on the patch are also welcome, indeed.
8. Things To Do Before 2.4: Saga Continues
28 Mar 2000 - 30 Mar 2000 (21 posts) Archive Link: "The 2.3.x Job List
(Updated)"
Topics: Compression, Disk Arrays: RAID, Disks: IDE, Disks: SCSI, FS: Coda, FS:
FAT, FS: NFS, FS: NTFS, FS: UMSDOS, I2O, Networking, PCI, Power Management:
ACPI, SMP, Security, USB, Virtual Memory, VisWS
People: Alan Cox, Wakko Warner
Alan Cox posted his latest list of things to do before 2.4 could come out:
1. Fixed
1. Tulip hang on rmmod (fixed in .51 ?)
2. Incredibly slow loopback tcp bug (believed fixed about 2.3.48)
3. COMX series WAN now merged
4. VM needs rebalancing or we have a bad leak
5. SHM works chroot
6. SHM back compatibility
7. Intel i960 problems with I2O
2. In Progress
1. Merge the network fixes (DaveM)
2. Merge 2.2.15 changes (Alan)
3. Get RAID 0.90 in (Ingo)
3. Fix Exists But Isnt Merged
1. Signals leak kernel memory (security)
2. msync fails on NFS
3. Semaphore races
4. Sempahore memory leak
5. Exploitable leak in file locking
6. Symbol clashes and other mess from _three_ copies of zlib!
7. Shared memory changes change the API breaking applications (eg gimp)
8. Merge the RIO driver (probably do post 2.4.0 as it is large)
9. S/390 Merge (merged in AC tree)
10. via rhine oopses under load ?
11. 1.07 AMI MegaRAID
12. PCI buffer overruns
13. SCSI generic driver crashes controllers (need to pass
PCI_DIR_UNKNOWN..)
14. Finish softnet driver port over and cleanups
4. To Do
1. Restore O_SYNC functionality
2. Fix eth= command line
3. Trace numerous random crashes in the inode cache
4. Fix Space.c duplicate string/write to constants
5. VM kswapd has some serious problems
6. vmalloc(GFP_DMA) is needed for DMA drivers
7. put_user appears to be broken for i386 machines
8. Fix module remove race bug (mostly done - Al Viro)
9. Test other file systems on write
10. Directory race fix for UFS
11. Audit all char and block drivers to ensure they are safe with the 2.3
locking - a lot of them are not especially on the open() path.
12. Stick lock_kernel() calls around driver with issues to hard to fix
nicely for 2.4 itself
13. PCMCIA/Cardbus hangs, IRQ problems, Keyboard/mouse problem (may be
fixed ?)
14. Use PCI DMA by default in IDE is unsafe (must not do so on via VPx x<3)
15. Use PCI DMA 'lost interrupt' problem with some hw [which ?]
16. Crashes on boot on some Compaqs ?
17. pci_set_master forces a 64 latency on low latency setting devices.Some
boards require all cards have latency <= 32
18. usbfs hangs on mount sometimes
19. Loopback fs hangs
20. Problems with ip autoconfig according to Zaitcev
21. Still some SHM bug reports
22. Any user can crash FAT fs code with ftruncate
5. To Do But Non Showstopper
1. Make syncppp use new ppp code
2. Finish 64bit vfs merges (lockf64 and friends missing)
3. NCR5380 isnt smp safe
4. DMFE is not SMP safe
5. ACPI hangs on boot for some systems
6. Get the Emu10K merged
7. Finish I2O merge
8. Go through as 2.4pre kicks in and figure what we should mark obsolete
for the final 2.4
9. Per Process rtsigio limit
10. Fix SPX socket code
11. Boot hangs on a range of Dell docking stations (Latitude)
12. Port SGI VisWS to 2.3.x or mark obsolete
13. HFS is still broken
14. iget abuse in knfsd
15. Mark NTFS as obsolete
16. Paride seems to need fixes for the block changes yet
17. PIII FXSAVE/FXRESTORE support
18. Some people report 2.3.x serial problems
19. AIC7xxx doesnt work non PCI ?
20. USB hangs on APM suspend on some machines
21. PCMCIA crashes on unloading pci_socket
22. DEFXX driver appears broken
23. ISAPnP IRQ handling failing on SB1000 + resource handling bug
6. Compatibility Errors
7. Probably Post 2.4
1. per super block write_super needs an async flag
2. addres_space needs a VM pressure/flush callback
3. per file_op rw_kiovec
4. enhanced disk statistics
5. AFFS fixups
6. UMSDOS fixups resync
8. Drivers In 2.2 not 2.4
1. Lan Media WAN
9. To Check
1. Truncate races (Debian apt shows it nicely) [done ? - all but Coda]
2. Elevator and block handling queue change errors are all sorted
3. Check O_APPEND atomicity bug fixing is complete
4. Make sure all drivers return 1 from their __setup functions
5. Protection on isize (sct) [Al Viro mostly done]
6. Mikulas claims we need to fix the getblk/mark_buffer_uptodate thing for
2.3.x as well
7. Network block device seems broken by block device changes
8. Fbcon races
9. Fix all remaining PCI code to use new resources and enable_Device
10. VFS?VM - mmap/write deadlock
11. rw sempahores on page faults (mmap_sem)
12. kiobuf seperate lock functions/bounce/page_address fixes
13. Fix routing by fwmark
14. Some FB drivers check the A000 area and find it busy then bomb out
15. rw semaphores on inodes to fix read/truncate races ? [Probably fixed]
16. Not all device drivers are safe now the write inode lock isnt taken on
write
17. File locking needs checking for races
18. Multiwrite IDE breaks on a disk error
19. AFFS doesn't work on current page cache
20. ACPI/APM suspend issue
Wakko Warner replied to item 4.13 (PCMCIA/Cardbus hangs, IRQ problems, Keyboard
/mouse problem (may be fixed ?)), with, "Fixed for me. Since yenta doesn't
probe irq12, it doesn't cause me any lockups."
There were some other scattered comments as well, but nothing conclusive.
9. Problems With kernel.org Mirrors
28 Mar 2000 - 29 Mar 2000 (6 posts) Archive Link: "Linux 2.3.99pre3-ac1"
Topics: Kernel Release Announcement
People: Alan Cox, H. Peter Anvin, James H. Cloos Jr., James H. Cloos, Arjan van
de Ven
Alan Cox announced a patch against 2.3.99pre3, so people could keep up with him
in their debugging expeditions, but Arjan van de Ven noticed that it wasn't on
any of the kernel.org mirrors. Alan replied, "It does appear not to be
mirroring right. I've put a copy on ftp.linux.org.uk:/pub/linux/alan/ (ftp://
ftp.linux.org.uk/pub/linux/alan/) " . H. Peter Anvin offered, "Please report
broken kernel.org mirrors **including IP address** to ftpadmin@kernel.org (
mailto:ftpadmin@kernel.org) as soon as you can tell, please." James H. Cloos
Jr. also explained:
The notify from ftpadmin to lka-change didn't go out until Tue, 28 Mar 2000
15:43:24 -0800 or about 70 minutes after Alan's note (quoted above), or about
six hours after Alan's initial announcement....
I'm sure most of us rsync only once or twice a day from cron(8), plus whenever
lka-change mail arrives, hense the delay.
10. New Scheduler Code; Locking Issues
28 Mar 2000 - 29 Mar 2000 (5 posts) Archive Link: "locking problems"
Topics: SMP
People: Rik van Riel, Jun Sun, Andrew Morton
Rik van Riel posted a patch against 2.3.99, to implement a low overhead fair
process scheduler. As he explained in the code comments, "It works by handing
out CPU time like we do at the normal recalculation. The catch is that we move
the list head (where the for_each_task() loop starts) to _after_ the first task
where we ran out of quota. This means that if a user has too many runnable
processes, his tasks will get extra CPU time here in turns." But he reported to
the list:
Unfortunately it hangs on taking locks in the recalculation code :(
I'm somewhat amazed by why it hangs and interested in any explanations...
Jun Sun gave his tentative opinion, "Interrupt handlers sometimes call kernel
functions that would require a lock on tasklist_lock. If that interrupt happens
during the time you hold write lock on tasklist_lock, a deadlock would happen."
He suggested using write_lock_irq() to fix it, and added, "BTW, I really think
interrupt handlers acquiring the same locks which can be acquired by processes
is a *BIG* problem in Linux." Andrew Morton asked why Jun though this, and Jun
replied, "Linus told me so. I believe him. :-)" On a more technical note, he
added, "I did sniff around the source code and spotted a couple of places where
locks COULD be acquired by ISRs, but I never did a RUN-TIME check to catch this
situation," and went on to say:
I believe the problem here is that Linux does not have a CLEAR notion and
separation of task-context code and interrupt-context code.
Imagine if a kernel function needs to read task list, then it must acquire a
read lock on tasklist_lock. However, the function might be called from both
process and ISR, then we will have the ISR acquiring lock problem.
I don't know if this has been a problem to Linux in the past. I am relatively
new to Linux kernel.
There was no reply to this, but Rik, replying to Jun's suggested write_lock_irq
(), posted a new patch, and said:
Indeed, that was the problem. I was lucky to get a few good traces by the NMI
oopser that identified this problem. Now things are fixed.
The new patch is attached, for adventurous users. I'm testing it on my SMP
system now.
That was it.
11. Network Load Balancing
28 Mar 2000 - 30 Mar 2000 (12 posts) Archive Link: "iproute and 2.3 question"
Topics: Networking
People: Guus Sliepen, Andi Kleen, Alexey Kuznetsov, George Bonser
George Bonser noticed in the ip-route cref document, a description of the
'equalize' modifier: "allow packet by packet randomization on multipath routes.
Without this modifier route will be frozen to one selected nexthop, so that
load splitting will occur only on per-flow base." The same document said that
the kernel had to be patched to make use of this feature. George asked if this
was still the case, or if 'equalize' had been integrated into the main kernel
sources. Andi Kleen and Guus Sliepen (author of the patch) replied that it had
not been integrated. Guus gave a pointer to the patch (ftp://
sliepen.warande.net/pub/eql/) , and explained:
It works indeed by throwing out cache entries every time they have been used.
This works for up to at least 20 Mbit/s of traffic, but not for 400 Mbit/s
(both cases have been tried, the former works fine, the latter does not). You
can try it if you like.
Route based load balancing does have certain advantages over bonding devices.
But it's really hard to implement in a clean way in current kernels. I'd rather
see the complete networking code being a module, and those who want to use
different routing/firewalling/scheduling schemes can load (or even create their
own) different modules. The current code is (to my eye) just a messy bunch of
hooks and checks.
He added that he'd propose this for 2.5 when the time came. There was no reply
to this, but Andi's reply to the original post, was that equalization on the
routing cache layer was too slow or not fine-grained enough, for inclusion in
the main sources. Someone asked for more explanation, and Andi replied:
Linux has a routing cache that caches routing table lookups. This is called the
destination cache. A destination cache entry is tied to a specific destination,
which means only a single neighbour on a multipath route. To use multipath
routing for load balancing requires dropping the destination entry after every
use, so that another neighbour in the multipath could be looked up (the
destination cache knows nothing about multipaths, that is all encapsulated in
the FIB or routing table)
Dropping them all the time does not work well and is slow. It is also not
finegrained enough (because the decision occurs to early) to get an even load
balancing
Multipath routing is only useful for failover when a device is down in Linux.
For load balancing you can use the existing eql, teql and bonding devices,
which work at a lower layer and avoid these problems.
Alexey Kuznetsov was critical of this explanation, and said that multipath
routing worked "perfectly when you need to split load on servers talking to
enough large number of clients. Any http server is good example." He added that
Andi's suggestion of the existing eql, teql and bonding devices, would
introcude "even worse problem of strong tcp reordering. Actually, experiments
show that load balancing works only in the situations, when congestion window
is bounded by 3 packets. If it is not made artificially, it occurs
automatically on each connection after some amount of excessive
retransmissions. Total single TCP connection throughput is never better in this
case. Actually, it hints to the thought that "true load blalancing" has to
involve tracking connections and avoiding reordering TCP packets." There was no
reply to this, but there was a bit of implementation discussion elsewhere,
along the lines of Andi's explanations.
12. AFFS Support And Discussion
29 Mar 2000 - 3 Apr 2000 (37 posts) Archive Link: "AFFS progress."
Topics: FS: FAT, FS: ext2
People: Dave Jones, Alexander Viro, Matthias Andree, Nicholai Benalal, Rask
Ingemann Lambertsen
Dave Jones posted several patches to update AFFS, although he proclaimed loudly
that this code was possibly dangerous and should not be used without extensive
backups. At one point, he added, "I'm coding without tools to test this, so I
need help from the people who are going to be using this.. I'd also appreciate
feedback from other filesys/vfs guys about anything in this patch that just
'doesn't look right' I've gone from knowing nothing about fs/VFS to this diff
in four days, and now my head hurts. I wouldn't be at all surprised if I've
done _something_ wrong somewhere." To Dave's discussion of ongoing work and
problems, Alexander Viro replied:
real problems with AFFS are different. I'll bring the pre-patch I've done back
in September from backups tomorrow and then you'll get more detailed
description, but right now I can recall the following:
1. AFFS handles links horribly. It has pseudo-inodes for all links and they
point to the real one. Unfortunately, that "real" inode _must_ belong to
some directory. Which means that if you create a link to file and remove
the original link you are in for pain. You can't just remove the original
entry from its hash chain. So the bloody thing finds some other link, moves
the name into original one, inserts the original into hash chain of the
other and kills other. It means that unlink() in one directory may
reshuffle another. And you've got _no_ protection by i_sem on another
directory - any attempt to get it will lead to easy deadlocks. Consequence:
_easy_ races.
2. You have to account for situations when link() and unlink() race with each
other. Again, not done in the current code.
3. Links on directories easily kill VFS. Don't.
4. Since some operations (e.g. rename) involve a _lot_ of hash chains walking
and pointers switching - beware of the failure modes when you abort in the
middle of modification. It may easily leave you with fucked up filesystem.
It's a lot of crap to fix and I gave up on that when I got more pressing things
to do. I can pass you the patch along with notes. I can remove the swearwords -
you will reinsert them as soon as you'll play with this beast yourself.
The bottom line: AFFS design is a festering pile of dung and attempts to make
it look like UNIX filesystem only made it uglier. Judging by dejanews search,
AmigaOS itself doesn't handle it well. Hell knows what had stopped them from
replacing it with decent filesystem - with the thing outside of kernel it
wasn't that hard to do... Damnit, FAT is not so braindead compared to that
abortion.
Matthias Andree added his sentiments, "These problems you mention have been
persisting in AmigaOS for years ever since links had been introduced with
AmigaOS 2, PLUS, AmigaOS shell commands have had bugs so that only the GNU
ported fileutils along with the FileSystem-based (as opposed to dos.library
based) ixemul.library could get you rid of directory symlinks; this is partly
because links had originally not been present in AmigaOS 1.x, partly because
they fucked things up really bad." He went on:
There are some commercial (AFS/PFS - some versions of these are also fucked
beyond repair) and at least to freeware (SFS, Berkeley FFS Amiga port)
filesystems, I never checked them for completeness or stability, though. Amiga
users seem to be quite satisfied with SFS which is claimed to be journalling.
The Dircache (DCFS) option of OFS/FFS (DOS\4 and DOS\5) are also claimed to be
fucked somewhat and slow things down for disks with low random access times
such as hard disks.
If you consider seriously revamping the AFFS support, I strongly urge to get
hold of Ralph Babel's "The Amiga Guru Book" which contains carefully collected
information on DOS internals, while you will still need the Amiga Developer's
CD for information on Dircache.
I've given up almost all my AmigaOS activities for the sake of Unix, AmigaOS
has given me too much grief with all its troubles, it had just shoot one
filesystem beyond recovery (into crash-on-access state) once again.
And since that stuff is so severely fucked, I suggest marking write support
EXPERIMENTAL and add a kernel compile-time option for that.
Dave also replied to Alexander's harsh critique, adding, "Which is probably why
someone came along and wrote PFS (Professional File System) for it, and made a
whole load of money from it." He asked, "Anyone know if any tech-documentation
on this exists? That may be an interesting filesystem to hack on when I'm done
with AFFS.." Nicholai didn't know of anything for PFS, but he added, "Smart
File System (SFS) on the other hand is a free (as in beer) filesystem with good
technical documentation. It has gained ground in the amiga community lately."
He gave a pointer to an SFS page (http://www.xs4all.nl/~hjohn/SFS) .
But Nicholai Benalal came somewhat to AmigaOS' defense in response to
Alexander's stark depiction, saying, "AFFS works allright under AmigaOS. It has
limitations but generally it's ok. Still, a lot of people use other filesystems
for AmigaOS but there are no Linux drivers for these. So the best way to
transfer files between the Amiga side and Linux is still the buggy Linux affs
driver :-)" A lot of people pointed out that the 'tar' command might be a
decent alternative. Matthias added that lack of Linux drivers for the other
filesystems might have something to do with the lack of available sources for
those primarily commercial filesystems. He also disagreed with Nicholai's
assertion that AFFS worked passably under AmigaOS. He expressed, "It blows
itself off its very feet when crashing at the wrong time and leaving a system
behind that's corrupted beyond repair. This is happening all over the place
every now and then."
Rask Ingemann Lambertsen replied, "AFFS is as good (or bad) as ext2 in this
regard." But Matthias and Alexander both objected to this. Alexander said, "It
is not. You need much more atomic operations to get from one valid state of
filesystem to another. And I mean _much_ more - as minimum twice. Data
structure is designed by complete loonie - just look at it and write down the
worst-case set of modifications to be done upon rename(). Pay attention to the
size of critical part - _some_ steps can be performed without corrupting the
structure if you fail after them, but there is a nice lump that should be taken
together or not at all. Now, do the same for FFS (_real_ one, designed by sane
people). Or its descendants - ext2, ufs... Compare the results." And Matthias
also said, "AFFS is much worse. It needs reordering hash chains, touching
several and so on even on a single rename. If your machine crashes before all
blocks have been written, you're in trouble, since tools that handle this do
not come with AmigaOS. I've NEVER lost so much data with ext2 because of
corruption as I recently lost with affs (on AmigaOS). AFFS is fine only as long
as you don't touch anything."
13. Intel eepro100 Driver To Be GPL-Compatible?
30 Mar 2000 - 31 Mar 2000 (14 posts) Archive Link: "[PATCH] eepro100.c"
People: Alan Cox, Dragan Stancevic
Dragan Stancevic posted a patch to the non-Intel eepro100 driver, to provide
more detailed information about installed devices, and in the course of
discussion, Alan Cox mentioned, "I've had some discussion with intel about
fixing the licensing for the eepro100 driver they released so that we can merge
the two (they have support for more boards, ucode for interrupt mitigation,
errata workarounds and portability and locking flaws." He added, "I have a
positive answer I dont quite understand 8) from the Intel lawyers." Dragon
replied, "I was not aware that intel is reconsidering their license... Did they
give you any time frames to when the intel driver might be release under a more
compatible license?" And Alan Concluded, "Basically once I finish talking to
the lawyer. I've just not had time and I'll be busy next week too."
14. This Year's April Fool's Joke
1 Apr 2000 - 2 Apr 2000 (21 posts) Archive Link: "Linux 2000(tm)(r)"
Topics: Microsoft
People: Michael Talbot-Wilson, Linus Torvalds
Someone purporting to be Linus Torvalds announced that he'd partnered with
Microsoft and would be selling Linux from now on. Everyone rolled their eyes,
but the interesting part is how quickly the identity of the poster was
established. Within a day his name and various personal details were known and
published. It looked as though a serious manhunt would soon be under way, until
Michael Talbot-Wilson said, "Hey, guys. Enough, huh? He intentionally made it
clear enough that it was a joke. Fun for a couple of minutes, right?"
15. New Networking HOWTOs And LVM HOWTO
2 Apr 2000 (1 post) Archive Link: "[DOCUMENTATION] 3 2.4 HOWTOs. Traffic
Shaping, iproute2 and LVM"
Topics: Disk Arrays: LVM, Version Control
People: Bert Hubert
Continuing from last week in Issue #61, Section #15 (22 Mar 2000: iproute2 And
netfilter HOWTO) , Bert Hubert announced:
3 new HOWTOs:
* Linux 2.4 Advanced Routing & Traffic Shaping HOWTO http://www.ds9a.nl/
2.4Routing
Do interesting stuff with netfilter, ip, tc and other tools. Already quite
long and considered useful by a lot of people. Cooperative project with 4
authors working together via CVS
* Linux 2.4 Networking http://www.ds9a.nl/2.4Networking
iproute2 HOWTO which also tries to impart understanding of basic Linux
networking. Still in its very early stages and desperately needs more
authors.
* Linux Logical Volume Manager HOWTO http://www.ds9a.nl/lvm-howto
A very hands on HOWTO about LVM. In its early stages as well but already
quite useful - progressing rapidly.
Sharon And Joy
Kernel Traffic is grateful to be developed on a computer donated by Professor
Greg Benson and Professor Allan Cruse in the Department of Computer Science at
the University of San Francisco. This is the same department that invented
FlashMob Computing. Kernel Traffic is hosted by the generous folks at
kernel.org. All pages on this site are copyright their original authors, and
distributed under the terms of the GNU General Public License version 2.0.