Kernel Traffic

Fri Sep 16 18:56:36 2005

Mon Aug 29 20:42:16 2005

Thu Sep 15 16:49:32 2005

2152

1649

84:23:57

13MB

6KB

644KB

371

3MB

371

3MB

665

260

229979

117922

-13.501859

4.599000igor filippenko

106

51.28%

42.94%

0.00% 0.84% 0.00%

luben tuikov

13KB

2MB

14:50

andi kleen

4KB

278KB

11:21

akpm@osdl.org

5KB

248KB

10:49

sam ravnborg

6KB

329KB

14:55

john w. linville

5KB

229KB

12:01

rfc: i386: kill !4kstacks

100

5KB

414KB

14:13

1125642836

1126634752

[patch 1/3] dynticks - implement no idle hz for x86

7KB

373KB

13:42

1125712583

1126308647

gfs, what's remaining

5KB

227KB

12:47

1125575979

1126383089

[linux-cluster] re: gfs, what's remaining

5KB

192KB

11:15

1125733594

1126727194

[patch 2.6.13 5/14] sas-class: sas_discover.c discover process

7KB

245KB

14:17

1126311083

1126764279

linux-kernel@vger.kernel.org

424

8KB

4MB

13:32

akpm@osdl.org

187

9KB

2MB

12:41

linus torvalds

163

7KB

1024KB

12:59

andi kleen

6KB

352KB

12:29

dmitry torokhov

6KB

225KB

03:32

linux-kernel@vger.kernel.org

942

6KB

6MB

13:37

akpm@osdl.org

219

6KB

2MB

13:51

linus torvalds

136

6KB

686KB

13:33

jeff garzik

5KB

276KB

11:32

andi kleen

5KB

284KB

12:51

com 959

7KB

7MB

org 377

6KB

2MB

de 182

5KB

846KB

uk 151

5KB

711KB

net 142

7KB

877KB

+0200 595

7KB

4MB

-0700 482

6KB

3MB

-0400 403

7KB

3MB

+0100 231

5KB

2MB

-0500 76

8KB

537KB

sgi 23 90KB computing service, university of cambridge, uk 8 40KB ypo4 7 43KB http://bugsplatter.mine.nu/ 6 23KB home 4 20KB

mozilla 49 3MB evolution 38 1016KB mutt/1.5.9i 36 569KB mozilla/5.0 26 480KB mutt/1.5.10i 23 626KB

2682MB 4283MB 3523MB 2872MB 126774KB 3953MB 2702MB

00 00 00 00 00 00 00 991KB 211713MB 00 00 00

2389KB

24119KB

48248KB

69283KB

78345KB

52261KB

56281KB

82565KB

3713MB

2222MB

1992MB

3492MB

3002MB

2232MB

21121KB

128KB

863KB

64415KB

53380KB

24148KB

1559KB

1350KB

1883KB

28109KB

62321KB

120708KB

1852MB

122675KB

96446KB

118605KB

136704KB

1722MB

139808KB

122905KB

124844KB

84435KB

80530KB

56279KB

107656KB

72368KB

1756http://vger.kernel.org/majordomo-info.html

1754http://www.tux.org/lkml/

16http://yahoo.shaadi.com

8http://mail.yahoo.com

8http://lkdp.blogspot.com/

lukasz kosewski

00:00:47

coywolf qi hunt

00:01:14

erik slagter

00:04:26

andreas koch

00:07:08

ben greear

00:08:15

john stultz

00:11:18

mboxstats2.8folkert@vanheusden.comhttp://www.vanheusden.com/mboxstats/

Disk Arrays: LVM FS: NFS FS: NTFS FS: ReiserFS FS: XFS FS: ext2 FS: ext3 FS: sysfs Ioctls POSIX Arjan van de Ven Pekka Enberg

David Teigland said:

this is the latest set of gfs patches, it includes some minor munging since the previous set. Andrew, could this be added to -mm? there's not much in the way of pending changes.

http://redhat.com/~teigland/gfs2/20050901/gfs2-full.patch
http://redhat.com/~teigland/gfs2/20050901/broken-out/

I'd like to get a list of specific things remaining for merging. I believe we've responded to everything from earlier reviews, they were very helpful and more would be excellent. The list begins with one item from before that's still pending:

Adapt the vfs so gfs (and other cfs's) don't need to walk vma lists. [cf. ops_file.c:walk_vm(), gfs works fine as is, but some don't like it.]

Arjan van de Ven offered concrete criticisms on the patches themselves, pointing out races and other problems; and David and others discussed these. Elsewhere, Pekka Enberg pointed out that the requirement to walk VMA lists wasn't just a case of "some not liking it", but would actually prevent GFS from working properly with other clustered filesystems. And Daniel Phillips also brought some perspective to the whole prospect of a GFS merge into mainline, saying:

Where are the benchmarks and stability analysis? How many hours does it survive cerberos running on all nodes simultaneously? Where are the testimonials from users? How long has there been a gfs2 filesystem? Note that Reiser4 is still not in mainline a year after it was first offered, why do you think gfs2 should be in mainline after one month?

So far, all catches are surface things like bogus spinlocks. Substantive issues have not even begun to be addressed. Patience please, this is going to take a while.

Andrew Morton also asked for answers to a few basic questions:

I don't recall seeing much discussion or exposition of

Why the kernel needs two clustered fileystems
Why GFS is better than OCFS2, or has functionality which OCFS2 cannot possibly gain (or vice versa)
Relative merits of the two offerings

Alan Cox pointed out what he felt was a simple answer to all these questions: people actively use it and have been for some years. Same reason with have NTFS, HPFS, and all the others. On that alone it makes sense to include. But Christoph Hellwig remarked, That's GFS. The submission is about a GFS2 that's on-disk incompatible to GFS. Alan replied:

Just like say reiserfs3 and reiserfs4 or ext and ext2 or ext2 and ext3 then. I think the main point still stands - we have always taken multiple file systems on board and we have benefitted enormously from having the competition between them instead of a dictat from the kernel kremlin that 'foofs is the one true way'

Competition will decide if OCFS or GFS is better, or indeed if someone comes along with another contender that is better still. And competition will probably get the answer right.

The only thing that is important is we don't end up with each cluster fs wanting different core VFS interfaces added.

Lars Marowsky-Bree was not as trusting in the virtues of competition, pointing out that Competition will come up with the same situation like reiserfs and ext3 and XFS, namely that they'll all be maintained going forward because of, uhm, political constraints ;-) But he also affirmed, as long as they _are_ maintained and play along nicely with eachother (which, btw, is needed already so that at least data can be migrated...), I don't really see a problem of having two or three. He also agreed that requiring different core VFS interfaces would be unacceptable.

Andrew reiterated his question, saying he was looking for technical reasons in favor of inclusion. David offered:

GFS is an established fs, it's not going away, you'd be hard pressed to find a more widely used cluster fs on Linux. GFS is about 10 years old and has been in use by customers in production environments for about 5 years. It is a mature, stable file system with many features that have been technically refined over years of experience and customer/user feedback. The latest development cycle (GFS2) has focussed on improving performance, it's not a new file system -- the "2" indicates that it's not ondisk compatible with earlier versions.

OCFS2 is a new file system. I expect they'll want to optimize for their own unique goals. When OCFS appeared everyone I know accepted it would coexist with GFS, each in their niche like every other fs. That's good, OCFS and GFS help each other technically even though they may eventually compete in some areas (which can also be good.)

Here's a random summary of technical features:

cluster infrastructure: a lot of work, perhaps as much as gfs itself, has gone into the infrastructure surrounding and supporting gfs
cluster infrastructure allows for easy cooperation with CLVM
interchangable lock/cluster modules: gfs interacts with the external infrastructure, including lock manager, through an interchangable module allowing the fs to be adapted to different environments.
a "nolock" module can be plugged in to use gfs as a local fs (can be selected at mount time, so any fs can be mounted locally)
quotas, acls, cluster flocks, direct io, data journaling, ordered/writeback journaling modes -- all supported
gfs transparently switches to a different locking scheme for direct io allowing parallel non-allocating writes with no lock contention
posix locks -- supported, although it's being reworked for better performance right now
asynchronous locking, lock prefetching + read-ahead
coherent shared-writeable memory mappings across the cluster
nfs3 support (multiple nfs servers exporting one gfs is very common)
extend fs online, add journals online
full fs quiesce to allow for block level snapshot below gfs
read-only mount
"specatator" mount (like ro but no journal allocated for the mount, no fencing needed for failed node that was mounted as specatator)
infrastructure in place for live ondisk inode migration, fs shrink
stuffed dinodes, small files are stored in the disk inode block
tunable (fuzzy) atime updates
fast, nondisruptive stat on files during non-allocating direct-io
fast, nondisruptive statfs (df) even during heavy fs usage
friendly handling of io errors: shut down fs and withdraw from cluster
largest GFS cluster deployed was around 200 nodes, most are much smaller
use many GFS file systems at once on a node and in a cluster
customers use GFS for: scientific apps, HA, NFS serving, database, others I'm sure
graphical management tools for gfs, clvm, and the cluster infrastruture exist and are improving quickly

Arjan short-circuited any discussion of these particular features, pointing out that David's description referred to GFS, not to GFS2 which, as others had already pointed out, was not compatible. David replied:

Just a new version, not a big difference. The ondisk format changed a little making it incompatible with the previous versions. We'd been holding out on the format change for a long time and thought now would be a sensible time to finally do it.

This is also about timing things conveniently. Each GFS version coincides with a development cycle and we decided to wait for this version/cycle to move code upstream. So, we have new version, format change, and code upstream all together, but it's still the same GFS to us.

As with _any_ new version (involving ondisk formats or not) we need to thoroughly test everything to fix the inevitible bugs and regresssions that are introduced, there's nothing new or surprising about that.

About the name -- we need to support customers running both versions for a long time. The "2" was added to make that process a little easier and clearer for people, that's all. If the 2 is really distressing we could rip it off, but there seems to be as many file systems ending in digits than not these days...

Daniel asked what the on-disk format change was all about, but there was no reply to that post. Elsewhere, various folks made serious efforts to answer his request for technical reasons for or against inclusion. Andi Kleen kicked off that branch of discussion, saying to Andrew:

There seems to be clearly a need for a shared-storage fs of some sort for HA clusters and virtualized usage (multiple guests sharing a partition). Shared storage can be more efficient than network file systems like NFS because the storage access is often more efficient than network access and it is more reliable because it doesn't have a single point of failure in form of the NFS server.

It's also a logical extension of the "failover on failure" clusters many people run now - instead of only failing over the shared fs at failure and keeping one machine idle the load can be balanced between multiple machines at any time.

One argument to merge both might be that nobody really knows yet which shared-storage file system (GFS or OCFS2) is better. The only way to find out would be to let the user base try out both, and that's most practical when they're merged.

Personally I think ocfs2 has nicer & cleaner code than GFS. It seems to be more or less a 64bit ext3 with cluster support, while GFS seems to reinvent a lot more things and has somewhat uglier code. On the other hand GFS' cluster support seems to be more aimed at being a universal cluster service open for other usages too, which might be a good thing. OCFS2s cluster seems to be more aimed at only serving the file system.

But which one works better in practice is really an open question.

The only thing that should be probably resolved is a common API for at least the clustered lock manager. Having multiple incompatible user space APIs for that would be sad.

Andi's term "clustered lock manager" is more commonly known as "distributed lock manager" or DLM. This was the term taken up for the rest of the discussion, and becoming the primary focus as well. In this light, Daniel Phillips replied to Andi:

The only current users of dlms are cluster filesystems. There are zero users of the userspace dlm api. Therefore, the (g)dlm userspace interface actually has nothing to do with the needs of gfs. It should be taken out the gfs patch and merged later, when or if user space applications emerge that need it. Maybe in the meantime it will be possible to come up with a userspace dlm api that isn't completely repulsive.

Also, note that the only reason the two current dlms are in-kernel is because it supposedly cuts down on userspace-kernel communication with the cluster filesystems. Then why should a userspace application bother with a an awkward interface to an in-kernel dlm? This is obviously suboptimal. Why not have a userspace dlm for userspace apps, if indeed there are any userspace apps that would need to use dlm-style synchronization instead of more typical socket-based synchronization, or Posix locking, which is already exposed via a standard api?

There is actually nothing wrong with having multiple, completely different dlms active at the same time. There is no urgent need to merge them into the one true dlm. It would be a lot better to let them evolve separately and pick the winner a year or two from now. Just think of the dlm as part of the cfs until then.

What does have to be resolved is a common API for node management. It is not just cluster filesystems and their lock managers that have to interface to node management. Below the filesystem layer, cluster block devices and cluster volume management need to be coordinated by the same system, and above the filesystem layer, applications also need to be hooked into it. This work is, in a word, incomplete.

Close by, Mark Fasheh also said to Andi, As far as userspace dlm apis go, dlmfs already abstracts away a large part of the dlm interaction, so writing a module against another dlm looks like it wouldn't be too bad (startup of a lockspace is probably the most difficult part there). Daniel asked why SysFS would not work just as well for this, and Wim Coekaerts replied cryptically that the two were totally different. Daniel replied:

You create a dlm domain when a directory is created. You create a lock resource when a file of that name is opened. You lock the resource when the file is opened. You access the lvb by read/writing the file. Why doesn't that fit the configfs-nee-sysfs model? If it does, the payoff will be about 500 lines saved.

This little dlm fs is very slick, but grossly inefficient. Maybe efficiency doesn't matter here since it is just your slow-path userspace tools taking these locks. Please do not even think of proposing this as a way to export a kernel-based dlm for general purpose use!

Your userdlm.c file has some hidden gold in it. You have factored the dlm calls far more attractively than the bad old bazillion-parameter Vaxcluster legacy. You are almost in system call zone there. (But note my earlier comment on dlms in general: until there are dlm-based applications, merging a general-purpose dlm API is pointless and has nothing to do with getting your filesystem merged.)

Andrew agreed that Daniel is asking a legitimate question. He went on, If there's duplicated code in there then we should seek to either make the code multi-purpose or place the common or reusable parts into a library somewhere. If neither approach is applicable or practical for *every single function* then fine, please explain why. AFAIR that has not been done. Joel Becker replied:

Regarding sysfs and configfs, that's a whole 'nother conversation. I've not yet come up with a function involved that is identical, but that's a response here for another email.

Understanding that Daniel is talking about dlmfs, dlmfs is far more similar to devptsfs, tmpfs, and even sockfs and pipefs than it is to sysfs. I don't see him proposing that sockfs and devptsfs be folded into sysfs.

dlmfs is *tiny*. The VFS interface is less than his claimed 500 lines of savings. The few VFS callbacks do nothing but call DLM functions. You'd have to replace this VFS glue with sysfs glue, and probably save very few lines of code.

In addition, sysfs cannot support the dlmfs model. In dlmfs, mkdir(2) creates a directory representing a DLM domain and mknod(2) creates the user representation of a lock. sysfs doesn't support mkdir(2) or mknod(2) at all.

More than mkdir() and mknod(), however, dlmfs uses open(2) to acquire locks from userspace. O_RDONLY acquires a shared read lock (PR in VMS parlance). O_RDWR gets an exclusive lock (X). O_NONBLOCK is a trylock. Here, dlmfs is using the VFS for complete lifetiming. A lock is released via close(2). If a process dies, close(2) happens. In other words, ->release() handles all the cleanup for normal and abnormal termination.

sysfs does not allow hooking into ->open() or ->release(). So this model, and the inherent lifetiming that comes with it, cannot be used. If dlmfs was changed to use a less intuitive model that fits sysfs, all the handling of lifetimes and cleanup would have to be added. This would make it more complex, not less complex. It would give it a larger code size, not a smaller one. In the end, it would be harder to maintian, less intuitive to use, and larger.

The DLM debate and its relationship to GFS acceptance became very technical, with many tendrils of discussion, that did not lead to any clear conclusion, in spite of the fact that Andrew was a very active participant in leading the discussion. The closest thing to a decision that came out of the discussion came when David, who'd opened the whole discussion, said that GFS depended on the full DLM API, and would find it impractical to rely on anything else. He said, We export our full dlm API through read/write/poll on a misc device. All user space apps use the dlm through a library as you'd expect. The library communicates with the dlm_device kernel module through read/write/poll and the dlm_device module talks with the actual dlm: linux/drivers/dlm/device.c If there's a better way to do this, via a pseudo fs or not, we'd be pleased to try it. Andrew replied, inotify did that for a while, but we ended up going with a straight syscall interface. How fat is the dlm interface? ie: how many syscalls would it take? David replied that only 4 functions would be needed: create_lockspace(), release_lockspace(), lock(), and unlock(). Kurt C. Hackel from Oracle replied:

FWIW, it looks like we can agree on the core interface. ocfs2_dlm exports essentially the same functions:

dlm_register_domain() dlm_unregister_domain() dlmlock() dlmunlock()

I also implemented dlm_migrate_lockres() to explicitly remaster a lock on another node, but this isn't used by any callers today (except for debugging purposes). There is also some wiring between the fs and the dlm (eviction callbacks) to deal with some ordering issues between the two layers, but these could go if we get stronger membership.

There are quite a few other functions in the "full" spec(1) that we didn't even attempt, either because we didn't require direct user<->kernel access or we just didn't need the function. As for the rather thick set of parameters expected in dlm calls, we managed to get dlmlock down to *ahem* eight, and the rest are fairly slim.

Looking at the misc device that gfs uses, it seems like there is pretty much complete interface to the same calls you have in kernel, validated on the write() calls to the misc device. With dlmfs, we were seeking to lock down and simplify user access by using standard ast/bast/unlockast calls, using a file descriptor as an opaque token for a single lock, letting the vfs lifetime on this fd help with abnormal termination, etc. I think both the misc device and dlmfs are helpful and not necessarily mutually exclusive, and probably both are better approaches than exporting everything via loads of syscalls (which seems to be the VMS/opendlm model).

Andrew liked the 4 syscall requirement, saying, Neat. I'd be inclined to make them syscalls then. I don't suppose anyone is likely to object if we reserve those slots. Daniel cautioned that the function parameters might be a bit ugly, but David said it was likely there would be no more than 2 or 3 for any of them. But Alan Cox spoke out vehemently against this whole course of action. He said:

If the locks are not file descriptors then answer the following:

How are they ref counted
What are the cleanup semantics
How do I pass a lock between processes (AF_UNIX sockets wont work now)
How do I poll on a lock coming free.
What are the semantics of lock ownership
What rules apply for inheritance
How do I access a lock across threads.
What is the permission model.
How do I attach audit to it
How do I write SELinux rules for it
How do I use mount to make namespaces appear in multiple vservers

and thats for starters...

Every so often someone decides that a deeply un-unix interface with new syscalls is a good idea. Every time history proves them totally bonkers. There are cases for new system calls but this doesn't seem one of them.

Look at system 5 shared memory, look at system 5 ipc, and so on. You can't use common interfaces on them, you can't select on them, you can't sanely pass them by fd passing.

All our existing locking uses the following behaviour

        fd = open(namespace, options)
        fcntl(.. lock ...)
        blah
        flush
        fcntl(.. unlock ...)
        close

Unfortunately some people here seem to have forgotten WHY we do things this way.

The semantics of file descriptors are well understood by users and by programs. That makes programming easier and keeps code size down
Everyone knows how close() works including across fork
FD passing is an obscure art but understood and just works
Poll() is a standard understood interface
Ownership of files is a standard model
FD passing across fork/exec is controlled in a standard way
The semantics for threaded applications are defined
Permissions are a standard model
Audit just works with the same tools
SELinux just works with the same tools
I don't need specialist applications to see the system state (the whole point of sysfs yet someone wants to break it all again)
fcntl fd locking is a posix standard interface with precisely defined semantics. Our extensions including leases are very powerful
And yes - fcntl fd locking supports mandatory locking too. That also is standards based with precise semantics.

Everyone understands how to use the existing locking operations. So if you use the existing interfaces with some small extensions if neccessary everyone understands how to use cluster locks. Isn't that neat....

Andrew disagreed that the new syscalls would be such grave violations. He pointed out that David said that "We export our full dlm API through read/write/poll on a misc device.". That miscdevice will simply give us an fd. Hence my suggestion that the miscdevice be done away with in favour of a dedicated syscall which returns an fd. Alan didn't reply.

At right around this point, Patrick Caulfield got home from vacation, and threw out his take on things:

let me tell you what we do now and why and lets see what's wrong with it.

Currently the library create_lockspace() call returns an FD upon which all lock operations happen. The FD is onto a misc device, one per lockspace, so if you want lockspace protection it can happen at that level. There is no protection applied to locks within a lockspace nor do I think it's helpful to do so to be honest. Using a misc device limits you to <255 lockspaces depending on the other uses of misc but this is just for userland-visible lockspace - it does not affect GFS filesystems for instance.

Lock/convert/unlock operations are done using write calls on that lockspace FD. Callbacks are implemented using poll and read on the FD, read will return data blocks (one per callback) as long as there are active callbacks to process. The current read functionality behaves more like a SOCK_PACKET than a data stream which some may not like but then you're going to need to know what you're reading from the device anyway.

ioctl/fcntl isn't really useful for DLM locks because you can't do asynchronous operations on them - the lock has to succeed or fail in the one operation - if you want a callback for completion (or blocking notification) you have to poll the lockspace FD anyway and then you might as well go back to using read and write because at least they are something of a matched pair. Something similar applies, I think, to a syscall interface.

Another reason the existing fcntl interface isn't appropriate is that it's not locking the same kind of thing. Current Unix fcntl calls lock byte ranges. DLM locks arbitrary names and has a much richer list of lock modes. Adding another fcntl just runs in the problems mentioned above.

The other reason we use read for callbacks is that there is information to be passed back: lock status, value block and (possibly) query information.

While having an FD per lock sounds like a nice unixy idea I don't think it would work very well in practice. Applications with hundreds or thousands of locks (such as databases) would end up with huge pollfd structs to manage, and it while it helps the refcounting (currently the nastiest bit of the current dlm_device code) removes the possibility of having persistent locks that exist after the process exits - a handy feature that some people do use, though I don't think it's in the currently submitted DLM code. One FD per lock also gives each lock two handles, the lock ID used internally by the DLM and the FD used externally by the application which I think is a little confusing.

I don't think a dlmfs is useful, personally. The features you can export from it are either minimal compared to the full DLM functionality (so you have to export the rest by some other means anyway) or are going to be so un-filesystemlike as to be very awkward to use. Doing lock operations in shell scripts is all very cool but how often do you /really/ need to do that?

I'm not saying that what we have is perfect - far from it - but we have thought about how this works and what we came up with seems like a good compromise between providing full DLM functionality to userspace using unix features. But we're very happy to listen to other ideas - and have been doing I hope.

The discussion ended here, with no certain conclusion, though Andrew's syscall preference may hold sway.

Assembly Digital Video Broadcasting Networking PCI Power Management: ACPI Stephen Hemminger James Bottomley David S. Miller Benjamin Herrenschmidt Alexander Viro Theodore Ts'o Linus Torvalds Randy Dunlap Alan Cox Mark Haverkamp David Woodhouse Patrick McHardy Andrew Morton Zwane Mwaikambo

Chris Wright said:

This is the start of the stable review cycle for the 2.6.13.1 release. There are 9 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let us know. If anyone is a maintainer of the proper subsystem, and wants to add a signed-off-by: line to the patch, please respond with it.

These patches are sent out with a number of different people on the Cc: line. If you wish to be a reviewer, please email stable@kernel.org to add your name to the list. If you want to be off the reviewer list, also email us.

The Cc list contained Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap, Chuck Wolber, Linus Torvalds, Andrew Morton, and Alan Cox, in addition to the linux-kernel mailing list itself.

Each of Chris's replies had a single patch, with these changelog entries:

I wish I had seen this before 2.6.13 was released... I guess this only goes to show that there haven't been any testers using saa7134-hybrid dvb/v4l boards that depend on the tda1004x module, during the 2.6.13-rc series :-(

Please apply this to 2.6.14, and also to 2.6.13.1 -stable. Without this patch, users will have to EXPLICITLY select tda1004x in Kconfig. This SHOULD be done automatically when saa7134-dvb is selected. This patch corrects this problem.

saa7134-dvb must select tda1004x

Signed-off-by: Michael Krufky
Signed-off-by: Chris Wright
This was noticed by Doug Bazamic and the fix found by Mark Salyzyn at Adaptec.

There was an error in the BUG_ON() statement that validated the calculated fib size which can cause the driver to panic.

Signed-off-by: Mark Haverkamp
Acked-by: James Bottomley
Signed-off-by: Chris Wright
This fixes a problem with pci_map_rom() which doesn't properly update the ROM BAR value with the address thas allocated for it by the PCI code. This problem, among other, breaks boot on Mac laptops.

It'ss a new version based on Linus latest one with better error checking.

Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Linus Torvalds
Signed-off-by: Chris Wright
I had some time to think about PCI assign issues in 2.6.13-rc series.

The major problem here is that we call pci_assign_unassigned_resources() way too early - at subsys_initcall level. Therefore we give no chances to ACPI and PnP routines (called at fs_initcall level) to reserve their respective resources properly, as the comments in drivers/pnp/system.c and drivers/acpi/motherboard.c suggest:
```
 /**
  * Reserve motherboard resources after PCI claim BARs,
  * but before PCI assign resources for uninitialized PCI devices
  */
```
So I moved the pci_assign_unassigned_resources() call to pcibios_assign_resources() (fs_initcall), which should hopefully fix a lot of problems and make PCIBIOS_MIN_IO tweaks unnecessary.

Other changes:
- remove resource assignment code from pcibios_assign_resources(), since it duplicates pci_assign_unassigned_resources() functionality and actually does nothing in 2.6.13;
- modify ROM assignment code as per Ben's suggestion: try to use firmware settings by default (if PCI_ASSIGN_ROMS is not set);
- set CARDBUS_IO_SIZE back to 4K as it's a wonderful stress test for various setups.
Confirmed by Tero Roponen (who had problems with the 4kB CardBus IO size previously).

Signed-off-by: Linus Torvalds
Signed-off-by: Chris Wright
[NET]: 2.6.13 breaks libpcap (and tcpdump)

Patrick McHardy says:

Never mind, I got it, we never fall through to the second switch statement anymore. I think we could simply break when load_pointer returns NULL. The switch statement will fall through to the default case and return 0 for all cases but 0 > k >= SKF_AD_OFF.

Here's a patch to do just that.

I left BPF_MSH alone because it's really a hack to calculate the IP header length, which makes no sense when applied to the special data.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller
Signed-off-by: Chris Wright
[CRYPTO] Fix boundary check in standard multi-block cipher processors

Fixes Bug 5194 (IPSec related Oops in 2.6.13).

The boundary check in the standard multi-block cipher processors are broken when nbytes is not a multiple of bsize. In those cases it will always process an extra block.

This patch corrects the check so that it processes at most nbytes of data.

Signed-off-by: Herbert Xu
Signed-off-by: Chris Wright
[IPV4]: Reassembly trim not clearing CHECKSUM_HW

This was found by inspection while looking for checksum problems with the skge driver that sets CHECKSUM_HW. It did not fix the problem, but it looks like it is needed.

If IP reassembly is trimming an overlapping fragment, it should reset (or adjust) the hardware checksum flag on the skb.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller
Signed-off-by: Chris Wright
When we copy 32bit ->msg_control contents to kernel, we walk the same userland data twice without sanity checks on the second pass.

Second version of this patch: the original broke with 64-bit arches running 32-bit-compat-mode executables doing sendmsg() syscalls with unaligned CMSG data areas

Another thing is that we use kmalloc() to allocate and sock_kfree_s() to free afterwards; less serious, but also needs fixing.

Patch by Al Viro, David Miller, David Woodhouse

Signed-off-by: Chris Wright
Fix unchecked __get_user that could be tricked into generating a memory read on an arbitrary address. The result of the read is not returned directly but you may be able to divine some information about it, or use the read to cause a crash on some architectures by reading hardware state. CAN-2005-2492.

Fix from Alexander Viro, ack from Dave S. Miller.

Signed-off-by: Chris Wright

For his job, Weber Ress had to lead a team of engineers to upgrade the kernel from 2.4 to 2.6 in many servers. He asked for advice. Michael Thonke suggested that google is your best friend and first source for it. And gave a link to William von Hagen's article on the subject. Jesper Juhl also said:

I do upgrade a lot of kernels, so I'll tell you a little about what I do and what I'd recommend. Then you can do with that info what you like :)

The very first thing you want to do is to ensure that all core utilities/tools are up-to-date to versions that will work with your new kernel.

If you download a copy of the 2.6.13 kernel source, extract it, and look in the file Documentation/Changes you'll see a list of tools and utils along with the minimum required version for them to work properly with that kernel. Ensure those tools are OK.

Once you are sure the core utils are up-to-date you need to go check whatever other important programs you have on the machine(s) and check that those are also able to run OK with the new kernel.

Once you are satisfied that everything is up to a level that'll work with the new kernel you can go build the new 2.6.13 kernel and drop it in place. You don't need to remove your existing kernel first, you can just install the 2.6.13 kernel side by side with the old one and test boot it, then if it doesn't work right you can always reboot back to the old one.

Most likely you can find documentation for your distribution stating what version of it is "2.6 ready" - I use Slackware for example, and Slackware 10.1 is completely 2.6 kernel ready, so on a Slackware 10.1 box there's no hassle at all, I just drop in a 2.6 kernel in place of the 2.4 one it installs by default and everything is good - all tools are already ready to cope.

Disks: SCSI FS: sysfs Hot-Plugging Ioctls Ottawa Linux Symposium SMP Serial ATA Douglas Gilbert

Luben Tuikov from Adaptec said:

The following announcements and patches introduce Serial Attached SCSI (SAS) support for the Linux kernel. Everything is supported.

The infrastructure is broken into

SAS LLDD,
SAS Layer.

The SAS LLDD does phy/OOB management, and generates SAS events to the SAS Layer. Those events are *the only way* a SAS LLDD communicates with the SAS Layer. If you can generate 2 types of event, then you can use this infrastructure. The first two are, loosely, "link was severed", "bytes were dmaed". The third kind is "received a primitive", used for domain revalidation.

A SAS LLDD should implement the Execute Command SCSI RPC and at least one SCSI TMF (Task Management Function), in order for the SAS Layer to communicate with the SAS LLDD.

The SAS Layer is concerned with

SAS Phy/Port/HA event management (LLDD generates, SAS Layer processes),
SAS Port management (creation/destruction),
SAS Domain discovery and revalidation,
SAS Domain device management,
SCSI Host registration/deregistration,
Device registration with SCSI Core (SAS) or libata (SATA/PI), and
Expander management and exporting expander control to user space.

The SAS Layer uses the Execute Command SCSI RPC, and the TMFs implemented by the SAS LLDD in order to manage the domain and the domain devices.

For details please see drivers/scsi/sas-class/README.

The SAS Layer represents the SAS domain in sysfs. For each object represented, its parent is the physical entity it attaches to in the physical world. So in effect, kobject_get, gets the whole chain up on which that object depends on.

In effect, the sysfs representation of the SAS domain(s) is what you'd see in the physical world.

Hot plugging and hot unplugging of devices, domains and subdomains is supported. Repeated hot plugging and hot unplugging is also supported, naturally.

SAS introduces a new physical entity, an expander. Expanders are _not_ SAS devices, and thus are _not_ SCSI devices. Expanders are part of the Service Delivery Subsystem, in this case SAS.

Expanders are controlled using the Serial Management Protocol (SMP). Complete control is given to user space of all expanders found in the domain, using an "smp_portal". More of this in the second and third email in this series.

A user space program, "expander_conf.c" is also presented to show how one controls expanders in the domain. It is located here: drivers/scsi/sas-class/expanders_conf.c

The second email in this series shows an example of SAS domains and their representation in sysfs.

The third email in this series shows an example of using the "expander_conf.c" program to query all expanders in the domain, showing their attributes, their phys, and their routing tables.

If you have the hardware, please give it a try. If you have expander(s) it would be even more interesting.

Patches of the SAS Layer and of the AIC94XX SAS LLDD follow.

You can also download the patches from http://www.geocities.com/ltuikov/

Christoph Hellwig said, At the core it's some really nice code dealing with host-based SAS implementations. What's not nice is that it's not intgerating with the SAS transport class I posted, it's duplicating things like LUN disocvery from the SCSI core code, and adding it's own sysfs representation that's very different from the way the SCSI core and transport classes do it. Are you willing to work with us to intgerate it with the infrastructure we have? Luben replied, HP and LSI were aware of my efforts since the beginning of the year. As well, you had a copy of my code July 14 this year, long before starting your work on your SAS class for LSI and HP (so its acceptance is guaranteed), after OLS. We did meet at OLS and we did have the SAS BOF. I'm not sure why you didn't want to work together? He invited Christoph to base future work on Luben's implementation. Andrew Patterson from Hewlett Packard replied, This effort started on April. Eric Moore, Mike Miller and I started work on a SAS transport class and then later pulled Luben it at the suggestion of Douglas Gilbert (if I remember correctly). We later mutually agreed that Luben would take over the transport class work as he seemed to have much more experience with this sort of thing. The original idea was to implement a SAS transport class that would allow the LSI and Adaptec driver to get into kernel.org (or others at the time) and to find a way to get SDI/CSMI API's into the kernel without the use of IOCTL's. Luben then went off on his own and came up with his effectively Adaptec only solution. He also added, regarding the OLS BOF, If my memory serves correctly, there were 10-12 people at that BOF, representing the SCSI kernel maintainers and all of the vendors currently providing SAS hardware. Virtually everyone disagreed with your implementation (which you indeed emailed shortly before the conference) that would only work with one vendor's card. The suggestion was made that you convert your code to various library layers so that it would work with all vendors. A suggestion which it seems that you continue to reject.

Ian E. Morgan said:

I would like to ask that the SBC8360 watchdog driver be pushed upstream from -mm in time for the 2.6.14-rc series.

I recognise that this driver, like a lot of the watchdog drivers, is for a piece of hardware this is present in only a very small percentage of hardware runnig Linux. I doubt that being in -mm for a long time will make any significant difference to it being more widely tested. The driver is working perfectly as expected on each of the machines we've tested it on.

As a recap, the driver was submitted to akpm, was included in -mm1 (watchdog-new-sbc8360-driver.patch), offloaded to Wim's linux-2.6-watchdog-mm.git tree (commit 88b1f50923d14195ac1a50840fc4aa4066f067a9), and subsequently included in -mm2 by way of the combined git-watchdog.patch.

Please consider merging this driver into 2.6.14-rc1. Thanks.

Andrew Morton replied, That's in Wim's tree now. Wim, could you please prepare a pull for Linus within the next couple of days? Wim Van Sebroeck said, I'm preparing the tree for linus to pull from. Should be there by the end of the weekend. (Will probably contain 6 drivers + some updates of some other drivers).

FS: devfs FS: sysfs Sound: ALSA Valdis Kletnieks

Greg KH, having been stymied in his effort to remove DevFS in time for the 2.6.12 release, now submitted the identical patches against 2.6.13; he hoped this time they would make it in. He added, Also, if people _really_ are in love with the idea of an in-kernel devfs, I have posted a patch that does this in about 300 lines of code, called ndevfs. It is available in the archives if anyone wants to use that instead (it is quite easy to maintain that patch outside of the kernel tree, due to it only needing 3 hooks into the main kernel tree.) Mike Bell replied that NDevFS was broken by design. It creates yet another incompatible naming scheme for devices, and what's worse the devices it breaks are the ones like ALSA and the input subsystem, whose locations are hard-coded into libraries. Unless sysfs is going to get attributes from which the proper names could be derived, it won't ever work. Greg replied that he knew NDevFS wasn't a nice solution, it was just an alternative. He added, Anyway, I'm not offering it up for inclusion in the kernel tree at all, but for a proof-of-concept for those who were insisting that it was impossible to keep a devfs-like patchset out of the main kernel tree easily.

Elsewhere, David Lang said it was important to be cautious in removing DevFS, because of the dangers of breaking various systems. Greg replied, Ok, how long should I wait then? And David said:

if 2.6.13 removed the devfs config option, then I would say the code itself should stay until 2.6.15 or 2.6.16 (if the release schedule does drop down to ~2 months then it would need to be at lease .16). especially with so many people afraid of the 2.6 series you need to wait at least one full release cycle, probably two (and possibly more if they end up being short ones) then rip out the rest of the code for the following release.

remember that the distros don't package every kernel, they skip several between releases and it's not going to be until they go to try them that all the kinks will get worked out.

add to this the fact that many people have gotten confused about kernel releases and think that since 13 is odd 2.6.13 is a testing kernel and 2.6.14 will be a stable one and so won't look at .13

note that all this assumes that the issues that people have about sysfs not yet being able to replace that capabilities that they are useing in devfs have been addressed.

Greg said he wasn't aware of any major distribution shipping kernels with DevFS enabled. He and Valdis Kletnieks asked if anyone knew of any that did. Bastian Blank said Debian Unstable did, though as someone else pointed out, no one could confuse Debian Unstable with a shippable distribution. Beyond that, no one was able to come up with even a single distribution shipping DevFS.

The thread ended with no hard conclusions about how long DevFS can expect to live in the kernel.

Assembly Digital Video Broadcasting PCI Security Stephen Hemminger David S. Miller Benjamin Herrenschmidt Chris Wright: Ivan Kokshaysky Mark Haverkamp David Woodhouse

Chris Wright announced Linux 2.6.13.1, saying:

We (the -stable team) are announcing the release of the 2.6.13.1 kernel.

The diffstat and short summary of the fixes are below.

I'll also be replying to this message with a copy of the patch between 2.6.13 and 2.6.13.1, as it is small enough to do so.

The updated 2.6.13.y git tree can be found at:
rsync://rsync.kernel.org/pub/scm/linux/kernel/git/chrisw/linux-2.6.13.y.git
and can be browsed at the normal kernel.org git web browser:
www.kernel.org/git/

He listed the changes from 2.6.13 to 2.6.13.1:

Al Viro:
raw_sendmsg DoS (CAN-2005-2492)

Benjamin Herrenschmidt:
Fix PCI ROM mapping

Chris Wright:
Linux 2.6.13.1

David S. Miller:
Use SA_SHIRQ in sparc specific code.

David Woodhouse:
32bit sendmsg() flaw (CAN-2005-2490)

Herbert Xu:
2.6.13 breaks libpcap (and tcpdump)
Fix boundary check in standard multi-block cipher processors

Ivan Kokshaysky:
x86: pci_assign_unassigned_resources() update

Mark Haverkamp:
aacraid: 2.6.13 aacraid bad BUG_ON fix

Michael Krufky:
Kconfig: saa7134-dvb must select tda1004x

Stephen Hemminger:
Reassembly trim not clearing CHECKSUM_HW

Andi Kleen said:

Just noticed the ugly SGI /proc/*/numa_maps code got merged. I argued several times against it and I very deliberately didn't include a similar facility when I wrote the NUMA policy code because it's a bad idea.

it's a lot of ugly code.
it's basically only a debugging hack right now
it presents lots of kernel internal information and mempolicy internals (like how many people have a page mapped) etc. to userland that shouldn't be exposed to this.
the format is very complicated and the chance of bug free userland parsers of this is near zero.
there is no demonstrated application that needs it (there was a theoretical usecase where it might be needed, but there were better solutions proposed for this)

Can the patch please be removed?

Andrew Morton said he queued up a patch reversion that should take care of it. Christoph Lameter felt that patch was quite salvageable, and didn't see why it should be reverted. Andrew replied, If it's useful to application developers then fine. It it's only useful to kernel developers then the argument is weakened. However there's still quite a lot of development going on in this area, so there's still some argument for having the monitoring ability in the mainline tree. Christoph replied:

I still have a hard time to see how people can accept the line of reasoning that says:

Users are not allowed to know on which nodes the operating system allocated resources for a process and are also not allowed to see the memory policies in effect for the memory areas

Then the application developers have to guess the effect that the memory policies have on memory allocation. For memory alloc debugging the poor app guys must today simply imagine what the operating system is doing. They can see the amount of total memory allocated on a node via other proc entries and then guess based on that which application has taken it. Then they modify their apps and do another run.

My thinking today is that I'd rather leave /proc/<pid>/numa_stats instead of using smaps because the smaps format is a bit verbose and will make it difficult to see the allocation distribution. If we use smaps then we probably need some tool to parse and present information. numa_stats is directly usable.

I have a new series of patches here that does a gradual thing with the policy layer:

Clean up policy layer to properly use node macros instead of bitmaps. Some comments to explain certain limitations of the policy layer.
Clean up policy layer by doing do_xx and sys_xx separation [optional but this separates the dynamic bitmaps in user space from the static node maps in kernel space which I find very helpful]
Add mpol_to_str to policy layer and make numa_stats use mpol_to_str.
Solve the potential access issue when set_mempolicy is updating task->mempolicy while numa_stats are being displayed by taking a writelock on mmap_sem in set_mempolicy. This is in harmony with vma mempolicy updates that also take a lock on mmap_sem and that are already safe to access since numa_stats always takes an mmap_sem readlock. The patch is essentially inserting two lines.

Then I still have these evil intentions of making it possible to dynamically change memory policies from the outside. The mininum that we all need is to least be able to see whats going on.

Of course we would be happier if we would also be allowed to change policies to control memory allocation. The argument that the layer is not able to handle these is of course true since attempts to fix the issues have been blocked.

Andrew began to be swayed by these arguments. He started to favor keeping the patch in, but the debate did not reach any firm conclusion during the thread.

MAINTAINERS File

Stefan Richter said, the MAINTAINERS list of Linus' tree is still listing eth1394 and sbp2 as orphaned. This is certainly not correct for sbp2. Is it for eth1394? He said to Ben Collins, Ben, I remember you wanted to have your contact added back in, at least for sbp2. In case this should not be true anymore, I'd volunteer for sbp2 maintenance. Ben replied, I sent a patch to Linus, but I guess it never got added. Stefan, feel free to send a patch adding you as the maintainer. Regarding eth1394, Jody McIntyre also said, I emailed Steve Kinneberg, the last person to do any serious work on the driver, before I made this change, and he's OK with that. If someone else wants to take it, I suggest they submit a patch.

Michal Piotrowski

Jean Delvare asked:

Is there a place where pending -stable patches can be seen?

Are mails sent to stable@kernel archived somewhere?

There seems to be a need for this. For example, there's a patch I would like to see in 2.6.13.2, but I wouldn't want to report an already known problem.

Michal Piotrowski gave Jean a link to the stable queue shortlog, and Jean replied, Exactly what I needed. It's bookmarked now. Thanks!

FS: devfs FS: sysfs Hot-Plugging

Greg KH said:

I've released the 069 version of udev. It can be found at:
kernel.org/pub/linux/utils/kernel/hotplug/udev-069.tar.gz

udev allows users to have a dynamic /dev and provides the ability to have persistent device names. It uses sysfs and /sbin/hotplug and runs entirely in userspace. It requires a 2.6 kernel with CONFIG_HOTPLUG enabled to run. Please see the udev FAQ for any questions about it:
kernel.org/pub/linux/utils/kernel/hotplug/udev-FAQ

For any udev vs devfs questions anyone might have, please see:
kernel.org/pub/linux/utils/kernel/hotplug/udev_vs_devfs

And there is a general udev web page at:
http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev.html

Note, I _really_ recommend anyone running 2.6.13 or newer to upgrade to at least the 068 version of udev due to some very nice speed improvemets (not to mention the fact that the 2.6.12 kernel requires at least the 058 version of udev.)

There have been lots of good bugfixes and new features added since the last time I announced a udev release, so see the RELEASE-NOTES file for details, and the changelog below.

udev uses git for its source code control system. The main udev git repo can be found at:
rsync://rsync.kernel.org/pub/scm/linux/hotplug/udev.git
and can be browsed online at:
http://www.kernel.org/git/?p=linux/hotplug/udev.git

FS: devfs FS: sysfs Hot-Plugging Small Systems Greg KH

Mike Bell said:

devfs vs udev: From the other side

Presuppositions (True of both udev and devfs):

Dynamic /dev is the way of the future, and a Good Thing
A single major/minor combination should have only a single device node, its other names should be symlinks. If you don't do this, you break locking on certain classes of applications, among other things.

The above are uncontentious as far as I know. I believe Greg KH has stated both. If you feel otherwise, please explain why.

Differences:

devfs creates device nodes from kernel space, and creates symlinks for alternative names using a userspace helper. udev handles both tasks from user space, by exporting the information through a different kernel-generated filesystem.

devfs advantages over udev:

devfs is smaller
Hey, I ran the benchmarks, I have numbers, something Greg never gave. Took an actual devfs system of mine and disabled devfs from the kernel, then enabled hotplug and sysfs for udev to run. make clean and surprise surprise, kernel is much bigger. Enable netlink stuff and it's bigger still. udev is only smaller if like Greg you don't count its kernel components against it, even if they wouldn't otherwise need to be enabled. Difference is to the tune of 604164 on udev and 588466 on devfs. Maybe not a lot in some people's books, but a huge difference from the claims of other people that devfs is actually bigger.

And that's just the kernel. Then because your root is read-only you need an early userspace, and in regular userspace the udev binary, and its data files, all more wasted flash (you can shave it down by removing stuff you don't need, but that's just more work for the busy coder, and udev STILL loses on size).

On the system in question (a real-world embedded system) the devfs solution requires no userspace helper except for two symlinks which were simply created manually in init, and could have been done away with if necessary.
devfs is faster
Despite all the many tricks that can be used to speed up udev (static linking, netlink, etc) devfs is still dramatically faster. On a big, bloated, slow-booting distribution system you may not notice so much, but when even your slowest booting systems are interactive in under 5 seconds using devfs, this is quite significant time loss.
devfs uses less memory
Check free. sysfs alone does udev in and that's just the kernel stuff that's always there.

Also, the user space stuff may not have to run at all times in all configurations, but on a system without swap and with long-running apps, all that matters is its PEAK memory usage. If my app takes x MB and my kernel takes y MB it doesn't MATTER that udev is only running for one second, I still need more than x+yMB of memory.

udev advantages over devfs:

udev has all sorts of spiffy features
Sure, but having device nodes exported directly from the kernel in no way stops you from having those spiffy features. The problem is that udev is doing two separate tasks, and it's easy to confuse the one it should be doing with the one it shouldn't.
udev doesn't have policy in kernel space
Well, that's a bit of a lie. sysfs has even stricter policy in kernel space. What he MEANS is that udev exchanges hard-coded but symlinkable /dev paths for hardcoded sysfs paths, moving the hard-coded kernel policy from one filesystem to another.

This argument is really the only architectural reason to go with udev. At all. If you really believe that the ability to name your hard drive /dev/foobarbaz is vital, and absolutely can't live with merely having /dev/foobarbaz be a symlink to the real device node, then you need udev. The devfs way of handling this situation was a stupid, racey misfeature and rightly deserves to die horribly.

That said, read my comments on why flexible /dev naming is actually a bad thing and think very, very carefully about whether you actually want this "feature" at all. Symlinks are your friend.
devfs is ugly
Part of this is true, and part of this is just the perspective of certain people (Greg has this fascinating world view where code required for devfs is garbage, and code required for udev is core kernel code and doesn't count against udev, which allows him to say udev is smaller.)

The legitimate comments about devfs being ugly... well, how many subsystems which have been largely untouched for similar periods of time aren't even uglier? TTY stuff? And it's very hard to find a maintainer for a subsystem when it's "obsolete", patches that change its behaviour aren't accepted, and certain people are so vocally opposed to its very existence. Who wants to throw away their time writing code that won't even be considered, only to be hated for it?
devfs is unsupported, udev isn't
True that. And even people who've tried to maintain devfs get turned away. So unless this document causes a few people to reexamine the need to remove devfs, you can reasonably assume that udev will be the only way to run a linux system very shortly (static /dev is already on its last legs). Me, I'll be disappointed if this happens, because as the above document indicates, I still think kernel-exported /dev is better (and not because I'm a lazy user-space-hater, Greg. :) ).

There was no real discussion in response to this. It looked as though a huge flamewar would erupt after the first few replies, but the thread petered out immediately and vanished.