<?xml version="1.0" ?>

<kc>

<title>Kernel Traffic</title>

<author contact="mailto:zbrown@tumblerings.org">Zack Brown</author>

<issue num="328" date="19 Sep 2005 00:00:00 -0800" />

<mailbox-stats>
	<global-stats>
		<generated-at>Fri Sep 16 18:56:36 2005</generated-at>
		<first-message>Mon Aug 29 20:42:16 2005</first-message>
		<last-message>Thu Sep 15 16:49:32 2005</last-message>
		<totals>
			<n-messages>2152</n-messages>
			<n-is-reply>1649</n-is-reply>
			<avg-resp-time>84:23:57</avg-resp-time>
			<n-pgp-signed>51</n-pgp-signed>
			<total-size>13MB</total-size>
			<avg-size>6KB</avg-size>
			<n-attachments>69</n-attachments>
			<att-size>644KB</att-size>
			<bussiest-day-in-n day="2005-09-09"><n-msgs>371</n-msgs><n-bytes>3MB</n-bytes></bussiest-day-in-n>
			<bussiest-day-in-bytes day="2005-09-09"><n-msgs>371</n-msgs><n-bytes>3MB</n-bytes></bussiest-day-in-bytes>
			<n-writers>665</n-writers>
			<wrote-more-then-1-message>260</wrote-more-then-1-message>
			<n-lines>229979</n-lines>
			<header-size>117922</header-size>
			<n-user-agents>66</n-user-agents>
			<n-organisations>41</n-organisations>
			<n-toplevel-domains>37</n-toplevel-domains>
			<avg-spam-score>-13.501859</avg-spam-score>
				<spammiest-writer><score>4.599000</score><name>igor&#32;filippenko</name></spammiest-writer>
		</totals>
		<averages>
			<lines-per-message>106</lines-per-message>
			<lines-per-header>54</lines-per-header>
			<header-percent-of-message>51.28%</header-percent-of-message>
			<header-percent-of-total>42.94%</header-percent-of-total>
			<line-length>31</line-length>
		</averages>
		<importance>
			<low>0.00%</low>
			<normal>0.84%</normal>
			<high>0.00%</high>
		</importance>

	</global-stats>
	<top-writers>
		<top-writer rank="1">
			<e-mail-addr>luben&#32;tuikov</e-mail-addr>
			<n-messages>85</n-messages>
			<avg-size>13KB</avg-size>
			<total-size>2MB</total-size>
			<mostly-written-at>14:50</mostly-written-at>
		</top-writer>
		<top-writer rank="2">
			<e-mail-addr>andi&#32;kleen</e-mail-addr>
			<n-messages>71</n-messages>
			<avg-size>4KB</avg-size>
			<total-size>278KB</total-size>
			<mostly-written-at>11:21</mostly-written-at>
		</top-writer>
		<top-writer rank="3">
			<e-mail-addr>akpm@osdl.org</e-mail-addr>
			<n-messages>61</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>248KB</total-size>
			<mostly-written-at>10:49</mostly-written-at>
		</top-writer>
		<top-writer rank="4">
			<e-mail-addr>sam&#32;ravnborg</e-mail-addr>
			<n-messages>58</n-messages>
			<avg-size>6KB</avg-size>
			<total-size>329KB</total-size>
			<mostly-written-at>14:55</mostly-written-at>
		</top-writer>
		<top-writer rank="5">
			<e-mail-addr>john&#32;w.&#32;linville</e-mail-addr>
			<n-messages>47</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>229KB</total-size>
			<mostly-written-at>12:01</mostly-written-at>
		</top-writer>
	</top-writers>
	<top-subjects>
		<top-subject rank="1">
			<subject>rfc:&#32;i386:&#32;kill&#32;!4kstacks</subject>
			<n-messages>100</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>414KB</total-size>
			<mostly-written-at>14:13</mostly-written-at>
			<first-msg>1125642836</first-msg>
			<last-msg>1126634752</last-msg>
		</top-subject>
		<top-subject rank="2">
			<subject>[patch&#32;1/3]&#32;dynticks&#32;-&#32;implement&#32;no&#32;idle&#32;hz&#32;for&#32;x86</subject>
			<n-messages>61</n-messages>
			<avg-size>7KB</avg-size>
			<total-size>373KB</total-size>
			<mostly-written-at>13:42</mostly-written-at>
			<first-msg>1125712583</first-msg>
			<last-msg>1126308647</last-msg>
		</top-subject>
		<top-subject rank="3">
			<subject>gfs,&#32;what's&#32;remaining</subject>
			<n-messages>56</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>227KB</total-size>
			<mostly-written-at>12:47</mostly-written-at>
			<first-msg>1125575979</first-msg>
			<last-msg>1126383089</last-msg>
		</top-subject>
		<top-subject rank="4">
			<subject>[linux-cluster]&#32;re:&#32;gfs,&#32;what's&#32;remaining</subject>
			<n-messages>44</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>192KB</total-size>
			<mostly-written-at>11:15</mostly-written-at>
			<first-msg>1125733594</first-msg>
			<last-msg>1126727194</last-msg>
		</top-subject>
		<top-subject rank="5">
			<subject>[patch&#32;2.6.13&#32;5/14]&#32;sas-class:&#32;sas_discover.c&#32;discover&#32;process</subject>
			<n-messages>39</n-messages>
			<avg-size>7KB</avg-size>
			<total-size>245KB</total-size>
			<mostly-written-at>14:17</mostly-written-at>
			<first-msg>1126311083</first-msg>
			<last-msg>1126764279</last-msg>
		</top-subject>
	</top-subjects>
	<top-receivers>
		<top-receiver rank="1">
			<e-mail-addr>linux-kernel@vger.kernel.org</e-mail-addr>
			<n-messages>424</n-messages>
			<avg-size>8KB</avg-size>
			<total-size>4MB</total-size>
			<mostly-written-at>13:32</mostly-written-at>
		</top-receiver>
		<top-receiver rank="2">
			<e-mail-addr>akpm@osdl.org</e-mail-addr>
			<n-messages>187</n-messages>
			<avg-size>9KB</avg-size>
			<total-size>2MB</total-size>
			<mostly-written-at>12:41</mostly-written-at>
		</top-receiver>
		<top-receiver rank="3">
			<e-mail-addr>linus&#32;torvalds</e-mail-addr>
			<n-messages>163</n-messages>
			<avg-size>7KB</avg-size>
			<total-size>1024KB</total-size>
			<mostly-written-at>12:59</mostly-written-at>
		</top-receiver>
		<top-receiver rank="4">
			<e-mail-addr>andi&#32;kleen</e-mail-addr>
			<n-messages>69</n-messages>
			<avg-size>6KB</avg-size>
			<total-size>352KB</total-size>
			<mostly-written-at>12:29</mostly-written-at>
		</top-receiver>
		<top-receiver rank="5">
			<e-mail-addr>dmitry&#32;torokhov</e-mail-addr>
			<n-messages>38</n-messages>
			<avg-size>6KB</avg-size>
			<total-size>225KB</total-size>
			<mostly-written-at>03:32</mostly-written-at>
		</top-receiver>
	</top-receivers>
	<top-ccers>
		<top-ccers rank="1">
			<e-mail-addr>linux-kernel@vger.kernel.org</e-mail-addr>
			<n-messages>942</n-messages>
			<avg-size>6KB</avg-size>
			<total-size>6MB</total-size>
			<mostly-written-at>13:37</mostly-written-at>
		</top-ccers>
		<top-ccers rank="2">
			<e-mail-addr>akpm@osdl.org</e-mail-addr>
			<n-messages>219</n-messages>
			<avg-size>6KB</avg-size>
			<total-size>2MB</total-size>
			<mostly-written-at>13:51</mostly-written-at>
		</top-ccers>
		<top-ccers rank="3">
			<e-mail-addr>linus&#32;torvalds</e-mail-addr>
			<n-messages>136</n-messages>
			<avg-size>6KB</avg-size>
			<total-size>686KB</total-size>
			<mostly-written-at>13:33</mostly-written-at>
		</top-ccers>
		<top-ccers rank="4">
			<e-mail-addr>jeff&#32;garzik</e-mail-addr>
			<n-messages>60</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>276KB</total-size>
			<mostly-written-at>11:32</mostly-written-at>
		</top-ccers>
		<top-ccers rank="5">
			<e-mail-addr>andi&#32;kleen</e-mail-addr>
			<n-messages>59</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>284KB</total-size>
			<mostly-written-at>12:51</mostly-written-at>
		</top-ccers>
	</top-ccers>
	<top-level-domains>
		<tld rank="1">
			<name>com</name>
			<freq>959</freq>
			<avg-size>7KB</avg-size>
			<total-size>7MB</total-size>
		</tld>
		<tld rank="2">
			<name>org</name>
			<freq>377</freq>
			<avg-size>6KB</avg-size>
			<total-size>2MB</total-size>
		</tld>
		<tld rank="3">
			<name>de</name>
			<freq>182</freq>
			<avg-size>5KB</avg-size>
			<total-size>846KB</total-size>
		</tld>
		<tld rank="4">
			<name>uk</name>
			<freq>151</freq>
			<avg-size>5KB</avg-size>
			<total-size>711KB</total-size>
		</tld>
		<tld rank="5">
			<name>net</name>
			<freq>142</freq>
			<avg-size>7KB</avg-size>
			<total-size>877KB</total-size>
		</tld>
	</top-level-domains>
	<top-timezones>
		<tz rank="1">
			<name>+0200</name>
			<freq>595</freq>
			<avg-size>7KB</avg-size>
			<total-size>4MB</total-size>
		</tz>
		<tz rank="2">
			<name>-0700</name>
			<freq>482</freq>
			<avg-size>6KB</avg-size>
			<total-size>3MB</total-size>
		</tz>
		<tz rank="3">
			<name>-0400</name>
			<freq>403</freq>
			<avg-size>7KB</avg-size>
			<total-size>3MB</total-size>
		</tz>
		<tz rank="4">
			<name>+0100</name>
			<freq>231</freq>
			<avg-size>5KB</avg-size>
			<total-size>2MB</total-size>
		</tz>
		<tz rank="5">
			<name>-0500</name>
			<freq>76</freq>
			<avg-size>8KB</avg-size>
			<total-size>537KB</total-size>
		</tz>
	</top-timezones>
	<top-organisations>
		<org rank="1">
			<name>sgi</name>
			<freq>23</freq>
			<bytes>90KB</bytes>
		</org>
		<org rank="2">
			<name>computing&#32;service,&#32;university&#32;of&#32;cambridge,&#32;uk</name>
			<freq>8</freq>
			<bytes>40KB</bytes>
		</org>
		<org rank="3">
			<name>ypo4</name>
			<freq>7</freq>
			<bytes>43KB</bytes>
		</org>
		<org rank="4">
			<name>http://bugsplatter.mine.nu/</name>
			<freq>6</freq>
			<bytes>23KB</bytes>
		</org>
		<org rank="5">
			<name>home</name>
			<freq>4</freq>
			<bytes>20KB</bytes>
		</org>
	</top-organisations>
	<top-user-agents>
		<useragent rank="1">
			<name>mozilla</name>
			<freq>49</freq>
			<bytes>3MB</bytes>
		</useragent>
		<useragent rank="2">
			<name>evolution</name>
			<freq>38</freq>
			<bytes>1016KB</bytes>
		</useragent>
		<useragent rank="3">
			<name>mutt/1.5.9i</name>
			<freq>36</freq>
			<bytes>569KB</bytes>
		</useragent>
		<useragent rank="4">
			<name>mozilla/5.0</name>
			<freq>26</freq>
			<bytes>480KB</bytes>
		</useragent>
		<useragent rank="5">
			<name>mutt/1.5.10i</name>
			<freq>23</freq>
			<bytes>626KB</bytes>
		</useragent>
	</top-user-agents>
	<messages-per-day>
		<Sunday><msgs>268</msgs><bytes>2MB</bytes></Sunday>
		<Monday><msgs>428</msgs><bytes>3MB</bytes></Monday>
		<Tuesday><msgs>352</msgs><bytes>3MB</bytes></Tuesday>
		<Wednesday><msgs>287</msgs><bytes>2MB</bytes></Wednesday>
		<Thursday><msgs>126</msgs><bytes>774KB</bytes></Thursday>
		<Friday><msgs>395</msgs><bytes>3MB</bytes></Friday>
		<Saturday><msgs>270</msgs><bytes>2MB</bytes></Saturday>
	</messages-per-day>
	<messages-per-month>
		<Jan><msgs>0</msgs><bytes>0</bytes></Jan>
		<Feb><msgs>0</msgs><bytes>0</bytes></Feb>
		<Mar><msgs>0</msgs><bytes>0</bytes></Mar>
		<Apr><msgs>0</msgs><bytes>0</bytes></Apr>
		<May><msgs>0</msgs><bytes>0</bytes></May>
		<Jun><msgs>0</msgs><bytes>0</bytes></Jun>
		<Jul><msgs>0</msgs><bytes>0</bytes></Jul>
		<Aug><msgs>9</msgs><bytes>91KB</bytes></Aug>
		<Sep><msgs>2117</msgs><bytes>13MB</bytes></Sep>
		<Oct><msgs>0</msgs><bytes>0</bytes></Oct>
		<Nov><msgs>0</msgs><bytes>0</bytes></Nov>
		<Dec><msgs>0</msgs><bytes>0</bytes></Dec>
	</messages-per-month>
	<messages-per-day-of-month>
		<day-1><msgs>23</msgs><bytes>89KB</bytes></day-1>
		<day-2><msgs>24</msgs><bytes>119KB</bytes></day-2>
		<day-3><msgs>48</msgs><bytes>248KB</bytes></day-3>
		<day-4><msgs>69</msgs><bytes>283KB</bytes></day-4>
		<day-5><msgs>78</msgs><bytes>345KB</bytes></day-5>
		<day-6><msgs>52</msgs><bytes>261KB</bytes></day-6>
		<day-7><msgs>56</msgs><bytes>281KB</bytes></day-7>
		<day-8><msgs>82</msgs><bytes>565KB</bytes></day-8>
		<day-9><msgs>371</msgs><bytes>3MB</bytes></day-9>
		<day-10><msgs>222</msgs><bytes>2MB</bytes></day-10>
		<day-11><msgs>199</msgs><bytes>2MB</bytes></day-11>
		<day-12><msgs>349</msgs><bytes>2MB</bytes></day-12>
		<day-13><msgs>300</msgs><bytes>2MB</bytes></day-13>
		<day-14><msgs>223</msgs><bytes>2MB</bytes></day-14>
		<day-15><msgs>21</msgs><bytes>121KB</bytes></day-15>
		<day-16><msgs>0</msgs><bytes>0</bytes></day-16>
		<day-17><msgs>0</msgs><bytes>0</bytes></day-17>
		<day-18><msgs>0</msgs><bytes>0</bytes></day-18>
		<day-19><msgs>0</msgs><bytes>0</bytes></day-19>
		<day-20><msgs>0</msgs><bytes>0</bytes></day-20>
		<day-21><msgs>0</msgs><bytes>0</bytes></day-21>
		<day-22><msgs>0</msgs><bytes>0</bytes></day-22>
		<day-23><msgs>0</msgs><bytes>0</bytes></day-23>
		<day-24><msgs>0</msgs><bytes>0</bytes></day-24>
		<day-25><msgs>0</msgs><bytes>0</bytes></day-25>
		<day-26><msgs>0</msgs><bytes>0</bytes></day-26>
		<day-27><msgs>0</msgs><bytes>0</bytes></day-27>
		<day-28><msgs>0</msgs><bytes>0</bytes></day-28>
		<day-29><msgs>1</msgs><bytes>28KB</bytes></day-29>
		<day-30><msgs>0</msgs><bytes>0</bytes></day-30>
		<day-31><msgs>8</msgs><bytes>63KB</bytes></day-31>
	</messages-per-day-of-month>
	<messages-per-hour>
		<hour-1><msgs>64</msgs><bytes>415KB</bytes></hour-1>
		<hour-2><msgs>53</msgs><bytes>380KB</bytes></hour-2>
		<hour-3><msgs>24</msgs><bytes>148KB</bytes></hour-3>
		<hour-4><msgs>15</msgs><bytes>59KB</bytes></hour-4>
		<hour-5><msgs>13</msgs><bytes>50KB</bytes></hour-5>
		<hour-6><msgs>18</msgs><bytes>83KB</bytes></hour-6>
		<hour-7><msgs>28</msgs><bytes>109KB</bytes></hour-7>
		<hour-8><msgs>62</msgs><bytes>321KB</bytes></hour-8>
		<hour-9><msgs>120</msgs><bytes>708KB</bytes></hour-9>
		<hour-10><msgs>185</msgs><bytes>2MB</bytes></hour-10>
		<hour-11><msgs>122</msgs><bytes>675KB</bytes></hour-11>
		<hour-12><msgs>96</msgs><bytes>446KB</bytes></hour-12>
		<hour-13><msgs>118</msgs><bytes>605KB</bytes></hour-13>
		<hour-14><msgs>136</msgs><bytes>704KB</bytes></hour-14>
		<hour-15><msgs>172</msgs><bytes>2MB</bytes></hour-15>
		<hour-16><msgs>139</msgs><bytes>808KB</bytes></hour-16>
		<hour-17><msgs>122</msgs><bytes>905KB</bytes></hour-17>
		<hour-18><msgs>124</msgs><bytes>844KB</bytes></hour-18>
		<hour-19><msgs>84</msgs><bytes>435KB</bytes></hour-19>
		<hour-20><msgs>80</msgs><bytes>530KB</bytes></hour-20>
		<hour-21><msgs>56</msgs><bytes>279KB</bytes></hour-21>
		<hour-22><msgs>107</msgs><bytes>656KB</bytes></hour-22>
		<hour-23><msgs>72</msgs><bytes>368KB</bytes></hour-23>
	</messages-per-hour>
	<urls>
		<url-1><freq>1756</freq><url>http://vger.kernel.org/majordomo-info.html</url></url-1>
		<url-2><freq>1754</freq><url>http://www.tux.org/lkml/</url></url-2>
		<url-3><freq>16</freq><url>http://yahoo.shaadi.com</url></url-3>
		<url-4><freq>8</freq><url>http://mail.yahoo.com</url></url-4>
		<url-5><freq>8</freq><url>http://lkdp.blogspot.com/</url></url-5>
	</urls>
	<top-avg-resp>
		<resp-pers rank="1">
			<name>lukasz&#32;kosewski</name>
			<avg-resp-time>00:00:47</avg-resp-time>
			<n-replies>1</n-replies>
		</resp-pers>
		<resp-pers rank="2">
			<name>coywolf&#32;qi&#32;hunt</name>
			<avg-resp-time>00:01:14</avg-resp-time>
			<n-replies>1</n-replies>
		</resp-pers>
		<resp-pers rank="3">
			<name>erik&#32;slagter</name>
			<avg-resp-time>00:04:26</avg-resp-time>
			<n-replies>1</n-replies>
		</resp-pers>
		<resp-pers rank="4">
			<name>andreas&#32;koch</name>
			<avg-resp-time>00:07:08</avg-resp-time>
			<n-replies>1</n-replies>
		</resp-pers>
		<resp-pers rank="5">
			<name>ben&#32;greear</name>
			<avg-resp-time>00:08:15</avg-resp-time>
			<n-replies>1</n-replies>
		</resp-pers>
		<resp-pers rank="6">
			<name>john&#32;stultz</name>
			<avg-resp-time>00:11:18</avg-resp-time>
			<n-replies>1</n-replies>
		</resp-pers>
	</top-avg-resp>
	<created-with><name>mboxstats</name><version>2.8</version><developer>folkert@vanheusden.com</developer><url>http://www.vanheusden.com/mboxstats/</url></created-with>
</mailbox-stats>

<section
  title="Status Of Merging GFS2 Into Mainline"
  subject="GFS, what's remaining"
  archive="http://lkml.org/lkml/2005/9/1/67"
  posts="110"
  startdate="01 Sep 2005 02:42:49 -0800"
  enddate="14 Sep 2005 01:46:34 -0800">
<topic>Disk Arrays: LVM</topic>
<topic>FS: NFS</topic>
<topic>FS: NTFS</topic>
<topic>FS: ReiserFS</topic>
<topic>FS: XFS</topic>
<topic>FS: ext2</topic>
<topic>FS: ext3</topic>
<topic>FS: sysfs</topic>
<topic>Ioctls</topic>
<topic>POSIX</topic>

<mention>Arjan van de Ven</mention>
<mention>Pekka Enberg</mention>

<p>David Teigland said:</p>

<quote who="David Teigland">

<p>this is the latest set of gfs patches, it includes some minor munging
since the previous set.  Andrew, could this be added to -mm? there's not
much in the way of pending changes.</p>

<p><a href="http://redhat.com/~teigland/gfs2/20050901/gfs2-full.patch">http://redhat.com/~teigland/gfs2/20050901/gfs2-full.patch</a><br />
<a href="http://redhat.com/~teigland/gfs2/20050901/broken-out/">http://redhat.com/~teigland/gfs2/20050901/broken-out/</a></p>

<p>I'd like to get a list of specific things remaining for merging.  I
believe we've responded to everything from earlier reviews, they were very
helpful and more would be excellent.  The list begins with one item from
before that's still pending:</p>

<ul>

<li>Adapt the vfs so gfs (and other cfs's) don't need to walk vma lists.
  [cf. ops_file.c:walk_vm(), gfs works fine as is, but some don't like it.]</li>

</ul>

</quote>

<p>Arjan van de Ven offered concrete criticisms on the patches themselves,
pointing out races and other problems; and David and others discussed these.
Elsewhere, Pekka Enberg pointed out that the requirement to walk VMA lists
wasn't just a case of "some not liking it", but would actually prevent GFS
from working properly with other clustered filesystems. And Daniel Phillips
also brought some perspective to the whole prospect of a GFS merge into
mainline, saying:</p>

<quote who="Daniel Phillips">

<p>Where are the benchmarks and stability analysis?  How many hours does it
survive cerberos running on all nodes simultaneously?  Where are the
testimonials from users?  How long has there been a gfs2 filesystem?  Note
that Reiser4 is still not in mainline a year after it was first offered, why
do you think gfs2 should be in mainline after one month?</p>

<p>So far, all catches are surface things like bogus spinlocks.  Substantive
issues have not even begun to be addressed.  Patience please, this is going
to take a while.</p>

</quote>

<p>Andrew Morton also asked for answers to a few basic questions:</p>

<quote who="Andrew Morton">

<p>I don't recall seeing much discussion or exposition of</p>

<ul>

<li>Why the kernel needs two clustered fileystems</li>

<li>Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
  possibly gain (or vice versa)</li>

<li>Relative merits of the two offerings</li>

</ul>

</quote>

<p>Alan Cox pointed out what he felt was a simple answer to all these
questions: <quote who="Alan Cox">people actively use it and have been for
some years. Same reason with have NTFS, HPFS, and all the others. On that
alone it makes sense to include.</quote> But Christoph Hellwig remarked,
<quote who="Christoph Hellwig">That's GFS. The submission is about a GFS2
that's on-disk incompatible to GFS.</quote> Alan replied:</p>

<quote who="Alan Cox">

<p>Just like say reiserfs3 and reiserfs4 or ext and ext2 or ext2 and ext3
then. I think the main point still stands - we have always taken
multiple file systems on board and we have benefitted enormously from
having the competition between them instead of a dictat from the kernel
kremlin that 'foofs is the one true way'</p>

<p>Competition will decide if OCFS or GFS is better, or indeed if someone
comes along with another contender that is better still. And competition
will probably get the answer right.</p>

<p>The only thing that is important is we don't end up with each cluster fs
wanting different core VFS interfaces added.</p>

</quote>

<p>Lars Marowsky-Bree was not as trusting in the virtues of competition,
pointing out that <quote who="Lars Marowsky-Bree">Competition will come up
with the same situation like reiserfs and ext3 and XFS, namely that they'll
all be maintained going forward because of, uhm, political constraints
;-)</quote> But he also affirmed, <quote who="Lars Marowsky-Bree">as long
as they _are_ maintained and play along nicely with eachother (which, btw,
is needed already so that at least data can be migrated...), I don't really
see a problem of having two or three.</quote> He also agreed that requiring
different core VFS interfaces would be unacceptable.</p>

<p>Andrew reiterated his question, saying he was looking for technical
reasons in favor of inclusion. David offered:</p>

<quote who="David Teigland">

<p>GFS is an established fs, it's not going away, you'd be hard pressed to
find a more widely used cluster fs on Linux. GFS is about 10 years old and
has been in use by customers in production environments for about 5 years.
It is a mature, stable file system with many features that have been
technically refined over years of experience and customer/user feedback.
The latest development cycle (GFS2) has focussed on improving performance,
it's not a new file system -- the "2" indicates that it's not ondisk compatible
with earlier versions.</p>

<p>OCFS2 is a new file system. I expect they'll want to optimize for their
own unique goals. When OCFS appeared everyone I know accepted it would
coexist with GFS, each in their niche like every other fs. That's good,
OCFS and GFS help each other technically even though they may eventually
compete in some areas (which can also be good.)</p>

<p>Here's a random summary of technical features:</p>

<ul>

<li>cluster infrastructure: a lot of work, perhaps as much as gfs itself,
has gone into the infrastructure surrounding and supporting gfs</li>

<li>cluster infrastructure allows for easy cooperation with CLVM</li>

<li>interchangable lock/cluster modules: gfs interacts with the external
infrastructure, including lock manager, through an interchangable module
allowing the fs to be adapted to different environments.</li>

<li>a "nolock" module can be plugged in to use gfs as a local fs (can be
selected at mount time, so any fs can be mounted locally)</li>

<li>quotas, acls, cluster flocks, direct io, data journaling, ordered/writeback
journaling modes -- all supported</li>

<li>gfs transparently switches to a different locking scheme for direct io
allowing parallel non-allocating writes with no lock contention</li>

<li>posix locks -- supported, although it's being reworked for better
performance right now</li>

<li>asynchronous locking, lock prefetching + read-ahead</li>

<li>coherent shared-writeable memory mappings across the cluster</li>

<li>nfs3 support (multiple nfs servers exporting one gfs is very common)</li>

<li>extend fs online, add journals online</li>

<li>full fs quiesce to allow for block level snapshot below gfs</li>

<li>read-only mount</li>

<li>"specatator" mount (like ro but no journal allocated for the mount,
no fencing needed for failed node that was mounted as specatator)</li>

<li>infrastructure in place for live ondisk inode migration, fs shrink</li>

<li>stuffed dinodes, small files are stored in the disk inode block</li>

<li>tunable (fuzzy) atime updates</li>

<li>fast, nondisruptive stat on files during non-allocating direct-io</li>

<li>fast, nondisruptive statfs (df) even during heavy fs usage</li>

<li>friendly handling of io errors: shut down fs and withdraw from cluster</li>

<li>largest GFS cluster deployed was around 200 nodes, most are much
smaller</li>

<li>use many GFS file systems at once on a node and in a cluster</li>

<li>customers use GFS for: scientific apps, HA, NFS serving, database,
others I'm sure</li>

<li>graphical management tools for gfs, clvm, and the cluster infrastruture
exist and are improving quickly</li>

</ul>

</quote>

<p>Arjan short-circuited any discussion of these particular features, pointing
out that David's description referred to GFS, not to GFS2 which, as others
had already pointed out, was not compatible. David replied:</p>

<quote who="David Teigland">

<p>Just a new version, not a big difference. The ondisk format changed a
little making it incompatible with the previous versions. We'd been holding
out on the format change for a long time and thought now would be a sensible
time to finally do it.</p>

<p>This is also about timing things conveniently. Each GFS version coincides
with a development cycle and we decided to wait for this version/cycle
to move code upstream. So, we have new version, format change, and code
upstream all together, but it's still the same GFS to us.</p>

<p>As with _any_ new version (involving ondisk formats or not) we need to
thoroughly test everything to fix the inevitible bugs and regresssions that
are introduced, there's nothing new or surprising about that.</p>

<p>About the name -- we need to support customers running both versions for
a long time. The "2" was added to make that process a little easier and
clearer for people, that's all. If the 2 is really distressing we could
rip it off, but there seems to be as many file systems ending in digits than
not these days...</p>

</quote>

<p>Daniel asked what the on-disk format change was all about, but there
was no reply to that post. Elsewhere, various folks made serious efforts to
answer his request for technical reasons for or against inclusion. Andi Kleen
kicked off that branch of discussion, saying to Andrew:</p>

<quote who="Andi Kleen">

<p>There seems to be clearly a need for a shared-storage fs of some sort for
HA clusters and virtualized usage (multiple guests sharing a partition).
Shared storage can be more efficient than network file systems like NFS
because the storage access is often more efficient than network access and
it is more reliable because it doesn't have a single point of failure in
form of the NFS server.</p>

<p>It's also a logical extension of the "failover on failure" clusters many
people run now - instead of only failing over the shared fs at failure and
keeping one machine idle the load can be balanced between multiple machines
at any time.</p>

<p>One argument to merge both might be that nobody really knows yet which
shared-storage file system (GFS or OCFS2) is better. The only way to find
out would be to let the user base try out both, and that's most practical
when they're merged.</p>

<p>Personally I think ocfs2 has nicer &amp; cleaner code than GFS. It seems
to be more or less a 64bit ext3 with cluster support, while GFS seems to
reinvent a lot more things and has somewhat uglier code. On the other hand
GFS' cluster support seems to be more aimed at being a universal cluster
service open for other usages too, which might be a good thing. OCFS2s
cluster seems to be more aimed at only serving the file system.</p>

<p>But which one works better in practice is really an open question.</p>

<p>The only thing that should be probably resolved is a common API for at
least the clustered lock manager. Having multiple incompatible user space
APIs for that would be sad.</p>

</quote>

<p>Andi's term "clustered lock manager" is more commonly known as
"distributed lock manager" or DLM. This was the term taken up for the rest
of the discussion, and becoming the primary focus as well. In this light,
Daniel Phillips replied to Andi:</p>

<quote who="Daniel Phillips">

<p>The only current users of dlms are cluster filesystems. There are zero
users of the userspace dlm api. Therefore, the (g)dlm userspace interface
actually has nothing to do with the needs of gfs. It should be taken out
the gfs patch and merged later, when or if user space applications emerge
that need it. Maybe in the meantime it will be possible to come up with a
userspace dlm api that isn't completely repulsive.</p>

<p>Also, note that the only reason the two current dlms are in-kernel is
because it supposedly cuts down on userspace-kernel communication with the
cluster filesystems. Then why should a userspace application bother with
a an awkward interface to an in-kernel dlm? This is obviously suboptimal.
Why not have a userspace dlm for userspace apps, if indeed there are any
userspace apps that would need to use dlm-style synchronization instead
of more typical socket-based synchronization, or Posix locking, which is
already exposed via a standard api?</p>

<p>There is actually nothing wrong with having multiple, completely different
dlms active at the same time. There is no urgent need to merge them into
the one true dlm. It would be a lot better to let them evolve separately
and pick the winner a year or two from now. Just think of the dlm as part
of the cfs until then.</p>

<p>What does have to be resolved is a common API for node management. It is
not just cluster filesystems and their lock managers that have to interface
to node management. Below the filesystem layer, cluster block devices and
cluster volume management need to be coordinated by the same system, and
above the filesystem layer, applications also need to be hooked into it.
This work is, in a word, incomplete.</p>

</quote>

<p>Close by, Mark Fasheh also said to Andi, <quote who="Mark Fasheh">As
far as userspace dlm apis go, dlmfs already abstracts away a large part of
the dlm interaction, so writing a module against another dlm looks like it
wouldn't be too bad (startup of a lockspace is probably the most difficult part
there).</quote> Daniel asked why SysFS would not work just as well for this,
and Wim Coekaerts replied cryptically that the two were totally different.
Daniel replied:</p>

<quote who="Daniel Phillips">

<p>You create a dlm domain when a directory is created. You create a lock
resource when a file of that name is opened. You lock the resource when the
file is opened. You access the lvb by read/writing the file. Why doesn't
that fit the configfs-nee-sysfs model? If it does, the payoff will be about
500 lines saved.</p>

<p>This little dlm fs is very slick, but grossly inefficient. Maybe efficiency
doesn't matter here since it is just your slow-path userspace tools taking
these locks. Please do not even think of proposing this as a way to export
a kernel-based dlm for general purpose use!</p>

<p>Your userdlm.c file has some hidden gold in it. You have factored
the dlm calls far more attractively than the bad old bazillion-parameter
Vaxcluster legacy. You are almost in system call zone there. (But note my
earlier comment on dlms in general: until there are dlm-based applications,
merging a general-purpose dlm API is pointless and has nothing to do with
getting your filesystem merged.)</p>

</quote>

<p>Andrew agreed that <quote who="Andrew Morton">Daniel is asking a
legitimate question.</quote> He went on, <quote who="Andrew Morton">If
there's duplicated code in there then we should seek to either make the
code multi-purpose or place the common or reusable parts into a library
somewhere. If neither approach is applicable or practical for *every single
function* then fine, please explain why. AFAIR that has not been done.</quote>
Joel Becker replied:</p>

<quote who="Joel Becker">

<p>Regarding sysfs and configfs, that's a whole 'nother conversation.
I've not yet come up with a function involved that is identical, but that's
a response here for another email.</p>

<p>Understanding that Daniel is talking about dlmfs, dlmfs is far more
similar to devptsfs, tmpfs, and even sockfs and pipefs than it is to sysfs.
I don't see him proposing that sockfs and devptsfs be folded into sysfs.</p>

<p>dlmfs is *tiny*. The VFS interface is less than his claimed 500 lines
of savings. The few VFS callbacks do nothing but call DLM functions.
You'd have to replace this VFS glue with sysfs glue, and probably save very
few lines of code.</p>

<p>In addition, sysfs cannot support the dlmfs model. In dlmfs, mkdir(2)
creates a directory representing a DLM domain and mknod(2) creates the
user representation of a lock. sysfs doesn't support mkdir(2) or mknod(2)
at all.</p>

<p>More than mkdir() and mknod(), however, dlmfs uses open(2) to acquire locks
from userspace. O_RDONLY acquires a shared read lock (PR in VMS parlance).
O_RDWR gets an exclusive lock (X). O_NONBLOCK is a trylock. Here, dlmfs
is using the VFS for complete lifetiming. A lock is released via close(2).
If a process dies, close(2) happens. In other words, -&gt;release() handles
all the cleanup for normal and abnormal termination.</p>

<p>sysfs does not allow hooking into -&gt;open() or -&gt;release(). So this
model, and the inherent lifetiming that comes with it, cannot be used.
If dlmfs was changed to use a less intuitive model that fits sysfs, all the
handling of lifetimes and cleanup would have to be added. This would make
it more complex, not less complex. It would give it a larger code size, not
a smaller one. In the end, it would be harder to maintian, less intuitive
to use, and larger.</p>

</quote>

<p>The DLM debate and its relationship to GFS acceptance became very technical,
with many tendrils of discussion, that did not lead to any clear conclusion,
in spite of the fact that Andrew was a very active participant in leading the
discussion. The closest thing to a decision that came out of the discussion
came when David, who'd opened the whole discussion, said that GFS depended
on the full DLM API, and would find it impractical to rely on anything
else. He said, <quote who="David Teigland">We export our full dlm API through
read/write/poll on a misc device. All user space apps use the dlm through a
library as you'd expect. The library communicates with the dlm_device kernel
module through read/write/poll and the dlm_device module talks with the
actual dlm: linux/drivers/dlm/device.c If there's a better way to do this,
via a pseudo fs or not, we'd be pleased to try it.</quote> Andrew replied,
<quote who="Andrew Morton">inotify did that for a while, but we ended up going
with a straight syscall interface. How fat is the dlm interface? ie: how many
syscalls would it take?</quote> David replied that only 4 functions would be
needed: <code>create_lockspace()</code>, <code>release_lockspace()</code>,
<code>lock()</code>, and <code>unlock()</code>. Kurt C. Hackel from Oracle
replied:</p>

<quote who="Kurt C. Hackel">

<p>FWIW, it looks like we can agree on the core interface.  ocfs2_dlm
exports essentially the same functions:</p>

<p>    <code>dlm_register_domain()<br />
    dlm_unregister_domain()<br />
    dlmlock()<br />
    dlmunlock()</code></p>

<p>I also implemented dlm_migrate_lockres() to explicitly remaster a lock
on another node, but this isn't used by any callers today (except for
debugging purposes).  There is also some wiring between the fs and the
dlm (eviction callbacks) to deal with some ordering issues between the
two layers, but these could go if we get stronger membership.</p>

<p>There are quite a few other functions in the "full" spec(1) that we
didn't even attempt, either because we didn't require direct
user&lt;-&gt;kernel access or we just didn't need the function.  As for the
rather thick set of parameters expected in dlm calls, we managed to get
dlmlock down to *ahem* eight, and the rest are fairly slim.</p>

<p>Looking at the misc device that gfs uses, it seems like there is pretty
much complete interface to the same calls you have in kernel, validated
on the write() calls to the misc device.  With dlmfs, we were seeking to
lock down and simplify user access by using standard ast/bast/unlockast
calls, using a file descriptor as an opaque token for a single lock,
letting the vfs lifetime on this fd help with abnormal termination, etc.
I think both the misc device and dlmfs are helpful and not necessarily
mutually exclusive, and probably both are better approaches than
exporting everything via loads of syscalls (which seems to be the
VMS/opendlm model).</p>

</quote>

<p>Andrew liked the 4 syscall requirement, saying, <quote who="Andrew
Morton">Neat. I'd be inclined to make them syscalls then. I don't suppose
anyone is likely to object if we reserve those slots.</quote> Daniel
cautioned that the function parameters might be a bit ugly, but David said
it was likely there would be no more than 2 or 3 for any of them. But Alan
Cox spoke out vehemently against this whole course of action. He said:</p>

<quote who="Alan Cox">

<p>If the locks are not file descriptors then answer the following:</p>

<ul>

<li>How are they ref counted</li>

<li>What are the cleanup semantics</li>

<li>How do I pass a lock between processes (AF_UNIX sockets wont work now)</li>

<li>How do I poll on a lock coming free.</li>

<li>What are the semantics of lock ownership</li>

<li>What rules apply for inheritance</li>

<li>How do I access a lock across threads.</li>

<li>What is the permission model.</li>

<li>How do I attach audit to it</li>

<li>How do I write SELinux rules for it</li>

<li>How do I use mount to make namespaces appear in multiple vservers</li>

</ul>

<p>and thats for starters...</p>

<p>Every so often someone decides that a deeply un-unix interface with new
syscalls is a good idea. Every time history proves them totally bonkers.
There are cases for new system calls but this doesn't seem one of them.</p>

<p>Look at system 5 shared memory, look at system 5 ipc, and so on. You
can't use common interfaces on them, you can't select on them, you can't
sanely pass them by fd passing.</p>

<p>All our existing locking uses the following behaviour</p>

<pre>        fd = open(namespace, options)
        fcntl(.. lock ...)
        blah
        flush
        fcntl(.. unlock ...)
        close</pre>

<p>Unfortunately some people here seem to have forgotten WHY we do things
this way.</p>

<ol>

<li>The semantics of file descriptors are well understood by users and by programs. That makes programming easier and keeps code size down</li>

<li>Everyone knows how close() works including across fork</li>

<li>FD passing is an obscure art but understood and just works</li>

<li>Poll() is a standard understood interface</li>

<li>Ownership of files is a standard model</li>

<li>FD passing across fork/exec is controlled in a standard way</li>

<li>The semantics for threaded applications are defined</li>

<li>Permissions are a standard model</li>

<li>Audit just works with the same tools</li>

<li>SELinux just works with the same tools</li>

<li>I don't need specialist applications to see the system state (the whole point of sysfs yet someone wants to break it all again)</li>

<li>fcntl fd locking is a posix standard interface with precisely defined semantics. Our extensions including leases are very powerful</li>

<li>And yes - fcntl fd locking supports mandatory locking too. That also is standards based with precise semantics.</li>

</ol>

<p>Everyone understands how to use the existing locking operations. So if
you use the existing interfaces with some small extensions if neccessary
everyone understands how to use cluster locks. Isn't that neat....</p>

</quote>

<p>Andrew disagreed that the new syscalls would be such grave violations. He
pointed out that <quote who="Andrew Morton">David said that "We export our
full dlm API through read/write/poll on a misc device.". That miscdevice
will simply give us an fd. Hence my suggestion that the miscdevice be done
away with in favour of a dedicated syscall which returns an fd.</quote>
Alan didn't reply.</p>

<p>At right around this point, Patrick Caulfield got home from vacation, and
threw out his take on things:</p>

<quote who="Patrick Caulfield">

<p>let me tell you what we do now and why and lets see what's wrong with
it.</p>

<p>Currently the library create_lockspace() call returns an FD upon which
all lock operations happen. The FD is onto a misc device, one per lockspace,
so if you want lockspace protection it can happen at that level. There is no
protection applied to locks within a lockspace nor do I think it's helpful
to do so to be honest. Using a misc device limits you to &lt;255 lockspaces
depending on the other uses of misc but this is just for userland-visible
lockspace - it does not affect GFS filesystems for instance.</p>

<p>Lock/convert/unlock operations are done using write calls on that
lockspace FD. Callbacks are implemented using poll and read on the FD,
read will return data blocks (one per callback) as long as there are active
callbacks to process. The current read functionality behaves more like a
SOCK_PACKET than a data stream which some may not like but then you're going
to need to know what you're reading from the device anyway.</p>

<p>ioctl/fcntl isn't really useful for DLM locks because you can't do
asynchronous operations on them - the lock has to succeed or fail in the one
operation - if you want a callback for completion (or blocking notification)
you have to poll the lockspace FD anyway and then you might as well go back
to using read and write because at least they are something of a matched
pair. Something similar applies, I think, to a syscall interface.</p>

<p>Another reason the existing fcntl interface isn't appropriate is that
it's not locking the same kind of thing. Current Unix fcntl calls lock
byte ranges. DLM locks arbitrary names and has a much richer list of lock
modes. Adding another fcntl just runs in the problems mentioned above.</p>

<p>The other reason we use read for callbacks is that there is information to
be passed back: lock status, value block and (possibly) query information.</p>

<p>While having an FD per lock sounds like a nice unixy idea I don't think it
would work very well in practice. Applications with hundreds or thousands of
locks (such as databases) would end up with huge pollfd structs to manage, and
it while it helps the refcounting (currently the nastiest bit of the current
dlm_device code) removes the possibility of having persistent locks that exist
after the process exits - a handy feature that some people do use, though I
don't think it's in the currently submitted DLM code. One FD per lock also
gives each lock two handles, the lock ID used internally by the DLM and the
FD used externally by the application which I think is a little confusing.</p>

<p>I don't think a dlmfs is useful, personally. The features you can export
from it are either minimal compared to the full DLM functionality (so you
have to export the rest by some other means anyway) or are going to be so
un-filesystemlike as to be very awkward to use. Doing lock operations in shell
scripts is all very cool but how often do you /really/ need to do that?</p>

<p>I'm not saying that what we have is perfect - far from it - but we have
thought about how this works and what we came up with seems like a good
compromise between providing full DLM functionality to userspace using unix
features. But we're very happy to listen to other ideas - and have been
doing I hope.</p>

</quote>

<p>The discussion ended here, with no certain conclusion, though Andrew's
syscall preference may hold sway.</p>

</section>

<section
  title="Review Period In Preparation For 2.6.13.1"
  subject="[PATCH 0/9] -stable review"
  archive="http://groups.google.com/group/linux.kernel/msg/fbadc2209ba564d4"
  posts="14"
  startdate="07 Sep 2005 17:28:42 -0800"
  enddate="09 Sep 2005 08:05:00 -0800">
<topic>Assembly</topic>
<topic>Digital Video Broadcasting</topic>
<topic>Networking</topic>
<topic>PCI</topic>
<topic>Power Management: ACPI</topic>

<mention>Stephen Hemminger</mention>
<mention>James Bottomley</mention>
<mention>David S. Miller</mention>
<mention>Benjamin Herrenschmidt</mention>
<mention>Alexander Viro</mention>
<mention>Theodore Ts'o</mention>
<mention>Linus Torvalds</mention>
<mention>Randy Dunlap</mention>
<mention>Alan Cox</mention>
<mention>Mark Haverkamp</mention>
<mention>David Woodhouse</mention>
<mention>Patrick McHardy</mention>
<mention>Andrew Morton</mention>
<mention>Zwane Mwaikambo</mention>

<p>Chris Wright said:</p>

<quote who="Chris Wright">

<p>This is the start of the stable review cycle for the 2.6.13.1 release.
There are 9 patches in this series, all will be posted as a response to
this one.  If anyone has any issues with these being applied, please let
us know.  If anyone is a maintainer of the proper subsystem, and wants
to add a signed-off-by: line to the patch, please respond with it.</p>

<p>These patches are sent out with a number of different people on the
Cc: line.  If you wish to be a reviewer, please email stable@kernel.org
to add your name to the list.  If you want to be off the reviewer list,
also email us.</p>

</quote>

<p>The Cc list contained Justin Forbes, Zwane Mwaikambo, Theodore Ts'o,
Randy Dunlap, Chuck Wolber, Linus Torvalds, Andrew Morton, and Alan Cox,
in addition to the linux-kernel mailing list itself.</p>

<p>Each of Chris's replies had a single patch, with these changelog entries:</p>

<ul>

<li>

<p>I wish I had seen this before 2.6.13 was released... I guess this only
goes to show that there haven't been any testers using saa7134-hybrid
dvb/v4l boards that depend on the tda1004x module, during the 2.6.13-rc
series :-(</p>

<p>Please apply this to 2.6.14, and also to 2.6.13.1 -stable.  Without this
patch, users will have to EXPLICITLY select tda1004x in Kconfig.  This
SHOULD be done automatically when saa7134-dvb is selected.  This patch
corrects this problem.</p>

<p>saa7134-dvb must select tda1004x</p>

<p>Signed-off-by: Michael Krufky<br />
Signed-off-by: Chris Wright</p>

</li>

<li>

<p>This was noticed by Doug Bazamic and the fix found by Mark Salyzyn at
Adaptec.</p>

<p>There was an error in the BUG_ON() statement that validated the
calculated fib size which can cause the driver to panic.</p>

<p>Signed-off-by: Mark Haverkamp<br />
Acked-by: James Bottomley<br />
Signed-off-by: Chris Wright</p>

</li>

<li>

<p>This fixes a problem with pci_map_rom() which doesn't properly
update the ROM BAR value with the address thas allocated for it by the
PCI code. This problem, among other, breaks boot on Mac laptops.</p>

<p>It'ss a new version based on Linus latest one with better error
checking.</p>

<p>Signed-off-by: Benjamin Herrenschmidt<br />
Signed-off-by: Linus Torvalds<br />
Signed-off-by: Chris Wright</p>

</li>

<li>

<p>I had some time to think about PCI assign issues in 2.6.13-rc series.</p>

<p>The major problem here is that we call pci_assign_unassigned_resources()
way too early - at subsys_initcall level. Therefore we give no chances
to ACPI and PnP routines (called at fs_initcall level) to reserve their
respective resources properly, as the comments in drivers/pnp/system.c
and drivers/acpi/motherboard.c suggest:</p>

<pre> /**
  * Reserve motherboard resources after PCI claim BARs,
  * but before PCI assign resources for uninitialized PCI devices
  */</pre>

<p>So I moved the pci_assign_unassigned_resources() call to
pcibios_assign_resources() (fs_initcall), which should hopefully fix a
lot of problems and make PCIBIOS_MIN_IO tweaks unnecessary.</p>

<p>Other changes:</p>

<ul>

<li>remove resource assignment code from pcibios_assign_resources(), since
  it duplicates pci_assign_unassigned_resources() functionality and
  actually does nothing in 2.6.13;</li>
<li>modify ROM assignment code as per Ben's suggestion: try to use firmware
  settings by default (if PCI_ASSIGN_ROMS is not set);</li>
<li>set CARDBUS_IO_SIZE back to 4K as it's a wonderful stress test for
  various setups.</li>

</ul>

<p>Confirmed by Tero Roponen (who had problems with
the 4kB CardBus IO size previously).</p>

<p>Signed-off-by: Linus Torvalds<br />
Signed-off-by: Chris Wright</p>

</li>

<li>

<p>[NET]: 2.6.13 breaks libpcap (and tcpdump)</p>

<p>Patrick McHardy says:</p>

<blockquote>

<p>  Never mind, I got it, we never fall through to the second switch
  statement anymore. I think we could simply break when load_pointer
  returns NULL. The switch statement will fall through to the default
  case and return 0 for all cases but 0 > k >= SKF_AD_OFF.</p>

</blockquote>

<p>Here's a patch to do just that.</p>

<p>I left BPF_MSH alone because it's really a hack to calculate the IP
header length, which makes no sense when applied to the special data.</p>

<p>Signed-off-by: Herbert Xu<br />
Signed-off-by: David S. Miller<br />
Signed-off-by: Chris Wright</p>

</li>

<li>

<p>[CRYPTO] Fix boundary check in standard multi-block cipher processors</p>

<p>Fixes Bug 5194 (IPSec related Oops in 2.6.13).</p>

<p>The boundary check in the standard multi-block cipher processors are
broken when nbytes is not a multiple of bsize.  In those cases it will
always process an extra block.</p>

<p>This patch corrects the check so that it processes at most nbytes of
data.</p>

<p>Signed-off-by: Herbert Xu<br />
Signed-off-by: Chris Wright</p>

</li>

<li>

<p>[IPV4]: Reassembly trim not clearing CHECKSUM_HW</p>

<p>This was found by inspection while looking for checksum problems
with the skge driver that sets CHECKSUM_HW. It did not fix the
problem, but it looks like it is needed.</p>

<p>If IP reassembly is trimming an overlapping fragment, it
should reset (or adjust) the hardware checksum flag on the skb.</p>

<p>Signed-off-by: Stephen Hemminger<br />
Signed-off-by: David S. Miller<br />
Signed-off-by: Chris Wright</p>

</li>

<li>

<p>When we copy 32bit ->msg_control contents to kernel, we walk the same
userland data twice without sanity checks on the second pass.</p>

<p>Second version of this patch: the original broke with 64-bit arches
running 32-bit-compat-mode executables doing sendmsg() syscalls with
unaligned CMSG data areas</p>

<p>Another thing is that we use kmalloc() to allocate and sock_kfree_s()
to free afterwards; less serious, but also needs fixing.</p>

<p>Patch by Al Viro, David Miller, David Woodhouse</p>

<p>Signed-off-by: Chris Wright</p>

</li>

<li>

<p>Fix unchecked __get_user that could be tricked into generating a
memory read on an arbitrary address.  The result of the read is not
returned directly but you may be able to divine some information about
it, or use the read to cause a crash on some architectures by reading
hardware state.  CAN-2005-2492.</p>

<p>Fix from Alexander Viro, ack from Dave S. Miller.</p>

<p>Signed-off-by: Chris Wright</p>

</li>

</ul>

</section>

<section
  title="Some Advice For Upgrading From 2.4 To 2.6"
  subject="How to plan a kernel update ?"
  archive="http://groups.google.com/group/linux.kernel/msg/86f63ed45b1c6f15"
  posts="5"
  startdate="08 Sep 2005 09:12:01 -0800"
  enddate="09 Sep 2005 09:54:33 -0800">

<p>For his job, Weber Ress had to lead a team of engineers to upgrade the
kernel from 2.4 to 2.6 in many servers. He asked for advice. Michael
Thonke suggested that <quote who="Michael Thonke">google is your
best friend and first source for it.</quote> And gave a link to <a
href="http://linuxdevices.com/articles/AT3855888078.html">William von Hagen's
article</a> on the subject. Jesper Juhl also said:</p>

<quote who="Jesper Juhl">

<p>I do upgrade a lot of kernels, so I'll tell you a little about what I do
and what I'd recommend. Then you can do with that info what you like :)</p>

<p>The very first thing you want to do is to ensure that all core
utilities/tools are up-to-date to versions that will work with your new
kernel.</p>

<p>If you download a copy of the 2.6.13 kernel source, extract it, and
look in the file Documentation/Changes you'll see a list of tools and utils
along with the minimum required version for them to work properly with that
kernel. Ensure those tools are OK.</p>

<p>Once you are sure the core utils are up-to-date you need to go check
whatever other important programs you have on the machine(s) and check that
those are also able to run OK with the new kernel.</p>

<p>Once you are satisfied that everything is up to a level that'll work
with the new kernel you can go build the new 2.6.13 kernel and drop it in
place. You don't need to remove your existing kernel first, you can just
install the 2.6.13 kernel side by side with the old one and test boot it,
then if it doesn't work right you can always reboot back to the old one.</p>

<p>Most likely you can find documentation for your distribution stating what
version of it is "2.6 ready" - I use Slackware for example, and Slackware 10.1
is completely 2.6 kernel ready, so on a Slackware 10.1 box there's no hassle
at all, I just drop in a 2.6 kernel in place of the 2.4 one it installs by
default and everything is good - all tools are already ready to cope.</p>

</quote>

</section>

<section
  title="Status Of Serial SCSI; Some Dispute Over Direction"
  subject="[ANNOUNCE 0/2] Serial Attached SCSI (SAS) support for the Linux kernel"
  archive="http://groups.google.com/group/fa.linux.kernel/msg/5b068cd03f92e716"
  posts="6"
  startdate="09 Sep 2005 11:30:09 -0800"
  enddate="13 Sep 2005 06:24:06 -0800">
<topic>Disks: SCSI</topic>
<topic>FS: sysfs</topic>
<topic>Hot-Plugging</topic>
<topic>Ioctls</topic>
<topic>Ottawa Linux Symposium</topic>
<topic>SMP</topic>
<topic>Serial ATA</topic>

<mention>Douglas Gilbert</mention>

<p>Luben Tuikov from Adaptec said:</p>

<quote who="Luben Tuikov">

<p>The following announcements and patches introduce Serial Attached SCSI
(SAS) support for the Linux kernel. Everything is supported.</p>

<p>The infrastructure is broken into</p>

<ul>

<li>SAS LLDD,</li>

<li>SAS Layer.</li>

</ul>

<p>The SAS LLDD does phy/OOB management, and generates SAS events to the
SAS Layer. Those events are *the only way* a SAS LLDD communicates with
the SAS Layer. If you can generate 2 types of event, then you can use
this infrastructure. The first two are, loosely, "link was severed",
"bytes were dmaed". The third kind is "received a primitive", used for
domain revalidation.</p>

<p>A SAS LLDD should implement the Execute Command SCSI RPC and at least
one SCSI TMF (Task Management Function), in order for the SAS Layer to
communicate with the SAS LLDD.</p>

<p>The SAS Layer is concerned with</p>

<ul>

<li>SAS Phy/Port/HA event management (LLDD generates, SAS Layer
processes),</li>

<li>SAS Port management (creation/destruction),</li>

<li>SAS Domain discovery and revalidation,</li>

<li>SAS Domain device management,</li>

<li>SCSI Host registration/deregistration,</li>

<li>Device registration with SCSI Core (SAS) or libata (SATA/PI), and</li>

<li>Expander management and exporting expander control to user space.</li>

</ul>

<p>The SAS Layer uses the Execute Command SCSI RPC, and the TMFs implemented
by the SAS LLDD in order to manage the domain and the domain devices.</p>

<p>For details please see drivers/scsi/sas-class/README.</p>

<p>The SAS Layer represents the SAS domain in sysfs. For each object
represented, its parent is the physical entity it attaches to in the physical
world. So in effect, kobject_get, gets the whole chain up on which that
object depends on.</p>

<p>In effect, the sysfs representation of the SAS domain(s) is what you'd
see in the physical world.</p>

<p>Hot plugging and hot unplugging of devices, domains and subdomains is
supported. Repeated hot plugging and hot unplugging is also supported,
naturally.</p>

<p>SAS introduces a new physical entity, an expander.
Expanders are _not_ SAS devices, and thus are _not_ SCSI devices.
Expanders are part of the Service Delivery Subsystem, in this case
SAS.</p>

<p>Expanders are controlled using the Serial Management Protocol (SMP).
Complete control is given to user space of all expanders found
in the domain, using an "smp_portal".  More of this in the second
and third email in this series.</p>

<p>A user space program, "expander_conf.c" is also presented to show
how one controls expanders in the domain.  It is located here:
drivers/scsi/sas-class/expanders_conf.c</p>

<p>The second email in this series shows an example of SAS domains
and their representation in sysfs.</p>

<p>The third email in this series shows an example of using the
"expander_conf.c" program to query all expanders in the domain,
showing their attributes, their phys, and their routing tables.</p>

<p>If you have the hardware, please give it a try.  If you have
expander(s) it would be even more interesting.</p>

<p>Patches of the SAS Layer and of the AIC94XX SAS LLDD follow.</p>

<p>You can also download the patches from
<a
href="http://www.geocities.com/ltuikov/">http://www.geocities.com/ltuikov/</a></p>

</quote>

<p>Christoph Hellwig said, <quote who="Christoph Hellwig">At the core
it's some really nice code dealing with host-based SAS implementations.
What's not nice is that it's not intgerating with the SAS transport class I
posted, it's duplicating things like LUN disocvery from the SCSI core code,
and adding it's own sysfs representation that's very different from the way
the SCSI core and transport classes do it. Are you willing to work with us to
intgerate it with the infrastructure we have?</quote> Luben replied, <quote
who="Luben Tuikov">HP and LSI were aware of my efforts since the beginning of
the year. As well, you had a copy of my code July 14 this year, long before
starting your work on your SAS class for LSI and HP (so its acceptance is
guaranteed), after OLS. We did meet at OLS and we did have the SAS BOF. I'm
not sure why you didn't want to work together?</quote> He invited Christoph
to base future work on Luben's implementation. Andrew Patterson from Hewlett
Packard replied, <quote who="Andrew Patterson">This effort started on April.
Eric Moore, Mike Miller and I started work on a SAS transport class and then
later pulled Luben it at the suggestion of Douglas Gilbert (if I remember
correctly). We later mutually agreed that Luben would take over the transport
class work as he seemed to have much more experience with this sort of thing.
The original idea was to implement a SAS transport class that would allow
the LSI and Adaptec driver to get into kernel.org (or others at the time)
and to find a way to get SDI/CSMI API's into the kernel without the use of
IOCTL's. Luben then went off on his own and came up with his effectively
Adaptec only solution.</quote> He also added, regarding the OLS BOF, <quote
who="Andrew Patterson">If my memory serves correctly, there were 10-12
people at that BOF, representing the SCSI kernel maintainers and all of the
vendors currently providing SAS hardware. Virtually everyone disagreed with
your implementation (which you indeed emailed shortly before the conference)
that would only work with one vendor's card. The suggestion was made that you
convert your code to various library layers so that it would work with all
vendors. A suggestion which it seems that you continue to reject.</quote></p>

</section>

<section
  title="SBC8360 Watchdog Driver Heading Into Mainline"
  subject="[WATCHDOG] Push SBC8360 driver upstream"
  archive="http://groups.google.com/group/linux.kernel/msg/82b363bf21aa75d0"
  posts="3"
  startdate="09 Sep 2005 12:16:07 -0800"
  enddate="10 Sep 2005 01:07:02 -0800">

<p>Ian E. Morgan said:</p>

<quote who="Ian E. Morgan">

<p>I would like to ask that the SBC8360 watchdog driver be pushed upstream
from -mm in time for the 2.6.14-rc series.</p>

<p>I recognise that this driver, like a lot of the watchdog drivers, is
for a piece of hardware this is present in only a very small percentage of
hardware runnig Linux. I doubt that being in -mm for a long time will make
any significant difference to it being more widely tested. The driver is
working perfectly as expected on each of the machines we've tested it on.</p>

<p>As a recap, the driver was submitted to akpm, was
included in -mm1 (watchdog-new-sbc8360-driver.patch),
offloaded to Wim's linux-2.6-watchdog-mm.git tree (commit
88b1f50923d14195ac1a50840fc4aa4066f067a9), and subsequently included in -mm2
by way of the combined git-watchdog.patch.</p>

<p>Please consider merging this driver into 2.6.14-rc1. Thanks.</p>

</quote>

<p>Andrew Morton replied, <quote who="Andrew Morton">That's in Wim's tree now.
Wim, could you please prepare a pull for Linus within the next couple of
days?</quote> Wim Van Sebroeck said, <quote who="Wim Van Sebroeck">I'm
preparing the tree for linus to pull from. Should be there by the end of
the weekend. (Will probably contain 6 drivers + some updates of some other
drivers).</quote></p>

</section>

<section
  title="DevFS Still On The Chopping Block; Users Still Resistant"
  subject="[GIT PATCH] Remove devfs from 2.6.13"
  archive="http://groups.google.com/group/linux.kernel/msg/908fd4ff398e4a8e"
  posts="32"
  startdate="09 Sep 2005 13:45:42 -0800"
  enddate="14 Sep 2005 18:10:41 -0800">
<topic>FS: devfs</topic>
<topic>FS: sysfs</topic>
<topic>Sound: ALSA</topic>

<mention>Valdis Kletnieks</mention>

<p>Greg KH, having been stymied in his effort to remove DevFS in time for
the 2.6.12 release, now submitted the identical patches against 2.6.13; he
hoped this time they would make it in. He added, <quote who="Greg KH">Also,
if people _really_ are in love with the idea of an in-kernel devfs, I have
posted a patch that does this in about 300 lines of code, called ndevfs.
It is available in the archives if anyone wants to use that instead (it is
quite easy to maintain that patch outside of the kernel tree, due to it only
needing 3 hooks into the main kernel tree.)</quote> Mike Bell replied that
NDevFS was <quote who="Mike Bell">broken by design. It creates yet another
incompatible naming scheme for devices, and what's worse the devices it
breaks are the ones like ALSA and the input subsystem, whose locations are
hard-coded into libraries. Unless sysfs is going to get attributes from which
the proper names could be derived, it won't ever work.</quote> Greg replied
that he knew NDevFS wasn't a nice solution, it was just an alternative.
He added, <quote who="Greg KH">Anyway, I'm not offering it up for inclusion
in the kernel tree at all, but for a proof-of-concept for those who were
insisting that it was impossible to keep a devfs-like patchset out of the
main kernel tree easily.</quote></p>

<p>Elsewhere, David Lang said it was important to be cautious in removing
DevFS, because of the dangers of breaking various systems. Greg replied, <quote
who="Greg KH">Ok, how long should I wait then?</quote> And David said:</p>

<quote who="David Lang">

<p>if 2.6.13 removed the devfs config option, then I would say the code itself
should stay until 2.6.15 or 2.6.16 (if the release schedule does drop down
to ~2 months then it would need to be at lease .16). especially with so many
people afraid of the 2.6 series you need to wait at least one full release
cycle, probably two (and possibly more if they end up being short ones)
then rip out the rest of the code for the following release.</p>

<p>remember that the distros don't package every kernel, they skip several
between releases and it's not going to be until they go to try them that
all the kinks will get worked out.</p>

<p>add to this the fact that many people have gotten confused about kernel
releases and think that since 13 is odd 2.6.13 is a testing kernel and 2.6.14
will be a stable one and so won't look at .13</p>

<p>note that all this assumes that the issues that people have about sysfs
not yet being able to replace that capabilities that they are useing in
devfs have been addressed.</p>

</quote>

<p>Greg said he wasn't aware of any major distribution shipping kernels with
DevFS enabled. He and Valdis Kletnieks asked if anyone knew of any that did.
Bastian Blank said Debian Unstable did, though as someone else pointed out, no
one could confuse Debian Unstable with a shippable distribution. Beyond that,
no one was able to come up with even a single distribution shipping DevFS.</p>

<p>The thread ended with no hard conclusions about how long DevFS can expect
to live in the kernel.</p>

</section>

<section
  title="Linux 2.6.13.1 Released"
  subject="Linux 2.6.13.1"
  archive="http://groups.google.com/group/fa.linux.kernel/msg/2dabf00664aaebd4"
  posts="2"
  startdate="09 Sep 2005 19:22:02 -0800"
  enddate="09 Sep 2005 19:23:56 -0800">
<topic>Assembly</topic>
<topic>Digital Video Broadcasting</topic>
<topic>PCI</topic>
<topic>Security</topic>

<mention>Stephen Hemminger</mention>
<mention>David S. Miller</mention>
<mention>Benjamin Herrenschmidt</mention>
<mention>Chris Wright:</mention>
<mention>Ivan Kokshaysky</mention>
<mention>Mark Haverkamp</mention>
<mention>David Woodhouse</mention>

<p>Chris Wright announced Linux 2.6.13.1, saying:</p>

<quote who="Chris Wright">

<p>We (the -stable team) are announcing the release of the 2.6.13.1 kernel.</p>

<p>The diffstat and short summary of the fixes are below.</p>

<p>I'll also be replying to this message with a copy of the patch between
2.6.13 and 2.6.13.1, as it is small enough to do so.</p>

<p>The updated 2.6.13.y git tree can be found at:<br />
        rsync://rsync.kernel.org/pub/scm/linux/kernel/git/chrisw/linux-2.6.13.y.git<br />
and can be browsed at the normal kernel.org git web browser:<br />
        <a href="http://www.kernel.org/git/">www.kernel.org/git/</a></p>

</quote>

<p>He listed the changes from 2.6.13 to 2.6.13.1:</p>

<quote who="Chris Wright">

<p>Al Viro:<br />
  raw_sendmsg DoS (CAN-2005-2492)</p>

<p>Benjamin Herrenschmidt:<br />
  Fix PCI ROM mapping</p>

<p>Chris Wright:<br />
  Linux 2.6.13.1</p>

<p>David S. Miller:<br />
  Use SA_SHIRQ in sparc specific code.</p>

<p>David Woodhouse:<br />
  32bit sendmsg() flaw (CAN-2005-2490)</p>

<p>Herbert Xu:<br />
  2.6.13 breaks libpcap (and tcpdump)<br />
  Fix boundary check in standard multi-block cipher processors</p>

<p>Ivan Kokshaysky:<br />
  x86: pci_assign_unassigned_resources() update</p>

<p>Mark Haverkamp:<br />
  aacraid: 2.6.13 aacraid bad BUG_ON fix</p>

<p>Michael Krufky:<br />
  Kconfig: saa7134-dvb must select tda1004x</p>

<p>Stephen Hemminger:<br />
  Reassembly trim not clearing CHECKSUM_HW</p>

</quote>

</section>

<section
  title="Status Of Exposing Certain NUMA Data To Userspace"
  subject="NUMA mempolicy /proc code in mainline shouldn't have been merged"
  archive="http://groups.google.com/group/linux.kernel/msg/ffa5a064d2d8c338"
  posts="7"
  startdate="10 Sep 2005 01:20:18 -0800"
  enddate="10 Sep 2005 23:11:20 -0800">

<p>Andi Kleen said:</p>

<quote who="Andi Kleen">

<p>Just noticed the ugly SGI /proc/*/numa_maps code got merged. I argued
several times against it and I very deliberately didn't include a similar
facility when I wrote the NUMA policy code because it's a bad idea.</p>

<ul>

<li>it's a lot of ugly code.</li>

<li>it's basically only a debugging hack right now</li>

<li>it presents lots of kernel internal information and mempolicy internals
(like how many people have a page mapped) etc. to userland that shouldn't
be exposed to this.</li>

<li>the format is very complicated and the chance of bug free userland
parsers of this is near zero.</li>

<li>there is no demonstrated application that needs it (there was a theoretical
usecase where it might be needed, but there were better solutions proposed
for this)</li>

</ul>

<p>Can the patch please be removed?</p>

</quote>

<p>Andrew Morton said he queued up a patch reversion that should take care of
it. Christoph Lameter felt that patch was quite salvageable, and didn't see
why it should be reverted. Andrew replied, <quote who="Andrew Morton">If it's
useful to application developers then fine. It it's only useful to kernel
developers then the argument is weakened. However there's still quite a lot
of development going on in this area, so there's still some argument for having
the monitoring ability in the mainline tree.</quote> Christoph replied:</p>

<quote who="Christoph Lameter">

<p>I still have a hard time to see how people can accept the line of reasoning
that says:</p>

<blockquote>

<p>Users are not allowed to know on which nodes the operating system allocated
resources for a process and are also not allowed to see the memory policies
in effect for the memory areas</p>

</blockquote>

<p>Then the application developers have to guess the effect that the memory
policies have on memory allocation. For memory alloc debugging the poor
app guys must today simply imagine what the operating system is doing.
They can see the amount of total memory allocated on a node via other proc
entries and then guess based on that which application has taken it. Then
they modify their apps and do another run.</p>

<p>My thinking today is that I'd rather leave /proc/&lt;pid&gt;/numa_stats
instead of using smaps because the smaps format is a bit verbose and will
make it difficult to see the allocation distribution. If we use smaps then
we probably need some tool to parse and present information. numa_stats is
directly usable.</p>

<p>I have a new series of patches here that does a gradual thing with the
policy layer:</p>

<ol>

<li>Clean up policy layer to properly use node macros instead of bitmaps.
Some comments to explain certain limitations of the policy layer.</li>

<li>Clean up policy layer by doing do_xx and sys_xx separation [optional
but this separates the dynamic bitmaps in user space from the static node
maps in kernel space which I find very helpful]</li>

<li>Add mpol_to_str to policy layer and make numa_stats use mpol_to_str.</li>

<li>Solve the potential access issue when set_mempolicy is updating
task-&gt;mempolicy while numa_stats are being displayed by taking a writelock
on mmap_sem in set_mempolicy. This is in harmony with vma mempolicy updates
that also take a lock on mmap_sem and that are already safe to access since
numa_stats always takes an mmap_sem readlock. The patch is essentially
inserting two lines.</li>

</ol>

<p>Then I still have these evil intentions of making it possible to dynamically
change memory policies from the outside. The mininum that we all need is to
least be able to see whats going on.</p>

<p>Of course we would be happier if we would also be allowed to change policies
to control memory allocation. The argument that the layer is not able to handle
these is of course true since attempts to fix the issues have been blocked.</p>

</quote>

<p>Andrew began to be swayed by these arguments. He started to favor keeping
the patch in, but the debate did not reach any firm conclusion during the
thread.</p>

</section>

<section
  title="Status Of eth1394 And SBP2 Maintainership"
  subject="eth1394 and sbp2 maintainers"
  archive="http://groups.google.com/group/linux.kernel/msg/3bfe679c9caca0f5"
  posts="7"
  startdate="10 Sep 2005 10:39:49 -0800"
  enddate="12 Sep 2005 17:50:51 -0800">
<topic>MAINTAINERS File</topic>

<p>Stefan Richter said, <quote who="Stefan Richter">the MAINTAINERS list of
Linus' tree is still listing eth1394 and sbp2 as orphaned. This is certainly
not correct for sbp2. Is it for eth1394?</quote> He said to Ben Collins,
<quote who="Stefan Richter">Ben, I remember you wanted to have your contact
added back in, at least for sbp2. In case this should not be true anymore,
I'd volunteer for sbp2 maintenance.</quote> Ben replied, <quote who="Ben
Collins">I sent a patch to Linus, but I guess it never got added. Stefan,
feel free to send a patch adding you as the maintainer</quote>. Regarding
eth1394, Jody McIntyre also said, <quote who="Jody McIntyre">I emailed Steve
Kinneberg, the last person to do any serious work on the driver, before I
made this change, and he's OK with that. If someone else wants to take it,
I suggest they submit a patch.</quote></p>

</section>

<section
  title="Location Of Stable Kernel Pending Patches"
  subject="Pending -stable patches"
  archive="http://groups.google.com/group/linux.kernel/msg/1e7a68561d3e2bd4"
  posts="3"
  startdate="13 Sep 2005 08:27:36 -0800"
  enddate="14 Sep 2005 12:27:54 -0800">

<mention>Michal Piotrowski</mention>

<p>Jean Delvare asked:</p>

<quote who="Jean Delvare">

<p>Is there a place where pending -stable patches can be seen?</p>

<p>Are mails sent to stable@kernel archived somewhere?</p>

<p>There seems to be a need for this. For example, there's a patch I would
like to see in 2.6.13.2, but I wouldn't want to report an already known
problem.</p>

</quote>

<p>Michal Piotrowski gave Jean a link to <a
href="http://www.kernel.org/git/?p=linux/kernel/git/chrisw/stable-queue.git;a=shortlog">the
stable queue shortlog</a>, and Jean replied, <quote who="Jean Delvare">Exactly
what I needed. It's bookmarked now. Thanks!</quote></p>

</section>

<section
  title="udev 069 Released"
  subject="[ANNOUNCE] udev 069 release"
  archive="http://groups.google.com/group/linux.kernel/msg/7a9bc69af71a6ff7"
  posts="4"
  startdate="13 Sep 2005 09:48:49 -0800"
  enddate="14 Sep 2005 07:32:11 -0800">
<topic>FS: devfs</topic>
<topic>FS: sysfs</topic>
<topic>Hot-Plugging</topic>

<p>Greg KH said:</p>

<quote who="Greg KH">

<p>I've released the 069 version of udev.  It can be found at:<br />
        <a href="http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev-069.tar.gz">kernel.org/pub/linux/utils/kernel/hotplug/udev-069.tar.gz</a></p>

<p>udev allows users to have a dynamic /dev and provides the ability to
have persistent device names.  It uses sysfs and /sbin/hotplug and runs
entirely in userspace.  It requires a 2.6 kernel with CONFIG_HOTPLUG
enabled to run.  Please see the udev FAQ for any questions about it:<br />
        <a href="http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev-FAQ">kernel.org/pub/linux/utils/kernel/hotplug/udev-FAQ</a></p>

<p>For any udev vs devfs questions anyone might have, please see:<br />
        <a href="http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev_vs_devfs">kernel.org/pub/linux/utils/kernel/hotplug/udev_vs_devfs</a></p>

<p>And there is a general udev web page at:<br />
        <a href="http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev.html">http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev.html</a></p>

<p>Note, I _really_ recommend anyone running 2.6.13 or newer to upgrade to
at least the 068 version of udev due to some very nice speed improvemets
(not to mention the fact that the 2.6.12 kernel requires at least the
058 version of udev.)</p>

<p>There have been lots of good bugfixes and new features added since the
last time I announced a udev release, so see the RELEASE-NOTES file for
details, and the changelog below.</p>

<p>udev uses git for its source code control system.  The main udev git
repo can be found at:<br />
        rsync://rsync.kernel.org/pub/scm/linux/hotplug/udev.git<br />
and can be browsed online at:<br />
        <a href="http://www.kernel.org/git/?p=linux/hotplug/udev.git">http://www.kernel.org/git/?p=linux/hotplug/udev.git</a></p>

</quote>

</section>

<section
  title="In Defense Of DevFS"
  subject="devfs vs udev FAQ from the other side"
  archive="http://groups.google.com/group/fa.linux.kernel/msg/78939b21143543dc"
  posts="7"
  startdate="14 Sep 2005 16:51:06 -0800"
  enddate="14 Sep 2005 20:13:34 -0800">
<topic>FS: devfs</topic>
<topic>FS: sysfs</topic>
<topic>Hot-Plugging</topic>
<topic>Small Systems</topic>

<mention>Greg KH</mention>

<p>Mike Bell said:</p>

<quote who="Mike Bell">

<p align="center">devfs vs udev: From the other side</p>

<p>Presuppositions (True of both udev and devfs):</p>

<ol>

<li>Dynamic /dev is the way of the future, and a Good Thing</li>

<li>A single major/minor combination should have only a single device
node, its other names should be symlinks. If you don't do this, you
break locking on certain classes of applications, among other things.</li>

</ol>

<p>The above are uncontentious as far as I know. I believe Greg KH has
stated both. If you feel otherwise, please explain why.</p>

<p>Differences:</p>

<ol>

<li>devfs creates device nodes from kernel space, and creates symlinks
  for alternative names using a userspace helper. udev handles both tasks
  from user space, by exporting the information through a different
  kernel-generated filesystem.</li>

</ol>

<p>devfs advantages over udev:</p>

<ol>

<li>

<p>devfs is smaller<br />
  Hey, I ran the benchmarks, I have numbers, something Greg never gave.
  Took an actual devfs system of mine and disabled devfs from the
  kernel, then enabled hotplug and sysfs for udev to run.  make clean
  and surprise surprise, kernel is much bigger. Enable netlink stuff and
  it's bigger still. udev is only smaller if like Greg you don't count
  its kernel components against it, even if they wouldn't otherwise need
  to be enabled. Difference is to the tune of 604164 on udev and 588466
  on devfs. Maybe not a lot in some people's books, but a huge
  difference from the claims of other people that devfs is actually
  bigger.</p>

<p>  And that's just the kernel. Then because your root is read-only you
  need an early userspace, and in regular userspace the udev binary, and
  its data files, all more wasted flash (you can shave it down by
  removing stuff you don't need, but that's just more work for the busy
  coder, and udev STILL loses on size).</p>

<p>  On the system in question (a real-world embedded system) the devfs
  solution requires no userspace helper except for two symlinks which
  were simply created manually in init, and could have been done away
  with if necessary.</p>

</li>

<li>devfs is faster<br />
  Despite all the many tricks that can be used to speed up udev (static
  linking, netlink, etc) devfs is still dramatically faster. On a big,
  bloated, slow-booting distribution system you may not notice so much,
  but when even your slowest booting systems are interactive in under 5
  seconds using devfs, this is quite significant time loss.</li>

<li>

<p>devfs uses less memory<br />
  Check free. sysfs alone does udev in and that's just the kernel stuff
  that's always there.</p>

<p>  Also, the user space stuff may not have to run at all times in all
  configurations, but on a system without swap and with long-running
  apps, all that matters is its PEAK memory usage. If my app takes x MB
  and my kernel takes y MB it doesn't MATTER that udev is only running
  for one second, I still need more than x+yMB of memory.</p>

</li>

</ol>

<p>udev advantages over devfs:</p>

<ol>

<li>udev has all sorts of spiffy features<br />
  Sure, but having device nodes exported directly from the kernel in no
  way stops you from having those spiffy features. The problem is that
  udev is doing two separate tasks, and it's easy to confuse the one it
  should be doing with the one it shouldn't.</li>

<li>

<p>udev doesn't have policy in kernel space<br />
  Well, that's a bit of a lie. sysfs has even stricter policy in kernel
  space. What he MEANS is that udev exchanges hard-coded but symlinkable
  /dev paths for hardcoded sysfs paths, moving the hard-coded kernel
  policy from one filesystem to another.</p>

<p>  This argument is really the only architectural reason to go with udev.
  At all. If you really believe that the ability to name your hard drive
  /dev/foobarbaz is vital, and absolutely can't live with merely having
  /dev/foobarbaz be a symlink to the real device node, then you need udev.
  The devfs way of handling this situation was a stupid, racey misfeature
  and rightly deserves to die horribly.</p>

<p>  That said, read my comments on why flexible /dev naming is actually a
  bad thing and think very, very carefully about whether you actually want
  this "feature" at all. Symlinks are your friend.</p>

</li>

<li>

<p>devfs is ugly<br />
  Part of this is true, and part of this is just the perspective of
  certain people (Greg has this fascinating world view where code
  required for devfs is garbage, and code required for udev is core
  kernel code and doesn't count against udev, which allows him to say
  udev is smaller.)</p>

<p>  The legitimate comments about devfs being ugly... well, how many
  subsystems which have been largely untouched for similar periods of
  time aren't even uglier? TTY stuff? And it's very hard to find a
  maintainer for a subsystem when it's "obsolete", patches that change
  its behaviour aren't accepted, and certain people are so vocally
  opposed to its very existence. Who wants to throw away their time
  writing code that won't even be considered, only to be hated for it?</p>

</li>

<li>devfs is unsupported, udev isn't<br />
  True that. And even people who've tried to maintain devfs get turned
  away. So unless this document causes a few people to reexamine the
  need to remove devfs, you can reasonably assume that udev will be the
  only way to run a linux system very shortly (static /dev is already on
  its last legs). Me, I'll be disappointed if this happens, because as
  the above document indicates, I still think kernel-exported /dev is
  better (and not because I'm a lazy user-space-hater, Greg. :) ).</li>

</ol>

</quote>

<p>There was no real discussion in response to this. It looked as though
a huge flamewar would erupt after the first few replies, but the thread
petered out immediately and vanished.</p>

</section>

</kc>

