|
Hello, looks like I discouvered a probable bug in the nfs-code, very easy to reproduce in my setup : Machine-1 : Today's 9-stable, exporting /files (ufs) and /z2 (zfs) Machine-2 : 8-stable as of April the 10th exporting /raid1 On Machine-1 I mount /raid1 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768) and start a script on this mount looping something like : dd if=/dev/random of=BIG bs=1048576 count=${SIZE} cp -fp BIG BIG2 cmp -x BIG BIG2 I let this run for 24 hours (from time to time stressing Machine-1 with other scripts, including provoking heavy swapping), no problem at all. However, then I mount /z2 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768) on Machine-2, and *immediately* the above loop on Machine-1 fails : Copying file ...cp: BIG: Permission denied No console messages this time, last time I got kernel: nfs_getpages: error 13 kernel: vm_fault: pager read error, pid 87803 (cmp) on Machine-1. I repeated this scenario by replacing Machine-2 with a good old 6-4-stable one, same outcome. Please tell me what I could do to nail this down a bit more. Thanx in advance, Best, Arno _______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[hidden email]" |
|
On 06/07/2012 14:19, Arno J. Klaassen wrote:
> Hello, > > looks like I discouvered a probable bug in the nfs-code, very > easy to reproduce in my setup : > > > Machine-1 : Today's 9-stable, exporting /files (ufs) and /z2 (zfs) > > Machine-2 : 8-stable as of April the 10th exporting /raid1 > > On Machine-1 I mount /raid1 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768) > and start a script on this mount looping something like : > > dd if=/dev/random of=BIG bs=1048576 count=${SIZE} > cp -fp BIG BIG2 > cmp -x BIG BIG2 > > I let this run for 24 hours (from time to time stressing Machine-1 with > other scripts, including provoking heavy swapping), no problem at all. > > However, then I mount /z2 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768) > on Machine-2, and *immediately* the above loop on Machine-1 fails : > > Copying file ...cp: BIG: Permission denied > > No console messages this time, last time I got > > kernel: nfs_getpages: error 13 > kernel: vm_fault: pager read error, pid 87803 (cmp) > > on Machine-1. > > I repeated this scenario by replacing Machine-2 with a good old > 6-4-stable one, same outcome. > > Please tell me what I could do to nail this down a bit more. as documented in PRs kern/131342 kern/136865 I've recently asked on -CURRENT about this and had a patch to try from Rick, I'm testing it now but it doesnt seem to fix it for me, just improve it alothough I'm trying to get enough runs to be a valid sample. (see http://docs.freebsd.org/cgi/getmsg.cgi?fetch=377627+0+archive/2012/freebsd-current/20120701.freebsd-current ) What I did for my production nas was edit mount.c so it didnt send a SIGHUP to mountd as suggested by rick, as it was easy to do and non intrusive. Vince > > Thanx in advance, > > Best, Arno > > _______________________________________________ > [hidden email] mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "[hidden email]" _______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[hidden email]" |
|
Vincent Hoffman <[hidden email]> writes: > On 06/07/2012 14:19, Arno J. Klaassen wrote: >> Hello, >> >> looks like I discouvered a probable bug in the nfs-code, very >> easy to reproduce in my setup : >> >> >> Machine-1 : Today's 9-stable, exporting /files (ufs) and /z2 (zfs) >> >> Machine-2 : 8-stable as of April the 10th exporting /raid1 >> >> On Machine-1 I mount /raid1 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768) >> and start a script on this mount looping something like : >> >> dd if=/dev/random of=BIG bs=1048576 count=${SIZE} >> cp -fp BIG BIG2 >> cmp -x BIG BIG2 >> >> I let this run for 24 hours (from time to time stressing Machine-1 with >> other scripts, including provoking heavy swapping), no problem at all. >> >> However, then I mount /z2 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768) >> on Machine-2, and *immediately* the above loop on Machine-1 fails : >> >> Copying file ...cp: BIG: Permission denied >> >> No console messages this time, last time I got >> >> kernel: nfs_getpages: error 13 >> kernel: vm_fault: pager read error, pid 87803 (cmp) >> >> on Machine-1. >> >> I repeated this scenario by replacing Machine-2 with a good old >> 6-4-stable one, same outcome. >> >> Please tell me what I could do to nail this down a bit more. > Its possible (although not definite) that you have hit the a mountd bug > as documented in PRs > > kern/131342 > kern/136865 especially kern/131342 looks similar and quite old; funny I never hit this before, I basically do the same tests since 'ages' on each new box. Could be that faster network/cpu unreveals some race condition; I notice as well that this server is the first (IIRC) who uses 3 different IRQs for network interrupts (em(4) Intel(R) PRO/1000). > I've recently asked on -CURRENT about this and had a patch to try from > Rick, I'm testing it now but it doesnt seem to fix it for me, just > improve it alothough I'm trying to get enough runs to be a valid sample. > (see > http://docs.freebsd.org/cgi/getmsg.cgi?fetch=377627+0+archive/2012/freebsd-current/20120701.freebsd-current > ) > > What I did for my production nas was edit mount.c so it didnt send a > SIGHUP to mountd as suggested by rick, as it was easy to do and non > intrusive. hmm, this means I should patch each fbsd-client, no? May be easier to patch mountd to ignore SIHGUP and use some non-standard signal to force re-init? Arno > Vince > >> >> Thanx in advance, >> >> Best, Arno >> >> _______________________________________________ >> [hidden email] mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-stable >> To unsubscribe, send any mail to "[hidden email]" > > > _______________________________________________ > [hidden email] mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "[hidden email]" [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[hidden email]" |
|
On 06/07/2012 18:51, Arno J. Klaassen wrote:
> Vincent Hoffman <[hidden email]> writes: > >> On 06/07/2012 14:19, Arno J. Klaassen wrote: >>> Hello, >>> >>> looks like I discouvered a probable bug in the nfs-code, very >>> easy to reproduce in my setup : >>> >>> >>> Machine-1 : Today's 9-stable, exporting /files (ufs) and /z2 (zfs) >>> >>> Machine-2 : 8-stable as of April the 10th exporting /raid1 >>> >>> On Machine-1 I mount /raid1 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768) >>> and start a script on this mount looping something like : >>> >>> dd if=/dev/random of=BIG bs=1048576 count=${SIZE} >>> cp -fp BIG BIG2 >>> cmp -x BIG BIG2 >>> >>> I let this run for 24 hours (from time to time stressing Machine-1 with >>> other scripts, including provoking heavy swapping), no problem at all. >>> >>> However, then I mount /z2 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768) >>> on Machine-2, and *immediately* the above loop on Machine-1 fails : >>> >>> Copying file ...cp: BIG: Permission denied >>> >>> No console messages this time, last time I got >>> >>> kernel: nfs_getpages: error 13 >>> kernel: vm_fault: pager read error, pid 87803 (cmp) >>> >>> on Machine-1. >>> >>> I repeated this scenario by replacing Machine-2 with a good old >>> 6-4-stable one, same outcome. >>> >>> Please tell me what I could do to nail this down a bit more. >> Its possible (although not definite) that you have hit the a mountd bug >> as documented in PRs >> >> kern/131342 >> kern/136865 > especially kern/131342 looks similar and quite old; funny I never hit > this before, I basically do the same tests since 'ages' on each new box. > Could be that faster network/cpu unreveals some race condition; I notice > as well that this server is the first (IIRC) who uses 3 different IRQs > for network interrupts (em(4) Intel(R) PRO/1000). > >> I've recently asked on -CURRENT about this and had a patch to try from >> Rick, I'm testing it now but it doesnt seem to fix it for me, just >> improve it alothough I'm trying to get enough runs to be a valid sample. >> (see >> http://docs.freebsd.org/cgi/getmsg.cgi?fetch=377627+0+archive/2012/freebsd-current/20120701.freebsd-current >> ) >> >> What I did for my production nas was edit mount.c so it didnt send a >> SIGHUP to mountd as suggested by rick, as it was easy to do and non >> intrusive. > hmm, this means I should patch each fbsd-client, no? May be easier to > patch mountd to ignore SIHGUP and use some non-standard signal to force > re-init? to mountd. you can manually HUP mountd if needed. > > Arno > > >> Vince >> >>> Thanx in advance, >>> >>> Best, Arno _______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[hidden email]" |
|
Vincent Hoffman <[hidden email]> writes:
> On 06/07/2012 18:51, Arno J. Klaassen wrote: >> Vincent Hoffman <[hidden email]> writes: >> >>> On 06/07/2012 14:19, Arno J. Klaassen wrote: >>>> Hello, >>>> >>>> looks like I discouvered a probable bug in the nfs-code, very >>>> easy to reproduce in my setup : >>>> >>>> >>>> Machine-1 : Today's 9-stable, exporting /files (ufs) and /z2 (zfs) >>>> >>>> Machine-2 : 8-stable as of April the 10th exporting /raid1 >>>> >>>> On Machine-1 I mount /raid1 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768) >>>> and start a script on this mount looping something like : >>>> >>>> dd if=/dev/random of=BIG bs=1048576 count=${SIZE} >>>> cp -fp BIG BIG2 >>>> cmp -x BIG BIG2 >>>> >>>> I let this run for 24 hours (from time to time stressing Machine-1 with >>>> other scripts, including provoking heavy swapping), no problem at all. >>>> >>>> However, then I mount /z2 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768) >>>> on Machine-2, and *immediately* the above loop on Machine-1 fails : >>>> >>>> Copying file ...cp: BIG: Permission denied >>>> >>>> No console messages this time, last time I got >>>> >>>> kernel: nfs_getpages: error 13 >>>> kernel: vm_fault: pager read error, pid 87803 (cmp) >>>> >>>> on Machine-1. >>>> >>>> I repeated this scenario by replacing Machine-2 with a good old >>>> 6-4-stable one, same outcome. >>>> >>>> Please tell me what I could do to nail this down a bit more. >>> Its possible (although not definite) that you have hit the a mountd bug >>> as documented in PRs >>> >>> kern/131342 >>> kern/136865 >> especially kern/131342 looks similar and quite old; funny I never hit >> this before, I basically do the same tests since 'ages' on each new box. >> Could be that faster network/cpu unreveals some race condition; I notice >> as well that this server is the first (IIRC) who uses 3 different IRQs >> for network interrupts (em(4) Intel(R) PRO/1000). > Certainly possible and seems reasonable enough. just my $0.02, I glanced kern/131342, looks like the culprit should be something like a 'non-atomic'-operation in-between invalidating old /etc/exports and validating new /etc/exports. Wonder if just verifying /var/run/mountd.pid is newer than /etc/exports and if true just skip that operation would be an acceptable band-aid (if I understood correctly, a rewrite of mountd correcting this (amongst others) is close to hit -current (?)) >>> I've recently asked on -CURRENT about this and had a patch to try from >>> Rick, I'm testing it now but it doesnt seem to fix it for me, just >>> improve it alothough I'm trying to get enough runs to be a valid sample. >>> (see >>> http://docs.freebsd.org/cgi/getmsg.cgi?fetch=377627+0+archive/2012/freebsd-current/20120701.freebsd-current >>> ) >>> >>> What I did for my production nas was edit mount.c so it didnt send a >>> SIGHUP to mountd as suggested by rick, as it was easy to do and non >>> intrusive. >> hmm, this means I should patch each fbsd-client, no? May be easier to >> patch mountd to ignore SIHGUP and use some non-standard signal to force >> re-init? > No just patch /sbin/mount on the nfs server so it doesnt send the SIGHUP > to mountd. [In my case] it's the mount on a client which causes the server to fail, I don't see how patching /sbin/mount on the nfs server should fix this? As I don't remember if it's possible to discriminate a -1 signal send from a process against one sent from terminal, if so, another bandaid, one sent from a process could be ignored at all? Merci Arno > you can manually HUP mountd if needed. >> >> Arno >> >> >>> Vince >>> >>>> Thanx in advance, >>>> >>>> Best, Arno > > > _______________________________________________ > [hidden email] mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "[hidden email]" > -- Arno J. Klaassen SCITO S.A. 8 rue des Haies F-75020 Paris, France http://scito.com _______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[hidden email]" |
|
On 09/07/2012 23:00, Arno J. Klaassen wrote:
> Vincent Hoffman <[hidden email]> writes: > >> On 06/07/2012 18:51, Arno J. Klaassen wrote: >>> Vincent Hoffman <[hidden email]> writes: >>> >>>> On 06/07/2012 14:19, Arno J. Klaassen wrote: >>>>> Hello, >>>>> >>>>> looks like I discouvered a probable bug in the nfs-code, very >>>>> easy to reproduce in my setup : >>>>> >>>>> >>>>> Machine-1 : Today's 9-stable, exporting /files (ufs) and /z2 (zfs) >>>>> >>>>> Machine-2 : 8-stable as of April the 10th exporting /raid1 >>>>> >>>>> On Machine-1 I mount /raid1 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768) >>>>> and start a script on this mount looping something like : >>>>> >>>>> dd if=/dev/random of=BIG bs=1048576 count=${SIZE} >>>>> cp -fp BIG BIG2 >>>>> cmp -x BIG BIG2 >>>>> >>>>> I let this run for 24 hours (from time to time stressing Machine-1 with >>>>> other scripts, including provoking heavy swapping), no problem at all. >>>>> >>>>> However, then I mount /z2 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768) >>>>> on Machine-2, and *immediately* the above loop on Machine-1 fails : >>>>> >>>>> Copying file ...cp: BIG: Permission denied >>>>> >>>>> No console messages this time, last time I got >>>>> >>>>> kernel: nfs_getpages: error 13 >>>>> kernel: vm_fault: pager read error, pid 87803 (cmp) >>>>> >>>>> on Machine-1. >>>>> >>>>> I repeated this scenario by replacing Machine-2 with a good old >>>>> 6-4-stable one, same outcome. >>>>> >>>>> Please tell me what I could do to nail this down a bit more. >>>> Its possible (although not definite) that you have hit the a mountd bug >>>> as documented in PRs >>>> >>>> kern/131342 >>>> kern/136865 >>> especially kern/131342 looks similar and quite old; funny I never hit >>> this before, I basically do the same tests since 'ages' on each new box. >>> Could be that faster network/cpu unreveals some race condition; I notice >>> as well that this server is the first (IIRC) who uses 3 different IRQs >>> for network interrupts (em(4) Intel(R) PRO/1000). >> Certainly possible and seems reasonable enough. > just my $0.02, I glanced kern/131342, looks like the culprit should be > something like a 'non-atomic'-operation in-between invalidating old > /etc/exports and validating new /etc/exports. > Wonder if just verifying /var/run/mountd.pid is newer than /etc/exports > and if true just skip that operation would be an acceptable band-aid (if > I understood correctly, a rewrite of mountd correcting this (amongst > others) is close to hit -current (?)) looked good from my quick look over but there are a few minor incompatibilities in the exports syntax even in compatibility mode that seem to be stopping acceptance (I'm hoping the problem is a little more complex but thats all I understand it to be.) In the mean time I'm testing a second patch from rick to see if that helps. > >>>> I've recently asked on -CURRENT about this and had a patch to try from >>>> Rick, I'm testing it now but it doesnt seem to fix it for me, just >>>> improve it alothough I'm trying to get enough runs to be a valid sample. >>>> (see >>>> http://docs.freebsd.org/cgi/getmsg.cgi?fetch=377627+0+archive/2012/freebsd-current/20120701.freebsd-current >>>> ) >>>> >>>> What I did for my production nas was edit mount.c so it didnt send a >>>> SIGHUP to mountd as suggested by rick, as it was easy to do and non >>>> intrusive. >>> hmm, this means I should patch each fbsd-client, no? May be easier to >>> patch mountd to ignore SIHGUP and use some non-standard signal to force >>> re-init? >> No just patch /sbin/mount on the nfs server so it doesnt send the SIGHUP >> to mountd. > [In my case] it's the mount on a client which causes the server to fail, > I don't see how patching /sbin/mount on the nfs server should fix this? > As I don't remember if it's possible to discriminate a -1 signal send > from a process against one sent from terminal, if so, another bandaid, > one sent from a process could be ignored at all? machine-1 on an export from machine-2, you then mounted an export from machine-1 on machine-2 (ran the mount command on machine-2, the original NFS server) which caused the test machine-1 was running to fail, as machine-2 sent a "permission denied" If i understood this incorrectly my guess at your problem could be completely off track. Vince > Merci > > Arno > > >> you can manually HUP mountd if needed. >>> Arno >>> >>> >>>> Vince >>>> >>>>> Thanx in advance, >>>>> >>>>> Best, Arno >> >> _______________________________________________ >> [hidden email] mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-stable >> To unsubscribe, send any mail to "[hidden email]" >> _______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[hidden email]" |
| Powered by Nabble | Edit this page |
