Quantcast

nfs-bug when server for 9-Stable becomes client as well ?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

nfs-bug when server for 9-Stable becomes client as well ?

Arno J. Klaassen-2

Hello,

looks like I discouvered a probable bug in the nfs-code, very
easy to reproduce in my setup :


   Machine-1 : Today's 9-stable, exporting /files (ufs) and /z2 (zfs)

   Machine-2 : 8-stable as of April the 10th exporting /raid1

On Machine-1 I mount /raid1 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768)
and start a script on this mount looping something like :

  dd if=/dev/random of=BIG bs=1048576 count=${SIZE}
  cp -fp BIG BIG2
  cmp -x BIG BIG2

I let this run for 24 hours (from time to time stressing Machine-1 with
other scripts, including provoking heavy swapping), no problem at all.

However, then I mount /z2 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768)
on Machine-2, and *immediately* the above loop on Machine-1 fails :

  Copying file ...cp: BIG: Permission denied

No console messages this time, last time I got

  kernel: nfs_getpages: error 13
  kernel: vm_fault: pager read error, pid 87803 (cmp)

on Machine-1.

I repeated this scenario by replacing Machine-2 with a good old
6-4-stable one, same outcome.

Please tell me what I could do to nail this down a bit more.

Thanx in advance,

Best, Arno

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: nfs-bug when server for 9-Stable becomes client as well ?

Vincent Hoffman-Kazlauskas
On 06/07/2012 14:19, Arno J. Klaassen wrote:

> Hello,
>
> looks like I discouvered a probable bug in the nfs-code, very
> easy to reproduce in my setup :
>
>
>    Machine-1 : Today's 9-stable, exporting /files (ufs) and /z2 (zfs)
>
>    Machine-2 : 8-stable as of April the 10th exporting /raid1
>
> On Machine-1 I mount /raid1 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768)
> and start a script on this mount looping something like :
>
>   dd if=/dev/random of=BIG bs=1048576 count=${SIZE}
>   cp -fp BIG BIG2
>   cmp -x BIG BIG2
>
> I let this run for 24 hours (from time to time stressing Machine-1 with
> other scripts, including provoking heavy swapping), no problem at all.
>
> However, then I mount /z2 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768)
> on Machine-2, and *immediately* the above loop on Machine-1 fails :
>
>   Copying file ...cp: BIG: Permission denied
>
> No console messages this time, last time I got
>
>   kernel: nfs_getpages: error 13
>   kernel: vm_fault: pager read error, pid 87803 (cmp)
>
> on Machine-1.
>
> I repeated this scenario by replacing Machine-2 with a good old
> 6-4-stable one, same outcome.
>
> Please tell me what I could do to nail this down a bit more.
Its possible (although not definite) that you have hit the a mountd bug
as documented in PRs

kern/131342
kern/136865

I've recently asked on -CURRENT about this and had a patch to try from
Rick, I'm testing it now but it doesnt seem to fix it for me, just
improve it alothough I'm trying to get enough runs to be a valid sample.
(see
http://docs.freebsd.org/cgi/getmsg.cgi?fetch=377627+0+archive/2012/freebsd-current/20120701.freebsd-current
)

What I did for my production nas was edit mount.c so it didnt send a
SIGHUP to mountd as suggested by rick, as it was easy to do and non
intrusive.

Vince

>
> Thanx in advance,
>
> Best, Arno
>
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "[hidden email]"


_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: nfs-bug when server for 9-Stable becomes client as well ?

Arno J. Klaassen-2

Vincent Hoffman <[hidden email]> writes:

> On 06/07/2012 14:19, Arno J. Klaassen wrote:
>> Hello,
>>
>> looks like I discouvered a probable bug in the nfs-code, very
>> easy to reproduce in my setup :
>>
>>
>>    Machine-1 : Today's 9-stable, exporting /files (ufs) and /z2 (zfs)
>>
>>    Machine-2 : 8-stable as of April the 10th exporting /raid1
>>
>> On Machine-1 I mount /raid1 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768)
>> and start a script on this mount looping something like :
>>
>>   dd if=/dev/random of=BIG bs=1048576 count=${SIZE}
>>   cp -fp BIG BIG2
>>   cmp -x BIG BIG2
>>
>> I let this run for 24 hours (from time to time stressing Machine-1 with
>> other scripts, including provoking heavy swapping), no problem at all.
>>
>> However, then I mount /z2 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768)
>> on Machine-2, and *immediately* the above loop on Machine-1 fails :
>>
>>   Copying file ...cp: BIG: Permission denied
>>
>> No console messages this time, last time I got
>>
>>   kernel: nfs_getpages: error 13
>>   kernel: vm_fault: pager read error, pid 87803 (cmp)
>>
>> on Machine-1.
>>
>> I repeated this scenario by replacing Machine-2 with a good old
>> 6-4-stable one, same outcome.
>>
>> Please tell me what I could do to nail this down a bit more.
> Its possible (although not definite) that you have hit the a mountd bug
> as documented in PRs
>
> kern/131342
> kern/136865

especially kern/131342 looks similar and quite old; funny I never hit
this before, I basically do the same tests since 'ages' on each new box.
Could be that faster network/cpu unreveals some race condition; I notice
as well that this server is the first (IIRC) who uses 3 different IRQs
for network interrupts (em(4) Intel(R) PRO/1000).

> I've recently asked on -CURRENT about this and had a patch to try from
> Rick, I'm testing it now but it doesnt seem to fix it for me, just
> improve it alothough I'm trying to get enough runs to be a valid sample.
> (see
> http://docs.freebsd.org/cgi/getmsg.cgi?fetch=377627+0+archive/2012/freebsd-current/20120701.freebsd-current
> )
>
> What I did for my production nas was edit mount.c so it didnt send a
> SIGHUP to mountd as suggested by rick, as it was easy to do and non
> intrusive.

hmm, this means I should patch each fbsd-client, no? May be easier to
patch mountd to ignore SIHGUP and use some non-standard signal to force
re-init?

Arno


> Vince
>
>>
>> Thanx in advance,
>>
>> Best, Arno
>>
>> _______________________________________________
>> [hidden email] mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> To unsubscribe, send any mail to "[hidden email]"
>
>
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "[hidden email]"
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: nfs-bug when server for 9-Stable becomes client as well ?

Vincent Hoffman-Kazlauskas
On 06/07/2012 18:51, Arno J. Klaassen wrote:

> Vincent Hoffman <[hidden email]> writes:
>
>> On 06/07/2012 14:19, Arno J. Klaassen wrote:
>>> Hello,
>>>
>>> looks like I discouvered a probable bug in the nfs-code, very
>>> easy to reproduce in my setup :
>>>
>>>
>>>    Machine-1 : Today's 9-stable, exporting /files (ufs) and /z2 (zfs)
>>>
>>>    Machine-2 : 8-stable as of April the 10th exporting /raid1
>>>
>>> On Machine-1 I mount /raid1 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768)
>>> and start a script on this mount looping something like :
>>>
>>>   dd if=/dev/random of=BIG bs=1048576 count=${SIZE}
>>>   cp -fp BIG BIG2
>>>   cmp -x BIG BIG2
>>>
>>> I let this run for 24 hours (from time to time stressing Machine-1 with
>>> other scripts, including provoking heavy swapping), no problem at all.
>>>
>>> However, then I mount /z2 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768)
>>> on Machine-2, and *immediately* the above loop on Machine-1 fails :
>>>
>>>   Copying file ...cp: BIG: Permission denied
>>>
>>> No console messages this time, last time I got
>>>
>>>   kernel: nfs_getpages: error 13
>>>   kernel: vm_fault: pager read error, pid 87803 (cmp)
>>>
>>> on Machine-1.
>>>
>>> I repeated this scenario by replacing Machine-2 with a good old
>>> 6-4-stable one, same outcome.
>>>
>>> Please tell me what I could do to nail this down a bit more.
>> Its possible (although not definite) that you have hit the a mountd bug
>> as documented in PRs
>>
>> kern/131342
>> kern/136865
> especially kern/131342 looks similar and quite old; funny I never hit
> this before, I basically do the same tests since 'ages' on each new box.
> Could be that faster network/cpu unreveals some race condition; I notice
> as well that this server is the first (IIRC) who uses 3 different IRQs
> for network interrupts (em(4) Intel(R) PRO/1000).
Certainly possible and seems reasonable enough.

>
>> I've recently asked on -CURRENT about this and had a patch to try from
>> Rick, I'm testing it now but it doesnt seem to fix it for me, just
>> improve it alothough I'm trying to get enough runs to be a valid sample.
>> (see
>> http://docs.freebsd.org/cgi/getmsg.cgi?fetch=377627+0+archive/2012/freebsd-current/20120701.freebsd-current
>> )
>>
>> What I did for my production nas was edit mount.c so it didnt send a
>> SIGHUP to mountd as suggested by rick, as it was easy to do and non
>> intrusive.
> hmm, this means I should patch each fbsd-client, no? May be easier to
> patch mountd to ignore SIHGUP and use some non-standard signal to force
> re-init?
No just patch /sbin/mount on the nfs server so it doesnt send the SIGHUP
to mountd.
you can manually HUP mountd if needed.
>
> Arno
>
>
>> Vince
>>
>>> Thanx in advance,
>>>
>>> Best, Arno


_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: nfs-bug when server for 9-Stable becomes client as well ?

Arno J. Klaassen-2
Vincent Hoffman <[hidden email]> writes:

> On 06/07/2012 18:51, Arno J. Klaassen wrote:
>> Vincent Hoffman <[hidden email]> writes:
>>
>>> On 06/07/2012 14:19, Arno J. Klaassen wrote:
>>>> Hello,
>>>>
>>>> looks like I discouvered a probable bug in the nfs-code, very
>>>> easy to reproduce in my setup :
>>>>
>>>>
>>>>    Machine-1 : Today's 9-stable, exporting /files (ufs) and /z2 (zfs)
>>>>
>>>>    Machine-2 : 8-stable as of April the 10th exporting /raid1
>>>>
>>>> On Machine-1 I mount /raid1 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768)
>>>> and start a script on this mount looping something like :
>>>>
>>>>   dd if=/dev/random of=BIG bs=1048576 count=${SIZE}
>>>>   cp -fp BIG BIG2
>>>>   cmp -x BIG BIG2
>>>>
>>>> I let this run for 24 hours (from time to time stressing Machine-1 with
>>>> other scripts, including provoking heavy swapping), no problem at all.
>>>>
>>>> However, then I mount /z2 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768)
>>>> on Machine-2, and *immediately* the above loop on Machine-1 fails :
>>>>
>>>>   Copying file ...cp: BIG: Permission denied
>>>>
>>>> No console messages this time, last time I got
>>>>
>>>>   kernel: nfs_getpages: error 13
>>>>   kernel: vm_fault: pager read error, pid 87803 (cmp)
>>>>
>>>> on Machine-1.
>>>>
>>>> I repeated this scenario by replacing Machine-2 with a good old
>>>> 6-4-stable one, same outcome.
>>>>
>>>> Please tell me what I could do to nail this down a bit more.
>>> Its possible (although not definite) that you have hit the a mountd bug
>>> as documented in PRs
>>>
>>> kern/131342
>>> kern/136865
>> especially kern/131342 looks similar and quite old; funny I never hit
>> this before, I basically do the same tests since 'ages' on each new box.
>> Could be that faster network/cpu unreveals some race condition; I notice
>> as well that this server is the first (IIRC) who uses 3 different IRQs
>> for network interrupts (em(4) Intel(R) PRO/1000).
> Certainly possible and seems reasonable enough.

just my $0.02, I glanced kern/131342, looks like the culprit should be
something like a 'non-atomic'-operation in-between invalidating old
/etc/exports and validating new /etc/exports.
Wonder if just verifying /var/run/mountd.pid is newer than /etc/exports
and if true just skip that operation would be an acceptable band-aid (if
I understood correctly, a rewrite of mountd correcting this (amongst
others) is close to hit -current (?))

>>> I've recently asked on -CURRENT about this and had a patch to try from
>>> Rick, I'm testing it now but it doesnt seem to fix it for me, just
>>> improve it alothough I'm trying to get enough runs to be a valid sample.
>>> (see
>>> http://docs.freebsd.org/cgi/getmsg.cgi?fetch=377627+0+archive/2012/freebsd-current/20120701.freebsd-current
>>> )
>>>
>>> What I did for my production nas was edit mount.c so it didnt send a
>>> SIGHUP to mountd as suggested by rick, as it was easy to do and non
>>> intrusive.
>> hmm, this means I should patch each fbsd-client, no? May be easier to
>> patch mountd to ignore SIHGUP and use some non-standard signal to force
>> re-init?
> No just patch /sbin/mount on the nfs server so it doesnt send the SIGHUP
> to mountd.

[In my case] it's the mount on a client which causes the server to fail,
I don't see how patching /sbin/mount on the nfs server should fix this?
As I don't remember if it's possible to discriminate a -1 signal send
from a process against one sent from terminal, if so, another bandaid,
one sent from a process could be ignored at all?

Merci

Arno


> you can manually HUP mountd if needed.
>>
>> Arno
>>
>>
>>> Vince
>>>
>>>> Thanx in advance,
>>>>
>>>> Best, Arno
>
>
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "[hidden email]"
>

--

  Arno J. Klaassen

  SCITO S.A.
  8 rue des Haies
  F-75020 Paris, France
  http://scito.com
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: nfs-bug when server for 9-Stable becomes client as well ?

Vincent Hoffman-Kazlauskas
On 09/07/2012 23:00, Arno J. Klaassen wrote:

> Vincent Hoffman <[hidden email]> writes:
>
>> On 06/07/2012 18:51, Arno J. Klaassen wrote:
>>> Vincent Hoffman <[hidden email]> writes:
>>>
>>>> On 06/07/2012 14:19, Arno J. Klaassen wrote:
>>>>> Hello,
>>>>>
>>>>> looks like I discouvered a probable bug in the nfs-code, very
>>>>> easy to reproduce in my setup :
>>>>>
>>>>>
>>>>>    Machine-1 : Today's 9-stable, exporting /files (ufs) and /z2 (zfs)
>>>>>
>>>>>    Machine-2 : 8-stable as of April the 10th exporting /raid1
>>>>>
>>>>> On Machine-1 I mount /raid1 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768)
>>>>> and start a script on this mount looping something like :
>>>>>
>>>>>   dd if=/dev/random of=BIG bs=1048576 count=${SIZE}
>>>>>   cp -fp BIG BIG2
>>>>>   cmp -x BIG BIG2
>>>>>
>>>>> I let this run for 24 hours (from time to time stressing Machine-1 with
>>>>> other scripts, including provoking heavy swapping), no problem at all.
>>>>>
>>>>> However, then I mount /z2 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768)
>>>>> on Machine-2, and *immediately* the above loop on Machine-1 fails :
>>>>>
>>>>>   Copying file ...cp: BIG: Permission denied
>>>>>
>>>>> No console messages this time, last time I got
>>>>>
>>>>>   kernel: nfs_getpages: error 13
>>>>>   kernel: vm_fault: pager read error, pid 87803 (cmp)
>>>>>
>>>>> on Machine-1.
>>>>>
>>>>> I repeated this scenario by replacing Machine-2 with a good old
>>>>> 6-4-stable one, same outcome.
>>>>>
>>>>> Please tell me what I could do to nail this down a bit more.
>>>> Its possible (although not definite) that you have hit the a mountd bug
>>>> as documented in PRs
>>>>
>>>> kern/131342
>>>> kern/136865
>>> especially kern/131342 looks similar and quite old; funny I never hit
>>> this before, I basically do the same tests since 'ages' on each new box.
>>> Could be that faster network/cpu unreveals some race condition; I notice
>>> as well that this server is the first (IIRC) who uses 3 different IRQs
>>> for network interrupts (em(4) Intel(R) PRO/1000).
>> Certainly possible and seems reasonable enough.
> just my $0.02, I glanced kern/131342, looks like the culprit should be
> something like a 'non-atomic'-operation in-between invalidating old
> /etc/exports and validating new /etc/exports.
> Wonder if just verifying /var/run/mountd.pid is newer than /etc/exports
> and if true just skip that operation would be an acceptable band-aid (if
> I understood correctly, a rewrite of mountd correcting this (amongst
> others) is close to hit -current (?))
I dont know how close it (nfse) is to hitting -current. It certainly
looked good from my quick look over but there are a few minor
incompatibilities in the exports syntax even in compatibility mode that
seem to be stopping acceptance (I'm hoping the problem is a little more
complex but thats all I understand it to be.)
In the mean time I'm testing a second patch from rick to see if that helps.

>
>>>> I've recently asked on -CURRENT about this and had a patch to try from
>>>> Rick, I'm testing it now but it doesnt seem to fix it for me, just
>>>> improve it alothough I'm trying to get enough runs to be a valid sample.
>>>> (see
>>>> http://docs.freebsd.org/cgi/getmsg.cgi?fetch=377627+0+archive/2012/freebsd-current/20120701.freebsd-current
>>>> )
>>>>
>>>> What I did for my production nas was edit mount.c so it didnt send a
>>>> SIGHUP to mountd as suggested by rick, as it was easy to do and non
>>>> intrusive.
>>> hmm, this means I should patch each fbsd-client, no? May be easier to
>>> patch mountd to ignore SIHGUP and use some non-standard signal to force
>>> re-init?
>> No just patch /sbin/mount on the nfs server so it doesnt send the SIGHUP
>> to mountd.
> [In my case] it's the mount on a client which causes the server to fail,
> I don't see how patching /sbin/mount on the nfs server should fix this?
> As I don't remember if it's possible to discriminate a -1 signal send
> from a process against one sent from terminal, if so, another bandaid,
> one sent from a process could be ignored at all?
Your message above seemed to say that you were running the test on
machine-1 on an export from machine-2, you then mounted an export from
machine-1 on machine-2 (ran the mount command on machine-2, the original
NFS server) which caused the test machine-1 was running to fail, as
machine-2 sent a "permission denied"
If i understood this incorrectly my guess at your problem could be
completely off track.

Vince

> Merci
>
> Arno
>
>
>> you can manually HUP mountd if needed.
>>> Arno
>>>
>>>
>>>> Vince
>>>>
>>>>> Thanx in advance,
>>>>>
>>>>> Best, Arno
>>
>> _______________________________________________
>> [hidden email] mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> To unsubscribe, send any mail to "[hidden email]"
>>


_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"
Loading...