Quantcast

mmap() incoherency on hi I/O load (FS is zfs)

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

mmap() incoherency on hi I/O load (FS is zfs)

Pavlo-2



There's a case when some parts of files that are mapped and then
modified getting corrupted. By corrupted I mean some data is ok (one that
was written using write()/pwrite()) but some looks like it never existed.
Like it was some time in buffers, when several processes simultaneously
(of course access was synchronised) used shared pages and reported it's
existence. But after time pass they (processes) screamed that it is now
lost. Only part of data written with pwrite() was there. Everything that
was written via mmap() is zero.

So as I said it occurs on hi I/O busyness. When in background 4+
processes do indexing of huge ammount of data. Also I want to note, it
never occurred in the life of our project  while we used mmap() under
same I/O stress conditions when mapping was done for a whole file of just
a part(header) starting from a beginning of a file. First time we used
mapping of individual pages, just to save RAM, and this popped up.

Solution for this problem is msync() before any munmap(). But man says:

The msync() system call is usually not needed since BSD implements a
coherent file system buffer cache.  However, it may be used to associate
dirty VM pages with file system buffers and thus cause them to be flushed
to physical media sooner rather than later.

Any thoughts? Thanks.

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: mmap() incoherency on hi I/O load (FS is zfs)

Rick Macklem
Pavlo wrote:

> There's a case when some parts of files that are mapped and then
> modified getting corrupted. By corrupted I mean some data is ok (one
> that
> was written using write()/pwrite()) but some looks like it never
> existed.
> Like it was some time in buffers, when several processes
> simultaneously
> (of course access was synchronised) used shared pages and reported
> it's
> existence. But after time pass they (processes) screamed that it is
> now
> lost. Only part of data written with pwrite() was there. Everything
> that
> was written via mmap() is zero.
>
> So as I said it occurs on hi I/O busyness. When in background 4+
> processes do indexing of huge ammount of data. Also I want to note, it
> never occurred in the life of our project while we used mmap() under
> same I/O stress conditions when mapping was done for a whole file of
> just
> a part(header) starting from a beginning of a file. First time we used
> mapping of individual pages, just to save RAM, and this popped up.
>
> Solution for this problem is msync() before any munmap(). But man
> says:
>
> The msync() system call is usually not needed since BSD implements a
> coherent file system buffer cache. However, it may be used to
> associate
> dirty VM pages with file system buffers and thus cause them to be
> flushed
> to physical media sooner rather than later.
>
> Any thoughts? Thanks.
>
With a recent kernel from head, I am seeing dirty mmap'd pages being written
quite late for the NFSv4 client. Even after the NFS client VOP_RECLAIM() has
been called, it seems. I didn't observe this behaviour in a kernel from
head in March. (I don't know enough about the vm/mmap area to know if this
is correct behaviour or not?)

I thought I'd mention this, since you didn't say how recent a kernel you
were running and thought it might be caused by the same change?

Sorry I can't help more, rick
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "[hidden email]"
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: mmap() incoherency on hi I/O load (FS is zfs)

Pavlo-2



>

Pavlo wrote:

> There's a case when some parts of files that are mapped and then
> modified getting corrupted. By corrupted I mean some data is ok (one
> that
> was written using write()/pwrite()) but some looks like it never
> existed.
> Like it was some time in buffers, when several processes
> simultaneously
> (of course access was synchronised) used shared pages and reported
> it's
> existence. But after time pass they (processes) screamed that it is
> now
> lost. Only part of data written with pwrite() was there. Everything
> that
> was written via mmap() is zero.
>
> So as I said it occurs on hi I/O busyness. When in background 4+
> processes do indexing of huge ammount of data. Also I want to note, it
> never occurred in the life of our project while we used mmap() under
> same I/O stress conditions when mapping was done for a whole file of
> just
> a part(header) starting from a beginning of a file. First time we used
> mapping of individual pages, just to save RAM, and this popped up.
>
> Solution for this problem is msync() before any munmap(). But man
> says:
>
> The msync() system call is usually not needed since BSD implements a
> coherent file system buffer cache. However, it may be used to
> associate
> dirty VM pages with file system buffers and thus cause them to be
> flushed
> to physical media sooner rather than later.
>
> Any thoughts? Thanks.
>
With a recent kernel from head, I am seeing dirty mmap'd pages being written
quite late for the NFSv4 client. Even after the NFS client VOP_RECLAIM() has
been called, it seems. I didn't observe this behaviour in a kernel from
head in March. (I don't know enough about the vm/mmap area to know if this
is correct behaviour or not?)

I thought I'd mention this, since you didn't say how recent a kernel you
were running and thought it might be caused by the same change?

Sorry I can't help more, rick
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs> To unsubscribe, send any mail to "[hidden email]"
>

Thanks for reply, Rick!

Yes, we have pretty old kernel:

# uname -a
FreeBSD mpop-zebra-k1.ukr.net 8.2-STABLE FreeBSD 8.2-STABLE #9: Wed Jan
25 11:28:55 EET 2012

I just posted my observation here to point out possible problem that
could still exist.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: mmap() incoherency on hi I/O load (FS is zfs)

Konstantin Belousov
In reply to this post by Rick Macklem
On Thu, Jun 14, 2012 at 07:32:36AM -0400, Rick Macklem wrote:

> Pavlo wrote:
> > There's a case when some parts of files that are mapped and then
> > modified getting corrupted. By corrupted I mean some data is ok (one
> > that
> > was written using write()/pwrite()) but some looks like it never
> > existed.
> > Like it was some time in buffers, when several processes
> > simultaneously
> > (of course access was synchronised) used shared pages and reported
> > it's
> > existence. But after time pass they (processes) screamed that it is
> > now
> > lost. Only part of data written with pwrite() was there. Everything
> > that
> > was written via mmap() is zero.
> >
> > So as I said it occurs on hi I/O busyness. When in background 4+
> > processes do indexing of huge ammount of data. Also I want to note, it
> > never occurred in the life of our project while we used mmap() under
> > same I/O stress conditions when mapping was done for a whole file of
> > just
> > a part(header) starting from a beginning of a file. First time we used
> > mapping of individual pages, just to save RAM, and this popped up.
> >
> > Solution for this problem is msync() before any munmap(). But man
> > says:
> >
> > The msync() system call is usually not needed since BSD implements a
> > coherent file system buffer cache. However, it may be used to
> > associate
> > dirty VM pages with file system buffers and thus cause them to be
> > flushed
> > to physical media sooner rather than later.
> >
> > Any thoughts? Thanks.
> >
> With a recent kernel from head, I am seeing dirty mmap'd pages being written
> quite late for the NFSv4 client. Even after the NFS client VOP_RECLAIM() has
> been called, it seems. I didn't observe this behaviour in a kernel from
> head in March. (I don't know enough about the vm/mmap area to know if this
> is correct behaviour or not?)
>
> I thought I'd mention this, since you didn't say how recent a kernel you
> were running and thought it might be caused by the same change?
Can you, please, comment more on this ?
How is this possible at all ?

Could you please show at least a backtrace for the moment when a write
request is made for the page which belong to already reclaimed vnode ?

attachment0 (203 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: mmap() incoherency on hi I/O load (FS is zfs)

Rick Macklem
Kostik wrote:

> On Thu, Jun 14, 2012 at 07:32:36AM -0400, Rick Macklem wrote:
> > Pavlo wrote:
> > > There's a case when some parts of files that are mapped and then
> > > modified getting corrupted. By corrupted I mean some data is ok
> > > (one
> > > that
> > > was written using write()/pwrite()) but some looks like it never
> > > existed.
> > > Like it was some time in buffers, when several processes
> > > simultaneously
> > > (of course access was synchronised) used shared pages and reported
> > > it's
> > > existence. But after time pass they (processes) screamed that it
> > > is
> > > now
> > > lost. Only part of data written with pwrite() was there.
> > > Everything
> > > that
> > > was written via mmap() is zero.
> > >
> > > So as I said it occurs on hi I/O busyness. When in background 4+
> > > processes do indexing of huge ammount of data. Also I want to
> > > note, it
> > > never occurred in the life of our project while we used mmap()
> > > under
> > > same I/O stress conditions when mapping was done for a whole file
> > > of
> > > just
> > > a part(header) starting from a beginning of a file. First time we
> > > used
> > > mapping of individual pages, just to save RAM, and this popped up.
> > >
> > > Solution for this problem is msync() before any munmap(). But man
> > > says:
> > >
> > > The msync() system call is usually not needed since BSD implements
> > > a
> > > coherent file system buffer cache. However, it may be used to
> > > associate
> > > dirty VM pages with file system buffers and thus cause them to be
> > > flushed
> > > to physical media sooner rather than later.
> > >
> > > Any thoughts? Thanks.
> > >
> > With a recent kernel from head, I am seeing dirty mmap'd pages being
> > written
> > quite late for the NFSv4 client. Even after the NFS client
> > VOP_RECLAIM() has
> > been called, it seems. I didn't observe this behaviour in a kernel
> > from
> > head in March. (I don't know enough about the vm/mmap area to know
> > if this
> > is correct behaviour or not?)
> >
> > I thought I'd mention this, since you didn't say how recent a kernel
> > you
> > were running and thought it might be caused by the same change?
> Can you, please, comment more on this ?
> How is this possible at all ?
>
> Could you please show at least a backtrace for the moment when a write
> request is made for the page which belong to already reclaimed vnode ?
After some off list discussion, it was determined that my problem was
doing nfsrpc_close() before vnode_destroy_object() in the NFSv4 client's
VOP_RECLAIM(). This is an NFSv4 specific bug and wouldn't be related to
the above issue.

Sorry about the noise, rick
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: mmap() incoherency on hi I/O load (FS is zfs)

Pavlo-2
In reply to this post by Pavlo-2



--- Original message ---
From: "Pavlo" <[hidden email]>
To: [hidden email]
Date: 14 June 2012, 13:30:20
Subject: mmap() incoherency on hi I/O load (FS is zfs)


> There's a case when some parts of files that are mapped and then
modified getting corrupted. By corrupted I mean some data is ok (one that
was written using write()/pwrite()) but some looks like it never existed.
Like it was some time in buffers, when several processes simultaneously
(of course access was synchronised) used shared pages and reported it's
existence. But after time pass they (processes) screamed that it is now
lost. Only part of data written with pwrite() was there. Everything that
was written via mmap() is zero.
>
> So as I said it occurs on hi I/O busyness. When in background 4+
processes do indexing of huge ammount of data. Also I want to note, it
never occurred in the life of our project  while we used mmap() under
same I/O stress conditions when mapping was done for a whole file of just
a part(header) starting from a beginning of a file. First time we used
mapping of individual pages, just to save RAM, and this popped up.
>
> Solution for this problem is msync() before any munmap(). But man says:
>
>

The msync() system call is usually not needed since BSD implements a
coherent file system buffer cache.  However, it may be used to associate
dirty VM pages with file system buffers and thus cause them to be flushed
to physical media sooner rather than later.
>
> Any thoughts? Thanks.
>
>

So I tracked issue to the place where it occurs. When I commit data to
file using mmap() and pwrite() side by side, sometimes 'newest data' is
being overwritten by 'elder data'. From time to time 'elder data' can be
something written with mmap() either with pwrite(). It never happens when
I use exclusively mmap() either pwrite(). Also this issue reproduces on
UFS as well. I think there is a problem keeping mmapep pages and FS cache
synced.

I will try to make test to reliably reproduces issue.


_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: mmap() incoherency on hi I/O load (FS is zfs)

Konstantin Belousov
On Wed, Jul 04, 2012 at 11:07:36AM +0300, Pavlo wrote:

>
>
>
> --- Original message ---
> From: "Pavlo" <[hidden email]>
> To: [hidden email]
> Date: 14 June 2012, 13:30:20
> Subject: mmap() incoherency on hi I/O load (FS is zfs)
>
>
> > There's a case when some parts of files that are mapped and then
> modified getting corrupted. By corrupted I mean some data is ok (one that
> was written using write()/pwrite()) but some looks like it never existed.
> Like it was some time in buffers, when several processes simultaneously
> (of course access was synchronised) used shared pages and reported it's
> existence. But after time pass they (processes) screamed that it is now
> lost. Only part of data written with pwrite() was there. Everything that
> was written via mmap() is zero.
> >
> > So as I said it occurs on hi I/O busyness. When in background 4+
> processes do indexing of huge ammount of data. Also I want to note, it
> never occurred in the life of our project  while we used mmap() under
> same I/O stress conditions when mapping was done for a whole file of just
> a part(header) starting from a beginning of a file. First time we used
> mapping of individual pages, just to save RAM, and this popped up.
> >
> > Solution for this problem is msync() before any munmap(). But man says:
> >
> >
>
> The msync() system call is usually not needed since BSD implements a
> coherent file system buffer cache.  However, it may be used to associate
> dirty VM pages with file system buffers and thus cause them to be flushed
> to physical media sooner rather than later.
> >
> > Any thoughts? Thanks.
> >
> >
>
> So I tracked issue to the place where it occurs. When I commit data to
> file using mmap() and pwrite() side by side, sometimes 'newest data' is
> being overwritten by 'elder data'. From time to time 'elder data' can be
> something written with mmap() either with pwrite(). It never happens when
> I use exclusively mmap() either pwrite(). Also this issue reproduces on
> UFS as well. I think there is a problem keeping mmapep pages and FS cache
> synced.
I am curious how do you label data with newer and elder labels.

I do admit a possibility of a race in ZFS double-copy implementation of
the mmap/cache coherency, but somewhat skeptical about the same possibility
for UFS. What you saying might indicate that we loose modified/dirty bits
for the page, but that would have much more firework then just eventual
race with write.

What version of the system ? Does the machine swap ?

>
> I will try to make test to reliably reproduces issue.
Yes, isolated test case is the best route forward. It would either show
a bug or demonstrate a misunderstanding on your part.


attachment0 (203 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: mmap() incoherency on hi I/O load (FS is zfs)

Pavlo-2

 

  --- Original message ---
 From: "Konstantin Belousov" <[hidden email]>
 To: "Pavlo" <[hidden email]>
 Date: 4 July 2012, 12:06:44
 Subject: Re: mmap() incoherency on hi I/O load (FS is zfs)
 
 


> On Wed, Jul 04, 2012 at 11:07:36AM +0300, Pavlo wrote:
> >
> >
> >
> > --- Original message ---
> > From: "Pavlo" <[hidden email]>
> > To: [hidden email]
> > Date: 14 June 2012, 13:30:20
> > Subject: mmap() incoherency on hi I/O load (FS is zfs)
> >
> >
> > > There's a case when some parts of files that are mapped and then
> > modified getting corrupted. By corrupted I mean some data is ok (one that
> > was written using write()/pwrite()) but some looks like it never existed.
> > Like it was some time in buffers, when several processes simultaneously
> > (of course access was synchronised) used shared pages and reported it's
> > existence. But after time pass they (processes) screamed that it is now
> > lost. Only part of data written with pwrite() was there. Everything that
> > was written via mmap() is zero.
> > >
> > > So as I said it occurs on hi I/O busyness. When in background 4+
> > processes do indexing of huge ammount of data. Also I want to note, it
> > never occurred in the life of our project  while we used mmap() under
> > same I/O stress conditions when mapping was done for a whole file of just
> > a part(header) starting from a beginning of a file. First time we used
> > mapping of individual pages, just to save RAM, and this popped up.
> > >
> > > Solution for this problem is msync() before any munmap(). But man says:
> > >
> > >
> >
> > The msync() system call is usually not needed since BSD implements a
> > coherent file system buffer cache.  However, it may be used to associate
> > dirty VM pages with file system buffers and thus cause them to be flushed
> > to physical media sooner rather than later.
> > >
> > > Any thoughts? Thanks.
> > >
> > >
> >
> > So I tracked issue to the place where it occurs. When I commit data to
> > file using mmap() and pwrite() side by side, sometimes 'newest data' is
> > being overwritten by 'elder data'. From time to time 'elder data' can be
> > something written with mmap() either with pwrite(). It never happens when
> > I use exclusively mmap() either pwrite(). Also this issue reproduces on
> > UFS as well. I think there is a problem keeping mmapep pages and FS cache
> > synced.
> I am curious how do you label data with newer and elder labels.

I have list header like:

struct XXX
{
    uint32_t alloc_size;
    uint32_t list_size;
    node_t   node[1];
}

First I init it with pwrite() setting for example alloc_size to 10 and everything else to 0;

Then add elements with mmap();

1. Workers log elements existence...
2. Workers log elements existence...
... same thing for a few seconds.
X. One of the workers cry that list is empty.

Then I inspect core file and see that list looks like if it was just initialised with pwrite() ie alloc_size equals 10, everything else is 0.
Hard to reproduce because it happen only on really high IO loads. And from tens of thousands of such files only a couple getting corrupted.

>
> I do admit a possibility of a race in ZFS double-copy implementation of
> the mmap/cache coherency, but somewhat skeptical about the same possibility
> for UFS. What you saying might indicate that we loose modified/dirty bits
> for the page, but that would have much more firework then just eventual
> race with write.
>
> What version of the system ? Does the machine swap ?

Okay, after msync() helped but didn't fixed issue (just reduced occurrence) I did next thing:
tracked modification of mmaped pages using mprotect(). At the end of session before munpap() saved modified pages, then munmap() then I wrote those pages back to disk.

Later worker accessed those pages again with mmap(), modified them and for some parts of those pages did read() instead of accessing via mmap(). What read() returned was data committed in previous session with write() but not the data, that was just modified by same process via mmap(). We reproduces this again and again on UFS on FreeBSD and only on high IO load. Though we could never reproduce this on Linux (ext4).

>
> >
> > I will try to make test to reliably reproduces issue.
> Yes, isolated test case is the best route forward. It would either show
> a bug or demonstrate a misunderstanding on your part.

I am trying, but it's really hard to make example to reproduce this issue.

Thanks for reply.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: mmap() incoherency on hi I/O load (FS is zfs)

Pavlo-2


  --- Original message ---
 From: "Pavlo" <[hidden email]>
 To: "Konstantin Belousov" <[hidden email]>
 Date: 4 July 2012, 12:25:55
 Subject: Re: mmap() incoherency on hi I/O load (FS is zfs)
 
 


> --- Original message ---
> From: "Konstantin Belousov" <[hidden email]>
> To: "Pavlo" <[hidden email]>
> Date: 4 July 2012, 12:06:44
> Subject: Re: mmap() incoherency on hi I/O load (FS is zfs)
>
>
>
>
> > On Wed, Jul 04, 2012 at 11:07:36AM +0300, Pavlo wrote:
> > >
> > >
> > >
> > > --- Original message ---
> > > From: "Pavlo" <[hidden email]>
> > > To: [hidden email]
> > > Date: 14 June 2012, 13:30:20
> > > Subject: mmap() incoherency on hi I/O load (FS is zfs)
> > >
> > >
> > > > There's a case when some parts of files that are mapped and then
> > > modified getting corrupted. By corrupted I mean some data is ok (one that
> > > was written using write()/pwrite()) but some looks like it never existed.
> > > Like it was some time in buffers, when several processes simultaneously
> > > (of course access was synchronised) used shared pages and reported it's
> > > existence. But after time pass they (processes) screamed that it is now
> > > lost. Only part of data written with pwrite() was there. Everything that
> > > was written via mmap() is zero.
> > > >
> > > > So as I said it occurs on hi I/O busyness. When in background 4+
> > > processes do indexing of huge ammount of data. Also I want to note, it
> > > never occurred in the life of our project  while we used mmap() under
> > > same I/O stress conditions when mapping was done for a whole file of just
> > > a part(header) starting from a beginning of a file. First time we used
> > > mapping of individual pages, just to save RAM, and this popped up.
> > > >
> > > > Solution for this problem is msync() before any munmap(). But man says:
> > > >
> > > >
> > >
> > > The msync() system call is usually not needed since BSD implements a
> > > coherent file system buffer cache.  However, it may be used to associate
> > > dirty VM pages with file system buffers and thus cause them to be flushed
> > > to physical media sooner rather than later.
> > > >
> > > > Any thoughts? Thanks.
> > > >
> > > >
> > >
> > > So I tracked issue to the place where it occurs. When I commit data to
> > > file using mmap() and pwrite() side by side, sometimes 'newest data' is
> > > being overwritten by 'elder data'. From time to time 'elder data' can be
> > > something written with mmap() either with pwrite(). It never happens when
> > > I use exclusively mmap() either pwrite(). Also this issue reproduces on
> > > UFS as well. I think there is a problem keeping mmapep pages and FS cache
> > > synced.
> > I am curious how do you label data with newer and elder labels.
>
> I have list header like:
>
> struct XXX
> {
> uint32_t alloc_size;
> uint32_t list_size;
> node_t   node[1];
> }
>
> First I init it with pwrite() setting for example alloc_size to 10 and everything else to 0;
>
> Then add elements with mmap();
>
> 1. Workers log elements existence...
> 2. Workers log elements existence...
> ... same thing for a few seconds.
> X. One of the workers cry that list is empty.
>
> Then I inspect core file and see that list looks like if it was just initialised with pwrite() ie alloc_size equals 10, everything else is 0.
> Hard to reproduce because it happen only on really high IO loads. And from tens of thousands of such files only a couple getting corrupted.
>
> >
> > I do admit a possibility of a race in ZFS double-copy implementation of
> > the mmap/cache coherency, but somewhat skeptical about the same possibility
> > for UFS. What you saying might indicate that we loose modified/dirty bits
> > for the page, but that would have much more firework then just eventual
> > race with write.
> >
> > What version of the system ? Does the machine swap ?

Forgot to tell system stat:

uname -a
FreeBSD zfs1.dev.ukr.net 8.2-STABLE FreeBSD 8.2-STABLE #7: Wed Aug  3 11:41:58 EEST 2011     [hidden email]:/usr/obj/usr/src/sys/DEV  i386

Swap is turned off. For known reasons.

>
> Okay, after msync() helped but didn't fixed issue (just reduced occurrence) I did next thing:
> tracked modification of mmaped pages using mprotect(). At the end of session before munpap() saved modified pages, then munmap() then I wrote those pages back to disk.
>
> Later worker accessed those pages again with mmap(), modified them and for some parts of those pages did read() instead of accessing via mmap(). What read() returned was data committed in previous session with write() but not the data, that was just modified by same process via mmap(). We reproduces this again and again on UFS on FreeBSD and only on high IO load. Though we could never reproduce this on Linux (ext4).
>
> >
> > >
> > > I will try to make test to reliably reproduces issue.
> > Yes, isolated test case is the best route forward. It would either show
> > a bug or demonstrate a misunderstanding on your part.
>
> I am trying, but it's really hard to make example to reproduce this issue.
>
> Thanks for reply.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: mmap() incoherency on hi I/O load (FS is zfs)

Konstantin Belousov
In reply to this post by Pavlo-2
On Wed, Jul 04, 2012 at 12:25:55PM +0300, Pavlo wrote:

>
>  
>
>   --- Original message ---
>  From: "Konstantin Belousov" <[hidden email]>
>  To: "Pavlo" <[hidden email]>
>  Date: 4 July 2012, 12:06:44
>  Subject: Re: mmap() incoherency on hi I/O load (FS is zfs)
>  
>  
>
>
> > On Wed, Jul 04, 2012 at 11:07:36AM +0300, Pavlo wrote:
> > >
> > >
> > >
> > > --- Original message ---
> > > From: "Pavlo" <[hidden email]>
> > > To: [hidden email]
> > > Date: 14 June 2012, 13:30:20
> > > Subject: mmap() incoherency on hi I/O load (FS is zfs)
> > >
> > >
> > > > There's a case when some parts of files that are mapped and then
> > > modified getting corrupted. By corrupted I mean some data is ok (one that
> > > was written using write()/pwrite()) but some looks like it never existed.
> > > Like it was some time in buffers, when several processes simultaneously
> > > (of course access was synchronised) used shared pages and reported it's
> > > existence. But after time pass they (processes) screamed that it is now
> > > lost. Only part of data written with pwrite() was there. Everything that
> > > was written via mmap() is zero.
> > > >
> > > > So as I said it occurs on hi I/O busyness. When in background 4+
> > > processes do indexing of huge ammount of data. Also I want to note, it
> > > never occurred in the life of our project  while we used mmap() under
> > > same I/O stress conditions when mapping was done for a whole file of just
> > > a part(header) starting from a beginning of a file. First time we used
> > > mapping of individual pages, just to save RAM, and this popped up.
> > > >
> > > > Solution for this problem is msync() before any munmap(). But man says:
> > > >
> > > >
> > >
> > > The msync() system call is usually not needed since BSD implements a
> > > coherent file system buffer cache.  However, it may be used to associate
> > > dirty VM pages with file system buffers and thus cause them to be flushed
> > > to physical media sooner rather than later.
> > > >
> > > > Any thoughts? Thanks.
> > > >
> > > >
> > >
> > > So I tracked issue to the place where it occurs. When I commit data to
> > > file using mmap() and pwrite() side by side, sometimes 'newest data' is
> > > being overwritten by 'elder data'. From time to time 'elder data' can be
> > > something written with mmap() either with pwrite(). It never happens when
> > > I use exclusively mmap() either pwrite(). Also this issue reproduces on
> > > UFS as well. I think there is a problem keeping mmapep pages and FS cache
> > > synced.
> > I am curious how do you label data with newer and elder labels.
>
> I have list header like:
>
> struct XXX
> {
>     uint32_t alloc_size;
>     uint32_t list_size;
>     node_t   node[1];
> }
>
> First I init it with pwrite() setting for example alloc_size to 10 and everything else to 0;
>
> Then add elements with mmap();
>
> 1. Workers log elements existence...
> 2. Workers log elements existence...
> ... same thing for a few seconds.
> X. One of the workers cry that list is empty.
>
> Then I inspect core file and see that list looks like if it was just initialised with pwrite() ie alloc_size equals 10, everything else is 0.
> Hard to reproduce because it happen only on really high IO loads. And from tens of thousands of such files only a couple getting corrupted.
>
> >
> > I do admit a possibility of a race in ZFS double-copy implementation of
> > the mmap/cache coherency, but somewhat skeptical about the same possibility
> > for UFS. What you saying might indicate that we loose modified/dirty bits
> > for the page, but that would have much more firework then just eventual
> > race with write.
> >
> > What version of the system ? Does the machine swap ?
You just ignored these ^^^^^^^^^^^^ questions.

>
> Okay, after msync() helped but didn't fixed issue (just reduced occurrence) I did next thing:
> tracked modification of mmaped pages using mprotect(). At the end of session before munpap() saved modified pages, then munmap() then I wrote those pages back to disk.
>
> Later worker accessed those pages again with mmap(), modified them and for some parts of those pages did read() instead of accessing via mmap(). What read() returned was data committed in previous session with write() but not the data, that was just modified by same process via mmap(). We reproduces this again and again on UFS on FreeBSD and only on high IO load. Though we could never reproduce this on Linux (ext4).
>
So you are saying that the following sequence:
        1. write at offset X
        2. write into the shared mapping of the same file at offset X
        3. read at offset X
performed by single thread can return data at the point (1) instead of
the data at the point (2) ?

Knowing how write is implemented for UFS, I find this quite impossible.

If the actions are executed in the different processes/threads, say
process 1 executes (1, 2) and process 2 executes (3), or process 1
executes (1), and process 2 executes (2, 3), then my first guess would
be a lack of proper synchronization between actions. This would indeed
makes possible exactly the outcome I described.
> >
> > >
> > > I will try to make test to reliably reproduces issue.
> > Yes, isolated test case is the best route forward. It would either show
> > a bug or demonstrate a misunderstanding on your part.
>
> I am trying, but it's really hard to make example to reproduce this issue.
This seems to be the only way forward, at least for you.
And do answer about the version/swap question.

>
> Thanks for reply.

attachment0 (203 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: mmap() incoherency on hi I/O load (FS is zfs)

Pavlo-2

> On Wed, Jul 04, 2012 at 12:25:55PM +0300, Pavlo wrote:
> >
> >  
> > > On Wed, Jul 04, 2012 at 11:07:36AM +0300, Pavlo wrote:
> > > >
> > > >
> > > > > There's a case when some parts of files that are mapped and then
> > > > modified getting corrupted. By corrupted I mean some data is ok (one that
> > > > was written using write()/pwrite()) but some looks like it never existed.
> > > > Like it was some time in buffers, when several processes simultaneously
> > > > (of course access was synchronised) used shared pages and reported it's
> > > > existence. But after time pass they (processes) screamed that it is now
> > > > lost. Only part of data written with pwrite() was there. Everything that
> > > > was written via mmap() is zero.
> > > > >
> > > > > So as I said it occurs on hi I/O busyness. When in background 4+
> > > > processes do indexing of huge ammount of data. Also I want to note, it
> > > > never occurred in the life of our project  while we used mmap() under
> > > > same I/O stress conditions when mapping was done for a whole file of just
> > > > a part(header) starting from a beginning of a file. First time we used
> > > > mapping of individual pages, just to save RAM, and this popped up.
> > > > >
> > > > > Solution for this problem is msync() before any munmap(). But man says:
> > > > >
> > > > >
> > > >
> > > > The msync() system call is usually not needed since BSD implements a
> > > > coherent file system buffer cache.  However, it may be used to associate
> > > > dirty VM pages with file system buffers and thus cause them to be flushed
> > > > to physical media sooner rather than later.
> > > > >
> > > > > Any thoughts? Thanks.
> > > > >
> > > > >
> > > >
> > > > So I tracked issue to the place where it occurs. When I commit data to
> > > > file using mmap() and pwrite() side by side, sometimes 'newest data' is
> > > > being overwritten by 'elder data'. From time to time 'elder data' can be
> > > > something written with mmap() either with pwrite(). It never happens when
> > > > I use exclusively mmap() either pwrite(). Also this issue reproduces on
> > > > UFS as well. I think there is a problem keeping mmapep pages and FS cache
> > > > synced.
> > > I am curious how do you label data with newer and elder labels.
> >
> > I have list header like:
> >
> > struct XXX
> > {
> >     uint32_t alloc_size;
> >     uint32_t list_size;
> >     node_t   node[1];
> > }
> >
> > First I init it with pwrite() setting for example alloc_size to 10 and everything else to 0;
> >
> > Then add elements with mmap();
> >
> > 1. Workers log elements existence...
> > 2. Workers log elements existence...
> > ... same thing for a few seconds.
> > X. One of the workers cry that list is empty.
> >
> > Then I inspect core file and see that list looks like if it was just initialised with pwrite() ie alloc_size equals 10, everything else is 0.
> > Hard to reproduce because it happen only on really high IO loads. And from tens of thousands of such files only a couple getting corrupted.
> >
> > >
> > > I do admit a possibility of a race in ZFS double-copy implementation of
> > > the mmap/cache coherency, but somewhat skeptical about the same possibility
> > > for UFS. What you saying might indicate that we loose modified/dirty bits
> > > for the page, but that would have much more firework then just eventual
> > > race with write.
> > >
> > > What version of the system ? Does the machine swap ?
> You just ignored these ^^^^^^^^^^^^ questions.

Sorry, forgot to answer. Did in next reply but anyways I'll repeat:


uname -a
FreeBSD zfs1.dev.ukr.net 8.2-STABLE FreeBSD 8.2-STABLE #7: Wed Aug  3 11:41:58 EEST 2011     [hidden email]:/usr/obj/usr/src/sys/DEV  i386

Swap is turned off. For known reasons.

Also maybe I confused you with different cases. Thing about list header _does_not_reproduces_on_UFS_. Only on ZFS.

>
> >
> > Okay, after msync() helped but didn't fixed issue (just reduced occurrence) I did next thing:
> > tracked modification of mmaped pages using mprotect(). At the end of session before munpap() saved modified pages, then munmap() then I wrote those pages back to disk.
> >
> > Later worker accessed those pages again with mmap(), modified them and for some parts of those pages did read() instead of accessing via mmap(). What read() returned was data committed in previous session with write() but not the data, that was just modified by same process via mmap(). We reproduces this again and again on UFS on FreeBSD and only on high IO load. Though we could never reproduce this on Linux (ext4).
> >
> So you are saying that the following sequence:
> 1. write at offset X
> 2. write into the shared mapping of the same file at offset X
> 3. read at offset X
> performed by single thread can return data at the point (1) instead of
> the data at the point (2) ?
>
> Knowing how write is implemented for UFS, I find this quite impossible.
>
> If the actions are executed in the different processes/threads, say
> process 1 executes (1, 2) and process 2 executes (3), or process 1
> executes (1), and process 2 executes (2, 3), then my first guess would
> be a lack of proper synchronization between actions. This would indeed
> makes possible exactly the outcome I described.

This was tested _ONLY_ on UFS.

Process 1:

1. Write at offset X with mmap();
2. Commit that page again after munmap() with write().

Later process 2.

1. Read at offset X with mmap();
2. Write at offset X with mmap();
3. Read at offset X with read() and see data written by process 1 in (2).

All operations are guarded by lock. Never reproduces on Linux.
When I remove step (2) for process 1. Never reproduces on UFS but does on ZFS (as I wrote before).
Of course may be my mistakes. But same things done exclusively via mmap() or exclusively via read/write never break file.

> > >
> > > >
> > > > I will try to make test to reliably reproduces issue.
> > > Yes, isolated test case is the best route forward. It would either show
> > > a bug or demonstrate a misunderstanding on your part.
> >
> > I am trying, but it's really hard to make example to reproduce this issue.
> This seems to be the only way forward, at least for you.
> And do answer about the version/swap question.
>

Roget that. Thanks for reply.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Loading...