|
There's a case when some parts of files that are mapped and then modified getting corrupted. By corrupted I mean some data is ok (one that was written using write()/pwrite()) but some looks like it never existed. Like it was some time in buffers, when several processes simultaneously (of course access was synchronised) used shared pages and reported it's existence. But after time pass they (processes) screamed that it is now lost. Only part of data written with pwrite() was there. Everything that was written via mmap() is zero. So as I said it occurs on hi I/O busyness. When in background 4+ processes do indexing of huge ammount of data. Also I want to note, it never occurred in the life of our project while we used mmap() under same I/O stress conditions when mapping was done for a whole file of just a part(header) starting from a beginning of a file. First time we used mapping of individual pages, just to save RAM, and this popped up. Solution for this problem is msync() before any munmap(). But man says: The msync() system call is usually not needed since BSD implements a coherent file system buffer cache. However, it may be used to associate dirty VM pages with file system buffers and thus cause them to be flushed to physical media sooner rather than later. Any thoughts? Thanks. _______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to "[hidden email]" |
|
Pavlo wrote:
> There's a case when some parts of files that are mapped and then > modified getting corrupted. By corrupted I mean some data is ok (one > that > was written using write()/pwrite()) but some looks like it never > existed. > Like it was some time in buffers, when several processes > simultaneously > (of course access was synchronised) used shared pages and reported > it's > existence. But after time pass they (processes) screamed that it is > now > lost. Only part of data written with pwrite() was there. Everything > that > was written via mmap() is zero. > > So as I said it occurs on hi I/O busyness. When in background 4+ > processes do indexing of huge ammount of data. Also I want to note, it > never occurred in the life of our project while we used mmap() under > same I/O stress conditions when mapping was done for a whole file of > just > a part(header) starting from a beginning of a file. First time we used > mapping of individual pages, just to save RAM, and this popped up. > > Solution for this problem is msync() before any munmap(). But man > says: > > The msync() system call is usually not needed since BSD implements a > coherent file system buffer cache. However, it may be used to > associate > dirty VM pages with file system buffers and thus cause them to be > flushed > to physical media sooner rather than later. > > Any thoughts? Thanks. > quite late for the NFSv4 client. Even after the NFS client VOP_RECLAIM() has been called, it seems. I didn't observe this behaviour in a kernel from head in March. (I don't know enough about the vm/mmap area to know if this is correct behaviour or not?) I thought I'd mention this, since you didn't say how recent a kernel you were running and thought it might be caused by the same change? Sorry I can't help more, rick > _______________________________________________ > [hidden email] mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "[hidden email]" _______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to "[hidden email]" |
|
> Pavlo wrote: > There's a case when some parts of files that are mapped and then > modified getting corrupted. By corrupted I mean some data is ok (one > that > was written using write()/pwrite()) but some looks like it never > existed. > Like it was some time in buffers, when several processes > simultaneously > (of course access was synchronised) used shared pages and reported > it's > existence. But after time pass they (processes) screamed that it is > now > lost. Only part of data written with pwrite() was there. Everything > that > was written via mmap() is zero. > > So as I said it occurs on hi I/O busyness. When in background 4+ > processes do indexing of huge ammount of data. Also I want to note, it > never occurred in the life of our project while we used mmap() under > same I/O stress conditions when mapping was done for a whole file of > just > a part(header) starting from a beginning of a file. First time we used > mapping of individual pages, just to save RAM, and this popped up. > > Solution for this problem is msync() before any munmap(). But man > says: > > The msync() system call is usually not needed since BSD implements a > coherent file system buffer cache. However, it may be used to > associate > dirty VM pages with file system buffers and thus cause them to be > flushed > to physical media sooner rather than later. > > Any thoughts? Thanks. > quite late for the NFSv4 client. Even after the NFS client VOP_RECLAIM() has been called, it seems. I didn't observe this behaviour in a kernel from head in March. (I don't know enough about the vm/mmap area to know if this is correct behaviour or not?) I thought I'd mention this, since you didn't say how recent a kernel you were running and thought it might be caused by the same change? Sorry I can't help more, rick > _______________________________________________ > [hidden email] mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs> To unsubscribe, send any mail to "[hidden email]" > Thanks for reply, Rick! Yes, we have pretty old kernel: # uname -a FreeBSD mpop-zebra-k1.ukr.net 8.2-STABLE FreeBSD 8.2-STABLE #9: Wed Jan 25 11:28:55 EET 2012 I just posted my observation here to point out possible problem that could still exist. _______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to "[hidden email]" |
|
In reply to this post by Rick Macklem
On Thu, Jun 14, 2012 at 07:32:36AM -0400, Rick Macklem wrote:
> Pavlo wrote: > > There's a case when some parts of files that are mapped and then > > modified getting corrupted. By corrupted I mean some data is ok (one > > that > > was written using write()/pwrite()) but some looks like it never > > existed. > > Like it was some time in buffers, when several processes > > simultaneously > > (of course access was synchronised) used shared pages and reported > > it's > > existence. But after time pass they (processes) screamed that it is > > now > > lost. Only part of data written with pwrite() was there. Everything > > that > > was written via mmap() is zero. > > > > So as I said it occurs on hi I/O busyness. When in background 4+ > > processes do indexing of huge ammount of data. Also I want to note, it > > never occurred in the life of our project while we used mmap() under > > same I/O stress conditions when mapping was done for a whole file of > > just > > a part(header) starting from a beginning of a file. First time we used > > mapping of individual pages, just to save RAM, and this popped up. > > > > Solution for this problem is msync() before any munmap(). But man > > says: > > > > The msync() system call is usually not needed since BSD implements a > > coherent file system buffer cache. However, it may be used to > > associate > > dirty VM pages with file system buffers and thus cause them to be > > flushed > > to physical media sooner rather than later. > > > > Any thoughts? Thanks. > > > With a recent kernel from head, I am seeing dirty mmap'd pages being written > quite late for the NFSv4 client. Even after the NFS client VOP_RECLAIM() has > been called, it seems. I didn't observe this behaviour in a kernel from > head in March. (I don't know enough about the vm/mmap area to know if this > is correct behaviour or not?) > > I thought I'd mention this, since you didn't say how recent a kernel you > were running and thought it might be caused by the same change? Could you please show at least a backtrace for the moment when a write request is made for the page which belong to already reclaimed vnode ? |
|
Kostik wrote:
> On Thu, Jun 14, 2012 at 07:32:36AM -0400, Rick Macklem wrote: > > Pavlo wrote: > > > There's a case when some parts of files that are mapped and then > > > modified getting corrupted. By corrupted I mean some data is ok > > > (one > > > that > > > was written using write()/pwrite()) but some looks like it never > > > existed. > > > Like it was some time in buffers, when several processes > > > simultaneously > > > (of course access was synchronised) used shared pages and reported > > > it's > > > existence. But after time pass they (processes) screamed that it > > > is > > > now > > > lost. Only part of data written with pwrite() was there. > > > Everything > > > that > > > was written via mmap() is zero. > > > > > > So as I said it occurs on hi I/O busyness. When in background 4+ > > > processes do indexing of huge ammount of data. Also I want to > > > note, it > > > never occurred in the life of our project while we used mmap() > > > under > > > same I/O stress conditions when mapping was done for a whole file > > > of > > > just > > > a part(header) starting from a beginning of a file. First time we > > > used > > > mapping of individual pages, just to save RAM, and this popped up. > > > > > > Solution for this problem is msync() before any munmap(). But man > > > says: > > > > > > The msync() system call is usually not needed since BSD implements > > > a > > > coherent file system buffer cache. However, it may be used to > > > associate > > > dirty VM pages with file system buffers and thus cause them to be > > > flushed > > > to physical media sooner rather than later. > > > > > > Any thoughts? Thanks. > > > > > With a recent kernel from head, I am seeing dirty mmap'd pages being > > written > > quite late for the NFSv4 client. Even after the NFS client > > VOP_RECLAIM() has > > been called, it seems. I didn't observe this behaviour in a kernel > > from > > head in March. (I don't know enough about the vm/mmap area to know > > if this > > is correct behaviour or not?) > > > > I thought I'd mention this, since you didn't say how recent a kernel > > you > > were running and thought it might be caused by the same change? > Can you, please, comment more on this ? > How is this possible at all ? > > Could you please show at least a backtrace for the moment when a write > request is made for the page which belong to already reclaimed vnode ? doing nfsrpc_close() before vnode_destroy_object() in the NFSv4 client's VOP_RECLAIM(). This is an NFSv4 specific bug and wouldn't be related to the above issue. Sorry about the noise, rick _______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to "[hidden email]" |
|
In reply to this post by Pavlo-2
--- Original message --- From: "Pavlo" <[hidden email]> To: [hidden email] Date: 14 June 2012, 13:30:20 Subject: mmap() incoherency on hi I/O load (FS is zfs) > There's a case when some parts of files that are mapped and then modified getting corrupted. By corrupted I mean some data is ok (one that was written using write()/pwrite()) but some looks like it never existed. Like it was some time in buffers, when several processes simultaneously (of course access was synchronised) used shared pages and reported it's existence. But after time pass they (processes) screamed that it is now lost. Only part of data written with pwrite() was there. Everything that was written via mmap() is zero. > > So as I said it occurs on hi I/O busyness. When in background 4+ processes do indexing of huge ammount of data. Also I want to note, it never occurred in the life of our project while we used mmap() under same I/O stress conditions when mapping was done for a whole file of just a part(header) starting from a beginning of a file. First time we used mapping of individual pages, just to save RAM, and this popped up. > > Solution for this problem is msync() before any munmap(). But man says: > > The msync() system call is usually not needed since BSD implements a coherent file system buffer cache. However, it may be used to associate dirty VM pages with file system buffers and thus cause them to be flushed to physical media sooner rather than later. > > Any thoughts? Thanks. > > So I tracked issue to the place where it occurs. When I commit data to file using mmap() and pwrite() side by side, sometimes 'newest data' is being overwritten by 'elder data'. From time to time 'elder data' can be something written with mmap() either with pwrite(). It never happens when I use exclusively mmap() either pwrite(). Also this issue reproduces on UFS as well. I think there is a problem keeping mmapep pages and FS cache synced. I will try to make test to reliably reproduces issue. _______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to "[hidden email]" |
|
On Wed, Jul 04, 2012 at 11:07:36AM +0300, Pavlo wrote:
> > > > --- Original message --- > From: "Pavlo" <[hidden email]> > To: [hidden email] > Date: 14 June 2012, 13:30:20 > Subject: mmap() incoherency on hi I/O load (FS is zfs) > > > > There's a case when some parts of files that are mapped and then > modified getting corrupted. By corrupted I mean some data is ok (one that > was written using write()/pwrite()) but some looks like it never existed. > Like it was some time in buffers, when several processes simultaneously > (of course access was synchronised) used shared pages and reported it's > existence. But after time pass they (processes) screamed that it is now > lost. Only part of data written with pwrite() was there. Everything that > was written via mmap() is zero. > > > > So as I said it occurs on hi I/O busyness. When in background 4+ > processes do indexing of huge ammount of data. Also I want to note, it > never occurred in the life of our project while we used mmap() under > same I/O stress conditions when mapping was done for a whole file of just > a part(header) starting from a beginning of a file. First time we used > mapping of individual pages, just to save RAM, and this popped up. > > > > Solution for this problem is msync() before any munmap(). But man says: > > > > > > The msync() system call is usually not needed since BSD implements a > coherent file system buffer cache. However, it may be used to associate > dirty VM pages with file system buffers and thus cause them to be flushed > to physical media sooner rather than later. > > > > Any thoughts? Thanks. > > > > > > So I tracked issue to the place where it occurs. When I commit data to > file using mmap() and pwrite() side by side, sometimes 'newest data' is > being overwritten by 'elder data'. From time to time 'elder data' can be > something written with mmap() either with pwrite(). It never happens when > I use exclusively mmap() either pwrite(). Also this issue reproduces on > UFS as well. I think there is a problem keeping mmapep pages and FS cache > synced. I do admit a possibility of a race in ZFS double-copy implementation of the mmap/cache coherency, but somewhat skeptical about the same possibility for UFS. What you saying might indicate that we loose modified/dirty bits for the page, but that would have much more firework then just eventual race with write. What version of the system ? Does the machine swap ? > > I will try to make test to reliably reproduces issue. Yes, isolated test case is the best route forward. It would either show a bug or demonstrate a misunderstanding on your part. |
|
--- Original message --- From: "Konstantin Belousov" <[hidden email]> To: "Pavlo" <[hidden email]> Date: 4 July 2012, 12:06:44 Subject: Re: mmap() incoherency on hi I/O load (FS is zfs) > On Wed, Jul 04, 2012 at 11:07:36AM +0300, Pavlo wrote: > > > > > > > > --- Original message --- > > From: "Pavlo" <[hidden email]> > > To: [hidden email] > > Date: 14 June 2012, 13:30:20 > > Subject: mmap() incoherency on hi I/O load (FS is zfs) > > > > > > > There's a case when some parts of files that are mapped and then > > modified getting corrupted. By corrupted I mean some data is ok (one that > > was written using write()/pwrite()) but some looks like it never existed. > > Like it was some time in buffers, when several processes simultaneously > > (of course access was synchronised) used shared pages and reported it's > > existence. But after time pass they (processes) screamed that it is now > > lost. Only part of data written with pwrite() was there. Everything that > > was written via mmap() is zero. > > > > > > So as I said it occurs on hi I/O busyness. When in background 4+ > > processes do indexing of huge ammount of data. Also I want to note, it > > never occurred in the life of our project while we used mmap() under > > same I/O stress conditions when mapping was done for a whole file of just > > a part(header) starting from a beginning of a file. First time we used > > mapping of individual pages, just to save RAM, and this popped up. > > > > > > Solution for this problem is msync() before any munmap(). But man says: > > > > > > > > > > The msync() system call is usually not needed since BSD implements a > > coherent file system buffer cache. However, it may be used to associate > > dirty VM pages with file system buffers and thus cause them to be flushed > > to physical media sooner rather than later. > > > > > > Any thoughts? Thanks. > > > > > > > > > > So I tracked issue to the place where it occurs. When I commit data to > > file using mmap() and pwrite() side by side, sometimes 'newest data' is > > being overwritten by 'elder data'. From time to time 'elder data' can be > > something written with mmap() either with pwrite(). It never happens when > > I use exclusively mmap() either pwrite(). Also this issue reproduces on > > UFS as well. I think there is a problem keeping mmapep pages and FS cache > > synced. > I am curious how do you label data with newer and elder labels. I have list header like: struct XXX { uint32_t alloc_size; uint32_t list_size; node_t node[1]; } First I init it with pwrite() setting for example alloc_size to 10 and everything else to 0; Then add elements with mmap(); 1. Workers log elements existence... 2. Workers log elements existence... ... same thing for a few seconds. X. One of the workers cry that list is empty. Then I inspect core file and see that list looks like if it was just initialised with pwrite() ie alloc_size equals 10, everything else is 0. Hard to reproduce because it happen only on really high IO loads. And from tens of thousands of such files only a couple getting corrupted. > > I do admit a possibility of a race in ZFS double-copy implementation of > the mmap/cache coherency, but somewhat skeptical about the same possibility > for UFS. What you saying might indicate that we loose modified/dirty bits > for the page, but that would have much more firework then just eventual > race with write. > > What version of the system ? Does the machine swap ? Okay, after msync() helped but didn't fixed issue (just reduced occurrence) I did next thing: tracked modification of mmaped pages using mprotect(). At the end of session before munpap() saved modified pages, then munmap() then I wrote those pages back to disk. Later worker accessed those pages again with mmap(), modified them and for some parts of those pages did read() instead of accessing via mmap(). What read() returned was data committed in previous session with write() but not the data, that was just modified by same process via mmap(). We reproduces this again and again on UFS on FreeBSD and only on high IO load. Though we could never reproduce this on Linux (ext4). > > > > > I will try to make test to reliably reproduces issue. > Yes, isolated test case is the best route forward. It would either show > a bug or demonstrate a misunderstanding on your part. I am trying, but it's really hard to make example to reproduce this issue. Thanks for reply. _______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to "[hidden email]" |
|
--- Original message --- From: "Pavlo" <[hidden email]> To: "Konstantin Belousov" <[hidden email]> Date: 4 July 2012, 12:25:55 Subject: Re: mmap() incoherency on hi I/O load (FS is zfs) > --- Original message --- > From: "Konstantin Belousov" <[hidden email]> > To: "Pavlo" <[hidden email]> > Date: 4 July 2012, 12:06:44 > Subject: Re: mmap() incoherency on hi I/O load (FS is zfs) > > > > > > On Wed, Jul 04, 2012 at 11:07:36AM +0300, Pavlo wrote: > > > > > > > > > > > > --- Original message --- > > > From: "Pavlo" <[hidden email]> > > > To: [hidden email] > > > Date: 14 June 2012, 13:30:20 > > > Subject: mmap() incoherency on hi I/O load (FS is zfs) > > > > > > > > > > There's a case when some parts of files that are mapped and then > > > modified getting corrupted. By corrupted I mean some data is ok (one that > > > was written using write()/pwrite()) but some looks like it never existed. > > > Like it was some time in buffers, when several processes simultaneously > > > (of course access was synchronised) used shared pages and reported it's > > > existence. But after time pass they (processes) screamed that it is now > > > lost. Only part of data written with pwrite() was there. Everything that > > > was written via mmap() is zero. > > > > > > > > So as I said it occurs on hi I/O busyness. When in background 4+ > > > processes do indexing of huge ammount of data. Also I want to note, it > > > never occurred in the life of our project while we used mmap() under > > > same I/O stress conditions when mapping was done for a whole file of just > > > a part(header) starting from a beginning of a file. First time we used > > > mapping of individual pages, just to save RAM, and this popped up. > > > > > > > > Solution for this problem is msync() before any munmap(). But man says: > > > > > > > > > > > > > > The msync() system call is usually not needed since BSD implements a > > > coherent file system buffer cache. However, it may be used to associate > > > dirty VM pages with file system buffers and thus cause them to be flushed > > > to physical media sooner rather than later. > > > > > > > > Any thoughts? Thanks. > > > > > > > > > > > > > > So I tracked issue to the place where it occurs. When I commit data to > > > file using mmap() and pwrite() side by side, sometimes 'newest data' is > > > being overwritten by 'elder data'. From time to time 'elder data' can be > > > something written with mmap() either with pwrite(). It never happens when > > > I use exclusively mmap() either pwrite(). Also this issue reproduces on > > > UFS as well. I think there is a problem keeping mmapep pages and FS cache > > > synced. > > I am curious how do you label data with newer and elder labels. > > I have list header like: > > struct XXX > { > uint32_t alloc_size; > uint32_t list_size; > node_t node[1]; > } > > First I init it with pwrite() setting for example alloc_size to 10 and everything else to 0; > > Then add elements with mmap(); > > 1. Workers log elements existence... > 2. Workers log elements existence... > ... same thing for a few seconds. > X. One of the workers cry that list is empty. > > Then I inspect core file and see that list looks like if it was just initialised with pwrite() ie alloc_size equals 10, everything else is 0. > Hard to reproduce because it happen only on really high IO loads. And from tens of thousands of such files only a couple getting corrupted. > > > > > I do admit a possibility of a race in ZFS double-copy implementation of > > the mmap/cache coherency, but somewhat skeptical about the same possibility > > for UFS. What you saying might indicate that we loose modified/dirty bits > > for the page, but that would have much more firework then just eventual > > race with write. > > > > What version of the system ? Does the machine swap ? Forgot to tell system stat: uname -a FreeBSD zfs1.dev.ukr.net 8.2-STABLE FreeBSD 8.2-STABLE #7: Wed Aug 3 11:41:58 EEST 2011 [hidden email]:/usr/obj/usr/src/sys/DEV i386 Swap is turned off. For known reasons. > > Okay, after msync() helped but didn't fixed issue (just reduced occurrence) I did next thing: > tracked modification of mmaped pages using mprotect(). At the end of session before munpap() saved modified pages, then munmap() then I wrote those pages back to disk. > > Later worker accessed those pages again with mmap(), modified them and for some parts of those pages did read() instead of accessing via mmap(). What read() returned was data committed in previous session with write() but not the data, that was just modified by same process via mmap(). We reproduces this again and again on UFS on FreeBSD and only on high IO load. Though we could never reproduce this on Linux (ext4). > > > > > > > > > I will try to make test to reliably reproduces issue. > > Yes, isolated test case is the best route forward. It would either show > > a bug or demonstrate a misunderstanding on your part. > > I am trying, but it's really hard to make example to reproduce this issue. > > Thanks for reply. [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to "[hidden email]" |
|
In reply to this post by Pavlo-2
On Wed, Jul 04, 2012 at 12:25:55PM +0300, Pavlo wrote:
> > > > --- Original message --- > From: "Konstantin Belousov" <[hidden email]> > To: "Pavlo" <[hidden email]> > Date: 4 July 2012, 12:06:44 > Subject: Re: mmap() incoherency on hi I/O load (FS is zfs) > > > > > > On Wed, Jul 04, 2012 at 11:07:36AM +0300, Pavlo wrote: > > > > > > > > > > > > --- Original message --- > > > From: "Pavlo" <[hidden email]> > > > To: [hidden email] > > > Date: 14 June 2012, 13:30:20 > > > Subject: mmap() incoherency on hi I/O load (FS is zfs) > > > > > > > > > > There's a case when some parts of files that are mapped and then > > > modified getting corrupted. By corrupted I mean some data is ok (one that > > > was written using write()/pwrite()) but some looks like it never existed. > > > Like it was some time in buffers, when several processes simultaneously > > > (of course access was synchronised) used shared pages and reported it's > > > existence. But after time pass they (processes) screamed that it is now > > > lost. Only part of data written with pwrite() was there. Everything that > > > was written via mmap() is zero. > > > > > > > > So as I said it occurs on hi I/O busyness. When in background 4+ > > > processes do indexing of huge ammount of data. Also I want to note, it > > > never occurred in the life of our project while we used mmap() under > > > same I/O stress conditions when mapping was done for a whole file of just > > > a part(header) starting from a beginning of a file. First time we used > > > mapping of individual pages, just to save RAM, and this popped up. > > > > > > > > Solution for this problem is msync() before any munmap(). But man says: > > > > > > > > > > > > > > The msync() system call is usually not needed since BSD implements a > > > coherent file system buffer cache. However, it may be used to associate > > > dirty VM pages with file system buffers and thus cause them to be flushed > > > to physical media sooner rather than later. > > > > > > > > Any thoughts? Thanks. > > > > > > > > > > > > > > So I tracked issue to the place where it occurs. When I commit data to > > > file using mmap() and pwrite() side by side, sometimes 'newest data' is > > > being overwritten by 'elder data'. From time to time 'elder data' can be > > > something written with mmap() either with pwrite(). It never happens when > > > I use exclusively mmap() either pwrite(). Also this issue reproduces on > > > UFS as well. I think there is a problem keeping mmapep pages and FS cache > > > synced. > > I am curious how do you label data with newer and elder labels. > > I have list header like: > > struct XXX > { > uint32_t alloc_size; > uint32_t list_size; > node_t node[1]; > } > > First I init it with pwrite() setting for example alloc_size to 10 and everything else to 0; > > Then add elements with mmap(); > > 1. Workers log elements existence... > 2. Workers log elements existence... > ... same thing for a few seconds. > X. One of the workers cry that list is empty. > > Then I inspect core file and see that list looks like if it was just initialised with pwrite() ie alloc_size equals 10, everything else is 0. > Hard to reproduce because it happen only on really high IO loads. And from tens of thousands of such files only a couple getting corrupted. > > > > > I do admit a possibility of a race in ZFS double-copy implementation of > > the mmap/cache coherency, but somewhat skeptical about the same possibility > > for UFS. What you saying might indicate that we loose modified/dirty bits > > for the page, but that would have much more firework then just eventual > > race with write. > > > > What version of the system ? Does the machine swap ? > > Okay, after msync() helped but didn't fixed issue (just reduced occurrence) I did next thing: > tracked modification of mmaped pages using mprotect(). At the end of session before munpap() saved modified pages, then munmap() then I wrote those pages back to disk. > > Later worker accessed those pages again with mmap(), modified them and for some parts of those pages did read() instead of accessing via mmap(). What read() returned was data committed in previous session with write() but not the data, that was just modified by same process via mmap(). We reproduces this again and again on UFS on FreeBSD and only on high IO load. Though we could never reproduce this on Linux (ext4). > So you are saying that the following sequence: 1. write at offset X 2. write into the shared mapping of the same file at offset X 3. read at offset X performed by single thread can return data at the point (1) instead of the data at the point (2) ? Knowing how write is implemented for UFS, I find this quite impossible. If the actions are executed in the different processes/threads, say process 1 executes (1, 2) and process 2 executes (3), or process 1 executes (1), and process 2 executes (2, 3), then my first guess would be a lack of proper synchronization between actions. This would indeed makes possible exactly the outcome I described. > > > > > > > > I will try to make test to reliably reproduces issue. > > Yes, isolated test case is the best route forward. It would either show > > a bug or demonstrate a misunderstanding on your part. > > I am trying, but it's really hard to make example to reproduce this issue. This seems to be the only way forward, at least for you. And do answer about the version/swap question. > > Thanks for reply. |
|
> On Wed, Jul 04, 2012 at 12:25:55PM +0300, Pavlo wrote: > > > > > > > On Wed, Jul 04, 2012 at 11:07:36AM +0300, Pavlo wrote: > > > > > > > > > > > > > There's a case when some parts of files that are mapped and then > > > > modified getting corrupted. By corrupted I mean some data is ok (one that > > > > was written using write()/pwrite()) but some looks like it never existed. > > > > Like it was some time in buffers, when several processes simultaneously > > > > (of course access was synchronised) used shared pages and reported it's > > > > existence. But after time pass they (processes) screamed that it is now > > > > lost. Only part of data written with pwrite() was there. Everything that > > > > was written via mmap() is zero. > > > > > > > > > > So as I said it occurs on hi I/O busyness. When in background 4+ > > > > processes do indexing of huge ammount of data. Also I want to note, it > > > > never occurred in the life of our project while we used mmap() under > > > > same I/O stress conditions when mapping was done for a whole file of just > > > > a part(header) starting from a beginning of a file. First time we used > > > > mapping of individual pages, just to save RAM, and this popped up. > > > > > > > > > > Solution for this problem is msync() before any munmap(). But man says: > > > > > > > > > > > > > > > > > > The msync() system call is usually not needed since BSD implements a > > > > coherent file system buffer cache. However, it may be used to associate > > > > dirty VM pages with file system buffers and thus cause them to be flushed > > > > to physical media sooner rather than later. > > > > > > > > > > Any thoughts? Thanks. > > > > > > > > > > > > > > > > > > So I tracked issue to the place where it occurs. When I commit data to > > > > file using mmap() and pwrite() side by side, sometimes 'newest data' is > > > > being overwritten by 'elder data'. From time to time 'elder data' can be > > > > something written with mmap() either with pwrite(). It never happens when > > > > I use exclusively mmap() either pwrite(). Also this issue reproduces on > > > > UFS as well. I think there is a problem keeping mmapep pages and FS cache > > > > synced. > > > I am curious how do you label data with newer and elder labels. > > > > I have list header like: > > > > struct XXX > > { > > uint32_t alloc_size; > > uint32_t list_size; > > node_t node[1]; > > } > > > > First I init it with pwrite() setting for example alloc_size to 10 and everything else to 0; > > > > Then add elements with mmap(); > > > > 1. Workers log elements existence... > > 2. Workers log elements existence... > > ... same thing for a few seconds. > > X. One of the workers cry that list is empty. > > > > Then I inspect core file and see that list looks like if it was just initialised with pwrite() ie alloc_size equals 10, everything else is 0. > > Hard to reproduce because it happen only on really high IO loads. And from tens of thousands of such files only a couple getting corrupted. > > > > > > > > I do admit a possibility of a race in ZFS double-copy implementation of > > > the mmap/cache coherency, but somewhat skeptical about the same possibility > > > for UFS. What you saying might indicate that we loose modified/dirty bits > > > for the page, but that would have much more firework then just eventual > > > race with write. > > > > > > What version of the system ? Does the machine swap ? > You just ignored these ^^^^^^^^^^^^ questions. Sorry, forgot to answer. Did in next reply but anyways I'll repeat: uname -a FreeBSD zfs1.dev.ukr.net 8.2-STABLE FreeBSD 8.2-STABLE #7: Wed Aug 3 11:41:58 EEST 2011 [hidden email]:/usr/obj/usr/src/sys/DEV i386 Swap is turned off. For known reasons. Also maybe I confused you with different cases. Thing about list header _does_not_reproduces_on_UFS_. Only on ZFS. > > > > > Okay, after msync() helped but didn't fixed issue (just reduced occurrence) I did next thing: > > tracked modification of mmaped pages using mprotect(). At the end of session before munpap() saved modified pages, then munmap() then I wrote those pages back to disk. > > > > Later worker accessed those pages again with mmap(), modified them and for some parts of those pages did read() instead of accessing via mmap(). What read() returned was data committed in previous session with write() but not the data, that was just modified by same process via mmap(). We reproduces this again and again on UFS on FreeBSD and only on high IO load. Though we could never reproduce this on Linux (ext4). > > > So you are saying that the following sequence: > 1. write at offset X > 2. write into the shared mapping of the same file at offset X > 3. read at offset X > performed by single thread can return data at the point (1) instead of > the data at the point (2) ? > > Knowing how write is implemented for UFS, I find this quite impossible. > > If the actions are executed in the different processes/threads, say > process 1 executes (1, 2) and process 2 executes (3), or process 1 > executes (1), and process 2 executes (2, 3), then my first guess would > be a lack of proper synchronization between actions. This would indeed > makes possible exactly the outcome I described. This was tested _ONLY_ on UFS. Process 1: 1. Write at offset X with mmap(); 2. Commit that page again after munmap() with write(). Later process 2. 1. Read at offset X with mmap(); 2. Write at offset X with mmap(); 3. Read at offset X with read() and see data written by process 1 in (2). All operations are guarded by lock. Never reproduces on Linux. When I remove step (2) for process 1. Never reproduces on UFS but does on ZFS (as I wrote before). Of course may be my mistakes. But same things done exclusively via mmap() or exclusively via read/write never break file. > > > > > > > > > > > I will try to make test to reliably reproduces issue. > > > Yes, isolated test case is the best route forward. It would either show > > > a bug or demonstrate a misunderstanding on your part. > > > > I am trying, but it's really hard to make example to reproduce this issue. > This seems to be the only way forward, at least for you. > And do answer about the version/swap question. > Roget that. Thanks for reply. _______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to "[hidden email]" |
| Powered by Nabble | Edit this page |
