Quantcast

Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

classic Classic list List threaded Threaded
78 messages Options
1234
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Mark Felder-4
Alright guys, I'm at the end of my rope here. For those that haven't seen  
my previous emails here's the (not so) quick breakdown:

Overview:

FreeBSD ?? - 7.4 never crash
FreeBSD 8.0 - 8.2 crashes
FreeBSD 8-STABLE, 8.3, and 9.0 are untested (Sorry, not possible in our  
production at this time, and we were hoping we could base some stuff on  
8.3 for long term stability...)
ESXi: Confirmed ESXi 4.0 - 5.0 has this problem. Haven't tested on others.


History:

Over the course of the last 2 years we've been banging our heads on the  
wall. VMWare is done debugging this. They claim it's not a VMWare issue.  
They can't identify what the heck happens. We had a glimmer of hope with  
ESXi 5.0 fixing it because we never saw any crashes in the handful of  
deployments, but our dreams were crushed today -- two days before an  
outage to begin migration to ESXi 5.0 -- when a customer's ESXi 5.0 server  
and FreeBSD 8.2 guest crashed.


Crash Details:

The keyboard/mouse usually stops responding for input on the console;  
normally we can't type in a username or password. However, we can switch  
VTs.

If there's a shell on the console and we can type, we can only run things  
in memory. Any time we try to access the disk it will hang indefinitely.

The server still has network access. We can ping it without issue. SSH of  
course kicks you out because it can't do any I/O.

If we were to serve a lightweight http server off a memory backed  
filesystem I'm confident it would run just fine as long as it wasn't  
logging or anything.

On ESXi you see that there is a CPU spike of 100% that goes on  
indefinitely. No idea what the FreeBSD OS itself thinks it is doing  
because we can't run top during the crash.

This crash can affect a server and happen multiple times a week. It can  
also not show up for 180 days or more. But it does happen. The server can  
be 100% idle and crash. We have servers that do more I/O than the ones  
that crash could ever attempt to do and these don't crash at all.  
Completely inexplicable.


Things we've looked into:

Nothing about the installed software matters. We've tried cross  
referencing the crashed servers by the programs they run but the base OS  
is the only common denominator due to the wide variety of servers it has  
affected.

Storage doesn't matter. We've tried different iSCSI SANs, we've tried  
different switches, we've tried local datastores on the ESXi servers  
themselves.

HP servers, Dell servers -- doesn't seem to matter either. (All with  
latest firmwares, BIOSes, etc)

VMWare gave us a ton of debugging tasks, and we've given them gigabytes of  
debugging info and data; they can't find anything.

VMWare tools -- with, without, using open-vm-tools makes no difference. I  
think we've done a fair job ruling out VMWare.


I think we've finally found enough data that this is definitely something  
in the FreeBSD world. I'm going to begin prepping some of the known crashy  
servers with more debugging. Any suggestions on what I should build the  
kernel with? They never do a proper panic, but I definitely want to at  
least *try* to get into the debugger the next time it crashes. And when it  
crashes, what the heck should I be running? I've never played with the KDB  
before...


Thank you for any suggestions and help you can give me....
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Adrian Chadd-2
Hi,

* have you filed a PR?
* is the crash easily reproducable?
* are you able to boot some ramdisk-only FreeBSD-8.2 images (eg create
a ramdisk image using nanobsd?) and do some stress testing inside
that?

It sounds like you've established it's a storage issue, or at least
interrupt handling for storage issue. So I'd definitely try the
ramdisk-only boot and thrash it using lighttpd/httperf or something.
If that survives fine, I'd look at trying to establish whether there's
something wrong in the disk driver(s) freebsd is using. I'm not that
cluey on ESXi, but there may be some PIC/APIC/ACPI change between 7.x
and 8.0 which has caused this to surface.

2c,


Adrian
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

dougb
In reply to this post by Mark Felder-4
On 3/28/2012 1:59 PM, Mark Felder wrote:
> FreeBSD 8-STABLE, 8.3, and 9.0 are untested

As much as I'm sensitive to your production requirements, realistically
it's not likely that you'll get a helpful result without testing a newer
version. 8.2 came out over a year ago, many many things have changed
since then.

Doug
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Joe Greco
In reply to this post by Adrian Chadd-2
> Hi,
>
> * have you filed a PR?
> * is the crash easily reproducable?
> * are you able to boot some ramdisk-only FreeBSD-8.2 images (eg create
> a ramdisk image using nanobsd?) and do some stress testing inside
> that?
>
> It sounds like you've established it's a storage issue, or at least
> interrupt handling for storage issue. So I'd definitely try the
> ramdisk-only boot and thrash it using lighttpd/httperf or something.
> If that survives fine, I'd look at trying to establish whether there's
> something wrong in the disk driver(s) freebsd is using. I'm not that
> cluey on ESXi, but there may be some PIC/APIC/ACPI change between 7.x
> and 8.0 which has caused this to surface.

We've seen this.  Or something that seems really like it.

We run dozens of FreeBSD VM's, many of which are 8.mumble.  We have a
scripted build environment dating back many years, so generally servers
come out in a fairly reproducible form.

After several months of smooth running, we had need to shuffle some
things around, and migrated some servers to a different datastore.
Suddenly, one particular VM, our corp Jabber server, started randomly
disconnecting people every morning.  Some inspection showed that the
machine was running, but disk I/O in the VM was freezing up.  
Subsequent inspection suggested that it was happening during the
periodic daily, though we never managed to get it to happen by manually
forcing periodic daily, so that's only a theory.  Given that several
times it appeared that one of the find commands was running, I was
guessing that something in the thin provisioned disk image for the
system had gone bad, but reading the entire disk with dd didn't cause
a hang, running the periodic daily by hand didn't cause a hang, etc.

Migrating the VM to a different host and datastore did not fix the
issue.  Migrating the VM from an Opteron to a Xeon host with all the
latest ESXi 4 patches also didn't make any difference.  Migrating the
disk image from thin to full seemed to fix it, but I only gave it a
day or two, then decided there were other good reasons to reload the
VM, so I nuked the VM, which, of course, fixed it.

In the meantime, a dozen other similar VM's alongside it run just
fine.  My conclusion was that it was something specific that had gone
awry in the virtual machine, probably in the disk image, but I could
not identify it without significant digging that I had no particular
reason or inclination to do; since it appeared to be a VMware problem,
the "reload it and be done with it" seemed the quickest path to
resolution.

That having been said, if anyone has any brilliant ideas about what
would constitute useful further steps to isolate this, I can look at
recovering the faulty VM from backup and seeing if it still exhibits
the problem.

... JG
--
Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net
"We call it the 'one bite at the apple' rule. Give me one chance [and] then I
won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN)
With 24 million small businesses in the US alone, that's way too many apples.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Mark Felder-4
In reply to this post by Adrian Chadd-2
On Wed, 28 Mar 2012 18:31:38 -0500, Adrian Chadd <[hidden email]>  
wrote:

> * have you filed a PR?

No

> * is the crash easily reproducable?

Unfortunately not. It's totally random. Some servers will "get the bug"  
and crash daily, some will crash weekly, some might seem to be fine but 3  
months later hit this crash.

> * are you able to boot some ramdisk-only FreeBSD-8.2 images (eg create
> a ramdisk image using nanobsd?) and do some stress testing inside
> that?

That's a plan I'd like to execute but my free time for building that  
environment is rather short at the moment :(

> I'm not that
> cluey on ESXi, but there may be some PIC/APIC/ACPI change between 7.x
> and 8.0 which has caused this to surface.

Was there a setting to revert ACPI behavior from 8.x to 7.x? I thought I  
read about that at one point.... or perhaps this was something available  
back in the dev cycle when 8 was -CURRENT. *shrug* I know 9.0 and onward  
has even more ACPI changes so assuming it truly is an ACPI bug I guess we  
could cross our fingers and hope that the bug has mysteriously vanished?
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Mark Felder-4
In reply to this post by dougb
On Thu, 29 Mar 2012 02:36:49 -0500, Doug Barton <[hidden email]> wrote:

> As much as I'm sensitive to your production requirements, realistically
> it's not likely that you'll get a helpful result without testing a newer
> version. 8.2 came out over a year ago, many many things have changed
> since then.

The sad part is that VMWare's "supported FreeBSD versions" are a joke, and  
we've been trying to keep VMWare happy by only running "supported  
versions". I honestly don't think they even test. It's so stupid.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Joe Greco
In reply to this post by Adrian Chadd-2
> On 3/28/2012 1:59 PM, Mark Felder wrote:
> > FreeBSD 8-STABLE, 8.3, and 9.0 are untested
>
> As much as I'm sensitive to your production requirements, realistically
> it's not likely that you'll get a helpful result without testing a newer
> version. 8.2 came out over a year ago, many many things have changed
> since then.
>
> Doug

So you're saying that he should have been using 8.3-RELEASE, then.

If you'll kindly go over to http://www.freebsd.org and look under
"Latest Releases", please note that "8.2" is a production release.
If you don't want it to be a production release, then find a way
to make it so, but please don't snipe at people who are using the
code that the FreeBSD project has indicated is a current production
offering.

There are many good reasons not to run arbitrary snapshots on your
production gear.  It's unrealistic to expect people to run non-
RELEASE non-production code on their production gear.  We can have
that discussion if you don't understand that, drop me a note off-
list and I'll be happy to explain it.

Otherwise, you've told him to run a "newer version," of which NONE
IS AVAILABLE, unless you're thinking 9.0, but FreeBSD has a rather
catastrophic history of "point zero" releases, and most clueful
admins won't run those in production without carefully measuring
the risks and benefits.  So you've basically told him to run a
newer version without any such version being realistically
available.

WTF?

You want people not to use releases that "came out over a year
ago"?  The generally sensible solution to that is to release
RELEASEs more than once every fourteen or fifteen months.

... JG
--
Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net
"We call it the 'one bite at the apple' rule. Give me one chance [and] then I
won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN)
With 24 million small businesses in the US alone, that's way too many apples.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Mark Felder-4
In reply to this post by Mark Felder-4
Alright, new data. It happened to crash about 10 minutes after I came in  
this morning and I ran some stuff in the DDB. I have no idea what  
information is useful, but perhaps someone will see something out of the  
ordinary?


http://feld.me/freebsd/esx_crash/


Thanks...
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Hans Petter Selasky
In reply to this post by Joe Greco
On Thursday 29 March 2012 15:42:42 Joe Greco wrote:
> > Hi,

Do both 32- and 64-bit versions of FreeBSD crash?

--HPS
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Mark Felder-4
On Thu, 29 Mar 2012 09:58:16 -0500, Hans Petter Selasky <[hidden email]>  
wrote:

> Do both 32- and 64-bit versions of FreeBSD crash?

Correct, we see both i386 and amd64 flavors crash in the same way.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Eduardo Morras
In reply to this post by Mark Felder-4
At 16:03 29/03/2012, you wrote:
>Alright, new data. It happened to crash about 10 minutes after I came in
>this morning and I ran some stuff in the DDB. I have no idea what
>information is useful, but perhaps someone will see something out of the
>ordinary?
>
>
>http://feld.me/freebsd/esx_crash/

Don't know about ESXi but on others VM Managers i can change the
chipset emulation from ICH10 to ICH4. Can you change it to an older
chipset too?


>Thanks...


_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Mark Felder-4
In reply to this post by Mark Felder-4
On Thu, 29 Mar 2012 10:31:24 -0500, Eduardo Morras <[hidden email]>  
wrote:

>
> Don't know about ESXi but on others VM Managers i can change the chipset  
> emulation from ICH10 to ICH4. Can you change it to an older chipset too?

Unfortunately there's no setting in the GUI for that but I'll keep looking  
to see if there's a hidden option -- perhaps in the VM's config file.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Joe Greco
In reply to this post by Hans Petter Selasky
> On Thursday 29 March 2012 15:42:42 Joe Greco wrote:
> > > Hi,
>
> Do both 32- and 64-bit versions of FreeBSD crash?

We've only seen it happen on one virtual machine.  That was a 32-bit
version.  And it's not so much a crash as it is a "disk I/O hang".

The fact that it was happening regularly to that one VM, while a
bunch of other similar VM's were running alongside it without any
incident, along with the problem moving with the VM as it is moved
from host to host and from Opteron to Xeon, strongly points at
something being wrong with the VM itself.  Our systems are built
mostly by script; I rebuilt the VM a few months ago and the
problem vanished.  The rebuilt system "should" have been virtually
identical to the original.  I never actually compared them though.

My working theory was that something bad had happened to the VM
during a migration from one datastore to another.  We have a really
slow-writing iSCSI server that it had been migrated onto for a little
bit, which was where the problem first appeared, I believe.  At
first I thought it was the nightly cron jobs just exceeding the iSCSI
server's capacity to cope, so we migrated the VM onto a host with
local datastores, and it remained broken thereafter.

So my conclusion was that it seemed likely that somehow VMware's
thin provisioned disk image had gotten fouled up, and under some
unknown use case, it could be teased into locking up further I/O
on the VM.  I wasn't able to prove it.  I tried a read-dd of the
entire disk - passed, flying.  I tried several things to duplicate
the nightly periodic tasks where it seemed so prone to locking up.
They all ran fine.  But if I left the machine run, it'd do it
again eventually.

I explained it at the time to one of my VMware friends:

> But here's where it gets weird.  Three times, now, one VM - our Jabber
> server - has gone wonky in the wee early AM hours.  Disk I/O on the VM
> just locks up.  You can type at the console until it does I/O, so you
> can put in "root" at the login: prompt but never get a pw prompt.  My
> systems all run "top" from /etc/ttys and I can see that a whole bunch
> of processes are stopped in "getblk".  It's like the iSCSI disk has gone
> away, except it hasn't, since the other VM's are all happily churning
> away, on the same datastore, on the same VMware host.

http://www.sol.net/tmp/freebsd/freebsd-esxi-lockup.gif

> Now it's *possible* that the problem actually happens after the 3AM cron
> run (note slight CPU/memory drop) but the Jabber implosion actually
> happens around 0530, see drop in memory%.  But the root problem at the
> VM level seems to be that disk I/O has frozen.  I can't tell for sure when
> that happens.  All three instances are similar to this.
>
> I can't explain this or figure out how to debug it.  Since it's locked up
> right now, thought I'd ping you for ideas before resetting it.

Now that was actually before we migrated it back to local datastore,
but when we did, the problem remained, suggesting that whatever has
happened to the VM, it is contained within the VM's vmdk or other
files.

... JG
--
Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net
"We call it the 'one bite at the apple' rule. Give me one chance [and] then I
won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN)
With 24 million small businesses in the US alone, that's way too many apples.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Hans Petter Selasky
On Thursday 29 March 2012 17:49:30 Joe Greco wrote:
> > On Thursday 29 March 2012 15:42:42 Joe Greco wrote:
> > > > Hi,
> >
> > Do both 32- and 64-bit versions of FreeBSD crash?
>
> We've only seen it happen on one virtual machine.  That was a 32-bit
> version.  And it's not so much a crash as it is a "disk I/O hang".

It almost sounds like the lost interrupt issue I've seen with USB EHCI
devices, though disk I/O should have a retry timeout?

What does "wmstat -i" output?

--HPS
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Mark Felder-4
On Thu, 29 Mar 2012 10:55:36 -0500, Hans Petter Selasky <[hidden email]>  
wrote:
>
> It almost sounds like the lost interrupt issue I've seen with USB EHCI
> devices, though disk I/O should have a retry timeout?
>
> What does "wmstat -i" output?
>
> --HPS


Here's a server that has a week uptime and is due for a crash any hour now:

root@server:/# vmstat -i
interrupt                          total       rate
irq1: atkbd0                          34          0
irq6: fdc0                             9          0
irq15: ata1                           34          0
irq16: em1                        778061          1
irq17: mpt0                     19217711         31
irq18: em0                     283674769        460
cpu0: timer                    246571507        400
Total                          550242125        892
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Mark Felder-4
In reply to this post by Joe Greco
On Thu, 29 Mar 2012 10:49:30 -0500, Joe Greco <[hidden email]> wrote:

> I explained it at the time to one of my VMware friends:


This is 100% identical to what we see, Joe! And we're so unlucky that we  
have this happen on probably a dozen servers, but a handful are the really  
bad ones. We've rebuilt them from scratch many times with no improvement.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Jim Bryant-5
In reply to this post by Mark Felder-4
This sounds just like a race condition that happens under Windows 7 on
this laptop.  The race condition, as far as I can tell involves heavy
disk access and heavy network access, and usually leaves the drive light
on, while all activity monitors (alldisk, allcpu, allnetwork) are still
active, although on this laptop disk takes priority, and network slows
to a crawl.  occasionally, the mouse will stop working, along with
everything else, but usually not.  keyboard is lower priority, and
doesn't do anything.

You might want to check with mickeysoft, this might just be their
problem.  This sounds so freaking similar to the issue I get, and I
think it's a race condition (shared interrupts??).

This laptop is a Compaq Presario C300 series, with the 945GM chipset and
a T7600 Core2 Duo CPU, with 3G of RAM.

Mark Felder wrote:

> Alright guys, I'm at the end of my rope here. For those that haven't
> seen my previous emails here's the (not so) quick breakdown:
>
> Overview:
>
> FreeBSD ?? - 7.4 never crash
> FreeBSD 8.0 - 8.2 crashes
> FreeBSD 8-STABLE, 8.3, and 9.0 are untested (Sorry, not possible in
> our production at this time, and we were hoping we could base some
> stuff on 8.3 for long term stability...)
> ESXi: Confirmed ESXi 4.0 - 5.0 has this problem. Haven't tested on
> others.
>
>
> History:
>
> Over the course of the last 2 years we've been banging our heads on
> the wall. VMWare is done debugging this. They claim it's not a VMWare
> issue. They can't identify what the heck happens. We had a glimmer of
> hope with ESXi 5.0 fixing it because we never saw any crashes in the
> handful of deployments, but our dreams were crushed today -- two days
> before an outage to begin migration to ESXi 5.0 -- when a customer's
> ESXi 5.0 server and FreeBSD 8.2 guest crashed.
>
>
> Crash Details:
>
> The keyboard/mouse usually stops responding for input on the console;
> normally we can't type in a username or password. However, we can
> switch VTs.
>
> If there's a shell on the console and we can type, we can only run
> things in memory. Any time we try to access the disk it will hang
> indefinitely.
>
> The server still has network access. We can ping it without issue. SSH
> of course kicks you out because it can't do any I/O.
>
> If we were to serve a lightweight http server off a memory backed
> filesystem I'm confident it would run just fine as long as it wasn't
> logging or anything.
>
> On ESXi you see that there is a CPU spike of 100% that goes on
> indefinitely. No idea what the FreeBSD OS itself thinks it is doing
> because we can't run top during the crash.
>
> This crash can affect a server and happen multiple times a week. It
> can also not show up for 180 days or more. But it does happen. The
> server can be 100% idle and crash. We have servers that do more I/O
> than the ones that crash could ever attempt to do and these don't
> crash at all. Completely inexplicable.
>
>
> Things we've looked into:
>
> Nothing about the installed software matters. We've tried cross
> referencing the crashed servers by the programs they run but the base
> OS is the only common denominator due to the wide variety of servers
> it has affected.
>
> Storage doesn't matter. We've tried different iSCSI SANs, we've tried
> different switches, we've tried local datastores on the ESXi servers
> themselves.
>
> HP servers, Dell servers -- doesn't seem to matter either. (All with
> latest firmwares, BIOSes, etc)
>
> VMWare gave us a ton of debugging tasks, and we've given them
> gigabytes of debugging info and data; they can't find anything.
>
> VMWare tools -- with, without, using open-vm-tools makes no
> difference. I think we've done a fair job ruling out VMWare.
>
>
> I think we've finally found enough data that this is definitely
> something in the FreeBSD world. I'm going to begin prepping some of
> the known crashy servers with more debugging. Any suggestions on what
> I should build the kernel with? They never do a proper panic, but I
> definitely want to at least *try* to get into the debugger the next
> time it crashes. And when it crashes, what the heck should I be
> running? I've never played with the KDB before...
>
>
> Thank you for any suggestions and help you can give me....
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to
> "[hidden email]"
>
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Alan Cox-9
In reply to this post by Mark Felder-4
On Thu, Mar 29, 2012 at 11:27 AM, Mark Felder <[hidden email]> wrote:

> On Thu, 29 Mar 2012 10:55:36 -0500, Hans Petter Selasky <[hidden email]>
> wrote:
>
>>
>> It almost sounds like the lost interrupt issue I've seen with USB EHCI
>> devices, though disk I/O should have a retry timeout?
>>
>> What does "wmstat -i" output?
>>
>> --HPS
>>
>
>
> Here's a server that has a week uptime and is due for a crash any hour now:
>
> root@server:/# vmstat -i
> interrupt                          total       rate
> irq1: atkbd0                          34          0
> irq6: fdc0                             9          0
> irq15: ata1                           34          0
> irq16: em1                        778061          1
> irq17: mpt0                     19217711         31
> irq18: em0                     283674769        460
> cpu0: timer                    246571507        400
> Total                          550242125        892
>
>

Not so long ago, VMware implemented a clever scheme for reducing the
overhead of virtualized interrupts that must be delivered by at least some
(if not all) of their emulated storage controllers:

http://static.usenix.org/events/atc11/tech/techAbstracts.html#Ahmad

Perhaps, there is a bad interaction between this scheme and FreeBSD's mpt
driver.

Alan
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Mark Atkinson-4
In reply to this post by Mark Felder-4
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/29/2012 07:03, Mark Felder wrote:
> Alright, new data. It happened to crash about 10 minutes after I
> came in this morning and I ran some stuff in the DDB. I have no
> idea what information is useful, but perhaps someone will see
> something out of the ordinary?
>
>
> http://feld.me/freebsd/esx_crash/

If this is an interrupt problem with disk i/o, then you might want to
look into (DDB(4))

show intr
show intrcount

maybe

show allrman
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk90lloACgkQrDN5kXnx8yaCZACbBamQksNyWC26PUsOn5N9LJLV
ql0AoJwYCFDfXhCpZIN735V9qg0VepFf
=fCLN
-----END PGP SIGNATURE-----

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Mark Felder-4
On Thu, 29 Mar 2012 12:05:30 -0500, Mark Atkinson <[hidden email]>  
wrote:

>
> If this is an interrupt problem with disk i/o, then you might want to
> look into (DDB(4))
> show intr
> show intrcount
> maybe
> show allrman


Thank you! I really don't know what things we should be running in DDB to  
diagnose this and we will try this upon the next crash.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
1234
Loading...