Linux is a funny thing. It’s kinda like buying a kit-car. The plans and the parts are cheap, but if something gets an odd problem, it can drive you mad ferreting it out.

This is the first time I’ve worked with a truly co-located box. Access to the console isn’t available so it is purely remote-admin. To be careful I went with the default install that the place uses (RedHat 9) and have done my best to stay within redhat-isms in design. As such, I used the sendmail and usual package installs. I hadn’t used RedHat for some time, and I figured I should give it a go again. I’ve been surprised by how things have improved, but at the same time the usual clunkiness of RedHat is still there and in the future I don’t think I’ll be sticking with it. I think I’ve reached the same decision about sendmail versus postfix or qmail.

Anyway, I’ve happened upon one of those quirky and maddening problems. The box up and decides to reboot. It does so with frightening regularity (on averge about ever 4 days in the last month).

There is no surge in usage. There is no surge in load. There is no surge in network traffic. I’ve packaged everything up to the current versions of things using the fedora legacy project.

Nothing appears in the logs. Not even a half-written entry. Just BOOM and a reboot.

I poked the provider about it and they have not been rebooting the machine. I’m starting to guess hardware, but even then, most failing hardware will dump something to disk before exiting.

dmesg shows nothing. chkrootkit shows nothing. The box is clean and normal.

It does a lot of web serving and has a busy life sending and receiving email and spam. The CPU isn’t too busy. Mostly SpamAssassin and php for apache.

The only weird question mark is occasionally named craps out. I’ve been doing a lot of reading up in case that was the issue, but I am running the latest patch version of it. I’m wondering if maybe I should tighten down block transfers and see if that’s part of it. I think that named gets under a little load because sendmail is running 7 blackholes, so fo each mail service connect that’s a lookup of 8 DNS queries. But still… it should be able to handle that.

I’m running named in debug mode to see if I can catch an error when it craps out. I’m not entirely sure that these are related. named will crap out at times unrelated to the reboots. Curiously, most all of the reboots seem to happen in the late evening and early morning hours of EST.

Curious. Anyfolks got ideas?

4 Responses to “Annoyances”

  1. rustitobuck says:

    Last Linux server problem Cheetah and I had was related to a misinstalled heatsink. Everything was fine until the machine had a significant load. Then the CPU would overheat and lock up.

    This would happen when one of our users would run a program that had a defect that caused it to use all available CPU.

    And it would happen early in the morning when the nightly cron jobs would go and index the disk and stuff like that.

  2. dragonrift says:

    Wish I could. :/ The Linux language to me and my friends is like trying to learn Japanese in Germany. o_O;;

  3. Phil says:

    After a network vunerability, system load was my second inclination. That or a network attack. The trouble is it just doesn’t seem to pan it. Here’s tracking from last week:

    While the CPU has spiked busy in places, (At one point above the average of 4 for the better part of an hour in that graph) the reboots don’t line up. (Reboots are where the green in the bottomg graph goes to zero. A lot of the spikes were where I was rebuilding, copying, and stressing the system afterwards)

    I’m wondering if someone else generating a ton of heat in the rack might cause my machine to overheat?

  4. jadedfox says:

    From working at a Colo place in the past. YES, a neighboring machine might be overheating you.

    I know that we ran into a case like that at Shore.Net. The client was livid that the machine kept crashing, we moved it to a new location, and it was fine. We checked everything at that location, and then started monitoring the machines around it. One machine next to it was getting REALLY hot, and it would crash the machines next to it.

    It’s something to look into, hell, asking to have your machine moved to a different rack may not be a bad idea either.

Leave a Response