PDA

View Full Version : SOLVED! Did I burn my CPUs or main board? Help! BIOS dead!



Snorkfroken
03-11-2006, 03:28 PM
I have a very serious computer problem I will try to describe the events here. Please help me if you have any ideas on what it might be. I recently bought hardware for a new computational x86 server at our university department. To save cash we bought all parts separately and built it and configured it ourselves, having very good experience from this procedure before. Hardware set up:

MB: K8T Master2-FAR7
CPU: 2 Dual Core Opteron 265
RAM: 4GB
SATA: 2 Seagate Barracuda 7200.9 200GB (contains / and /home)
IDE: 1 Western Digital 80GB (contains /boot and /tmp and some other catalogues with scientific packages), 1 Sony DVD player
AGP: XFX Geforce MX4000
PCI: 1 3Com Gigabit lan
PSU: Silverstone ST65ZF 650W

I installed Quantian on this system (a scientific OpenMosix mod of Knoppix) and ran the 2.4.27-om OpenMosix kernel, which seemed stable and fast and ran just fine for a couple of days. I did however notice some oddities with the DVD player at one point complaining about lost interrupts. Something like this (dont remember all of it, the DVD is hda):

hda: lost interrupt
hda: dma_timer_expiry: dma status == 0x24

At that point the DVD did not work, so I rebooted the system and it worked fine. I didn't have to use the DVD much anyway so I didn't really pay much attention to it, though I now know it has to do with APIC of dual cpu systems and that noapic could have helped. I then had the system up and running for a day or so and compiled some scientific packages and ran them to test the system, which was all good and stable. Yesterday I started one long running MrBayes analysis and when I logged in over ssh today to have look at it I noticed that it had simply stopped. It was completely frozen, no error messages, no nothing - it was just hanging there. I also noticed some odd things with the ssh terminal. For instance I could not use TAB to autocomplete commands - the terminal would hang and I needed to close the terminal window and log in again in a new window. If I just logged and tried it, TAB was dead. But if I su'ed to another user, it worked fine. I therefore had a look in the /etc/profile file which is executed by bash upon login, but not when you do su (I believe). /etc/profile reminded me that I hade path's pointing to my other IDE device - the harddrive. I hit dmesg, which was filled lost interrupts for the hdc device. I also realised I could not execute any programs or commands that were not already in memory - df, ls, cd and less worked fine, but mc for instance had my terminal freeze and poweroff didn't do anything.

So I decided to go to my office and have a look at the machine. Fans alright, keyboard working. Local terminals behaved exactly like the ssh ones described above. Poweroff didn't work. So I powered off the computer using the power button. And now it does not boot anymore. It beeps, displays the nvidia card info screen, loads the BIOS and tell me that I use for CPU: Dual Core Opteron 265, it tests the memory and then it freezes. Nothing more after that. So I never get to see the list of IDE devices, and although I can hit Escape to stop the memory test scroller with the keybord, hitting delete does not take me to the BIOS.

One thing I noticed is that before this happened the CPU info line in the boot screen said something like:
Dual Core Opteron 265 , 4 cpus

Now it just says:
Dual Core Opteron 265

I have removed all disk drives, and reset the CMOS RAM with the onboad jumper. No difference.

So, finally. Can it be that the CPUs are dead (then how could I log in before...?)? The programs we use for computation use the cpus 100% up to several weeks. I did notice that the 2.4.27 linux boot up complained about some missing ACPI functions. I also use the automatic fan control on the MB to increase fan speed with higher temperature - could that be non-functional if the OS is missing ACPI extentions so that the CPUs became to hot and burnt? Seems strange to me. Or is it the MB that is dead?

Someone please shed some light on this problem! I would be grateful!

Snorkfroken
03-11-2006, 06:19 PM
Well, cry wolf and all of that. I should have learnt by now that these kind of super cryptic errors are most often due to faulty memory chips. I removed the RAM memory systematically and found that one 1GB module was bad. Now it all starts nicely (this time with the noapic cheatcode). Thanx anyway!

jjmac
03-14-2006, 10:43 AM
>>
I did notice that the 2.4.27 linux boot up complained about some missing ACPI functions.
>>

2 Dual core Opterons ... would have thought they would also benifit by the better support offered in the 2.6.x kernel too.

With the ck patch of course, but i am biased in that regard :)


jm


jm