Jump to content


 


Register a free account to unlock additional features at BleepingComputer.com
Welcome to BleepingComputer, a free community where people like yourself come together to discuss and learn how to use their computers. Using the site is easy and fun. As a guest, you can browse and view the various discussions in the forums, but can not create a new topic or reply to an existing one unless you are logged in. Other benefits of registering an account are subscribing to topics and forums, creating a blog, and having no ads shown anywhere on the site.


Click here to Register a free account now! or read our Welcome Guide to learn how to use this site.

Photo

Arima NM46X Bad DIMM slots 1 & 2 beside CPU2?


  • Please log in to reply
26 replies to this topic

#1 wbelk7777

wbelk7777

  • Members
  • 12 posts
  • OFFLINE
  •  
  • Local time:05:03 AM

Posted 20 June 2014 - 06:49 PM

I have an Arima NM46X motherboard with Dual Quad AMD Opteron 2356 Processors and 32 GB of RAM. The motherboard has 16 DIMM slots with 2GB of memory in each slot. The RAM is PC2-5300P ECC memory. The operating system is Windows 2008 Server. ftp://ftp.sgi.com/public/Technical%2...NM46X_V097.pdf

There appears to be a problem with DIMM slots 1 & 2 beside CPU2 on the motherboard. When I install memory in these slots and run Memtest86+ it causes the server to shutdown.

With all of the DIMM slots filled (2GB in each slot) the server will boot normally and will see all 32GB of RAM in the BIOS, but while running Memtest86+ the server shuts-down. I've tried installing different RAM modules and using another known good power supply, in both cases I get the same result. If I remove the RAM from these two DIMM slots, the server runs Memtest86+ fine.

I’ve tried several different memory tests (memtest HCI and memtest86) with the same results. When you put the CPU under load with memory in slots 1 & 2 beside CPU2 the server shuts-down. If I remove the memory from slots 1 & 2 beside CPU2 the server tests fine. I’ve replaced the memory in slots 1 & 2 beside CPU2 with known good memory, but the server still crashes.

I’ve been monitoring the on chip CPU temps with CPUID Hardware Monitor and they never get above 35 degrees Celsius.

Does anyone have any ideas on what could be causing this?

Thank you for your help.


BC AdBot (Login to Remove)

 


m

#2 Platypus

Platypus

  • Moderator
  • 12,894 posts
  • OFFLINE
  •  
  • Gender:Male
  • Location:Australia
  • Local time:08:03 PM

Posted 20 June 2014 - 07:37 PM

My first approach would be to examine the contacts of those two DIMM slots under magnification (e.g. jeweller's loupe) to see if any contacts are obstructed, out of position or appear sprung (opened out wider to give loose contact) or show tarnish/corrosion or pitting. It could also be useful to examine the contact pads of the original DIMMs from that location to see if they show any hint of having had a bad contact when they were being used. Also it's unclear whether the SMT (Surface Mount) chips under the CPU securing arm are related to the DIMM bank, however they are the same in number as the slots, so maybe check to ensure there's no sign the arm or a tool has struck any of the chips when the CPU was being fitted, and possibly caused a latent fault there.

 

Other than that, I suspect such a problem is technically more complex. If you or a helpful tech felt like removing the board and examining the underside, the solder on all the DIMM pins could be checked for cracks. With a multi-layer board which has plating through, this is not common though. My other thought is that since it is the DIMMs closest to the CPU, the board may have been affected internally by thermal or physical stress (from the CPU mounting), and developed a crack or interlayer short on interconnecting tracks, which would not be feasible to correct.


Top 5 things that never get done:

1.


#3 wbelk7777

wbelk7777
  • Topic Starter

  • Members
  • 12 posts
  • OFFLINE
  •  
  • Local time:05:03 AM

Posted 20 June 2014 - 10:31 PM

Platypus,

 

Thank you for your recommendations.  I used a magnifying glass to examine the two DIMM sockets.  I didn't see anything glaringly obvious jump out at me.  The chips beside the CPU look fine.  I'm debating whether to remove the motherboard or not to look underneath.  It is a long process.  I'm thinking about just accepting the fact that these two DIMM slots are defective and using the server as is with 28GB of RAM.  What do you think are the chances that the issue will spread to the remaining DIMM slots?



#4 Platypus

Platypus

  • Moderator
  • 12,894 posts
  • OFFLINE
  •  
  • Gender:Male
  • Location:Australia
  • Local time:08:03 PM

Posted 20 June 2014 - 11:11 PM

You may be best to just go with 28G, unless you can source some 4GB DIMMs to try. It's not clear to me from the manual whether sizes can be mixed.

 

Without knowing the actual cause, we can't really know if it's likely to progress. I know a guy who's had an Apple G5 with 2 non-working slots for 7 years.


Top 5 things that never get done:

1.


#5 wbelk7777

wbelk7777
  • Topic Starter

  • Members
  • 12 posts
  • OFFLINE
  •  
  • Local time:05:03 AM

Posted 24 June 2014 - 01:16 AM

Update on DIMM slots.

 

I went back to the drawing board and started focusing on the eight DIMM slots beside CPU2.  I tested each slot individually with a known good stick of RAM.  To my surprise all of the DIMM slots tested fine.  What I found out was that I could use up to 7 of the 8 slots and the server would run through Memtest86 test 5 using parallel processing without errors.  I ran these tests for at least ten passes in most cases more.  It didn’t matter which slot I left open, as long as I left one open it would run Memtest86 without an issue.  As stated previously, my memory is PC2-5300 667mhz DDR2 timings of 5-5-5-12.  I have tried all Micron brand memory in the slots and all Hynix brand memory in the slots, with the same results.   The memory sticks were all identical in specs.

 

There are limited options in the BIOS as to what I can change in regards to the RAM.  I can change the speed from 667mhz to 533mhz to 400mhz and I can change the HT-LDT speed from 1000mhz to 200mhz in 200mhz increments.  I cannot change memory timings or voltages. 

 

If I lower the clock speed from 667mhz to 533mhz, I can run the server with all the slots filled and see all 32gb of memory.  I’m running Memtest as I type this, so far 2 passes with no errors.   Usually with all the slots filled at 667mhz the server will shutdown before the first pass has finished.

 

It also looks like I’m going to have to purchase a fan for the RAM at the rear of the case, since it’s in a hot spot and the heat causes the tests to throw errors.  With a fan blowing on the RAM it tests out fine.

 

Any thoughts on what can be causing these issues?  Thank you for your help.



#6 Platypus

Platypus

  • Moderator
  • 12,894 posts
  • OFFLINE
  •  
  • Gender:Male
  • Location:Australia
  • Local time:08:03 PM

Posted 24 June 2014 - 06:23 AM

Some good sleuthing there, and to me it looks to be pointing at a problem with the drive capability of the memory controller. I think the suggestion on TSF of swapping the CPUs is a good diagnostic step - if the behaviour follows the CPU it would indicate the controller on the CPU is having a problem.

 

If it stays with the DIMM bank, it would seem the problem is on the hypertransport bus or CPU socket. Since there is also misbehaviour with temperature, if the problem doesn't follow the CPU, it's possible the root cause could be something like an electrolytic capacitor or capacitors around the CPU and DIMM slots. The surface mount capacitors (bare metal ones) have had reliability problems at times. PSU could also be a possible candidate.


Edited by Platypus, 24 June 2014 - 06:24 AM.
Typo

Top 5 things that never get done:

1.


#7 wbelk7777

wbelk7777
  • Topic Starter

  • Members
  • 12 posts
  • OFFLINE
  •  
  • Local time:05:03 AM

Posted 25 June 2014 - 12:15 AM

Update on motherboard

 

I replaced CPU2 with a spare AMD Opteron 2356 that I had, with the same results.  When running the RAM at 667 MHz through Memtest86 test 5 parallel processing and all 8 CPU2 DIMM slots filled (CPU1 DIMM bank empty) the server would shutdown as before.  Remove one stick of RAM and it would run fine.  I then switched the processors.  I took the original processor in the CPU2 socket and move it to the CPU1 socket and moved the processor in CPU1 socket to the CPU2 socket.  Again, no memory in CPU1 DIMM bank, eight sticks in CPU2 DIMM bank running at 667 MHz, the server shutdown just the same.  Remove a stick and reduce the number to seven and it passes Memtest86 with flying colors.  For fun I moved the RAM from CPU2 bank to CPU1 bank, where the original CPU2 socket processor was now.  I was able to run all 8 sticks of RAM at full speed (667 MHz) through Memtest86 test 5 parallel processing and it never missed a beat. 

 

Even though I have a brand new high quality power supply, I still tried a second power supply, but I got the same results previously mentioned.

 

Is there anything else that I can test before we declare the motherboard absolutely defective?

 

Thanks again for your insight and patience in working through this issue.



#8 zingo156

zingo156

  • BC Advisor
  • 3,333 posts
  • OFFLINE
  •  
  • Gender:Male
  • Local time:03:03 AM

Posted 25 June 2014 - 08:10 AM

After your last test, I highly suspect a mainboard problem... Just out of curiosity, have you tried a prime 95 test from any version of windows?


If I am helping you with a problem and I have not responded within 48 hours please send me a PM.

#9 Platypus

Platypus

  • Moderator
  • 12,894 posts
  • OFFLINE
  •  
  • Gender:Male
  • Location:Australia
  • Local time:08:03 PM

Posted 25 June 2014 - 08:39 AM

Like zingo, I can't point to anything that would take us beyond the conclusion that it really seems to be a mainboard fault. I think you've done thorough faultfinding, and probably the remaining courses of action would be speculative like replacing all capacitors to see if that sorts it, or analysis in a capable workshop with some flash logic probes and the like to try to actually see where it's failing. Don't know if you'd want to follow either of these courses?


Top 5 things that never get done:

1.


#10 wbelk7777

wbelk7777
  • Topic Starter

  • Members
  • 12 posts
  • OFFLINE
  •  
  • Local time:05:03 AM

Posted 25 June 2014 - 09:10 AM

Zingo,

 

Thanks for the reply.  I haven't tried Prime95 in windows, but I did run Memtest HCI in windows.  I ran the test with 8 sessions open testing 2048 each.  With all the CPU2 DIMM slots filled, it caused the server to shutdown.  With one DIMM removed it caused the system to freeze up, but it didn't shutdown.  Maybe 8 sessions was one too many :)  I should have tried 7 sessions.

 

Platypus,

 

Thanks for the reply.  I've emailed the online retailer where I purchased the motherboard and they're sending me a replacement.  I just wanted to make sure that a replacement motherboard would likely fix my problem.

 

Thanks again.



#11 zingo156

zingo156

  • BC Advisor
  • 3,333 posts
  • OFFLINE
  •  
  • Gender:Male
  • Local time:03:03 AM

Posted 25 June 2014 - 09:43 AM

You have done all the things I would do... Since you have tried a different cpu, different ram and psu, also confirmed the ram was good in other slots it seems only logical that there is an issue with the board. The only other thing I could think of would be a potential compatibility issue with those tests and your setup though it seems unlikely...

 

Good hardware should pass testing 24/7 though you may wear things out faster using it at 100% non stop. I generally run memtest for 72 hours on new server builds I do, sometimes longer with high volumes of ram to verify stability and no errors. I also zero all hard drives 2 times and run MHDD before and after checking for new errors or slow sectors. My new builds are running diagnostics for weeks in some cases but I tend to go with the "better safe than sorry" motto.


Edited by zingo156, 01 July 2014 - 10:56 AM.

If I am helping you with a problem and I have not responded within 48 hours please send me a PM.

#12 wbelk7777

wbelk7777
  • Topic Starter

  • Members
  • 12 posts
  • OFFLINE
  •  
  • Local time:05:03 AM

Posted 01 July 2014 - 12:53 AM

Update on motherboard memory issues.

 

The replacement motherboard came today.  I spent several hours installing the motherboard, only to find the same exact problem.  When the CPU2 DIMM bank is filled with eight sticks of RAM, the server shuts down during Memtest86+ test #5 parallel CPU cores. 

 

In Memtest it shows the RAM as having the correct CAS timings and the speed is a little under-clocked.

 

Some possibilities I can think of are:

 

1. All the memory in the computer needs to be exactly the same and from the same manufacturer for all 32GB to run at full speed.  The specs for my memory are all the same, but I have 16GB installed from Hynix (all in CPU1 bank)and 16GB installed from Micron (all in CPU2 bank).  In total I have 16GB of Hynix, and 26GB of Micron.  Do you think it would be worth buying more Micron RAM so that all of my memory is Micron Brand?  I could then fill both CPU1 and CPU2 banks with Micron RAM and remove the Hynix memory completely.

 

2. The RAM voltage is not set correctly (I cannot change these settings)

 

3) There is a major flaw in Memtest65+ in dealing with this chipset/motherboard (unlikely, but a possibility)

 

Thanks for your help.



#13 zingo156

zingo156

  • BC Advisor
  • 3,333 posts
  • OFFLINE
  •  
  • Gender:Male
  • Local time:03:03 AM

Posted 01 July 2014 - 07:17 AM

I would probably try something like prime 95 from windows just for extra testing. It generally is best to fill every slot with the same exact ram. Differences in required voltages can cause instability. After the testing you have done, the different ram seems to be the last likely cause.


Edited by zingo156, 01 July 2014 - 10:43 AM.

If I am helping you with a problem and I have not responded within 48 hours please send me a PM.

#14 Platypus

Platypus

  • Moderator
  • 12,894 posts
  • OFFLINE
  •  
  • Gender:Male
  • Location:Australia
  • Local time:08:03 PM

Posted 01 July 2014 - 09:05 AM

Well, I must admit I've been under the impression this was a problem you'd had with the real world operation of the server, that you'd then used memtest to track down the cause of. In the first post when you said:

 

"I’ve tried several different memory tests (memtest HCI and memtest86) with the same results. When you put the CPU under load with memory in slots 1 & 2 beside CPU2 the server shuts-down."

 

does normal server duty with 32G fitted result in shutdowns, or is it only under the load specifically of the memtest that it happens?

 

And for the prospect of the problem being mixed memory types, I would have thought that was eliminated by your results of having the shutdown occur with a single bank filled and coming good with a DIMM removed, where you noted "I have tried all Micron brand memory in the slots and all Hynix brand memory in the slots, with the same results". It's all rather puzzling...


Top 5 things that never get done:

1.


#15 wbelk7777

wbelk7777
  • Topic Starter

  • Members
  • 12 posts
  • OFFLINE
  •  
  • Local time:05:03 AM

Posted 01 July 2014 - 10:45 AM

Zingo,

 

On the first motherboard I did run Memtest HCI in windows.  I ran the test with 8 sessions open testing 2048 each.  With all the CPU2 DIMM slots filled, it caused the server to shutdown.  With one DIMM removed it caused the system to freeze up, but it didn't shutdown.  I haven't tried the test o this motherboard yet, but I expect the same results from the way its behaving.

 

Platypus,

 

Sorry for confusion.  This is a server that I'm building based on the NM46X motherboard.  I'm was testing the server for stability before putting it into production.  I was using Memtest86, Memtest86+, and Memtest HCI to test for stability. 

 

You make an excellent point about the testing that I did with only the CPU2 DIMM bank being filled with the same exact memory from two well known brands.  The server shutdown during these tests.  This would seem to rule out the possibility of it being a RAM issue.  The only other caveat that I can think of in regards to the RAM is that having only the CPU2 DIMM bank filled is not a supported configuration according to the motherboard manual. 

 

You asked if the server shutsdown during normal operation.  I haven't run it for more than a couple of hours, but it seems to run WIndows 2008 Server fine.  It passed the Windows 2008 Server Memory test, but it shutdown while running Memtest HCI under Windows 2008 Server with the CPU2 DIMM bank filled.

 

Thank you both for your help.






0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users