smp cpu2 slow


What is this stuff?

If this isn't exactly what you wanted, please try our Search (there's a LOT of techy and non-techy stuff here about Linux, Unix, Mac OS X and just computers in general!):



From: Bela Lubkin <belal@caldera.com>
Subject: Re: Any Known Issue with SMP Ver 1.1.1Ga for OpenServer 5.0.6?
Date: Sat, 2 Nov 2002 06:30:36 GMT References: <ciHw9.166205$%h2.53807@news02.bloor.is.net.cable.rogers.com>


Hate these ads?



JP wrote:



> I have posted some messages regarding some unexplainable %sys activities in
> SCO even when system was not running any tasks.  It's been a few months but
> the problem was still not resolved.

> The system has two Xeon processors running SMP 1.1.1Ga.  The problem will
> disappear if I keep rebooting and run sar to verify.  There is another way
> to bring the %idle back to 100.  The trick is to deactivate the second CPU.
> That way I can see %sys shown as zero and %idle go back to 100.

> I have phsyically swapped the 2 CPU's but it did not make any difference.
> No matter which CPU take turn as the "second" CPU, every time I disable the
> second CPU, I can see a better performance from the sar.  I cannot help
> wondering if there is any bug or incompatibility between SMP and the
> hardware.  I am using HP NetServer LH3000 U3 (HP part no. P2482b).














I'm having a lot of trouble following the discussion since you keep
posting from different accounts, with different subject lines, and
including either way too much or no context at all...  Some of what I'm
going to say here might not be accurate due to lack of that context.



The "smoking gun" symptom that you're seeing -- a couple of percent of
sys time on the 2nd CPU -- is rather subtle.  Most people would never
notice it, and you wouldn't be concerned about it either if there wasn't
a secondary symptom.  You say that your users complain about performance
when the system is in the "bad" state.  You need to characterize _that_
performance problem, because it's is _not_ caused by the 2% sys time.  I
am not going to accept that your users are so performance-sensitive that
they notice and complain about a 1-2% difference.



I'm not saying that the two aren't tied together.  But you must write a
description of the actual performance issues your users see, which must
be something other than "their jobs only run at 98-99% of normal speed".






No, there is no known issue that fits the description you've been giving,
as far as anyone can tell.



You've also made inconsistent statements which are making it very hard
to follow the problem.  Previously you've shown how you reboot and
sometimes see 2% sys time, sometimes see 0%.  You said that the users do
not complain on a "good" boot where %sys is 0.  In fact you seem to say
so in this message, but then you also say "every time I disable the
second CPU, I can see a better performance from the sar".  That's
confusing the matter.  If you mean "with two CPUs I see some bad boots
and some good; with one, all boots are good" -- say that.



From the symptoms you describe, I am _guessing_ at a possible cause.  It
_sounds_ like the 2nd CPU is running at a greatly reduced speed.  The
"smoking gun" symptom would happen because the CPU is never entirely
idle.  Both CPUs take 100 timer ticks per second, for instance.  With a
normally functioning full speed CPU, handling those ticks probably takes
less than .1 % of a CPU.  Now suppose the CPU was running 20x slower,
for some reason.  It would still take as many clock cycles to handle
each timer tick, but now there are 1/20 as many total clock cycles per
realtime second, so now the CPU is 2% busy.









To users, this would show up as some sort of erratic slow execution.
Each process starts out on one CPU or the other and tends to stick to
the same CPU, but can also migrate depending on system activity.
Running the same job repeatedly would result in varying runtimes.



And of course this is only happening during some boots.



To summarize, this is my _guess_ based on the symptoms:



  - some boots are fine, both CPUs work at full speed
  - some boots are bad, CPU 2 runs at a significantly reduced speed
    (for an as-yet unknown reason)
  - when CPU 2 is running slowly, your users complain about performance,
    and you notice it in `sar` because just handling the timer ticks
    takes enough CPU to be noticable



Here's a shell script which may help diagnose this.  It starts up one
process per CPU, then in each process, times a simple spin loop several
times.  If the system is otherwise idle, each process will end up
running on a separate CPU.  It should be quite obvious in the output if
one process is running significantly faster than the other.  In that
case, _which_ process runs faster may change over the course of a run,
but you'll still see that something weird is happening.



On an idle system, what you _should_ see is that each loop takes about
the same time (with maybe up to 10% variation), and the entire set of
loops run by each process should end at about the same time.  If one CPU
is running significantly faster then you'll see some loops that take a
lot less time, and one process may finish long before the other.



VERY IMPORTANT: run this at least once on a "good" 2-CPU boot and once
on a "bad" 2-CPU boot.  The point is to compare behavior between the two
states.



>Bela<



=============================================================================



#!/bin/sh



LOOPS=10        # how many times to run the outer loop
SPINS=2000000   # adjust this manually so each loop takes about 1 second



procs=--
trap 'kill -1 $procs >/dev/null 2>&1; exit' 1 2 3 15
ncpu=`LANG=C uname -X | awk '/NumCPU/ { print $3 }'`
proc=0
while [ $proc != $ncpu ]; do
  proc=`expr $proc + 1`
  loop=1
  while [ $loop -le $LOOPS ]; do
    echo Process $proc loop $loop: `/bin/time awk 'BEGIN { for(i=0; i<'$SPINS'; i++) ; }' 2>&1`
    loop=`expr $loop + 1`
  done &
  procs="$procs $!"
done
wait





Enter your email address for automatic notification of new posts here
(be sure to whitelist 'feedburner.com' if you use spam filtering)

Or use any RSS reader

Delivered by FeedBurner





Views for this page
Today This Week This Month This Year  Overall
21249960 3,346

/Bofcusm/1695.html copyright 1997-2004 Bela Lubkin All Rights Reserved

Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.

Publishing your articles here

More:
       - Bela




Unix/Linux Consultants

Your ad here - $48.00 yearly!

http://bcstechnology.net Full service Linux & UNIX systems integrator; Windows to UNIX/Linux Client-Server Specialist; Secure E-Mail & Website Hosting; Thoroughbred Software Developer; Custom Industrial Automation; Hardware & Electronics Experts; In Business Since 1985.


http://www.m3ipinc.com Security, firewalls, ids, audits, vulnerability assesments, BS7799, HIPAA, GLB, incident handling


http://www.cleverminds.net Need expert advice? Want a second opinion? CleverMinds is a one-stop-shop for a wide range of technology solutions. We support Unix, Linux, SCO as well as CMS, ecom, blogs, podcasts, search engines consulting and more. Contact us at web2.0@cleverminds.net 0r (617) 894-1282



Twitter
  • Nov 21 07:55
    @loudmouthman: correct, but how do you prove ANYTHING like that is accurate? You can't. A text file is no better or worse than anything.
  • Nov 21 07:40
    @loudmouthman: well, a digital signature could prove it hadn't been altered. Text is no more insecure than anything else in that sense.




card_image








Change Congress


Related Posts