SGI Power Challenge Array (4x16 R8000 CPUs)
GRID: 32 x 32 x 32 per processor(tile)
COMMENT: On just a few processors, ZEUS-3D is faster for this small problem size because
a substantial fraction of the data (less than 7 MB) fits in the 4 MB cache. The
optimizations for ZEUS-MP to improve the reuse of encached data are largely
wasted.
COMMENT: When communicating across POWERnodes (more than 16 threads), SGI native MPI uses
HIPPI for message sizes above 8 KB and sockets for smaller messages. Although
HIPPI is the faster network, it has a longer latency than sockets, so it takes
longer for short messages. Even with this small tile size, most messages exceed
8 KB (see discussion above under COMMUNICATION), so the long latency of HIPPI
is probably responsible for the poor scaling across more than 16 threads.
ZEUS-MP (10 steps)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 4.68 4.64 70334 61.47 0.32 1.00 1.00
2 2x1x1 5.03 4.97 130881 114.39 0.60 1.86 0.93
4 2x2x1 5.00 4.95 263300 230.13 1.21 3.74 0.94
8 2x2x2 5.33 5.25 495453 433.04 2.28 7.04 0.88
16 4x2x2 6.26 6.03 864381 755.50 3.98 12.29 0.77
32 4x4x2 16.50 16.07 648345 566.68 2.98 9.22 0.29
64 4x4x4 45.61 42.76 467040 408.21 2.15 6.64 0.10
ZEUS-3D (20 steps) (same layout)
Speedup/
Processors Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 8.00 8.10 80932 70.74 0.37 1.00 1.00
2 9.00 8.18 160142 136.60 0.50 1.98 0.99
4 12.00 10.05 260945 218.15 0.77 3.22 0.81
8 17.00 15.12 346802 283.81 0.97 4.29 0.54
16 18.00 15.85 661638 535.75 1.49 8.18 0.51
GRID: 64 x 64 x 64 per processor(tile)
COMMENT: ZEUS-MP outperforms ZEUS-3D even on 1 processor.
ZEUS-MP (10 steps)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 39.80 39.38 66115 54.11 0.18 1.00 1.00
2 2x1x1 40.80 40.34 129099 105.65 0.36 1.95 0.98
4 2x2x1 41.69 41.24 252549 206.67 0.71 3.82 0.95
8 2x2x2 45.65 45.07 461475 377.65 1.29 6.98 0.87
16 4x2x2 68.46 66.65 621333 508.47 1.74 9.40 0.59
32 4x4x2 75.70 73.13 1126290 921.70 3.15 17.04 0.53
64 4x4x4 113.49 109.70 1503090 1230.06 4.20 22.73 0.36
ZEUS-3D (10 steps)(same layout)
Speedup/
Processors Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 53.00 52.40 50026 40.94 0.14 1.00 1.00
2 53.00 51.30 102201 82.75 0.23 2.04 1.02
4 77.00 56.88 184355 149.28 0.42 3.69 0.92
8 121.00 79.90 262467 212.53 0.59 5.25 0.66
16 203.00 189.84 220941 178.90 0.50 4.42 0.28
GRID: 128 x 64 x 64 per processor(tile)
COMMENT: I tried to do a 128-cubed problem with ZEUS-MP, but the system kept crashing
(no error messages) for 16 or more processors.
Each 128-cubed tile requires about 256 MB, so running with this tile size on
16 processors would use about 4 GB. In fact, I had no success with
any MPI run with a total memory requirement over 2 GB. Moreover, the
32 processor run below does not get the correct answer -- the timestep is 0!
These problems have been fixed for IRIX 6.2.
COMMENT: For ZEUS-MP, the speedup is nearly the same as it is for the 64-cubed tiles.
The long latency of HIPPI apparently has no impact on scaling.
HIPPI is probably just not fast enough to keep up with the processors for 64-cubed
or larger tiles.
ZEUS-MP (10 steps)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 75.50 74.68 69705 54.16 0.33 1.00 1.00
2 1x1x2 75.50 74.57 139601 108.46 0.67 2.00 1.00
4 1x2x2 76.69 75.79 274691 213.42 1.31 3.94 0.99
8 2x2x2 83.56 82.53 504319 391.83 2.40 7.24 0.90
16 2x4x2 148.44 129.34 640063 497.32 3.03 9.18 0.57
32 2x4x4 149.24 140.89 1169500 908.69 5.54 16.77 0.52
ZEUS-3D (7 or 10 steps) (same layout)
Speedup/
Processors Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 106.00 104.23 50302 40.73 0.11 1.00 1.00
2 111.00 109.03 96169 77.87 0.22 1.91 0.96
4 124.00 120.21 174458 141.26 0.39 3.47 0.87
8 194.00 183.53 228535 185.05 0.52 4.54 0.57
16 5371.00 1123.20 52281 42.33 0.12 1.04 0.06
WORK IS CONSTANT
GRID: Tile size adjusted to make the full mesh 128 x 128 x 128
COMMENT: The ZEUS-MP tile size is 32-cubed for 64 processors.
ZEUS-MP (10 steps)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 350.65 346.84 60077 47.66 0.31 1.00 1.00
2 1x1x2 146.91 145.21 143305 113.70 0.75 2.39 1.19
4 1x2x2 76.78 75.92 274217 217.56 1.43 4.56 1.14
8 2x2x2 45.65 45.07 461475 366.13 2.41 7.68 0.96
16 2x2x4 32.23 31.57 657477 521.64 3.43 10.94 0.68
32 2x4x4 26.26 25.18 817647 648.72 4.27 13.61 0.42
64 4x4x4 45.61 42.76 467040 370.55 2.44 7.77 0.12
ZEUS-3D (10 steps) (same layout)
Speedup/
Processors Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 410.00 408.70 51313 41.55 0.12 1.00 1.00
2 214.00 211.09 99351 80.45 0.22 1.94 0.97
3 152.00 148.65 141080 114.24 0.32 2.75 0.92
4 119.00 114.88 182557 147.82 0.41 3.56 0.89
5 104.00 99.72 210300 170.29 0.47 4.10 0.82
6 92.00 88.70 236437 191.45 0.53 4.61 0.77
7 88.00 84.71 247569 200.46 0.56 4.82 0.69
8 85.00 79.98 262212 212.32 0.59 5.11 0.64
9 83.00 78.19 268202 217.17 0.60 5.23 0.58
10 80.00 75.97 276053 223.53 0.62 5.38 0.54
11 79.00 74.33 282132 228.45 0.64 5.50 0.50
12 79.00 73.29 286162 231.71 0.65 5.58 0.46
13 76.00 71.33 294009 238.07 0.66 5.73 0.44
14 75.00 70.19 298785 241.93 0.67 5.82 0.42
15 70.00 64.69 324182 262.50 0.73 6.32 0.42
16 67.00 60.77 345097 279.43 0.78 6.73 0.42
GRID: Tile size adjusted to make the full mesh 256 x 256 x 256
COMMENT: The ZEUS-MP tile size is 64-cubed for 64 processors.
COMMENT: The ZEUS-3D data was obtained from an ordinary batch job (not dedicated).
ZEUS-MP (10 steps)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 666.38 659.53 63180 49.09 0.30 1.00 1.00
2 1x1x2 671.53 664.60 125393 97.42 0.60 1.98 0.99
4 1x2x2 679.97 672.85 247688 192.44 1.18 3.92 0.98
8 2x2x2 394.17 389.89 427417 332.10 2.11 6.77 1.00
16 2x2x4 241.92 237.11 701097 544.74 3.32 11.10 0.69
32 2x4x4 149.24 140.89 1169500 908.69 5.55 18.51 0.59
64 4x4x4 113.49 109.70 1503090 1167.88 7.14 23.79 0.37
ZEUS-3D (3 to 10 steps) (same layout)
Speedup/
Processors Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1005.00 1000.50 50308 40.74 0.11 1.00 1.00
2 1014.00 1005.10 100154 81.10 0.23 1.99 1.00
4 912.00 890.21 188464 152.60 0.43 3.75 0.94
8 672.00 647.26 259202 209.88 0.58 5.15 0.64
16 643.00 602.86 278295 225.34 0.63 5.53 0.35
Back to Scaling Comparison Main |