ZEUS-MP Benchmarks for Radiation Diffusion
Cray C90 Cray UNICOS machines have a hardware performance monitor (hpm), which gives the number of floating point operations per CPU second (FLOPS) performed by a given process. The FLOPS for other machines are determined from C90 FLOPS and the ratios of the Zone-Cycles/sec. When compiled with cft77 -ez, the program ran at 1/3 the rate reported below, unless HDF dumps were enabled (which should only have slowed it down to do I/O). The Cray fpp optimizer is able to fix whatever the problem was with the optimization. All listings indicated that all inner loops in the ZEUS-MP radiation module vectorize, regardless of whether cft77, cf77 -Zv, or cf77 -Zp were used, even though the performance differed wildly -- by a factor of 3! All routines were compiled with: cf77 -Zp -Wf"-ez"
GRID: 32 x 32 x 32 per processor (tile) (10 steps)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 4.06 4.05 80763 96.40 1.00 1.00 1.00
GRID: 64 x 64 x 64 per processor (tile) (10 steps)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 24.37 20.29 128725 137.71 1.00 1.00 1.00
GRID:128 x 64 x 64 per processor (tile) (10 steps)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 41.42 36.64 142520 149.51 1.00 1.00 1.00
Cray J90 Cray UNICOS machines have a hardware performance monitor (hpm), which gives the number of floating point operations per CPU second (FLOPS) performed by a given process. All routines were compiled with: cf77 -Zp -ez GRID: 32 x 32 x 32 per processor (tile) (10 steps) GRID: 64 x 64 x 64 per processor (tile) (10 steps) GRID:128 x 64 x 64 per processor (tile) (10 steps) Cray T90 Cray UNICOS machines have a hardware performance monitor (hpm), which gives the number of floating point operations per CPU second (FLOPS) performed by a given process. All routines were compiled with: cf77 -Zp -ez GRID: 32 x 32 x 32 per processor (tile) (10 steps) GRID: 64 x 64 x 64 per processor (tile) (10 steps) GRID:128 x 64 x 64 per processor (tile) (10 steps) Cray T3D All routines were compiled with: cf77 -c -C cray-t3d -I/usr/include/mpp All timings are Wall Clock seconds -- the T3D lacks the CPU timer "second". These tests were run under the NQS batch system.
GRID: 32 x 32 x 32 per processor (tile) (10 steps)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
SGI Power Challenge (16 R10000 CPUs) Most of these tests were performed in a dedicated batch queue. All routines but one were compiled with: f77 -c -O3 -g3 -w -r10000 -64 -mips4 -OPT:IEEE_arithmetic=3 -OPT:roundoff=3 The main program was compiled at -O2 to get it to work across POWERnodes. This code is not tuned to reduce TLB misses on this machine. TLB misses account for about 10 percent of the run time for the larger tile sizes, comparable to the time for L2 cache misses.
GRID: 32 x 32 x 32 per processor(tile)
COMMENT: The 24 and 32 processor runs were done in interactive mode
because dedicated time was not available without special
permission. The 24 processor run had 16 threads on one
POWERnode and 8 on the other, while the 32 processor run
actually used 3 POWERnodes, with 16, 10, and 6 processors
respectively, to avoid contention with other jobs. Corresponding
runs in dedicated mode should scale even better.
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 2.43 2.40 136340 162.06 1.69 1.00 1.00
2 2x1x1 2.48 2.48 264495 314.40 3.27 1.94 0.97
2 1x2x1 2.46 2.45 266989 317.36 3.31 1.96 0.98
2 1x1x2 2.46 2.45 267109 317.50 3.31 1.96 0.98
4 2x2x1 2.59 2.58 506763 602.37 6.27 3.72 0.93
4 2x1x2 2.57 2.56 510769 607.13 6.32 3.75 0.94
4 1x2x2 2.57 2.56 510771 607.13 6.32 3.75 0.94
8 2x2x2 2.80 2.80 937048 1113.83 11.60 6.87 0.86
12 3x2x2 2.88 2.86 1373130 1632.19 17.00 10.07 0.84
12 2x3x2 2.91 2.89 1361080 1617.87 16.85 9.98 0.83
12 2x2x3 2.91 2.89 1359110 1615.52 16.83 9.97 0.83
16 4x2x2 2.98 2.87 1811840 2153.67 22.43 13.29 0.83
16 2x4x2 2.91 2.90 1806720 2147.58 22.37 13.25 0.83
16 2x2x4 3.05 3.02 1736330 2063.91 21.50 12.74 0.80
24 4x3x2 ? ? 2500000 2971.62 30.99 18.34 0.76
32 4x4x2 ? ? 3024070 3594.55 37.48 22.18 0.69
GRID: 64 x 64 x 64 per processor(tile)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 21.16 21.08 124146 133.09 0.96 1.00 1.00
2 2x1x1 21.66 21.59 242408 259.87 1.88 1.95 0.98
2 1x2x1 21.58 21.49 243618 261.17 1.89 1.96 0.98
2 1x1x2 21.78 21.61 242246 259.70 1.88 1.95 0.98
4 2x2x1 22.48 22.40 467486 501.17 3.63 3.77 0.94
4 2x1x2 22.46 22.36 468402 502.15 3.64 3.77 0.94
4 1x2x2 22.34 22.24 470783 504.70 3.66 3.79 0.95
8 2x2x2 24.23 24.00 872498 935.36 6.78 7.03 0.88
12 3x2x2 37.52 25.47 1229650 1318.25 9.55 9.90 0.83
12 2x3x2 37.61 25.41 1236330 1325.41 9.60 9.96 0.83
12 2x2x3 37.57 25.59 1227670 1316.13 9.54 9.89 0.82
16 4x2x2 49.01 26.10 1568560 1681.58 12.19 12.63 0.79
16 2x4x2 49.04 26.01 1606890 1722.67 12.48 12.94 0.81
16 2x2x4 47.94 25.96 1575340 1688.85 12.24 12.69 0.79
GRID: 128 x 64 x 64 per processor(tile) (4 or 10 steps)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 40.99 40.56 129070 135.84 0.91 1.00 1.00
2 2x1x1 41.41 41.28 253614 266.92 1.78 1.96 0.98
2 1x2x1 41.38 41.21 254056 267.39 1.78 1.97 0.98
2 1x1x2 41.34 41.16 254390 267.74 1.78 1.97 0.99
4 2x2x1 43.23 42.79 489402 515.09 3.43 3.79 0.95
4 2x1x2 43.25 42.82 489079 514.75 3.43 3.79 0.95
4 1x2x2 43.16 42.70 490386 516.12 3.44 3.80 0.95
8 2x2x2 46.39 45.71 916311 964.40 6.43 7.10 0.89
12 3x2x2 68.21 47.46 1319990 1389.27 9.26 10.23 0.85
12 2x3x2 69.17 48.45 1296930 1365.00 9.10 10.05 0.84
12 2x2x3 69.19 48.40 1298100 1366.23 9.11 10.06 0.84
16 4x2x2 90.94 49.72 1684970 1773.40 11.82 13.05 0.82
16 2x4x2 89.38 48.41 1727170 1817.82 12.12 13.38 0.84
16 2x2x4 92.10 48.49 1727700 1818.38 12.12 13.39 0.84
SGI Power Challenge Array (2x16 R8000 CPUs) These machines are connected via HIPPI with a full crossbar switch. These tests were performed in a normal batch queue -- dedicated performance (on up to 32 processors) may be marginally better. My first attempt at using more than 16 processors failed (hung?). I do not yet know why the R8000's performance is so much worse than the R10000's. Perhaps on the R10000 the software pipelining is better; however, a casual glance at the software pipelining reports (compiled with -S) indicates no differences -- neither machine can pipeline loops in which numbers are raised to real powers (like density**1.9 in the diffusion coefficient). Listing says software pipelining failed due to function call in "do body" All routines were compiled with: f77 -c -O3 -g3 -w -r8000 -64 -mips4 -OPT:IEEE_arithmetic=3 -OPT:roundoff=3 It ran at about 37 MFLOPS on 1 processor when compiled using the -v6 flag in addition to those above, which accesses the version 6.0 compiler. This is slower than the rates reported below.
GRID: 32 x 32 x 32 per processor(tile)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 8.38 7.00 46784 55.61 0.58 1.00 1.00
2 2x1x1 9.62 7.85 82462 98.02 1.02 1.76 0.88
2 1x2x1 10.05 7.82 83689 99.48 1.04 1.79 0.89
2 1x1x2 7.41 7.26 90217 107.24 1.12 1.93 0.96
4 2x2x1 7.77 7.47 175376 208.46 2.17 3.75 0.94
4 2x1x2 8.00 7.74 167603 199.22 2.08 3.58 0.90
4 1x2x2 9.78 8.42 155578 184.93 1.93 3.33 0.83
8 2x2x2 10.53 8.67 299632 356.16 3.71 6.40 0.80
12 3x2x2 12.58 10.49 362219 430.56 4.48 7.74 0.65
12 2x3x2 11.98 9.65 407386 484.24 5.04 8.71 0.73
12 2x2x3 11.55 9.45 415190 493.52 5.14 8.87 0.74
16 4x2x2 9.12 8.32 629695 748.50 7.80 13.46 0.84
16 2x4x2 11.36 9.56 534299 635.10 6.62 11.42 0.71
16 2x2x4 12.52 10.70 483769 575.04 5.99 10.34 0.65
GRID: 64 x 64 x 64 per processor(tile)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 52.67 47.60 55014 58.98 0.43 1.00 1.00
2 2x1x1 54.29 49.23 106411 114.08 0.83 1.93 0.97
2 1x2x1 50.60 48.85 107224 114.95 0.83 1.95 0.97
2 1x1x2 51.17 48.82 107265 114.99 0.83 1.95 0.97
4 2x2x1 57.32 52.43 199827 214.23 1.55 3.63 0.91
4 2x1x2 59.88 51.92 201795 216.33 1.57 3.67 0.92
4 1x2x2 56.13 51.17 204715 219.47 1.59 3.72 0.93
8 2x2x2 61.87 55.41 378193 405.44 2.94 6.87 0.86
12 3x2x2 59.73 57.32 548403 587.92 4.26 9.97 0.83
12 2x3x2 63.20 59.48 528482 566.56 4.11 9.61 0.80
12 2x2x3 62.84 60.59 518836 556.22 4.03 9.43 0.79
16 4x2x2 93.18 68.58 605992 649.66 4.71 11.02 0.69
16 2x4x2 92.73 72.45 567997 608.92 4.41 10.32 0.65
16 2x2x4 98.47 74.50 549765 589.38 4.27 9.99 0.62
GRID: 128 x 64 x 64 per processor(tile) (4 or 10 steps)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 103.26 92.09 56873 59.86 0.40 1.00 1.00
2 2x1x1 102.09 94.49 110874 116.69 0.78 1.95 0.97
2 1x2x1 108.69 95.56 109624 115.38 0.77 1.93 0.96
2 1x1x2 110.49 95.51 109696 115.45 0.77 1.93 0.96
4 2x2x1 116.69 99.52 210544 221.59 1.48 3.70 0.93
4 2x1x2 125.78 100.90 207669 218.57 1.46 3.65 0.91
4 1x2x2 130.66 101.40 206638 217.48 1.45 3.63 0.91
8 2x2x2 147.89 109.13 384043 404.20 2.69 6.75 0.84
12 3x2x2 140.49 110.81 567326 597.10 3.98 9.98 0.83
12 2x3x2 139.57 113.78 552493 581.49 3.88 9.71 0.81
12 2x2x3 145.39 114.04 551244 580.18 3.87 9.69 0.81
16 4x2x2 193.18 116.52 703460 740.38 4.94 12.37 0.77
16 2x4x2 136.48 116.95 716747 754.36 5.03 12.60 0.79
16 2x2x4 130.76 116.04 722379 760.29 5.07 12.70 0.79
Convex Exemplar SPP-1200 Isom Crawford's Fortran-callable interface to the thread timing routines is available here. This 4-HYPERnode system was configured with several HYPERnodes devoted to processing one batch job at a time (dedicated batch queue). All routines were compiled with: f77 -c +O3
GRID: 32 x 32 x 32 per processor(tile)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 10.61 10.37 31569 37.52 .39 1.00 1.00
2 2x1x1 10.57 10.39 62983 74.87 .78 2.00 1.00
2 1x2x1 10.54 10.38 63072 74.97 .78 2.00 1.00
2 1x1x2 10.48 10.36 63198 75.12 .78 2.00 1.00
4 2x2x1 10.73 10.53 124042 147.44 1.54 3.93 .98
4 2x1x2 10.57 10.40 125882 149.63 1.56 3.99 1.00
4 1x2x2 10.57 10.39 126031 149.81 1.56 3.99 1.00
8 2x2x2 10.87 10.55 247425 294.10 3.06 7.84 .98
12 3x2x2 12.41 11.94 328268 390.20 4.06 10.40 .87
12 2x3x2 12.34 11.82 331183 393.67 4.10 10.49 .87
12 2x2x3 12.45 11.98 326750 388.40 4.05 10.35 .86
16 4x2x2 11.17 21.62 241857 287.49 2.99 7.66 .48
16 2x4x2 11.25 10.66 489335 581.65 6.06 15.50 .97
16 2x2x4 11.25 10.67 488829 581.05 6.05 15.48 .97
24 4x3x2 12.13 11.65 668293 794.38 8.27 21.17 .88
24 4x2x3 13.85 13.20 581109 690.74 7.20 18.41 .77
24 3x4x2 12.00 11.31 685797 815.18 8.49 21.72 .91
24 3x2x4 12.84 12.06 641195 762.16 7.94 20.31 .85
24 2x4x3 13.85 12.91 592454 704.23 7.34 18.77 .78
24 2x3x4 11.73 11.08 704726 837.68 8.73 22.32 .93
GRID: 64 x 64 x 64 per processor(tile)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 75.56 74.84 34985 37.51 .27 1.00 1.00
2 2x1x1 75.10 74.37 70400 75.47 .55 2.01 1.01
2 1x2x1 75.56 74.80 69951 74.99 .54 2.00 1.00
2 1x1x2 75.01 74.39 70391 75.46 .55 2.01 1.01
4 2x2x1 76.89 75.88 137770 147.70 1.07 3.94 .98
4 2x1x2 76.28 74.42 140729 150.87 1.09 4.02 1.01
4 1x2x2 75.80 75.04 139533 149.59 1.08 3.99 1.00
8 2x2x2 77.54 76.11 274868 294.67 2.14 7.86 .98
12 3x2x2 84.71 83.05 377989 405.22 2.94 10.80 .90
12 2x3x2 85.72 83.92 373183 400.07 2.90 10.67 .89
12 2x2x3 85.21 83.53 375332 402.38 2.92 10.73 .89
16 4x2x2 77.91 76.40 547713 587.18 4.25 15.66 .98
16 2x4x2 80.66 78.73 531948 570.28 4.13 15.20 .95
16 2x2x4 78.13 91.24 458678 491.73 3.56 13.11 .82
24 4x3x2 84.13 82.02 758958 813.64 5.90 21.69 .90
24 4x2x3 83.63 81.46 764210 819.27 5.94 21.84 .91
24 3x4x2 84.02 81.50 763615 818.64 5.93 21.83 .91
24 3x2x4 84.52 82.28 760598 815.40 5.91 21.74 .91
24 2x4x3 84.07 81.88 765121 820.25 5.94 21.87 .91
24 2x3x4 83.00 80.34 775112 830.96 6.02 22.16 .92
GRID: 128 x 64 x 64 per processor(tile)
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
1 1x1x1 147.07 145.72 35934 37.82 .25 1.00 1.00
2 2x1x1 147.43 146.02 71680 75.44 .50 1.99 1.00
2 1x2x1 147.14 145.45 71996 75.77 .51 2.00 1.00
2 1x1x2 147.03 145.78 71832 75.60 .50 2.00 1.00
4 2x2x1 147.99 146.29 143124 150.64 1.00 3.98 1.00
4 2x1x2 147.15 145.66 143770 151.32 1.01 4.00 1.00
4 1x2x2 147.89 153.45 136488 143.65 .96 3.80 .95
8 2x2x2 154.51 151.66 275716 290.19 1.93 7.67 .96
12 3x2x2 166.13 163.11 384083 404.24 2.69 10.69 .89
12 2x3x2 165.00 162.13 386504 406.79 2.71 10.76 .90
12 2x2x3 164.71 162.22 387332 407.66 2.72 10.78 .90
16 4x2x2 155.58 152.82 546730 575.42 3.84 15.22 .95
16 2x4x2 156.09 152.89 547129 575.84 3.84 15.23 .95
16 2x2x4 157.77 154.36 541371 569.78 3.80 15.07 .94
24 4x3x2 157.19 153.98 812248 854.88 5.70 22.60 .94
24 4x2x3 153.48 150.73 830812 874.42 5.83 23.12 .96
24 3x4x2 159.07 155.18 804810 847.05 5.65 22.40 .93
24 3x2x4 154.04 150.93 831055 874.67 5.83 23.13 .96
24 2x4x3 156.82 153.12 817906 860.83 5.74 22.76 .95
24 2x3x4 155.40 152.16 824640 867.92 5.79 22.95 .96
Intel Paragon This data set was obtained on the smaller Paragon at Sandia National Laboratory as an interactive job. This machine has a total of 64 nodes, with the compute nodes running the SUNMOS/PUMA operating system. All nodes have at least 16MB of DRAM, so this problem should fit (not much paging). Wall-clock time was used in place of "tused" to compute the number of zone-cycles per second because the UNIX etime routine, which returns CPU time, requires some missing system library modules. All routines were compiled with: sif77 -c -O4
GRID: 32 x 32 x 32 per processor(tile) (4 Steps)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
DEC AlphaServer 8400 This data set was obtained on the DEC AlphaServer 8400 at LLNL as an interactive job. This system is a cluster of two 4-processor machines. All routines were compiled with: f77 -c -O5 -g3 -nowarn -check noformat The MPI_WTIME routine does not seem to work on this system.
GRID: 32 x 32 x 32 per processor(tile)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
GRID: 64 x 64 x 64 per processor(tile)
Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor
|
Back to ZEUS-MP Main