Personal tools
You are here: Home Codes ZEUS-MP ZEUS-MP Benchmarks for Radiation Diffusion
Document Actions

ZEUS-MP Benchmarks for Radiation Diffusion

by streeter last modified 2007-03-30 04:23
  • CODE: ZEUS-MP version 1.3
  • MACHINES: (Jump to the performance data):
    Cray C90
    Cray J90
    Cray T90
    Cray T3D
    DEC AlphaServer 8400
    HP/Convex Exemplar SPP-1200
    Intel Paragon
    SGI Power Challenge (R10000)
    SGI Power Challenge Array (R8000)
  • PROBLEM: Block -- Radiation diffusion around a dense block in 3-D. The diffusion coefficent is proportional to density^-1.9 * temperature^5.6.
  • GEOMETRY: Cartesian XYZ
  • GRID: The physical grid is uniform and partitioned into cubic "tiles".
  • PRECISION: Single precision on Crays (64-bits), DOUBLE PRECISION on others.
  • ALGORITHM: Product Formula unconditionally stable explicit time evolution. The radiation diffusion equation is evolved by multiplying the radiation energy density vector by a product of exponentiated matrices representing the formal solution of the transfer equation. See Frank Graziani, Journal of Computational Physics, v. 118, pp. 9-23 (1995). The evolution operator is split into 1-dimensional sweeps. Communication is required to exchange radiation energy density values on the surfaces at both ends of the sweep.
  • DATA: In the table below, "tused" is the number of CPU seconds used by the master thread in computing the evolution (some system overhead and ZEUS-MP initialization is excluded). The Zone-Cycles/sec is the total number of mesh zones times the number of time steps divided by tused. This measure allows direct performance comparisons between problems of different sizes. MFLOPS are computed based on the C90s hardware performance monitor. The MPIS R10000 typically reports significantly fewer MFLOPS because it counts "multiply-adds" as 1 operation, etc.


Cray C90

Cray UNICOS machines have a hardware performance monitor (hpm), which gives the number of floating point operations per CPU second (FLOPS) performed by a given process. The FLOPS for other machines are determined from C90 FLOPS and the ratios of the Zone-Cycles/sec.

When compiled with cft77 -ez, the program ran at 1/3 the rate reported below, unless HDF dumps were enabled (which should only have slowed it down to do I/O). The Cray fpp optimizer is able to fix whatever the problem was with the optimization. All listings indicated that all inner loops in the ZEUS-MP radiation module vectorize, regardless of whether cft77, cf77 -Zv, or cf77 -Zp were used, even though the performance differed wildly -- by a factor of 3!

All routines were compiled with: cf77 -Zp -Wf"-ez"
 
GRID: 32 x 32 x 32 per processor (tile) (10 steps) 
                                                                              Speedup/
Processors  Layout Wall Clock  tused(s) Zone-Cycles/sec  MFLOPS  C90s Speedup Processor 
    1       1x1x1     4.06       4.05        80763       96.40   1.00  1.00     1.00 
 
GRID: 64 x 64 x 64 per processor (tile) (10 steps) 
                                                                              Speedup/
Processors  Layout Wall Clock  tused(s) Zone-Cycles/sec  MFLOPS  C90s Speedup Processor 
    1       1x1x1     24.37      20.29       128725      137.71  1.00   1.00    1.00 
 
GRID:128 x 64 x 64 per processor (tile) (10 steps) 
                                                                              Speedup/
Processors  Layout Wall Clock  tused(s) Zone-Cycles/sec  MFLOPS  C90s Speedup Processor 
    1       1x1x1     41.42      36.64       142520      149.51  1.00   1.00    1.00 


Cray J90

Cray UNICOS machines have a hardware performance monitor (hpm), which gives the number of floating point operations per CPU second (FLOPS) performed by a given process.

All routines were compiled with: cf77 -Zp -ez
 
GRID: 32 x 32 x 32 per processor (tile) (10 steps) 
 
GRID: 64 x 64 x 64 per processor (tile) (10 steps) 
 
GRID:128 x 64 x 64 per processor (tile) (10 steps) 


Cray T90

Cray UNICOS machines have a hardware performance monitor (hpm), which gives the number of floating point operations per CPU second (FLOPS) performed by a given process.

All routines were compiled with: cf77 -Zp -ez
 
GRID: 32 x 32 x 32 per processor (tile) (10 steps) 
 
GRID: 64 x 64 x 64 per processor (tile) (10 steps) 
 
GRID:128 x 64 x 64 per processor (tile) (10 steps) 


Cray T3D

All routines were compiled with: cf77 -c -C cray-t3d -I/usr/include/mpp

All timings are Wall Clock seconds -- the T3D lacks the CPU timer "second".

These tests were run under the NQS batch system.
 
GRID: 32 x 32 x 32 per processor (tile) (10 steps) 
                                                                           Speedup/

Processors  Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor 


SGI Power Challenge (16 R10000 CPUs)

Most of these tests were performed in a dedicated batch queue.

All routines but one were compiled with: f77 -c -O3 -g3 -w -r10000 -64 -mips4 -OPT:IEEE_arithmetic=3 -OPT:roundoff=3

The main program was compiled at -O2 to get it to work across POWERnodes.

This code is not tuned to reduce TLB misses on this machine. TLB misses account for about 10 percent of the run time for the larger tile sizes, comparable to the time for L2 cache misses.


 
GRID: 32 x 32 x 32 per processor(tile) 
 
COMMENT: The 24 and 32 processor runs were done in interactive mode 
         because dedicated time was not available without special 
         permission.  The 24 processor run had 16 threads on one 
         POWERnode and 8 on the other, while the 32 processor run 
         actually used 3 POWERnodes, with 16, 10, and 6 processors 
         respectively, to avoid contention with other jobs. Corresponding 
         runs in dedicated mode should scale even better. 
 
                                                                              Speedup/

 Processors Layout Wall Clock  tused(s) Zone-Cycles/sec MFLOPS   C90s Speedup Processor 
      1      1x1x1    2.43       2.40       136340      162.06   1.69   1.00    1.00 
      2      2x1x1    2.48       2.48       264495      314.40   3.27   1.94    0.97 
      2      1x2x1    2.46       2.45       266989      317.36   3.31   1.96    0.98 
      2      1x1x2    2.46       2.45       267109      317.50   3.31   1.96    0.98 
      4      2x2x1    2.59       2.58       506763      602.37   6.27   3.72    0.93 
      4      2x1x2    2.57       2.56       510769      607.13   6.32   3.75    0.94 
      4      1x2x2    2.57       2.56       510771      607.13   6.32   3.75    0.94 
      8      2x2x2    2.80       2.80       937048     1113.83  11.60   6.87    0.86 
     12      3x2x2    2.88       2.86      1373130     1632.19  17.00  10.07    0.84 
     12      2x3x2    2.91       2.89      1361080     1617.87  16.85   9.98    0.83 
     12      2x2x3    2.91       2.89      1359110     1615.52  16.83   9.97    0.83 
     16      4x2x2    2.98       2.87      1811840     2153.67  22.43  13.29    0.83 
     16      2x4x2    2.91       2.90      1806720     2147.58  22.37  13.25    0.83 
     16      2x2x4    3.05       3.02      1736330     2063.91  21.50  12.74    0.80 
     24      4x3x2     ?          ?        2500000     2971.62  30.99  18.34    0.76 
     32      4x4x2     ?          ?        3024070     3594.55  37.48  22.18    0.69 
 



 
GRID: 64 x 64 x 64 per processor(tile) 
                                                                              Speedup/
 Processors Layout Wall Clock  tused(s) Zone-Cycles/sec MFLOPS   C90s Speedup Processor 
      1      1x1x1    21.16      21.08      124146      133.09   0.96   1.00    1.00 
      2      2x1x1    21.66      21.59      242408      259.87   1.88   1.95    0.98 
      2      1x2x1    21.58      21.49      243618      261.17   1.89   1.96    0.98 
      2      1x1x2    21.78      21.61      242246      259.70   1.88   1.95    0.98 
      4      2x2x1    22.48      22.40      467486      501.17   3.63   3.77    0.94 
      4      2x1x2    22.46      22.36      468402      502.15   3.64   3.77    0.94 
      4      1x2x2    22.34      22.24      470783      504.70   3.66   3.79    0.95 
      8      2x2x2    24.23      24.00      872498      935.36   6.78   7.03    0.88 
     12      3x2x2    37.52      25.47     1229650     1318.25   9.55   9.90    0.83 
     12      2x3x2    37.61      25.41     1236330     1325.41   9.60   9.96    0.83 
     12      2x2x3    37.57      25.59     1227670     1316.13   9.54   9.89    0.82 
     16      4x2x2    49.01      26.10     1568560     1681.58  12.19  12.63    0.79 
     16      2x4x2    49.04      26.01     1606890     1722.67  12.48  12.94    0.81 
     16      2x2x4    47.94      25.96     1575340     1688.85  12.24  12.69    0.79 
 



 
GRID: 128 x 64 x 64 per processor(tile) (4 or 10 steps) 
                                                                              Speedup/
 Processors Layout Wall Clock  tused(s) Zone-Cycles/sec MFLOPS   C90s Speedup Processor 
      1      1x1x1    40.99      40.56      129070      135.84   0.91   1.00    1.00 
      2      2x1x1    41.41      41.28      253614      266.92   1.78   1.96    0.98 
      2      1x2x1    41.38      41.21      254056      267.39   1.78   1.97    0.98 
      2      1x1x2    41.34      41.16      254390      267.74   1.78   1.97    0.99 
      4      2x2x1    43.23      42.79      489402      515.09   3.43   3.79    0.95 
      4      2x1x2    43.25      42.82      489079      514.75   3.43   3.79    0.95 
      4      1x2x2    43.16      42.70      490386      516.12   3.44   3.80    0.95 
      8      2x2x2    46.39      45.71      916311      964.40   6.43   7.10    0.89 
     12      3x2x2    68.21      47.46     1319990     1389.27   9.26  10.23    0.85 
     12      2x3x2    69.17      48.45     1296930     1365.00   9.10  10.05    0.84 
     12      2x2x3    69.19      48.40     1298100     1366.23   9.11  10.06    0.84 
     16      4x2x2    90.94      49.72     1684970     1773.40  11.82  13.05    0.82 
     16      2x4x2    89.38      48.41     1727170     1817.82  12.12  13.38    0.84 
     16      2x2x4    92.10      48.49     1727700     1818.38  12.12  13.39    0.84 
 


SGI Power Challenge Array (2x16 R8000 CPUs)

These machines are connected via HIPPI with a full crossbar switch.

These tests were performed in a normal batch queue -- dedicated performance (on up to 32 processors) may be marginally better. My first attempt at using more than 16 processors failed (hung?).

I do not yet know why the R8000's performance is so much worse than the R10000's. Perhaps on the R10000 the software pipelining is better; however, a casual glance at the software pipelining reports (compiled with -S) indicates no differences -- neither machine can pipeline loops in which numbers are raised to real powers (like density**1.9 in the diffusion coefficient). Listing says software pipelining failed due to function call in "do body"

All routines were compiled with: f77 -c -O3 -g3 -w -r8000 -64 -mips4 -OPT:IEEE_arithmetic=3 -OPT:roundoff=3

It ran at about 37 MFLOPS on 1 processor when compiled using the -v6 flag in addition to those above, which accesses the version 6.0 compiler. This is slower than the rates reported below.


 
GRID: 32 x 32 x 32 per processor(tile) 
                                                                             Speedup/
 Processors Layout Wall Clock  tused(s) Zone-Cycles/sec MFLOPS  C90s Speedup Processor 
      1      1x1x1    8.38       7.00       46784       55.61   0.58   1.00    1.00 
      2      2x1x1    9.62       7.85       82462       98.02   1.02   1.76    0.88 
      2      1x2x1   10.05       7.82       83689       99.48   1.04   1.79    0.89 
      2      1x1x2    7.41       7.26       90217      107.24   1.12   1.93    0.96 
      4      2x2x1    7.77       7.47      175376      208.46   2.17   3.75    0.94 
      4      2x1x2    8.00       7.74      167603      199.22   2.08   3.58    0.90 
      4      1x2x2    9.78       8.42      155578      184.93   1.93   3.33    0.83 
      8      2x2x2   10.53       8.67      299632      356.16   3.71   6.40    0.80 
     12      3x2x2   12.58      10.49      362219      430.56   4.48   7.74    0.65 
     12      2x3x2   11.98       9.65      407386      484.24   5.04   8.71    0.73 
     12      2x2x3   11.55       9.45      415190      493.52   5.14   8.87    0.74 
     16      4x2x2    9.12       8.32      629695      748.50   7.80  13.46    0.84 
     16      2x4x2   11.36       9.56      534299      635.10   6.62  11.42    0.71 
     16      2x2x4   12.52      10.70      483769      575.04   5.99  10.34    0.65 
 



 
GRID: 64 x 64 x 64 per processor(tile) 
                                                                              Speedup/
 Processors Layout Wall Clock  tused(s) Zone-Cycles/sec MFLOPS   C90s Speedup Processor 
      1      1x1x1    52.67      47.60       55014       58.98   0.43   1.00    1.00 
      2      2x1x1    54.29      49.23      106411      114.08   0.83   1.93    0.97 
      2      1x2x1    50.60      48.85      107224      114.95   0.83   1.95    0.97 
      2      1x1x2    51.17      48.82      107265      114.99   0.83   1.95    0.97 
      4      2x2x1    57.32      52.43      199827      214.23   1.55   3.63    0.91 
      4      2x1x2    59.88      51.92      201795      216.33   1.57   3.67    0.92 
      4      1x2x2    56.13      51.17      204715      219.47   1.59   3.72    0.93 
      8      2x2x2    61.87      55.41      378193      405.44   2.94   6.87    0.86 
     12      3x2x2    59.73      57.32      548403      587.92   4.26   9.97    0.83 
     12      2x3x2    63.20      59.48      528482      566.56   4.11   9.61    0.80 
     12      2x2x3    62.84      60.59      518836      556.22   4.03   9.43    0.79 
     16      4x2x2    93.18      68.58      605992      649.66   4.71  11.02    0.69 
     16      2x4x2    92.73      72.45      567997      608.92   4.41  10.32    0.65 
     16      2x2x4    98.47      74.50      549765      589.38   4.27   9.99    0.62 
 



 
GRID: 128 x 64 x 64 per processor(tile) (4 or 10 steps) 
                                                                              Speedup/
 Processors Layout Wall Clock  tused(s) Zone-Cycles/sec MFLOPS   C90s Speedup Processor 
      1      1x1x1    103.26     92.09       56873       59.86   0.40   1.00    1.00 
      2      2x1x1    102.09     94.49      110874      116.69   0.78   1.95    0.97 
      2      1x2x1    108.69     95.56      109624      115.38   0.77   1.93    0.96 
      2      1x1x2    110.49     95.51      109696      115.45   0.77   1.93    0.96 
      4      2x2x1    116.69     99.52      210544      221.59   1.48   3.70    0.93 
      4      2x1x2    125.78    100.90      207669      218.57   1.46   3.65    0.91 
      4      1x2x2    130.66    101.40      206638      217.48   1.45   3.63    0.91 
      8      2x2x2    147.89    109.13      384043      404.20   2.69   6.75    0.84 
     12      3x2x2    140.49    110.81      567326      597.10   3.98   9.98    0.83 
     12      2x3x2    139.57    113.78      552493      581.49   3.88   9.71    0.81 
     12      2x2x3    145.39    114.04      551244      580.18   3.87   9.69    0.81 
     16      4x2x2    193.18    116.52      703460      740.38   4.94  12.37    0.77 
     16      2x4x2    136.48    116.95      716747      754.36   5.03  12.60    0.79 
     16      2x2x4    130.76    116.04      722379      760.29   5.07  12.70    0.79 
 


Convex Exemplar SPP-1200

Isom Crawford's Fortran-callable interface to the thread timing routines is available here.

This 4-HYPERnode system was configured with several HYPERnodes devoted to processing one batch job at a time (dedicated batch queue).

All routines were compiled with: f77 -c +O3


 
GRID: 32 x 32 x 32 per processor(tile) 
                                                                              Speedup/
 Processors Layout Wall Clock  tused(s) Zone-Cycles/sec MFLOPS   C90s Speedup Processor 
      1      1x1x1    10.61      10.37       31569       37.52    .39   1.00    1.00 
      2      2x1x1    10.57      10.39       62983       74.87    .78   2.00    1.00 
      2      1x2x1    10.54      10.38       63072       74.97    .78   2.00    1.00 
      2      1x1x2    10.48      10.36       63198       75.12    .78   2.00    1.00 
      4      2x2x1    10.73      10.53      124042      147.44   1.54   3.93     .98 
      4      2x1x2    10.57      10.40      125882      149.63   1.56   3.99    1.00 
      4      1x2x2    10.57      10.39      126031      149.81   1.56   3.99    1.00 
      8      2x2x2    10.87      10.55      247425      294.10   3.06   7.84     .98 
     12      3x2x2    12.41      11.94      328268      390.20   4.06  10.40     .87 
     12      2x3x2    12.34      11.82      331183      393.67   4.10  10.49     .87 
     12      2x2x3    12.45      11.98      326750      388.40   4.05  10.35     .86 
     16      4x2x2    11.17      21.62      241857      287.49   2.99   7.66     .48 
     16      2x4x2    11.25      10.66      489335      581.65   6.06  15.50     .97 
     16      2x2x4    11.25      10.67      488829      581.05   6.05  15.48     .97 
     24      4x3x2    12.13      11.65      668293      794.38   8.27  21.17     .88 
     24      4x2x3    13.85      13.20      581109      690.74   7.20  18.41     .77 
     24      3x4x2    12.00      11.31      685797      815.18   8.49  21.72     .91 
     24      3x2x4    12.84      12.06      641195      762.16   7.94  20.31     .85 
     24      2x4x3    13.85      12.91      592454      704.23   7.34  18.77     .78 
     24      2x3x4    11.73      11.08      704726      837.68   8.73  22.32     .93 



 
GRID: 64 x 64 x 64 per processor(tile) 
                                                                               Speedup/
 Processors Layout  Wall Clock  tused(s) Zone-Cycles/sec MFLOPS   C90s Speedup Processor 
      1      1x1x1     75.56      74.84       34985       37.51    .27   1.00    1.00 
      2      2x1x1     75.10      74.37       70400       75.47    .55   2.01    1.01 
      2      1x2x1     75.56      74.80       69951       74.99    .54   2.00    1.00 
      2      1x1x2     75.01      74.39       70391       75.46    .55   2.01    1.01 
      4      2x2x1     76.89      75.88      137770      147.70   1.07   3.94     .98 
      4      2x1x2     76.28      74.42      140729      150.87   1.09   4.02    1.01 
      4      1x2x2     75.80      75.04      139533      149.59   1.08   3.99    1.00 
      8      2x2x2     77.54      76.11      274868      294.67   2.14   7.86     .98 
     12      3x2x2     84.71      83.05      377989      405.22   2.94  10.80     .90 
     12      2x3x2     85.72      83.92      373183      400.07   2.90  10.67     .89 
     12      2x2x3     85.21      83.53      375332      402.38   2.92  10.73     .89 
     16      4x2x2     77.91      76.40      547713      587.18   4.25  15.66     .98 
     16      2x4x2     80.66      78.73      531948      570.28   4.13  15.20     .95 
     16      2x2x4     78.13      91.24      458678      491.73   3.56  13.11     .82 
     24      4x3x2     84.13      82.02      758958      813.64   5.90  21.69     .90 
     24      4x2x3     83.63      81.46      764210      819.27   5.94  21.84     .91 
     24      3x4x2     84.02      81.50      763615      818.64   5.93  21.83     .91 
     24      3x2x4     84.52      82.28      760598      815.40   5.91  21.74     .91 
     24      2x4x3     84.07      81.88      765121      820.25   5.94  21.87     .91 
     24      2x3x4     83.00      80.34      775112      830.96   6.02  22.16     .92 



 
GRID: 128 x 64 x 64 per processor(tile) 
 
 Processors Layout  Wall Clock  tused(s) Zone-Cycles/sec  MFLOPS   C90s Speedup Processor 
      1      1x1x1     147.07     145.72       35934       37.82    .25   1.00    1.00 
      2      2x1x1     147.43     146.02       71680       75.44    .50   1.99    1.00 
      2      1x2x1     147.14     145.45       71996       75.77    .51   2.00    1.00 
      2      1x1x2     147.03     145.78       71832       75.60    .50   2.00    1.00 
      4      2x2x1     147.99     146.29      143124      150.64   1.00   3.98    1.00 
      4      2x1x2     147.15     145.66      143770      151.32   1.01   4.00    1.00 
      4      1x2x2     147.89     153.45      136488      143.65    .96   3.80     .95 
      8      2x2x2     154.51     151.66      275716      290.19   1.93   7.67     .96 
     12      3x2x2     166.13     163.11      384083      404.24   2.69  10.69     .89 
     12      2x3x2     165.00     162.13      386504      406.79   2.71  10.76     .90 
     12      2x2x3     164.71     162.22      387332      407.66   2.72  10.78     .90 
     16      4x2x2     155.58     152.82      546730      575.42   3.84  15.22     .95 
     16      2x4x2     156.09     152.89      547129      575.84   3.84  15.23     .95 
     16      2x2x4     157.77     154.36      541371      569.78   3.80  15.07     .94 
     24      4x3x2     157.19     153.98      812248      854.88   5.70  22.60     .94 
     24      4x2x3     153.48     150.73      830812      874.42   5.83  23.12     .96 
     24      3x4x2     159.07     155.18      804810      847.05   5.65  22.40     .93 
     24      3x2x4     154.04     150.93      831055      874.67   5.83  23.13     .96 
     24      2x4x3     156.82     153.12      817906      860.83   5.74  22.76     .95 
     24      2x3x4     155.40     152.16      824640      867.92   5.79  22.95     .96 


Intel Paragon

This data set was obtained on the smaller Paragon at Sandia National Laboratory as an interactive job. This machine has a total of 64 nodes, with the compute nodes running the SUNMOS/PUMA operating system. All nodes have at least 16MB of DRAM, so this problem should fit (not much paging).

Wall-clock time was used in place of "tused" to compute the number of zone-cycles per second because the UNIX etime routine, which returns CPU time, requires some missing system library modules.

All routines were compiled with: sif77 -c -O4

 
GRID: 32 x 32 x 32 per processor(tile) (4 Steps) 
                                                                          Speedup/
Processors Layout Wall Clock tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor 


DEC AlphaServer 8400

This data set was obtained on the DEC AlphaServer 8400 at LLNL as an interactive job. This system is a cluster of two 4-processor machines.

All routines were compiled with: f77 -c -O5 -g3 -nowarn -check noformat

The MPI_WTIME routine does not seem to work on this system.

 
GRID: 32 x 32 x 32 per processor(tile) 
                                                                           Speedup/
Processors Layout Wall Clock  tused(s) Zone-Cycles/sec MFLOPS C90s Speedup Processor 
 
GRID: 64 x 64 x 64 per processor(tile) 
                                                                             Speedup/
Processors Layout Wall Clock  tused(s) Zone-Cycles/sec  MFLOPS C90s Speedup Processor 


Back to ZEUS-MP Main


Powered by Plone CMS, the Open Source Content Management System

This site conforms to the following standards: