SFGenBenchmark/Performance

sfgen Benchmark - Performance

Major Performance Factors

Reading Grid Data

Ideally, the -L flag to sfgen would always be used, and LoadGridDataAtStart would be set to TRUE. However, not all datasets will fit into memory on the available number of processors. Because of this, sfgen has the ability to read in the grid data as needed during each pass. This does cause a large performance hit, but enable measurements that might not be able to be done, other wise. And, reading in the grid data initially reduces the number of point pairs that can be in memory each pass.

Fortunately, the penalty for opening up files goes down a little as a the number of MPI tasks goes up--depending on the ability of the file system.

Compare the Loading Grid Data section with the Note Loading Grid Data one.

Points per Pass

The amount of work (computation) and memory consumption per pass is determined by the total number of points generated each pass. During each pass, for each bin (distance), an equal number of point pairs is generated. The number of bins (or distances) is set by NumberOfBins, while the number of point pairs is controlled by NumberOfPairs. (Creative algorithms are cool, creative variable naming is not.) This means the total number of points (and hence cell indexes and cell values) each pass is:

Points per pass = 2 X NumberOfPairs X NumberOfBins 

Number of Passes

Looking at the section above, it's not hard to figure out that the total number of points generated is the number of points per pass times the number of passes. Each pass brings with it the need to communicate, and possibly read in data, both of which add to the total time.

Total number of points = 2 X NumberOfPairs X NumberOfBins X NumberOfPasses

Number of Tasks

TODO: Discuss communication cost.

Other Factors

TODO: Number of grids.

Performance Output

In addition to tools such as hpmcount and IPM, sfgen can be compiled to do some simple timing measurements. These are created using MPI_Wtime around major portions of the code. The results will appear in a file named *.structure.perf, e.g., bench_t16.structure.perf.

ds100 $ cat bench_t16.structure.perf 
# NumberOfPasses      = 4
# NumberOfPDFBins     = 100
# NumberOfBins        = 10
# NumberOfPairs       = 65536
# NumberOfProcessors  = 16
# LoadGridDataAtStart = FALSE
PassNumber, GetPointPairs, CountGridCells, CommunicateCellCounts, GetCellIndexes, CommunicateCellIndexes, ReadInGrids, \
CommunicateCellValues, ReadCellValues, CalculateValues, CalculatePDF, CalculateValuesTotal, PassTotalTime
0,    1.30489,  0.0679054,    7.19513,   0.168064,   0.935935,    100.951,    61.8646,   0.446828,    3.17832,    1.84424,    5.02262,    177.958
1,    1.30359,  0.0673356,    27.9205,   0.177544,    1.13707,    95.6161,    60.7849,   0.601919,    3.46606,    1.84364,    5.30976,    192.919
2,    1.26115,  0.0658407,    27.7975,   0.156125,     1.2052,    99.2339,    59.9792,    0.51448,    3.29586,    1.84641,    5.14232,    195.356
3,    1.26102,  0.0669351,    28.3995,   0.156415,    1.60676,    96.5413,    62.4807,   0.590838,    3.32079,    1.84423,    5.16507,    196.269
====================================================
Passes Total Time =    762.526
Reduce Time =    35.8403
Writing Time =   0.101057
Total Time =    798.468
ds100 $ 

Perf Columns and Totals

Quantity Description
PassNumber Number of the current pass, zero based.
GetPointPairs Generating random points and directions, finding containing grids and cell indexes.
CountGridCells Looping over the points and counting the total number of cells per grid.
CommunicateCellCounts Communicating cell counts between processors using MPI_Alltoallv.
GetCellIndexes Collecting cell indexes from point pairs into an array for communicating.
CommunicateCellIndexes Communicating cell indexes for each grid to the owning task using MPI_Alltoallv.
ReadInGrids Getting cell values from grids. Tasks will read in data from disk if needed at this point.
CommunicateCellValues Sending cell values to requesting tasks using MPI_Alltoallv.
ReadCellValues Getting the cell values out of the communication arrays and into the point pairs.
CalculateValues Calculating the interesting numbers for each point pair.
CalculatePDF Binning probability distribution functions for each distance and value.
CalculateValuesTotal Sum of the previous two columns.
PassTotalTime Total time for this pass.
Passes Total Time Should be sum of PassTotalTime.
Reduce Time Time to use MPI_Reduce to get results to the root task.
Writing Time Time to write out results.
Total Time Total time, not counting initial IO.

Loading Grid Data

LoadLeveler Script

# @ executable = /users/ucsd/ux455215/gdata/sfgen-benchmark/DD0100/sfgen-1024.script
# @ node = 4
# @ tasks_per_node = 8
# @ resources = ConsumableCpus(1) ConsumableMemory(2gb)
# @ node_usage = not_shared
# @ network.mpi = sn_all,shared,US
# @ wall_clock_limit = 1:00:00
# @ class = high
# @ queue

Batch Script

poe hpmcount -o bench_t32_load_hpm /users/ucsd/ux455215/tmp/sfgen-benchmark/bin/sfgen \
    -L -P 4 -o bench_t32_load ts_1024ppmL0M6H5_0100

Load grid data, 4 passes, output to bench_t32_load.

Perf File

ds001 $ cat bench_t32_load.structure.perf 
# NumberOfPasses      = 4
# NumberOfPDFBins     = 100
# NumberOfBins        = 10
# NumberOfPairs       = 65536
# NumberOfProcessors  = 32
# LoadGridDataAtStart = TRUE
Initialization Time =     50.872
PassNumber, GetPointPairs, CountGridCells, CommunicateCellCounts, GetCellIndexes, CommunicateCellIndexes, ReadInGrids, CommunicateCellValues, ReadCellValues, CalculateValues, CalculatePDF, CalculateValuesTotal, PassTotalTime
0,    1.11785,   0.062119,     1.3364,   0.149467,  0.0355806,   0.863223,   0.258517,   0.548922,    2.90667,    1.62741,    4.53413,    8.90649
1,    1.16458,  0.0620608,   0.028697,   0.134882,  0.0439982,   0.790914,   0.184859,   0.534144,    2.89807,    1.62062,    4.51874,    7.46315
2,    1.11474,   0.062026,   0.104585,   0.136715,  0.0226259,   0.790626,   0.196604,   0.550444,    2.94474,    1.66149,    4.60628,    7.58494
3,    1.11717,  0.0618196, 0.000574589,   0.133951,  0.0252619,   0.789778,   0.188171,   0.547121,    2.89896,    1.62136,    4.52037,     7.3845
====================================================
Passes Total Time =    31.3402
Reduce Time =  0.0470595
Writing Time =  0.0728083
Total Time =     82.332
ds001 $ 

HPM Results

===================================================
Computation performance measured for all   32 cpus:

Execution wall clock time    =      87.879 seconds
Total FPU arithmetic results =       1.353e+11
(29.2% of these were FMAs)
Aggregate flop rate          =       1.990 Gflop/s
Average flop rate per cpu    =      62.184 Mflop/s
                             =       1.0% of `peak'

Ratio of floating point divisions
to all FPU arithmetic results: 0.019

Memory usage:

Memory high water            =    1955.980 MB
Memory low water             =    1945.876 MB
Total memory                 =      62.558 GB


Communication wall clock time for   32 cpus:

        max =     10.977 seconds
        min =      2.771 seconds

Communication took 12.49% of total wall clock time.
===================================================

Not Loading Grid Data

LoadLeveler Script

# @ executable = /users/ucsd/ux455215/gdata/sfgen-benchmark/DD0100/sfgen-1024-noload.script
# @ node = 4
# @ tasks_per_node = 8
# @ resources = ConsumableCpus(1) ConsumableMemory(2gb)
# @ node_usage = not_shared
# @ network.mpi = sn_all,shared,US
# @ wall_clock_limit = 1:00:00
# @ class = high
# @ queue

Batch Script

poe hpmcount -o bench_t32_noload_hpm /users/ucsd/ux455215/tmp/sfgen-benchmark/bin/sfgen \
    -P 4 -o bench_t32_noload ts_1024ppmL0M6H5_0100

Don't load grid data, 4 passes, output to bench_t32_noload.

Perf File

ds001 $ cat bench_t32_noload.structure.perf 
# NumberOfPasses      = 4
# NumberOfPDFBins     = 100
# NumberOfBins        = 10
# NumberOfPairs       = 65536
# NumberOfProcessors  = 32
# LoadGridDataAtStart = FALSE
Initialization Time =   0.542891
PassNumber, GetPointPairs, CountGridCells, CommunicateCellCounts, GetCellIndexes, CommunicateCellIndexes, ReadInGrids, CommunicateCellValues, ReadCellValues, CalculateValues, CalculatePDF, CalculateValuesTotal, PassTotalTime
0,     1.1782,  0.0617247, 0.000570297,   0.149917,  0.0305123,    51.7688,    3.75361,   0.541288,    2.87594,    1.62803,    4.50402,    61.9889
1,    1.18559,  0.0621586, 0.000967503,   0.136963,  0.0217381,    32.4274,    1.77462,   0.543929,    2.88617,    1.62936,     4.5156,    40.6692
2,     1.1837,  0.0606303,   0.130129,   0.140462,  0.0189729,    31.9734,    2.81744,   0.544559,    2.88427,    1.62981,    4.51414,    41.3838
3,    1.18848,  0.0620074, 0.000603199,    0.13602,    0.02321,    32.6865,    1.87053,   0.541438,    2.89543,    1.62935,    4.52483,    41.0339
====================================================
Passes Total Time =    185.078
Reduce Time =  0.0305595
Writing Time =  0.0504446
Total Time =    185.702
ds001 $ 

HPM Results

===================================================
Computation performance measured for all   32 cpus:

Execution wall clock time    =     188.063 seconds
Total FPU arithmetic results =       1.310e+11
(29.3% of these were FMAs)
Aggregate flop rate          =       0.900 Gflop/s
Average flop rate per cpu    =      28.137 Mflop/s
                             =       0.5% of `peak'

Ratio of floating point divisions
to all FPU arithmetic results: 0.020

Memory usage:

Memory high water            =     265.064 MB
Memory low water             =     264.328 MB
Total memory                 =       8.467 GB


Communication wall clock time for   32 cpus:

        max =     10.977 seconds
        min =      2.771 seconds

Communication took 5.84% of total wall clock time.
===================================================