sfgen Benchmark - Performance
Major Performance Factors
Reading Grid Data
Ideally, the -L flag to sfgen would always be used, and LoadGridDataAtStart would be set to TRUE. However, not all datasets will fit into memory on the available number of processors. Because of this, sfgen has the ability to read in the grid data as needed during each pass. This does cause a large performance hit, but enable measurements that might not be able to be done, other wise. And, reading in the grid data initially reduces the number of point pairs that can be in memory each pass.
Fortunately, the penalty for opening up files goes down a little as a the number of MPI tasks goes up--depending on the ability of the file system.
Compare the Loading Grid Data section with the Note Loading Grid Data one.
Points per Pass
The amount of work (computation) and memory consumption per pass is determined by the total number of points generated each pass. During each pass, for each bin (distance), an equal number of point pairs is generated. The number of bins (or distances) is set by NumberOfBins, while the number of point pairs is controlled by NumberOfPairs. (Creative algorithms are cool, creative variable naming is not.) This means the total number of points (and hence cell indexes and cell values) each pass is:
Points per pass = 2 X NumberOfPairs X NumberOfBins
Number of Passes
Looking at the section above, it's not hard to figure out that the total number of points generated is the number of points per pass times the number of passes. Each pass brings with it the need to communicate, and possibly read in data, both of which add to the total time.
Total number of points = 2 X NumberOfPairs X NumberOfBins X NumberOfPasses
Number of Tasks
TODO: Discuss communication cost.
Other Factors
TODO: Number of grids.
Performance Output
In addition to tools such as hpmcount and IPM, sfgen can be compiled to do some simple timing measurements. These are created using MPI_Wtime around major portions of the code. The results will appear in a file named *.structure.perf, e.g., bench_t16.structure.perf.
ds100 $ cat bench_t16.structure.perf # NumberOfPasses = 4 # NumberOfPDFBins = 100 # NumberOfBins = 10 # NumberOfPairs = 65536 # NumberOfProcessors = 16 # LoadGridDataAtStart = FALSE PassNumber, GetPointPairs, CountGridCells, CommunicateCellCounts, GetCellIndexes, CommunicateCellIndexes, ReadInGrids, \ CommunicateCellValues, ReadCellValues, CalculateValues, CalculatePDF, CalculateValuesTotal, PassTotalTime 0, 1.30489, 0.0679054, 7.19513, 0.168064, 0.935935, 100.951, 61.8646, 0.446828, 3.17832, 1.84424, 5.02262, 177.958 1, 1.30359, 0.0673356, 27.9205, 0.177544, 1.13707, 95.6161, 60.7849, 0.601919, 3.46606, 1.84364, 5.30976, 192.919 2, 1.26115, 0.0658407, 27.7975, 0.156125, 1.2052, 99.2339, 59.9792, 0.51448, 3.29586, 1.84641, 5.14232, 195.356 3, 1.26102, 0.0669351, 28.3995, 0.156415, 1.60676, 96.5413, 62.4807, 0.590838, 3.32079, 1.84423, 5.16507, 196.269 ==================================================== Passes Total Time = 762.526 Reduce Time = 35.8403 Writing Time = 0.101057 Total Time = 798.468 ds100 $
Perf Columns and Totals
| Quantity | Description |
| PassNumber | Number of the current pass, zero based. |
| GetPointPairs | Generating random points and directions, finding containing grids and cell indexes. |
| CountGridCells | Looping over the points and counting the total number of cells per grid. |
| CommunicateCellCounts | Communicating cell counts between processors using MPI_Alltoallv. |
| GetCellIndexes | Collecting cell indexes from point pairs into an array for communicating. |
| CommunicateCellIndexes | Communicating cell indexes for each grid to the owning task using MPI_Alltoallv. |
| ReadInGrids | Getting cell values from grids. Tasks will read in data from disk if needed at this point. |
| CommunicateCellValues | Sending cell values to requesting tasks using MPI_Alltoallv. |
| ReadCellValues | Getting the cell values out of the communication arrays and into the point pairs. |
| CalculateValues | Calculating the interesting numbers for each point pair. |
| CalculatePDF | Binning probability distribution functions for each distance and value. |
| CalculateValuesTotal | Sum of the previous two columns. |
| PassTotalTime | Total time for this pass. |
| Passes Total Time | Should be sum of PassTotalTime. |
| Reduce Time | Time to use MPI_Reduce to get results to the root task. |
| Writing Time | Time to write out results. |
| Total Time | Total time, not counting initial IO. |
Loading Grid Data
LoadLeveler Script
# @ executable = /users/ucsd/ux455215/gdata/sfgen-benchmark/DD0100/sfgen-1024.script # @ node = 4 # @ tasks_per_node = 8 # @ resources = ConsumableCpus(1) ConsumableMemory(2gb) # @ node_usage = not_shared # @ network.mpi = sn_all,shared,US # @ wall_clock_limit = 1:00:00 # @ class = high # @ queue
Batch Script
poe hpmcount -o bench_t32_load_hpm /users/ucsd/ux455215/tmp/sfgen-benchmark/bin/sfgen \
-L -P 4 -o bench_t32_load ts_1024ppmL0M6H5_0100
Load grid data, 4 passes, output to bench_t32_load.
Perf File
ds001 $ cat bench_t32_load.structure.perf # NumberOfPasses = 4 # NumberOfPDFBins = 100 # NumberOfBins = 10 # NumberOfPairs = 65536 # NumberOfProcessors = 32 # LoadGridDataAtStart = TRUE Initialization Time = 50.872 PassNumber, GetPointPairs, CountGridCells, CommunicateCellCounts, GetCellIndexes, CommunicateCellIndexes, ReadInGrids, CommunicateCellValues, ReadCellValues, CalculateValues, CalculatePDF, CalculateValuesTotal, PassTotalTime 0, 1.11785, 0.062119, 1.3364, 0.149467, 0.0355806, 0.863223, 0.258517, 0.548922, 2.90667, 1.62741, 4.53413, 8.90649 1, 1.16458, 0.0620608, 0.028697, 0.134882, 0.0439982, 0.790914, 0.184859, 0.534144, 2.89807, 1.62062, 4.51874, 7.46315 2, 1.11474, 0.062026, 0.104585, 0.136715, 0.0226259, 0.790626, 0.196604, 0.550444, 2.94474, 1.66149, 4.60628, 7.58494 3, 1.11717, 0.0618196, 0.000574589, 0.133951, 0.0252619, 0.789778, 0.188171, 0.547121, 2.89896, 1.62136, 4.52037, 7.3845 ==================================================== Passes Total Time = 31.3402 Reduce Time = 0.0470595 Writing Time = 0.0728083 Total Time = 82.332 ds001 $
HPM Results
===================================================
Computation performance measured for all 32 cpus:
Execution wall clock time = 87.879 seconds
Total FPU arithmetic results = 1.353e+11
(29.2% of these were FMAs)
Aggregate flop rate = 1.990 Gflop/s
Average flop rate per cpu = 62.184 Mflop/s
= 1.0% of `peak'
Ratio of floating point divisions
to all FPU arithmetic results: 0.019
Memory usage:
Memory high water = 1955.980 MB
Memory low water = 1945.876 MB
Total memory = 62.558 GB
Communication wall clock time for 32 cpus:
max = 10.977 seconds
min = 2.771 seconds
Communication took 12.49% of total wall clock time.
===================================================
Not Loading Grid Data
LoadLeveler Script
# @ executable = /users/ucsd/ux455215/gdata/sfgen-benchmark/DD0100/sfgen-1024-noload.script # @ node = 4 # @ tasks_per_node = 8 # @ resources = ConsumableCpus(1) ConsumableMemory(2gb) # @ node_usage = not_shared # @ network.mpi = sn_all,shared,US # @ wall_clock_limit = 1:00:00 # @ class = high # @ queue
Batch Script
poe hpmcount -o bench_t32_noload_hpm /users/ucsd/ux455215/tmp/sfgen-benchmark/bin/sfgen \
-P 4 -o bench_t32_noload ts_1024ppmL0M6H5_0100
Don't load grid data, 4 passes, output to bench_t32_noload.
Perf File
ds001 $ cat bench_t32_noload.structure.perf # NumberOfPasses = 4 # NumberOfPDFBins = 100 # NumberOfBins = 10 # NumberOfPairs = 65536 # NumberOfProcessors = 32 # LoadGridDataAtStart = FALSE Initialization Time = 0.542891 PassNumber, GetPointPairs, CountGridCells, CommunicateCellCounts, GetCellIndexes, CommunicateCellIndexes, ReadInGrids, CommunicateCellValues, ReadCellValues, CalculateValues, CalculatePDF, CalculateValuesTotal, PassTotalTime 0, 1.1782, 0.0617247, 0.000570297, 0.149917, 0.0305123, 51.7688, 3.75361, 0.541288, 2.87594, 1.62803, 4.50402, 61.9889 1, 1.18559, 0.0621586, 0.000967503, 0.136963, 0.0217381, 32.4274, 1.77462, 0.543929, 2.88617, 1.62936, 4.5156, 40.6692 2, 1.1837, 0.0606303, 0.130129, 0.140462, 0.0189729, 31.9734, 2.81744, 0.544559, 2.88427, 1.62981, 4.51414, 41.3838 3, 1.18848, 0.0620074, 0.000603199, 0.13602, 0.02321, 32.6865, 1.87053, 0.541438, 2.89543, 1.62935, 4.52483, 41.0339 ==================================================== Passes Total Time = 185.078 Reduce Time = 0.0305595 Writing Time = 0.0504446 Total Time = 185.702 ds001 $
HPM Results
===================================================
Computation performance measured for all 32 cpus:
Execution wall clock time = 188.063 seconds
Total FPU arithmetic results = 1.310e+11
(29.3% of these were FMAs)
Aggregate flop rate = 0.900 Gflop/s
Average flop rate per cpu = 28.137 Mflop/s
= 0.5% of `peak'
Ratio of floating point divisions
to all FPU arithmetic results: 0.020
Memory usage:
Memory high water = 265.064 MB
Memory low water = 264.328 MB
Total memory = 8.467 GB
Communication wall clock time for 32 cpus:
max = 10.977 seconds
min = 2.771 seconds
Communication took 5.84% of total wall clock time.
===================================================
