In this article I will describe a few simple strategies you can use to improve your simulation performance. The basic strategies might require a little bit of testing, and some forethought but these techniques are straightforward and should pay off for massive jobs, multi-parameter sweeps and various optimization schemes.
For more information on high performance computing, how hardware impacts simulation performance and how to optimize AWS instances see these posts.
- High Performance Computing
- FDTD Simulation Benchmark
- Information on Hardware Specification
- Resource configuration elements and controls
More Efficient Simulations
1. Improve Simulation Set-up
This means reducing the simulation requirements by adjusting mesh size (MAX ΔxΔx that gives reasonable results), employing available symmetry or reducing the amount of data that monitors collect, through less monitors and down sampling. This way we can ensure unnecessary operations are eliminated or at least minimized. The most effective thing to consider is usually whether you can reduce the spatial and temporal resolution of your frequency and time monitors in the advanced tab.
2. Use CPU Resources Effectively
Distributed computing allows us to split large FDTD simulation jobs across many separate processors or cores using a message passing interface MPI.
- Slice a simulation into multiple spatial units that can be run in parallel, with fields passed at each time step.
- Supports two different concurrency mechanisms:
i) Launching multiple executables.
ii) Executables that spawn multiple threads.
If you click on the resource button on the top menu bar of FDTD Solutions, the resource configuration window will open and there you will find the concurrency settings for a given machine. As you can see here, each machine is configured to run the FDTD solver by launching two executables that will spawn eight threads each. The important thing to remember when you set the number of processes and threads is that the number of threads times the number of processes must be equal to the total number of CPU cores available on any given machine. This will ensure that all the CPU cores are busy.
3. Running Independent Simulations in Parallel
Concurrency means that multiple units of a computer program can be run in parallel using either separate processors or separate CPU cores packaged inside a single processor. At a certain point distributing your simulation across more cores will not result in any speed improvements, this depends on your particular simulation of course, but once you have reached the point of diminishing returns you can safely use the excess cores for other tasks. Having multiple processors and processor cores defined allows the operating system to run multiple tasks without task switching.
People often ask what combination of threads and processes will work best for them. Unfortunately, it depends critically on your simulation set-up, and the hardware it is running on; thus, there is no general rule or we would tell you. Although do not know exactly what configuration to use we do have a reliable way a maximizing your simulation efficiency.
Remember the most important thing is to reduce the necessary simulation objects and the data saved to memory. The next most important thing is to keep the CPS’s busy (if you are going to be using the computer at the same time maybe leave yourself a processor), it is helpful to your computers performance
Increase the number of cores you use letting the simulation run for a few seconds and then check the log file for the solving rate.
It should tell you exactly how the FDTD volume was partitioned and the Mega-nodes per second which is essentially how many millions of floating-point calculations are performed per second. Although increasing the number of processors will never slow things down at certain point you will hit a plateau, and here the simulation is not fully utilizing the cores. This can be seen from running on a very large single node.
Note that scale here is likely very different than what you might obtain from your own workstation, but the general trend would be the same. At a certain point we do not see an increase in our calculation efficiency by adding processes. Thus we can safely dedicate those cores to other resources so that they can run their simulations in parallel.
Next to fine tune your results found in step one by increasing threads and reducing processes simultaneously while timing the simulation.
Once again, the log file is a valuable source of information. In the next release we will introduce new cmd line arguments benchmark
Experiment with various hardware configurations find the optimal set-up. Locally you could optimize each resource separately. In the cloud the possible permutations are immense, and it would easy to experiment with different instances. This might provide insight into what hardware you would consider buying.
Only if you really need to squeeze all the efficiency out of a machine or cloud instance would steps 2-3 be worthwhile. Step one could be done very easily, and may allow you to double or triple your throughput on each local machine or cloud instance. Ultimately no tricks will beat brute force for computing at scale. That being said you can improve your workflow efficiency with these few simple tricks, and should the need arise it is easy to access massive server clusters on the cloud. Here we show the solving rate for a few such cases