What affects Cluster efficiency


It is important to check what are the characteristics of the machine where you are running the model referring to the aspects reported below.


> The first/most important factor with Cluster is the number of physical cores: "Unlimited" number of nodes can be used in parallel but ultimately the total processing power is given by the total physical cores in the machine (ref. "Note on Hyperthreading" below).


It is important to avoid "overloading" the processors, e.g.

  • 40 processors,
  • each scenario using 8 cores in parallel,
  • safe to run 4 (< 40/8=5) runs simultaneously without overloading the processors


> Second aspect is RAM: RAM plays a more important role for Analyst and other programs using a lot of RAM, e.g. Matrix with Arrays, Automdarrays, etc. (together with processors’ usage). There is not an exact rule in terms of memory usage:– For 32-bit Voyager, do not use any function/feature (e.g. AUTOMDARRAY, ARRAY etc.) that will take more than 1.2 GB of memory in a single step

– For 64-bit Voyager, do not use any function/feature (e.g. AUTOMDARRAY, ARRAY etc.) that will take more than 80% of physical memory in a single step, but recommend staying below 50% if possible

– For all versions, the total memory requirements for all cluster nodes should not exceed 80% of physical memory, but recommend staying below 50% if possible

[Note that each node requires the same amount of memory as a non-Cluster step when doing intrastep, or add up the memory requirements for each node if mulistep]



> Third factor when running cluster is disk usage (this can be checked by running the model and monitoring the disk usage from task manager – if close to 100%). This is a specific element to consider for PT Program or program with high I/O requirements: 

– e.g. route enumeration, writing the route files on the fly (RTE) 

– for all the parallel processes at the same time, and 

– e.g. when writing Intercept ICP files. 

...if disk usage is very high, there could be no benefit in using more cluster nodes. This has to do with IOPS (IO Operations Per Second). Modern SSDs can perform high amounts of IOPS and so can withstand a greater workload than hard disks. 


> Fourth aspect is that Cluster itself adds some overhead time which involves communication between the master and the slave nodes. However this is only significant when using large number of slave nodes (e.g. more than 30).


Note on Hyperthreading


Hyperthreaded cores are only part of the execution unit: they help with the processors pipeline processing but only the number of physical cores are really effectively as able to execute in parallel. 


Turning hyperthreading "off" would not improve parallel performance and there is no conflict if using hyperthreading, as hyperthreading itself does not constitute any issue for Cluster, but it is recommended not to use more nodes than the number of physical cores. Indeed, the problem with hyperthreading is that you may think that it gives twice as many cores as you really do, but this is not the case (i.e. if you have 16 cores with hyperthreading enabled you may be under the impression that you have 32 cores, but this is not true). The problem only comes in when you over provision the CPU by having more than 16 (in the example) fully CPU bound processes (or threads). So, if you tried to run 32 cluster nodes, you will most likely negatively impact performance.


A computer program is made up of single instructions, organized in "threads of execution" by the Operating System.


A "thread of execution" is like the context of a particular set of instructions (it's associated "instruction pointer" that allows the OS/processor to keep track of what line of assembly instructions you are on, it's associated local memory, etc). Any sequence of instructions running through the  OS is going to be a part of a "thread of execution".


You have lots of "processes" running on your computer - every process must have at least one "thread of execution" - but the processor does not actually have the concept of a "process".


The OS has all of these processes running and has to manage them, therefore it has a "scheduler" that decides what thread from what process to run on which processor.


There are two main reasons why the OS/scheduler may put a thread in a "waiting state", and to use a processor for another thread:


1) I/O request. --- Typically, there is a point in the program where the process (therefore its thread) performs some I/O (e.g. Voyager program reading a file). When the thread requests I/O, the thread is effectively "paused" (or "blocking"), while it waits for the computer to go to the disk and extract the data it requested and then make it available to it's process memory. When this happens (I/O request), the OS/scheduler knows to put that thread in a waiting state and to use that processor for another thread. When the I/O request is complete, the OS is notified by the hardware, it produces something called an "Interrupt Request" (or "IRQ"). The OS figures out that data is available and what thread requests that I/O, finds the information about that thread in it's book-keeping and then continues execution of that thread on the processor where it left off (instruction pointer). ---


2) Competition for the CPU. --- You have more processes or threads actively requiring work than you have available processors, and the OS wants to be a "fair scheduler", giving equal time to everyone. ---


Therefore: when the thread was put into a "waiting state" it was running on CPU #X but when it is put back to work what CPU will it execute it on?

By default, Windows processes do not have "affinity set" (processor affinity means that a process/thread will only run on a particular processor, e.g. CPU #X). So, if you have a single threaded process that run for 10 hours (execution time), the OS may have scheduled this process to run on every different processor for various periods of time during its execution time. To simplify (there could be some optimizations to keep using the last processor), it is not important what processor a thread uses when it "wakes up". Hyperthreading is partly a way to speed up this "wake up"; hyperthreading does not give true additional execution units. The execution characteristics, in terms of "what processor is it using?" is largely a function of processes, and therefore its threads, I/O characteristics and the the characteristics of the OS scheduler (and any hints you may provide like affinity or priority).


Note: on occasions you can run a little more processes/threads than you have cores if your computation involves a lot of I/O with a lot of waiting time, for instance you might benefit from running 19-20 cluster nodes on 16 cores, but this is an optimization and is very dependent on the use case (Voyager/Cluster). However, if the program performs a lot of I/O, then the more threads performing I/O, the more work the I/O subsystem has to do and you may get a performance hit from I/O. In particular, an issue that might arise in this situation is when Disk Usage is very high, and cause an increase in runtimes when using more cores. This is reported above as "third factor".


How to detect Number of Cores


The below Command Line variables can provide the number of processors within Pilot in Cube.


1. %NUMBER_OF_PROCESSORS% variable gives the number of logical cores if Hyperthreading is ON. If you turn hyperthreading OFF you do have *higher probability* that this variable will report the number of physical cores without hyperthreading, but it may not be the most reliable number.


2. To get the number of Physical Cores is possible using WMIC command below (with the need to convert the myCores.dat file to AMCI):

*WMIC CPU Get NumberOfCores > "{CATALOG_DIR}\Model\myCores.dat"

*cmd /a /c type "{CATALOG_DIR}\Model\myCores.dat" > "{CATALOG_DIR}\Model\myCores1.dat"


3. There are other programs that can accurately determine the number of cores, one example is CPU-Z available from the link below:

https://www.cpuid.com/ 

https://www.cpuid.com/softwares/cpu-z.html  


CPU-Z can be used through command line. An example of using CPU-Z with Voyager script is reported reported below:

;****************************************************************************************************************************************************

; PILOT Script

*"C:\Program Files\CPUID\CPU-Z\cpuz.exe" -txt={Scenario_dir}\CPU-Z_report

; End of PILOT Script

; Script for program MATRIX

RUN PGM=MATRIX PRNFILE="{Scenario_dir}\EAMAT0A0.PRN" MSG='Reading the TXT file and creating a PRN file with the variable n_CPUs'

FILEO PRINTO[1] = "{Scenario_dir}\CPU-Z_nCPUs.txt"

FILEI RECI = "{Scenario_dir}\CPU-Z_report.TXT"

_nCores=strpos('Number of cores',reci)

if (_nCores>0)

_length1=_nCores+strlen('Number of cores')

_postbr =strpos('(',reci)

_length2=_postbr-_length1

n_CPUs=substr(reci,_length1,_length2)

PRINT PRINTO=1 LIST="n_CPUs=", val(n_CPUs)(L10.0)

endif

ENDRUN

;****************************************************************************************************************************************************