Solved

How to bypass processor groups to support more than 64 logical cores


Userlevel 1
Badge

Hi AIMMSians,

 

I am using the AMD 3995WX processor which has 128 logical processors. I realized that a single instance of AIMMS only uses 64 logical processors as processors are grouped by windows’ limit (64 cores). To use all 128 processors, I can run two instances of AIMMS at the same time. The first instance gets affinity of group0. Then I need to modify the affinity of the second instance to use the second NUMA (group1). Still, I would like one instance of AIMMS employs all logical processors. I assume the same problem exists on servers where the number of logical cores often exceeds 64. 

Is there any way to bypass windows' processor group policy (i.e. make AIMMS group aware) and distribute the job between all 128 available threads?

 

Note 1: I tried CPLEX and GUROBI solvers. I am not sure if this limitation is related to AIMMS or the solvers.

Note 2: I am aware of disabling Hyperthreading/SMT, but that is not desirable. 

icon

Best answer by Mstislav Simonenkov 7 June 2021, 14:38

View original

16 replies

Userlevel 3
Badge +2

Hi @afattahi 

This article https://www.extremetech.com/computing/310299-how-does-windows-use-multiple-cpu-cores seems to suggest that this is a Windows limitation. 

 

Userlevel 2
Badge +1

Hi @afattahi

Gurobi has a parameter allowing use of multiple NUMA nodes.

I think it is called GURO_PAR_PROCGROUPS 2  where 2 is the number of numa nodes.

 

Userlevel 1
Badge

Hi @afattahi 

This article https://www.extremetech.com/computing/310299-how-does-windows-use-multiple-cpu-cores seems to suggest that this is a Windows limitation. 

 

It indeed is the limitation of the windows. But, apparently the code we intend to run can get modified to schedule the job between processor groups, which apparently is called "making the code group aware”. 

This article https://bitsum.com/general/the-64-core-threshold-processor-groups-and-windows/ suggests contacting app developer to modify the app to become group aware. 

As @Mstislav Simonenkov suggested, apparently Gurobi has this feature.

 

Hi @afattahi

Gurobi has a parameter allowing use of multiple NUMA nodes.

I think it is called GURO_PAR_PROCGROUPS 2  where 2 is the number of numa nodes.

 

@Mstislav Simonenkov This is great! 

I searched for the GURO_PAR_PROCGROUPS parameter in Gurobi's document, but I could not find any information. Would you please provide more information regarding this parameter, or let us know how to modify it through AIMMS?

Userlevel 2
Badge +1

Hi @afattahi 

We actively discussed the issue of NUMA nodes with Marcel Hunting. I want to thank him again.
CPLEX supports the parameter CPXPARAM_CPUmask which gives the user control over NUMA nodes.

https://www.ibm.com/support/pages/node/6435091

However as far as i’m aware AIMMS currently does not support the CPXPARAM_CPUmask parameter for CPLEX. 

About Gurobi parameter unfortunately i does not know how to modify it through AIMMS. I’m using it with other software via Gurobi.par file.

Other solution is to switch OS from Windows to Linux.

Userlevel 1
Badge

@Mstislav Simonenkov Thank you for the accurate input. 

I have not used the Linux version of AIMMS. I would prefer to stick to windows as much as possible. 

 

@MarcelRoelofs Would it be possible to incorporate the support for CPLEX CPU mask parameter (https://www.ibm.com/docs/en/icos/20.1.0?topic=parameters-cpu-mask-bind-threads-cores)  (and the similar Gurobi parameter) in AIMMS? 

Userlevel 4
Badge +1

@afattahi Support for NUMA machines is on our backlog, but currently it does not have the highest priority because it seems that the performance gain of using more than 32 threads is limited. See for example this discussion:

https://or.stackexchange.com/questions/5416/gurobi-and-cplex-cannot-exploit-more-than-32-cores-of-machine

Moreover, using multiple NUMA nodes to solve one MIP problem might harm the performance because then data has to be transferred through the cross-chip interconnect which adds overhead (see: https://www.cse.wustl.edu/~angelee/cse539/papers/numa.pdf).

For NUMA awareness of CPLEX, see: https://www.ibm.com/support/pages/node/6435091

@Mstislav Simonenkov GURO_PAR_PROCGROUPS seems to be an undocumented parameter. You can use a Gurobi parameter file in AIMMS by enabling the Gurobi option ‘Read Parameter File’.

Userlevel 2
Badge +1

 @Marcel Hunting 
@afattahi 


I would like to share results of some of my tests, when switching from 48 to 96 threads:
MIP models the GAP improvement ~ 0.25% with 24h timelimit.

LP models are significantly faster to optimallity ~40% (Barrier method). 

Userlevel 1
Badge

 @Marcel Hunting 
@afattahi


I would like to share results of some of my tests, when switching from 48 to 96 threads:
MIP models the GAP improvement ~ 0.25% with 24h timelimit.

LP models are significantly faster to optimallity ~40% (Barrier method). 

@Mstislav Simonenkov Very interesting results for the LP model. I assume you measured the improvements by the solving time not the total elapsed time. May I ask which CPU configuration are you using?

also, If you are using windows, would you please let us know how you distribute threads across NUMAs?

Userlevel 2
Badge +1

@afattahi 

According to AIMMS help my suggestion is to try the following:


1) Set Grudobi Read Parameter File to Yes in the project options

2) Create txt file with desired Gurobi settings. An example is present below

Heuristics 0.1

GURO_PAR_PROCGROUPS 2

3) Save this txt file as a gurobi.prm and placeit the AIMMS project directory

 

Please let me know if it helped.

Userlevel 1
Badge

@Mstislav Simonenkov 

I tried as AIMMS help and you instructed. Unfortunately, AIMMS does not apply the gurobi.prm parameters for my model. 

my gurobi.prm file:

# Parameter settings

THREADS 128

GURU_PAR_PROCGROUPS 2

 

I placed the file in the AIMMS root project folder

As the AIMMS help instructed, the gurobi.prm file overrights the AIMMS settings. I set the Thread_limit parameter to 80. However, when the run is finished, I can see in the gurobi log file that only 80 threads have been used, meaning that AIMMS did not read the .prm file. Also, while running, I realize that only one NUMA node is being used. 

 

Please let me know if I am making a mistake here. 

Userlevel 2
Badge +1

@afattahi 

Could you please try changing GURU_PAR_PROCGROUPS  to GURO_PAR_PROCGROUPS.

Please let me know if it helped

Userlevel 1
Badge

@Mstislav Simonenkov ah, I made a typo in my post. I actually wrote GURO_PAR_PROCGROUPS 2. I tried both with space and tab characters between name and number. Also, I use an academic license.

Maybe @Marcel Hunting can point out what am I missing?

Userlevel 1
Badge

@afattahi Support for NUMA machines is on our backlog, but currently it does not have the highest priority because it seems that the performance gain of using more than 32 threads is limited.

@Marcel Hunting great to hear that NUMA support is on your development list. I deal with large LP problems for energy system modeling and I almost always use barrier method for it's relatively superior solve speed. My guess is that increasing the thread limit to more than 64 would increase solving speeds for LP problem. Of course this increase is not linear, but still, a good improvement. My knowledge on this issue is very limited and mainly comes from online forums. 

It would be very helpful if the support for NUMA nodes becomes available at least for LP problems. 

Userlevel 4
Badge +1

@afattahi I have no idea what is wrong with your parameter file. If one of the parameter names is incorrect then Gurobi will still read and set the other options, so Threads should have been set to 128. Did you place ‘gurobi.prm’ in the same folder as the .aimms file?

Userlevel 1
Badge

@Marcel Hunting @Mstislav Simonenkov Thank you for all your input. Now, I learned how to have a greater control on solver options. After some trial and error, I found my rookie mistake. I was saving the file with the name gurobi.prm.txt without seeing the extension.

I was able to successfully modify the PROCGROUPS parameter; but, the result was disappointing.

In my LP model (Barrier method), I faced 35% increase in solving time by moving from default (32) threads to 128 threads. Not sure why there is a decrease in performance. 

Userlevel 2
Badge +1

@afattahi 
Glad we were able to help you!

Very interesting results! I achived best solving time (net) by using half of the cores (96 of 192).
 

Reply


Didn't find what you were looking for? Try searching on our documentation pages:

AIMMS Developer & PRO | AIMMS How-To | AIMMS SC Navigator