wiki:BatchSystems

Batch Systems on MacOS X

Torque

Torque is an attractive batch system to use because this would capitalize on the existing LAL experience with using torque to manage the grid resources. It would also allow integration directly into the existing Computing Elements (CEs) at LAL.

The latest version of torque (2.4.6) compiles on MacOS X without too many problems. The code uses the deprecated stat64 structure, so the flag --disable-gcc-warnings must be used. The configure command used to compile torque was:

./configure --disable-gcc-warnings --disable-gui --prefix=/usr/local/torque --with-server-home=/var/spool/pbs --disable-drmaa

Unfortunately when trying to actually run the server (same for pbs_mom), the log file fills with the following error message:

03/15/2010 08:23:24;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::PBS_Server, wait_request failed

and the server is not functional. The error comes from a low-level routine within torque that waits for data on a socket, then processes that data with a call back function. The function is wait_request defined in net_server.c. Unfortunately, the actual error code is not returned, so without debugging it isn't possible to know exactly what has gone wrong.

Sun Grid Engine

Sun Grid Engine is used by several sites within the EGEE grid infrastructure and is also a good candidate for a batch system for the ARTS@LAL system. It is also interesting because it already has the concept of a "cloud connector" that allows scaling out a system into a commercial cloud like Amazon EC2 and allows the integration/control of hadoop resources.

MacOS X is listed as a supported system, but the download indicates that the binary downloads are for MacOS X 10.4 (ppc and intel). There are apparently problems with running these binaries on more recent versions of the operating system as seen in these posts (SGE on SnowLeopard and SGE launchd scripts) from BioTeam Inc.

Pushing on using the information in those posts, comes closer to having SGE daemons running, but unfortunately I've not yet been able to get both the master and execution node daemons running to allow job submission.

Slurm

This is a batch system used at many supercomputing centers and is noted for its simplicity and scalability. In the grid context, torque wrappers exist for the slurm command line utilities that would make integration into the grid straight-forward.

Although MacOS X is listed on the site as one of the supported platforms. It uses a non-posix call (GNU extension) within the C library that does not exist on MacOS X: getgrent_r(). This call is in the file partition_mgr.c. This is intended to be a reentrant (thread-safe) version of getgrent() that allows looping over the defined groups. Hence the code does not compile on MacOS X.

The manpages for similar C functions are explicitly marked as being thread-safe. However, this particular function is not so marked and testing would have to be done to determine whether it is indeed thread-safe. Rewriting the code would be fairly easy to do if getgrent() on MacOS X is indeed thread-safe.

This problem occurs with both the production release of slurm (2.1.4) and the development release (2.2.0-0.pre2).

Last modified 14 years ago Last modified on Mar 15, 2010, 8:52:07 AM