Troubleshooting Torque and Maui
From Debian Clusters
If you're visiting this page, you should have Torque, a resource manager, and Maui, a scheduler, installed on your systems. This page will cover a few places to check for information about the problem and a few "what can go wrong" scenarios.
If you're looking for how to test if a setup is working properly, or how to submit a job to the queue, see the Torque and Maui Sanity Check: Submitting a Job page.
Contents |
Problems with Torque/Maui
Most problems can be checked along the way while installing Torque and Maui. You'll want to make sure you can run qstat -a and see the queues (visit the Torque page if you don't), and make sure you can run showq to see the jobs that Maui has been working with (visit the Maui) page if it can't).
Troubleshooting Tips
Check for Running Programs
The first step in making sure everything flows correctly is to make sure the right components are running on the right servers.
On the head node, you'll want running
- the Torque pbs_server - run
ps aux | grep pbs | grep -v grepto verify that it is running - Maui - run
ps aux | grep maui | grep -v grepto verify
On the worker nodes, you'll want running
- a Torque pbs_mom - run
ps aux | grep pbs | grep -v grepto verify
If one of these is missing, it needs to be started with the binary of that file. If you followed my setup,
pbs_serveris at/usr/local/sbin/pbs_servermauiis at/usr/local/maui/sbin/maui-
pbs_momis at/usr/local/sbin/pbs_mom
Otherwise, if you can't find it, you can install locate (apt-get install locate), run updatedb, and then enter
locate x
where x is the binary you're trying to find. It will also potentially come up with quite a few more file names that have x in their path.
Check the Logs
The logs are also a great source of information. On the server, you'll want to check the Torque server logs at <your pbs root>/server_logs/ (/var/spool/pbs/server_logs if you used my Torque setup). Maui logs on the server are at <your maui root>/log/maui.log (/var/spool/maui/log/maui.log if you used my Maui setup).
On the worker nodes, you can check the pbs_mom logs. These are at <your pbs root>/mom_logs/ (/var/spool/pbs/mom_logs if you used my setup). Additionally, you can check for undelivered files on the worker nodes - these are located at
<your pbs root>/undelivered/ (/var/spool/pbs/undelivered if you used the same setup as me).
Check What Nodes the Head Node Can See
When running into troubles, checking the status of the worker nodes - according to the pbs_server (the Torque server) - can sometimes be helpful. Running
pbsnodes
will show a list of what worker nodes the head node "sees", and also their status. For instance, a typical entry for a worker node in this list might look like:
owl
state = free
np = 4
ntype = cluster
status = opsys=linux,uname=Linux owl 2.6.21-2-686 #1 SMP Wed Jul 11 03:53:0
2 UTC 2007 i686,sessions=? 0,nsessions=? 0,nusers=0,idletime=1266542,totmem
=3004480kb,availmem=2954424kb,physmem=1028496kb,ncpus=8,loadave=0.00,netloa
d=201080783,state=free,jobs=,varattr=,rectime=1201201179
A node whose pbs_mom is unreachable to the head node will appear like this:
harrier
state = down
np = 4
ntype = cluster
Any nodes that don't show up in the list or show up as down should be further examined. It may be helpful to check my documentation on installing Torque on worker nodes.
Determining Which Node a Job is Running On
Particularly if you want to check a pbs_mom log or look for undelivered files, it helps to know which worker node a job is running on. This is a little cumbersome to get to, just because to the user, it's usually transparent - they don't need to know which node their submission is running on, only that it's been submitted and is running. It's not too difficult to get to, however.
As shown above, pbsnodes will show the worker nodes and their status. To narrow it down a bit and just see the name and state lines, run
pbsnodes | grep -v status | grep -v ntype | grep -v np
No Files?
There are a number of reasons why your users might not receive their .o# and .e# files after their jobs finish.
Setting up Scp
If you're using a mounted file system, each of your worker nodes must have a pbs_mom/config file to explain how to copy files back to the head node. This file should have the following contents:
$usecp <full hostname of head node>:<home directory path on head node> <home directory path on worker node>
The path is the same for me on my head node or worker node, and my file looks like this:
$usecp gyrfalcon.raptor.loc:/shared/home /shared/home
This file needs to be stored at <your pbs root>/mom_priv/config. (If you used my Torque setup, your path will be /var/spool/pbs/mom_priv/config.) You can create this file on each one of the worker nodes individually, or check out the Cluster Time-saving Tricks page to see how to do this more quickly.
Typically, the lack of this file will result in the error and output files being lost. Your users will receive e-mails from the system saying something to this affect.
Troubles with SSH
If your users aren't getting e-mails about lost files, and you've set up scp as shown above, but your users still aren't seeing their output and standard error files after jobs finish, the problem may be with SSH configuration. Run a job as a one of your users, and pay attention to which node the job runs on by running
pbsnodes | grep -v status | grep -v ntype | grep -v np
before the job finishes. Then SSH into that node when the job completes. Check <your pbs root>/undelivered/ (if you used [[Resource Manager: Torque | my Torque setup, that's /var/spool/pbs/undelivered/). If you have files ending with .OU and .ER, it's probably a delivery problem due to one of two SSH problems.
No SSH Key?
Each of your users needs to have ~/.ssh/authorized_keys2 file whose contents match the contents of their ~/.ssh/id_rsa.pub file. If you have NFS-mounted home directories, you only need this once. If the home directories are different on each of the nodes, rather than mounted, you'll need the same key and same authorized_keys2 file in the home directory on each one of the worker nodes.
To create this key and file for the first time, as your user, run
ssh-keygen
Keep the default location for the file, and also hit enter twice without a password. Then, run
cat id_rsa.pub >> authorized_keys2
This file needs to only be readable by the owner. Do this with
chmod 600 authorized_keys2
Strict StrictHostKeyChecking
If you followed my Cloning Worker Nodes tutorial, you probably disabled StrictHostKeyChecking for SSH before cloning all of the worker nodes. If not, this might be the problem. If you haven't changed it, the default for this setting is ask, meaning that when a user (or a program acting on behalf of a user) tries to SSH from one node to another node that it hasn't encountered before, the user will be prompted as to whether they would like to accept the identification the node gives and continue SSHing in. Unfortunately, this can be a show stopper if there's no user to enter yes when prompted.
If you'd like to test if this is the problem, SSH into one of your worker nodes, become a normal user, and try to SSH into your head node. This is the kind of output you're likely to encounter:
kwanous@osprey:~$ ssh gyrfalcon
The authenticity of host 'gyrfalcon (192.168.1.200)' can't be established.
RSA key fingerprint is 22:98:61:31:fd:20:e8:c6:ec:47:e9:e9:ef:99:22:0d.
Are you sure you want to continue connecting (yes/no)? </pre>
One solution would be to SSH from each node to your head node as each one of your users. However, even with script, this can take a lot of time, and it doesn't scale well. Plus, you'd need to do this every time you added a new user. Instead, you can disable <code>StrictHostKeyChecking for SSH.
For each of your worker nodes, you'll need to open /etc/ssh/ssh_config and find the line that looks like
# StrictHostKeyChecking ask
Take out the hash (#) to uncomment this line, and change the value from ask to no.
If you'd rather not change it manually for each of your worker nodes, check out the Cluster Time-saving Tricks page to learn how to automate copying files out.
If All Else Fails...
Sometimes my server just seems to need to have Torque restarted. I haven't yet diagnosed why this happens, but it may be related to accidental power cycling (long story). When it starts to have strange errors, restarting might be a viable solution. From the head node, run
killall -KILL pbs_serverpbs_server

