Thursday, March 5, 2009

Another day, another globus error

After almost 5 years at this lark, I thought I'd got a handle on most of the cryptic globus errors. However, today atlas production jobs started failing with errors like this:

018 (9163559.001.000) 03/05 11:25:27 Globus job submission failed!
  Reason: 22 the job manager failed to create an internal script argument file
Google didn't provide any help, but after asking on LCG rollout, it looked like the problem was the number of files in the relevant user's account. This turned out to be because the script /opt/lcg/sbin/cleanup-grid-accounts.sh that cleans up the grid accounts hadn't run in some days and there were almost 32000 files under that directory.

So there's yet another vital cog in the grid wheel that can fail fairly silently and cause inexplicable errors! Time to add a nagios sensor to check that this cron job runs successfully every night ...