Thursday, March 5, 2009

Another day, another globus error

After almost 5 years at this lark, I thought I'd got a handle on most of the cryptic globus errors. However, today atlas production jobs started failing with errors like this:

018 (9163559.001.000) 03/05 11:25:27 Globus job submission failed!
  Reason: 22 the job manager failed to create an internal script argument file
Google didn't provide any help, but after asking on LCG rollout, it looked like the problem was the number of files in the relevant user's account. This turned out to be because the script /opt/lcg/sbin/cleanup-grid-accounts.sh that cleans up the grid accounts hadn't run in some days and there were almost 32000 files under that directory.

So there's yet another vital cog in the grid wheel that can fail fairly silently and cause inexplicable errors! Time to add a nagios sensor to check that this cron job runs successfully every night ...

2 comments:

Mike Jones said...

You mean "So there's yet another vital cog in the LCG grid wheel..." :-)

SteveT said...

For the job priorities where the queued time was overriding fairshair component then consider the

QUEUETIMEWEIGHT of 0.

In particular if two jobs happen
to have the the exact same priority which is of course unlikely then job id and so order will win.