018 (9163559.001.000) 03/05 11:25:27 Globus job submission failed!
Reason: 22 the job manager failed to create an internal script argument fileGoogle didn't provide any help, but after asking on LCG rollout, it looked like the problem was the number of files in the relevant user's account. This turned out to be because the script /opt/lcg/sbin/cleanup-grid-accounts.sh that cleans up the grid accounts hadn't run in some days and there were almost 32000 files under that directory.
So there's yet another vital cog in the grid wheel that can fail fairly silently and cause inexplicable errors! Time to add a nagios sensor to check that this cron job runs successfully every night ...
2 comments:
You mean "So there's yet another vital cog in the LCG grid wheel..." :-)
For the job priorities where the queued time was overriding fairshair component then consider the
QUEUETIMEWEIGHT of 0.
In particular if two jobs happen
to have the the exact same priority which is of course unlikely then job id and so order will win.
Post a Comment