Grid-Ireland Operations

Thursday, June 18, 2009

DPM 1.7.0 upgrade

I took advantage of a downtime to upgrade our DPM server. We need the upgrade as we want to move files around using dpm-drain and don't want to lose space token associations. As we don't use YAIM I had to run the upgrade script manually, but it wasn't too difficult. Something like this should work (after putting the password in a suitable file):


./dpm_db_310_to_320 --db-vendor MySQL --db $DPM_HOST  --user dpmmgr --pwd-file /tmp/dpm-password --dpm-db dpm_db

I discovered a few things to watch out for along the way though. Here's my checklist:

Make sure you have enough space on your system disk: I got bitten by this on a test server. The upgrade script needs a good chunk of space (comparable to that already used by the MySQL DB?) to perform the upgrade
There's a mysql setting you probably need to tweak first: add set-variable=innodb_buffer_pool_size=256M to the [mysqld] section in /etc/mysql.conf and restart mysql. Otherwise you get this cryptic error:
Thu Jun 18 09:02:30 2009 : Starting to update the DPNS/DPM database. Please wait... failed to query and/or update the DPM database : DBD::mysql::db do failed: The total number of locks exceeds the lock table size at UpdateDpmDatabase.pm line 19. Issuing rollback() for database handle being DESTROY'd without explicit disconnect().

Also worth noting is that if this happens to you, when you try to re-run the script (or YAIM) you will get this error:
failed to query and/or update the DPM database : DBD::mysql::db do failed: Duplicate column name 'r_uid' at UpdateDpmDatabase.pm line 18. Issuing rollback() for database handle being DESTROY'd without explicit disconnect().

This is because the script has already done this step. You need to edit /opt/lcg/share/DPM/dpm-db-310-to-320/UpdateDpmDatabase.pm and comment out this line:
$dbh_dpm->do ("ALTER TABLE dpm_get_filereq ADD r_uid INTEGER");

You should then be able to run the script to completion.

Monday, June 8, 2009

STEP '09 discoveries

ATLAS have been giving our site a good thrashing over the past week, which has helped us shake out a number of issues with our setup. Here's some of what we've learned.

Intel 10G cards don't work well with SL4 kernels

We're currently upgrading our networking to 10G and had it mostly in place by the time STEP'09 started. However, we discovered that the stock SL4 kernel (2.6.9) doesn't support the ixgbe 10G driver very well. It was hard to detect because we could get reasonable transmit performance but receive was limited to 30Mbit/s! It's basically an issue with interrupts (MSI-X and multi-queue weren't enabled). I compiled up a 2.6.18 SL5 kernel for SL4 and that works like a charm (once you've installed it using --nodeps).

It's worth tuning RFIO

We had loads of atlas analysis jobs pulling data from the SE and they were managing to saturate the read performance of our disk array. See this NorthGrid post for solutions.

Fair-shares don't work too well if someone stuffs your queues

We'd set up shares for the various different atlas sub-groups but the generic analysis jobs submitted via ganga were getting to use much more time. On digging deeper with Maui's diagnose -p I could see that the length of time they'd been queued was overriding the priority due to fairshare. I was able to fix this by increasing the value of FSWEIGHT in Maui's config file.

You need to spread VOs over disk servers

We had a nice tidy setup where all the ATLAS filesystems were on one DPM disk server. Of course this then got hammered ... we're now trying to spread out the data across multiple servers.

Thursday, March 5, 2009

Another day, another globus error

After almost 5 years at this lark, I thought I'd got a handle on most of the cryptic globus errors. However, today atlas production jobs started failing with errors like this:

018 (9163559.001.000) 03/05 11:25:27 Globus job submission failed!

  Reason: 22 the job manager failed to create an internal script argument file

Google didn't provide any help, but after asking on LCG rollout, it looked like the problem was the number of files in the relevant user's account. This turned out to be because the script /opt/lcg/sbin/cleanup-grid-accounts.sh that cleans up the grid accounts hadn't run in some days and there were almost 32000 files under that directory.

So there's yet another vital cog in the grid wheel that can fail fairly silently and cause inexplicable errors! Time to add a nagios sensor to check that this cron job runs successfully every night ...

Thursday, January 8, 2009

Home-made Torque monitoring

I've always been frustrated by the tools for finding out what's going on with Torque/Maui. In particular, it's hard to get an overview of the cluster state. So I compiled up pbs_python and wrote a little web CGI application to provide the information I was interested in. It shows information on jobs running on each cluster node: owner, efficiency, memory usage. It colour-codes the details: grey for under-utilisation and red for over-utilisation. Not perfect but useful for me.

It's available at http://grid.ie/distribution/clustermon

P.S. if something better exists out there, I'd be very interested in hearing about it. I've never found anything that does quite what I want.

Thursday, December 4, 2008

Cron Security

After the recent Security Challenge we became aware that any pool user could create at and cron jobs on our cluster: obviously not good for security or scheduling.

Initially we wondered if we'd need to create SELinux policies to restrict this but it's much simpler than that. Cron and at support simple allow and deny files to control which users can use the commands. /etc/cron.deny specifies which users are denied access, and /etc/cron.allow specifies which users are allowed. (For full details man crontab.)

In /etc/cron.deny we put:

ALL

and in /etc/cron.allow we put:

   root
   admina
   adminb
   ...

where admina, adminb and so on are the admin users who should have cron access. /etc/at.deny and /etc/at.allow are configured the same way.

This is configured through Quattor. For now we're using the filecopy component to install the config files, but this might be a useful extension to the cron component.

Thursday, September 11, 2008

LHC switch-on in Ireland

We had a great day yesterday at Trinity's Science Gallery where we had a live feed from CERN running all day. There was a lot of press interest and the grid featured heavily due to the fact that the grid group here at TCD makes up half of Ireland's LHC involvement (the other half being the particle physics group at UCD who are in LHCb). We had the GridPP real-time monitor running all day, which provoked a lot of interest and made it onto national TV. One interesting side-effect of all the publicity is that the man on the street now knows that Ireland is one of the few European countries that isn't a member of CERN -- maybe it will cause the politicians to reconsider.

Friday, July 11, 2008

geclipse: a nice grid UI at last?

I've just been playing around with geclipse and I like what I see. It wraps up the fiddly business of VOMs proxies, information system queries, etc. so you don't have to worry about them. Once I'd downloaded the latest milestone release via eclipse's update manager and set up a VO I was able to submit a job. The WMS was discovered from the information system. They use JSDL to describe jobs, but you fill in the description using dialog boxes -- it can also translate to JDL. There are lots of cool things that I haven't even looked at yet like an interface to amazon ec2 and to local batch systems (to view queues etc.), also visualisation plugins allowing things like interactive jobs.

This looks like a great interface for grid beginners, especially those who're already familiar with eclipse. I knew that sooner or later someone would get round to writing some good software for submitting grid jobs!