Thursday, January 8, 2009

Home-made Torque monitoring

I've always been frustrated by the tools for finding out what's going on with Torque/Maui. In particular, it's hard to get an overview of the cluster state. So I compiled up pbs_python and wrote a little web CGI application to provide the information I was interested in. It shows information on jobs running on each cluster node: owner, efficiency, memory usage. It colour-codes the details: grey for under-utilisation and red for over-utilisation. Not perfect but useful for me.

It's available at http://grid.ie/distribution/clustermon

P.S. if something better exists out there, I'd be very interested in hearing about it. I've never found anything that does quite what I want.

4 comments:

Stephen Childs said...

Version 0.3 is now available which features more information via tooltips, colour-coding of nodes based on their state, and summaries of node and queue information.

Ewan said...

Looks nice; I see the download directory is up to version 0.4 now, too.

I haven't actually got this up and running yet, but a couple of things have struck me so far looking at it:
- Firstly, I'd rather like to get this packaged up to make it as easily deployable as possible. Would it be possible to get suitable licence statements tacked onto the top of each of the source files?

- Secondly, the README suggests that this needs to run on the PBS head node - the pbs_python code seems to indicate that you can pass the PBSQuery call a remote server name to connect to. If we can get a config option for that then it would allow running clustermon on another machine, requiring only the torque-client package installing.

Stephen Childs said...

Thanks for your comment -- I'm glad someone is interested! Fair comment about packaging, licenses etc., I will get on to it next week when I'm back in Dublin.

The idea of running on a separate server is definitely a good one. I will add this as an option.

I will also look at getting a sourceforge project set up so others can contribute to the code.

SP said...

I wanted to try this on my Solaris machine. I could not use it because the pbs_python packages are built for x86_64 architecture. It would be helpful if you also allow source downloads.