Tuesday, February 12, 2008

Can Quattor save the world?

Due to the wonders of planet, I've just seen this post by Andrew from Glasgow with the intriguing comment: "Are there any better tools? (is Quattor the savoiur for this type of problem)". This post was due to the frustrations of cobbling together fabric management from a collection of very good, but separate tools. So I thought I'd briefly describe some of the advantages of Quattor. I know many were burned in the early days of Quattor by its complexity and obscurity, but times really have changed and I suggest you revisit it. So here are just a few of the reasons I like it:


  • It's got a real programming language: this gives you data structures (e.g. hashes), types (allowing validation of the values you type in -- i.e. it will recognise that "123.34.32.O7" (spot the deliberate mistake) isn't an IP address).
  • The gLite configuration is up to date with YAIM (and often ahead of it): Michel Jouvin has led the way on a number of deployment issues in LCG (e.g. 64-bit WNs, space tokens, etc.) and all this stuff gets into Quattor before YAIM. (Also DNS-style VO-names, Xen configuration, etc., etc.) We have found that whenever we have to do something non-custom (e.g. publishing multiple different jobmanagers from one CE in GIP) it's a doddle in Quattor due to the availability of proper data structures (see above).
  • The Quattor Working Group templates are effectively a complete Grid distribution in a way that gLite itself isn't. What I mean is that they provide all you need for going from bare metal to installation of a complete SL-based Grid site. This is ideal for new/small sites.
  • It's a true community effort: having been involved in YAIM development for MPI, I have first-hand knowledge of the protracted process involved in getting anything fixed in gLite. In contrast, Quattor functions as a true OSS project: if there's a problem, you fix it and check it in. If it passes muster after a lightweight review, it's included in the core release. Problem solved.
  • It provides integration with installation and monitoring: the contents of configuration profiles for a machine are directly used to generate Kickstart templates, and monitoring (using Lemon is also tightly integrated with a raft of sensors and alerts available.

Monday, February 4, 2008

Play it again, SAM

After much pain, we have finally got a SAM server up and running for Grid-Ireland (see here). We used to run an SFT server, but it was ancient and when eventually the client software became incompatible with the UI distribution, we decided to move to SAM. It looked like there were quite good installation docs available so we assigned it to someone as a Friday afternoon project. That was two months ago! It turned out that the documentation, while good, had a few critical errors/omissions in it, and the support was non-existent. We've finally got it sorted now (the last problem was solved when I divined by reading the source code that you had to define an ACL of approved DNs in the config file) and it looks like it should be useful in keeping track of our non-EGEE sites. We'll try and feed our experience back upstream, or (probably more useful) stick it on a public page so it makes it into Google.

Friday, February 1, 2008

Stepping through the pgrade portal

As a Grid veteran, I normally submit jobs using edg-job-*, and at this stage I've almost given up hope that there could be a less painful way of getting jobs onto the Grid. I've tried Ganga in the past, and it was promising, but it didn't work well with the broken MPI on the EGEE grid, so I kind of gave up on it. The latest thing we've installed is the p-grade portal which has been around for a good while and is allegedly getting "mature" now. The first problem after creating an account was getting my cert set up for use in the portal. I had the cert and key on my local machine, and tried to upload them to a MyProxy server to get something the portal could use. At this point, I was asked for the hostname and port number of the MyProxy service. Now, I actually administer the MyProxy server, and still had to ask a colleague which port it ran on. There is no way in the world a user should have to know this, but apparently you can't set defaults in the portal. We're running version 2.5 still so maybe it's fixed in 2.6.

Once I got my cert up and running, I went to submit a job. My first job, the challenging "/bin/hostname" test failed. I didn't expect that. Apparently pgrade uploads a binary from your local machine by default rather than executing something hosted on the remote machine. As my local machine is FC6 and the execute node is SL3, the uploaded hostname binary wouldn't run. So if you wanted to run the hostname program on the remote host, you would have to upload a script which ran /bin/hostname.

The next challenge was how to add input files to the job. It turns out that this is done by adding "ports" to the job node you define. Everything in pgrade is a workflow, so files are ports that allow data to flow between nodes (or from the local machine). It takes a little while to get used to this approach.