Monday, June 8, 2009

STEP '09 discoveries

ATLAS have been giving our site a good thrashing over the past week, which has helped us shake out a number of issues with our setup. Here's some of what we've learned.

Intel 10G cards don't work well with SL4 kernels

We're currently upgrading our networking to 10G and had it mostly in place by the time STEP'09 started. However, we discovered that the stock SL4 kernel (2.6.9) doesn't support the ixgbe 10G driver very well. It was hard to detect because we could get reasonable transmit performance but receive was limited to 30Mbit/s! It's basically an issue with interrupts (MSI-X and multi-queue weren't enabled). I compiled up a 2.6.18 SL5 kernel for SL4 and that works like a charm (once you've installed it using --nodeps).

It's worth tuning RFIO

We had loads of atlas analysis jobs pulling data from the SE and they were managing to saturate the read performance of our disk array. See this NorthGrid post for solutions.

Fair-shares don't work too well if someone stuffs your queues

We'd set up shares for the various different atlas sub-groups but the generic analysis jobs submitted via ganga were getting to use much more time. On digging deeper with Maui's diagnose -p I could see that the length of time they'd been queued was overriding the priority due to fairshare. I was able to fix this by increasing the value of FSWEIGHT in Maui's config file.

You need to spread VOs over disk servers

We had a nice tidy setup where all the ATLAS filesystems were on one DPM disk server. Of course this then got hammered ... we're now trying to spread out the data across multiple servers.

1 comment:

Sam Skipsey said...

A note to the suggestion about RFIO tuning, mainly for those reading this post for suggestions.
The "large value" method of tuning RFIO read buffers that worked so well for small AOD files now seems to be extremely counterproductive for the larger merged-AODs (for small AODs, it basically resulted in the entire AOD being buffered in RAM, for large AODs, this doesn't happen, and you end up reading masses more data than you should be for random io).
It is possible that buffer sizes around 16 to 100 kb would work well - dCap is optimised at about 32kb, so...