Good points! We do still use 6 + 3 = 9 columns, but I oversimplified our Rescue implementation a bit in the writeup. Some more specifcs:
- We use separate gates for the two steps of Rescue: 
RescueStepAGateandRescueStepBGate. - Our overall model uses 6 constant (aka selector) polynomials. Since the Rescue gates have a depth of 2 in the tree, 2 of those 6 are used for filtering.
 - That leaves 4 constants available to each Rescue gate, so we use a sponge width of 4.
 - We use \alpha = 5, so those constraints are degree 7 after incorporating the gate type filter. This is fine for us; we just want to keep constraints within degree 8 so that our largest FFTs are degree 8n.
 - 
RescueStepAGateuses 4 routed input wires, and 4 advice wires for the purported x^{1/\alpha} values. It doesn’t have explicit output wires; instead it constrains the input wires of theRescueStepBGatethat follows it. - 
RescueStepBGatealso uses 4 input wires, and similarly treats the wires of the following gate as its “outputs”. We add a no-op gate after the lastRescueStepBGateto receive the final outputs and make them routable. (Now that I write this, I think we can do without that final gate by givingRescueStepBGatenon-routed inputs and routed outputs.) 
With \lambda=128 and our permutation width of 4, we have 16 recommended rounds per permutation, or 33 gates. So for a k-to-1 hash we end up using 33 \lceil k/3 \rceil gates.
Hope that makes sense? There are a ton of possible variants depending on how many constants are available etc., but this seems to fit pretty well with our particular circuit model.