A potentially more elegant solution to achieving moderate multiplicative depth is just to use the existing z\omega evaluation point to combine two rows into one constraint, and permit a much larger arity. One is able to evaluate an up to 128-way multiplexing function.
If one intends not to increase the proof size and verifier cost in terms of splitting t(x), this comes at the cost of larger SRS and more prover exponentiations. So one gets tradeoffs between proof size and prover costs.
My favourite solution is to combine wires into evaluation points. This applies pleasing square-root arguments.
With 4 wire-commitment elements a, b, c, d and 4 evaluation points z, z\omega, z\omega^2, z\omega^3, just 2 more \mathbb{G}_1 than currently used for 4-wire PLONK, one can make use of 16 variables in one gate.
With just 2 more than that, one gets 25, 36, 49, 64 and so on, an n^2/2n scaling in gate arity. Note that increasing arity in this way does not increase prover exponentiations at all.
Taking into account t(X) size as well, one gets a factor \sqrt{n} scaling. For instance, against the base case of fast prover, for SRS of size 4x and prover exp. of 4a v.s. a, and the same cost of 4 \mathbb{G_1}, one can get a maximum gate degree of 16 v.s. 4. This is relatively little.
TODO: Take into account \mathbb{F} into proof size as well.
A first application to test the high-arity idea is the bilinear form over a field (\mathbb{F} or \mathbb{F}_2). To my knowledge, R1CS will be k times as inefficient for a k-dimensional bilinear form.
\sum_{i,j=1}^k a_{i,j}x_iy_j, where x, y are witness values.
R1CS: \sum_{i=1}^k x_i\left(\sum_{j=1}^ka_{i,j}y_j\right). k+1 constraints.