>>8
I'd go as far as to say that doing that would be quicker than having an array of pointers to double due to the pointer load's taking at least 2 cycles on modern chips (assuming L1 cache hit), whereas multiplication always takes 2 cycles (or 3 on Intel).
Of course it's quicker still if you can use constant dimensions by e.g. sticking them in the template parameters; there's a bunch of ways to get small static muls with lower execution latencies than a multiply instruction if the multiplier is constant. Powers of two being the canonical example.