Friday, February 5, 2016

Life at the End of Moore's Road

The earliest reference to the end of Moore's Law I can remember was in a 1995 book I read from Stanford University Press. Can't remember the name but the technologies we are hearing about now to keep it going, e.g. GaAs, photonics, etc. were mentioned.

It was assumed, if memory serves, that these technologies would be needed well before 14nm. And yet, here we are. 14nm on plain ol' Silicon. The hard limits are near and there are only a handful of tick-tock-tock cycles left before we reach them.

TSMC says they will have 7nm in 2017. Given the slowdown in fab shrinks and their difficulty even with 16nm, they are probably a few years off. But if the rumors are true that they've spent $16 billion on a new fab, it'll probably happen.

$16 billion is a ton and Intel's 5nm fab is likely to cost quite a bit more. That's a huge investment, even for Intel. And for what? Most of Intel's efforts lately have focused on improving efficiency but, like general computing power, battery life of portables is sufficient.

Going beyond 5nm? Well, that's like Law XVI of Augustine. Intel would basically have to go all in on a single fab. Enormous risk, meager reward.

At some point, the industry is going to essentially standardize, maybe around 5nm. Who knows. What I do know is that predictions of node shrinkage end times are usually wrong. But if I'm wrong, it'll be by a handful of nanometers at most.

So that will be interesting since Intel won't enjoy the huge process size/fab advantage it's had and will have to concentrate more on chip design. But, as it stands, there's not a whole lot of IPC improvements and clock speed improvements to wring out. Or are there?

First let's look at clock speed. The delta between stock clocks and overclocks has shrunk e.g. Intel's 4790K hits 4.4GHz on a single core and 4 GHz on all four cores stock. A decent overclock on air is about 4.7GHz on all cores which is about an 18% increase. For whatever reason, around 4.6GHz seems to have been the limit for overclocking since Sandy Bridge. Even Core2Quad reached around 4.3GHz on air.

A good air cooler is actually pretty sophisticated these days with finely machined fins paired with specialized fans and copper heatpipes with sponge shaped wicking metal and refrigerants inside. They are large and fairly heavy - a far cry from the simple heatsink and fan that the original Pentium required. And what an outcry there was when that was introduced!

Oddly, watercooling these days with all-in-one setups is far more elegant, though noisier, than the giant air-coolers used by overclockers. Unfortunately, the frequency gains in watercooling are marginal. For large gains, cascade cooling is required.*

Cascade cooling is basically a series of refrigeration units. It's ungainly for sure, but the 5.5GHz clocks that Hwbot typically reports is a substantial 38% increase. The heatpipe tower configuration has essentially reached maturation and even if Intel can scale clock speed at typical temperatures, the TDP will be beyond the reach of even the largest air coolers; that's triple and quad+ radiator territory. And if clocks can't scale at typical temperatures, then it's cascade. A chill box is probably the only acceptable form factor for that.

The other factor of CPU hardware performance is IPC. It's hard to measure because programs use CPU resources differently. Transcoding benefits from AVX/AVX2 units and secure storage benefits from cryptographic accelerators. 3D games work best with graphics hardware. But standard programs don't use any of that and are mainly integer workloads.

So how to improve standard integer IPC?

One tried and true method is by adding caches, and making them faster and larger. The only really interesting innovation in desktop CPUs since Nehalem are the eDRAM Skylake parts which offer up to 128MB of L4 cache. In many applications, this provides a very large speed up.

Geometrically, chip stacking theoretically means that lower latency and larger caches are possible. More registers, more execution units, more prefetch and branch prediction hardware etc. to allow for greater than linear speed increases. Near geometric increases, in theory.

I'm not sure why node shrinks (Sandy to Ivy, Haswell to Broadwell) haven't reduced cache latencies; from what I understand, thinner connections are slower connections despite the reduced distances.

Of course, there are non standard IPC gains to be had that can be done with fixed function hardware but with Intel's acquisition of Altera, perhaps specialized workloads can be programmed into an FPGA co-processor sort of the way GPUs are used in specialized compute applications. Except FPGAs are far more versatile. A large downside is that VHDL is a pretty intimidating language (at least it was for me), but given the other path to greater performance - parallelizing - it's probably a wash.

Looking at the software side, there's a lot of room for software optimization on both the application and compiler side. For instance, Dhrystones in unoptimized compiled program per cycle terms actually went down after Nehalem. Using an optimized compiler, however, shows post Nehalem architectures with better IPC.

And I think that with greater hardware stability, the greater the incentive to optimize the software side. When upgrade cycles are long, it's fast efficient code that wins out.

Then there are the exotics:

Graphene, high temperature superconductors, quantum computing, photonics, etc. I don't know what the lead time between university press release to actual production is, but it's very long. I'm not sure if there are mechanisms that systemically slow the adoption of  academic "breakthroughs" by industry, but eliminating those can help get us back on Moore's road.

* If power were near free, generating liquid nitrogen at the household could even be an option.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.