Thursday, July 28, 2016

IPC redux

Anandtech brings us an article today celebrating the 10th anniversary of Intel's Core 2 Duo, a processor line that brought a return to Intel's technological dominance over AMD. I've already looked at processor growth over the years so it's not really new to me. It's a good article though and Ian's articles are some of the best (although Anandtech without Anand Lal Shimpi is like Apple without Steve Jobs).

His dataset for graphs makes the apparent case that there's been a real and steady progression in performance when there really hasn't.  He's very much aware of the IPC stagnation over the past decade so it's mainly just a confusing choice of samples.

"Here's this new processor and here's this 10 year old one and look at all these others that fall in between"

But I figured I'd revisit the topic to clarify my own thinking.

Here's a table that I believe expresses the state of CPU advancement more accurately. The data reflect how processors perform on an unoptimized version of one of the oldest benchmarks around, Dhrystone. The Dhrystone program is tiny, about 16KB, and basically eliminates the trickery effects of large caches and other subsystems; it's meant to measure pure CPU integer performance which is the basis of most programs.

Maybe I've posted this table before. I can't be bothered to check and halt this stream of consciousness. Dhrystone data were gathered from Roy Longbottom's site

YearIPC D2.1 NoOptNoOpt IPC chgYearlyMHzClock changeHWBOT OCIPC * Clock change (OC)Yearly
1989AMD 3860.114040
1991i4860.1966%33%66180%112365%182%
1994Pentium0.2534%11%75108%233179%60%
1997Pentium II0.4580%27%300124%523304%101%
2000Pentium III0.473%1%1000206%1600214%71%
2002Pentium 40.14-70%-35%3066166%4250-19%-10%
2006Core 2 Duo0.52268%67%2400-9%3852234%58%
2009Core i7 9300.544%1%30669%421314%5%
20123930K0.51-5%-2%473011%46805%2%
20134820K0.520%0%3900-1%4615-1%-1%

"IPC D2.1 NoOpt" is a relative measure of how many instructions per cycle each processor is able to complete. Although there are optimized versions of the Dhrystone benchmark for specific processors, it's important to avoid those in a comparison looking purely at CPU performance. Optimizing for benchmarks used to be common practice and definitely gives a cheating kind of vibe.

Looking at the IPC, there were enormous gains in the eighties and nineties where each new generation could do significantly more work per cycle. A 286 to a 386 to a 486 were all clear upgrades even if the clock speed was the same (all had 25MHz versions). This pattern was broken with the Pentium III which was basically Intel just taking advantage of the public's expectation that it would outclass the previous generation. I know I was less than thrilled that many people enjoyed most of the performance of my pricey hot-rod Pentium III at a fraction of the cost with the legendary Celeron 300A. 

Pentium 4 went even further such that someone "upgrading" from a Pentium III 1GHz to a Pentium 4 1.3 GHz (P4 1 GHz did not exist) would be actually end up with a slower computer. And interestingly, the much vaunted Core i-series did not bring that much to the table over Core 2 in terms of IPC.

The Pentium III and Pentium 4 acquitted themselves with huge increases in clock speed which is the marquee feature for CPUs and it appears that the Core i-series followed the same pattern. But when you look at clock speeds that the large pool of overclockers average over at HWBOT, we see that Core 2 Duo was capable of some very good speeds. The HWBOT (air/water) scores are useful in finding out the upper limits to clock speed for a given architecture rather than the artificial limits Intel creates for marketing segmentation.

There are many people still using Core 2 Duos and for good reason. While it is certainly slower than the Core i-series, it's not too far off
... particularly if human perception of computing speed is logarithmic as it seems to be for many phenomena.
There are many reasons to go for newer architectures, e.g., lower power use, higher clockspeeds for locked CPUs, larger caches, additional instruction sets, Intel's lock-in philosophy etc. In retrospect, we can see the characteristic S-curve (or a plateauing if you are looking at the log graph) of technology development so it all makes sense - the Pentium 4 being a stray data point. An aberration. Something to dismiss in a post-study discussion or hide with a prominent line of best fit. There, so simple. Companies pay millions to frauds like Gartner Group for this kind of analysis! But you, singular, the reader, are getting it for free.

At the time, however, it seemed like the performance gains would go on forever with exotic technologies neatly taking taking the baton whereas the only technologies I see making a difference now - at least for CMOS - are caching related. Someday CPUs will fetch instructions like Kafka's Imperial messenger forever getting lost in another layer of cache. But if Zen turns to Zeno, it won't matter, Dhrystone will expose it all.

Kurzweil talks about S-curves being subsets of other S-curves like a fractal of snakes and maybe that's where we're at. Or is his mind the product of the 1970-2000 period of Moore's Law, forever wired to think in exponential terms?




Wednesday, July 27, 2016

Improving Game Loading Times

Faster computing is a game of eliminating bottlenecks. Every component in a system is waiting for something, whether it be the results from CPU calculations, a piece of information from memory, storage, or the network, or even just input from the user.

Ideally, the computer would always be waiting on the user rather than the other way around. For the most part, today's computing experience approaches that ideal for most. This is why it's all the more jarring when you do have to wait which is often the case with large games.

For game loading, a common piece of advice to improve load times is to get an SSD if you are using a regular hard drive as storage. And it definitely helps.

Regular hard drives are so much more sluggish that replacing them with SSDs improves the general responsiveness of computers more than just about any other upgrade. And for game loading times, it makes sense that faster storage devices lead to faster loading times. But at some point, storage devices will become so fast that they will no longer be the bottleneck.

It turns if you have an SSD, you are already there because even if you increase the speed of your storage device by an order of magnitude, as is the case with RAM drives versus SSDs, game loading times are basically unchanged.**

Why?

For many programs, the bottleneck moves back to the CPU and the rest of the system. Rather than the CPU waiting on storage, it's the user waiting on the CPU to process the instructions that setup the game. To demonstrate this, I clocked my CPU at 1.2, 2.4, 3.6, and 4.8GHz, and then measured initial and subsequent loading times for Killing Floor 2.*


Although it is clear that a faster CPU helps loading times, the benefits become smaller as CPU frequency increases - even looking at things with a percentage change perspective, i.e., the load time of the 2.4GHz run was 62% of the 1.2GHz run despite its 100% clock speed advantage and the load time of the 4.8GHz run was 71% of the 2.4GHz run despite its 100% clock speed advantage. In addition, it is also exponentially more difficult to increase CPU frequencies. Thankfully overclocks in the low to mid 4GHz range happen to be the sweet spot for the processors Intel has released over the past few years so most of the load time benefits can be realized for those with an overclockable system or the latest processors e.g. 4790k turbos to 4.4GHz and 6700k boosts to 4.2.

The second load runs were performed to see the effect of the Windows cache. These runs were about 15 seconds faster across all CPU frequencies. If having parts of the program preloaded into memory can save that much time, it makes me wonder why RAM drives don't perform better. I have a feeling that overhead from the file system might be to blame. Maybe game resources are unpacked in RAM whereas they still need to be decompressed even on a RAM drive. I'd have to unpack the files, if they even are compressed, to test the theory. This has the benefit of shifting some of the burden off the CPU and onto storage. Right now my Killing Floor 2 directory takes up 30GB so even mild compression can easily have it balloon to a size where it doesn't really make sense to reduce loading time but use up precious SSD space. It's worth trying someday.

In any case, the Windows cache after running Killing Floor was 3GB. If all of that represents game assets loaded straight from disk, then that represents at least six seconds of the 15 - probably more given the the maximum read rates for the SSD of around 200MB/s during loading and the important 4K QD1 performance of typical SSD drives, like mine, which is about 29MB/s..

Benching my SSD (Seagate 600 Pro 240GB SSD)

Then again, almost all online RAM drive vs SSD game loading time comparisons suggest this is not the case.

4K QD1 read performance is important because games, and most typical programs, mostly do low queue depth accesses. Resource monitor showed a queue less than 3 during loading. This type of workload is tough to optimize and even the fastest multi-thousand dollar NVMe PCIe datacenter SSDs are no better than a decent consumer drive using old fashioned SATA. The 4K QD1 in this case may be a red herring given the lack of a RAM drive advantage even on a 4.5GHz 2500k.

As an aside which I'm not really going to separate, the best SSDs, like the Samsung SM961 can do 60MB/s for 4K QD1. It's a very good performance and is basically what the ACARD ANS-9010, a SATA based drive that uses much faster DDR2 DRAM, could do (63MB/s or 70MB/s or 55MB/s depending on who you ask). On the other hand, it shows just how much overhead can hamper performance. This user was able to get double or triple the performance (130 to 210 MB/s) on DDR2 and a Core 2 Duo with a software RAM drive. I don't know if its SATA overhead or what, but that's a very significant hit.

Now if only someone would make a PCIe drive using DDR3...

RAM drive bench on my system (3930k 4.6GHz DDR3-2133)

Imagine what a newer computer with DDR4 4000+ could do! (It'd probably fill up the bars) But I'd still rather have a hardware solution over software. That's how I feel about all tasks where there's a choice between hardware and software. REAL TIME. DEDICATED. GUARANTEED PERFORMANCE. - not - "your task will be completed when the Windows scheduler feels like it and as long as the hundred other programs running play nice with each other"

Anyway, performance monitoring software revealed some other interesting facts during the test. Even on the 1.2GHz run, the maximum CPU thread use was 80% even though it is clear that the CPU speed was constraining the load time. (It was 60% for first run at 4.8GHz, 45% for subsequent). The unused CPU capacity might be the result of a race condition but I think it's safe to say there is room for optimization software or hardware side.**

But it's completely understandable if game loading time is very low on the list of developer priorities. 

* From starting the program to firing first shot with loading screens disabled using the
"-nostartupmovies" launch option. This actually saves a good amount of time.

** I tried changing core affinities and counts, priority, hyperthreading, RAM speed (1066-2133), and there were no changes. I'm using a stopwatch so there might have been advantages but nothing like the effect CPU frequency had.

Wednesday, July 13, 2016

Black Friday in July??

The Friday after Thanksgiving, Black Friday, is supposedly a shopping festival with Bacchanalian levels of chaos. Stampedes, people in tents camping outside stores, fights - all of which probably get retransmitted around the world to help shape others' image of Americana. But efficient markets have largely neutered whatever actual savings people might find on Black Friday with stores only stocking enough doorbuster items to stay out of legal trouble and online retailers finding their headline sales out of stock and on eBay in seconds.

Black Friday was never great and will only fade into greater obscurity as preference for online shopping grows. So why, in the name of the Senate and American Republic, am I getting ads for Black Friday in July? Its excess is comedic, like the Thursday Afternoon Monday Morning podcast or the Spishak Mach 20. Naturally I wonder if we'll be seeing a Black Friday in July Weekend Extravaganza! As an inveterate capitalist, I've made my bed. Ear-splitting advertising noise is the will of the Market, the Market from which all First World problems flow.

Speaking of First World problems, my quick HTC Vive review:

Cons

  • very visible screendoor / low resolution
  • grainy display with poor black levels for an OLED panel
  • visible Fresnel lines and chromatic aberration
  • low FOV
  • heavy and stuffy headset
  • sometimes tricky setup
  • huge hardware requirements if you want to use supersampling
  • glitches with lighthouse tracking
  • quality at periphery is not good
  • cables
  • desktop mode is difficult to use and has a lot of latency
  • finicky adjustment for each persons eyes
  • the shipping box was an order of magnitude larger than it needed to be
  • Room-scale requires clearing a large chunk of space and applications that use it always leave you wanting more
  • Lighthouse boxes emit a motor whine. Not a fan of moving mechanical parts.
Pros
  • tracking precision is very good when it works, which is most of the time
  • responsiveness should stave off motion sickness for most
  • colors and dynamic range are good
  • controllers are excellent
  • SteamVR integration is well done
  • overall impression of quality materials
  • Lighthouse "room-scale" makes this clearly superior to the Rift in applications that can use it although the Rift is lighter and has better optics
All in all, contemporary VR is an amazing achievement. Lighter headsets and better lenses are probably on the way along with more wireless parts but improving the big visual cons, i.e., low resolution, supersampling, and low FOV, will require an enormous increase in graphics power. Right now I'm using an overclocked 980 Ti, but supersampling will likely require next year's 1080 Ti. Possibly two if NVIDIA decide on the typical 20% improvement instead of the 50% improvement they did as a kind of one-off with the 980 Ti to deflate AMD's Fury launch. But it seems that by locking down many overclocking voltage settings with the current 1070/1080, NVIDIA is keeping plenty of performance bottled up should they need to counter a big AMD launch again.