The Cloud and the Sunset of the GHz-based CPU Metric

We have known this for years but it’s only when you get a slap on your face that you understand what’s going on for real: the GHz metric is useless these days. I was experimenting with vCloud Director the other day and I was checking out from the catalog my Turnkey Linux Core virtual machine (I use that because it’s small and I can check it in and out from the catalog very quickly – it’s also a very nice distro!). This instance was launched in a cloud PoC I have recently started working on for a big SP and I noted it took quite some to boot, at least more than what it usually takes which is around 40-60 seconds. Similarly the user experience once booted was not optimal compared to what I am used to. While I haven’t done any serious analysis of the problem, I am going to take a stab at what I believe it was happening behind the scene.

A little background first. This service provider opted to use some quite old IBM x86 servers to run this PoC. Since the PoC, for the moment, is focusing on functionalities – rather than performance and scaling – we thought it was ok to use these servers. For the records they are IBM System x 3850 (8863-Z1S). These are single-core 3.66GHz servers with 4 sockets. Admittedly, pretty old kits.  This is how they show up in vCenter:

This is technology from 2004/2005 if memory serves me well. Consider that, while they <should> be 64-bit servers (I’d need to double check – can’t bother) they certainly do not even have the CPU virtualization extensions – required in the latest vSphere releases – to support 64-bit guest OS’es. We found this out at the beginning trying to instantiate a VM of that class. They have been working fine anyway and are serving our needs pretty well for what we need to test.

Back to the performance issue I was describing now. You should know that when vCloud Director assigns to an Organization a vDC using the PAYG model, it sets a certain “value” for the vCPU. You can think – roughly and conceptually – about this value as something similar to the AWS ECU (Elastic Compute Unit). This is a good thing to do because it provides a mechanism for the cloud administrator to normalize the capacity of a vCPU. It also allows the provider to create a mechanism to cap the workload (as you probably don’t want a consumer to stuck an entire core). For the records vCD can also reserve part of that “speed” for the VM so that it can guarantee that these reserved resources are always available. The picture below shows the screen where you set this value when creating an Organization vDC (these are all the default values).

Note that the default “speed” value for a vCPU in the PAYG model is 0,26GHz (or 260MHz if you will). This means that, when you deploy a VM in this vDC, vCloud Director configures a limit on the vCPU with that value. I am not sure how Amazon enforces the ECU on their infrastructure (or if they enforce it at all) but this is how vCD and vSphere cooperatively do it:

To the point now. Everybody knows that x86 boxes scaled CPU capacity exponentially in the last few years. Today, a last generation 4-socket server can have a ridiculous amount of cores (up to 80). That’s one dimension of the scalability Intel and AMD have achieved. Another dimension is that the core itself has gone through some very profound technology enhancements and got better and better. Let’s try to do some math and find out how much better.

To do this I am not going to do a scientific comparison (I wish I had the time). I am going to quickly leverage a couple of benchmarks to find out the different efficiency between the old and the new cores. I am going to use the TPC-C benchmark – which is a simulated OLTP workload – that may not be always relevant but it’s known to be CPU bound – although it does require a couple of hundreds thousands of disk spindles to not be bottleneck on the disk subsystem (which means: don’t bother trying it at home). Long story short I took a TPC-C benchmark of an IBM server equipped with the same CPUs that we are using in this cloud PoC and I compared it to a benchmark of one of the IBM servers that supports the latest generation of Intel Xeon processors:

Old Benchmark:     150,000 tpmC (4 sockets, 4 cores, 3.66Ghz)

New Benchmark:   2,300,000 tpmC (4 sockets, 32 cores, 2.26Ghz)

We are not interested in the metric (tpmC = transactions per minute C-workload) in absolute terms because we are using this metric just to compare the CPUs. So for the two systems the math would (more or less) look like this:

  • Old server: 150K transactions on 4 cores makes roughly 38000K transactions per 3.66Ghz core which means roughly 10 transactions per MHz
  • New server: 2.3M transactions on a 32 cores make 72K transactions per 2.26Ghz core which means roughly 32 transactions per MHz

I didn’t have time to triangulate with more benchmarks so will stick with this one and we will claim that a single MHz of a new core is worth about three MHz of the old core we are using in the PoC.

Now I guess you have an idea why talking about MHz is meaningless at this point. I guess you also see why assigning “260MHz” to the CPU tells half of the story (the other half being.. ok but of which core?). Yet there still are a lot of people out there that think that a 3Ghz processor is faster than a 2.26GHz processors. I believe you also have an idea now why Amazon and VMware introduced these different metrics: it’s basically a way for the provider of resources to normalize the actual capacity of the CPUs underneath to overcome the variance we have seen above). My initial performance problem was in fact solved raising the value of the “vCPU speed” in vCloud Director: I assigned more GHz to the vCPU to off-set the poor quality of the core.

Let me change gear here now. What we have discussed so far is fine when you are dealing with VMs since you can easily use a technique to buffer this variance (the “vCPU speed” or the Amazon “ECU”). However this becomes a little bit trickier when you start dealing with virtual data center capacity. How do you normalize that? The easiest (and more user-friendly) way to do this is to expose directly the capacity expressed in terms of GHz, which is what vCloud Director does today when configuring Organization vDCs in reservation or allocation mode.

So what do we do? We all agree that 10GHz is no longer meaningful but what is the other option? You may argue that in a cloud environment you shouldn’t bother about the low level hardware implementation details because the whole purpose of cloud is to hide them right? On the other hand we are talking about IaaS type of cloud here so a much higher level metric such as “application response time” wouldn’t be applicable as vCloud Director doesn’t really manage the middleware and application part of the stack; that would be out of its control.

GHz may sound like the right thing to expose when you are providing virtual hardware capacity in an IaaS cloud but yet the metric would need to be consistent across different providers (and we have seen this may not be the case if different providers are using different hardware technologies). An option would be to try to normalize this value similarly to how the CPU in the VM gets normalized. Sure but how? With which metric? In the VM based model you can expose a very well known metric / object: the vCPU). In that case you can pass onto the consumer the key to decrypt the amount of compute capacity of that object similarly to how Amazon does it with ECU : “One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. This is also the equivalent to an early-2006 1.7 GHz Xeon processor“. The keyword here being “equivalent”. This means you can use any CPU technology you want, someone will tweak a parameter so that your performance experience will always be the same. Which is hat I have done above to fix my performance problem in the PoC for example.

The vCloud Director challenge is slightly different than the Amazon challenge though in the sense that vCD is a technology that enables service providers to stand up cloud backbones quickly and efficiently… whereas Amazon is indeed a Service Provider. So coordination and standardization for them isn’t an issue (as a matter of fact ECU doesn’t really have any industry-wide meaning outside of the context of the AWS platform).

Similarly to how the industry is looking for standard APIs to consume cloud resources across different providers, I believe there is a need to standardize the metrics that describe the capacity that those resources can deliver. Can this metric be an industry standard benchmark like TPC-C (for example)? Or should it be more like a brand new synthetic value that combines a number of benchmarks covering a wider range of workload patterns? Or should it just be a normalized GHz number? Ironically enough this problem can only be worse in a PaaS context because it is supposed to be playing at a level of the stack where the hardware infrastructure is completely hidden (and exposing GHz wouldn’t make any sense at all). However PaaS doesn’t expand its reach to a level where higher level of metrics (such as application response time) can be used because the application code falls into the consumer responsibility and not in the PaaS provider set of responsibilities. Which means you could have a PaaS layer that “screams” but an application layered on top of it that is a piece of junk (performance wise).

I’d like to point out also that you may consider having an end-to-end governance of your entire stack where you can monitor high-level metrics for the services / applications and you let the “governance system” deal with the monitoring and capacity planning of the virtual hardware and all other layers above it. While I admit this would be a desirable state we are not quite there at the moment.

However, if you think at the separation of duties and roles that this multi-layer cloud stack brings in – a stack made of many different services interfaces each of which has a provider and a consumer – we also need to make sure each of these interfaces has a way to be measured consistently. In other words, it may not be a human being having to deal with the measurement of these IaaS metrics, it may be a “governance system” that automatically does that for the human being, but yet we need to instrument these interfaces so that the “governance system” can deal with them.

Imagine for example a situations where a SaaS provider may be the consumer of external IaaS resources, this end-to-end monitoring becomes difficult to achieve hence the need to create more detailed SLA metrics between the various layers and their interfaces in the stack. In this specific example how is the SaaS provider subscribing IaaS resources? How are these virtual hardware resources going to be measured, monitored and enforced from a performance perspective by the two parties when the two parties are separate entities with separate duties? How do you define those boundaries from an SLAs perspective? That’s what we are debating here.

To GHz or not to GHz? That is the problem. All in all, the GHz based capacity planning and monitoring is dead. However it seems we are still flooded with IT tools that are leveraging it pretty heavily.

I’d like to hear what you think and if you have any opinion on how to address this problem.

Massimo.

17 comments to The Cloud and the Sunset of the GHz-based CPU Metric

  • For the last 20+ years I have debated against the use of the MHZ as a measure of the CPU throughput.
    Back then there were a lot of different CPU families and architectures (CISC vs RISC) and real world benchmarking would not necessarely prove to be better on the CPu that had more Mhz.
    Since back then there were a lot less computer educated people the best way I have found to explain it is that if you want to qualify your car performance you are not going to say “my car goes to 6500 RPM” but you will mention the max power, the 0-60 mph (or 0-100 Km/h)time and maximum speed.
    This debate happened again when AMD had lower clock rates and marketed MHZ what they decided to be equivalence to Intel’s CPU Mhz.

    Mhz / Ghz are a base of comparison across the same family / model of CPUs but in recent years the GHZ race was replaced by the number of cores and their effectiveness. Since the cloud end user is consuming a service that can be based on different hardware there should be a better CPU metric.
    Ghz was certainly chosen since this is something people can relate to even if meaningless. There were a lot of efforts by different parties on providing new metrics but this requires effective testing which is not a trivial task because it highly depends on what you use these CPU for and how the compilers, Operating Systems, hypervizor are optimized for a given architecture.
    This is certainly another reason why making a call to get the CPU Mhz is chosen over setting up a new Metric.

    We may see youtube competitive videos showing that for a given amount MHZ a given cloud provider runs better a given type of workloads which may involve better CPUs but also better IOs or better memory architecture which is something not transparent to the end user as well.

    Maybe we should drop all the technical metrics and having the customer ordering for very small, small, medium, large, extra large and let them decide which cloud provider medium offer they like best.

  • Looing at your benchmark from a different perspective you had 4-cores@3.66GHz vs 32-cores@2.26GHz. This is an increase of 4.9 times the available GHz. Yet your transactions per GHz rose from 10-per MHz to 32-per MHz which is only 3 times the performance.

    So, a power of 5 increase yielded only a power of 3 performance increase.
    Perhaps those big cores are useful. Not all applications are multi-threaded yet and smaller cores will slow these apps down.

    Capacity performance varies greatly between application types. Full application performance metrics are not available nor understood for all types of applications. The only option is to work with metrics that can be understood, like GHz, and then understand that performance scaling is not linear, nor ever has been, for higher number of processor cores.

    Linear (1:1) processor core scaling has never been achieved. This dates all the way back to the 1960′s mainframes. The first 2-way box actually decreased performance overall. 4-way SMP only appeared in 1984 and then only delivered x2.8 of the power of a single processor. It took until the ’90s to get past these issues. [Have a look at LSPR for some background on complex capacity planning: http://www.ibm.com/systems/z/advantages/management/lspr/index.html

    Improvements in processor logic and caching, as filtered down from mainframe to x86/x64, has helped deliver more resource availability and flexibility but you still need to hedge your bets when considering capacity planning. Proof of concepts are still required when moving heavy-resource applications to the cloud or to virtual.

    What real alternative to GHz is there?

    • Massimo

      Hi Jason. The scalability problem (i.e. multi-threaded app) is very well understood I think. However I believe there are very few applications nowadays that are designed and required that (namely big databases). At the beginning everyone was freaking out because most were wondering how you can make a single thread app scaling on 2, 4, 6, 8 and now 10 cores… per socket… but the reality turned out to be that 99% of the applications would just require a fraction of the core instead of spanning on multiple core. Needless to say virtualization was an easy sell here.

      This is the first dimension of the scalability I mentioned but I didn’t think it was important to talk about this. At the end of the day you don’t care whether you are running 100 workloads on 5 very powerful core (20 to 1 ratio) or 100 workloads on 20 slow cores (5 to 1 ratio).

      What really matters is that if you require a fraction of the core… that fraction needs to be “consistent” if you move your workload left and right (or from one SP to another). And GHz isn’t going to cut it (as I have tried to describe). That’s why I concentrate the discussion on the “micro” performance problem (i.e. 1 workload on the fraction of the core) rather than at the whole system level holistically (i.e. workloads on the many many cores).

      It’s funny that you mention the mainframes and then you ask what’s the alternative to GHz. Isn’t MIPS (philosophically) a better way to measure the capacity of a compute resource for example?

      Massimo.

      • MIPS is pretty useless (and this is not new).

        For ex. read here : http://www.futuretech.blinkenlights.nl/perf.html

        • Massimo

          Christophe,

          I said “Isn’t MIPS (philosophically) a better way to measure the capacity of a compute resource for example” … philosophically being the most important word in that sentence.

          BTW that guy is looking for a perfect benchmark (that doesn’t exist). This is in fact not at all a geek discussion but an effort to find a higher-level index/metric that could be as meaningful as GHz for the 5B people out there but, at the same time, not as broken as the GHz metric. I am fine if it’s not perfect.. but it can easily be better than GHz.
          I am not even interested in comparing different “CPU architectures” (which seems what the guy at the link provided seemed to be obsessed with).

  • Duncan Yellow Bricks (@DuncanYB)

    Besides the fact that the default value is just ridiculous you have a point in terms of used metric. It would make more sense to use a metric that is comparable across platforms. 2GHz doesn’t say much indeed.

  • MIPS was long ago considered dead. LSPR showed how processor performance varied between workloads, processor performance and number of processors. In the mainframe world that is very well understood.

    In the x86/x64 world understanding the result of a certain combination of app/speed/cores is less certain.

    In the sub-core world it is even more difficult to get to grips with. Making sense of sub-core needs a more thorough understanding of processor scheduling and its net effects on the applications scheduled.

    We’re not there yet and so basic GHz calculations remain prevalent.

    • Massimo

      Jason, agreed that normalizing the CPU capacity (or a fraction of it) won’t give you predictability on the final workloads. Sticking on the CPU subsystem there are other parameters that may greatly influence the performance of the application. There are for example cache friendly applications that will benefit a lot from bigger caches (and I am not even taking that into account in my ranting – too difficult, especially if you throw an hypervisor into the mix). If we take into account other subsystems (memory but more importantly disk and network) their architecture/implementation will greatly again influence the behavior of the application even further.

      For the records, I wasn’t trying to solve all these variables with that “new metric”. That would be way too difficult. I was just taking a small part of the problem (the cpu speed) and try to see if we can make it better measurable across different infrastructures. Essentially this is what the AWS ECU was introduced for in the end for VM based deployments. How do we start addressing the same thing for vCD based deployments? And I am sure Amazon doesn’t sell this as an overall performance metric to be able to determine application performance upfront (as you rightly said testing is the only way to find that out). Amazon in fact uses it as a simple CPU speed normalization factor.

      Thanks. Good discussion.

      Massimo.

  • Dmitri Kalintsev

    Massimo,

    What about some sort of metric derived from SPECint+SPECfp of a single core? Hypevisor could potentially benchmark a host it is running on at boot time and come up with a multiplier against a “base performance”, to be used to allocate/track resources in cluster.

    (It well sound naive, but this is what seems to “make sense” to me).

    On a related topic, it would be nice if storage resources had an similar knob (perhaps IOPS), which would on one hand help set end-customer’s expectations re: disk performance and on the other hand help provider with storage system dimensioning and resource management..

    Anyway, just a thought.

    – Dmitri

    • Massimo

      Hi Dimitri. Yes SPECint may do the trick. It would be a perfect example of what Jason/Christophe were pitching as “useless” (and I 100% agree) but a great example of the normalization I was talking about.
      Thanks. Massimo.

  • Really nice post,

    We just recently deployed are GSA Cloud platform and this is really something I needed to see. I am a super noob with vCD but this bring into question our cataloging and planning that went into it. I will have to do some testing. Good work!

    • Massimo

      Thanks. I’d be interested in your feedback on how you would do things differently (and/or what you would like us to do) when you are done with your testings.

      Massimo.

  • Jacques Talbot

    +1 on SPECint or SPECint_rate (but they are mostly equivalent)as suggested by Dmitri.

    This is an age old problem, nothing new with virtualization. It is the so called speed daemons versus brainiacs chip design controversy of the previous century. Brainiacs won apparently.

    The key issue is to be sure to have a benchmark popular enough so that all Intel/AMD chips have a number when appearing on the market.
    SPECint, contrarily to TPC-C characterizes basically only the chip and not the system (runs in L2 or L3 cache). This is OK for our purpose since any different cloud providers have different systems (blades or pizzas).

    It is IMPOSSIBLE to solve at this level the core scalability issue, so do not even try – same story that the SMP scaling, I worked on that for years :-)

    SPECint is probably the best solution for vCloud, far better than MHz or proprietary style ECUs.

    Note: SPECfp is mostly irrelevant except for scientific workloads

  • Geoffrey Coulter

    Perhaps the answer is for VMware to create a mechanism whereby the “Work Units” supported by individual servers are normalized. As some sort of enrollment step into vCD, on each (idle) hypervisor, start a pre-packaged VM at several limit levels and calculate SPECint (and other tests?) to establish a good normalization factor. On the old server above, maybe 25 WU = 750 Mhz, while on the new server, 25 WU = 250 Mhz. Limit configuration would be based on WU, not a raw speed value.

    If VMware implements something like this, if would at least be useful to compare a sizable subset of cloud offerings.

  • i was working on the same problem here http://pleasedonttouchthescreen.blogspot.com/2011/08/comparing-mhz-from-2005-to-2011.html
    i was thinking about building a virtual appliance with the SPECint to perform some in-house evaluation, but i’ve found that the SPEC benchmarks are not available for download for free, even for non-profit
    anyway, available processors are a finite number so vmware itself could benchmark them, during the HCL process, and include a “power index” table (MIPS, SPECint, drystones, whatever) in vcenter itself.

  • [...] its a very low amount, you will need to adjust for you environment. There are two great blog posts here and here about this [...]

Leave a Reply