Welcome to IT 2.0 :  Next Generation IT infrastructures Sign in | Join | Help

Thoughts on VMworld 7.0 and Virtualization 2.0

This article was originally posted on the VMware vCloud corporate blog. I am re-posting here for the convenience of the readers of my personal blog.

Yes that's right, it’s the 7th year of VMworld. The event started years back as a small gathering of a few hundreds geeks. At least this is what VMware was expecting, in fact almost 1,500 individuals showed up in San Diego in 2004 for the inaugural show. I am proud to be one of those first attendees. At that time you could breath the "geeky" spirit of the event and how such a little and simple concept would have changed the way we do computing down the road. Yeah, you take an industry standard server, you install this little piece of software and you can install two or more standard Operating Systems. Boom! You’ve changed the world forever. There are times you listen to someone and have a “WOW” moment - I still remember the moment where, in a meeting with a big bank back a few years ago, I was telling a Veritas architect that we couldn't install their HA clustering software because it was incompatible with vMotion. When I explained to him what vMotion was he laughed at me saying that I probably "misunderstood" what that technology could do because it was simply impossible to move on-the-fly a Windows instance from one server to another. Yeah, sure. Welcome to virtualization (1.0).

Then there’s a period where there haven't been many "WOWs!". Sure there have been a few (VMware Fault Tolerance comes to mind) but the titanic effect that the first wave of virtualization technologies delivered in the first place...we just haven't seen it again (yet). And there are good reasons for that. As virtualization became more widely adopted, customers were obviously looking for more enterprise management surrounding the first wave of these highly potential technologies. This was the time when VMware concentrated more on the management side of things. I have never seen anyone attending an ITIL class stepping out screaming "Gee, this is the coolest thing in the world, I want to go home and read more about it RIGHT NOW!"; similarly I anticipated a certain level of "ah ok, interesting" when VMware announced a string of new technologies in that management area (I can think of CapacityIQ, SRM, not to mention all the Ionix portfolio VMware recently acquired from EMC). Don't get me wrong: as much as none of the hypervisor geeks are going to "WOW" for the Ionix portfolio....we all very well understand that without such tools and an overall solid management strategy VMware wouldn't be considered for what VMware would like to be considered, which is clearly not (just) the cool technology provider that keeps a geek up during the night. That (alone) is not what can bring VMware at the heart of the data center. Funny enough I’ve had a blog post in my drafts for more than a year whose title was "Virtualization is no longer sexy, it's just useful". More or less this is what I'd have discussed here so I'll go ahead and delete the draft now.

This year it’s different though. VMworld 7.0 (i.e. VMworld 2010) is going to go back to some core geeky type of discussions around cloud technologies. While some of the cloud-related discussions are still around management (which needs to be because we want cloud to resonate to the enterprise as well) there are other "cloudy" topics really for the geek at heart. I am thinking about the concept of "location independence" that cloud will bring onto the table. That's something I am going to touch on during my session at VMworld: Cloud 101: What's Real, What's Relevant for Enterprise IT, and What Role Does VMware Play. This session is really geared towards introducing the cloud concepts and certainly one of the most interesting concepts about cloud is that you could run your workloads...well…in the cloud! Where else?! If you are still wondering what this whole cloud thing is come to this session and you won't be disappointed. And if you are, that's fine. Just don't fill up the feedback form. :-)

If everything goes well I may even be able to show you something during that breakout. Consider I can't promise this will be as "WOW!" as vMotion, but rest assured it's going to be more fun than an ITIL class! This is, in a way, Virtualization 2.0 being presented at VMworld 7.0!

Joking aside, the only problem I see is that some of these concepts (and technologies) are a bit hard to digest the first time you face them. At least this is what happened to me when I joined the vCloud team roughly 6 months ago. That's the part I am struggling with at the moment: we are having some internal discussions on how to layout a few sessions and we are debating on how to better present the concepts and the products. You don't want to be too kindergarten but at the same time you don't want to go too deep and lose the audience in the first 30 seconds. Challenging.

That's all I wanted to say for today; if you are a geek you may find VMworld 2010 a fun show. And if you are around stop by and say "ciao".

Massimo.

posted by Massimo | 1 Comments

Open standards, open source, OpenStack and the TCPIP of Cloud APIs

A few weeks ago Rackspace and NASA announced an initiative called OpenStack aimed at providing an(other) open source alternative for building public and private clouds. This generated some reactions in the open source community like this one from the OpenNebula team. By the way do not confuse OpenNebula, one of the other open source cloud implementations mentioned, with NASA's Nebula, an open source compute-related  cloud implementation that is part of the OpenStack announcement along with Rackspace's CloudFiles, an open source storage-related cloud platform. If you find it hard to follow let's try to remove the redundant "cloud" and "open source" words and let's try to express the concept in a mathematical format: it's OpenNebula Vs Nebula + CloudFiles = OpenStack. Easy, isn't it.

My colleague Winston Bumpus already addressed in this post the difference between open standards and open source so I am not going to talk about that. Not to mention that Winston is at least an order of magnitude more authoritative than I am when talking about standards. Winston is director of standards architecture at VMware and president of the Distributed Management Task Force Inc. (DMTF).

Instead, I'd like to talk about what turned out to be the two major talking points around OpenStack. The first one goes like "OpenStack is open source hence it's free". The second one goes like "OpenStack is open source so I can customize it to my needs and use this customization as a differentiator in my industry".

I am not going deep into the "free" argument as this has been a never ending discussion for years. If "free" was always better I say, in your corporate IT,  you'd be all using Ubuntu, MySQL, OpenOffice and the like whereas I'll go ahead and speculate you are most likely using all flavors of Windows, Oracle and Office. Note I don't have a political agenda in saying this because if I had to say something against closed source software I'd have all my vSphere colleagues hunting me down. Same treatment from my SpringSource and Zimbra folks if I had something bad to say about the open source software model. In the final analysis I think that the choice boils down to the rounded value of any given product. And I am using the word value loosely here, including things like features-set, support, roadmap integrity, commitment, openness, compatibility etc. For many organizations these are more important than getting something "for free". I apologize with the NASA and OpenNebula folks for having been a bit picky in my opening but I wanted to make the point here that simplicity is a value too.  

The second talking point ("it's open source so I can customize it") is way more interesting. First this is an argument that pertains more to the service providers building public clouds than enterprise organizations building their private clouds. Having worked for about six months now with service providers across Europe I have seen and heard many interesting things. The first thing I've heard is "we don't want to develop the cloud stack ourselves. It's not our business. We'd rather use an out-of-the-box product and build on top of it". In addition to this, I have seen at least a couple of high-level cloud stack prototypes built on top of vSphere from two of these service providers and, believe me, they all look the same from a core functionality perspective: multi-tenant support, web portals, catalog management and external programming interfaces. Sure the branding of the portals looked very different (obviously) but if you abstracted yourself for a minute from this, you'd note that the logic, the flow, the processes involved from on-boarding a user all the way to having the same user being able to deploy a workload were all very similar. Long story short, I have two data points at the moment: the first one says these service providers don't want to develop these core functionalities in-house and the second one says that their internal mockups and prototypes look very similar from a functionality perspective.

This doesn't sum up I thought. Assuming this is representative of what in general service providers think, what's the point of having an open source stack that you can customize? You don't want to hack it because you don't want to build the stack in-house. And you don't need to hack it in the first place because you are essentially building the same thing anyway (even starting from scratch). You may be arguing that this is not in-house development but it's more like customizing to your needs an existing software. To this point I really think the burden is not so much in doing the one-shot customization but it's rather in maintaining it over time. Doing so, you are essentially forking from the vanilla open source software and it may be difficult to incorporate new evolution of the vanilla platform in the long run.

The whole point here is that, obviously, service providers need to differentiate from each other. However an IaaS cloud offering is so articulated that one should wonder whether it is possible to differentiate at a level which is (well) above the core multi-tenancy cloud functionalities. In fact there are tons of things that a service provider can do on top of the cloud stack without having to do something inside the cloud stack providing the technology backbone. After all many service providers are using many out of the box products to build their offerings and I am not sure anyone has ever thought about writing (or customizing) their own version of Oracle, Windows, vSphere, Tivoli. And I bet they are even using open source software (just because it has value) without necessarily having recompiled a tweaked kernel, although possible.

And while I am at this I'd like to touch on the cloud APIs and on openness as well. I am not a programmer so this is not my territory however I'll give a stub at it and say that there are really a couple of approaches I have seen so far in the cloud space. I define the first one the "standard" approach whereas the second one is the "dominant" approach. In the standard approach, not matter what the actual product implementation is, the interfaces are common and all technology vendors have agreed on a common "language". This is the standardization effort that the DMTF is coordinating. Right now VMware submitted the vCloud APIs to the certification body in an attempt to define a common standard everybody could agree upon to make various cloud stacks interoperable.

With what I define the dominant approach, some vendors are trying to achieve the same result without going through a standardization process but rather using an abstraction/wrapping model where they essentially tell ISVs that their cloud APIs 3 is able to abstract cloud APIs 1 and Cloud APIs 2. In this case ISVs (and users) have to write only once to "reach many". The funny thing is that this is typically perceived as something that avoid lock-in. The reality is that you locked-in into neither API 1 nor API 2, but you end up being locked-in into API 3 anyway. There is no way out from this other than using the standard approach I mentioned before since there is not any "master" and "slave" concept.

Also, the other problem with the dominant approach is that we are in the so early stages of cloud computing that cloud stack vendors implementing APIs 2 may try a similar approach (and win the master position) abstracting APIs 3 and APIs 1. This would leave users and ISVs with the dilemma of which APIs they should develop against. The picture below summarizes and visualizes this recursive abstraction concept:

Interestingly enough this is already happening. Last time I checked, for example, Red Hat's DeltaCloud project was able to abstract a variety of cloud interfaces including Amazon EC2, GoGrid, OpenNebula, Rackspace and others. At the same time OpenNebula has recently announced an adaptor to connect and abstract Amazon EC2 and DeltaCloud resources. This sounds like an example where both the DeltaCloud and OpenNebula interfaces are masters and slaves at the same time. So now what are you going to do? What are you "standardizing" on?

The way I see this evolving is that these cloud APIs will have to be like TCP/IP. No matter what the product is, it implements the very same language so that heterogeneous systems (in the case of TCP/IP) or different service providers (in the case of clouds) will be able to interoperate transparently. And this requires a certain level of flat standardization, after all TCP/IP didn't become so widely adopted because it was able to abstract IBM's SNA, Novell's IPX/SPX, Microsoft NetBEUI and others.

The last thought for service providers is this: while certainly this level of standardization will make easier for customers to switch from one vendor to another (which is something understandably you don't like), I'll look at this upside down: think about how many public cloud customers you will be able to get on-board simply because they will be confident that they can leave you whenever they want. Think about how many customers are not buying into public clouds today, simply because they know it's easy to get in but it's difficult to get out. Why do you think x86 is so popular and proprietary platforms are on a demises? The market rules in the end.

This is going to be a win-win for everybody.

Massimo.

posted by Massimo | 5 Comments

Cloud and the New IT Pillars

This article was originally posted on the VMware vCloud corporate blog. I am re-posting here for the convenience of the readers of my personal blog.

I have used one of my recent posts in some off-line discussions about the use and penetration of virtualization in some accounts. In this post I’d like to expand a bit on that. I will just start with a nice picture that is supposed to summarize with a different graphic (but with the same core concepts) what I was trying to argue in the post I was referring above. In this case, I think a picture is worth 1,000 words:

Specifically I want to position where cloud infrastructures are going to fit into an organization. While this picture is really focused on internal (enterprise) deployments it also maps how service and hosting providers are going to shape their offerings for their end-users (more specifically, on the right hand-side the traditional hosting business and on the left hand side the new virtual servers /cloud business).

To make a long story short, most enterprise will have to accommodate – like it or not – these four platform pillars (right to left):

*  Proprietary platform: essentially all non-x86 platforms. Think of mainframes and the AS/400 as prime examples. While many may not refer to Unix as a proprietary platform, I believe it is.

*  Physical x86 platform: traditional Windows and Linux deployments on physical servers. This is the typical old way where a single OS image maps to a dedicated physical server. Many customers still have physical server deployments as part of their regular practice. Sometimes this is required; sometimes they do this simply for “irrational fear of virtualization technologies”.

*  Virtualized x86 platform: this is the first deployment policy for many organizations. Think of VMware VI3 or VMware vSphere deployments. This has proved to work well for the last five to six years and it’s an established good practice. As mentioned in the post I referred to at the beginning, the level of penetration may vary depending on many factors.

*  Cloud platform (IaaS): this is the new potential player in your infrastructure and it’s probably going to support the less critical and more dynamic environments you have to deal with on a daily basis (test and development is one common example; there are many others).

One could spend hours commenting on this slide but I’ll try to be dry on some key points that you need to digest (in my opinion).

First and foremost there is clearly a trend where the left pillar(s) is taking over some of the workloads of the right pillar(s). And this trend is consistent across the board: x86 physical deployments are cannibalizing proprietary platforms. It’s the same pattern for virtualized deployments eating up typical x86 physical workloads (more and more we hear about the “virtual first policy”). Last but not least expect the new player, cloud infrastructure, to cannibalize most of the traditional virtualized x86 deployments. Stopping the trends I am describing here will be as difficult as trying to stop a moving train with your fingertips: good luck.

At this point you may wonder why, given that cloud infrastructures build on top of and leverage virtual infrastructures, I am calling out two specific and separate pillars. That’s a good point, especially because it’s true that clouds (specifically IaaS clouds) build on top of hardware virtualization. In fact I argued just this point in the other post I referenced at the beginning of this blog: since the first cloud instantiation will tend to trade-off the complexity of many tuning options for a better and easier end-user experience, we expect some workloads that require a bit of tuning and visible infrastructure layout options to remain on more traditional vSphere types of deployment.

If you think about this, cloud is all about agility and with agility comes less control (i.e. tuning). As time goes by these two pillars will converge. The Cloud pillar will take over the traditional virtualization pillar. Indeed, I expect that most of these tunings and controls will no longer be needed because of the additional automation and auto-tuning concepts that cloud-related technologies offer. Last but not least, let’s not forget that Cloud technologies will also mature over time and will fill holes we see in the first wave of cloud technologies.

Finally, I’d like to touch briefly on management. I think we need to be all very pragmatic here. I know many customers are looking for the nirvana “one tool to manage them all”. The fact of the matter is that the more you try to normalize these pillars under the same management umbrella, the more benefits for each of the pillars you must sacrifice. I have had an interesting discussion lately with a colleague at VMware and I think he did hit the nail on the head when he said, “They want to have one tool because they think it’s more efficient. It’s not. It’s more efficient, and more effective, to run two tools that manage two systems well than to run one tool that manages ten systems poorly“.

In my previous IT life I was in the business of trying to homogenize heterogeneous virtualization platforms under a single management umbrella so I have to (strongly) agree with my colleague’s statement. In fact, these pillars are very different in the way you manage them. This is true not only from a technology perspective but also, and even more so, from a process perspective. For example, the process to request a partition on a legacy Unix system may be totally different than the process required to instantiate a new physical server, which in turn is totally different than the process to request a new vSphere virtual machine. To complicate things more, the Cloud pillar, by very definition, doesn’t require any process whatsoever to instantiate a new workload from the self-service portal.

Try to homogenize this with common processes, a single management umbrella, and a single pane of glass. The moment you think you have done it, you wake up all sweaty.
I am not making the case that your application or service will not span different pillars. You may very well have your scale-out web front-end on a dynamic cloud pillar and your scale-up back-end database on a more tunable virtualized pillar or any other combination. After all, the concept of application layer tiering isn’t that new in this industry. If you think about that we have just added another interesting pillar (Cloud) into a picture that we have been using in the last ten years. This is not going to shake our world, but it is going to make it much better.

Massimo.

posted by Massimo | 0 Comments

Are Hypervisors Cloud Commodities?

This article was originally posted on the VMware vCloud corporate blog. I am re-posting here for the convenience of the readers of my personal blog.

There have been a number of discussions in the industry in the last few years about whether hypervisors are (becoming) a commodity and whether the value is (or will be) largely driven by the management and automation tools on top of them. To be honest, I have conflicting sentiments about this. On one hand I tend to agree. If you look at how the industry is shaping pricing schemas around these products, that's the general impression - all major hypervisors are free, and by definition one could argue that they are a commodity.

On the other hand, this doesn't really match my definition of commodity. I'd define a commodity as something that had reached a "plateau of innovation" where there is very little to differentiate from comparable competitor technologies. This pattern typically drives prices down and adoption up (in a virtuous cycle) because users focus more on costs rather than on technology differentiation. The PC industry is a good example of this pattern.

Is this what it is happening with hypervisor technologies? Hell no. I think there is no one on this planet who thinks that deploying OS images on dedicated physical servers is faster, more flexible and in general better than deploying them on a virtualized host. Yet virtualization usage, in the industry, is broad but not deep and it's usually around 30 percent (on average) within most organizations. And these technologies are widely available for free (ESXi, Hyper-V, XenServer and KVM)! So, if everybody agrees there is a problem with the current physical server deployment model, and that there are technologies available to download from the Internet that can address the problem, why are organizations only confident to put 30 percent of their workloads on these hypervisors? Can someone explain this? My take is that there may be a number of concerns around support and licensing. But the industry has matured and made huge progress on this front in the last few years (Oracle being one of the few exceptions unfortunately). I bet that a large chunk of that 70 percent of servers deployments is not virtualized simply for technology concerns such as stability, performance, scalability, security and so forth. Where there are technology concerns or technology limitations then there is space for innovation (or education to raise awareness).   

The fact that the industry is moving to a model where the hypervisor is free and the management tools are the source of revenue tells a partial story to me. The technology story behind the scenes is quite different. The reality is that there are multiple ways to look at hypervisors and their use cases. If you view the hypervisor as the thin software layer that allows you to consolidate five servers on a single box... well I am with you. At 10 Km/hour there is little difference between a Ferrari and a Fiat (even though the Ferrari is still damn cool). If you, instead, view the hypervisor as the foundation for private and public clouds where multi-tenancy, security, flexibility, performance consistency and predictability, integrity and scalability are not optional characteristics... well then there is a difference indeed.

You may argue that you can achieve most of these characteristics using the proper management and automation tools that sit on top of bare metal hypervisors. But the fact is that the policies at the management layer are only as good and reliable as the hypervisor used to implement and enforce them. Yes, you could put a Ferrari engine on a Fiat and have the best pilot (Michael Schumacher Fernando Alonso) pushing it at 330 Km/hour! And everything may be great up until the moment you hit the brakes and find out that it will take you 1,500 meters to stop it (if you don't hit a wall before). Similarly could say that the real "value" of an airplane is its cockpit with all the automation that goes into it. Again, you can put autopilot on and all is good but, at the end of the day, the autopilot (and all the other automation technologies in the cockpit) only instructs the "basic" airplane technologies (thrust reversal, flaps, etc) to do the real job. And I can assure you will want these technologies to be as good, reliable and secure as possible! Always remember that it's not the autopilot and all the slick automation that happens in the cockpit that keep you flying at 33.000 feet - it's the wings.

I am mixing metaphors here and perhaps digressing. Going back to our lovely "commodity" hypervisors discussion, one of the things that always shocked me is how powerful the networking subsystem is that is inside ESX. It's just amazing. Out-of-the-box and easy-to-use support for distributed virtual switches, redundancy (both at the physical and logical level), multiple failover and balancing algorithms on a PortGroup basis, traffic shaping, security built-in via the VMSafe APIs, and a tons of other parameters and features that you can leverage and tune based on your specific requirements. And what you have seen so far is really just the foundation of what's happening in terms of injecting more cloud oriented and multi-tenancy support. We are working on some cool stuff that will be out in the future that is just amazing. I personally spent the last three months digging into those things and the potential there is phenomenal. I can't talk about this in detail today but it's pretty clear that here we are not talking about just setting up 10 Windows VMs on a physical server allowing them to connect to a flat L2 segment sharing a single Ethernet cable. I can't wait to talk more about what we have in the works and to prove to you that, just like you can't build a castle on the sand, you can't build an Enterprise Cloud on a limited hypervisor. 

Massimo.

posted by Massimo | 1 Comments

VMware, Virtualization... and More

Before joining VMware roughly 4 months ago I was wondering, along with many of you, what sort of company VMware was turning into and what they were doing and what they wanted to become in the long run. The more I was tracking VMware buying (supposedly) disconnected companies the more I was thinking "what does this have to do with virtualization? What (the hell) are they doing?". Some of them are a bit less disconnected than others when it comes to virtualization but yet the full picture was not clear to me. I think that, in order to get the full picture, you need to abstract a bit from the day-to-day tactical discussions around point features such as VMotion, memory over-commitment and geek-terms like these. The way I see it, there seems to be a bigger plan here which is as simple as this: making IT and the associated user-experience better than it is right now. Period. Virtualization is really the backbone for this but, instead of being the end-goal, it should really be considered a must-have piece of technology to achieve the above plan. And, if you look at the Gartner magic quadrant regarding who is the leader in the virtualization space, well there is no question at all that VMware is the best positioned to achieve that plan. But it's not limited to that. There are at least another couple of angles. The end-goal here is not taking the IT stack as-is and make it run, more flexibly, on software partitions. As we said that's really the must-have backbone but, that alone is not enough. VMware is really trying to make the whole stack better focusing heavily on management, application frameworks and so forth. What I tried to speculate years ago on other posts such as this one or this other one is now becoming more clear as things materialize. Another angle relates to how end-users consume IT resources. Historically the industry thought that the only method for an organization to use IT was to buy a piece of hardware, a piece of software that they would then setup and maintain for the internal users to use these IT resources. VMware is leading the industry to change that pattern too via the vCloud initiative and via our partnerships with other visionaries and innovators such as Google and Saleseforce.

But the thing that is fascinating me most about VMware is the attitude and the people. In fact we are probably at one of those disruption points in the industry that only happen once in a while. These disruption points happen during stagnant periods in the industry where the leaders of that period impose a technology and a business model and make everything possible to maintain the status-quo. This happened for example with IBM (circa 1940-1980) where they led the market with the mainframe and have been challenged by new technologies and new business models such that of Sun Microsystems specifically (circa 1980-1995).  Sun was itself challenged by a new-comer into the datacenter segment and that was Microsoft which sort of took the lead in the last few years. Are we at the next disruption point where VMware is the new agent of change? Well, while I don't have a crystal ball to look into the feature, I can only say that the stars are aligned  for this to be a very strong possibility. By the way, while Unix is on the decline the legacy of these deployments is very strong and the overall Unix market today is still around 20B$ a year (million more, million less). Part of this is because many Unix customers had to find yet a good Unix alternative and Microsoft has not an option for them. VMware is in a unique position with a value proposition that takes the best of both worlds and so it can address a potentially immense chunk of the market from the low-end Microsoft market all the way to the high-end Unix market: Unix-like (or even better) characteristics at low x86 prices!

And this is where the people comes in. VMware is no longer a "virtualization company" hiring "virtualization people". VMware is (to me) becoming the agent of change in a stagnant IT industry and, as such, it is becoming the catalyst of visionaries and smart people that find in VMware the proper environment (with no business and technical legacy) to exploit their visions for a better IT without compromises. What people are  doing at VMware (or at least this is my perception) is very simple: they are taking the traditional IT stack, taking it apart, recomposing it leaving out the things that are not needed and injecting the things that are most needed (virtualization being an example). This is a very important and key point to understand and why the VMware potential is so huge. This is called the innovator's dilemma. If you are working for a company and an organization that is leading the market and is making tons of money out of a specific business model which leverages a traditional/legacy IT stack, you'll do very little to change the status-quo. This doesn't mean you won't be adding "new features" to your stack but certainly you won't do much to reinvent everything and, in so doing, question your future leadership in a changed landscape. I really like Henry Ford's quote "If I had asked people what they wanted, they would have said faster horses” (via vinternals.com, thanks Stu). To rephrase it, in the context of this discussion, if you are in the horses business you won't do much to engineer and promote cars. While I don't consider myself a visionary, this is one of the reasons I wanted to join VMware: I wanted to join a company which didn't have technology or business legacies so that we could just think about and create new things that are useful to organizations and end-users, without compromises.

So what does virtualization has to do with this? Virtualization is not enough to accomplish our plan, however it's the must-have foundation for it. And if I look at the Gartner magic quadrant VMware is the only company that seems to be positioned for the next disruption in the industry. We will only know in 20 years though if I was right or wrong! 

Massimo.

posted by Massimo | 4 Comments

Public Cloud Adoption Curve - is History Repeating?

This article was originally posted on the VMware vCloud corporate blog. I am re-posting here for the convenience of the readers of my personal blog.

As I mentioned in my previous post I started working on virtualization technologies years ago. It was around 2003 when I started talking, at public events, about what one could achieve using VMware ESX (which at that time was the only VMware offering for the enterprise market). I still remember the very first two questions I got asked in one of those events that year. The first one was "wow, does it really work?" Answer: "Yes, it does indeed". The second question I got asked was, "Can I virtualize SAP"? The answer in 2003 was a no brainier and it was something like, "We don't want you to virtualize the SAP instance. We want you to virtualize the 20 plus infrastructure servers you have sitting around it that support that SAP instance because they are what cause you so much trouble".

For the next event, I decided that I should anticipate the "what is virtualization good for?" and "where do I start with it?" type of questions so I built the following slides to give the audience a rough idea of where (and why!) these technologies would fit.

For years I have pitched a typical datacenter deployment as a pyramid on the side where, on the left, we have many instances of dynamic, non-critical, non-resource intensive types of workloads. Test and development environments are a good example. As we move to the right, workloads start becoming less dynamic, more critical, and more resource intensive. The SAP instance above would be a good example of what sits on the other side of this spectrum. In the middle we have a broad mix of infrastructure, tier 2 and tier 3 types of workloads, each of which comes with various infrastructure requirements.

As you can tell from my graphics above, the virtualization adoption model I was suggesting was pretty straightforward: "start from the left, move to the right and stop where you like". This slide was built in 2004 and could still be used in 2010. I think this adoption model made tons of sense at that time for many specific reasons:

1) Organizations were losing control of the left part because of the many little workloads that were popping up every other day without any sort of governance (virtualization helped a lot with consolidation and containment);

2) Organizations were not dynamic enough on the left part because the deployment lead time for physical servers was too long (virtualization helped a lot with the concept of "your new server is 3 clicks of mouse away");

3) Organizations were happy to introduce new innovative technologies on the left part because it was less critical compared to the part on the right side of the pyramid.

In a way, this was a win-win. The advantages of this solution were an excellent fit for the characteristics of the dynamic workloads on the left side and the limitations of this solution (limited enterprise maturity with associated risks) weren't really an issue for those types of non-critical workloads. Well, you know what happened next. End-users started this "journey" and there are now many organizations that are running SAP virtualized.

That was the picture in 2003. How about now in 2010? As I started working more closely on public IaaS cloud aspects, I have heard many concerns and doubts that reminded me of those questions I was getting back in the early years of this century. Can I move my core business application out there in the cloud? How can I ensure that my own customers' data are protected? Well I am sorry to rain on the party but, honestly, I don't believe these will be the first workloads to move into the public IaaS cloud.

First, there is a technology argument. We are still talking, by and large, about early offerings in the public cloud space. Similar to what happened with ESX and with the overall virtualization ramp-up, we will see technical improvements in public cloud offerings that will make it easier to migrate critical workloads onto future stages of the IT infrastructure. This doesn't mean ESX wasn't initially an enterprise-grade product. In fact, I worked with a number of customers that were moving relatively important workloads on to that platform, but arguably vSphere is a better and more mature technology.

Other than that, we can't ignore another, probably more important, fact. Organizations will want to take the time to learn what the public cloud is and will gradually move workloads there. Most of them recognize the value of doing so in the same way that they recognized the value of VMware ESX 1.0 when they first saw it. This doesn't mean they jumped onto it overnight to migrate their core apps.

No matter how good the technology is (and while there is space for improvement, it is good indeed) it will take time. You may want to call it "fear of the unknown" or "risk management," but we need to accept it for what it is. You will probably see me using these slides again in 2010. I will just need to change the title to "The Public Cloud (likely) Adoption Curve".

Massimo.

posted by Massimo | 391 Comments

Don't Fear the Cloud (the Cloud is Good)

This article was originally posted on the VMware vCloud corporate blog. I am re-posting here for the convenience of the readers of my personal blog.

I have been working in IT for about 15 years now, nine of which I have spent working with customers to get the maximum out of VMware enterprise technologies in the x86 space. I have always said that virtualization has been a cornerstone in this “PC space”. Yes, some still call this platform a "PC," go figure. As part of this journey, I have heard many professionals ask, "is this the latest buzz-word or is there something substantial to the Cloud?” Yes, there is something substantial to it.

I really think that the word Cloud has gained a bad reputation among some IT people simply because it's been used (or I should say abused?) a lot. That's why, whenever I enter into such debates, I tend to move the discussion towards what I believe Cloud really means. Cloud may mean many different things to many of you but fundamentally the word Cloud resembles a number of very tangible aspects you deal with (or you would like to deal with) on a daily basis within your data centers. To name a few, the most relevant are:

  • Self provisioning of resources;
  • Pay per use;
  • Automation;
  • Independence of IT resources location.

There are obviously more but these are among the most important. Talking to people, I have the impression that most of them associate the word Cloud with the concept of being able to consume resources from outside the organization’s boundaries. Not necessarily wrong, but Cloud is much more than that. As a matter of fact, there are really good discussions within enterprises today about creating Private Clouds within the data center boundaries – the exact opposite of the typical Cloud "perception" (i.e. provisioning resources from the outside). There are many other things that define a Cloud (see the list above) which goes well beyond the "Independence of IT resources location".

One of the things VMware has been very active with is the definition of standards in the Cloud space through the vCloud APIs. These describe a standard way for end-users to consume compute resources that are provided by an external organization (a.k.a. service provider). Leveraging this concept, one of the thinsg we are very obsessed with at VMware is the possibility to provide federation (through the standard vCloud APIs) between Private Clouds and Public Clouds, effectively empowering organizations with a single homogeneous view of distributed resources. Those resources can be tin their own facility or in service providers’ facilities. Do you think this is just a recent Cloud marketing hype? Have a look at the following picture:

At first it doesn’t really look shocking as it summarizes many of the concepts we have already digested in the last few years. I am referring in particular to the powerful concept of decoupling applications and workloads from the physical infrastructure (servers, network and storage). What it is interesting though about this picture is the fact that it’s a slide from a deck I presented back in 2004 at an IT congress. Not only that, specifically interesting is the comment in red at the bottom of it: “On-demand ready: you can buy it, rent it, share it (or a mix of this)”.

Isn’t that one of the many attributes (“Independence of IT resources location”) we are pitching today for Cloud computing? The point I am trying to make is that this is not hype. This is what virtualization enables you to achieve! It’s for real. I wasn’t trying to create hype back then. I was just working with customers to redesign their datacenters using virtualization technologies. We could easily see, six years ago, where this foundation would have brought us to from an architectural perspective.

If you are skeptical about the word Cloud please try to take a step forward and dive a little deeper into what the Cloud really is. You may very well find out that what we call Cloud is the collage of functionalities you have been dreaming about for the last 10 years. At that point you may find it a bit less of a hype… and a bit more of a end-goal for any organization.

Don’t fear the Cloud. The Cloud is good.

Massimo.

posted by Massimo | 650 Comments

I am back (on the blogsphere)!

Yes I am back. After a few months of "electronic silence" (on this blog at least) here I am again. For those (few) of you that may have been wondering "where (on earth) is he?"..... well lots of things happened and I have been pretty busy. One for all I joined VMware after having spent more than 15 years at IBM and I felt like I have been literally hit by a train running over me at 200Km/h. I have hardly had the time to breath in these last few months (let alone posting on my blog). To give you a sense of what I have been doing... I can say I have seen more check-in totems than traffic lights in the last few months and my almost two-year-old (beautiful) daughter's favorite is "dad is on a plane, dad is one a plane". That's perhaps why, when she sees me, she's like "mummy, who's he?".

So what do I do for VMware? I work in a global team of very talented people (me being an exception) as a vCloud Architect for Europe. I work with Service Providers, Hosters, and Outsourcers to help them build their Public Cloud service offerings. The funny thing is that I thought that having an EMEA scope meant doing 2 or 3 conference calls (in English) per week and that was about it. It turned out that, at VMware, having an EMEA scope really means the plane is your second first home. Ah, of course this is on top of 2 or 3 (overlapping) conference calls per day. Does it sound interesting? It is indeed, I wouldn't change it for anything (and yes I was of course exaggerating when I talked about my daughter, otherwise I wouldn't be doing all this).

By now you may start to see why I haven't been blogging too much lately. However it's not like I disappeared completely from the blogsphere as I tweet from time to time and I also started blogging on the VMware vCloud corporate blog as part of my new role. This is something I will continue to do and this is an example of how I am contributing there if you are interested.

So how is this blog (or site as a whole) going to be evolving over time? Good question. First and foremost this will continue to be my voice and not the VMware voice as I stated clearly in the About page. I will continue to express my opinions and if they happen to be pro-VMware I'd like you to think that I have joined VMware because I am saying those pro-VMware things (in which I believe) rather than thinking that I am saying those things because I joined VMware (as an obligation to my employer). Things that I will discuss here will not necessarily reflect the opinions and the strategies of my employers even though some of the stuff I say may very well align. My contribution to the VMware vCloud corporate blog will be, on the other hand, more institutional and I will be having my "VMware hat" when talking there. This doesn't mean I will just be repeating a marketing story, but it rather means that I will be discussing things with a VMware perspective in mind.

Last but not least, one of the first consequences of this switch is that I am going to discontinue one of the most famous hit page of my site which is the Virtualization Software Comparison table (if I am counting well I am around 50.000 hit since it went live). I am doing this for a couple of reasons:

  1. Being now a VMware employee, no matter how "independent" I will try to be, you will always have the feeling (or at least the doubt) that this is VMware biased. The only value of that table was the trust that you could put in it and that it was done and maintained by an independent entity.

  2. More importantly this market is maturing and shifting very quickly. So it is the value of these solutions. We have got to the point where a yes/no table like this is not capturing the true essence of what's going on in the industry at the moment which goes well beyond a set of features like those. I usually like to make a parallel with the car industry where if I say that my car has 6 gears, a steering wheel, 4 wheels, 5 seats and a powerful engine... you can't really say whether I am talking about a Skoda or a BMW can you? Similarly if I am saying that I have a phone that can take and make calls, can play music and you can download applications on it... you can't really say whether I am talking about a 300€ Nokia or a 600€ iPhone right?

This is pretty much it for now. The TV in front of me in the hotel room says it's 1:06AM and I have a wake up call scheduled later at 6:45AM. I am sure you don't mind if I take a short nap now. All in all I just wanted to say I am still alive and if you are still on this station, stay tuned.

Thanks for reading my blog. Massimo.

posted by Massimo | 7 Comments

From Scale Up vs Scale Out... to Scale Down

Those of you that have been following me on twitter and on my blog know that I have been very focused on studying and monitoring the latest trends regarding which hardware platforms virtualization users are using for their infrastructures. This includes multiple points of view such as simple sizing rules of thumb, potential reference architectures and scale up vs. scale out strategies. I'd like to spend the next few minutes talking about what's going on lately in this respect, specifically in light of the latest (and future) hardware improvements we have seen or that we will see in the next few months. I am doing this because I have a very weird feeling about what's going on. Bear with me.

When I started working with VMware software back in 2001, the only value proposition that we could imagine out of the thing was the so-called server consolidation: in essence the process of consolidating many virtual instances - aka partitions or guests - onto a fewer number of physical servers. To make a long story short, down the road we have realized that the value proposition was way more than just server consolidation as a mean to reduce the costs of operation. It suddenly became pretty evident that there were many more advantages to that which may include things like easier high-availability for applications, easier Disaster Recovery scenarios, faster time-to-market for business applications, and many more. Server consolidation was, at that point, just one of the many value items we know today.

Right now my feeling is that the advantage of stuffing more and more OS instances on as few physical systems as possible is not even considered an advantage any more these days. To put it another way, it is still considered an advantage, but only to a certain extent. In fact, if consolidating more instances on fewer hardware pieces was still one of the strategic objectives of a virtualization process, what you would have seen was a progression in terms of the ratio # of OS instances / physical system. Something like this:

  • 4-Socket single-core x86-based server with n GB of memory could support 10 VMs

  • 4-Socket dual-core x86-based server with n*2 GB of memory could support 20 VMs

  • 4-Socket quad-core x86-based server with n*4 GB of memory could support 40 VMs

The numbers above are just examples, and are only used to outline the mathematic progression I was mentioning. The high level idea behind it is that, the more powerful the systems become, the more OS instances you could consolidate onto them. Once you have strategically chosen a given hardware platform (whose main characteristic is expressed in # of CPUs it is capable to support) you will see higher consolidation ratios as the CPUs become more powerful (typically via doubling the number of cores from one generation to the other). Put into a more mathematical language, the constant here should be the number of CPUs (in red). The speed of the CPU is a function of the Moore's law, so to speak. As a result, the number of VMs that can be supported is a function of the CPU speed. Memory is also a function of the CPU speed and it needs to be configured accordingly to keep a balanced system with the proper CPU-to-Memory ratio.

That's what would happen (naturally) if server consolidation was a priority. However I have noticed that it doesn't seem to be what's actually happening in the industry. I can think of many such situations, but the most emblematic to me refers to a customer I have been working with very closely since 2001. We started deploying 16-Socket single-core servers, then they moved to 8-Socket dual-core servers, then to 4-Socket quad-core servers and are now in the process of migrating to 2-Socket Nehalem-based servers. In a way, what it is happening is that customers are inverting the mathematical constants and variables compared to what would be natural (see above). This is the approach and mindset most customers are using these days to size their "brick":

  • To support 20 VMs I would need a 8-Socket single-core system with n GB of memory

  • To support 20 VMs I would need a 4-Socket dual-core system with n GB of memory

  • To support 20 VMs I would need a 2-Socket quad-core system with n GB of memory

Wow. This is neither Scale Up nor Scale Out. This is indeed Scale Down!

Again, while the numbers are not tremendously unrealistic, they are only used to demonstrate, at a very high level, the mathematical progression which maps the mindset. As you can see there is a trend in the industry right now that doesn't consider the number of VMs you can get on a system as a function of how fast and powerful the system is. It's quite the opposite. The speed of a system is determined as a function of the requirement to run a fixed number of VMs. Since the size of the memory is typically a function of the number of VMs, its configuration doesn't tend to vary drastically because the number of VMs tends to remain the same. By the way, 20 / 25 VMs seems to be the average number most customers are defaulting to on each physical host, based on what I have seen.

There are a few reasons for which this is happening. One of the reasons is that most customers are not confident to put too many eggs into a single basket. They may be guessing that 20 / 25 partitions per host is a good trade-off between disadvantage of the potential downtime of multiple partitions and the advantage of having fewer physical servers (compared to a non-virtualized environment). For example, having 5 partitions would diminish too much the value of the latter, and having 100 partitions would increase too much the potential risk of the former. The consensus today does seem to be 20 / 25 partitions.

Another reason why this is happening is that there is a common perception that the smaller the virtualization brick is, the cheaper it is (due to the commoditization process we are seeing in the low-end x86 market). I don't have a definitive position on this - as I think that it always depends. But there are a number of people in this industry that would claim that, while this may be a good approach for a small business that only has a few dozens partitions to deal with, it wouldn't work for an enterprise customer with thousands of partitions. The method would result in an improperly designed virtualized infrastructure due to the high number of physical low-end servers required.

The third - and last - reason I am mentioning here is a bit more tricky and opportunistic in my opinion. The x86 virtualization industry is largely driven by software vendors rather than hardware vendors. Software vendors in this space tend to prefer the usage of low-end commodity servers because, this way, they can provide the value at the software layer. There is no magic: the better the hardware is (in terms of scalability / resiliency / efficiency / etc.), the less infrastructure software features you need to make it an enterprise platform. On the other hand, if you use many low-end commodity x86 servers you can tie them together into a single gigantic (virtual) enterprise platform through the value of the software running on them. The latter is what software vendors really love to hear these days and that's what they are after.

If you are still following me and agree with the analysis to some extent, you'll realize that there are a number of implications caused by this trend.

One of the implications is that servers are now memory-bound. If you ask 10 virtualization architects in the x86 space they will all tell you that the limiting factor today in servers is the memory subsystem. Put it another way, you are reaching the physical memory usage limit far before you manage to saturate the processors in a virtualized server. Have you ever wondered why that is the case? As users move backwards from 8-Socket servers to 4-Socket servers to 2-Socket servers the number of memory slots available per server gets reduced. That's how x86-based servers have been designed over the years: the more sockets the server has, the more memory slots that are available. What is happening now is that customers tend to use much smaller servers because they can support the same number of partitions per physical host, but the memory requirements haven't changed. That's because the amount of memory needed is a function of the number of partitions running, and if that number of partitions is kept constant you will always need the same amount of memory.

That's the problem: you now have a lot fewer slots available to support the same amount of memory. While memory vendors have been able to squeeze more and more Gigabytes worth of circuitry in the same DIMMs, the fact is that this is not enough to create a balanced system given the speed of CPUs has improved at a faster pace than memory vendors have been able to shrink their parts to put more memory space into a single DIMM. The outcome? You either configure very dense - and expensive! - memory modules into those fewer slots in the low-end servers, or you configure reasonably cheap DIMMs into those slots. The first approach would send the price of that virtualization brick to the roof; the second approach would cause the system to be bottlenecked very soon by the memory subsystem, with the CPUs being used at a fraction of their potential. This is in fact what's happening, as it is not uncommon these days to see virtualized systems being used - from a CPU perspective - at about 30-40%, and memory being already under heavy pressure approaching the physical limit.

There is another aspect to consider which is even more "interesting." The high density memory cost seems, frankly, to be the excuse for being stuck in such a situation. After all, it may even be convenient, in some cases, to configure more expensive memory parts to double the number of partitions and put to good use those wasted CPU cycles. However, the real problem seems to be that most customers are mentally partitions-bound: "No matter the technology and its associated costs, I don't want to get beyond the 20 / 25 partitions per physical host." If that is really the case - it's just my feeling so far - in the near future we won't need cheaper high density memory DIMMs or more memory slots in low-end servers. Most likely what will happen in the near future is that these customers will either start using 1-Socket servers - assuming these have the same memory support characteristics of the 2Socket servers - or more simply they will start populating a single CPU package in 2-Socket-capable servers. At this pace we will be running single socket Atom servers in about 24 to 36 months: Intel and AMD are warned!

This also will have further (and funny) implications. For example, the structure of all the industry benchmarks out there may become irrelevant in the future (assuming you consider it relevant today). All these benchmarks are designed to load the CPUs at 100% (configuring all other subsystems to cope with that) and coming out with a scalability number. In the server virtualization context, this number is typically expressed in the number of VMs a given n-Socket server can support. In the scenario I am picturing, this is completely useless. First of all, because of what we have said, memory is becoming the bottleneck in most of the situations, so these benchmarks should - at least - assume the 100% memory load as the limiting factor of a given server configuration. What's the point of benchmarking a server running at 100% of CPU utilization for which you had to configure 1TB of memory and 3.000+ disk spindles to achieve that CPU load, when customers are using 128GB of memory and a few dozens spindles at best?

To make things worse, the number of VMs is not even a function of the speed of the server any more - as we argued - but rather it's becoming a constant in the equation. In the currently available benchmarks, in fact, the constant is the number of Sockets and its 100% load. To build a benchmark that could map exactly what's happening in the industry and could be of use for the community, one would need to design a performance test that would give the number and type of CPUs and memory DIMMs to achieve a certain number of constant partitions (20 or 25). The lowest the resources (and their price), the best is the result.

While there is nothing wrong with all this, at the same time we need to acknowledge it is the complete negation of the initial Server Consolidation value item we started with back in 2001. The problem is that users may be leaving lots of money on the table because of inefficiencies due to underutilized resources and/or the management of many small Intel based servers (think about the costs associated with power consumption or I/O cablings). This is far from being an attempt to convince you that Scale Up is a better approach. I am ok with a Scale Out approach, too, as I can see the value of it. However, I see this Scale Down approach as a trend that won't allow users to exploit the full potential of what you could achieve using the technologies properly. Perhaps I am having the wrong perception of what's going on; or perhaps I am having the right perception and I am wrong in questioning it. Either way, I'd be curious to hear what you think, if you have a spare minute.

Massimo.

posted by Massimo | 160 Comments

XenServer: Why? (Updated)

There have been lots discussions lately about what's happening around Citrix XenServer. Perhaps too many. For what it is worth, I was one of the people discussing this on the net (Twitter, Blogs etc) with some other folks. I originally drafted a blog post when Citrix bought XenSource but it never made it (officially because I was busy, unofficially because I couldn't figure out "why").

I think that what it is happening is pretty clear at this point. The market landscape is being consolidated with Oracle acquiring VirtualIron as well as the "Sun Xen thing" within the overall grand plan of the acquisition (of the remaining) of Sun. All these solutions have hardly, in the past few years, managed to make a difference in the industry and their names were floating around more with the hope that VMware could feel more pressure and competition, and hence lower the prices. In the meanwhile, VMware increased their price which speaks for itself.

This is leaving (apparently) the x86 virtualization market with 3 relevant viable alternatives that are VMware, Microsoft and Citrix. I have always said this is going to be a two-horse race and I still stand behind this statement. The first horse is VMware and the second horse is what I call Microtrix (tm). There have been a nice Twitter discussion a few days ago on why Citrix bought XenSource and the future of it etc. This was my tweet in the discussion which, in a way, summarizes my thinking:

My XenServer in 140 chars: a non conventional weapon ordered by Microsoft for Citrix to use in the "meanwhile" (meanwhile Hyper-V matures)

While I have always said I am a geek, you can't afford to not look at all this from a business perspective. So the discussion is not so much "features related" but it is rather more like "how a vendor is going to capitalize on something". Because, at the end of the day, all vendors are vendors for a single reason: $$$.

And this is what never worked out for Citrix in my opinion. This is what I miss from a business perspective. Don't get me wrong, I am not saying "XenServer is not a good product!". I am rather asking ... "why XenServer?".

So Citrix bought XenSource more than a couple of years ago (off the top of my head - I am on a train and not connected) and the idea was that they would have engaged with VMware to win a chunk of the promising business VMware was leading. 500M$, at that time, was a big investment but something you could afford to spend if your grand plan is to win a slice of that lucrative market. Immediately the whole thing sounded a bit weird for at least a couple of reasons:

  • That was not Citrix core business: they essentially deal (very well) with end-user application virtualization at multiple levels. They are not so much into the data center if not for centralizing something that is otherwise distributed on the end-user desktops (oversimplification!).

  • Microsoft was to come out shortly with their very first implementation of Hyper-V and it was clear that XenServer was going to compete with it.

I was struggling to fit this Citrix strategy into the bigger picture, especially because of the strong Microsoft and Citrix relationship - someone refers to Citrix as a fully independent Microsoft subsidiary, go figure. So while they were "in bed" at the Corporate level they would have forced their respective sales fields and channels to compete at the local level. And we are not talking about a mere add-on tool where there is slightly competition. This would have been a fierce battle for a key layer (and a tremendous point of control) in the data room. Not peanuts folks!

Well that was it anyway. So we lived in this limbo for quite a while without bringing up again this concern until Citrix broke the news just before VMworld Europe 2009. Just prior to the event they made the announcement that XenServer Enterprise (I mean the high-end version with all the fireworks) was going to be given away for free. Yeah you got it right: the technology they bought from XenSource for 500M$ was to be given away for free. And you may rightly wonder "why?", especially if you consider that the Citrix business track record, as far as I can say, is not that of a charity nor you can say - more seriously - that Citrix is the kind of company that gives away licenses for free because they make money on professional services and support. Not at all: they have always been in the business of selling you a great piece of software (Metaframe / XenApp being an example) for a great amount of money and profits. Not only that, they were now putting lots of R&D efforts into a product that was going to generate 0 revenue and hence 0 profits. This can't be Citrix I wondered! My assumption of "lots of R&D efforts" comes from what they used to tell customers asking "what is the value of Citrix XenServer as opposed to the freely available open source Xen package?". Their position was, in fact, that they were putting into the base open source code some additional functionalities and enterprise-grade testing of all components. That's what customers were paying for.

Immediately afterwards, they made a new announcement where they stated they would be developing add-on management products for XenServer (called Citrix Essentials) to extend the basic capability of the XenServer technology. This was putting them somewhat on a track that did make more sense if it was not for another part of the same announcement: in fact, they stated that these add-ons would have been available to extend the functionalities of both XenServer as well as Microsoft Hyper-V / Virtual Machine Manager. And this, again, made me wonder: they now have the possibility to making money on both the free product they develop and maintain or making money on the free product that Microsoft develops and maintains. So why bother with developing and maintaining your own free stuff if you can off-load the burden to your pals?

Citrix didn't take too much to answer (with facts) that question. The latest news is that Citrix announced, a few days ago, that they are going to donate to the open source community not only the Xen hypervisor itself (which is already open source) but the whole proprietary stack that XenSource and then Citrix have been developing around it (and for which Citrix paid 500M$ I would add...). At least this makes more sense for them as, if we go back to the previous discussion, XenServer is now no longer on their R&D budget. However, it doesn't answer why they spent 500M$, in the first place, to get to this point in just after a couple of years.

Another weird thing I heard lately is that, in the latest discussions on the web, Citrix has also provided an interesting success metric for XenServer which is the amount of profit loss that XenServer caused to VMware. Now, every single vendor is allowed to spend their own money as they wish (as long as the investors are happy) but they may allow end-users to wonder why they have invested 500M$ in a company just to hurt the (current) leader in that space. I would say that you don't enter a market, as a newcomer, spending a lot of money to buy something and turn it into a freely available open source software in a couple of years... with the only intent to make the leader loose money. However, you may want to do so if you are in a dominant position and you feel the pressure from the leader of a segment where you are still late-to-market. Are you guessing?

To recap, this is what have observed in the last few years:

  • VMware has grown in relevance in this industry.

  • Microsoft feels they may be loosing an important point of control in the data center (to VMware) but are not ready to counter with Hyper-V (R1).

  • Citrix buys XenSource (one of VMware most important potential competitors) for 500M$.

  • Citrix engages a battle with VMware (and apparently with Microsoft) to win the hypervisor battle.

  • Citrix gives away XenServer for free in an attempt to hurt VMware even more.

  • Citrix announces the Citrix Essentials package that would extend hypervisor functionalities for both Citrix XenServer as well as Microsoft Hyper-V.

  • Microsoft announces the availability of Hyper-V R2 (which fills many gaps they had with the VMware offering).

  • Citrix is to donate the XenServer code to the open source community.

I am not sure about you, but I see something here between the lines.

The latest Citrix take on this is that they didn't waste their money as XenServer is a key component of their XenDesktop strategy where they use XenServer as the hypervisor to serve the back-end infrastructure and they are using the Xen kernel to build the client hypervisor platform for off-line VDI scenarios and the like. I don't want to dispute this. There is nothing wrong with this strategy and I think that Citrix also has a technology lead vs. VMware when it comes to application virtualization and VDI (just like VMware has a technology lead for the back-end infrastructure). My mere argument is that, at this very point, they could have done exactly the same thing without spending the 500M$ in the first place back in 2007. For example:

  • They could have added support for their XenDesktop to a XenSource backend (similarly to how they provide support for VMware and Microsoft hypervisors today).

  • They could have developed Citrix Essentials for both XenSource and Hyper-V if they really thought it made sense for them to do so.

  • They could have taken the already open sourced Xen hypervisor to create their own client hypervisor for off-line VDI.

I can't think of a single thing that they couldn't have done leveraging the Xen open source project or leveraging a partnership with XenSource and yet keeping 500M$ in their wallet... I have too much respect for Marc Templeton for not insinuating that there was a larger plan in this XenSource acquisition.

Just to make sure we are all on the same page, this doesn't mean Xen(Server) is dead by any means. It will continue to live and grow in the open source community and it will evolve over time. For example it will be a very compelling building block for those (big) service providers trying to implement cloud services. If these players could afford to build everything in house (as Amazon did) and if they don't want to deal with the commercial tricks and license limitations of a more "commercial" package, such as VMware vSphere, then Xen(Server) is a great fit. These customers, in fact, may not see vSphere as a good fit since, while the ESXi hypervisor is free, it does require Virtual Center to fully exploit its basic functionalities. Nothing wrong with that, but these service providers may want to leverage something more flexible and build their in-house developed stuff on it without stringent licensing requirements posed by the vendors.

Similarly, typical commercial customers may appreciate a more off-the-shelf / vendor owned product such as VMware ESX/vCenter/View or Microsoft Hyper-V/VMM/Citrix Essentials/XenDesktop. That's the two-horse race I was talking about. The VMware vs Microtrix (tm) positioning in the industry is beyond the scope of this post.

As an example, I am finding hard to understand why an SMB customer, with some 10 or 20 Windows servers to virtualize, should use XenServer as opposed to Microsoft Hyper-V with Virtual Machine Manager. While the Microsoft solution is not entirely free it would cost "negligible peanuts" and with the new R2 release it will pretty much map what the free XenServer offering can provide (High Availability*, LiveMigration on top of all), especially in a pure Windows context as it is often the case in SMB accounts. By the way if some Linux support is required Microsoft is doing a great job at that too with Hyper-V and if you want even more functionalities the Citrix Essentials package will do!

Back to my tweet above, the warning I want to give you is this: watch out because weapons are used and then decommissioned when they become obsolete (from a business perspective). Perhaps I am wrong. Only time will tell. In the meanwhile, mark my words (I can't do worse than what Gartner/IDC did years ago when they speculated Itanium would have ruled the world by 2008 anyway).

I have tried to interpret what I have seen in the past without any biased opinion (I hope). At least I tried to keep on straight facts. Perhaps my name will show up on some black-lists after this post; at least I hope it will give end-users an additional point of view to think about before committing to a strategic hypervisor decision.

Massimo.

P.S. What's in this post only reflects my personal opinions and not those of my employer.

* Roger Klorese from Citrix pointed me to the fact that High Availability is not included in the free XenServer offering being open sourced but it's rather included in the fee-based Citrix Essential package. Thanks Roger for the heads up.

posted by Massimo | 60 Comments

Ad Hoc Designed Infrastructures: do they still make sense?

The topic in this article is something that I have been thinking about for a while. It's about the methodology, the patterns, the habits - if you will - associated with how new IT infrastructures are being assessed, designed, sold and - in the final analysis - acquired by end-users for their datacenters. While it might not make a lot of sense to you initially, please bear with me as I go through my "internal mental brainstorming." It seems long but, as usual, it's full of pictures. 

The Italian market is pretty interesting: the vast majority of the customers are (very) small organizations distributed across the entire territory. We also have a few medium-sized businesses (although not the core economy of the country), and then we have big organizations (a mix of public customers and privately held corporations). To turn this into IT terms, the vast majority of Italian customers' datacenters are very small - in the range of 5 to 15 x86-based servers. We then have customers - such as medium-sized businesses, big banks and big public organizations - that have hundreds to a few thousand x86-based servers. Having spent most of my IT career focusing on the optimization of the x86 infrastructures, I had to deal with all these scenarios above so I think I have a pretty complete view of the spectrum. This article is going to discuss specifically a couple of points that I had to deal with during the process:

  1. The assessment of the legacy infrastructures from a capacity and characteristics perspective.

  2. The design of the target architecture of the virtualized infrastructures.

These are two different aspects, and they could deserve a dedicated discussion, but I am trying to cover both in this article anyway. 

 

Assessing and Designing Optimized x86 Infrastructures for the Small IT Shops

At the very beginning of the virtualization era (around 2002-2003), I was using a pretty standard methodology that would require the analysis of the current datacenter in terms of number of physical x86 servers deployed, their hardware configuration and their usage (average at least, historical at best). You would then take the data and work through them to get to a specific hardware sizing that was capable of consolidating those physical servers onto a lower number of physical boxes. This has worked pretty well until a few months ago when I sat down with my good fellow Maurizio Benassi and we drafted a brand new methodology for sizing. It all started with a joke:

"The majority of customers could be consolidated on either one single mainframe (which never breaks), two Unix boxes (which very rarely break) or three x86 servers (which happen to break from time to time)."

A further analysis of the patterns resulted in an updated joke (err: statement) regarding the new pragmatic methodology:

"One x86 server could sustain the whole workload, the second x86 server is configured for high availability, the third server is used to sleep well at night."

Fun aside I guess you are starting to see a pattern here. Think about that for a moment: the fact is that the smallest x86 architecture you can configure today is capable of supporting the workload that the vast majority of customers have in place. And I am using the notion x86 architecture here on purpose since you never - ever - configure a single x86 box for any given datacenter - no matter what the workload is. What happened in the last few months is that the majority of the virtualization requests I have seen coming in could be served efficiently with a standard configuration which comprises just a couple of Nehalem-based servers tied together with some sort of shared storage. Why would you bother assessing a common pattern and reinventing the wheel (er: the architecture) every time? More on this later.

 

Designing Optimized x86 Infrastructures for the Medium and Big IT Shops

This is a completely different realm, however assessing, designing, selling and acquiring such infrastructures do have their own peculiarities which might contrast with the standard historical methodology I have mentioned above (deep level analysis of the installed base to produce a to-be new infrastructure). I have already discussed in the past a more pragmatic approach to sizing (virtual) infrastructures I ended up using in the last few months. I still stand behind the controversial comments in that article regarding the opportunity to go through a detailed analysis of the entire environment Vs taking a shortcut like the one I have described in the post. It's interesting also to notice that, similarly to what happens for the small shops, the layout of the to-be virtualized infrastructure doesn't dramatically change across the different situations. Sure the size might change dramatically, in fact where most if not all small shops could be doing fine with two servers, these enterprise customers might require a different number of physical servers (along with a different amount of storage and network connections); however the high-level architecture isn't so drastically different among all the configurations I have been working on. I am referring to common patterns we can learn from such as shared storage configurations, cluster(s) of virtualized servers and common network configurations.

By the way, this isn't supposed to be shocking and the pattern could be easily explained. In the old days - when physical deployments where the norm - you had to take into account each application silo, and determine the best infrastructure configuration for each. That's how you ended up with complex and heterogeneous scenarios where some applications could be deployed on physical standalone servers with no redundancy, other applications had to be deployed on physical standalone servers with some degree of redundancy, others yet had to be deployed on dedicated physical clusters - forget active / active heterogeneous application clusters - for the most demanding high availability requirements. Virtualization, at least in the context of the 100% virtualized datacenter if I can steal Chad Sakac's mantra, is changing all this complexity. First applications are no longer bound to specific physical servers so you can start thinking in "MIPS" terms for the whole infrastructure rather than sizing each vertical silo on its own. This is when my rule of thumb comes handy as you will always - most likely - end up in the average (the more servers you have the better it works).

Another side effect of virtualization is that it has raised the bar of SLAs and you can tune your service levels on the fly without having to re-work your entire hardware infrastructure underneath. A good example is the possibility of moving your workload from SATA storage to Fibre Channel storage on-line (or nearly on-line) if you need it, or creating your application high availability policies at run-time time: in a VMware infrastructure, for example, this might be No-HighAvailability, HighAvailability or even FaultTolerance. At the end of the day, designing an enterprise infrastructure boils down to sizing the aggregated workload (where aggregated is the key word here) and providing the right set of infrastructure characteristics and attributes that an organization might require (with the flexibility to apply them to selected workloads only at workload deployment time).

 

Do the Functional Requirements Matter During the Design Phase?

Simply put, IT is comprised of two major building blocks: Functional Requirements and Non-Functional Requirements. This is how Wikipedia defines them:

Functional Requirements: "A functional requirement defines a function of a software system or its component. A function is described as a set of inputs, the behavior, and outputs (see also software)"

Non Functional Requirement: "A non-functional requirement is a requirement that specifies criteria that can be used to judge the operation of a system, rather than specific behaviors. This should be contrasted with functional requirements that define specific behavior or functions".

So the question I have been thinking about for the last few years is simple: in a virtualization context, do I really need - during a customer engagement - to go through a deep level analysis of the applications currently being deployed or soon to be deployed? In addition, defining the new virtualized infrastructure to support the applications mentioned, do I need to analyze all those applications one-by-one (from a Non Functional Requirement perspective) or can I treat them as a whole? You can depict the answer from the following two slides which are included in a set of charts I created back in 2007.

The yellow line "No Fly Zone" Buffer pretty much captures the concept I am trying to articulate here: the application realm and the infrastructure realm don't need to be strictly correlated. The infrastructure underneath needs to be designed and architected to match current and projected total workload of the functional requirements. In addition to that it needs to be designed to match the customer's policies around the required Non-Functional Requirements. None of these two items requires an in-depth analysis and assessment of the various application silos currently deployed in a non-virtualized datacenter.

 

Does the Public Cloud Concept Bother With Functional Requirements After All?

You have heard the buzz lately about internal and external Cloud, haven't you? And I am sure you heard the concept of Private (aka Internal) and Public (aka External) Clouds. The idea is that you can have a given workload that you can choose to execute either internally on your infrastructure or externally on a third-party infrastructure (typically that of a service provider). This should happen transparently.

It is obvious at this point that the Public Clouds out there have not been designed upfront with your own applications in mind and nor they can be. That is obviously impossible. First, they are shared infrastructures so they should be ad hoc designed against more than a single customer (impossible). Plus they are ready to use so they need to be in place before the provider could even think about assessing your internal infrastructure - assuming it makes sense, but clearly it doesn't as I said above - to be able to support it in its Public Cloud. All security concerns about running applications in a Public Cloud aside for a moment, let's agree that you can effectively run your application either internally or externally. And if that is possible, why would you need to purpose design an ad hoc internal infrastructure based on an assessment and in-depth analysis of the legacy, if the public infrastructure allows you to do that without going through that pain? That's simply because the Public Cloud infrastructures are designed against standard well-known successful patterns that have been used to design internal virtualized infrastructures for years.

This doesn't mean all Public Clouds are equal - they might vary greatly in terms of the characteristics they offer (Non-Functional Requirements). You might find Public Clouds that are optimized for costs, some others might be optimized for high availability, and others still might be optimized for Disaster Recovery scenarios. This is exactly similar, in concept, to how you would want your own private datacenter to behave: are HA and DR important to you (for all or just a selection of applications)? Is scalability important to you? Is data protection important to you? And so on. Again, this is somewhat unrelated to the fact you use IIS or Apache, Lotus Domino or Microsoft Exchange (you name your favorite application here).

The problem we have today is that, while we define Public and Private Clouds as being very similar from a "plumbing" perspective, the way they are sold/bought by vendors/customers is too different. We tend to rent a service with some characteristic on the Public Cloud, whereas most customers still buy dispersed technology parts to build a Private Cloud.

Sure there are big differences in the sense that while you "buy" a Private Cloud, you actually "rent" a Public Cloud (well, a part of it). Similarly a Private Cloud is dedicated whereas a Public Cloud is shared. Last but not least the management of a Private Cloud is on you whereas the management burden of the Public Cloud is on the service provider. However, if you look at the plumbing (the way servers, networks and storage are assembled and tied together with a hypervisor) the differences are not so drastic. What if the industry started hiding all the plumbing details of Private Clouds and started selling them like Public Clouds are sold? In a scenario like this customers wouldn't buy various pieces of technologies to assemble together; rather they'd buy and then manage a certain capacity with a certain level of Non-Functional Requirements (as opposed to rent and let the provider manage a part of a Public Cloud). What we have seen so far is hardware vendors (aka Private Clouds vendors) adding Public Clouds services offerings. I wouldn't be surprised to see service providers of Public Clouds turning into Private Clouds vendors as well leveraging their know-how.

 

It's All About the Metadata!

As I said, I have been thinking about this concept of simplifying the way virtualized x86 infrastructures are proposed by IT vendors and, in turns, acquired by the end-users. I knew there was a single word to define all this but I was struggling to find it until I read this very interesting post from vinternals. Metadata: that's the word I was looking for. Thanks, Stu! In fact, this fits pretty nice with the VMware mantra of vApps if you think about this for a moment. Those of you that have been working on the matter have probably seen this chart many times.

The idea is that, through the OVF standard, a vApp (basically a collection of a number of virtual machines that can provide a service to the end user) publishes its Non-Functional Requirements to be satisfied. As Stu points out, while the vApp can publish its requirements, there is no structured way - as of today - for the infrastructure underneath to publish what it is capable of providing. However, if you have noticed, I am trying to push this concept a little bit further: not only infrastructure metadata for Non-Functional Requirements is a must to create the binary match between what the applications require and what the infrastructure is capable of providing, but it also could be used to revolutionize, as I said, how the new infrastructures (comprised of hardware, storage and networking) are designed, architected, built and sold/acquired. This in turns means a shorter and easier sales cycle for vendors and proven, reliable, fully supported all-in-one infrastructures for customers.

 

 

Reference Architectures: Examples

In retrospect, this is exactly what I was trying to achieve (without using the terminology and the notion I am using in this article) when I started to talk about virtualized reference architectures during customers' and partners' events in the last few months. I have used a fairly simple approach which might be the basis for a more sophisticated speculative sizing algorithm. First of all, I made a few assumptions in terms of sizing based on the rules of thumb I have published in the past (and adjusted to map onto the new technology).

The above step covers the "sizing" part but it doesn't really cover the characteristic of the configuration (i.e. what we now call Metadata in the context of this article). I then started to draft a few common scenarios (or reference architectures if you will) that I have seen being commonly and successfully used by many customers. Actual numbers and other assumptions we have used are not important in this context. I am just showing you the framework I have used and I am sure those numbers and overall assumptions might need more work to capture better patterns.

The following is the first example that I presented at a joint IBM-LSI-Intel-VMware event last spring:

This is obviously a very simplistic approach. In addition, it would be laughable (I agree) to call these two brief comments a List of Non-Functional Requirements. Although the next few examples are a bit better, by no means is this a comprehensive implementation of the potential shift in the industry that I am discussing.

The following chart illustrates another example which is a superset of the above configuration where we have added the backup solution.

 

The following is another example which uses the BladeCenter S as a foundation. Note: don't pay too much attention to the number of VMs a configuration like this can support compared to the others. We have used HS12 blades which are single socket blades that don't use the brand new Intel Xeon 5500 (Nehalem) CPUs so the #VMs/Core is a bit lower. Again these are just examples.

 

The chart below is an example of an infrastructure capable of supporting about 72 VMs and with a "DR counterpart" to be installed at a remote site. In this example, we didn't use the native Storage mirroring capabilities and we opted for a cheaper software replication alternative. Notice the RPO (Recovery Point Objective) is greater than 0 since software based replications like this do not allow a complete sync of the two storage at any point in time. This is a typical Non-Functional Requirement discussion and a design point. This should be one of the first things that Metadata should publish as a characteristic of the underlying infrastructure. If you want you can use more sophisticated and native replication technologies as I discussed in this post.

One interesting thing to notice is that the first configurations are comprised of the smallest hardware configurations you can buy today in the market. That's true for servers as well as for the storage components. Yet the workload they can sustain with this minimal configuration (expressed in estimated number of VMs) exceeds the total amount of workload with which most of the SMB customers need to deal. This underlines again that an in-depth analysis to determine the size of the target environment is, in most cases, not even required.

Conclusions

In this article, I questioned the value of two specific practices: first, assessing legacy infrastructures is becoming more and more useless because, on one hand, we have so much power available these days that for most customers in the SMB space the smallest thing vendors could design might be a bazooka to shoot a fly. For Enterprise customers, most of the time a rule-of-thumb approach (perhaps complemented with deferred purchases based on actual needs) seems to be a good compromise between quality of the output and the effort required to get to the output.

I have questioned the value of designing ad hoc infrastructures: this is true for both SMB and Enterprise shops as we have enough experience in the industry at this point to start pushing reference architectures applying best practices we have learned in the last 10 years without having to reinvent the wheel (or the architecture if you will) every time.

I know this is a bit of a stretch and, in fact, it's a sort of provocative article. However, while we are not clearly there today, my guess is that as we move toward the 100% virtualized datacenters, we might start to talk in the sense of selling and buying not just in terms of discrete components and technologies that can be used to create ad-hoc infrastructures, but rather in terms of black boxes that have a total aggregated throughput associated which could be expressed in "number of average VMs" or in any other metric that you can think of.

Additionally the black box would carry a label with a list of capabilities, or metadata, that describe the characteristics of the Non-Functional Requirements associated to that specific unit. A vendor might have more units in the catalog with different capacity and different levels of Non-Functional Requirements. The whole idea is to try to simplify the way these solutions are designed, architected and sold by people in the field on one side, and the way they are purchased by the end-users. And with virtualization, which decouples functional and Non-Functional Requirements, we might see the light out of the tunnel this time.

Massimo.

posted by Massimo | 64 Comments

The (Potential) Value of Blogging for Your Career

Last night I posted a new article about the SpringSource/VMware story and the potential implications for the industry that this will have. After slightly more than 24 hours I am looking at the statistics and they say I am just south of 1000 views, which I think it is amazing - for a casual blogger like myself at least. These days I have also come across a few comments on Twitter about a presentation that Jason Boche did at VMworld 2009 about the value of blogging for his visibility. Coincidentally yesterday night I read an excellent post from Duncan Epping on the same topic that is how much he has been able to capitalize out of his "blogging hobby".

I wanted to take a moment here to underline how true what Duncan was saying is. I can't agree more with his points as I have been through (some of) them myself. I have documented my blog experience in my revamped CV online here (have a look at the third session which is dedicated to my blog experience). Blogging, and the Web 2.0 in general, have literally changed my professional life, specifically my exposure and visibility. Not only that, on a much larger scale, it is changing the way the "Power of Knowledge" in the IT world actually works. I have built a small presentation on this concept a couple of years ago that I did for an internal review and that I have posted here for you to download. It goes through some of the concepts and a point in time "life-cycle" of a blog post that evolved into a great deal of internal visibility.

As Duncan pointed out the potential visibility you can get is very rewarding. By the way you don't need to "post like hell" in order to have good feedbacks and a good following. Take me as an example: even though I tend to post long articles that go through a specific concept and try to get into the details of the matter, I usually post no more than once a month on average. See? I don't post twice a day and you don't have to do so to end-up in the top 5 list of virtualization.info for the best blogs of 2008. The only piece of advice I have (on top of what Duncan suggested already) is that I wouldn't be too much worried about having a blog... I would, on the other hand, be worried about having something interesting to say in a blog. I have been in countless IBM meetings where people where suggesting, for a particular topic we were working on, to create a Domino team-room or a wiki to collaborate and have people aggregating around it: most people don't understand a wiki or a blog per se is nothing. It's just a frame... if you don't put a picture in it - i.e. the real content - people won't stop and won't look at it.

I have to admit I have had so much exposure to end-users and business partners in the last few years that it is easy for me to write about stuff they are interested in. Even if you don't visit often customers these days, other technologies such as forums, allow you to have a real life grip on what's going on in the field. When I spent a few hours on the VMware and Microsoft forums I feel like I have visited some 20 customers given the amount of information you can take out of those posts (pain points, requirements, constraints, even internal politics that have little to do with IT but do influence the IT choices). Sure if you haven't been able to meet a customer in 10 years and you think Twitter is a bad word... well you can always post stuff on your blog but they probably need to be related to how you would suggest cooking pasta "al dente" or a good steak on the grill.

100% agreed also with Duncan's post about the fact you post not only to share something you know with the community but also to have a chance to dig into something you need to understand in deeper details. This is for example what happened to me prior to this post about how DR works in a VMware scenario. I have used it as a challenge: there were many things I didn't have clear about the various steps and I thought that I was too lazy to just sit down and read the manuals. I had to have a challenge and posting an article about how to do that was a good one.

The only regret I have is that it seems Duncan has been able to capitalize more on his visibility than I have been able to, but that's ok. In the meanwhile I will enjoy the (almost) 1000 visits in 24 hours... talking to 1000 people of what I think about a given topic would mean 10 years on the road in the pre Web 2.0 era.

I guess that having posted yesterday as well as today, I will have to wait another couple of months for the next post to be on my "average posting rate".

Massimo.

posted by Massimo | 6 Comments

VMware, SpringSource and What's Not Appropriate to Say

The acquisition of SpringSource that VMware has announced is going to change the way the industry as a whole perceives and segments the key players in the x86 virtualization market. I think most people (myself included) need to change gear and look at the whole thing from a new perspective. In this article I am going to talk more about a concept that I have been thinking about lately: virtualization is becoming more and more broad and deep.

This is clearly becoming a two-horse race between Microsoft and VMware whereas Citrix is going to be forced to gravitate around Microsoft in the "broad and deep" context I am going to discuss hereafter.

When I heard about VMware and SpringSource, all of a sudden I realized the world is changing for all of us virtualization geeks. First and foremost those that have only been bothering about low level infrastructure virtualization details - such as VMotion compatibilities, cluster configurations, storage integrations and so forth - will have a hard time keeping up with what's going on in the industry. Virtualization vendors are "moving up the stack" very quickly so you'd better start familiarizing with concepts and technologies around Development Frameworks, Integrated Development Environment (IDE) and stuff like that. Not the sort of things Systems Engineers (aka infrastructure people) paid too much attention to - until now.

Those that have grown up with VMware in the virtualization arena have always focused their efforts on hypervisor capabilities first (I still remember my very first customer implementation where we were piloting a beta version of ESX 1.1) and subsequently on the infrastructure capabilities that VMware made available throughout the years (things like Virtual Center with all its associated functionalities as well as add-on products such as SRM and the like). This is the "standard dimension" we all are very familiar with and I would define this dimension as broad.  Basically VMware broadened its value prop moving from the hypervisor (which is a commodity from a business perspective but a tremendous asset from a sell-up perspective) all the way to make the infrastructure richer and more enterprise-ready with additional functionalities, specifically in the automation space.

This move about SpringSource opens up a whole different dimension which is what I refer to as the deep dimension. In fact if VMware continues to only broaden their hypervisor richness they will always be at the mercy of two things:

  1. Their competitors might be able to catch up to the same level of ecosystem and thus functionalities.

  2. Their own potential customers that might not need that vast ecosystem of functionalities and might be satisfied with VMware competitors offerings (even if not so broad).

Now, everybody knows that the stuff you can find in your data center is not a function of a technology per se, but rather a function of the business applications they are able to support. Basically your platform (be it a processor, an operating system or a middleware - you define it) is as good as the number of ISVs it has been able to attract over the years (I should trademark this). Back to VMware. One of the challenges they had was to not only grow broad but also find a way to grow deep. They had to try to differentiate that black box that they provide (i.e. the virtual hardware which describes their virtual machine), essentially moving up the stack trying to foster the development of business code on top of their virtual hardware (and virtual infrastructure) that wouldn't run as well on someone else's virtual hardware (and virtual infrastructure). They basically can't afford anymore (or they will not be able to afford in the long run) to win deals based on infrastructure functionalities alone. They need to create a compelling reason for the ISVs to suggest using VMware rather than leveraging Systems Engineers that suggest using VMware because it makes things "easier and cleaner." Let me tell you what I think: if it was about making things easier and cleaner we would all be running mainframes in our data centers. And we wouldn't be here discussing how to optimize the Intel server sprawl as there wouldn't be any Intel server sprawl in the first place.

If VMware doesn't do this they are exposed in the long run to the risks in points #1 and #2 above. In trying to create a better and more integrated application + infrastructure duo - which is their current mantra when discussing the SpringSource acquisition - they also need to find a way to make sure applications that are being developed will run better on certain virtual infrastructures (namely VMware) vs. competitors virtual infrastructures (namely Microsoft). Did I say lock-in? Nah, what a bad term.

Let me draw this concept in a simple chart:

How do I read this chart? Interestingly enough the hypervisor is central in this vision, however it's perceived as a piece of commodity, which it truly is, from a revenue perspective. Having said this, it's an incredible point of control for the vendors because hypervisor XYZ will drag fee-based management features (typically from the same vendor). The management features are on the broad dimension (left and right). In the VMware camp here you can find the enterprise features included in vSphere as well as all VMware data center oriented add-on products. In the Microsoft camp you would find Systems Center Virtual Machine Manager along with the whole Systems Center product suite.

The other dimension (deep) is what the new SpringSource acquisition is all about. VMware is willing to create a more integrated application layer, through virtualization hooks in the SpringSource framework, that will make new Java-based applications VMware-aware. Microsoft has a similar if not bigger potential (although they haven't exploited it so far) in the fact that they own the software stack/framework (Windows / .Net) that is being used in about 80% of the x86 deployments worldwide (be them virtual or physical). In light of this and with virtualization in mind, one might speculate that VMware has a very mature broad dimension and they are starting to build a deep dimension. On the other side Microsoft has a very mature deep dimension (although, as I said, they haven't really leveraged other than for some Windows enlightenment integrations) while they are adapting their highly potential broad dimension with more virtualization in mind - the System Center suite is very mature and complete but it's not virtualization-centric, so to speak. I guess you are starting to see now why I think this is going to be a two-horse race. How could Citrix keep up with all this?

All this looks interesting, but I have controversial sentiments about what's going on. In a sense, having virtualization aware applications is going to provide a new level of features and functionalities that do not exist today, which is very positive. On the other hand I have always evangelized (and hoped!) for a very clean separation between the infrastructure services and the application layer as I have outlined in my old presentation I did at VMworld 2007 (download it here). I strongly believe there is a tremendous value for end-users to use a standard infrastructure where they could switch virtualization technologies back and forth without having to compromise on the way business applications are written. Understandably this is not a value proposition the virtualization vendors like to hear as - for their own good business reasons - they want to be able to have the customers strategically standardized on their own platform. Did I say lock them in? Nah. Joking aside I believe many others will have controversial points of view in trying to determine whether it would be better to have a more generic application that runs well on all virtualization software platforms, or to have an application that rocks only on a single virtualization platform and runs so-so on all others. All this assumes industry standards will either be non existent, only used by a single vendor or simply ignored. At the end of the day they all lead to the same result from a user perspective, which is proprietary implementations.

Assuming this is the right interpretation of where the industry is moving (well, at least it's my interpretation for now), I think VMware is making a big bet with these messages. They are somehow giving the idea (to me at least) that in the long run there will be two optimized stacks in the industry one will need to choose from strategically: the first one is the "VMware stack" with SpringSource-based VMware-aware applications, and the other one is the "Microsoft stack" with Windows/.Net optimized applications where the former would run on top of ESX / vSphere and the latter would run on top of Hyper-V / Systems Center. Sure VMware is going to support Windows as well, but this discussion is not about running legacy physical servers in virtual machines, this discussion is about how to properly and strategically integrate newly developed applications on top of a brand new virtual infrastructure. On the other hand Microsoft does support and will continue to support Linux variants on top of Hyper-V, but you probably wouldn't say Linux is (going to be) optimized to run on top of the Microsoft hypervisor. You might argue that VMware does a better job at running Windows than Microsoft does at running Linux, but I don't think I need to explain why this is the case (just look at the OS marketshare data and you'll find the answer). The key point I am trying to make here is that until you treat the VM (and its application) as a black-box, you can always argue that your virtual infrastructure does a better job at running it, regardless of what runs inside of it. On the other hand as soon as you start having first and second class citizens in terms of application support (not to be confused with base OS support), you are opening up a new dimension that didn't basically exist before... and that might be an assist to your competitor. Perhaps this is a risk that VMware has to take to move to the next level: if they want to compete head-to-head with Microsoft they need to turn hard at some point and not fall in bed with the enemy all the times.

The following is a very unofficial view of what's in my mind with regard to things like focus, commitment and interest each of the two vendors will map into their own technology efforts:

There is another thing about the SpringSource acquisition. Other than the application integration I have referred to, there was another thing VMware was interested in: a Platform as a Service offering. There are a number of segmentations and definitions around the various cloud models but there are two that are dominant among the others (so far): IaaS and PaaS.

The first one is Infrastructure-as-a-Service, and its characteristics can be summarized as follows: a software black box that can run whatever the customer requires, starting from the OS all the way to the software stack (middleware and applications) of choice. For the virtualization geeks of the old school this basically is an empty virtual machine... I am sure you are familiar with that black screen that says "OS not found" and that prompts you for a diskette.

The characteristics of Platform-as-a-Service are a bit different and a little higher in the stack. In a PaaS cloud model, the end-user wouldn't be presented with a bare (virtual) metal VM (horrible definition, but I think you get the idea); rather with a software platform that includes functionalities that could be generally associated to operating systems, development frameworks as well as data management services. Microsoft Azure anyone? That's where VMware was coming short compared to Microsoft in the PaaS space. Microsoft has a very strong potential here to attract a huge community of developers. VMware had to do something to address that very important layer of the cloud space with its own offering as they had to provide an end-to-end stack for both IaaS (which was easy, and which they have had for a number of years) and PaaS to be credible players. This reason was perhaps even more compelling than the first reason discussed in this article (i.e. being able to create VMware-aware autonomic applications). After all, if it was only about application integration, they could have partnered with key middleware vendors to integrate these functionalities into a variety of leadership frameworks including WebSphere, WebLogic, JBoss to name a few. The fact that they wanted/needed to buy SpringSource to do this is partially due to the fact that they couldn't afford non-exclusive partnerships, as well as to the fact that they needed to move up the stack very quickly. This is not something that a standard, perhaps not even exclusive, technology partnership could provide.

Last but not least, while we are in speculation mode, if I look at the two PaaS stacks from Microsoft and VMware, the latter seems to be missing a good data management layer to counter the SQL Services in Azure. With Sun Microsystems falling apart and speculations of selected spin-off of various divisions, I am wondering if VMware isn't valuating an additional move up in their brand new stack targeting MySQL (or similar technologies if Oracle isn't willing to help VMware to become a new Microsoft)...

In conclusion, this is a very cruel take on what's on the horizon. Certainly the marketing machines of the vendors will try to smooth the angles of my very simplistic view, as an example VMware is preaching to the industry that they are going to open up the APIs so that all applications built on top of all sort of development frameworks could be integrated into their own infrastructure. While technically true most .NET developers might end up doing this on the Microsoft platform for their own convenience - which might or might not have anything to do with the technical reasons associated. Similarly I don't see many Java developers integrating their applications into Hyper-V and the Microsoft virtualization tools as a whole. It's interesting that the world seems to be aligning for these two vendors although they are coming from two very different perspectives: VMware is coming from the virtual infrastructure expanding into the platform space, whereas Microsoft is coming from the platform space moving into the virtual infrastructure space. The giants are moving and the customers are going to see the benefits. May you (we!) live in interesting times - which seems to be the case.

Massimo.

posted by Massimo | 110 Comments

Disaster Recovery Inside-Out for Dummies (with LSI)

In this article, I'd like to document a setup I have been working on for a few days at the LSI office in Milano (great guys and free beverage there! Thanks!). LSI is the company from which IBM OEMs the DS3000, DS4000 and DS5000 lines of storage servers. Since I am trying to get a little bit more into the storage and network subsystems I wanted to spend a few days playing with those kits. I have concentrated on today's hot topic of Disaster Recovery and particularly the integration of LSI RVM (Remote Volume Mirroring) into the VMware SRM (Site Recovery Manager). I have to admit that I am not a storage guru, nor I have looked too much into SRM, so most of the stuff you will find here might be pretty basic. This is clearly not an advanced read for the likes of Duncan Epping, nor for those that go to bed with the VMware vmkfstools CLI or "talk UUID." (I guess Duncan will get what I mean.) Yet it's intended to provide a bit of background about what happens behind the scenes (the "scenes" would be the GUIs of the various products involved in this case). The SRM part is really focused on the storage integration which was the thing I was most interested in for this 2-days storage marathon. I like to treat these articles as a sort of personal log / documentation of what I have done (for future reference) so it will certainly serve me in the long run. Hopefully it will be of use for some of you, too.

Last but not least while the bar on the right of your browser might suggest this is a long post... consider that it's full of screenshots! So without further adieu, let's get started.

Basic Remote Mirror Setup

This part doesn't involve any specific SRM concept in action. It's just meant to describe the basic infrastructure setup (both logical and physical) as well as the way the storage replicates and how the VMware hosts deal with replicated LUNs. It is important to understand what happens at a lower level in order to move on and plug SRM on top of this. The picture below outlines how the logical layout of the infrastructure looks (including SRM):

For completeness, the following picture describes how the physical infrastructure looks instead:

As the picture outlines, the Virtual Center VMs in both sites also host the SRM service. Depending on the scale of your project you might want to have dedicated virtual machines to host the SRM instances or even dedicated physical servers. Milano, in our lab scenario, is the primary site while Roma is the DR site. As you can imagine, LUNs need to be replicated from the DS4700 in Milano onto the DS4800 in Roma. LSI calls this storage feature RVM (Remote Volume Mirroring) and it's essentially an advanced function that allows you to keep a copy of your LUNs on a remote storage server.

Notice that the DS4700 is a storage server that includes into a single 3U package both the controllers (A and B) as well as the first string of disks (more can be attached through FC ports on the rear). On the other hand, the DS4800 has a 4U "head" unit that hosts the controllers but doesn't include any disk in the base chassis. They can be added with external expansions (as in the picture above). You might guess that the 4800 is a more powerful machine than the DS4700 and that, in a real life scenario, you might want to have that situation inverted. Your guessing is correct but for the sake of the tests this wasn't interesting since we weren't looking for ultimate performance. Also consider that any DS4xxx type of storage is "replication compatible" both ways with any other DS4xxx type of storage. And even DS5xxx!

Note: Other than the standard zoning so that each of the servers with two HBAs can see each of the two controllers on the storage array, please consider that for the RVM feature to work all controllers need to be connected in a certain way. Specifically for this scenario the last FC port of ControllerA on the DS4700 needs to be connected to the last FC port of ControllerA on the DS4800. Same zoning process for ControllerB. Without this extra SAN configuration RVM would not work. And no, having a single switch per site is not a best practice - you would need two in a real life environment.

The storage configuration (a summary of it) is described in the pictures below. Basically the DS4700 in Milano has a couple of LUNs that are dedicated to the local cluster and that do not replicate (these are VC-MILANO and SERVICE-MILANO). These LUNs host the Virtual Center instance as well as a Windows template. There are other LUNs (SRM-1-MILANO, SRM-2-MILANO, SRM-3-MILANO and SRM-4-MILANO) that are replicated onto the DS4800 in Roma. A simple synchronous mirroring configuration has been established.

 

The way you set this up is that you first create companion LUNs on the target: they need to be at least as big as the source LUNs, or bigger if you want.

Through the LSI Storage Manager (SANtricity) you then select the source LUN and you mirror it onto the remote storage: a list of DSxxx storage devices with the mirroring feature enabled is shown, as well as a list of compatible companion LUNs for each device. The DS4800 does not mask the replicated LUNs to the cluster in Roma. This means that the hosts in the cluster have no idea whatsoever that there are LUNs on that array that are in sync with the cluster in Milano. In our lab we have manually created SRM-1-ROMA, SRM-2-ROMA, SRM-3-ROMA, SRM-4-ROMA on the DS4800 (as you can see in the picture above) and then we went through the steps described to create the mirror. 

Now that the replication is in place, the first test we did at the storage infrastructure level was to create a snapshot of a replicated LUN. From the Storage Manager we created a snapshot of SRM-1-ROMA leaving the mirror link between SRM-1-MILANO and SRM-1-ROMA in place as the picture below suggests:

This is how you would read the above picture: SRM-1-ROMA is a replica of a LUN coming from another storage server. As such it's in a read-only state (in fact you don't want to write onto it since it's continuously being updated by its master LUN on a remote storage). However, we took a snapshot of that R/O LUN at a certain point in time and we called it Snap-SRM-1-ROMA-1. This LUN is now enabled for R/W so it could be fully used as a point in time copy of an R/O LUN under replication.

The next step was then to manually map this snapshot to the cluster in Roma so the servers would be able to recognize it:

And this is when the "fun" begins.

*************     Background information that you need to understand and be familiar with before you move on     *********************************

There are two key parameters that rule how an ESX host deals with the LUNs:

  • EnableResignature (default = 0 = False)

  • DisallowSnapshotLUN (default = 1 = True)

It took me a while to digest them (and right now I think I am halfway to it), but essentially the DisallowSnapshotLUN (when active, which is the default) instructs the ESX host NOT to import the VMware Datastore if it recognizes it's a snapshot of an existing LUN. When the parameter is turned off to False the ESX host is allowed to import the snapshot as a VMware Datastore without modifying its original name or its UUID.

The first parameter (when active, which is NOT the default) instructs the ESX host to resign the LUN and import it into the ESX host as a new VMware Datastore (which gets labeled snap-xxxxxx-<Original Datastore Name>) with a new UUID. When this parameter is turned on, the DisallowSnapshotLUN value is irrelevant as the LUN gets resigned right away and imported as a new Datastore.

These parameters get very important (and very critical) when you are dealing with snapshots and clone on the same storage server and you try to give the original ESX hosts visibility of these new spaces. For example, if you try to expose to a given host/cluster the original LUN as well as its snapshot without resigning it, you might incur potential data loss and inconsistency as the host/cluster will only make one of these two entities available (they are in fact essentially the same thing: same Datastore name, same UUID). When you are dealing with a remote copy of the LUN(s), this becomes a less important issue because you are basically importing a snapshot (or a mirror) into a different set of ESX hosts.

This should be enough for a dummy (like myself), but if you want to get into deeper details about these two parameters and the UUID thing I suggest you read one of Duncan's best articles as well as this post from Chad.

************************************************************************************************************************************************************

If you are now familiar with the background above you should guess what happens. Mapping the snapshot Snap-SRM-1-ROMA-1 to the cluster in Roma forced the ESX hosts to recognize the LUN after the rescan:

Since we left the parameters above at their defaults (EnableResignature=0, DisallowSnapshotLUN=1), the LUN doesn't show up as a VMware Datastore on any of the hosts in Roma:

This is the desired behavior since the hosts recognize this is a LUN that is coming from a different storage subsystem (so with a sort of "incompatible" UUID). As a matter of fact, you can manually add a brand new Datastore and the LUN above is showed as available space for a new VMFS file system (which we didn't create as we didn't want to destroy the content):

At this point we changed the DisallowSnapshotLUN parameter to 0 (that setting should read "Allow Snapshot to be imported"):

After this change (which doesn't require a reboot of the host), the hypervisor imports the VMware Datastore simply after a rescan of the HBAs:

Similarly, by changing the EnableResignature parameter to 1 and rescanning the HBAs, the Datastore gets imported with a new UUID and a new name as you can see from the picture below:

What I have described above (at a very high level) are basically the steps you would need to implement in order to manually deal with a DR procedure. SRM does that under the covers along with a number of other things, such as reconfiguring the VMs on the DR site (alternatively you would have to manually add them to the DR cluster after importing the Datastores). It's a common misconception that VMware SRM is a layer of additional  technologies on top of what VI3 already provides (SRM today is not compatible with vSphere, but it should be soon). I think a better way to describe what SRM does is that it's a method to code all the actions you would have to manually implement in order to either test or run a DR Recovery Plan. Many refer to SRM as a "binary coded DR runbook." There is nothing that you can't do if you don't have SRM. But having SRM might save you time... and some risks (manual DR procedures might be error prone).

Site Recovery Manager Setup (Test the Recovery Plan)

In this section, we are going to essentially automate the manual process above by means of a DR orchestrator (in this case, it is called VMware Site Recovery Manager). This article is not intended to be a detailed description of the capabilities of SRM nor a step-by-step guide to its configuration. We will assume from now on the reader has a basic understanding of the product. Before we get into the details it is important to describe the virtual environments (guest OSes) we created in the production site. Notice that there are additional VMs that we have used to host a number of infrastructure services (such as the Virtual Center servers themselves). These VMs generally would be either hosted on external physical hardware or would not be subject to any SRM DR plan anyway. We will focus on what we pretend to be "production VMs" in our lab test. From this perspective we have essentially created three VMs (Web1, Web2, Web3) that we mapped into the 4 LUNs described above. (SRM-1-MILANO, SRM-2-MILANO, SRM-3-MILANO and SRM-4-MILANO) The following picture outlines the mappings.

  • Web1 has two VMDK files associated to it. One is on the srm-1 VMware Datastore (which in turn is on the SRM-1-MILANO LUN) and another one is on the srm-2.

  • Web2 has one single VMDK file associated to it which is on the srm-2 Datastore.

  • Web3 is a bit more tricky. It has a VMDK on srm-3 and it also has an RDM (Raw Device Mapping) onto the SRM-4-MILANO LUN. Notice this LUN doesn't have an srm-4 Datastore associated because it's raw. Since the RDM mapping is set to virtual, Web3 has a VMDK pointer (on srm3) to the SRM-4-MILANO raw LUN. 

It is of paramount importance to understand how all the VMs interact with the Datastores / LUNs because there might be some consistency dependencies that SRM will have to deal with. In fact, once we have installed SRM as well as the LSI SRA (Storage Replication Adapter), this is what the "Configure Array Managers" window displays:

Have you noticed how the various LUNs get grouped together? The first group includes the srm-3 Datastore as well as the SRM-4-MILANO because there is a virtual RDM mapping from a VMDK file on srm-3 onto the fourth LUN. So they are somewhat dependent.

Similarly, there is another group that includes both srm-1 and srm-2. And that's because there are interdependencies as you can depict from the picture with the layout of the VM disk configuration: Web1 is dependent on the first and on the second LUN so they need to be treated as a single Protection Group (you can't split them, as this would split the VM configuration and this wouldn't maintain data consistency!). However, now that you have to treat srm-1 and srm-2 as a single Datastore Group, SRM realizes what the other dependencies are. In fact, Web1 is not the only VM that is hosted (partially) on srm-2: Web2 is hosted on srm-2 and it must be included in the very same Protection Group. This is what you would see from a GUI perspective when selecting this Datastore Group :

When you select the Datastore or the Datastore Group. SRM automatically displays the VMs that are dependent on that Datastore or those Datastores. That's a read only field. Notice you can't select either srm-1 or srm-2: they are a single entity for SRM.

What we did from here is simple. We created two Protection Groups on the SRM instance hosted on the production site (Milano). These PGs build on top of the srm-1 / srm-2 Datastore Group and the srm-3 Datastore (which includes the RDM on the fourth LUN). Subsequently, we created a Recovery Plan on the DR site (Roma) which contains the failover instructions for these two Protection Groups. That's it.

Our production site is now protected. What we need to do is "Test" our Recovery Plan. One of the advantages of SRM is that it has a built-in intelligence to simulate a DR. Obviously this process is not (and should not be) disruptive: you want to keep the replica of the LUNs in place as well not shutting down the VMs in production to run this test. How do I do so? It's easy. Let's push the Test button on the SRM GUI and go through the plan.

The trick here is that you want to create a dedicated environment (from a storage and network perspective) that doesn't interfere with the production environment. As soon as the test starts, a snapshot of the replicated LUNs is created (at least those that are in the Protection Group associated to the Recovery Plan that is being tested). It's conceptually identical to what we have already done with a manual snapshot (see above), but this time it is SRM that instructs the LSI SRA (Storage Replication Adapter) to create the snapshots and the SRA in turn talks natively to the LSI devices to do so. The SRA is basically the driver that SRM uses to communicate with the actual storage subsystem. You can see the snapshots being created in the next picture:

*************     Background information that you need to understand and be familiar with before you move on     *********************************

VMware SRM is configured by default to set the EnableResignature parameter to 1 (that means TRUE) on each of the hosts in the receiving cluster. This means that, independent of the behavior you configured on the hosts, SRM will always resign the LUNs when imported into the remote cluster in the DR site. This will cause the LUNs to be renamed with the (in)famous naming convention snap-xxxx-<Original Datastore Name>.

If you want to keep things clear and "human readable," you can change the SRM configuration to rename the Datastore to their original names. This is achieved through an SRM configuration file that is vmware-dr.xml and it's located in the C:\Program Files\Site Recovery Manager\Config directory of the SRM server in the DR site. You have to identify the line

<fixRecoveredDatastoreNames>false</fixRecoveredDatastoreNames>

and modify it to:

<fixRecoveredDatastoreNames>true</fixRecoveredDatastoreNames>

Thanks to Duncan E. and Mike L. for their researches.

It's important to understand that this will not change back the value of the EnableResignature parameter to 0. In fact the LUN will be resigned anyway but SRM will take an extra step to rename the Datastore back to its original name (effectively just deleting the snap-xxxx portion of the new Datastore name).

Not being an expert on this, I can only think that doing so is important when you want to maintain a decent naming convention, especially when you consider that a failback onto the production site would cause SRM to rename the Datastore into something like snap-xxxxx-snap-yyyyyy<Original Datastore Name> (which is indecent in my opinion). Apparently it would have been easier for SRM to configure the host to allow snapshot LUNs (DisallowSnapshotLUN = 0) and not bother in the first place with the resignature and the rename. But if VMware decided to do so, there must be other (hopefully good) reasons.

************************************************************************************************************************************************************

Having this said, we have the background to understand the next picture which outlines the storage configuration on the cluster at the DR site in Roma:

The Datastores have been imported with the original names due to the change in the vmware-dr.xml file. The UUID for the Datastores, however, have been changed since they have been resigned. This is not a problem for SRM because the "place-holder vmx files" that are kept at the DR site do not contain any reference to the disk configuration of the VM. The Datastores are parsed during the execution of the Recovery Plan and the correct disks (with the actual UUIDs) get included in the final vmx prior to the startup of the VM.

Notice that the production VMs are being started off the snapshots that the LSI SRA has created and they are now connected to a so-called "Bubble Network." The Bubble Network is a standard VMware Virtual Switch with no Physical NICs connected to it that gets created for the time of the test. This allows the system administrator to test the restart of a copy of the VMs (currently running in production) without bothering about potential network conflicts. Of course at this time, the replica between the primary and DR sites is still in place and we are still fully protected from a potential disaster.

The test is being executed, and apparently everything has been running smoothly. At this point, SRM pauses for the system administrator to make an evaluation of the test (notice in the SANtricity Storage Manager how the snapshots also have been automatically mapped to the cluster):

Once the administrator is done with the checks he/she can push the "Continue" button, which essentially rolls back the Test. This, in a nutshell, includes shutting down the VMs in the DR site and deleting the snapshots taken from the replicated LUNs. Everything is now back to normal for the next Test to run (or a disaster to recover from).

Site Recovery Manager Setup (Run the Recovery Plan)

Running the Recovery Plan is different than testing the Recovery Plan. The most important difference is that SRM doesn't create snapshots of the replicated LUNs; rather it uses the replicated LUNs directly. The other difference is that the VMs on the recovery site are connected to the actual physical network and no longer to the "Bubble Network" that is used in the Test. Everything else is pretty similar to what we have seen already.

As you can see, SRM instructed the LSI SRA to revert the role of the mirroring: now the LUNs on the DS4800 (the storage server at the DR site in Roma) are "Active" and get replicated onto the "Passive" LUNs on the DS4700 in Milano. Most likely this is not what would happen in a real life disaster. In that case, probably the DS4700 would not be available (due to the disaster) so the SRM would only activate the replicas on the DS4800 in the DR site.

At this point the VMs would be restarted on the cluster in Roma similarly to what happened in the Test scenario (with the exception that they would connect to the actual physical network since they are restarting there to really take over). Remember this is no longer a Test, it's a real Run of a real Recovery Plan. Doing this on a production environment will have devastating results!

At the end of the process, all production VMs (Web1, Web2 and Web3) would be running on the VI3 cluster in Roma which now effectively can be considered the new production site.

Failback

Failback is a nightmare, at least in my opinion. Unfortunately there is not a "Failback Button" on the SRM console. However, you could work on the VMware consoles to create a Recovery Plan that will move all the VMs currently running on the DR site (Roma, for us) onto the original production site (Milano, in our case). Rather than a real failback, I think it's more appropriate to define this as a new failover plan that happens to bring the workloads back to their original positions. VMware has published a useful document that, in chapter 6, describes the steps to failback from an SRM failover. It's a good read. There is only one caveat in that paper that would need further investigation: at some point in the failback process it's suggested to set the DisallowSnapshotLUN parameter on the hosts in the original site to 0 (it would be the hosts in Milano, in our case). This means that when the storage is brought back to the original place, the ESX hosts on the original production site would be able to import the Datastores without resigning them. Since this is done via SRM, it is inconsistent with the behavior we have noticed during the failover. SRM seems to automatically set (on the fly) the EnableResignature to 1 on the hosts where the LUNs are being re-activated, effectively forcing the hosts to re-sign the volumes - and thus making the DisallowSnapshotLUN irrelevant. Further investigation would be required to nail down this inconsistency between the documentation and the behavior we have noticed.

Massimo.

P.S. FOR DUMMIES® is a registered trademark of Wiley Publishing, Inc.

posted by Massimo | 94 Comments

Xeon 5500 (aka Nehalem) Marks the Death of Itanium (and More)

The last day of March 2009 Intel officially unveiled its brand new Nehalem core architecture under the Xeon 5500 product name umbrella. There is not much to say about it other than it's impressive from a performance perspective. Just to give you a sense of what we are talking about the new product - only available for 2-socket servers today and with up to 4 cores per socket - has published many benchmark numbers that are either on par or slightly better than 4-socket Intel based servers with up to as many as 24 cores. One might wonder why a successful (and clever) company like Intel is going to cannibalize their highly profitable multi-socket market with a lower profitable product such as the 5xxx Xeon series. And I think the answer to this question is in one of the slides they used to present Nehalem at the launch event:

These numbers are impressive but I am pretty sure that if SUN and IBM marketing people would ever be able to read the small text at the bottom (which seems to be technically impossible) I am pretty sure they would come up with something to counter those numbers as they are obviously presented in a way that favors Intel; however I am not sure about this as I can't read the text myself so I don't know the assumptions behind those numbers. What it is important in this chart however is not the numbers (we know Nehalem has impressive performance per core) but it's the fact that Intel is now using Xeon to go after a 20+ Billion $ UNIX market. Up until now - and in the last 10 years - they would have been using Itanic (ehm... I mean Itanium... sorry for the typo) to go after the IBM Power or the SUN Sparc processors to get a slice of the Unix pie. This doesn't seem to be the case any longer. One might wonder where Itanium falls into all this: good question.

A bit of history on Itanium might help. Originally the Intel vision for the 64-bit Itanium was that it should have been the x86 32-bit follow-on product: the replacement for the Xeon brand basically. And they might have had a chance to succeed if AMD didn't come out with a much smarter evolution for x86 32-bit processors: in case you are wondering that would be an x86 64-bit architecture (namely AMD Opteron). When Intel understood they couldn't fight the Opteron with Itanium - since Opteron was 100% backward compatible with the Xeon software available whereas Itanium was basically not and would have required massive and painful applications porting - they decided to introduce the same "enhancements" to their Xeon processors. This was initially referred by Intel to as x86-32e: obviously they couldn't say Xeon was 64-bit as it would have overlapped too much with Itanium so they preferred to stay with the ridiculous definition of "32-bit Extended". This was the time where they tried to pitch Itanium as the only "native" 64-bit processor whereas the Xeon (as well as the Opteron obviously) were "just extensions to current 32-bit architectures". And this is when they shot themselves in the feet since they tried to play with the words (i.e. native sounds better than extended) but the only problem is that they forgot that, as far as IT is concerned,  native means you have to port the application whereas extended means it's compatible. So, for most of the customers, eventually extended sounded much (much!) better than native. And this is when Itanium started to see its decline in perception. I did a presentation at an IBM System x Symposium in France back in 2004 where I have shared these thoughts. Interestingly enough at that time we had an Itanium based System x box in our portfolio - the x455 - and I basically implied that Itanium (hence the x455) was at a dead-end and a useless product given the historical context we were facing. This is for example a chart that I used in 2004 to predict Windows on Itanium had no real place and didn't make any sense at all; it took a while but I think now MS think along the same lines:

Funny enough there was an Intel representative in the room that apparently didn't like these messages and he decided to escalate and complain about my pitch to my line all the way to the General Manager of the IBM Systems and Technology Group (that reported directly to Lou Gerstner - CEO of IBM at that time). I was never been officially involved in this complaint but the fact is that, later in the year, we dropped the x455. I like to think I gave a hint to the product marketing team on what to do but more likely what I said in the session might have been a blessing from the field about what product management was going to do anyway (and for very good business reasons). For your information I have posted the entire Power Point deck in the Files session of my site if you want to have a look. You can download it here.

To make a long story short Intel had nothing left to do than re-position Itanium as a high-end RISC replacement with the help of HP that, confident in its value and roadmap, decided to completely drop their own RISC offering - the HP PA-RISC processor - and jump onto the Intel Itanium processor as a strategic replacement. Intel tried to position Itanium as an open platform mentioning they had dozens of OEMs offering servers based on that processors but usually they forget to mention that the vast majority of the sales numbers they were seeing were coming from HP which is the only tier 1 server vendor today offering such a processor (IBM and Dell used to but they withdrew it and SUN never even attempted to).

As Xeon (and the AMD Opteron) became more and more enterprise-ready, the Itanium potential started to shrink even further. Up until now when Nehalem seems to be the last nail on the Itanium coffin. Consider also that the first Nehalem incarnation is a CPU model for 2-socket servers (Xeon 5xxx). This might leave the impression that Itanium can address a much larger window as it shines on highly scalable boxes. The truth is that this is the first product iteration based on the Nehalem core. Later in the year Intel will announce a multi-socket Nehalem based CPU - aka Nehalem EX - capable of scaling up to 8 sockets (Xeon 7xxx series). This CPU will feature 8 cores and Hyper-Threading thus providing execution support for 128 simultaneous threads (8 sockets x 8 core x 2 threads) in a single system image. Last but not least this new CPU will also feature additional enterprise functionalities such as MCA (Machine Control Architecture) which was one of the few things Intel used to position Itanium as "more enterprise" than Xeon. On paper a system like this could address the need for 99.9% of the customers' requirements. This statement obviously refers to performance but we obviously all know that performance is just one aspect of platform selection. This will obviously cause some adjustments in the server market shares and this goes back to the fact that apparently Intel is cannibalizing their current high-end market. Most likely what they have in mind, instead, is that they want to push the bar further and enter even more aggressively into the UNIX market with a more appealing and serious offering (than Itanium) like Xeon. The idea is: I will cannibalize a high-end x86 profitable market today which is worth a few B$ with a lower-end and less profitable product, because I want to use its big brother (Nehalem EX) to go after a 20B$ UNIX market. Since a picture is worth 1000 words this is what I am trying to say:

Note that I am not implying this is what I think it will happen. As I said performance is just a metric in platform selection. I am only speculating on the view that Intel has going forward. I am not ruling out completely (either) that this view has a point given what's going on and if this happens this will not only impact Itanium in the RISC space but other UNIX platforms as well.

Back to the Itanium discussion, last but not least it's worth mentioning that there is going to be a convergence in the Itanium Tukwila time frame (unsurprisingly delayed again) where you can drop this new CPU into a Nehalem standard socket (see the Update below). Intel has always pictured this flexibility as a mean to lower Itanium development costs and make it more flexible/cheap for customers and OEMs to move from Xeon to Itanium. The reality is that at the end of the day you end up having a common system, with the same components, with the same CPU socket. At that point you'll have the choice of installing either a cheap, super fast Nehalem processor with an unmatched flexibility of OS flavours and ISV applications... or installing a more expensive, somewhat slow Itanium Tukwila processor with an embarrassing flexibility of choice of OSes and ISV applications (at least compared to the Xeon family). I am pretty sure there are some HP execs regretting the port of HP-UX onto Itanium rather than having ported it onto the x86 architecture - if they knew 10 years ago what the x86 architecture would have looked like 10 years later.

It's well known that not only Itanium didn't bring any profit but its development costs have been impressive and they never got on par with slow sales. In a word Intel has lost tons of money on Itanium. Having this said there are obviously a number of issues that prevent Intel from dropping immediately the dead processor: for example contracts that they have signed with "these dozens of OEMs" - and one in particular which I won't mention (again) - that dropped their in-house developed CPU architecture for jumping on Itanium. They cannot just say "hey we are dropping Itanium" and leave these vendors in the mud (especially one). So I guess it's fair to say that, officially, Itanium is alive and healthy, obviously you can imagine what the reality is.

Massimo.

Update (10th June 2009): while Tukwila and Nehalem EX will share the same QPI bus the sockets of the two processors will continue to remain incompatible for the moment.

posted by Massimo | 79 Comments
More Posts Next page »