Burst Up to the Cloud with a Preconfigured IP Address and VPN

From an IT-Ops, Dev-Ops or Security Architect standpoint, knowing and securing your infrastructure’s logical layout is a vital goal. As the technology world evolves, this basic requirement has constantly demanded that we learn new techniques to keep our business secure. The current push to the “cloud” has added a new challenge to the equation – random infrastructure! In this post, we’ll discuss how we can use Windows Azure to supplement our OnPrem infrastructure on an as-needed basis without exposing ourselves to additional 1) risk, 2) cost or 3) complication. But first… a little history to help us understand why this need exists.

A Brief History of Infrastructure Volunurability

If you think of the progression of computer systems in phases, you can see how this issue has grown more fierce … much like a mutating virus!

Phase 1 – Line-of-Business Application on a Single Computer

For a long time, businesses had LOB apps copied on each of their employees computers. No external connectivity was involved, no central database to hack, etc. An example of this is a newspaper company that I used to work for. They had a slew of sales personnel that managed their sales attempts, clients and billing directly on their local computers. Then they would print this information to paper forms that needed to be handed in to another department to actually fulfill the requests.

Basically, the fact that there was a lock on the front door was all that the IT department needed to do to “secure” their LOB applications :)

Phase 2 – LOB App on a Central Server

Technology advances and things get a little trickier! Clearly we have advanced to a better world – but the point of this article is not to discuss the benefits of each phase, but rather the complications of security.

So, the solution here is pretty simple… the central server would be on an internal network. Therefore, the locked front door and Mr. Security Guard still gets the job done.

Phase 3 – Multiple Locations Over the Public Internet

Now things get scary for the IT folks! At this point, someone has to actually do some thinking. Some would simply expose the application to the internet – perhaps as a Web App or a SaaS. However, since our application is still really a private utility and specific to our company… and since the only complication that was added since “Phase 2″ is physical distance and the need to span a public network… the solution that IT Ops would likely choose is a site-to-site VPN. This is especially true if there are many resources that are used, and not just a single application. For example, if there are multiple web apps, SQL databases, shared folders, etc.

So far, things have been pretty straight forward for the IT department. In all cases above, we know exactly what machines need to access what resources. Even in the more complicated scenario of needing to link two locations together via a VPN, both sites would probably have one or more dedicated IP addresses, so there is no dynamic complications to iron out.

What about in the “cloud”! Well, this is where things can get a bit tricky. A lot of people throw around the word “cloud”, and they simply mean that they have a managed data center that is off-site. And it’s true that you can purchase and maintain a 24/7 “cloud” presence, keeping your virtual machines running all the time… but at that point, you’ve just stepped backwards. That’s like putting wheels on a hover-car.

So, how do we utilize the cloud as a dynamically allocated resource that is still as secure as a remote site with a VPN?

Phase 4 – Dynamically Allocated Cloud Resources

To fully appreciate the complications and potential expense of this option, let’s consider the following scenario. Let’s suppose that we have a LOB application that consists of a Web Application (aka “Web Role” in Windows Azure) as the UI layer, a Windows Service (aka “Worker Role”) that processes long-running tasks, and a database server. Each layer of the application works together privately, however the database server needs to be addressable to your on-prem reporting server. The complete logical layout could look like this:

Now, let’s say your on-premises data center can handle the work load for the majority of the year. But a few weeks out of the year, you want to “burst up” to the cloud because you receive a tremendous amount of traffic. This is a pattern typical of companies who have cycles of business, such as those who revolve around annual taxes, “Black Friday”, holidays, seasonal usage, etc.

As the sample graphic above indicates, you can deploy your entire setup into Azure, and then hook it up to your on-prem site via a VPN. And while the VPN covers the “risk” factor, it can leave you stuck with excessive cost, or extra complication. How so?

Many Temporary VPNs – Cost and Complication

Let’s address the remaining issues on our list.

  1. Why is this complicated?
  2. Why is there a concern about cost?

Firstly, setting up a site-to-site VPN is a bit of a process. It takes time to get your local devices configured, deciding on IP ranges and subnets, etc. There is a lot you can do with PowerShell to dynamically create VPN configurations in Azure… but to download the configuration settings, discover the random IP addresses handed to you from Azure and then programatically configuring your on-prem network devices would take a monster of a program to build. No, this is not the task that should be scripted. It requires people to think and set up.

So the complication is, not that you need a Network Engineer to do the work… but rather, if you want to burst up at random times of the day as a response to increased traffic - that would be ridiculous to demand of personnel.

An alternative approach that allows you burst up would be to deploy a number of these LOB applications into Azure, but keep a small VM size and a low number of instances. This would allow you to crank up the size of the VMs and the number of instances in order to respond to random demand on your servers… but it has a cost of “keeping the car running” the whole time (including keeping the VPN connected).

So what is the solution? How do we burst up to the cloud, keep VPN connectivity and have a zero-cost solution when you don’t need the extra compute horsepower?

Preconfigure, Then Scale Up in Azure!

It’s quite likely that while reading this post, you’ve already figured out the solution. And if you haven’t up until this point, then the sub heading just above should be a huge clue :)

With Windows Azure, you pay by the hour for usage of the services, and not for configuration of the services. Meaning, you can create and configure your Virtual Networks/VPNs without paying a dime. In fact, you can create all of the PaaS and IaaS cloud services that you would need… and as long as you didn’t have any instances deployed, you’re not charged.

This of course will take all the same manual work from that Network Engineer – but at least it’s only one time. From there, you can use a PowerShell script (or the Azure REST API, etc) to dynamically “connect” and then deploy instances of your LOB application to the cloud. This way, the meter only starts when you are actually using Azure.

Once you no longer need the extra machines in the cloud, then spin down the instances to “0″, disconnect the VPN gateways and you’re done! You’ll keep all of your configuration settings. You’ll keep the VIP’s that were randomly generated when you created the Cloud Services, and you won’t be racking up a huge bill.

By the way, this solution also works if you don’t use a VPN, but rather simply punch a whole in your firewall to allow traffic from the VIP given you by Azure.

Greatly Increase the Performance of Azure Storage CloudBlobClient

Windows Azure Storage boasts some very impressive transactions-per-second and general throughput numbers. But in your own applications you may find that blob storage, tables and queues all perform much slower than you’d like. This post will teach you one simple trick that literally increased the throughput of my application over 50 times.

The fix is very simple, and only a few lines of code – but I’m not just going to give it away so easily. You need to understand why this is a “fix”. You need to understand what is happening under the hood when you are using anything to do with the Windows Azure API calls. And finally, you need to suffer a little pain like I did – so that’s the primary reason why I’m making you wait. :)

The Problem – Windows Azure uses a REST based API

At first glance, this may not seem like a throughput problem. In fact, if you’re a purist, you likely have already judged me a fool for the above statement. But hear me out on this one. If someone made a REST-based API, then it is very likely that a web-browser would be a client application that would consume this service. Now, what is one issue that web-browsers have by default when it comes to consuming web services from a single domain?

“Ah!” If you are a strong web developer – or you architect many web-based solutions,  you probably have just figured out the issue and are no longer reading this blog post. However, for the sake of completeness, I will continue.

Seeing that uploading a stream to a blob is a semi-lengthy and IO-bound procedure, I thought to just bump up the number of threads. The performance increased only a little, and that led me to my next question.

Why is the CloudBlobClient slow even if I increase threads?

At first I assumed that I had simply hit the limit of throughput on an Azure Blob Container. I was getting about 10 blobs per second, and thought that I probably just need to create more containers – “perhaps it’s a partitioning issue.”

This didn’t feel right because Azure Blobs are supposed to partition based on “container + blob-name”, and not just on container alone… but I was desperate. So, I created 10 containers and ran the test again. This time more threads, more containers… the result? Zero improvement. The throughput was the exact same.

Then it hit me. I decided to do a test that “shouldn’t” make a difference – but it’s one that I’ve done before in the past to prove that I’m not crazy (or in some cases, to prove that I am). I ran my console app program many times. The results were strange. One application was getting about 10 inserts per second – but 3 applications were getting 10 each. This means that my computer, my network and the Azure Storage Service was able to process far more than my one console application was doing!

This proved my hunch that “something” was throttling my application. But what could it be? My code was super simple:

while (true)
	// Create a random blob name.
	string blobName = string.Format("test-{0}.txt", Guid.NewGuid());
	// Get a reference to the blob storage system.
	var blobReference = blobContainer.GetBlockBlobReference(blobName);
	// Upload the word "hello" from a Memory Stream.
	// Increment my stat-counter.
	Interlocked.Increment(ref count);

That’s when it hit me! My code is simple because I’m relying on other people who wrote code, in this case the Windows Azure Storage team! They, in turn, are relying on other people who wrote code… in their case the .Net Framework team!

So you might ask, “What functionality are they using that is so significant to the performance of their API?” That question leads us to the our final segment.

Putting it All Together – Getting More Throughput in Azure Storage

As was mentioned before, the Azure Storage system uses a REST (HTTP-based) API. As was also mentioned, the developers on the storage team used functionality that already existed in the .Net Framework to create web requests to call their API. That class – the WebRequest (or HttpWebRequest) class in particular was where our performance throttling was happening.

By default, a web browser – or in this case any .Net application that uses the System.Net.WebRequest class – will only allow up to 2 simultaneous threads at a time per host domain.

So no matter how many threads I added in my application code, ultimately I was being funneled back into a 2-thread-maximum bottleneck. Once I proved that out, all I had to do was add this simple configuration bit to my App.config file:

<?xml version="1.0" encoding="utf-8" ?>
			<add address="*" maxconnection="1000" />

Now my application inserts 50 times more than it used to:

Windows Azure, Licensing and Saving Money

Chances are, if I asked you why you wanted to “go to the cloud”… one of your answers would be “to save money”. True, at first glance this would seem counter intuitive because the price to rent cloud-hardware is far more expensive than buying it yourself. This argument is presented to me by IT Ops folks quite often. But the beauty of the cloud – and in particular I’m talking about Windows Azure here – is that you pay “for what you use”. Here in lies the ability to save tons of money. I’m not going to go in depth in the whole rent-vs-buy argument in general… but instead what I’m going to tell you is a key on how to save even more money once you decide to go Azure.

Quick Overview of How Azure Saves Money in General

I mentioned that I don’t want to get into this too much, but I feel that if you are relatively new to the idea of going to a hosted (or “public”) cloud environment – you may have done some bad math and decided that it would cost you too much money. So I feel compelled to give this brief explanation on a couple of scenarios that actually cost you far less.

Scenario 1: Daily Report Generator

Suppose you need to run a process every morning that takes 2 hours and chews up all of your massive business data to provide some excellent reports to your customers. As your company grows, you need to now buy new hardware just to run this very important task.

This scenario is very fitting for the Azure “burst-up” model. Because your process only runs for 2 hours – you can “rent” a VM (or 1,000 VMs) for just 2 hours and then you only pay 1/12th of the monthly cost. Don’t forget you are also not paying for hardware upgrades, electricity, etc.

Scenario 2: Seasonal Business

Suppose your business does the vast majority of your business in just a few months. This makes sense if you are, let’s say the Honey Baked Ham company, or a tax preparation company such as Jackson Hewitt (where I work). In this case, the cost of purchasing hardware that sits idle for many months out of the year is clearly a waste.

Now, How To Save Even More Money (with Licensing)

As is often the case, technology advances faster than legal entities can keep up with them. In this example, public clouds (Amazon EC2, Windows Azure, etc) allow you to quickly burst up 1,000 virtual machines just for a day if you need. But then, what if you needed to run licenses software such as Microsoft SQL Server? Are you going to buy 1,000 licenses just to use SQL for that short time? No way! Well, in this case the legal team at Microsoft has thought of that. So… you will be able to “rent” SQL Server licenses by the hour just like you can rent the actual virtual machine.

By the way – I’ve not posted this until now because I’ve been under NDA… also, there is a whole lot more I’d like to say, but can’t yet. Here is the publicly available Microsoft site where I’m quoting from: https://www.windowsazure.com/en-us/pricing/details/#header-3. And here is a screenshot in case the link above changes in the future :)

OK, so now that my paranoid “I’m not spilling any beans” has taken place, I’ll continue.

You might be asking, “What is the point here?” No, I’m not simply restating the fact that you can rent SQL licenses by hour in Azure. The point I’m about to make can save you thousands of dollars a month. And here it is…

Test your payload before choosing a VM size! The reason why I’m stressing this is as follows. If you notice in the chart above, the SQL licensing for an XL VM is twice that of a Large VM. This may seem normal to you, because an XL VM gives you twice as much horse-power in terms of CPUs and RAM. And typically, you pay for CPUs when it comes to SQL licensing.

The “gotcha” here is that an XL VM may not necessarily give you 2x the performance of a Large. The reason for this (in the case of Windows Azure) is because while you get 2x the CPU, RAM and bandwidth… you do not get any more performance in your most likely bottleneck – disk speed. I’ve brought this out several times in the last few blog posts, particularly this one: “Windows Azure IaaS Performance – SQL and IOps

In my most recent testing, I’ve proven out that a write-heavy workload on an Extra Large VM has the same throughput of a Large VM in Azure. For my test, the CPU usage was less than 5%, the RAM usuage was less than 5% and the network usage was nearly 0%. So cutting back to a Large VM cuts the price in half, but didn’t hinder the throughput at all.

Take these savings and multiply it by the number of machines in your workload and you can save lots of money. In the case of the real-world reason why I’m running this test, we’re saving tens-of-thousands-of-dollars every month.

Configuring Disks in Azure for More Performance

In an ongoing effort to get the most performance out of Azure, I’ve run several test that lead me to this next tip that is both very helpful and extremely easy implement. Remember, you’re paying for the time you rent the VM. So, the more you get out of each VM, the more you save.

How do I get more disk performance in Azure IaaS?

Basically, my recommendation comes down to this: Think of disks in Azure as a network resources, and not as a disk. That type of reasoning brought me to test out different NTFS cluster sizes (or “allocation units”).

As a side point – Exercise #4 of the “Windows Azure Platform Training Course” says

leave the default Allocation unit size

It doesn’t give a reason… and the reason is probably because it’s “the default”. However, I recommend that you max out the size to 64K (instead of the default 4K). In my (very expensive) testing, I see that this yields a 20% increase in disk performance. And that increase followed directly in to SQL Server disk performance as well!

As they say, a picture is worth a bunch of words – so here is what my testing revealed.

Explanation of the Performance Test

To explain the above – I mounted 16 disks to an Extra Large Azure IaaS VM. I then created a script that would stripe the disks together and format them NTFS with a 4K cluster size. After that, the script would write a 1GB file to the disk. The test was run 30 times, and the measurement was only measuring the time to write the data – not the time to create or delete the file.

Next, the script would reformat the striped volume as NTFS with a 64K cluster size. The 1GB-file-test would then commence.

This process was repeated many times – yielding 300 results for 4K and 300 results for 64K. Also, the method of going back and forth ruled out any thought of a particular time of day being the issue.

The average time it took to write the 1GB file with a 4K cluster size was 2.5 seconds. The average time using the 64K cluster size was 2.0 seconds. This is a 20% increase in disk performance. It was also interesting to note that the 4K cluster size suffered from “spikes” more often than the 64K size did. You can download the spreadsheet of data here: AzureDiskHammer_log_20120827.csv

Hacking Azure for More Disk Performance

So, this is going to be a quick post. I’ve been doing a lot of work in trying to hit the limits of Azure – really getting a feel for the beast. As I’ve already posted about, one of the biggest limitations that I’ve run into has been the disk performance.

We’ve been in close communication with Microsoft on a lot of things pertaining to Azure, IaaS, SQL and the like – none of which I can talk about here. But, what I can talk about is a side project that I did to … guesstimate … what the performance of IaaS can be in the future.

Why the “Hack”?

Here are the “truths” that you likely already know about IaaS / disk performance in Azure.

  1. Azure IaaS uses a single storage account to host your VHD.
  2. Azure storage accounts are limited to 5,000 transactions per second.

Like I’ve mentioned before, that comes out to about two or three 15k spindle hard drives striped together. Here are a couple of things you likely don’t know (or haven’t thought of in the context of this conversation).

  1. Azure had the ability to have persisted hard drives for over 2 years now.
  2. Unlike IaaS (which is limited to 1 storage account), PaaS can mount 16 different storage accounts on one box!

The above statements are referring to using the CloudDrive class, which you have to mount in code. Due to the reliance of needing “RoleEnvironment.IsAvailable” to be true, this feature only works in PaaS. I consider that bad code, but I understand why this happened.

Therefore, the “hack” is to make a single worker role in PaaS that mounts multiple storage accounts into a single striped volume. Then I remote into that machine, install SQL Server and run my previous test to see what the performance of IaaS could be one day (if Microsoft decided that multiple storage accounts was the way to get more IOPs in IaaS).

Disclaimer and Code

I hope that I don’t have to tell you that this is obviously not recommended for a production solution. The whole point of the exercise was to see how far Azure would stretch, and to give an idea of the (potential) future. My goals were happily met. I achieved more than double the performance of my previous test. And I’m sure with the right setup, I could have gotten even more.

Here is the code (that will automatically create a multi-account striped drive) in case you want to do your own performance tests with your app in the cloud: AzureDriveMounterUtility.zip

You’ll need take the following steps for it to work:

  1. Allow Remote Desktop connections (so you can log in and test your app).
  2. Edit the App.Config with real storage account credentials.

How To Setup Peer-to-Peer Replication in Azure IaaS (SQL Server 2012)

Recently at my company we have evaluated several options for replicating data in-between Azure data-centers, as well as on-premises. One of the most compelling choices (due to simplicity, and a proven track record) was just using SQL Server’s built-in replication features.

This post will answer the following questions:

  • How do I setup Peer-to-Peer replication in SQL Server 2012?
  • What extra steps do I need to take in a Windows Azure IaaS environment?
  • What is the latency of replication from two different Azure data-centers.

How do I setup Peer-to-Peer replication in SQL Server 2012?

While I’m not a guru on SQL Server, I have been using it pretty heavily for years in distributed, “n-tier” environments. I’ve used synchronous and asynchronous replication, database mirroring, and fail-over partnering with a witness server. Of all of these different options, I have to say that Peer-to-Peer replication is my favorite (and I think the easiest) thing to setup. I also believe it gives tremendous value as you get multi-master, high-availability and scaled-up read performance.

The basic steps for setting up p2p replication in SQL Server are as follows. As a side note, this can all be done in 2008 as well, but 2012 makes it a little easier by removing one step (of right-clicking, going to properties and “enabling” p2p). So here are the steps:

  1. Enable / configure your server to act as a “distributor”.
    1. This means that the server will be able to “push” updates to other members of the replication group.
  2. Create a “publication”.
    1. This basically means that you pick the database and tables to replicate.
  3. Configure the “topology”.
    1. AKA – add other servers to the group.

While it may seem like an over-simplification, those really are the steps. Here they are in more detail with screen-shots. For the sake of this article, we will assume that we already created a database called “AzureReplicationDB”, and that there is a table named “Customers”

1. Enable and Configure Your Distributor

The first step is to enable distribution on your SQL Server. You’ll should do this step on all of the servers that will participate in the replication topology. Simply right-click on the “Replication” node in “Object Explorer” and choose the “Configure Distribution…” option.

SQL Server 2012 - Configure Distribution

You can just keep clicking “next” for all of the default options – especially “‘XyzServer’ will act as its own Distributor”.

2. Create a Publication

Next, expand the “Replication” node and right-click on “Local Publications” and choose the “New Publication…” option.

SQL Server 2012 - New Publication

You can mostly click “next” through this dialog as well, except that you need to make sure to select “Peer-to-Peer publication”, choose your tables that you want to replicate and configure the security (login/password). For simplicity’s sake, I created a single user on all of the servers that had sysadmin rights :) – never do that in a production environment!

SQL Server 2012 - Select Peer-to-Peer publication

SQL Server 2012 - Choose what tables to replicate

Finally, name your publication. For this example, I named it “Demo_Publication”. The name should be the same across all of the servers in the replication topology.

3. Configure the Replication Topology

The final step here is to make all of the servers aware of each other. The nice thing is that you can do this from one location, and all of the servers will be setup. A very important note here is that all of the servers need to already be created, configured for distribution and (should) have the blank database (with the same schema) across the board. This will make your world a lot easier.

So, right-click on the publication and choose the “Configure Peer-to-Peer Topology…” option. Then follow the wizard to 1) choose the publication and 2) add all of the peer nodes. Again, you will have to configure the security account that they all run under.

SQL Server 2012 - Configure Peer-to-Peer Topology

And that’s it! After about 30 seconds, you should be able to insert a row into one of the servers, and then go SELECT * FROM dbo.Customers in the other server to see the results. You can now insert, update, delete (and oddly enough, modify the schema of the Customers table) in any of the database servers that are connected… and all of the transactions will be replicated to each of the other nodes.

What extra steps do I need to take in a Windows Azure IaaS environment?

Because Windows Azure IaaS is just what its name suggests – an “Infrustructure as a Service” – there isn’t a lot of special work that you need to do to get SQL Server replication working between Azure data-centers, or on-prem. Actually, the only thing you need to do is punch a hole in the firewall – port 1433 by default. This shouldn’t come as much of a surprise to you, because it’s what you would have to do anyway in your own data-center.

Something else that I had to do, which may have nothing to do with Windows Azure, but may simply be a SQL Replication thing, is to edit the “hosts” file in Windows so that I can refer to the other SQL Server by name, instead of by it’s DNS name. What I mean is, SQL Server seems to want to connect to “Other-SQL-Box” … and not “Other-SQL-Box.cloudapp.net”.

What is the latency of replication from two different Azure data-centers.

This question is a tricky one. Largely because I am not a SQL DBA, and therefore I don’t know all of the tricks of gaining performance out of SQL Server Peer-to-Peer replication. I did, however, make a few modifications to the Distributor Properties by changing the transaction “CommitBatchSize” to 1,000 (instead of 100), “MaxBcpThreads” to 16 (instead of 1) and the “PollingInterval” to 1 second (instead of 5).

I don’t know if increasing the threads helped at all, because I believe transactional replication can’t use multiple threads. But I do know that pumping up the batch size did increase throughput significantly.

All this being said – I was able to insert 1,000 records in a small table on the Azure East data-center, and they were replicated to the Azure West data-center in 1 second. If I inserted 10,000 records, it took about 10 seconds. This is pretty promising when you consider all that you’re getting here.

As a side point, I may have been able to get more performance, and you might too, if you follow some of the guidelines that I mentioned in this previous article about SQL IOps in Azure IaaS: http://tk.azurewebsites.net/2012/06/18/windows-azure-iaas-performance-sql-and-iops/

Windows Azure IaaS Performance – SQL and IOps

It seems that recently there is a lot of buzz in the technology space around the word “Cloud”, though most people seem to think that applying old ways of developing software and just hosting them on someone else’s hardware equals achieving cloud greatness. In truth, the “cloud” is a new beast  - a completely different tool. One of the greatest examples of confusion is with the newly announced Windows Azure IaaS (Infrastructure as a Service).

This post will answer the following questions:

  • Am I just renting rack space from a Microsoft data center?
  • What is “horizontal” vs “vertical” scaling?
  • What speed is the hard-drive on an IaaS VM?
  • Does mounting more drives give more performance?
  • How does the Azure Storage SLA (service level agreement) get involved?
  • How does SQL Server perform in the IaaS environment?
  • What needs to be improved before I should use IaaS in Azure?

 Am I just renting rack space from a Microsoft data center?

This first question is highly important to help you understand the “cloud” better. Renting your own “dedicated server”, whether virtual or physical, has been around for many years. So you may be tempted to think of IaaS as the same thing – just provisioning a new instance of Windows Server or Linux in a data center that is armed with guards on Microsoft’s salary. That is the worst way of looking at IaaS.

Microsoft itself has been marketing IaaS as an ‘on-ramp to PaaS (platform as a service)’. Meaning, you should be using PaaS for your new designs, but they realize that some systems (such as Active Directory, Share Point, SQL Server Enterprise edition) might be mission critical – and you may not want to punch a whole in your firewall… in fact, you may want your entire data center to be hosted by someone else. To this end, IaaS is an on-ramp.

I personally think that viewing IaaS as simply an on-ramp to PaaS is the second worst way of looking at it. Why? Because the beauty of IaaS is not in the “hosting of a VM somewhere”… but in the “turn a dial, and now you’ve multiplied your horse power by 1,000″ – meaning, using it as an [infrustructure as a] SERVICE. This is known as “horizontal” scaling.

Quick definition: What is horizontal verses vertical scaling? Imagine have a SQL Server box with 16GB of RAM and 4 CPUs that handles all of your business needs. Suddenly your business increases 10 fold! So, you “vertically” scale your SQL box to have 160GB of RAM and 40 CPUs! However, with the cloud, you would instead “horizontally” scale by having 10 separate SQL boxes with 16GB of RAM and 4 CPUs each. Makes sense right?

What speed is the hard-drive on an Azure IaaS VM?

This question may seem random, but it’s not. In fact, it’s very important when you are considering IaaS. Consider, in the example above about vertical vs horizontal scaling – I mentioned RAM and CPU power… but if you know anything about administering SQL Servers, you’re #1 hang up is actually going to be IOps (Input/Output Operations per Second). If you only have 1 15K RPM spindle hard drive, then not even 4TB or RAM and 400 CPUs will increase your performance a bit.

So, because we could argue all day about spindles vs SSD (Solid-state drives), virtual vs physical, etc. – we will just talk in terms of IOps when it comes to disk performance. So, at the time of writing this post, one 15k spindle is roughly 200 IOps. Currently, a single VHD (Virtual Hard Disk – aka, then thing that Windows Azure uses as a hard drive) yields up to 500 IOps. So, think of a VHD as a pair of 15k spindles.

With Windows Azure Virtual Machines, you can have multiple VHDs for a single VM. You will have to pay more (by bumping up the VM “size” from small, to large, or extra large, etc.) So with an extra large VM, you can have up to 17 disks (1 for the main drive, and 16 mountable VHDs).

Does mounting more drives give more performance?

I’m so glad you asked! The short answer is “no for right now, yes in the near future”. Actual practice and stress testing shows that you don’t (currently) get (much) more IOps from 1 disk to 16 disks. The reason is because of a current limitation in IaaS. Actually, to me it’s the biggest limitation and it’s the number 1 issue that must be improved before I would personally recommend IaaS to larger organizations.

The issue comes down to the current “Azure Storage SLA”. According to the service level agreement for Azure storage, a single account can only expect up to 5,000 IOps. Now, keep in mind the following points:

  1. You may likely be able to get more IOps than 5k… but Microsoft is not contractually obligated to give you any more than that.
  2. IaaS is currently in “preview” – and since the storage SLA was made long before IaaS, it is clearly incompatible.
I have spoken with a few people at Microsoft about it, and Brad Calder, the General Manager of Windows Azure Storage, said
We plan to increase these by GA, but for where we are right now for the preview, … plan on the above, and use multiple disks and storage accounts as appropriate.

I threatened to hug him – and a few other people – for that statement. This leads us to our last question:

How does SQL Server perform in the IaaS environment?

Well basically, SQL performance can be easily predicted in any environment where you know the following 3 pieces of information:

  1. What is the CPU power?
  2. How much RAM does SQL Server get?
  3. What does the storage IOps look like?

Since we know the answer to all three of those questions, we know that SQL Server in an IaaS environment will perform pretty well – like a “good” box for a small company. I recommend that you mount 16 drives, stripe 12 of them together and use it as the MDF file location, then stripe the last 4 together and use it as the LDF file location.

Here are a couple of screen shots to help prove the point. The first is a single “Small” VM running SQL Server. This is a console app that opens 100 connections and just hammers as many insert statements as it can. The box was not able to handle much more than 100 connections (due to CPU/RAM power).

Now lets see that same test (on that same VM), but this time bump up the size to “Extra Large”, split out the MDF file onto a 12-striped VHD volume and split the LDF file onto a 4-stripe VHD. Being an XL box, we get 8 times more CPU and RAM.

In conclusion – IaaS is already a good tool for typical use… it is a great and brand new tool when it comes to being able to quickly ramp up or down… and (if Microsoft follows through – as they likely will), it will be even better for the general availability release!