Three Years of Successful Delivery of Public Cloud Services with Apache CloudStack

logo In the mid 2014 we decided to move our VPS services from used at that moment OpenQRM platform due to several considerations. First and foremost, OpenQRM platform was initially chosen without deep analysis of business needs and therefore, neither satisfied manageability nor operational demands. Then, clearly, the development approach of the OpenQRM platform creators was quite odd — the product was based on bunch of bash scripts, PHP code and plain workarounds. In short, our customers were unhappy, service wasn’t much cop and rather incurred losses than earned profits. Granted the small size of our regional service provider company, we hadn’t planned creation of big VPS service at that moment. Our main task was the transition to stable and reliable solution, that would satisfy following requirements:

  • easy to deploy and configure;
  • out of the box solution that has broad user base;
  • has easy error diagnostics process;
  • has user-friendly interface;
  • has API to manage resources.

The planned size of the infrastructure was to serve 200+ VMs, with overall 512 — 1024 GB RAM, 128 — 256 cores of Xeon E5-2670 and 10 — 20 TB storage. Service also should have been providing public IPv4 to VMs, no IPv6 was planned. As for the other essential components, we decided to use KVM hypervisor and classic NFSv3 storage.

Our next step was to perform a comparative analysis (actually, to deploy a cluster according to manuals and test all features function as expected) of several existing products: Apache CloudStack, OpenStack, Eucalyptus. We have not considered any platforms without API. Results of this analysis led us to Apache CloudStack (ACS) as a platform for our service. Though it is quite hard to clearly recall the decision process retrospectively, it is safe to say we have got the completely functional ACS-based infrastructure within 1-2 days. It was ACS 4.3, and we are using that version still (we do not see the needs in updates to recent ACS versions because the infrastructure is stable, reacts predictably to hardware management routines and complies with user needs).

At the moment, ACS 4.10 is planned to be released soon, however it doesn’t contain a lot of functionality changes. Here
we need to digress from the topic a little and point out, that ACS provides a bunch of different services choosing between which you can produce a certain type of cloud: with/without load balancing, with NAT or direct IP, with/without external security gateways, etc. In short, it is possible that certain deployment configuration may have almost zero differences between ACS 4.3 and 4.10, while the other may have significant amount of those.

We use the simplest deployment option for our service — public cloud with shared address space, without additional network services, which can be described as ACS with Basic Zones without Security Groups. Because of the deployment model simplicity there is logically not a lot of room for ACS functionality changes, therefore update to ACS 4.10 will bring us only the support of IPv6. The fact is, that ACS is often used to provide complex virtual services and consequently, is developed faster in this direction (so called Advanced zones), so the support of IPv6 for advanced zones exists already quite long time; as for basic zones — it only will be available from now on with the new ACS 4.10 release. The most important aspect to consider if the cloud is provided as B2B service or used as an enterprise private cloud is what particular functionality and service features are required, and from there on it might be clear if any significant changes between ACS 4.3 and 4.10 are relevant.

So, here is our experience gained within 3 years of infrastructure operational activity. We believe that described aspects may help to achieve smooth infrastructure operations if considered and followed.

Availability

Let’s start with uptime. We have servers with more than 1 year uptime and we haven’t identified any bad cases in which ACS stops functioning properly. The most of the system failures happen due to power outages. Over whole operational period we paid a compensation for SLA violation only once.

Virtual router

From our point of view, virtual router is the most complex, unclear and ugly component in ACS. It is used to provide DHCP services, direct and reverse DNS zones, routing, balancing, static NAT, providing an access to user data, passwords and ssh-keys for VM templates (cloud-init). This component might be also fault tolerant, but this is irrelevant in terms of our deployment, as ACS automatically initiates it after a failure without any changes to its’ functionality. If we were using advanced zones with sophisticated network functions virtual router would have a critical system role.

To expand topic further, there are several issues related to virtual router on ACS 4.3, some of them were present up to ACS 4.9, and only 4.10 should finally bring some resolution. First problem we identified is DHCP problem in Debian. It doesn’t return DHCP information because of the well known bug (read here more).

Additionally, we had problems with the logs rotation, which led to virtual router file system lack of space, as a result, virtual router simply stopped working. After all, we implemented quite a lot of changes in the VM, changed scripts (and very possibly broke compatibility with other features in the process), but achieved proper functioning of virtual router. Currently we reload this component each month or two, because we are in the last period of the cloud lifecycle and implementation of changes doesn’t have a lot of practical sense. To conclude, there are several other virtual router problems that might be relevant to big infrastructures with several thousands of VMs (for example, one problem is well described here). It is unclear whether the problem still exists in ACS 4.10, but committers were being quite enthusiastic to solve it (this problem is definitely solved in Cosmic fork).

Just to mention, it is possible to use Juniper SRX, Citrix NetScaler instead of Debian linux based virtual router. There is also an initiative for implementation of virtual router based on VyOS, however it doubtfully will come out as there is no serious player to back it.

Virtualization Host - Iptables, Ebtables Rules Scripts

ACS requires KVM agent which is deployed to every compute node and the agent configures iptables and ebtables rules which limit network capabilities of certain VM like denial of changing MAC address, requesting of wrong IP addresses, denial of launching of illegal DHCP servers, etc. By some unknown reason that scripts don’t work properly for us — rules lost and traffic stopped to go to the VM (just to note, in the current demonstrational stand where ACS 4.9.2 is used the problem is absent). So, basically we rewrote the python script and achieved correct behaviour, but it might be that the problem tself was caused by our trial ACS installation.

Several Primary NFS Storages in a Cluster

Simply put, it is a heuristical rule we are following in the end — do not use several primary storages in one cluster (cluster is an ACS hierarchical unit that includes several virtualization hosts, storages and is used to limit failur domains or define most effective hypervisor). We have experienced significantly lower cloud stability with several separate storages
contained in one cluster than we do with all storages merged in a single one. Currently we use big server with RAID6 on Samsung
Pro 850 SSDs and do regular backups for the whole cloud.

ACS Self-service Portal

ACS interface is quite conservative and professional system administrator-oriented. Hence, an average user, who is not accustomed to complex VM management instruments will experience significant difficulties with it and service provider should consider creation of additional instructions and manuals. From this point of view, big market players such as AWS and DO provide users with much better UX. Hereat, from time to time help desk has to do with long time phone calls explaining users how to perform a certain operation, i.e. how to create a template from running VM.

Conclusion

From 3 years retrospective, we believe that the list of described problems is sufficient to understand what are the most critical factors influencing the service quality. However, we admit that it was not complete list of complications and incidents that required administrator assistance to be resolved within this period of time.

Meanwhile we plan the deployment of new public cloud with 288 cores Xeon E5-2670, 1536 GB RAM and 40 TB SSD storage based on ACS 4.10 (Basic Zones, Security Groups).

To provide our customers with higher quality of service we also created open-source product called CloudStack-UI which reflects our up-to-date service experience and provides top-notch alternative user interface for APS deployed in this particular configuration.