Orchestrating and scaling servers can be a challenge, especially in a cloud-focused world where the ability to scale is key to being cost-efficient and providing a great user experience. Icebird Studios was facing that problem and was looking for a way to bring its stateful server written in C++ and Unreal Engine 4 executable to a cloud infrastructure with minimal administrative overhead and need to change code on its servers.

The hackfest and value stream mapping that we conducted with Icebird helped us to identify a solution, which consisted of a combination of Docker encapsulating the game server part and building a reliable and automated deployment process of the game server using Azure Resource Manager templates, Azure Functions, and DevOps practices.

The hack team:

  • Barry Krein – Backend Developer, Icebird Studios
  • Andreas Pohl – Technical Evangelist, Microsoft

Customer profile

Icebird Studios is an independent game development studio based in Munich, Germany. The team is experienced in cross-platform development and has released games on all major platforms, including PC, iOS, Android, and SmartTV. They decided to use Azure as a back-end provider for their new strategy multiplayer game, Subsiege. The global scale and the potential to automate the infrastructure in Azure helped to launch the game worldwide.

Problem statement

In the gaming and mobile world, it can be hard to predict the number of customers you will have once you release your game. For this kind of use case, it is important to have a scalable back-end architecture and an infrastructure that is able to run it cost-efficiently.

Icebird Studios already had its game server developed with Unreal Engine 4 for Linux and was looking for a way to orchestrate and scale the game server globally. The problem was that all the scaling and orchestration was happening by hand, which meant that at least one team member needed to manage the infrastructure rather than work on new features for customers. This would have a negative impact on the small team managing the infrastructure, especially with the unpredictable load and global rollout of the game server.

For these types of real-time games, it is important to be as close as possible to the user to reduce the latency to the game server and ensure the best gaming experience for the players. This means the game server needs to be deployed globally in different Azure regions.

The goal of the hackfest was to enable Icebird Studios to automatically scale up and down their game server depending on the current usage without the need to rewrite the game server at all.

Identifying opportunities for improvement

To identify and understand the problem better, we started the hackfest by creating a value stream map.

Value Stream Mapping

By generating the map, we realized that most of the steps in the build and deployment pipeline were manual ones that a developer has to perform with every iteration of the server. This means the developers need to provision hundreds of servers in many different regions, which can take days to set up. This showed us that most of the complex and time-consuming work is spent on orchestration/distribution and scaling of the infrastructure. By improving this part of the process, we would be able to make a release faster and more consistent and free up resources to work on features for the product.

The load balancing of the back end, which decides which players start on which server, was already implemented but wasn’t able to deploy and deprovision new servers. The value stream map showed that we needed to build something that is able to scale the infrastructure and communicate with the existing load-balancing code.

After finishing the value stream map, we focused on the following challenges:

  • Improving the deployment process
    • Reducing deployment time by utilizing Infrastructure as Code
    • Making it easier to deploy new regions by utilizing continuous deployment
  • Automating the deployment process
    • Scale up
    • Scale down
    • Set up new regions automatically by utilizing continuous deployment

Improving the deployment process

The key to automating the deployment of infrastructure in any environment is to have a reproducible and reliable way of deploying. By leveraging DevOps practices, we were able to reduce the deployment time significantly—from days to minutes—and build the foundation for automating the deployments. The first thing we did to improve the overall deployment process was to define the Infrastructure as Code. By doing this we were able to have an incremental deployment system that is controlled by a versioning system, which enabled us to leverage the advantages of continuous integration and deployment.

The architecture

After the value stream mapping, we started defining how an environment would look. We talked about what a region needs in order to function and came up with an architecture that would be rolled out in many different regions (see the following diagram). This architecture was built using Azure Resource Manager templates to leverage the advantages of versioning, incremental, and continuous updates that Infrastructure as Code is providing.

Azure Architecture

Taking a closer look, you will see some interesting choices we made regarding the infrastructure. We started by provisioning single virtual machines (you may have noticed we decided against virtual machine scale sets). This is because a scale set isn’t able to handle all the functional requirements. Scale sets always have a load balancer in front, and a virtual machine inside the scale set is accessible by using the network address translation (NAT) through the load balancer, not by a public IP address. Even though we can handle more than one port on the client side of the game, we wanted to make sure not to break the client because the player has blocked all not-well-known ports.

Another reason for deciding against scale sets is the stateful game server. If you scale down a scale set, it will first deprovision the latest provisioned virtual machine. This is a good solution if you have an orchestrator that handles the state of your application and makes sure the state is persistent even if we scale down. In our case, this would mean that we would need to handle the load information and rework the load-balancing script that is already running. We needed to make sure we don’t deprovision any virtual machines with active games running on them.

The provisioning

There are a couple of ways to manage the configuration of a virtual machine. In our case, we wanted a simple solution that doesn’t introduce unnecessary overhead. Because of this, we decided to do the provisioning inside the virtual machine with a custom script extension that would be a simple Bash script.

## Automatically pulls the image with the "latest" tag

server_image="server.azurecr.io/server:latest"

sudo docker login --username user --password password server.azurecr.io
sudo docker pull $server_image

/usr/bin/docker run -d --name=server --add-host teamspeak:11.0.0.7 -p 27016:27016/udp -p 27015:52050/tcp

This Bash script pulls the latest version of the Docker container and starts a container running the game server on the virtual machine. The next step was building the Resource Manager template that executes this script as part of the deployment process.

{
      "type": "Microsoft.Compute/virtualMachines/extensions",
      "name": "[concat(variables('vmName'),'/', variables('extensionName'))]",
      "apiVersion": "2015-05-01-preview",
      "location": "[resourceGroup().location]",
      "dependsOn": [
        "[concat('Microsoft.Compute/virtualMachines/', variables('vmName'))]"
      ],
      "properties": {
        "publisher": "Microsoft.Azure.Extensions",
        "type": "DockerExtension",
        "typeHandlerVersion": "1.0",
        "autoUpgradeMinorVersion": true,
        "settings": {}
      }
}

This section makes sure the dependency to Docker is installed before we try to execute the custom script for the provisioning.

{
      "type": "Microsoft.Compute/virtualMachines/extensions",
      "name": "[concat(variables('vmName'),'/', variables('customExtensionName'))]",
      "apiVersion": "2015-06-15",
      "location": "[resourceGroup().location]",
      "dependsOn": [
        "['DockerExtension']"
      ],
      "properties": {
        "publisher": "Microsoft.Azure.Extensions",
        "type": "CustomScript",
        "typeHandlerVersion": "2.0",
        "autoUpgradeMinorVersion": true,
        "settings": {
          "fileUris": "[split(parameters('fileUris'), ' ')]",
          "commandToExecute": "[variables('command')]"
        },
        "protectedSettings": {
          "storageAccountName": "[parameters('customScriptStorageAccountName')]",
          "storageAccountKey": "[parameters('customScriptStorageAccountKey')]"
        }
      }
}

Why Docker?

The reason for Docker was because we wanted to be able to run multiple game servers on one virtual machine simultaneously but the listening port of the game server was hard-coded in the Unreal server. We didn’t want to make the server aware of different ports because the engine didn’t intend this. A Docker container solved that problem because we were able to define a different public-facing port and redirect it to the same internal port.

So, why didn’t we use Kubernetes for orchestrating the containers? The reason for this is simple—orchestrators in Azure use scale sets and Load Balancer to make sure you don’t need to worry about the infrastructure anymore. This is a really good thing, so definitely use an orchestrator. Unfortunately in this case, we needed a specific configuration for the infrastructure, which would mean we needed to work around some characteristics of the orchestrators.

Automating the deployment process

After we finished the deployment part, we started to think about how to write a continuously called logic that is able to scale the infrastructure we described earlier. We had a task that needed to be executed in fixed intervals and automated. This kind of problem sounded like a great opportunity to use a timer-triggered Azure Function written in C#. We chose C# over JavaScript or another language because of the powerful Azure SDK integration. This was important because we needed to be able to manage our infrastructure programmatically.

Generating a service principal

Before you can get started, you need to generate a service principal in Azure Active Directory and give it the necessary rights to your subscription so it is allowed to provision and deprovision as necessary. Detailed documentation on how to create a service principal can be found on this Azure documentation page.

We decided to write a credential entity that wraps all the information the SDK needs for the different implementations to authenticate with the service principal against the Resource Manager REST API. We did this to provide an easy interface to different formats of the service principal that the SDK might need.

public class CredentialHelper
{
    /// <summary>
    ///     The Service Principle Client Id, In the Azure Portal it is called Application Id
    /// </summary>
    private readonly string _clientId;

    private readonly string _clientSecret;

    /// <summary>
    ///     The tenant Id can be accessed through azure CLI with az account show --subscription=ID
    /// </summary>
    private readonly string _tenantId;

    /// <summary>
    ///     Creates a new Crendeital Helper to use ResourceManager APIs and Azure SDKs
    /// </summary>
    /// <param name="servicePrincipleId">In the Azure Portal it is called Application Id</param>
    /// <param name="servicePrincipleSecret">The Secret for your Service Principle</param>
    /// <param name="subscriptionId">The Subscription you want to access with the Service Principle</param>
    /// <param name="tenantId">The tenant Id can be accessed through azure CLI with az account show --subscription=ID</param>
    public CredentialHelper(string servicePrincipleId, string servicePrincipleSecret, string subscriptionId,
        string tenantId)
    {
        this._clientId = servicePrincipleId;
        this._clientSecret = servicePrincipleSecret;
        this.SubscriptionId = subscriptionId;
        this._tenantId = tenantId;
    }

    public string SubscriptionId { get; }


    public ServiceClientCredentials GetServiceClientCrendtials()
    {
        return ApplicationTokenProvider.LoginSilentAsync(this._tenantId, this._clientId, this._clientSecret).Result;
    }

    public SubscriptionCloudCredentials GetSubscriptionCloudCloudCredentials()
    {
        return new TokenCloudCredentials(this.SubscriptionId,
            this.GetSubscriptionCloudCloudCredentials().ToString());
    }
}

Deploy with a function

The next step was to deploy the defined Resource Manager templates and modify the parameters on the deployment from within the function. You can access files uploaded to your function by using this code and path to the file:

Working on the Azure Function

//Path to file in Project: FuncyAutoScale\AutoScaleFunc\AutoVmScale\Data\infrastructureTemplate.json
var pathToTemplateFile = "D:\\home\\site\\wwwroot\\FUNCTIONNAME\\Data\\infrastructureTemplate.json";

//Path to file in Project: FuncyAutoScale\AutoScaleFunc\AutoVmScale\Data\infrastructureParameters.json
var pathToParameterFile = "D:\\home\\site\\wwwroot\\FUNCTIONNAME\\Data\\infrastructureParameters.json";

var template = JObject.Parse(File.ReadAllText(pathToTemplateFile));
var parameters = JObject.Parse(File.ReadAllText(pathToParameterFile));

//Modify Deployment Parameters
var rnd = new Random();
var numberStorageAccounts = Convert.ToInt32(parameters["parameters"]["numberStorageAccounts"]["value"].ToString());
parameters["parameters"]["storageId"]["value"] = rnd.Next(numberStorageAccounts);

this.vmName = this.GetRandomString();
parameters["parameters"]["virtualMachineName"]["value"] = this.vmName;

To enable an automatic scaling of the infrastructure, the function is called by a timer that is configured in the function.json of the scaling function.

{
  "disabled": false,
  "bindings": [
    {
      "name": "myTimer",
      "type": "timerTrigger",
      "direction": "in",
      "schedule": "0 10 * * * *"
    }
  ]
}

Now you can deploy the template with a ResourceManagmentClient from your Azure function.

this._resourceManagementClient = new ResourceManagementClient(serviceCreds)
        {
            SubscriptionId = this._credentials.SubscriptionId
        };

var deployment = new Deployment
        {
            Properties = new DeploymentProperties
            {
                Mode = DeploymentMode.Incremental,
                Template = templateFileContents,
                Parameters = parameterFileContents["parameters"].ToObject<JObject>()
            }
        };

this._resourceManagementClient.Deployments.BeginCreateOrUpdate(this._resourceGroupName, this._deploymentName, deployment);

It is important to use the asynchronous calls of the SDK that are prefixed with “Begin*” because the deployment can take longer than the execution timeout of the function and we want to keep execution time as small as possible.

To deallocate and deprovision already allocated resources, we implemented an easy wrapper around the Resource Manager API that can be used to programmatically manage the infrastructure.

        public void Deallocate(string resourceGroupName, string vmName)
        {
            var computeManagementClient = new ComputeManagementClient(this._credentials.GetServiceClientCrendtials());
            computeManagementClient.VirtualMachines.BeginDeallocateWithHttpMessagesAsync(resourceGroupName, vmName);
        }

Getting runtime information about your virtual machines

Now we were able to continuously deploy new game servers from within an Azure function, but how do we know what public IP was attached to a newly provisioned server? You can do this with the Azure SDK, too!

To do this, you need to understand what a public IP in Azure actually is. A public IP in Azure is a resource managed by a Resource Manager that you can talk to with an SDK or over the REST API. You can ask the public IP resource manager for the value but you need to know how the resource is named to do that. To find the name of the public IP, you have to know which network interface is attached to your virtual machine and what IPs are actually associated with this network interface. The reason to do it programmatically was because we wanted to be able to deallocate and allocate servers in order to scale faster. Because of this, we couldn’t use outputs from the Resource Manager deployments because the public IPs would change their value once the virtual machines associated with them was deallocated.

In C#, this process would look like this:

public string GetPublicIpFromVm(string resourceGroupName, string vmName)
        {
            var client = new ComputeManagementClient(this._credentials.GetServiceClientCrendtials())
            {
                SubscriptionId = this._credentials.SubscriptionId
            };

            var vm = client.VirtualMachines.Get(resourceGroupName, vmName);

            var networkName = vm.NetworkProfile.NetworkInterfaces[0].Id.Split('/').Last();

            var networkClient = new NetworkManagementClient(this._credentials.GetServiceClientCrendtials())
            {
                SubscriptionId = this._credentials.SubscriptionId
            };

            var ipResourceName =
                networkClient.NetworkInterfaces.Get(resourceGroupName, networkName).IpConfigurations[0].PublicIPAddress
                    .Id.Split('/').Last();

            return networkClient.PublicIPAddresses.Get(resourceGroupName, ipResourceName).IpAddress;
        }

The IP address was then used to write into Azure Table storage where the player matchmaking service could get all IPs of available game servers ready for the players.

Conclusion

The result of the hackfest was a reliable way to deploy new infrastructure to Azure with the help of modern DevOps practices. A developer didn’t need to deploy the resources by hand anymore, which helped save hours of provisioning time.

We learned at this hackfest that good old KISS (Keep It Simple, Stupid) is still as important as always. If you start thinking about Docker, you will default to an orchestrator that helps you work with your environment. But sometimes it is good to keep it as simple as possible and don’t work around the restrictions of standards. You can use Docker for more than one reason and as long as it helps you to solve a problem, you should use it.

The Resource Management API of Azure is really great because you can basically do everything you would by hand with the GUI over the API. Even complicated things can be easily automated this way.

Azure Functions enabled us to have a low overhead Cron job in our preferred language C# and it just works without the need to build and plan any infrastructure.

Before we started the hackfest, most of the back-end orchestration was done by hand. Utilizing DevOps practices enabled us to reduce time spent on repetitive tasks and made deployments more reliable. Deploying hundreds of servers in many different regions would take days to manage by hand. With the use of DevOps practices, we managed to reduce this to minutes.

Unfortunately, we didn’t manage to write the business logic that handles the scaling rules. This would improve the automation even more. This is something we will work on in the future to expand the DevOps story.

The view

Source code

The code referenced in this report can be found on GitHub:

GitHub sample