Reliable and scalable infrastructure: Principles

This is a series of posts:

  1. Introduction
  2. Principles (this post)
  3. Layers
  4. Traffic
  5. Secrets

First and foremost, you have to threat your service’s infrastructure as you threat your service’s code. In other words as infrastructure-as-code. This may include the techniques that are now common in general engineering processes such as:

  • Gated build. Each change is built and verified. If this an ARM template, you can run Test-AzResourceGroupDeployment
  • Gated deployment. Each change can not just be synthetically validated for the syntax correctness but actually deployed to a test cluster, alongside the basic infrastructure services if possible, what combined would help to ensure the changes are valid and functional
  • Continuous Integration (CI). Each change is immediately merged into the main branch and a ready-for-production build is produced
  • Continuous Delivery (CD). Each build is immediately deployed to an early test environment and the appropriate tests are performed. Then to another environment, then another.
  • Safe Deployment Practices (SDP). Each is build is not deployed to all available environments simultaneously but instead is slowly rolled out across environments and regions. They’re are grouped by kind (prod, pre-prod), geography (North America, Europe, Asia), type of customers (internal, partners, public), and so on.

You may refer to the Build and Deployment section of the Twelve-Factor App for more ideas how to the CI/CD process for both your services and infrastructure should look like.

Employing these and other techniques will help you to achieve multiple goals:

  • Increase the confidence in the changes
  • Increase the overall quality of the infrastructure by decreasing the number of errors slipping into production
  • Allow to catch issues early in the rollout
  • Increase the overall time-to-production, the total time it takes for a new feature or a fix to reach the target environment

To be continued…

Posted in Infrastructure | Tagged , , , | Leave a comment

3 ways to assign access policy for user-assigned managed identity on key vault using ARM template

This post is a summary of my experience dealing with user-assigned managed identity and key vaults in Azure, it explores multiple ways to achieve the same result – how to assign access policies using an ARM template. Each of the ways has its own pros and cons.

First, the simplest: to create a key vault with preassigned access policy:

{
  "resources": [
    {
      "type": "Microsoft.KeyVault/vaults",
      "apiVersion": "[variables('kvApiVersion')]",
      "name": "[parameters('kvName')]",
      "location": "[parameters('location')]",
      "properties": {
        "tenantId": "[variables('tenantId')]",
        "accessPolicies": "[parameters('accessPolicies')]",
        "sku": {
          "name": "Standard",
          "family": "A"
        }
      }
    }
  ]
}

The pros of this approach are same as the cons: you have to know all access policies ahead of time. That works but only in the simplest scenarios, such as for security groups as they’re created outside of ARM and have static, well-known OID.

Second: to create a key vault, then a user-assigned managed identity, and then add an access policy:

{
  "variables": {
    "uaidRef": "[concat('Microsoft.ManagedIdentity/userAssignedIdentities/', parameters('uaidName'))]",
  },
  "resources": [
    {
      "type": "Microsoft.KeyVault/vaults",
      "apiVersion": "[variables('kvApiVersion')]",
      "name": "[parameters('kvName')]",
      "location": "[parameters('location')]",
      "properties": {
        "tenantId": "[variables('tenantId')]",
        "accessPolicies": [],
        "sku": {
          "name": "Standard",
          "family": "A"
        }
      }
    },
    {
      "type": "Microsoft.ManagedIdentity/userAssignedIdentities",
      "apiVersion": "[variables('idApiVersion')]",
      "name": "[parameters('uaidName')]",
      "location": "[parameters('location')]"
    },
    {
      "type": "Microsoft.KeyVault/vaults/accessPolicies",
      "name": "[concat(parameters('kvName'), '/add')]",
      "apiVersion": "[variables('kvApiVersion')]",
      "properties": {
        "accessPolicies": [
          {
            "tenantId": "[variables('tenantId')]",
            "objectId": "[reference(variables('uaidRef'), variables('idApiVersion')).principalId]",
            "permissions": "[variables('uaidPermissions')]"
          }
        ]
      },
      "dependsOn": [
        "[concat('Microsoft.KeyVault/vaults/', parameters('kvName'))]",
        "[concat('Microsoft.ManagedIdentity/userAssignedIdentities/', parameters('uaidName'))]",
      ]
    }
  ]
}

The main drawback of this one is in the effect of eviction. Since a deployment of ARM template is effectively a PUT on the respective resource, immediately after the creation, a key vault has no access policies. What means all requests to access it will fail 403 until the respective polices are not added back. The time window might be relatively short but still exist what’s may and will cause outages and incidents.

Moreover Key Vault doesn’t support adding access policies in parallel. What means that if there are multiple policies to add they must be added sequentially. Each takes several seconds what increases the window up to a minute or more. If this is a production environment then this is guaranteed to have customer impact, makes it impossible to deployment transparently and without interruption of running services, violates one of the core principles of cloud and enterprise grade infrastructure.

Finally, third: create a user-assigned managed identity, then create a key vault with preassigned access policy:

{
  "variables": {
    "uaidRef": "[concat('Microsoft.ManagedIdentity/userAssignedIdentities/', parameters('uaidName'))]"
  },
  "resources": [
    {
      "type": "Microsoft.ManagedIdentity/userAssignedIdentities",
      "apiVersion": "[variables('idApiVersion')]",
      "name": "[parameters('uaidName')]",
      "location": "[parameters('location')]"
    },
    {
      "type": "Microsoft.KeyVault/vaults",
      "apiVersion": "[variables('kvApiVersion')]",
      "name": "[parameters('kvName')]",
      "location": "[parameters('location')]",
      "properties": {
        "tenantId": "[variables('tenantId')]",
        "accessPolicies": [
          {
            "tenantId": "[variables('tenantId')]",
            "objectId": "[reference(variables('uaidRef'), variables('idApiVersion')).principalId]",
            "permissions": "[variables('uaidPermissions')]"
          }
        ],
        "sku": {
          "name": "Standard",
          "family": "A"
        }
      },
      "dependsOn": [
        "[concat('Microsoft.ManagedIdentity/userAssignedIdentities/', parameters('uaidName'))]"
      ]
    }
  ]
}

This one basically combined the pros of the latter two and in my mind has no cons. It eliminates the window altogether, the key vault would never have no access polices even again.

Posted in Infrastructure | Tagged , , | Leave a comment

Reading books vs writing one

I have an issue with reading books. I read blogs and articles on the Internet often but physical books almost never. Back in the day when I was living in Moscow, I used to commute to college and work an hour each way every day and had a plenty of time for reading. Then after moving to the US, driving to work instead of taking public transport, having kids, and now permanently working from home – I don’t have neither time nor much of the desire.

A friend of mine gave once a sound advice: find more time to read books to advance my career. And he’s probably right, I should. On other hand, I can write a book of my own. I would title it:

Reliable and scalable infrastructure in Azure

Also compliant and using Service Fabric.

So here starts a series of blog posts which hopefully one day would be compiled to a book.

So far I came up with the following sections:

  1. Introduction (this post)
  2. Principles
  3. Layers
  4. Traffic
  5. Secrets
Posted in Thoughts | Tagged | Leave a comment

Carnation Anapa Winery, vol 3, day 4: yeast

Due to the pandemic and workaholism, everything takes longer this year.

I’m adding 5g of RC212 by Cellar Science (batch #52495, whatever it means) to the 5-gallon bucket of Petit Sirah. But first, to avoid shock, I’m diluting the yeast in a small amount of boiled water cooled down to 106°F.

Posted in Winemaking | Tagged | Leave a comment

Carnation Anapa Winery, vol 3, day 3: Potassium Metabisulfite

Last time when I added Potassium Metabisulfite the outcome was much better when I did not. So this time I’m adding it to both buckets of must, ~1.5 x ¼ tsp per 5 gallons.

Posted in Winemaking | Tagged | Leave a comment

Carnation Anapa Winery, vol 3, day 2: weighing

Some precalculations:

  • My weight: 74.15 kg
  • Empty bucket: 1.15 kg
  • Total: 75.45 kg

Bucket #1 (CS):

  • Total: 86.80 kg
  • Grapes: 12.65 kg

Bucket #2 (CS):

  • Total: 85.80 kg
  • Grapes: 11.65 kg

Bucket #3 (PS):

  • Total: 89.15 kg
  • Grapes: 15.00 kg

Bucket #4 (PS):

  • Total: 90.05 kg
  • Grapes: 15.9 kg

What in sum runs as:

  • Cabernet Sauvignon: 24.3 kg
  • Petite Sirah: 30.9 kg
  • Total: 55.2 kg (121.695 lbs)
Posted in Winemaking | Tagged | Leave a comment

Carnation Anapa Winery, vol 3, day 1: The journey continues

It’s that time of year when I drive to my friends at Carthage Vineyard in Zillah, WA and pick what’s left after the harvest season.

This year it was 2 buckets of Cabernet Sauvignon and 2 buckets of Petite Sirah.

Posted in Winemaking | Tagged | Leave a comment

How to configure Service Fabric to use AAD for client authentication

This blob post is intended to compliment the official doc which I personally don’t find helpful and comprehensive enough.

The configuration that works for me consists of 3 parts:

  1. Cluster ARM template change
  2. AAD app for the cluster identity (let’s call it client)
  3. AAD app for the users to access SFE (let’s call it cluster)

First you make the changes in your ARM template for the cluster and deploy:

"variables": {
  "clientAadAppId": "{client app id}",
  "clusterAadAppId": "{cluster app id}"
},
"resources": [
  {
    "type": "Microsoft.ServiceFabric/clusters",
    "apiVersion": "[variables('sfApiVersion')]",
    "name": "[parameters('clusterName')]",
    "location": "[parameters('location')]",
    "properties": {
      "addonFeatures": [],
      "azureActiveDirectory": {
        "tenantId": "[subscription().tenantId]",
        "clientApplication": "[variables('clientAadAppId')]",
        "clusterApplication": "[variables('clusterAadAppId')]"
      },
      "certificateCommonNames": {},
      "clientCertificateCommonNames": [],
      "clientCertificateThumbprints": [],
      "diagnosticsStorageAccountConfig": {},
      "fabricSettings": [],
      "reliabilityLevel": "[variables('reliabilityLevel')]",
      "upgradeMode": "Automatic",
      "vmImage": "Windows"
    }
  }
]

Then you create 2 third-party AAD applications and edit their manifests.

For the client app where you specify the Microsoft Graph and cluster app ids:

"requiredResourceAccess": [
  {
    "resourceAppId": "00000003-0000-0000-c000-000000000000",
    "resourceAccess": [
      {
        "id": "{random guid}",
        "type": "Scope"
      }
    ]
  },
  {
    "resourceAppId": "{cluster app id}",
    "resourceAccess": [
      {
        "id": "{your guid}",
        "type": "Scope"
      }
    ]
  }
],
"oauth2Permissions": [
  {
    "adminConsentDescription": "Allow the application to access SF Cluster Management application on behalf of the signed-in user.",
    "adminConsentDisplayName": "Access SF Cluster",
    "id": "{your guid}",
    "isEnabled": true,
    "lang": null,
    "origin": "Application",
    "type": "User",
    "userConsentDescription": "Allow the application to access SF Cluster Management application on your behalf.",
    "userConsentDisplayName": "Access SF Cluster",
    "value": "user_impersonation"
  }
]

And for the cluster app where you specify what roles have what permissions:

"appRoles": [
  {
    "allowedMemberTypes": [
      "User"
    ],
    "description": "ReadOnly roles have limited access",
    "displayName": "ReadOnly",
    "id": "{random guid}",
    "isEnabled": true,
    "lang": null,
    "origin": "Application",
    "value": "User"
  },
  {
    "allowedMemberTypes": [
      "User"
    ],
    "description": "Admins roles can perform all tasks",
    "displayName": "Admin",
    "id": "{random guid}",
    "isEnabled": true,
    "lang": null,
    "origin": "Application",
    "value": "Admin"
  }
]

And finally add your cluster’s SFE endpoint to the the Authentication section

https://{clusterName}.{clusterLocation}.cloudapp.azure.com:19080/Explorer/index.html

And finally go to the cluster app Overview and click Managed application in local directory, select Users and Group and assign permissions to your AAD groups you want to be Users or Admins.

That’s it, folks!

Posted in Infrastructure | Tagged | Leave a comment

How to hook up child DNS zone into parent by updating its NS records using ARM template

Imagine a scenario: you have one global DNS zone in Prod subscription and several child DNS zones for each environment in their own subscriptions, e.g.:

  • infra.example.com
    • Subscription: Prod
  • dev.infra.examle.com
    • Subscription: Dev
  • test.infra.example.com
    • Subscription: Test
  • prod.infra.example.com
    • Subscription: Prod

Each zone is created using its own ARM template. But in order a child zone to start working you need to hook it up into the parent zone by updating its NS record, e.g.:

  • dev.infra.example.com
    • NS
      • ns1-01.azure-dns.com.
      • ns1-01.azure-dns.net
      • ns1-01.azure-dns.org.
      • ns1-09.azure-dns.info.
  • infra.example.com
    • dev
      • NS
        • the records must be inserted here

Here’s how to achieve that using ARM template:

{
  "$schema": "http://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "environment": {
      "type": "string"
    },
    "globalSecretsSubscriptionId": {
      "type": "string"
    },
    "globalSecretsResourceGroupName": {
      "type": "string"
    },
    "globalDnsZoneName": {
      "type": "string"
    },
    "envDnsZoneName": {
      "type": "string"
    }
  },
  "variables": {
    "deploymentApiVersion": "2019-09-01",
    "dnsApiVersion": "2018-05-01"
  },
  "resources": [
    {
      "name": "[parameters('envDnsZoneName')]",
      "type": "Microsoft.Network/dnsZones",
      "apiVersion": "[variables('dnsApiVersion')]",
      "location": "global"
    },
    {
      "name": "[format('DNS-Global-{0}', parameters('environment'))]",
      "type": "Microsoft.Resources/deployments",
      "apiVersion": "[variables('deploymentApiVersion')]",
      "subscriptionId": "[parameters('globalSecretsSubscriptionId')]",
      "resourceGroup": "[parameters('globalResourceGroupName')]",
      "properties": {
        "mode": "Incremental",
        "template": {
          "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
          "contentVersion": "1.0.0.0",
          "resources": [
            {
              "name": "[format('{0}/{1}', parameters('globalDnsZoneName'), parameters('environment'))]",
              "type": "Microsoft.Network/dnsZones/NS",
              "apiVersion": "[variables('dnsApiVersion')]",
              "properties": {
                "TTL": 3600,
                "NSRecords": "[reference(resourceId('Microsoft.Network/dnszones/NS', parameters('envDnsZoneName'), '@'), variables('dnsApiVersion')).NSRecords]"
              }
            }
          ]
        }
      },
      "dependsOn": [
        "[concat('Microsoft.Network/dnsZones/', parameters('envDnsZoneName'))]"
      ]
    }
  ]
}

Here’s what it does:

  1. Creates a child zone in current subscription and resource group
  2. Updates the parent zone in its own subscription and resource group, creates NS record with the value of NS records of the child zone

Happy deployment!

Posted in Infrastructure | Tagged , | Leave a comment

How to enable automatic clean up of provisioned application types on a Service Fabric cluster

As time goes by and you deploy applications, a new build every time what means a new application type is getting provisioned. Application packages are piling up and after some time old versions become just a clutter that eats up disk space without providing any value. So you may want to periodically clean them up.

Before you had to periodically run a PowerShell script manually, like this one:

param
(
  [Parameter(Mandatory=$true)]
  [string]$ApplicationTypeName,

  [Parameter(Mandatory=$true)]
  [int]$NumberOfTypesToKeep
)

Set-ExecutionPolicy -ExecutionPolicy Unrestricted -Scope CurrentUser -Force

$moduleName = "ServiceFabric"
$module = Get-Module $moduleName
if (!$module)
{
  Write-Output "Module $moduleName was not found, importing"
  Import-Module "C:\Program Files\Microsoft Service Fabric\bin\ServiceFabric\ServiceFabric.psd1"
}

Connect-ServiceFabricCluster -Verbose

if ($ApplicationTypeName -eq "*")
{
  $applications = Get-ServiceFabricApplicationType -Verbose
  $applicationTypeNames = $applications.ApplicationTypeName | Select-Object -Unique
}
else
{
  $applicationTypeNames=$ApplicationTypeName
}

foreach ($appTypeName in $applicationTypeNames)
{
  $currentAppType = Get-ServiceFabricApplication -ApplicationTypeName $appTypeName -Verbose

  # if type not found
  $registeredAppTypes = Get-ServiceFabricApplicationType -ApplicationTypeName $appTypeName -Verbose
  if (!$registeredAppTypes)
  {
    Write-Error "Application Type '$appTypeName' was not found, skipped cleanup"
    continue
  }

  $registeredVersions = $registeredAppTypes.ApplicationTypeVersion

  # if to keep > total
  if ($NumberOfTypesToKeep -ge $registeredVersions.Count)
  {
    Write-Error "Parameter NumberOfTypesToKeep=$NumberOfTypesToKeep is greater than or equals the number of registered types=$($registeredVersions.Count)"
    continue
  }

  $versionsToDelete = $registeredVersions | `
                      Where-Object { $currentAppType.ApplicationTypeVersion -notcontains $_ } | `
                      Select-Object -First ($registeredVersions.Count - $NumberOfTypesToKeep)

  Write-Output "Application type '$appTypeName' does exist, started deletion"
  foreach ($versionToDelete in $versionsToDelete)
  {
    Unregister-ServiceFabricApplicationType -ApplicationTypeName $appTypeName `
                                            -ApplicationTypeVersion $versionToDelete `
                                            -Force `
                                            -Verbose
  }

  Write-Output "Successfully deleted application type '$appTypeName' versions: $versionsToDelete"
}

But since version 6.5 you don’t need to do anything manual anymore. Here’s a snippet from cluster ARM template:

{
  "name": "Management",
  "parameters": [
    {
      "name": "CleanupUnusedApplicationTypes",
      "value": true
    },
    {
      "name": "PeriodicCleanupUnusedApplicationTypes",
      "value": true
    },
    {
      "name": "TriggerAppTypeCleanupOnProvisionSuccess",
      "value": true
    },
    {
      "name": "MaxUnusedAppTypeVersionsToKeep",
      "value": "10"
    }
  ]
}

That’s it folks, happy deployment!

Posted in Infrastructure | Tagged , , | Leave a comment