An Official Cloud Migration Cookbook

Whether you are responsible for a single application, a large environment, or somewhere in between, it’s likely you’ve wanted to know how to maximize your cloud service providers’ capabilities. This usually results in questions like:

“How can I think about moving applications from my data center to the cloud?” or
“How can I avoid common pitfalls and build confidence in my cloud practice?”

I find the usual responses to these questions either too abstract (e.g., form a team, establish Key Performance Indicators (KPIs)) or too specific (e.g., set up DynamoDB, deploy a multi-container group using docker compose). This cookbook provides the missing middle ground between those two extremes. It comes in 3 sections:

Recipes: bare-essential templates for a commonly recognized cloud migration types
Nutritional Requirements and Table Dressings: critical considerations and nice-to-haves depending on your organization's values and priorities
Techniques and Methods: brief descriptions of frequently effective tactics to employ while implementing recipes

Recipes

In some organizations — and for some applications — you may find that the situation warrants one or another of these strategies to migration:

Rehost

Move your application into cloud virtual machines (VMs). Commonly known as the "lift and shift,” this migration strategy can be quite satisfying with a churlish application, for a quick win under time pressure, or as an initial step in a campaign of cloud adoption.

Ingredients

( money bag / three o’clock represent relative scales of effort for migration work. Ongoing costs are more complex, but, generally speaking, greater upfront effort results in lower ongoing costs.)

black small square money bag

black small square three o’clock

Existing hardware or VM
Configuration Management (CM) and/or data transfer technology
Cloud VM

Procedure

Deploy the application, its prerequisites and data to the cloud VM using your CM/data transfer technology.
Testing and Validation (see Techniques and Methods).
Repeat previous steps as needed.

If your methods here are mostly manual, consider the additional value the project might realize by standardizing this process (see Sanitization under Techniques and Methods).

Cutover (see Techniques and Methods).

Replatform

Move your application to the cloud and embrace the catalog of services your provider makes available. By pulling supporting services out of your VM configuration, this strategy can simplify your configs, reduce maintenance and make good use of the constant improvements cloud providers are incentivized to roll out.

Ingredients

black small square money bag

black small square three o’clock

Existing hardware/VM with ancillary service (e.g., varnish, redis, postgres, haproxy, filesystem)
CM and/or data transfer technology
Cloud VM with a compatible ancillary service

Procedure

Perform a Rehosting strategy migration up until the cutover step.
Provision and populate Database and Storage required by the application.
Configure the application to use cloud service.
Testing and Validation.
Repeat previous steps as needed

For recommendations, see Sanitization under Techniques and Methods.

Cutover.

Replatform, container variation

To illustrate the flexibility of the Replatform strategy, here's a specific spin that targets provisioning compute resources into a container service. This is where we see the Dockerfile take the place of the configuration management solution. Here, we're left to manage only the application endpoints and access to the cloud provider APIs.

Ingredients

Existing hardware/VM
Existing application source
Cloud container platform with compatible ancillary service(s)
Data transfer technology (optional)

Procedure

Create a Dockerfile that encapsulates and runs the application.
Provision and populate Database and Storage required by the application.
Deploy the container to the cloud container platform.
Testing and Validation.
Cutover.

Re-architect

The re-architect (a.k.a., refactor) strategy is usually undertaken to meet specific business needs (e.g., enable massive scalability, facilitate ongoing development efforts, remove vendor dependencies) and will change where and how your application runs. One popular embodiment of this strategy is the Strangler Pattern which uses a proxy to direct requests to the preferred implementation of the functionality — either for the original application or a superseding service/microservice.

Ingredients

black small square money bag

three o’clock

Existing application source
Desired cloud application architecture
Suitable cloud services
Data transfer technology (optional)

Procedure

Provision and populate Database and Storage required by the application.
Implement application changes and cloud resource provisioning configuration.
Testing and Validation.
Cutover.
Repeat as desired.

Replace (repurchase)

Instead of fixing up your old application, meet the need with a new solution. Implementing a suitable cloud offering, open source project, SaaS or commercial off-the-shelf option can deliver aspects of the cloud dream: simplified set up, reduced maintenance and continual updates.

Ingredients

black small square money bag

black small square three o’clock

Nuanced understanding of user/business needs
Critical test plan
List of replacement candidates

Procedure

Assess each of the candidates against the known needs.
Testing and Validation.
Database and Storage.
Cutover to ensure the effort put into communication planning and documentation are proportionate to the changes being made.

Retire

Stop running the things you don't actually use. The system that is most secure, is cheapest to run and has the fewest bugs is the one that does not exist.

Ingredients

black small square money bag

black small square three o’clock

Obsolete application

Procedure

Ensure that enough people agree with you. (This may feel like moving a mountain, but getting buy in from stakeholders is critical to success.)
See the short section on decommissioning at the end of Cutover under Techniques and Methods.

Retain

On second thought, leave this application where it is. The top reasons to employ this strategy include: upcoming application retirement, cost amortization and/or deferring due to other priorities.

Ingredients

( money bag / three o’clock represent relative scales of effort for migration work. This doesn’t account for risk, missed opportunity or ongoing cost.)

black small square three o’clock

Procedure

Document the rationale to save yourself time on future re-evaluations.

Nutritional Requirements and Table Dressings

Every organization has different tastes and each application can drive specific requirements. For each of the following items, consider what capabilities are needed for the application to adequately serve its function and how much time/effort/funding you're willing to invest. These will be useful inputs for selecting a migration strategy and determining which cloud services to target.

Backup

Have duplicate copies of your data spread out through time and geography.

Important Questions:

Is your data separate from your application and configuration?
Does the application produce data?
What amount of data loss is acceptable in the event of an incident?
For how long is a backup useful?

Common Solutions:

Many cloud storage services provide a built-in snapshot function which creates a consistent point-in-time copy of the data and can support automation via policies that control the frequency, location and retention of snapshots.
Other services provide a built-in replication feature that can replay changes to the data on another storage resource (locally or in another region).
Avoid the need to back up application servers by not customizing them (see immutable infrastructure under Sanitization, featured within the Techniques and Methods section below).

Disaster Recovery

Have a plan — or resources standing by — to recover from adverse incidents.

Important Questions:

What degree of application degradation is acceptable? For how long?
Do you need a human in the loop for decision making?
Are you willing to fund your availability wishes?

Common Solutions:

A low traffic application with forgiving (or captive) users might be satisfied with a weekly replicated VM snapshot that is manually spun up and DNS updated within a day.
Another application might be better off with an auto-scaling, warm standby that continually receives replicated data live from production.
Yet another might use health-check based conditional DNS policy to route requests to a static, useful-but-not-featureful placeholder site.

Deployment

Have a process for making updates and changes.

Important Questions:

What is the process, how often and by whom are security updates, feature roll-outs, maintenance tasks and other changes performed?
Which aspects of the current process are determined by the behavior of the application or the ingrained habits of the development team?
What degree of flexibility do you have in updating this process?

Common Solutions:

A CI/CD system can run automated testing against new commits (e.g., application, infrastructure, docs), then inform responsible parties of the outcome. Or, proceed to deploy changes to an environment and run additional testing, then inform responsible parties of the outcome. Or, proceed to deploy changes to environments, etc.
Manual actions can invoke stages of an automated process when out-of-band validation is required or while trust in the automation is being developed. The interface for manual interaction can be a CI system (like #1), or other tooling that supports specific access or operational requirements.
Rolling releases can be done by building new container or VM images and replacing members of an application pool.
Multiple application versions can be installed on a VM with configuration to determine which receives traffic.
An application version can have multiple possible code paths with feature flags determining application behavior.

Performance

Have defined expectations so you can evaluate the migration.

Important Questions:

Where is performance critical for this application?
Can you quantify these requirements?

Common Solutions:

Take benchmark measurements at service endpoints to compare the impact of changes in development and testing. This might be sufficient for lower-stakes applications.
Continuously assess performance using a metrics and logs collection service.
Use the cloud service suited to your needs and allocate the right size/type, scaling settings, throughput/capacity levels.

Documentation

Have the documentation ready and up-to-date.

Important Questions:

Who are your stakeholders, what level of documentation is relevant to them, and what level of active communication is appropriate?
What information is needed to maintain, develop and run your application?
Do you need mechanisms in place to ensure documentation does not become stale?

Common Solutions:

Focus on writing documentation in the format that the information is needed (e.g., learning-oriented tutorials, goal-oriented how-to guides, understanding-oriented discussions, information-oriented reference material). Here’s why.
Use static code analysis tools to generate detailed documentation for your functions and classes.
Prepare announcements, notifications and a timeline.

Application Dependencies

Have a plan for the more commonly neglected dependencies.

Important Questions:

What supporting services does your application utilize (e.g., email, authentication, monitoring, antivirus, management agent, private certificate authorities, package mirrors, private repositories)?
Which of these can remain unchanged, need reconfigured, need replaced or are obsolete?

Common Solutions:

In the case of email, all the major cloud providers block (or punitively throttle) outgoing port 25 traffic, so folks tend to either use API integration with a cloud provider or 3rd party service to originate mail or send authenticated, TLS encrypted mail (port 465 or 587) to a cloud provider, 3rd party service or managed relay.
Adopting immutable infrastructure limits the need to access custom mirrors/repos to the image build environment, allowing other environments to be blissfully unaware of those resources and simply redeployed when changes are needed.
Network access to pre-existing systems can be achieved by means of specially-crafted access rules, site-to-site VPN, mesh VPN, dedicated connections, proxies, shim APIs or not at all.

Economy of Scale

Have a process for executing migration processes (if you need one).

Important Questions:

How many applications do you have to migrate?
How similar are the applications to one another in architecture, availability requirements and data sensitivity?

Common Solutions:

Start out supporting a limited set of architecture patterns.
Seek feedback to continuously improve the process.
Defer special cases or spin them off into standalone projects.
Standardize aspects of how applications are developed and operated, so migration plans, management and deployment are turnkey.

Techniques and Methods

Sanitization

Before digging into the migration specific details, let's spend a moment discussing a few methods that will keep your work clean and safe.

The most widely accepted and effective practice for building and maintaining clean cloud environments is Infrastructure as Code (IaC). To manage an environment or resources with IaC means that instead of allowing arbitrary, untracked changes to be made in your cloud provider, resources — and their configurations — are controlled via a process that is rooted in a version control system (e.g., git).

A typical IaC scenario would consist of declarative configuration written for a suitable IaC tool (e.g., Terraform, CloudFormation, Azure Resource Manager), which is updated and applied when changes are needed. This could also periodically run as a change detection mechanism. This practice eliminates error prone manual provisioning and enables reproducibility, repeatable testing, rollback capabilities, reusability, history tracking and grep-ability.

Running a non-IaC environment is like having an entire data center without any cable labels. With IaC, you are able to spawn new data centers at will.

Immutable Infrastructure is another key approach for setting boundaries on when and where the complexity of our systems is allowed to grow. The concept of Immutable Infrastructure is that once compute resources are deployed, they will only be replaced, removed or stopped/restarted. Any action that would otherwise involve changing the application or configuration is accomplished by replacing the nodes that are currently running the application.

Container based platforms and Functions as a Service (FaaS) cloud offerings have this concept baked in, but when you are using a bare VM cloud service, the path to implementing this is to configure an automated, version controlled process that will build VM images and trigger the upgrade deployment, rollback and maintenance routines the application requires. Such an image build process can make use of cloud provider tools, Hashicorp Packer or Configuration Management tools (e.g., Ansible, SaltStack, MS DSC).

If this capability feels out of reach, the first step is to automate your server setup with a CM tool. The second is to use the CM tool maintenance tasks like patching and deploys. Eventually, the path to enhanced sanity and control in your compute environment will open to you.

What is the most fundamental principle of security? In my opinion, it’s Least Privilege. In all things, grant only the permissions that are required for your systems, people and processes to function. This applies to the capabilities of your application, access into and out of your networks, and access to your cloud provider. The potential liability of leaked cloud provider credentials is equal to the cost of all your data being deleted, plus the hourly rate of the most expensive resources times the number of hours the leak is unnoticed. Of course, there is a balancing point between granularity and manageability, but Infrastructure as Code shifts that balance in our favor.

Also, it is best practice to set up billing alerts for your environments. All cloud providers have them, and, even when they aren't free, they're worth it. If you don't have an estimate, make up a threshold that seems reasonable, and check the threshold against your actual costs until you know what to expect. I prefer a threshold that has room for monthly variability but is low enough to catch when something unexpected is happening.

Want more advice and orienting principles? Read more here and here.

Testing and Validation

Since cloud migration consists of two categories of changes — provisioning, instantiating and initializing resources in a cloud provider and cutover, transitioning application activity from one environment to another — good change practices directly apply to cloud migration activities.

There are two components needed to responsibly make changes to an application or its environment. The first is actually making the change; the second is validating the change.

Validating a change requires satisfying only two conditions: prove that the change is in effect and prove that the application is behaving as desired. The degree of rigor required for these proofs will vary according to your circumstances (e.g., frequency of the change type, SDLC phase, application criticality/visibility, blast radius).

Let's dig into some of the options:

Test that the invariants of your application/environment are as they should be.
Health checks ought to be passing.
Specific processes ought to be running.
Certain service endpoints ought to be reachable from specific locations.
The output of terraform plan should not indicate that changes are needed.
Load balancers ought to have a specific number of targets.
The git tag/app SemVer should match a.b.c. etc.

A good set of these "sanity checks" can quickly give you evidence of a misconfiguration and kick off the process of carving down the problem space if there is an issue. These tests are usually straightforward to automate using the check/test modes of CM tools, specific verification tools (e.g., Serverspec), cloud provider platform features, etc. These tests are useful for provisioning and — after any adjustments for architecture changes — pre/post-cutover scenarios.

Test specific application functionality and verify the output.
The 5 most popular requests should return x,y,z and complete within # ms.
The end-to-end test suite ought to pass.
Certain features require a specific role.
A set of requests that exercise the database, storage, external API calls should all succeed within # ms.
So-and-so from Department X should do their magic and send over a formal approval.

These kind of in-the-weeds application level tests can provide a major confidence boost that things are running as expected and can potentially serve as a basis for comparison between environments. Typically, these tests would be written with a protocol specific framework (e.g., for http applications, tools like cypress, puppeteer, or selenium). Since they're specific to application behavior, these tests are likely to run unchanged for provisioning and pre/post-cutover.

Compare point-in-time performance metrics/logs with current and historical values from other environments.
From a centralized metrics/logging service, require a favorable comparison of performance and resource utilization metrics between varied infrastructure components and previous revisions at different levels of user activity.
Simulate user activity on non-public/unreleased environments using application specific scenarios implemented using a load testing tool (e.g., locust, gatling, jmeter).

When executed under appropriate conditions with sufficient care, this kind of monitoring regime can weed out otherwise overlooked issues and aid in continuously ratcheting up application quality. Infrastructure components required by this kind of testing are available in many cloud providers. Support for custom metrics would need to be integrated into the application, and procedures for generating summary metrics and comparing results would need to be evaluated and adopted. Early versions of provisioning configuration may not be ready to integrate with this, but it can shine with pre/post-cutover.

Specifically:

Test the relevant administrative and operational processes.
Ensure backups are being created and are restorable.
Ensure administrative users have appropriate access.
Verify that disaster recovery processes function as expected.
Ensure the change processes for this service are safe and effective.

These capabilities — though not impacted by most changes and infrequently thought of — are an application's last line defense from a high-severity week of extreme sadness. Automation for these processes is typically written with an IaC or CM tool. Be sure to test these capabilities whenever they may be impacted by a change and on a regular cadence.

When some applications are in the process of being migrated to the cloud, they get more focused attention than they normally would. Any automated tests that can be used to validate the new environment, build confidence for the cutover or verify the application's health are likely to come in handy post-migration when the focus turns to feature deployment, maintenance and routine operations. So, at the very least, pick the low-hanging fruit.

Database and Storage

Depending on where the data for your application lives, how much there is and the nature of the application's interactions with it, getting data synced in a timely and consistent way to your target cloud service could take you down one of a few different paths.

A direct approach can be effective in simple cases that have:

a small amount of data,
low rate of change (relative to network bandwidth), or
the ability to leverage a maintenance window.

Upload a directory of assets/data to a cloud service with standard cloud provider tools or to a cloud VM with rsync or zfs send. Alternatively, use a cloud VM as a temporary endpoint for a cloud service by mounting cloud storage to the filesystem, or tunneling with ssh or proxying via nginx/haproxy to cloud resources in the cloud network. This is an excellent method for loading fresh database dump, second only to scping the dump to a cloud VM then running the DB load, which needs to push fewer bits over the internet if the DB load needs to be restarted for some reason.

If you are deferring a transition to immutable infrastructure until a later opportunity, for most on premises virtualization/cloud provider pairings there is a path for VM export/import. Plus, some cloud providers offer a background-transfer, live replication service. Be aware of your CPU arch, device names, drive mappings, customized hosts files, etc. when configuring a whole VM move like these.

Things get more interesting when the data that needs to be synced would take a week or more to transfer over existing network connections. For these cases, I would look at the "Sneaker Net" services that the target cloud provider offers.

A typical workflow with the "Sneaker Net" services is:

submit a requisition specifying the amount of data moving and the target service,
receive a computer in the mail,
connect this machine to your network and transfer data to it,
ship the computer back, and
receive a notification that the data has been loaded.

This is especially useful if the change rate or other factors don't invalidate the data.

Another solution is to use a cloud integrated storage appliance (e.g., throttle, encrypt, optimize) to manage transferring the data. These can be a "computer in the mail" flavor or a VM that you provide underlying storage for. Essentially, these devices provide network storage services (e.g., NFS, SMB) and copy the files they're given out to a cloud service. So, you set up the device, do a local storage migration to it and let it trickle your files to the cloud service according to the transfer parameters you provided.

When using a cloud integrated storage appliance, make sure you:

don't saturate your network links,
don't starve the transfers so that they can never complete, and/or
respect the appliance requirements as to whether or not other systems are permitted to write to the same cloud storage resources.

Either because of the time elapsed in completing an initial migration or an unrelenting rate of changes, some scenarios might require an initial data transfer be continually updated with new changes from the live environment. For innocent file data, iteratively running rsync or zfs send can be sufficient. Otherwise, you may be forced to use some specialized file replication product.

The path for keeping database data in sync feels somewhat clearer:

initially load a backup/dump,
configure DB replication following the guidance published for your cloud database service, or
make use of the cloud provider's database migration service offering.

To work well, either of those options will require adequate network access, stability and bandwidth.

In some circumstances — such as if an application's complexity is better suited to being decomposed into multiple, phased migrations (following the Strangler Pattern, for instance) — it can be worthwhile to add features to the application to directly support the changes being made:

conditional request segmentation,
prioritized list of storage locations,
multiple database connections, etc.

Such an approach increases the surface area where problems might occur, but adding features to the application may beat the alternatives.

Finally, some painful cases need local adjustments to be prepared to support migration activities:

insufficient network bandwidth,
indecomposable data stores,
un-replayable transactions,
no maintenance flexibility,
deprecated services/protocols, etc.

Years ago, I had the pleasure of dealing with an application that would somehow re-write the alternate data stream of a file after it had closed the file handle. The details were never clear to me, but the share full of corrupt documents spoke for itself.

Take the time to understand your application's relationship to its data and your organization's priorities when planning this phase of your migration.

Cutover

It is tempting to think cutover is solely about making the changes that update which environment is live, but I recommend thinking about cutover in three distinct phases: the Plan, the Change and the Rest.

The first step in formulating the Plan is to ensure you've met all the application's "Nutritional Requirements" and any bonus Organizational requirements — especially those that are difficult to bolster later on. These include making sure:

the day-1 documentation is ready
all of the stakeholders are in the loop
anyone that needs early access to the cloud environment has it (these can surface when considering Testing and Validation options)
there is plenty of time to find and flag issues

The second step in formulating the Plan is to check your priors. Verify that you know how to gauge the traffic hitting the live environment and how that flow is controlled. Verify that your Testing and Validation steps cover the most important and informative areas of the environment.

The third and final step of formulating the Plan is to literally write a plan. You can decide the level of formality, whether someone else should review it and where to add it to the project documentation.

The plan should consist of at least three parts:

Instructions for the steps to update which environment is live.
Instructions for the steps to validate that the change is in effect and that the application is working as expected.
Instructions for what actions to take if the validation does not succeed (e.g., a total rollback, correct an anticipated issue & re-validate).
Extra credit: Include specific notifications to send to stakeholders at the start and end of the Change.

At a minimum, the plan should be written with enough detail that your nearest coworker can step in for you. Writing change plans in this pattern is useful for:

uncovering mistaken assumptions,
explicitly considering contingencies,
more reliable execution, and
future re-use as training material and for discovering/developing standard operating procedures.

The Change consists of executing the plan and noting any deviations found to be necessary. For cloud migrations, I would expect most cutover changes to consist of a DNS update. When planning a change that updates DNS, remember that the TTL value determines the duration of cache validity for the record. You probably want to drop it to a low number ahead of time so that the change is propagated to the world without delay (then raise it again in the Rest).

For validating a DNS update, use nslookup locally, then mxtoolbox and, for good measure, whatever site shows up first after searching "global dns check." After that, complete the required application specific checks and look to see how much traffic is still hitting the old environment. A different cutover change I might expect would be updating a proxy server's configuration. Validation here is also straightforward. Simply check the proxy's backend metrics to verify requests are being sent to the cloud environment, followed by the application and old environment investigation. By now, you've got the idea.

The double meaning of the Rest is intentional. First, congratulations! The application is migrated; go ahead and take the weekend off. Second, determine what the next priority is:

Is everyone satisfied with how the Cutover went?
Are any updates needed to components of the environment, documentation, or other "Nutritional Requirements?"
If things are running well, is it time to proceed with decommissioning the old environment, or do some stakeholders still require access to it for a period of time?

Be explicit about timelines for disabling access and removing applications from your environment.
Be flexible and accommodate reasonable requests when previously undiscovered users come out of the woodwork.
Keep monitoring the logs to verify whether or not there is activity.

Is there another application to tackle?

Wrapping Up

This is my take on the options, significant considerations and best recommendations for approaching a cloud migration, presented in the most palatable format I could think of.

Hopefully, what you have read here has whet your appetite for a migration project or two. Now, you can:

Replace an archival backup server with cloud storage
Re-architect a marketing website into a static site behind a CDN
Replatform a bursty line-of-business app onto an immutable, dynamically scaling and disaster-ready service

Or, something better!

Share your migration projects with us, and schedule a free consult today!

Blog