How to Overcome Infrastructure as Code (IaC) Challenges

Infrastructure as code (IaC) is important for any organization to consider, as it makes it easier for your engineers to create the machines or resources required to run your applications. And as you scale more, IaC can help your engineers avoid wasting time and effort on launching and configuring resources so they can instead work on solving more complex problems.

Now, when I say IaC, what does that actually mean? IaC entails defining your infrastructure as code so that whenever you have to make some change or add new resources, you simply add or make a change in a piece of code, and the IaC tool will take care of configuring those resources for you.

In this article, we’ll talk about the challenges you can face with Infrastructure as code and discuss some steps to mitigate them.

The Principle Behind IaC

The basic principle behind IaC is that you will be defining your infrastructure in your code. With declarative syntax, you define the final state of your infrastructure. The underlying dependency resolution, resource-launching steps, etc. are taken care of by the IaC tool.

You can then put this code in a VCS system, which will keep track of all the changes executed in the previous steps. This will help you audit your infrastructure in terms of indicating who made what changes.

With IaC, your engineers don’t have to waste time defining the steps or scripts to launch and configure resources. Pulumi, Terraform, and CloudFormation are among some of the basic tools you can use to accomplish these tasks.

Role of a VCS in IaC

As you save most of your code in a VCS system like GitHub or GitLab, you have to also save your infrastructure as code in these so that your engineer can collaborate easily on them.

A VCS system additionally provides a very simple way to audit any changes made and who made them. You can even use the pipeline features present in GitHub or GitLab to make sure the changes comply with your policy and no unwanted changes go to production.

What are Some Common Use Cases of IaC?

The most common use case for IaC is launching infrastructure in different cloud vendors. You can also use it to keep the code for provisioning machines when launched. A few examples of IaC in provisioning are Chef, Ansible, and Puppet, while common tools for infrastructure provisioning include Terraform and Pulumi.

You can further implement IaC to roll out applications like in Kubernetes or other environments using Jenkins or Ansible. We’ll talk more about Kubernetes and IaC in a later section.

Challenges and Best Practices with IaC

With its ease of operability and maintainability, IaC also comes with some challenges. These are issues you need to handle properly to protect your infrastructure from vulnerabilities and abuse.

In this section, we’ll look at these challenges, as well as how they can be handled.

Adoption Within the Team

There can be a learning curve and process change when implementing IaC in your organization. Your team may have to learn the language in which the IaC code has to be written, plus develop pipelines to execute the code. If the team is operation-centric and used to making changes from cloud consoles, it can be a big step for them to move to IaC.

Configuration Drift

This is a major problem, especially at the start of your IaC journey. An engineer may not know what code changes are needed for infrastructure provisioning, in which case he may opt to go to the console and make changes manually. This, in turn, will lead to configuration drift, meaning whatever you have defined in the code is actually not what is deployed. And the next time you run the code, it may try to revert the change executed manually and cause an outage.

You have to inform the team about these consequences and ask them to avoid manual changes via the console.

Security

Many times, you’ll pick up some module from an open-source community to use in your IaC pipeline, and there’s a chance a few vulnerabilities exist in this code. So before using any open-source project, verify if it is OK to use.

Human Factors

There is always a chance of a misconfiguration introduced when an engineer makes some changes. You need a way to catch these validation errors and make sure they do not enter production. With Terraform, you can easily implement the validation step using the Terraform plan functionality.

Side Effect of Automation

Since you’re automating a lot of infrastructure creation, a lot of code will be reused. There’s a good chance that any small misconfiguration can propagate across a large set of resources very easily. This again has to be caught in the pipeline during verification.

Keeping Up to Date with Cloud Vendors

Cloud vendors can change their APIs and policies. When they do, the tools/code that you’re using must also be updated to make sure your existing infrastructure works exactly the same as before.

This can be very tricky if you’re using open-source tools since there can be a delay in the release of any changes. This can also lead to incorrect permissions if the API for RBAC gets changed or the API for provisioning access to the machines changes.

Maintainability and Traceability

You have to define a proper procedure so that anyone can take their infrastructure changes to the production environment and define the responsibilities in a way that everything going to production is verified. Maintainability from the VCS side is not an issue if you have the process defined; if not, it can be very chaotic.

Also, when we talk about traceability, everything can be traced very easily in any VCS system. All the changes will be present in the logs. For example, in the case of Git, you can easily see all the changes using the Git log command or in the Git commit history.

RBAC

A lot of these tools rely on the underlying platform you’ll use to keep your code for RBAC implementation. They do not have any native RBAC implementation, they just assume that if code is executed, it is being executed by the person who has the proper permissions. So the responsibility of RBAC has to be handled by the underlying VCS system.

Now that we’ve discussed the challenges with IaC, let’s look at some of the best practices you can use to handle them.

Version Control System and Proper Approval Flow

You must implement version control to keep control of your code. This helps you in auditing and keeping track of changes. Also, you should create a flow wherein no changes can be merged to production without proper approval and validation. You can consider writing validations in the CI of GitHub/GitLab. If you treat your IaC code just like normal application code, it should cover most use cases.

Handling Secrets Properly

There are two kinds of secrets you have to manage. The first one is used to create resources in the cloud; no one should have access to these—only the admin of the repos. You can use a secret variable in GitHub/GitLab for this purpose.

The second is when the code gets executed; this can create secrets like the password for an IAM user in AWS. You have to make sure these are not getting logged anywhere and find a way to safely ship them to users.

Immutable Infrastructure

If you have to make changes to your infrastructure, consider the principle of immutable infrastructure. This lets you make changes to a machine by creating a new machine with the changes and replacing the old one with the new one.

This way, you can make sure your changes are in sync with the code and no server reaches a snowflake server state. The basic idea is that you drive the machine from code and no changes have to be done manually.

Validations and Checks

Implementing checks and validations in the CI pipeline will make sure that any security issues and wrong misconfiguration are caught on the left side of the pipeline. This will increase the frequency of the development cycle and maintain the security of each release.

Infrastructure as Code and Kubernetes

You can run your application on Kubernetes using the same principles as IaC. Kubernetes objects are declarative files that you can define and keep in a code repository. A controller can then apply these files in a Kubernetes cluster, and your application can get deployed.

You can use tools like Argo CD and Flux CD that can sync your Git repository to the actual deployment. Flux is also recommended by the CNCF community for CD. This implementation is known as GitOps, as your Git repository is now the source of truth of your application states. Any changes you make in the application code will immediately be reflected in the Kubernetes cluster.

Conclusion

With the advantages of infrastructure as code, there are challenges you have to face. Making sure you have proper validations, checks, and a well-established process will save you from a lot of issues in production. Since you’re maintaining infrastructure with code here, any lapse of security can cause many problems—costs can get out of hand, and your environment can be compromised.

GitOps and IaC allow you to roll out changes faster and in a safer way, enabling a quicker deployment cycle as well as out-of-the-box auditing on a large scale. IaC is the present and future when it comes to managing your infrastructure, application, and tooling, and it is recommended for adoption to decrease your operational costs.

With IaC tools, you can achieve the same with less manpower.

You may also like:

> [Video] Best Practices for Using Terraform in Your Developer Pipeline

> [Video] Pulumi vs. Cloudformation vs. Terraform

> Terraform vs. CloudFormation vs. Pulumi

> Terraform and Ephemeral Environments with Bunnyshell

> CI/CD and Bunnyshell EaaS - Match Made in Heaven