DevOps Security at Scale

This is the fifth blog post in a series discussing how high-performing DevOps teams build secure systems at scale.

If you want to see a proponent of DevOps wince in pain, describe to them any process that could be automated but is instead still run by humans. QA, packaging, deployment, upgrading, rollback, and disaster recovery are some of the tasks that the DevOps movement made unacceptable to be done by error-prone, inefficient human beings. Modern software teams will now automate all of those chores to move fast and not break things.

It’s an obvious statement, but automation is key to velocity: an assembly line of robots will produce new automobiles far faster than a group of humans. High-performing DevOps teams also incorporate security in their automation plans to allow them to release features fast without adding undue risk.

Humans Should Only Do What Humans Do Best

Not only will sophisticated DevOps teams maximize their opportunities for automation, they’ll always be on the lookout for any opportunity to take repetitive tasks from a human and automate them. There are some tasks that are difficult, risky, or not beneficial to automate. For example, validating an application’s aesthetics, inspecting UI design fidelity, and validating the quality of a user’s experience are currently far more appropriate jobs for a human than for a computer.

Security falls into the “you should definitely automate a whole lot of it” category. There are many aspects of secure development to automate, and a few that are still only appropriate for humans. For example, security teams are the right choice for reviewing and granting application privileges prior to their rollout, and for managing those privileges while the application exists. And even though there are some wonderful automated tools available for penetration testing, humans are still better at finding new and creative ways to attack software systems.

Maximum Automation, Minimum Humans

In most high-quality DevOps teams, once a Continuous Integration (CI) pipeline is set up to push an application automatically to staging and/or production, the operations staff shouldn’t be involved in any of the app’s future upgrades.

In a similar vein, as long as an application’s security policy hasn’t changed and no new privileges are required, there’s little reason for human security personnel to be a bottleneck of human approvals preventing that application to roll forward and go live. They had already approved the enrollment of that application, so as long as the application doesn’t need runtime access to more privileged resources, upgrades should be a fully automatable event.

Making upgrades automatable but without compromising security does require that more tasks be present in the CI pipeline: scanning 3rd-party libraries and runtimes for CVEs, checking application code for security flaws, and running penetration tests.

While they sometimes take time to run, the overall cost of automatically running these tests with each deployment is drastically smaller when compared with the cost, risk, and time spent having humans do the same tasks over and over.

Automated Security

DevOps proponents already know to anticipate severe operational failures. Recently, failure mitigation approaches have become progressively more sophisticated. For example, data center resilience and disaster recovery (DR) methods have evolved significantly. In order, from least sophisticated to most:

Maintain a standby site that can be manually enabled
Regularly rehearse a manual site switchover procedure
Regularly perform a manual site switchover procedure
Completely automate a site switchover procedure
Run concurrently against multiple sites or regions and automate all compensating changes in the system in case of failures
Deliberately cause infrastructure failures by applying chaos engineering principles

What’s most interesting in the above is the level of automation required with each increase in sophistication. It turns out that the same is true when looking at security: high-velocity security teams focus on using automation to prevent breaches, and for helping respond when they do occur.

Preventing Breaches with Automation

Attackers love having time to do their work. If they can invade a system, time is what they’ll need to expand their attack surface and get access beyond where they originally penetrated.

For example, if a password to a critical database account is only changed manually once every 90 days, they’ll have months to spend attempting to crack it before they’ll have to start over. But if that database password is automatically changed (“rotated”) every day – or even more frequently – then their ability to crack that password will be drastically reduced.

Another critical element to breach prevention is the automatic repaving of infrastructure on a regular basis. This ensures that servers all return to a known good state very frequently – something that also limits the amount of time an attacker will have before being forced to start over.

Responding to Breaches with Automation

Once your infrastructure is capable of being repaved automatically without causing downtime, and your sensitive passwords and secrets are being rotated on a frequent schedule, much of a breach response process is to initiate those existing automated changes at the moment an invader is detected. All passwords get immediately rotated, and all relevant “invadable” infrastructure (such as VMs, containers, etc.) get recreated from a known good state.

Just as was learned from the evolving approaches to disaster recovery, breach response becomes a far less risky event if the automatic procedures to handle it are in place and occurring often.

Automation is perhaps the security officer’s most valuable tool. It is what allows an organization to move fast with their feature development without increasing their attack surface, and for them to confidently deal with breaches if they occur.

In the next and final post in this blog series, we’ll discuss how the adoption of bleeding-edge technology becomes an accelerant for secure DevOps and not something that makes it harder to control.

DevOps Security at Scale series

Brian Kelly

Brian Kelly is Head of Conjur Engineering at CyberArk, where he focuses on creating products that add much-needed security and access management to the landscape of DevOps tools and cloud systems. Brian is passionate about building teams, cybersecurity, and DevOps. Find him on Twitter at @brikelly.