Security Challenges Around Chaos Engineering

Chaos engineering, which aims to make software-based systems as resilient as possible in the face of unexpected error conditions, is a relatively new discipline.  These conditions can range from the crash of a container to the disappearance of a full database from the network, necessitating that you reroute traffic.  With chaos engineering, errors such as these are introduced under the watchful eye of the application development and site reliability engineering teams using tools like Chaos Monkey from Netflix or full chaos platforms from companies like Gremlin.

 

Application Security Considerations for Chaos Testing

Application security is paramount to running any service in a modern infrastructure successfully, and it is often geared toward providing a centralized service that is used across the entire enterprise.

 

Chaos engineering practices typically start small and incorporate more infrastructure components with greater potential impact on an organization’s application production landscape over time. Thus, by the time that core functions like application security come into play, chaos engineers have already built a solid reputation and demonstrated their ability to benefit the organization.

 

There are a few things that you should consider before you add security components into the chaos testing mix:

  • Security incident and event management teams will need a heads-up when secrets management will be involved in chaos testing.If they don’t know about the tests, they will panic when secret services are in a failed state.  When applications like Conjur go offline and are unable to make secrets available to applications as required, those applications will trip password policies and potentially lock out important service accounts that don’t have the correct password.
  • When authentication and authorization services go offline, or when they are in a serious state of degradation, requests will start to slow down and then fail across all interconnected systems.This will increase service requests from potentially all customer and employee-facing services, including the service desks themselves.  Therefore, SIEM should be informed about the tests before they begin and also when they end; they need to be kept in the loop, at least until the first few tests confirm that the services have some resilience.

 

Risks and Benefits of Including Application Security in Chaos Testing

The risk of including security components in chaos testing is that the tests will do exactly what they are best at doing: finding a single point of failure that will have a drastic impact on the entire enterprise’s ability to serve its customers. While this will improve the stability of the environment in the long-term, teams will be in for more meetings about enforcing better change management policies than you can imagine in the meantime.  Hold your ground and reassure your teams that you are following best practices — show them the evidence of how internal chaos testing has reduced both the number and length of outages since the advent of chaos engineering.

 

Obviously, the risks increase as the scope of a chaos engineering program expands to include core application security systems, but the risks are more than rewarded by the peace of mind that comes from knowing that your application security infrastructure has become more resilient through the process.

 

Summary

On its own, chaos engineering rarely specifically targets any one area of a system, including security components, but those very components are the most likely to cause full system-wide outages if they encounter problems which they have not been engineered to recover from or work around.

 

Security components are often overlooked since they are completely invisible when they are working as they should, but they might be hiding serious problem areas, and they will be thrust into the limelight when they fail — from a failure to handle all incoming authorization requests during peak hours to a failure to provide new containers in the batch programming environment with access to the secrets they require.  This means that no matter which tools or practices your organization leverages for chaos engineering, you should make absolutely sure that all types of security components are included in the testing — from authN and authZ to SDN and secrets management products.

 

Join the Conversation on the CyberArk Commons

If you’re interested in this and other open source content, join the conversation on the CyberArk Commons Community. Secretless Broker, Conjur, and other open source projects are a part of the CyberArk Commons Community, an open community dedicated to developers, engineers, cybersecurity researchers and other technically minded people. To discuss Kubernetes, Secretless Broker, Conjur, CyberArk Threat Research, and related topics, join me on the CyberArk Commons discussion forum.