• Research & Engineering

Managing configuration across systems and environments

dev, January 1, 2025

We have about 500k source lines of code that implement all of our research engineering. We execute code in a variety of contexts, whether ephemeral lambdas, beefy EC2 instances with GPUs, or our laptops when developing code. Although we have somewhat traditional change control management for our code, we also have a central system that handles our configuration, no matter the execution context. Our RFC 5 describes this process and we thought we'd share some of the insights we've learned here.

What is configuration?

Configuration is a somewhat "I'll know it when I see" it generally refers to a set of important key–value pairs. For example, we use configuration to store:

  • Database secrets
  • The host of a service that all code needs to consistently access
  • High-level settings like the current log level (e.g. DEBUG or INFO)
  • The current git commit that the code is executing on

Configuration tends to be important values that cut across the codebase that change between execution contexts (e.g. a laptop vs. AWS), but never within an execution context, i.e., once code starts running, we do not allow for configuration values to change.

In this way, we can ensure that—even across long-lived processes—all parts of our codebase have a consistent set of configurable values to read from. Notably, configuration is linked to the concept of environment variables. Many services and processes read variables from the current process environment, which gives rise to consistent patterns like seeing ENV=PROD for a production environment. While our configuration system interoperates with environment variables, we have designed a more consistent interface to these settings.

In order to maintain our system, we have a few rules with configuration. Namely, we require configuration to take on basic values with well-defined types, we forbid real-time configuration value changes, and we also enforce that configuration is the first code that executes in any code. That way, we can ensure that configuration is the root of all our code and we can rely on these values existing before doing anything else.

Configuration Sources

We currently have five sources of configuration, in priority order:

  1. Environment variables: we give the highest priority to variables declared in the environment and we competently understand variables commonly set by other systems.
  2. .lvc.env file: we respect a special environment file that we read in the current working directory.
  3. System-wide file: we also have a special system-wide file located in a canonical location. These types of settings are very useful for our local machines, where we want to set the same values across several projects.
  4. AWS Secrets Manager: we automatically pull from and reconcile secrets stored in Secrets Manager.
  5. Default: we define default values for every configuration options; for many of these, it's reasonable when testing code, since we have tried to set sensible defaults where possible.

These configuration sources have a very consistent naming scheme that is enforced by our code. For example, if we want to declare an Airtable API token, we can easily declare this with proper typing. By giving the field an alias of airtable_api_key, we are asking our system to synchronize this value in code with the environment variable AIRTABLE_API_KEY (case-insensitive), which is an environment variable understood by other systems. Furthermore, we declare documentation, which gets automatically parsed by our toolchain. Finally, note that we use a double underscore __ in names as a special convention to synchronize with AWS Secrets Manager.

airtable__token: SecretStr = Field(default=SecretStr(""), alias="airtable_api_key")
"""
The Airtable token to use for Airtable operations.
"""

One additional feature of our system is that we first interpret the value of ENV, which is a specially handled configuration value that decides if we are in develop, staging, or production. Based on this value, we actually load in different configuration values from AWS. We currently use the canonical convention of prod/some/value and dev/some/value in naming secrets in AWS Secrets Manager. For example, the aforementioned Airtable token would be matched with prod/airtable/token or dev/airtable/token depending on the value of ENV. In this way, we can ensure that resources are not mixed across environments, while keeping the interface in code very simply: code paths just need to use the value of airtable__token without worrying about which environment they are in.

Bells & Whistles

We currently use a combination of pydantic and our own custom code to implement our configuration system. Because our configuration system is the literal core of our code (it's import location is core.), our entire toolchain understands configuration. We have thus been able to implement several easy features as a result:

  1. Injection: when configuration loads, it injects all configuration back into the environment. That way, any other systems that read environment variables can access the relevant configuration values.
  2. Automatic rotation: we have specialized handlers that can automatically rotate secrets in AWS Secrets Manager. Using a combination of terraform and CDK, we can easily declare rotation handlers through the type system: (e.g. Annotated[SecretStr, RandomRotation()]) and our CI/CD system automatically deploys rotations.
  3. Tooling: we have a CLI that interacts with our configuration system. See figures below.
  4. Provenance: we can tightly trace where configuration values come from and report relevant information.
  5. Reporting: whenever we spin up jobs, the first thing we do is report the current configuration and its provenance, so we know what configuration was set on any job. See figures below.
  6. Central library: every project has access to this configuration system, which provides a consistent and streamlined interface.
Our CLI giving an overview of our configuration values.
Our CLI giving detailed information on a particular configuration value. For example, our CLI can read the associated description string in the source code and display that description in this view.
Our job reporting out configuration values when it starts.
Our job reporting configuration sources when it spins up.

Security

One of the biggest benefits of our central configuration system is increase security. We can avoid hard-coded credentials, non-secure credential sharing (e.g. sending over plaintext in slack), and other common pitfalls. If we update a secret value, all of our code immediately gets access to this value without having to manually update our many systems. Even our code has security provisions: we use the specialized SecretStr construct, which in code, prints out as * instead of the real value. That way, we know that our logs are properly sanitized of any potentially sensitive information.