gptme was passing API keys on the command line where any user could read them

gptme is an open-source terminal-based AI agent with over 6,000 GitHub stars. Its evaluation runner, docker_reexec(), re-executes itself inside a Docker container to benchmark LLM providers. To do that, it needs to pass API keys for services like OpenAI, Anthropic and DeepSeek into the container. Until PR #1789 was merged on 23 March 2026, it did this by appending -e OPENAI_API_KEY=sk-proj-abc123... directly to the docker run command line, where every user on the system could read the values by running ps auxww or reading /proc/<pid>/cmdline.

This is CWE-214: Invocation of Process Using Visible Sensitive Information. It is one of those vulnerability classes that experienced developers know about in principle but still ship in practice, especially in tooling that starts as a personal utility and grows into shared infrastructure.

What was exposed and how

The vulnerable function sits in gptme/eval/main.py. It collects environment variables containing API keys, then builds a Docker command:

env_args = []
for var in env_vars_to_pass:
    value = config.get_env(var)
    if value:
        env_args.extend(["-e", f"{var}={value}"])

Those env_args are then spliced directly into the subprocess.run() call that launches Docker. The resulting process looks something like this in a process listing:

docker run --rm -e OPENAI_API_KEY=sk-proj-... -e ANTHROPIC_API_KEY=sk-ant-... image:tag gptme-eval

On any Unix system, the full command line of every running process is visible to every user. The /proc/<pid>/cmdline pseudo-file is world-readable by design. Tools like ps, top and htop all expose it. There is no privilege escalation required. If you can run ps auxww, you can read every API key that gptme passes to Docker.

The exposure window is the duration of the Docker subprocess: seconds to minutes depending on the evaluation workload. On a shared CI runner or multi-user development machine, that is more than enough time for a monitoring script, a compromised process or a curious colleague to capture the values.

The project already knew this was a problem, partially

What makes this finding interesting is that the codebase contained a redact_env_args() function that walked the command list, identified arguments containing KEY or TOKEN in the variable name and replaced their values with ***REDACTED*** before logging:

def redact_env_args(cmd: list[str]) -> list[str]:
    redacted = []
    i = 0
    while i < len(cmd):
        if cmd[i] == "-e" and i + 1 < len(cmd):
            env_assignment = cmd[i + 1]
            if "=" in env_assignment:
                var_name = env_assignment.split("=", 1)[0]
                if "KEY" in var_name or "TOKEN" in var_name:
                    redacted.append(cmd[i])
                    redacted.append(f"{var_name}=***REDACTED***")
                    i += 2
                    continue
        redacted.append(cmd[i])
        i += 1
    return redacted

This is a reasonable log-sanitisation function. It demonstrates that someone on the team understood the risk of secrets appearing in output. But the threat model was incomplete. Redacting values from application logs does nothing about the operating system's own process table, which is the primary exposure vector for CWE-214. The 18-line function was protecting the log file while the kernel happily served the plaintext secrets to anyone who asked.

There is also a telling contrast elsewhere in the codebase. A separate file, execenv.py, passes environment variables to Docker using -e VAR (without the value), which tells Docker to inherit the variable from the host's environment. That form never exposes the secret in the command line. Only docker_reexec() used the dangerous -e VAR=VALUE form.

What the fix does

The merged fix replaces inline CLI arguments with Docker's --env-file mechanism. The approach is straightforward:

Collect all environment variable assignments into a list of KEY=VALUE strings
Write them to a temporary file created atomically with tempfile.mkstemp(), which produces a file descriptor with 0600 permissions from the moment it exists on the filesystem
Pass --env-file /tmp/gptme-docker-env-XXXX.env to Docker instead of individual -e flags
Clean up the temporary file in a finally block after subprocess.run() completes

The core change is compact:

# Before: secrets in CLI args
env_args.extend(["-e", f"{var}={value}"])
 
# After: secrets in a temp file with 0600 permissions
env_entries.append(f"{var}={value}")
# ... written via mkstemp, passed as --env-file

The use of mkstemp() rather than NamedTemporaryFile is a deliberate choice that emerged during code review. An earlier version of the fix used NamedTemporaryFile with a deferred os.chmod(), which introduced a TOCTOU (time-of-check-to-time-of-use) race: the file would briefly exist with wider permissions before the chmod completed. mkstemp() creates the file descriptor atomically with the requested permission bits, closing that window entirely. The except BaseException guard ensures the file is cleaned up if anything fails between creation and the Docker invocation.

The redact_env_args() function is removed entirely. It is no longer needed because secrets never appear in the command list. The replacement logging is simpler: it outputs the command with a <env-file> placeholder where the temp path would be, and separately lists the variable names (without values) being passed:

logged_cmd = [
    "<env-file>" if i > 0 and docker_cmd[i - 1] == "--env-file" else tok
    for i, tok in enumerate(docker_cmd)
]

Even the temp file path itself is masked in logs to avoid leaking its location to log sinks.

The tests

The PR includes five targeted tests in test_cwe214_docker_env_leak.py:

test_api_keys_not_in_cli_args: Mocks subprocess.run, captures the Docker command and asserts that no secret value appears anywhere in the argument list
test_env_file_used_instead: Confirms that --env-file is present in the Docker command
test_env_file_has_restrictive_permissions: Checks the file mode inside the subprocess mock (while the file still exists on disk, before cleanup) and asserts 0o600
test_multiple_keys_not_leaked: Runs with three different API keys and confirms none appear in the command line
test_env_file_contains_expected_content: Reads the temp file during the mock and validates that it contains the expected KEY=VALUE entries

The important detail is that the permission and content tests run inside the subprocess.run mock, meaning they inspect the file while it still exists. After the function returns, the finally block deletes it. This is a clean testing pattern for temporary-file-based security fixes and worth noting for anyone writing similar tests.

All five tests fail on the master branch, confirming the vulnerability exists, and pass on the fix branch.

A pattern in AI agent tooling

This is the third case study I have written where an AI agent framework ships a security gap that would have been caught by standard practice in any other context. The Hugging Face skills audit found SQL injection vectors that parameterised queries would have prevented. The Hermes Agent review found path traversal through unsanitised .worktreeinclude entries. Now gptme passes secrets on the command line where a two-decade-old best practice (use environment files or stdin, not CLI arguments) would have prevented it.

The commonality is not that the developers are careless. gptme's maintainers wrote a log-redaction function. They knew secrets in logs were a problem. The gap is between knowing about one exposure vector and thinking through the full threat model. Log redaction handles the application layer. CWE-214 is an operating system layer concern. Docker's own documentation recommends --env-file over -e VAR=VALUE for secrets. The information was available, it just was not applied at the time the code was written.

AI agent frameworks have a particular tendency toward this gap because they typically start as single-user developer tools. Nobody worries about process table visibility when they are the only user on the machine. But the evaluation runner in gptme is designed for benchmarking, which is exactly the kind of workload that ends up on shared CI runners and multi-tenant build systems. The threat model changed and the code did not change with it.

An edge case worth naming

The disprove analysis in the PR discussion raises one genuine limitation. If the gptme process receives a SIGKILL (signal 9) between writing the env file and the finally cleanup, the temp file persists on disk containing the secrets. This is a fundamental limitation of any file-based approach: finally blocks do not execute after SIGKILL. The file has 0600 permissions, so only the owning user can read it, but it remains on the filesystem until someone removes it.

This is not a reason to reject the fix. The previous approach exposed secrets to every user on the system for the entire subprocess duration. The new approach limits exposure to the file owner and only fails to clean up under a signal that cannot be caught. That is a meaningful reduction in attack surface, even if it is not perfect.

What was merged

The fix landed in PR #1789 on 23 March 2026. Two files changed: gptme/eval/main.py and the new test file. The PR went through multiple review rounds addressing TOCTOU concerns, double-close fd bugs and log leakage of the temp file path. The iteration itself is instructive: the first version of the fix had its own security issues that reviewers caught.

Docker's --env-file format does not support values containing newlines, which is unlikely for API keys but worth noting if the pattern is adopted for other secret types. The format is simple line-delimited KEY=VALUE, and Docker strips optional quoting.

The fix is correct, the tests are solid and the review process was thorough. What lingers is the broader question of how many other AI agent frameworks are passing secrets through command-line arguments right now, on evaluation runners and CI systems that assume they are the only tenant. ps auxww | grep -i key would be a productive afternoon for anyone with a shared build server.