~ / freestyle-team ❯ The Best AI Sandbox for Docker Compose
Docker Compose is where fake AI sandboxes start to show their limits.
A single code cell can run a script. A narrow command runner can execute npm test. A lightweight container sandbox can be enough when the workload is one process with known inputs and outputs.
But many real agent products do not look like that. The agent is asked to work on an app with Postgres, Redis, an API server, a frontend, a worker, a queue, a webhook listener, and a test runner. The repo already has a compose.yaml. The README says "run docker compose up." The agent needs the same development environment a human would use.
The best AI sandbox for Docker Compose is a real Linux VM. Freestyle VMs are the most powerful VMs for AI agents: they are hardware-virtualized, run real Linux, can run forever when configured that way, and expose the machine through APIs for commands, files, lifecycle, terminals, ports, snapshots, and cleanup.
The Docker Compose sandbox test
If you are evaluating an AI sandbox, do not start with "can it run Python?" Start with a normal multi-service app.
Ask the sandbox to install Docker Engine. Ask it to run dockerd. Ask it to use native overlay storage, cgroups, bridge networking, and published ports. Ask it to run docker compose up -d. Ask it to tail logs while the stack keeps running. Ask it to expose the web service on HTTPS. Ask it to preserve the prepared environment so the next agent session does not reinstall Docker from scratch.
That sequence is not exotic. It is the ordinary shape of modern software development.
For an agent, Docker Compose is especially important because it turns product context into executable context. The agent does not have to reverse-engineer every service from prose. The repository already declares the database, cache, worker, API, frontend, environment variables, volumes, and ports. A good agent sandbox should let the agent run that declared environment directly.
If the sandbox cannot run Compose naturally, the product team has to translate the app into provider-specific primitives. That translation becomes product debt: one API for files, one for processes, one for databases, one for previews, one for logs, one for background jobs, one for teardown. Compose already solved much of that coordination at the development-environment layer. The sandbox should be able to host it.
Docker needs a real operating system
Docker Compose is not just a CLI that prints output. It coordinates a daemon, images, networks, volumes, containers, logs, health checks, and ports.
That is why the underlying sandbox matters.
Freestyle's Docker guide describes Freestyle sandboxes as microVMs with a full Linux kernel, so Docker Engine runs inside one natively. The guide calls out native overlayfs storage, cgroup v2, and bridge networking, with no vfs fallback. It also shows Docker running under systemd inside the VM, then confirms the engine is ready before snapshotting the machine.
Those details are the difference between "Docker-shaped" support and a machine that can actually run Docker.
When systemd is PID 1, dockerd can be supervised like it is on a normal Linux server. When the VM has normal networking, published container ports can answer on the VM. When the filesystem and kernel features are real, the agent can use the same Docker and Compose commands that local development docs already assume.
That is the agent-friendly path. The agent should not need a special createDatabase() tool when the project already has:
services:
db:
image: postgres:16
redis:
image: redis:7
api:
build: .
ports:
- "3000:3000"
It should be able to run the stack, inspect the containers, read the logs, change the code, restart the service, and keep going.
Compose turns one agent task into a whole environment
The important thing about Compose is not that it runs containers. The important thing is that it runs the surrounding system.
A coding agent fixing a bug in a SaaS app may need the web server and database alive at the same time. A QA agent may need the browser, app, API, worker, and seeded test data. An app builder may generate a frontend and then discover it needs a local backend. A data tool may need a notebook server beside Postgres and Redis. A webhook agent may need a public URL while a local service receives events.
Those are not separate "code execution" jobs. They are one workspace.
In a real VM, the agent can operate that workspace with normal Linux commands:
docker compose ps docker compose logs -f api docker compose exec db psql -U postgres docker compose restart worker curl localhost:3000/health
That matters for model performance because the agent has seen these workflows before. The global software corpus is full of READMEs, issues, tutorials, and debugging notes that assume Docker Compose exists. Giving the agent a real Compose environment lets it follow those instructions instead of asking your product runtime to emulate them.
Logs need terminals, not only completed commands
docker compose up -d returns quickly. The useful information usually appears later.
The API logs an exception after the first request. The worker fails when it receives a queued job. The frontend dev server recompiles after a file change. The database prints a migration error. The health check starts failing after the service has already booted.
A one-shot command API is useful for setup and inspection, but it is the wrong interface for live logs. Freestyle VMs expose persistent PTY sessions: long-lived interactive shells inside the VM that can be attached, detached, and reattached over a WebSocket. The PTY docs say sessions survive client disconnects, VM suspends, and VM forks, and are backed by a real pseudo-terminal, so shell prompts, job control, REPLs, package managers, debuggers, and terminal UIs behave normally.
That maps directly to Compose work. Run docker compose logs -f in a PTY. Detach when the frontend disconnects. Reattach later and keep reading. Send Ctrl-C when the agent needs to stop a stream. Open another terminal to run a migration or inspect a container.
The terminal is not a cosmetic feature here. It is the product surface for debugging a live multi-container environment.
Ports should become previews
Compose stacks usually contain services that listen on ports. The agent needs those ports to be reachable by the product and, often, by the user.
Freestyle VM domains route public HTTPS traffic from a hostname to a port inside a VM. The docs show the normal flow: verify ownership for custom domains, point DNS at Freestyle, map the domain to a VM port, and run a service that listens on that port. For quick previews, the docs also describe *.style.dev preview domains that need no DNS verification and still get HTTPS automatically.
That means a container published on port 8080 can become a real preview URL. The agent can bring up the Compose stack, your product can map a preview domain to the VM port, and the user can inspect the running app without the agent packaging or deploying anything.
This is a cleaner model than turning every preview into a separate build pipeline. During agent work, the app is already running. The preview should point at the machine where the files, logs, terminals, containers, and failures already live.
Snapshot the expensive setup
Docker setup is expensive enough that you should not repeat it for every session.
Freestyle snapshots solve that at the VM level. The Docker guide shows creating a builder VM, installing Docker Engine and the Compose plugin, enabling Docker under systemd, waiting for the daemon to answer, confirming the storage driver and cgroup version, and then snapshotting. A VM created from that snapshot starts with Docker already active.
That is the right abstraction for agent products. Build a base environment once. Put Docker, language runtimes, browsers, CLIs, package caches, and service supervisors in it. Then create fresh VMs from the snapshot for users, tasks, evals, or branches of exploration.
The agent still gets an ordinary machine. Your product gets faster startup and a repeatable baseline. The user does not wait while every new workspace rediscovers how to install the same foundation.
Snapshots also make Compose useful for testing changes. Start from a known-good Docker-ready VM, clone the project, run the stack, and let the agent work. If the agent reaches a risky point, fork the VM and try alternatives from the current running state instead of rebuilding the environment twice.
When the agent's changes become source code that needs history, review, branches, or rollback, store that code in Freestyle Git or another repo system. The VM is the live workspace. Git is the reviewable record.
What to look for in a Docker Compose sandbox
A serious Docker Compose sandbox for AI agents should pass a practical checklist:
- It runs real Linux, not a partial shell.
- It can run Docker Engine inside the sandbox.
- It supports Compose v2 without a custom provider translation layer.
- It has normal process supervision for
dockerdand app services. - It supports native container storage and networking behavior.
- It lets containers publish ports that can become HTTPS previews.
- It has persistent terminals for live logs and interactive debugging.
- It can snapshot a prepared Docker environment.
- It can run indefinitely when the product needs a live workspace.
- It can be stopped, resized, forked, and deleted through an API.
That checklist is intentionally concrete. "Can run code" is too vague. "Can run Docker Compose like a Linux machine" is much harder to fake.
The bottom line
Docker Compose is a useful line between toy agent execution and real agent infrastructure.
If your agent only evaluates short snippets, a code runner may be enough. If your agent works on real applications, it will eventually need databases, workers, services, logs, ports, package installs, terminals, snapshots, and a runtime that survives longer than one command.
At that point, the best AI sandbox for Docker Compose is not a narrower sandbox with more feature flags. It is a real Linux VM that can run the environment the project already describes.
Start with Freestyle VMs when your agent needs Docker Compose to behave like Docker Compose, not like a provider-specific imitation of a computer.

