Building a Multi-Agent Containerization System at Bunnyshell

Building a Multi-Agent Containerization System at Bunnyshell

At Bunnyshell, we’re building the environment layer for modern software delivery. One of the hardest problems our users face is converting arbitrary codebases into production-ready environments, especially when dealing with monoliths, microservices, ML workloads, and non-standard frameworks.

To solve this, we built MACS: a multi-agent system that automates containerization and deployment from any Git repo. With MACS, developers can go from raw source code to a live, validated environment in minutes, without writing Docker or Compose files manually.

In this post, we’ll share how we architected MACS internally, the design patterns we borrowed, and why a multi-agent approach was essential for solving this problem at scale.

Problem: From Codebase to Cloud, Automatically

Containerizing an application isn’t just about writing a Dockerfile. It involves:

  • Analyzing unfamiliar codebases
  • Detecting languages, frameworks, services, and DBs
  • Researching Docker best practices (and edge cases)
  • Building and testing artifacts
  • Debugging failed builds
  • Composing services and deploying environments

This process typically takes hours or days for experienced DevOps teams. We wanted to compress it to minutes, with no human intervention.

The Multi-Agent Approach

Similar to Anthropic’s research assistant and other cognitive architectures, we split the problem into multiple specialized agents, each responsible for a narrow set of capabilities. Agents operate independently, communicate asynchronously, and converge on a working deployment through iterative refinement.

Our agent topology:

AgentResponsibility
OrchestratorBreaks goals into atomic tasks, tracks plan state
DelegatorManages task distribution and parallelism
AnalyzerPerforms static & semantic code analysis
ResearcherQueries web resources for heuristics and Docker patterns
ExecutorBuilds, tests, and validates artifacts
Memory StoreStores past runs, diffs, artifacts, logs

This modular architecture enables robustness, parallel discovery, and reflexive self-correction when things go wrong.

Pipeline Flow

Each repo flows through a pipeline of loosely-coupled agent interactions:

  1. Initialization
    A Git URL is submitted via UI, CLI or API
    The system builds a contextual index: file tree, README, CI/CD hints, existing Dockerfiles
  2. Planning
    The Orchestrator builds a goal tree: identify components, generate artifacts, validate outputs
    Delegator breaks tasks into subtrees and assigns to Analyzer/Researcher in parallel
  3. Discovery
    Analyzer inspects the codebase: detects Python, Node.js, Go, etc., plus frameworks like Flask, FastAPI, Express, etc.
    Researcher consults external heuristics (e.g., “best Dockerfile for Django + Celery + Redis”)
  4. Synthesis
    Executor generates Dockerfile and Compose services
    Everything is run in ephemeral Docker sandboxes
    Logs and test results are collected
  5. Refinement
    Failures trigger self-prompting and diff-based retries
    Agents update their plan and try again
  6. Transformation
    Once validated, Compose files are converted into bunnyshell.yml
    Environment is deployed on our infrastructure
    A live URL is returned

Memory & Execution Traces

Unlike simpler systems, we separate planning memory from execution memory:

  • Planning Memory (Orchestrator): Tracks reasoning paths, subgoals, dependencies
  • Execution Memory (Executor): Stores validated artifacts, performance metrics, diffs, logs

Only Executor memory is persisted across runs, this allows us to optimize for reuse and convergence without bloating the planning context.

Implementation Details

  • Models:
  • - Orchestrator: GPT-4.1 (high-context)
  • - Sub-agents: 3B–7B domain-tuned models
  • Runtime:
  • - Each agent runs in an ephemeral Docker container with CPU/RAM/network caps
  • Observability:
  • - Full token-level tracing of prompts, responses, API calls, build logs
  • - Used for debugging, auditing, and improving agent behavior over time

Why Multi-Agent?

We could have built MACS as a single LLM chain, but this quickly broke down in practice. Here’s why we went multi-agent:

  • Parallelism: Analyzer and Researcher run concurrently to speed up discovery
  • Modular reasoning: Each agent focuses on a narrow domain of expertise
  • Error isolation: Build failures don’t halt the planner—they trigger retries
  • Reflexivity: Agents can revise their plans based on test results and diffs
  • Reusability: Learned solutions are reused across similar projects

What We’ve Learned

  1. Multi-agent debugging is hard: you need good observability, logs, and introspection tools.
  2. Robustness beats optimality: our system favors “works for 95%” over exotic edge-case perfection.
  3. Emergent behavior happens: some of the most efficient retry paths were not explicitly coded.
  4. Boundaries matter: defining clean interfaces (e.g., JSON messages) between agents pays off massively.

What’s Next

We’re expanding MACS with:

  • Better multi-language support (Polyglot repo inference)
  • Orchestrator collaboration (multi-planner mode)
  • Plugin SDKs for self-hosted agents and agent fine-tuning

Our north star: a fully autonomous DevOps layer, where developers focus only on code - and the system handles the rest.

Want to try it?

You need only to paste your repo. Hopx by Bunnyshell instantly turns it into production-ready containers.

Try it now