Skip to main content
Environments for LM agents: from agentic RL to context optimization
January 26, 2026

In 2025, AI agents moved from single-step to multi-step workflows, but have yet to reliably complete most real-world tasks end-to-end. Outside of coding agents, most agents still fail at long-horizon or cross-tool/app tasks, which shows a core limitation: even SOTA models still require explicit optimization to align with an agent’s specific goals.

This has triggered a new optimization wave. Reinforcement learning for LM agents is evolving beyond RLHF toward verifiable, environment-grounded feedback (RLVR). In parallel, prompt and context learning have re-entered the spotlight - as Andrej Karpathy tweeted, many agent improvements can be framed as system prompt learning. Non-weight updates like GEPA-style methods can enable faster and earlier optimization than full-blown RL training loops.

At the same time, a new type of startup emerged around high-fidelity environments. These environments are no longer “fake websites just for RL”, they’re increasingly treated as paradigm-agnostic data foundries for eval, testing, and multiple optimization loops. On the infrastructure side, a growing set of RL libraries and platforms is pushing to productize the RL pipeline for enterprises, which is abstracting rollout, training, and policy updates into cloud-native systems. CoreWeave’s acquisition of OpenPipe is just one example of this trend.

These environments sit at the center of modern agent improvement and a number of new startups are already enabling both RL and non-weight-based learning.

High-fidelity environments are the new data foundry

Environments are not just mock software in a docker file. The startups in this space build high-fidelity replicas of real software systems - from web apps, terminals, internal tools, tool calls, to even structured assets like financial models or CAD files, run inside sandboxed infrastructure. What makes them powerful is not only that they enable real behavior, but also capture granular task data: step-by-step tasks, explicit rewards, and full session trajectory which turns interaction into reusable data for testing and optimization.

Today, the primary buyers of high-fidelity environments are AI labs, but more recently enterprises have realized the need for these environments to fine-tune their own models. But why is this the case? Why can’t agents simply learn from production interactions? Why do enterprises need to build this in-house instead of using traditional staging environments? In practice, none of the existing approaches are designed for iterative learning:

  • No safe trial-and-error in production: Real enterprise workflows are cross-tool. For example, a workflow might start with reading a customer email, then cross-checking the internal revenue dashboard, reviewing past calls in Gong, and finally updating a CRM record. Testing agents directly in production risks failures across systems.

  • Logs ≠ trajectories: Human workflows produce software logs, but they are disconnected, timestamped events: an email received at 8:01am, an internal dashboard viewed at 8:20am, a Gong call reviewed, a CRM updated at 11:03am. But real workflows are long-running and fragmented by meetings and breaks. As a result, these logs say little about workflow state, intent or causal ordering. It’s almost impossible to identify how the 8:01am email led to the 11:03am CRM change. On top of that, log fidelity varies across tools, with the most granular signals often owned by third-party systems and never reaching the customers. This makes software logs difficult to reconstruct or replay into agent-usable trajectories for training.

  • Reward design is hard and ongoing: Both RL and prompt optimization depend on verifiable, step/task-level signals generated by environments, not ad hoc rules. Defining what’s verifiable, validating it end-to-end, and updating it as tasks evolve becomes a continuous cost.

Environment startups with early traction are predominantly focused on coding (ex: Preference Model, Proximal, Mechanize, Vmax, BespokeLab, etc.) and computer-use (ex: Plato, Matrices, Deeptune, Fleet, Halluminate, etc.). These use cases usually have clearer goals with more objective verification signals (ex: passed the unit test, updated a field), making them suitable for environment-based optimization. Some use cases like financial modeling, business charting and CAD require collecting step-by-step workflow trajectories instead of domain artifacts (ex: design files, financial models, etc.) which are also increasingly in demand (ex: Isidor, Mercor typically provide these artifacts).

In domains like accounting, finance, and hardware design where correctness is contextual, producing high-quality task data becomes increasingly difficult and expert-driven. In the near term, the opportunity is to automate environment authoring and reward definition by using session recordings to enrich state and translate expert decisions into reward rubrics. Over time, a broad set of enterprises will want trajectory-reward data to automate their own unique workflows.

From environments to optimization-as-a-service

Downstream systems are consuming more environment-generated data than ever. Today, environments are primarily wired into RL post-training loops, feeding trajectories and rewards into policy weight updates (further details in the next section). But nothing in the definition of environment-generated data implies that updating weights is the only downstream optimization path.

From an agent builder’s perspective, the goal is simple: improve production agent performance. At a principled level, builders have two options: 1) update system prompts and context, or 2) update policy weights. The environment sits upstream of both, as the same trajectories can drive either prompt-level optimization or RL. In theory, builders can pick both options to maximize performance but in reality, performance goals, cost, and internal capability usually favor one option. We’re already seeing this play out in practice across the environment startups and optimization-centric providers:

  • Environment-based RL: As mentioned, model and agent labs are the biggest buyers of environments (billions-plus budget) because they have policy-level RL training loops at scale, strong demand for vertical optimization, and have long outsourced data generation to specialized vendors - historically from supervised learning data startups such as Scale, Labelbox, Surge and Mercor, to newer RL post-training environment startups discussed above. Importantly, environment companies sell only the data layer; the RL trainers and infrastructure are still operated by the labs. Environment startups typically start with this business model.

  • Environment-based Context Optimization: Most enterprises build agents at the prompt and context layer. Instead of manual prompt tuning, environment startups like BeSpoke Labs use coding environments with prompt optimizers (e.g. GEPA / DSPy) to automatically enhance system prompts. The experimentation “GFRL” (Gradient-Free RL) run by computer-use environment startup Plato validated performance gain from this method. This allows enterprises to benefit from environments without running RL, significantly expanding the environment buyer base from labs to enterprises.

  • RL-as-a-Service: In performance-critical domains (ex: finance, healthcare, regulated workflows), prompt optimization alone often isn’t enough. Enterprises want policy-level gains for agents, but usually lack the RL talent and infra to operate complex pipelines in-house (more in t). As a result, a wave of RL-as-a-Service vendors (ex: Applied Compute, Trajectory, CGFT, Osmosis, etc.) has emerged to deliver end-to-end agent optimization. These companies collect agent data (ex: code diffs, output schemas, tool call specs) when agents already run in prod, or build environments, and then run RL training loops to deliver custom models or agents as the final output. Adoption has been slower, not because demand is low, but because enterprise data is messy and rarely RL-ready.

  • Others: Eval-First Optimization: A parallel class of agent optimization startups focus on learning the evaluation function itself. These startups (ex: Judgment Labs, Arize AI, Humanloop acq. by Anthropic) post-train domain-specific LLM judges on production artifacts and feedback to reliably score agent outputs. Their value-add is not to build environments or run agent RL pipelines, but to power prompt iteration and feedback loops, which reflects the belief that real user interaction data is still the most coveted. This category is large but outside the main focus on this piece.

Looking ahead, the line between environment startups and optimization startups is likely to blur. As environment startups gain control over both high-fidelity environment trajectories and production traces, some will integrate closely with downstream prompt optimizers or RL infrastructure vendors to support both prompt optimization and RL training. Some of these startups may evolve into full optimization-as-a-service platforms as both paths are natural long-term product evolutions.

Recent NeurIPS 2025 work on RLVR shows that RL mainly reweights existing reasoning paths and unlocking new reasoning paths requires more deliberate and large-scale data curation. This emphasizes that strategic leverage still sits upstream: whomever controls high-quality trajectories (simulated or production) ultimately shapes how agents perform in the future.

Environment scale reshapes the RL infra stack

Even with high-quality trajectory data in place, RL adoption is likely constricted by  system complexity. At scale, the challenge is less about training methods (PPO vs. GRPO vs. DPO) and more about orchestrating large-scale rollouts (agent inference within environments), training and versioning. This has led to a growing RL libraries that coordinate these workloads, as well as end-to-end RL platforms that abstract away the underlying infrastructure and research stack. The common system complexities include:

  • Rollout vs. Training: Rollout requires high-TFLOPs GPUs for frequent, latency-sensitive policy inference while producing trajectories. For training, the workload is memory-bound, limited by how quickly weights, activations, and KV caches can be moved between memory and compute. They stress different parts of the stack and should run asynchronously to avoid poor utilization.

  • Policy Proliferation: Frequent updates create many coexisting policies (e.g., LoRA adapters, checkpoints), making version routing, replay, and fallback increasingly difficult, especially when policies are trained in parallel.

These explain why RL libraries (e.g., OpenPipe ART, Slime, Miles, Verl) tightly integrate inference engines (e.g., SGLang, vLLM), training backends (e.g., Megatron), and schedulers (e.g., Ray). Many libraries (e.g., Slime, Prime Intellect’s prime-rl) are designed around decoupling inference from training via async, server-based rollouts, so trajectory and policy updates can run independently to avoid GPU idling. Beyond decoupling, some systems reduce cost via LoRA-based policy updates (e.g., OpenPipe ART) with versioning and routing support, while some focus on loose coupling between environments and rollout to make environments as remote services (e.g., AReaL, ROLL, Slime).

On top of this, a growing set of end-to-end RL platforms package primitives into unified systems, abstracting execution and orchestration so teams can scale agent RL without in-house custom infrastructure. Prime Intellect focuses on an open ecosystem across environment, reward, and managed RL infrastructure. Tinker focuses on devX via an API-first interface for running RL and other optimization workflows. CoreWeave’s acquisitions of OpenPipe and Weights & Biases point to vertically integrated, cloud-native RL stacks, while highly performant training toolkits like Unsloth may evolve from kernel-level optimization into more opinionated RL platforms over time.

As environment-generated data scales, opportunities for RL systems lie in:

  • Cloud-native, workload-aware: Async rollouts and training run on workload-specific hardware (high-TFLOPs v.s. memory-optimized) across clusters and regions, with disk-backed replay for fault tolerance.

  • High-frequency updates with lower cost: RL libraries that natively support FP8/FP4 training without loss of stability can unlock higher update frequency at lower cost, creating durable efficiency advantages.

  • Environment orchestration: The long-term bottleneck will shift to the environment layer: stateful, CPU and IO-heavy systems that must handle synchronization and retries at scale. The RL infra defensibility will be implementations that keep thousands of environments busy, consistent, and debuggable.

We’re still in the early days of LM agent optimization. At Gradient, we’re excited to partner with the next generation of founders building tools and platforms that turn agents into reliable systems for automating enterprise workflows.