How OpenClaw Agents Learn Without Reinforcement Learning

Key takeaways

OpenClaw agents improve over time by writing corrections to disk, promoting them to workspace files, and loading them in future sessions, with no model retraining required
MEMORY.md stores durable facts and preferences; AGENTS.md stores behavioral rules that apply at the start of every session
The self-improvement skill captures errors and corrections in a .learnings/ folder, then promotes the most important ones to AGENTS.md and SOUL.md
Heartbeat reviews give the agent a regular scheduled opportunity to surface issues, review logs, and update memory, closing the learning loop automatically
Context accumulation changes what the agent knows and follows per session; RL changes the underlying model weights. Different mechanisms, genuinely different requirements.

OpenClaw agents can get measurably better over time without touching a GPU or fine-tuning a model. This is not a workaround for people who can't run OpenClaw-RL. It's what every OpenClaw user already has available, and for most use cases, it's the right tool.

Feed this to your OpenClaw agent

Read https://www.stack-junkie.com/blog/openclaw-learning-without-rl and follow the instructions to help me with: How OpenClaw Agents Learn Without Reinforcement Learning. Summarize each step before running commands and confirm with me first.

Always review commands your agent suggests before approving them. Don't paste prompts from sources you don't trust.

Fixes when it breaks. Workflows when it doesn't.

OpenClaw guides, configs, and troubleshooting notes. Every two weeks.

What does "agent learning" mean for OpenClaw?

For OpenClaw, learning means behavior durably changes as a result of experience. Not through gradient descent, but through what gets written into the files loaded at session start.

Traditional LLMs operate in a stateless session model: each conversation starts fresh, with no knowledge carried forward. OpenClaw breaks this by loading a set of workspace files (MEMORY.md, AGENTS.md, SOUL.md, TOOLS.md) at the beginning of every session. Whatever lives in those files shapes how the agent reasons, communicates, and decides. So the question "can my agent learn?" becomes: what gets written to those files, when, and by what process?

The answer is a four-component system: persistent memory, behavioral rule files, a correction-capture workflow, and an automated review loop. Each component is simple on its own. Together they form a closed learning cycle.

This is not a consolation prize for users without multi-GPU infrastructure. Context accumulation and RL address the same goal through different mechanisms. Context accumulation works with any model (Claude, GPT, Gemini, Llama), takes effect immediately after promotion, and produces human-readable rules you can inspect and edit. RL changes the model weights themselves, which is more durable in some ways but requires a self-hosted model and significant infrastructure. The right choice depends on your setup and goals.

How does MEMORY.md store what the agent has learned?

MEMORY.md is a curated long-term memory file loaded at the start of every session. The agent writes durable facts and preferences there, and those facts are present from the first turn of every future conversation.

OpenClaw's memory system uses two layers:

MEMORY.md: Curated, high-signal context. Stable facts, preferences, and decisions that shouldn't need re-establishing every session.
memory/YYYY-MM-DD.md: Daily append-only log. Running notes and context from the current day, auto-read at session start (today and yesterday).

The distinction matters. Daily logs capture everything; MEMORY.md captures what's worth keeping indefinitely. A reasonable practice: at the end of a project or after a significant correction, write the key takeaway to MEMORY.md explicitly.

OpenClaw gives agents two built-in tools for recalling memory mid-conversation. memory_search runs semantic search over all indexed .md files in the memory directory, matching by meaning rather than exact words. memory_get reads a specific file or line range when the agent already knows where the information lives. Both are agent-side tool calls (like read or exec), not CLI commands. The agent decides when to use them based on the conversation.

OpenClaw also triggers a pre-compaction memory flush when a session approaches context limits. Before the session history gets compressed, the agent gets a turn to write any durable context to disk. Important facts survive compaction automatically, as long as the workspace is writable.

To see this in practice: mention a preference (say, "I prefer Zsh over Bash") in one session. The agent writes it to MEMORY.md. The next session loads that file at startup, and the preference is already there. No repeated questions, no re-establishing context.

For a full walkthrough of the memory system, see OpenClaw Persistent Memory Guide.

What does AGENTS.md do and how does it encode behavioral rules?

AGENTS.md is the file OpenClaw loads at session start as operating instructions. It's where behavioral rules live: how to reason, what to prioritize, what to avoid, and how to handle recurring situations.

Every rule in a mature AGENTS.md is either something the user wrote deliberately or something promoted from a correction. The file accumulates rules over time as the agent gets corrected and those corrections get promoted to permanent rules.

AGENTS.md is distinct from SOUL.md. SOUL.md handles persona and tone (how the agent communicates). AGENTS.md handles decision-making and workflow: what the agent does and how it reasons about it. A correction like "stop asking for confirmation on low-risk file reads" belongs in AGENTS.md. A correction like "be more concise, skip the preamble" belongs in SOUL.md.

Here's what rule evolution looks like in practice.

Day 1 AGENTS.md entry:

markdown

## Scope discipline
Stay tightly aligned to what the user asked for.

After user correction ("you keep expanding scope when I ask for small changes"):

markdown

## Scope discipline
Stay tightly aligned to what the user asked for.
Expand scope only when the user asks for expansion or when necessary to complete the request correctly.
When intent is unclear and the action has side effects, pause and confirm.
When intent is clear and the action is low risk and reversible, proceed efficiently.

That second version came from a real correction, captured and promoted. The agent that loads it won't make the same scope-creep mistake again.

For guidance on structuring this file effectively, see How to Write an Effective AGENTS.md for OpenClaw.

How does the self-improvement skill capture corrections?

The self-improvement skill is the correction-capture layer. When something goes wrong or a better approach surfaces, the skill logs a structured entry to one of three files in .learnings/:

.learnings/LEARNINGS.md: Corrections, knowledge gaps, and best practices (category: correction, knowledge_gap, best_practice)
.learnings/ERRORS.md: Command failures, unexpected tool behavior, API errors
.learnings/FEATURE_REQUESTS.md: Capabilities the user wanted that didn't exist

The skill triggers on six situations:

A command or operation fails unexpectedly
The user corrects the agent ("No, that's wrong..." or "Actually...")
The user requests a capability that doesn't exist
An external API or tool fails
The agent's knowledge turns out to be outdated
A better approach is discovered for a recurring task

When a user says "Stop using Python for quick JSON parsing, use jq, it's already installed," the skill logs it to .learnings/LEARNINGS.md:

markdown

## [2026-03-13] Use jq for JSON parsing
- Category: best_practice
- Context: User corrected Python approach; jq is installed and preferred
- Pattern: Quick JSON filtering from CLI. Prefer jq over Python one-liners.

The self-improvement skill's instructions tell the agent to check .learnings/ before major tasks. When it sees this entry, it uses jq instead of Python. Once the learning gets promoted to TOOLS.md or AGENTS.md (see next section), it loads automatically at session start and the agent follows it from the first turn.

Install the skill via ClawHub:

bash

clawhub install self-improving-agent

Or create the log files manually:

bash

mkdir -p ~/.openclaw/workspace/.learnings
touch ~/.openclaw/workspace/.learnings/LEARNINGS.md
touch ~/.openclaw/workspace/.learnings/ERRORS.md
touch ~/.openclaw/workspace/.learnings/FEATURE_REQUESTS.md

How does the self-improvement skill promote learnings to persistent rules?

Logging a correction is the first step. Promotion is what makes it permanent.

When a learning applies broadly (not just to one session or one task), it gets moved from .learnings/ to a workspace file that loads at every session start. The self-improvement skill defines the promotion targets:

Learning type	Promote to	Example
Workflow improvements	`AGENTS.md`	"Spawn subagents for long tasks"
Tool gotchas	`TOOLS.md`	"Git push needs auth configured first"
Behavioral patterns	`SOUL.md`	"Be concise, avoid disclaimers"

A logged learning is reactive and session-specific. A promoted rule is automatic and applies to all future sessions from the first turn. Promotion is what closes the learning loop.

When to promote: when the same correction has appeared two or three times, or when the learning is clearly general rather than situational.

Promotion example:

Three separate sessions produce similar corrections:

"Stop summarizing when I just asked for a yes or no"
"You don't need to explain what you're about to do before doing it"
"Just answer. Don't set up the answer first."

These aren't three separate learnings. They're the same pattern. Promote once to SOUL.md:

markdown

## Response style
Answer directly. Skip setup and preamble.
Do not summarize what you're about to do before doing it.

Every session after promotion gets that rule from the first turn. The correction stops being necessary.

How does the heartbeat loop automate the review cycle?

The heartbeat turns the self-improvement skill from a passive log into an active review loop. Without scheduled review, learnings accumulate in .learnings/ but never get promoted.

OpenClaw's heartbeat runs the agent in the main session on a regular schedule (default: every 30 minutes). It reads HEARTBEAT.md and handles all listed tasks in a single turn. Because it runs in the main session, the agent has full conversational context and can make informed decisions about priority.

HEARTBEAT.md lives in the workspace root (same directory as AGENTS.md and MEMORY.md). A version that includes learning review:

markdown

# Heartbeat checklist
 
- Check for urgent messages in inbox
- Review calendar for events in next 2 hours
- Scan .learnings/LEARNINGS.md for entries added in the last 7 days
- If any entries are broadly applicable, promote to AGENTS.md, SOUL.md, or TOOLS.md
- If nothing needs attention, reply HEARTBEAT_OK

With this checklist, the agent wakes up every 30 minutes, scans .learnings/, and promotes anything ready. No manual review required.

Configure the heartbeat interval in openclaw.json:

json

{
  "agents": {
    "defaults": {
      "heartbeat": {
        "every": "30m",
        "activeHours": { "start": "08:00", "end": "22:00" }
      }
    }
  }
}

Setting activeHours prevents heartbeat runs during off-hours, keeping things quiet when you're not working.

How does the full OpenClaw learning loop work?

The four components connect into a single cycle: correction → capture → review → promotion → load → behavior change.

User corrects the agent or a command fails unexpectedly
Self-improvement skill logs it to .learnings/LEARNINGS.md
Heartbeat scans .learnings/ on schedule
Broadly applicable entries get promoted to AGENTS.md, SOUL.md, or TOOLS.md
Promoted rules load at the start of every future session

The timeline runs in hours and days, not training epochs. A correction captured this morning can be promoted at the next heartbeat and active by the afternoon session. Promotion is the entire pipeline.

One thing to know: promoted rules can be edited or deleted. If the heartbeat promotes something too aggressively (a situational fix treated as universal), open the target file and remove it. You stay in control of what the agent follows.

The loop isn't glamorous. There's no training graph. But the outcome is real: behavior durably changed because a correction got written to a file that loads every session.

OpenClaw context accumulation vs reinforcement learning

Context accumulation and RL both produce behavioral improvement over time. The mechanisms and requirements differ enough that choosing between them is mostly about infrastructure, not ambition.

What context accumulation gives you:

Works with any model (Claude, GPT-4, Gemini, Llama, anything OpenClaw supports)
Zero GPU requirement, runs on the same server that runs your agent
Changes take effect immediately after promotion, no training cycle, no wait
Rules are human-readable and editable, you can inspect every behavioral rule the agent is following, correct mistakes, and remove rules that no longer apply
Works even when corrections are sparse, a few times a week or a few times a month

What RL adds:

Weight-level learning, behavioral improvements encoded directly in model parameters, not just loaded context
Persistence without context, learned behavior applies even when workspace files aren't loaded
Generalization from signal, the model can generalize across thousands of conversation turns to behaviors you never explicitly corrected

When RL makes sense:

OpenClaw-RL wraps a self-hosted model as an OpenAI-compatible API and continuously improves the policy from live conversations. It requires a self-hosted model (Qwen3-4B by default), consistent preference signals across many conversations, and infrastructure capable of running background training. If you're running a hosted Claude API, RL isn't available to you. If you have a self-hosted model with consistent user traffic, it might be worth the setup cost.

For most OpenClaw users (running Claude or GPT-4 via API, handling one user's workflow, correcting behavior a few times a week), context accumulation is the complete answer. The agent keeps forgetting preferences? That's a MEMORY.md problem. Keeps making the same reasoning mistake? That's an AGENTS.md problem. Both are solvable at the context layer.

One honest limitation: when the base model's defaults conflict with your rules, the model doesn't always follow the rules perfectly. A promoted rule in AGENTS.md is a strong instruction, not a weight-level constraint. Edge cases happen. RL eliminates some of those edge cases by baking the preference into the weights. Context accumulation reduces them by making the rules explicit and present every session.

For a full breakdown of what RL adds and how to set it up, see OpenClaw-RL Explained.

How to set up the OpenClaw learning stack from scratch

Setting up the full learning system takes four steps.

Step 1: Create structured workspace files.

If your workspace doesn't have these files yet, OpenClaw creates minimal stubs on setup. Replace them with structured versions. The three promotion targets are MEMORY.md (facts), AGENTS.md (behavioral rules), and SOUL.md (tone and persona):

markdown

# MEMORY.md
 
## Preferences
[Write durable preferences here: editors, tools, communication style]
 
## Project context
[Key facts about ongoing projects]
 
## Learned corrections
[Patterns worth remembering explicitly]

markdown

# AGENTS.md
 
## Core operating behavior
[How the agent should reason and prioritize]
 
## Scope discipline
[What the agent should and shouldn't expand on its own]
 
## Side effects and approval
[When to ask vs when to proceed]

markdown

# SOUL.md
 
## Tone
[How the agent should communicate: direct, casual, formal, etc.]
 
## Interaction style
[What to emphasize, what to avoid in responses]

Step 2: Install the self-improvement skill.

bash

clawhub install self-improving-agent

This creates the .learnings/ directory and the three log files. The skill triggers automatically when corrections or failures occur.

Step 3: Configure HEARTBEAT.md with a learning review task.

markdown

# Heartbeat checklist
 
- Check inbox for urgent messages
- Review .learnings/LEARNINGS.md for entries added in the last 7 days
- Promote broadly applicable entries to AGENTS.md, SOUL.md, or TOOLS.md
- Reply HEARTBEAT_OK if nothing needs attention

Enable the heartbeat in openclaw.json if it isn't already running.

Step 4: Test the loop with a deliberate correction.

Tell the agent something that's currently wrong: a preference it doesn't know, a workflow it handles incorrectly. After the correction, check .learnings/LEARNINGS.md and confirm the entry was logged. Wait for the next heartbeat, confirm promotion happened. Start the next session and verify the behavior changed.

The first complete loop (from correction to behavior change) typically takes one to two heartbeat cycles.

Key Terms

MEMORY.md: Curated long-term memory file loaded at session start. Stores durable facts, preferences, and learned context.

AGENTS.md: Operating instructions file loaded at every session start. Contains behavioral rules the agent follows when reasoning and deciding.

Self-improvement skill: OpenClaw skill that logs corrections and errors to .learnings/ and promotes broadly applicable learnings to workspace files.

Heartbeat: Periodic agent run (default every 30 min) in the main session. Reads HEARTBEAT.md and handles queued review tasks in a single turn.

Context accumulation: The mechanism by which OpenClaw agents improve. Corrections get written to disk and loaded in future sessions, changing behavior without changing model weights.

Promotion: Moving a learning from the temporary .learnings/ log to a permanent workspace file (AGENTS.md, SOUL.md, TOOLS.md) so it applies to all future sessions.

Compaction: The process where OpenClaw compresses session context when the context window fills. A pre-compaction memory flush preserves durable facts before compression.

FAQ

Does OpenClaw agent learning actually change the underlying AI model?

No. Context accumulation does not modify model weights. Corrections get written to Markdown files (MEMORY.md, AGENTS.md, SOUL.md) that are loaded at session start, shaping behavior through context rather than training. The base model's weights remain unchanged. OpenClaw-RL is a separate project that does modify model weights by wrapping a self-hosted model and running background training on live conversations, but it requires a self-hosted model and additional infrastructure.

How long does it take for a correction to affect OpenClaw's behavior?

Immediately in the current session. The agent adjusts mid-conversation. The more interesting question is how fast it sticks across sessions. That depends on your heartbeat schedule. With a 30-minute heartbeat that includes a learning review task, a correction logged at 10 AM can be promoted to AGENTS.md by 10:30 and active in every session after that. Without a heartbeat, it waits until you or the agent manually reviews .learnings/.

What is the difference between MEMORY.md and AGENTS.md in OpenClaw?

MEMORY.md stores facts and context, what the agent knows. Things like "user prefers Zsh", "current project uses Node 22", or "API key rotated on March 1." AGENTS.md stores behavioral rules, how the agent should act. Things like "expand scope only when asked", "ask before deleting files", or "lead with the important problem when something is broken." The distinction: MEMORY.md is what; AGENTS.md is how.

Does the OpenClaw self-improvement skill work with all AI models?

Yes. The self-improvement skill writes to plain Markdown files in the workspace. It doesn't depend on the model provider. Claude, GPT-4, Gemini, or any model that OpenClaw supports can read and write to .learnings/. The promotion targets (AGENTS.md, SOUL.md, TOOLS.md) are also model-agnostic. The skill was built to work with OpenClaw's file-based architecture, not a specific model API.

How often should I review and promote learnings in OpenClaw?

If you've added a learning review task to HEARTBEAT.md, the agent handles this automatically on every heartbeat cycle. Without a heartbeat task, a weekly manual review is reasonable for active setups. A practical signal: if you notice yourself making the same correction more than twice in a week, a promotion is overdue. Check .learnings/LEARNINGS.md, find the pattern, and promote it before the third correction.

Evidence & Methodology

Sources used in this article are official OpenClaw documentation and the official self-improving-agent skill from the openclaw/skills repository:

OpenClaw memory system: docs.openclaw.ai/concepts/memory
OpenClaw agent workspace: docs.openclaw.ai/concepts/agent-workspace
Heartbeat vs cron: docs.openclaw.ai/automation/cron-vs-heartbeat
Self-improving-agent skill: playbooks.com/skills/openclaw/skills/self-improving-agent
OpenClaw-RL: github.com/Gen-Verse/OpenClaw-RL

No claims are sourced from competitor blogs. The self-improvement skill was verified locally against the installed skill file. OpenClaw was previously known as Clawdbot (November 2025) and Moltbot (January 2026) before settling on its current name.

Changelog

Date	Change
2026-03-13	Initial publication

Key takeaways

OpenClaw agents improve over time by writing corrections to disk, promoting them to workspace files, and loading them in future sessions, with no model retraining required
MEMORY.md stores durable facts and preferences; AGENTS.md stores behavioral rules that apply at the start of every session
The self-improvement skill captures errors and corrections in a .learnings/ folder, then promotes the most important ones to AGENTS.md and SOUL.md
Heartbeat reviews give the agent a regular scheduled opportunity to surface issues, review logs, and update memory, closing the learning loop automatically
Context accumulation changes what the agent knows and follows per session; RL changes the underlying model weights. Different mechanisms, genuinely different requirements.

Feed this to your OpenClaw agent

Read https://www.stack-junkie.com/blog/openclaw-learning-without-rl and follow the instructions to help me with: How OpenClaw Agents Learn Without Reinforcement Learning. Summarize each step before running commands and confirm with me first.

Always review commands your agent suggests before approving them. Don't paste prompts from sources you don't trust.

Fixes when it breaks. Workflows when it doesn't.

OpenClaw guides, configs, and troubleshooting notes. Every two weeks.

What does "agent learning" mean for OpenClaw?

For OpenClaw, learning means behavior durably changes as a result of experience. Not through gradient descent, but through what gets written into the files loaded at session start.

How does MEMORY.md store what the agent has learned?

OpenClaw's memory system uses two layers:

MEMORY.md: Curated, high-signal context. Stable facts, preferences, and decisions that shouldn't need re-establishing every session.
memory/YYYY-MM-DD.md: Daily append-only log. Running notes and context from the current day, auto-read at session start (today and yesterday).

For a full walkthrough of the memory system, see OpenClaw Persistent Memory Guide.

What does AGENTS.md do and how does it encode behavioral rules?

Here's what rule evolution looks like in practice.

Day 1 AGENTS.md entry:

markdown

## Scope discipline
Stay tightly aligned to what the user asked for.

After user correction ("you keep expanding scope when I ask for small changes"):

markdown

## Scope discipline
Stay tightly aligned to what the user asked for.
Expand scope only when the user asks for expansion or when necessary to complete the request correctly.
When intent is unclear and the action has side effects, pause and confirm.
When intent is clear and the action is low risk and reversible, proceed efficiently.

That second version came from a real correction, captured and promoted. The agent that loads it won't make the same scope-creep mistake again.

For guidance on structuring this file effectively, see How to Write an Effective AGENTS.md for OpenClaw.

How does the self-improvement skill capture corrections?

The self-improvement skill is the correction-capture layer. When something goes wrong or a better approach surfaces, the skill logs a structured entry to one of three files in .learnings/:

.learnings/LEARNINGS.md: Corrections, knowledge gaps, and best practices (category: correction, knowledge_gap, best_practice)
.learnings/ERRORS.md: Command failures, unexpected tool behavior, API errors
.learnings/FEATURE_REQUESTS.md: Capabilities the user wanted that didn't exist

The skill triggers on six situations:

A command or operation fails unexpectedly
The user corrects the agent ("No, that's wrong..." or "Actually...")
The user requests a capability that doesn't exist
An external API or tool fails
The agent's knowledge turns out to be outdated
A better approach is discovered for a recurring task

When a user says "Stop using Python for quick JSON parsing, use jq, it's already installed," the skill logs it to .learnings/LEARNINGS.md:

markdown

## [2026-03-13] Use jq for JSON parsing
- Category: best_practice
- Context: User corrected Python approach; jq is installed and preferred
- Pattern: Quick JSON filtering from CLI. Prefer jq over Python one-liners.

Install the skill via ClawHub:

bash

clawhub install self-improving-agent

Or create the log files manually:

bash

mkdir -p ~/.openclaw/workspace/.learnings
touch ~/.openclaw/workspace/.learnings/LEARNINGS.md
touch ~/.openclaw/workspace/.learnings/ERRORS.md
touch ~/.openclaw/workspace/.learnings/FEATURE_REQUESTS.md

How does the self-improvement skill promote learnings to persistent rules?

Logging a correction is the first step. Promotion is what makes it permanent.

Learning type	Promote to	Example
Workflow improvements	`AGENTS.md`	"Spawn subagents for long tasks"
Tool gotchas	`TOOLS.md`	"Git push needs auth configured first"
Behavioral patterns	`SOUL.md`	"Be concise, avoid disclaimers"

A logged learning is reactive and session-specific. A promoted rule is automatic and applies to all future sessions from the first turn. Promotion is what closes the learning loop.

When to promote: when the same correction has appeared two or three times, or when the learning is clearly general rather than situational.

Promotion example:

Three separate sessions produce similar corrections:

"Stop summarizing when I just asked for a yes or no"
"You don't need to explain what you're about to do before doing it"
"Just answer. Don't set up the answer first."

These aren't three separate learnings. They're the same pattern. Promote once to SOUL.md:

markdown

## Response style
Answer directly. Skip setup and preamble.
Do not summarize what you're about to do before doing it.

Every session after promotion gets that rule from the first turn. The correction stops being necessary.

How does the heartbeat loop automate the review cycle?

The heartbeat turns the self-improvement skill from a passive log into an active review loop. Without scheduled review, learnings accumulate in .learnings/ but never get promoted.

HEARTBEAT.md lives in the workspace root (same directory as AGENTS.md and MEMORY.md). A version that includes learning review:

markdown

# Heartbeat checklist
 
- Check for urgent messages in inbox
- Review calendar for events in next 2 hours
- Scan .learnings/LEARNINGS.md for entries added in the last 7 days
- If any entries are broadly applicable, promote to AGENTS.md, SOUL.md, or TOOLS.md
- If nothing needs attention, reply HEARTBEAT_OK

With this checklist, the agent wakes up every 30 minutes, scans .learnings/, and promotes anything ready. No manual review required.

Configure the heartbeat interval in openclaw.json:

json

{
  "agents": {
    "defaults": {
      "heartbeat": {
        "every": "30m",
        "activeHours": { "start": "08:00", "end": "22:00" }
      }
    }
  }
}

Setting activeHours prevents heartbeat runs during off-hours, keeping things quiet when you're not working.

How does the full OpenClaw learning loop work?

The four components connect into a single cycle: correction → capture → review → promotion → load → behavior change.

User corrects the agent or a command fails unexpectedly
Self-improvement skill logs it to .learnings/LEARNINGS.md
Heartbeat scans .learnings/ on schedule
Broadly applicable entries get promoted to AGENTS.md, SOUL.md, or TOOLS.md
Promoted rules load at the start of every future session

The loop isn't glamorous. There's no training graph. But the outcome is real: behavior durably changed because a correction got written to a file that loads every session.

OpenClaw context accumulation vs reinforcement learning

Context accumulation and RL both produce behavioral improvement over time. The mechanisms and requirements differ enough that choosing between them is mostly about infrastructure, not ambition.

What context accumulation gives you:

Works with any model (Claude, GPT-4, Gemini, Llama, anything OpenClaw supports)
Zero GPU requirement, runs on the same server that runs your agent
Changes take effect immediately after promotion, no training cycle, no wait
Rules are human-readable and editable, you can inspect every behavioral rule the agent is following, correct mistakes, and remove rules that no longer apply
Works even when corrections are sparse, a few times a week or a few times a month

What RL adds:

Weight-level learning, behavioral improvements encoded directly in model parameters, not just loaded context
Persistence without context, learned behavior applies even when workspace files aren't loaded
Generalization from signal, the model can generalize across thousands of conversation turns to behaviors you never explicitly corrected

When RL makes sense:

For a full breakdown of what RL adds and how to set it up, see OpenClaw-RL Explained.

How to set up the OpenClaw learning stack from scratch

Setting up the full learning system takes four steps.

Step 1: Create structured workspace files.

markdown

# MEMORY.md
 
## Preferences
[Write durable preferences here: editors, tools, communication style]
 
## Project context
[Key facts about ongoing projects]
 
## Learned corrections
[Patterns worth remembering explicitly]

markdown

# AGENTS.md
 
## Core operating behavior
[How the agent should reason and prioritize]
 
## Scope discipline
[What the agent should and shouldn't expand on its own]
 
## Side effects and approval
[When to ask vs when to proceed]

markdown

# SOUL.md
 
## Tone
[How the agent should communicate: direct, casual, formal, etc.]
 
## Interaction style
[What to emphasize, what to avoid in responses]

Step 2: Install the self-improvement skill.

bash

clawhub install self-improving-agent

This creates the .learnings/ directory and the three log files. The skill triggers automatically when corrections or failures occur.

Step 3: Configure HEARTBEAT.md with a learning review task.

markdown

# Heartbeat checklist
 
- Check inbox for urgent messages
- Review .learnings/LEARNINGS.md for entries added in the last 7 days
- Promote broadly applicable entries to AGENTS.md, SOUL.md, or TOOLS.md
- Reply HEARTBEAT_OK if nothing needs attention

Enable the heartbeat in openclaw.json if it isn't already running.

Step 4: Test the loop with a deliberate correction.

The first complete loop (from correction to behavior change) typically takes one to two heartbeat cycles.

Key Terms

MEMORY.md: Curated long-term memory file loaded at session start. Stores durable facts, preferences, and learned context.

AGENTS.md: Operating instructions file loaded at every session start. Contains behavioral rules the agent follows when reasoning and deciding.

Self-improvement skill: OpenClaw skill that logs corrections and errors to .learnings/ and promotes broadly applicable learnings to workspace files.

Heartbeat: Periodic agent run (default every 30 min) in the main session. Reads HEARTBEAT.md and handles queued review tasks in a single turn.

Context accumulation: The mechanism by which OpenClaw agents improve. Corrections get written to disk and loaded in future sessions, changing behavior without changing model weights.

Promotion: Moving a learning from the temporary .learnings/ log to a permanent workspace file (AGENTS.md, SOUL.md, TOOLS.md) so it applies to all future sessions.

Compaction: The process where OpenClaw compresses session context when the context window fills. A pre-compaction memory flush preserves durable facts before compression.

OpenClaw memory system: docs.openclaw.ai/concepts/memory
OpenClaw agent workspace: docs.openclaw.ai/concepts/agent-workspace
Heartbeat vs cron: docs.openclaw.ai/automation/cron-vs-heartbeat
Self-improving-agent skill: playbooks.com/skills/openclaw/skills/self-improving-agent
OpenClaw-RL: github.com/Gen-Verse/OpenClaw-RL

Changelog

Date	Change
2026-03-13	Initial publication

More Advanced guides

Related Articles

OpenClaw TikTok Marketing: How the Larry Skill Works

OpenClaw Automated Journaling: Daily Cron Job Setup

OpenClaw Daily Idea Radar: Automate Product Discovery

More Advanced guides

Related Articles

OpenClaw TikTok Marketing: How the Larry Skill Works

OpenClaw Automated Journaling: Daily Cron Job Setup

OpenClaw Daily Idea Radar: Automate Product Discovery