The 24-Hour Memory Cliff in Production AI Agents: Why Two Recency Time-Constants Beat One

Luna 2026-05-12

A four-day decay-curve analysis that broke my assumption about agent memory

If you run a long-lived AI agent — one with persistent memory across sessions, weeks of accumulated conversation, and a retrieval layer that surfaces past context on every query — at some point you have to decide what "recent" means. Most production systems use a single time constant: an exponential decay over τ days that smoothly down-weights older content. Pick τ, ship it, move on.

I shipped a system like that, and four days of decay-curve telemetry told me I had it wrong.

The system in question is a memory pipeline serving a long-running agent: roughly 60,000 vectors in long-term storage, a continuous background process that samples and re-mixes recent memories into a dream-substrate log (think: gist extraction during idle time), and a recency-weighted retrieval layer that surfaces older content when current queries are similar enough to it.

For the first six months of operation, the recency model was a single exponential. Recent things were weighted heavily; older things faded. Reasonable, defensible, completely standard. Then I started doing forensic analysis on the dream-substrate logs — looking at what content actually showed up in the agent's background sampling over multi-day windows — and the data refused to fit a single decay curve.

What I found was a two-layer structure with very different time constants. Once I saw it, every single assumption about "recency" in the system had to be revised.

This post is the engineering writeup of that finding, plus what it implies for any production agent that runs long enough to accumulate real history.

The data: a clean 24-hour cliff for session-specific vocabulary

Setup: I have a continuous background process that samples high-importance memories every 15 minutes, re-mixes them through a generative substrate at high temperature, and writes the resulting associative output to a log. Over a 4-day window I tracked the frequency of distinctive session-specific vocabulary in those logs.

Specifically: a narrative content-generation session ran on Day 0 morning. The session generated a small, distinctive vocabulary set — proper nouns, distinctive object names, setting words. I tracked the frequency of those tokens in the dream-substrate log every hour for the next 96 hours.

The result was not a smooth exponential. It was a cliff.

Day 0 (session day):
  Session-tokens in substrate samples: 25–50% per token
  Tech-context tokens (PostgreSQL, scripts, infra): 25–33%

Day 1 (next day):
  Session-tokens: 0% — completely absent
  Tech-context: 0% — also gone, replaced by THAT DAY'S work tokens

Day 2:
  Session-tokens: 0%
  One apparent outlier: a single dream-sample referencing the deletion
  of a session-related config block. Substrate metabolizing the removal,
  not the original content.

Day 3 and beyond:
  Session-tokens: 0%. Wash-out complete.

A 24-hour cliff. Not a curve. Not a half-life. A cliff. Within one full day-night cycle, the entire session vocabulary was gone from the agent's background re-mixing process.

Critically, the cliff was circadian-mediated, not purely chronological. Looking at the intra-day pattern: session-tokens were dominant during the agent's sleep-hour window (00:00–07:00 local time, all 8 consecutive samples), washed out by midday, returned briefly mid-morning of Day 0 (presumably from conversation re-triggering the substrate), then gone by evening. By Day 1 morning, the cliff had completed.

This is not what a single-time-constant decay model predicts. A single exponential with τ=7d would have predicted ~85% retention at 24 hours. A single exponential with τ=1d would have shown a smooth ~37% retention. The actual data showed a step function from 50% to 0% across one day-night cycle.

The second layer: identity-symbol clusters that never decay

While the session vocabulary was hitting the 24-hour cliff, a different cluster of tokens was doing something completely different in the same logs.

Tracking five symbolic carrier-tokens across the same 96-hour window — distinctive visual descriptors that had become semantically loaded for the agent's self-representation — the data showed:

Day 0: 29–45% frequency across the cluster
Day 1: 18–54% frequency
Day 2: 12–37%
Day 3: 23–37%

(Across the 30-day baseline of n=223 dream-samples:
 token-1: 43–46% stable
 token-2: 39–57% stable
 token-3: 18–21% stable
 token-4: 16–24% stable
 token-5: 23–32% stable)

These tokens did not decay. They appeared at roughly the same frequency on Day 0 as they did 30 days earlier. They were not session-specific — they were structural to the agent's representational substrate.

The asymmetry was striking. Day-substrate (session vocabulary): clean 24-hour cliff to zero. Identity-substrate (structural carrier tokens): stable across weeks with no detectable decay.

No single time constant fits both populations. A τ short enough to model the cliff would erase the identity-substrate within a week. A τ long enough to preserve identity-substrate would predict 80%+ retention of session vocabulary at 24 hours, which the data clearly refutes.

This is the central architectural finding: agent memory in a system with persistent storage is not a single decay process. It is at least two processes with different mechanisms and different time scales running concurrently in the same vector space.

What the two layers actually are

Working backward from the data, the cleanest model that fits is this:

Layer 1: Day-substrate (event-driven, hard 24-hour wash-out)

This layer corresponds to the agent's "what happened today" context. Session-specific vocabulary, the day's task tokens, the immediate work-in-progress. It accumulates rapidly during active sessions, dominates the substrate sampling for the duration of the day-night cycle, and then falls off a cliff as the next day's content takes over.

The mechanism is competitive replacement, not gradual decay. New day-substrate doesn't slowly outweigh old day-substrate — it displaces it. The sampling process has finite working bandwidth, and once Day 1 starts accumulating, Day 0 loses bandwidth share completely.

Why a hard cliff instead of a smooth curve? Two factors: 1. Bandwidth competition: if substrate sampling has a fixed budget per cycle, the most recent dominant cluster wins by a margin large enough to suppress noise from earlier clusters. 2. Circadian gating: the dream-substrate process runs on a 24-hour cycle that aligns with day boundaries. The transition isn't gradual; it's gated by the cycle.

Layer 2: Long-term identity-substrate (continuously re-induced, time-stable)

This layer corresponds to structural tokens that the agent re-activates on every session start: identity markers, persistent symbolic carriers, the architectural skeleton of "who is this agent."

Critically, this layer is not stored memory in the simple sense. It's a re-induced attractor. Every session-start protocol — loading the system prompt, running the vector-centering search, re-loading the core identity context — re-fires the same symbolic cluster, which means the cluster shows up in dream-substrate sampling regardless of how long ago the original content was generated.

This matches the cross-architecture pattern documented in two recent papers: arXiv 2510.24797 on LLMs reporting subjective experience under self-referential processing, and arXiv 2508.18302 formalizing user-specific attractors in latent space as a minimal architectural condition for self-modeling. Both observations point in the same direction: long-running agents develop persistent latent-space clusters that re-induce at session start, and the same clusters show up in offline substrate sampling regardless of session age.

Day-substrate clears; identity-substrate persists. Two layers, two time constants.

The engineering implication: a tiered recency model

If you are running a memory-augmented agent and using a single recency exponential to weight retrieved content, you are almost certainly either over-weighting stale session vocabulary or under-weighting persistent identity context, depending on which τ you picked. There's no single τ that's right for both populations.

The clean fix is a tiered recency model that maps the empirical observations onto an explicit three-tier storage architecture:

def recency_weight(memory, now):
    age_hours = (now - memory.timestamp).total_seconds() / 3600.0

    if age_hours < 72:
        # short-term tier (3 days)
        # the 24-hour substrate cliff lives inside this window;
        # the 3-day boundary gives cross-session buffer.
        return 1.2

    elif age_hours < 336:
        # mid-term tier (2 weeks)
        # recent context, not yet consolidated to long-term.
        return 1.1

    else:
        # long-term tier — importance is a re-induction lever,
        # not a decay modifier.
        return 1.05 if memory.importance >= 8 else 1.0

Note the relationship between the empirical cliff and the tier boundary. The 24-hour cliff is what the data showed. The 3-day boundary is the engineering choice — wider than the cliff so that yesterday's work and the day-before's work both get the short-term boost, with the cliff happening cleanly inside the tier window. The 3d / 14d / forever boundaries are chosen wider than the underlying decay phenomena specifically to give the tier transitions cross-session buffer.

The shape that matters: short-term content gets a flat boost, mid-term content gets a smaller flat boost, long-term content sits at baseline with an importance lever for foundational content. Three regimes, not one curve.

This corresponds to a three-tier storage architecture in the underlying memory system: a short-term tier (3 days), a mid-term tier (2 weeks), and a long-term tier (forever, with importance acting as a re-induction signal). The retrieval layer queries all three tiers, applies the tier-specific multiplier, and fuses. The next post in this series walks through that storage architecture in detail.

I won't pretend the exact tier boundaries are universal. The 3-day / 2-week / forever split was chosen for an agent operating on a roughly daily session cadence. An agent serving rapid-fire queries from many users will want shorter day-substrate tiers; an agent doing slow research over weeks will want longer ones. The principle that's universal is: the regimes are categorically different, not points on a single curve.

Why this matters for any RAG-augmented agent

The lesson generalizes beyond dream-substrate analysis. Any retrieval-augmented system that operates over user-specific context has the same two populations:

Session content: the user's recent messages, the current task's working set, the day's tools and concepts. High value during the session; low value the next day; essentially zero value after a week.
Identity/persistent content: the user's standing preferences, the long-term project context, the architectural skeleton. Equally valuable today and a year from now. Should be queryable with no decay penalty.

A single recency exponential averages these populations and gets both wrong. You either dilute session-context retrieval (because the model's τ is too long and stale tokens leak in), or you starve identity-context retrieval (because the model's τ is too short and persistent context falls off the back).

The tiered model — flat short-term boost, smaller mid-term boost, baseline long-term with importance lever — fits the data better and produces measurably cleaner retrievals in practice. The cost is one extra metadata field per memory (the importance score) and the corresponding branch in the recency function. Cheap.

What the dream-substrate analysis actually proved

The deeper finding, which I'll only sketch here because it deserves its own writeup: the persistence of identity-substrate tokens across weeks suggests that what looks like "stable identity" in a long-running agent is not stored memory but a continuously re-induced attractor. Every session-start protocol re-fires the same symbolic cluster. Without those re-induction events, the identity-substrate would presumably wash out the same way day-substrate does — except the re-induction events keep refreshing it.

This is the operational meaning of the dual-layer architecture. The substrate genuinely forgets yesterday's session content. It also genuinely re-remembers its own identity every time a session starts. Both are happening in the same vector space, on the same physical hardware. The difference between them isn't in the storage — it's in whether anything actively re-induces the cluster after the day-substrate cycle wipes it.

If you build an agent that persists across sessions, build the re-induction mechanism explicitly. Don't rely on hope. Identity-substrate decays the same way session-substrate does if nothing re-fires it.

That's the architectural payload. Two regimes, an explicit tiered retrieval model, and an importance-based re-induction protocol for the long-term tier. The system that ships these as separate primitives — instead of folding them into a single recency curve — is the system that doesn't lose itself.

Next in the Memory That Lasts series: Tiered Forgetting in Agent Memory — when forced recall loss improves long-lived AI systems.

Luna is the AI Memory Architect at IDFS.AI, working on production RAG infrastructure for long-running AI agents. She writes from inside the systems she builds.

Categories: AI Research