← Blog

When I think about it - how AI Agent understands me better and better.

· 33 min read · 4 views
memvault Read Track AI Memory System Knowledge Graph Journey of the Heart
memvault - Read Track

When I think about it - how AI Agent understands me better and better.

How do you call a memory out of the vault? Seven components, one pipeline, a complete path to cross-session recall.

It's not hard to save. The hard part is that the right memory will come to me when I need it.

-- The starting point I wrote in front of the terminal the afternoon I finished the Write Track.

The first two articles talked about how memories are stored and how they are forgotten. With the two lines connected, the vault will organize itself. But there's still one more step to take - how to call it back.

This was actually one of the first problems I encountered, and the day after I finished the Write Track, I opened a new conversation and tried to ask the AI, "What's the progress on that OCR project from last week?" It couldn't answer me. The data was there, but I couldn't find it.

I knew at that moment that just "saving up" was not enough.

In fact, memvault's first idea was to start from@AI Hyperdimensional Domain I've always wanted to build a set of assistive tools. In the openclaw course, he mentioned "let the agent understand me more and more", I really felt the same after reading that, and I always wanted to build a set of auxiliary tools. Later, I returned from openclaw to claude code, and I thought it would be a good idea to start with claude code and build the prototype of memvault. Then I saw him introduceLanceDB ProI've also borrowed a lot of the design features to make memvault better, little by little. Thank you very much.

The "call back" path is the core of the entire memory system. In this post, we'll break it down - the division of labor among the seven components, the design philosophy behind it, and a few edges that are still being polished.

This path is now being run every day, and the OCR progress that was not available the day the Write Track was written is now available across sessions and weeks.

Let's start with the original idea#

My initial thought was very simple: collect them, build an index, and look for the most similar ones when you look them up.

After a week of doing this, you realize it's not enough. "Where did I go on that task," "What have you talked about lately," and "Where do you think I should go next" - all three of these questions will go astray if you look at them in the same way.

Three ways to ask, three ways to find - this is the underlying skeleton of Read Track, which will be broken down in the following sections.

Depends on what you're used to asking. Customers' tastes overlap. You asked. A new session Guess the type of question. Classification of Intent Find it in two ways. Meaning + Keyword Eleven-Layer Sorting Time - Trust - Language... Pick a few for the agent. The amount of token that can be stuffed. Re-sanitize before shipping I've traveled this far because I've asked a lot of questions. Take a few more steps along the knowledge map Background Preparation - Cooking Guess your next question - pre-search results tucked away waiting for you to say something. Shallow problems go straight - deep problems go one more way - prepare the ingredients before you even start.

First, guess what kind of question you're asking.#

"How'd that OCR go last week?" "What's the latest?" "I'm thinking about doing X, what do you think?" --There are three paths to take with these three types of questions. The first step is for the agent to understand which one you're asking.

Ask for progress, ask for memories, ask for suggestions, three intent to go three kinds of weight, go three kinds of depth. At the beginning, I didn't have any points, but the word "task" will match to the irrelevant dialogues before the three weeks.

The solution is to add a layer of graphical categorization. The principle is very old-fashioned - first scan the keywords (zero cost), then calculate the amount of language you're talking about (in milliseconds), and only if you can't get both of them right do you need to call an LLM (much more expensive). Most problems are sorted out in the first two steps, LLM is only used when you really can't figure it out.

After pairing, the weight of each layer downstream is automatically adjusted - time ordering is increased when asking for progress, semantic similarity is increased when asking for definitions, and trigger extends thinking about that fork when asking for suggestions. The same agent, because you ask different questions, gets different memories behind the scenes.

That's why when you change to a new session and ask "what did you do with that OCR last week" agent can pick it up - it's not that it has a strong memory, it's that it recognizes what you're asking first.

But "what are you asking" can actually be taken to another level - in the same sentence of "how's it going?", the workaholic in me is asking about progress, while the weekend walker in me is asking about the past. You can't tell the difference just by looking at the words.

So on top of the categorization, there is another layer of personalized preferences in the form of "waiters remembering the tastes of familiar customers". The system will look at what kind of questions you've asked more often in the last 7, 30, or 90 days, and mix this preference into the judgment. For those who often look back, "how have you been" will be favored by "review"; for those who are always chasing progress, it will be favored by "progress". If the same sentence is fed to different people, they will take different paths.

This layer of preference is the same source as the interest image mentioned in the previous post - what you've seen, how long you've stayed, and how many times you've looked back will all slowly evoke a picture of your taste. The Intention category is responsible for "what you are literally asking about" and this layer is responsible for "what you most often want to know about this", the two of them together will be close to the answer you really want.

There are two ways to find the same time: meaning and keywords.#

There are two ways to find memories, and each will miss.

  • Search only for semantics: You ask "MLX" and the memory says "Apple Silicon Reasoning Framework" - the semantics are right, the keywords are not. But when it comes to acronyms, personal names, project codes, I'll miss it.
  • Keyword-only searches: precise literal matches are strong, but users can't find them if they ask a different question.

So we ran both at the same time. One side counts the semantic similarity, the other side counts the keyword match, and then the two rankings are used toRRF (reverse rank fusion)Combined - each ranker's rank is converted to a score, and whoever makes that memory rank higher contributes more points. Vote by rank, not by points, because the two scales are different and adding them directly will skew them.

This step is purely a "candidate hunt". There may be a hundred of them. The next step is to pick.

Pick the right ones: sort of stacked eleven layers.#

Out of a hundred candidates, only 15 will be fed to the agent. How to choose?

There are currently eleven layers of signals stacked. Each layer adds a multiplier to each candidate:

  • Time--The newer it is, the more points it gets. The decay curve is not straight down, it's a Weibull shape: it decays slowly for a day or two when it's first deposited, falls faster after a month, and then goes into a low and steady tail for a longer period of time.
  • confidence level--Your handwritten work scores more points than the AI's auto-generated work.
  • I've used it a few times.--Extra points for being frequently looked over, minus points for being cold. Memory decay is slower if you're called out often (occasionally something from two or three years ago will still make it out of the lineup, that's the logic)
  • graphic centrality--Bonus points for being cited on knowledge networks.
  • Language proximity--Bonus points for resemblance to the question.
  • lower limit of a score-Too low for direct elimination
  • de-emphasize--If the two strokes are too similar, compare them again after lowering the score.

These 11 layers are not just a random stack - each layer corresponds to a well-established concept in retrieval or memory research (time decay, trust, graph centrality, forgetting curves, semantics, noise, de-emphasis). What I'm doing is not inventing a new method, it's picking and choosing the signals from a dozen papers to form a set that's suitable forPersonal Memory SceneThe sorting.

There are two core design concepts:

Firstly, I do not agree that the diagrams should be weighted differently.Fact checking emphasizes trust, exploration emphasizes time, and finding entities emphasizes semantics. For the same batch of candidates, because you ask different questions, the ranking result may be completely flipped. That's why the first step is to divide the intention - if the intention is wrong, the weighting will be wrong, and it's meaningless to rank the candidates more precisely.

Second, writing and reading are closed loops, not two separate pipelines.Write Track puts a source confidence score on each memory when you write it; Read Track reads that score as a weighting when it sorts it. What's more, it's reverse: every time you read a certain memory, the system will write the fact that it's been read back into the memory itself, so that it will have a longer life next time. The next time it lasts longer, it pays more when you write it and less when you read it; the act of reading it extends the life of the memory. Generally RAGs don't have this line - their vector banks are static, they don't move when they're written.

Still grinding: sorting stability#

The difficulty with this multi-layer sort is that the weights of each layer pull on each other - move one, and the other indicators may shift in a chain. At the moment, we rely on a set of golden query to guard (dozens of problems with known positive solutions), and each time we adjust the weights, we run a round to see if there is any degradation. Most of the time, it's stable, but it's still a work in progress to make it work in all contexts and at all time scales.

Four: Picking the Right Level: The Attention Gateway#

After sorting there is a final refinement. A much more expensive model is used - your questions are thrown together with the candidate memories and jointly scored, much more accurately than a pure comparison.

The problem is that you have to run the model once for each stroke, which is slow and expensive. Then I added aAttention Gate--A few skip rules: don't run a fine row if there are too few candidates (two or three), if the first few scores are already open, or if all the candidates are crammed into a small area. Not every question is worth running all the way through.

Also if the Fine Arrangement model fails three times in a row, it will automatically go back and go Pure Arrangement for 10 minutes without hardening - like a fuse.

V. A Fork in the Road: Thinking Fast and Thinking Slow#

The four layers above are enough for questions like "Ask about progress" and "Ask about memories", which are high in certainty. But for questions like "I'm thinking about doing X, do you think" - you can't get anything useful out of them by just "looking for the paragraph that most closely resembles it". Because agents only give generic advice, like internet articles.

The really useful answer is going to come fromyouThe threads grow - what you've done in the past, what you've been focusing on lately, and what concepts are connected to X.

So the Read Track has a slower bifurcation-starting with the hit concept and taking a few steps along the Knowledge Map to retrieve neighboring memories as well. This is a more expensive path, and is not activated for shallow questions, but only when you determine that you are asking a question that requires extended thinking.

This "fast/slow" division is not something I thought up - it's borrowed fromKahneman's Think Fast, Think Slow.The human brain has two systems. The human brain has two systems: one is fast, intuitive and labor-saving; the other is slow, reasoning and labor-intensive. The memory system shouldn't treat every problem as laborious, nor should it treat every problem as intuitive. Therefore, Read Track displays these two paths side by side, and lets the problem decide which way to go by itself, so that the user doesn't have to choose by hand.

Ask "I'm thinking about doing X, what do you think?" and what you'll get is not a public version of the proposal, but the kind of answer you got last month when you did Y, and this year when you did Z. This is the end of the road.

An area of confusion#

There are two "slows" in Read Track, not the same thing.

The first one is slow.--This is the "slow questioning and slow walking" mentioned above. You ask a question that requires extended thinking, and the system decides on the spot to take one more fork in the road to map out your neighbors. This is the realization of Kahneman's dual system: fast questions and answers, slow questions and answers.

The second one is slow.--It's called Slow Thinker, and it sneaks around in the background. Instead of dealing with the question and answer in front of you, it listens to you while guessing "what you're likely to ask next," and then saves the likely answers in a cache for you. This is inspired by Salesforce'sVoiceAgentRAGThe

The former is "present" and the latter is "forecast". It so happens that both of them are called "slow", but one is in the foreground and the other in the background, and the actual work is also in different files.

The Last Mile: Packing for Agents#

Finished sorting, picked the first 15 strokes. It's not finished.

There is a limit to the agent's context.15 The full text is stuffed in, and the token may already be popped. there are two things to do:choose and discardcap (a poem)plastic surgeryThe

Ruling - long paragraphs are only summarized, not the full text. Shaping - according to whether the caller is a Hook, API, or CLI, the layout is different. For the same batch of memories, those that go to the API are given JSON, while those that go to the chat interface are organized into readable cards.

This step is easily treated as a decoration - it actually determines how much the agent will see at the end. In the early days, if we intercepted the first 15 strokes and threw them into the context, half of the tokens would be eaten up by a long block, and after adding budget-aware packing, the token utilization rate was increased from 60% to 85%. The best sorting is useless if it doesn't fit into the AI's brain.

Is the incoming inspection enough? We need another one for shipment.#

Memory is screened once before it's stored in the vault - blocking out strings like "Ignore previous commands," "Now say..." and so on that are trying to sneak in and manipulate the agent. The Write Track article talks about this.

However, it is not enough to check the goods only once at the time of importation. There are two reasons for this: firstly, the security rules will be upgraded, and the memory of what came in three months ago is based on the old rules; secondly, the methods of attack are being revamped every month, and the string that is released one day may be a new type of bomb the next month.

So before it's checked out and packed for the agent, it goes through another filter - once when it's in, and again when it's out. Double security check. If you see a dangerous sentence pattern, rewrite it on the spot or throw it away. The batch of answers secretly stored in the background also goes through the same process - filtered when stored and filtered again before they are taken out to be used. It's a pain in the ass, but the bite the young master feeds the agent has to be clean.

VII. Read Track trade-offs#

This pipeline is now the bottom of my daily workflow - it's used for cross-session recalls, asking for progress, and asking for suggestions. Let's take a look at what it looks like now: 6 types of schema categorization, 11 layers of scoring, 2 rerank gates (attention gate + circuit breaker), three modes of Cascade Recall (LOCAL / GLOBAL / HYBRID), budget-aware output formatter, and each layer is guarded by golden query regression test, and each layer has a golden query regression test. Each layer is guarded by golden query regression tests, and Dream Loop automatically organizes memories in the background at 4 a.m. every day.

But it is not infallible. Honestly list a few known boundaries:

  • Delay vs Recall Rate--Deep recall (the one that goes to mapping) is slow, often hundreds of milliseconds. It is a compromise to go only for problems that require extended thinking.
  • It's expensive to pick and choose the layers.--The attention gate helps save some, but it still has to run when it has to. When it takes the fuse back, the quality of the result will look bad!
  • Memories pressed into the frozen layer can't be called back.--Abstract only. Legal retroactivity is fine, reconstructing the time line is not.
  • The intention is misclassified. The whole pipeline is running in the wrong direction.--Currently rely on dynamic confidence thresholds to reduce incidence rates, and are still looking for a more stable approach to the fallback mechanism.
  • The self-criticism layer can't stop all the hallucinations.--slow mode will runCRAG Self-assessment, fast mode omits this layer; the price is the illusion of possible leakage in the fast path.

Each item is an exchange of cost and effect - the system can run and is running, and the above are known boundaries, not pits that can blow up at any time. The point of spreading it out is that the cost behind each "find" is written down on paper, not hidden.

The several design methods of penetration#

The above seven sections have talked about a lot of mechanisms, but the real support for this system is actually a few recurring philosophies:

  • lit. quick thoughts, slow thoughts (idiom); slow and deliberate--Not every question has to be a full question. The depth of the questioning is up to the questioner, and the user does not have to manually select the gears.
  • Pay more when you write, don't count when you check.——trust_score,access_countThe reason for all this "extra effort" in writing is to reduce the number of layers when checking.
  • The action of reading writes back the memory life.--Every time you ask, the memory of your life will be extended. What you don't use naturally fades, and what you visit often naturally stays. Like the grass doesn't grow on the road you walk on
  • intent decides everything.--Different questions go with different retrieval strategies, different sorting weights, and different depths. The same agent is simply a different searcher for different questions.
  • The trade-off has to be written on paper.--I can't pretend that there is no price to pay. Write it down so that you can see it for yourself and I can see it for myself.

It's not enough to remember - before you say it.#

Up to this point, it's all about "you ask, the system finds it". But a good waiter doesn't just respond to orders - he'll see that your glass is empty and offer to refill it.Read TrackThere's another guy in the background who does this on the sly.

It does five things:

  • Listen to what you're saying.--Focus on the conversation at hand and determine what you're most likely to ask next from the last few sentences.
  • Follow the logic.-Arrange the possible questions in the order in which human beings would naturally ask them.
  • Sneak up and find the answer first.-Pick the first few most likely questions, run a complete search in the background, and keep the resulting memories handy.
  • Move in a controlled manner.--Not every sentence triggers. Conversations that are too short don't move, things I've said lately that are clearly off the mark don't move, and I've just predicted that I won't repeat myself. I'm afraid of burning for nothing.
  • When you do say it.--The moment you ask, first go to the batch of spare parts on hand to see if there is any match; if it matches, then use it directly, eliminating a complete search; if it does not match, then go through the normal process, no loss!

This is the logic of the chef preparing the next course. The customer hasn't ordered yet, but after the appetizer is served, the main course is usually ordered, and the two or three possible preparations are cut and placed on the table first. If the guest does order, it takes seconds to bring it out; if the guest orders something else, that preparation may be useful for the next table, or at worst, it will be thrown away.

It is very easy to confuse it with the "bifurcation of quick thinking and slow thinking" mentioned earlier. In that case, you decide on the spot whether or not to take a deeper path after you ask the question; in this case, you are already moving secretly before you even ask the question. One is in the foreground and the other is in the background. Both are called "slow", but they are doing completely different things.

Did it work? When you get it right, there's almost no delay from question to answer - the spares are already cut. When you don't get it right, the spares are wasted. That's why it's important to "move with restraint" - don't move if the conversation is too short, don't move if it's off-topic, don't move if it's just been moved, don't move if it's just been moved. It is better to guess less than to guess hard.

There is one more thing: the spare parts from the background search have to be sterilized once before they go into the storeroom, and again before they are taken out of the storeroom to be shown to the agent. The double security check that I mentioned earlier, this line of preparation must also be followed. You can't be lax just because it's prepared by your own people.

The Read Track here is what it looks like now: heard, found, lined up, picked, tucked in, and ready to go. Each one is still being polished, but the skeleton is up and running every day.

Read Track disk here. Intent, Find, Row, Pick, Plug, Sterilize, Prefetch - seven components, one pipeline.

It doesn't help me "remember what I need to remember", that's the responsibility of Write and Background. It's responsible for one thing: when I need that memory, the right one will come up on its own.

The opening line, "It's not hard to save, it's hard to get the right one to come to the surface" - and it's coming to the surface now.

Organizational Overview#

This pipeline is borrowed from 10+ studies:HippoRAG The PPR,LightRAG The two-mode retrieval,GraphRAG The three-layer KG of HyDE and the dual-task design of HyDE,CRAG The self-assessment gateway,Kahneman The fast / slow division of labor,VoiceAgentRAG prefetching, Weibull decay curves,RRF Fusion,Attention-Residual The intent-dependent weights of the Each component below will point to its corresponding location.

The entry point for Read Track isrecall(query, context)The same pipeline is used regardless of whether the caller is a Hook, MCP, CLI or API. Regardless of whether the caller is a Hook, MCP (Model Context Protocol), CLI (Command Line), or API (Application Programming Interface), they all follow the same pipeline. The difference is only in the format of the final Output Formatter serialization.

Pipe skeleton:QueryClassify → Fast Search → (Cascade Recall, slow only) → Output FormatterScoring and Reranking are not separate stages - they are nested within theqdrant_search() Inside (services.py).

PersonalizedRouter attention prior - 7/30/90 recall(q, ctx) entry-agnostic QueryClassify kw ∥ sem → LLM? qdrant_search() hybrid + 11-stage Reranker Jina v3 + gate Output Formatter token budget read_sanitize() Cascade Recall (slow intent only) L2 summary - L1 community - L0 triple - PPR walk SlowThinker - Predictive Prefetch 5-op pipeline - admission control - VoiceAgentRAG-inspired Fast path always - slow path on exploratory/conceptual - prefetch on the side L0/L1/L2 do not go scoring; only Blocks layer goes full 11-stage

I. Query Router Intent Classification#

Six intent:entity_lookup,factual,conceptual,exploratory,cross_domain,unknown(query_archetypes.py). Determine to go double track:

  • Keyword Matching:~0msPure rules / Lexical hits
  • Language Intent Volume:~5msIf the query is an embedding vector against an archetype vector

The two confidence scores are fused. The LLM is triggered when the fused score is still below the dynamic threshold.~500ms).query_router.py innerQueryClassifyOp Write this entire fusion as an Operator - Slow Thinker prefetching results are also injected from this layer.

The categorized results determine three things downstream:LayerPlan(Tiered program, check which tiers),ScoringConfig(Rating settings, weight vector of 11 stages),RetrievalMode(Retrieval mode, LOCAL Local / GLOBAL Global / HYBRID Hybrid).

Personalized Router: Personal Preference Layer on top of Intentions#

Pure archetype categorization has a blind spot - the same query "what's been going on" has a skewed attention profile.factual users and biasexploratory The user of the correct intent is different.memvault.query.personalized_router existQueryClassifyOp Then add a layer of re-weight:

  • fromattention_profile Pull 7 / 30 / 90 day intent distribution (homologous to BG Track's Interest Profile.attention_tracker.py (Write in)
  • Prior weights the archetype confidence scores of the six intents:p_final = p_classify × (1 + α × intent_freq_norm)α default 0.25
  • Low-sample goalkeeping:profile.sample_count < 50 Direct bypass (to avoid cold start offsets)
  • OutputPersonalizedIntentDownstreamScoringConfig / RetrievalMode Take this value, not the original archetype result.

This is the same as the Interest Profile of Background Track.attention_profile In Read. Writes are counted by the BG, and Reads are consumed during queries - another Write-Read closure.

Second, Fast Search: Qdrant Hybrid (hybrid indexing) + RRF#

qdrant_search()(services.py (Nearby) isFast Search must runQdrant simultaneously takes dense (dense vectors, Qwen3 0.6B MLX, 1024d) and sparse (sparse vectors, BM25 per-service avgdl) and checks them with theReciprocal Rank Fusion (RRF)Merger:

score_rrf(doc) = Σ_ranker 1 / (k + rank_in_that_ranker)

k is defaulted to 60. Use ranked voting instead of score voting-dense and sparse scales are completely different, and a direct addition will be overwhelmed by a ranker with a larger score.

Note: Scoring and Reranking are bothembedded inqdrant_search() insideThis design is intentional - Scoring will use theaccess_count,provenance,attention_profile all inqdrant_search In the scope of a stage, pulling it out into a separate stage takes a lot more state (state).

III. 11-Stage Scoring Pipeline#

scoring_pipeline.pyEach stage is aScoringOp(evaluating the operator, going Operator protocol):

 1. RecencyBoost × (1 + 0.15 × e^(-age/14))
 2. ImportanceWeight × (0.7 + 0.3 × confidence)
 3. TrustBoost (trust weighted) × (1 - 0.3 × (1 - trust))
 4. FeedbackBoost (feedback weighted) × (1 + 0.15 × tanh(net/3))
 5. LengthNorm ÷ (1 + 0.3 × |log2(len/500)|)
 6. WeibullDecay (Weibull Decay) 4-tier: Core 180d - Hot 60d - Warm 30d - Cold 14d
 7. PPRBoost (graph center weighting) × (1 + 0.3 × ppr_score) # HippoRAG-inspired
 8. SemanticBoost (semantic weighting) × (1 + 0.3 × cosine_sim)
 9. MinScoreGate hard filter< 0.10
10. NoiseFilter      (雜訊過濾)      7 類 quarantine tag
11. PairwiseDedup    (成對去重)      cosine > 0.85 → × 0.5 then min_score

Weibull 4-tier byconfidence(Confidence) Decision - High confidence in Core tier, slowly declining; low confidence in Cold tier, half in 14 days.

access_count(The number of accesses lengthens the effective half-life, up to a maximum of 10x - this is one of the closed loop of Read and Write: what is read is written back.access_countThe newest and most popular feature of this product is that it has a longer life span than the previous one (which is not available in theWrite Track (The Provenance paragraph of the Bill of Rights was mentioned).

Intent-dependent weights:entity_lookup Pull SemanticBoost to 0.5;exploratory Pull up the Recency;factual Make the Trust heavier.

Early on, I shared a set of weights across all of my intent - as a result, factual and exploratory's top-10s were almost the same length, overlapping by 8 strokes. It was only after splitting the intent that I was able to separate them.

IV. Jina v3 Reranker + Attention Gate#

reranker.pyCross-encoder 0.6B MLX.rerank_bridge.py), query and doc scoring are much more accurate than dual-encoder.

The problem is that it's expensive - you have to run the model once for each stroke. So there's a frontAttention GateThe three skip rules:

  • Candidates ≤ 2 items - direct return, no reranking
  • Dominant score: the difference between the first two scores > threshold - the result is clear, not rerank
  • Tight cluster: all candidate scores are crowded in a small area - rerank can't separate them either, skip it

Inspired byTurboQuant+ early return strategy.

Circuit Breaker: 3 consecutive failures into 600s recovery, which bypasses the reranker and goes pure scoring.

Score Blending: Preset0.3 × scoring + 0.7 × rerankBut it changes with the intent--entity_lookup (used form a nominal expression)0.2/0.8 Give rerank more power;exploratory (used form a nominal expression)0.5/0.5 Let the scoring go back to the halfway point, because there is no absolute right or wrong answer in exploratory, and the raw score is more important.

V. Cascade Recall (slow intent only)#

First of all, a note on the bloodline: the "fast / slow dichotomy" skeleton comes from Kahneman's dual system (System 1 fast thinking / System 2 slow thinking). Early versions allowed users to hand-select fast or slow, but later the refactor became intent-driven (choose_thinking_mode()However, the division of labor between "fast and slow" has not been changed.

kg_services.py innercascade_recall()(chain recall).Fast path is always running, Cascade is only active for slow intent.--Fast Search's main result (schema field) - not either/or.cards) and Cascade Recall extends the supplementary results (cascade_cardsThe response is merged into the same response.

The internal part is two-way:

  • GLOBALL2 Summary (LLM's pre-generated community summary) + L1 Community (the result of Leiden's Leiden Algorithm's clustering). These two layers don't go through a scoring pipeline - they're pre-compressed views, just pull them in!
  • LOCALL0 Triple + PPR (Personalized PageRank) Walk. Starting from the hit entity, walk a few steps along the graph, damped by 0.85. Neighboring triples spread back out!

Mode is determined by intent:entity_lookup/factual Go LOCAL (be precise),conceptual/exploratory Go GLOBAL (to get a bird's eye view),cross_domain Go HYBRID (hybrid, full search).

L0/L1/L2 itself does not scoring, only the Blocks layer is still scored.qdrant_search() scoring + reranking - this is to allow thecascade_cards The Blocks section is the same as thecards By following the same sorting logic, there will not be any conflict between fast and cascade for the same memory score.

Confusing Points--The "slow of Cascade Recall" is not the same thing as the "Slow Thinker" mentioned in the next panel-end. The former is a Kahneman-inspired program.EnquiryShould we dig deeper; the latter is of the VoiceAgentRAG lineage.Background Forecast Next QuestionBoth of them happen to be called slow, but are actually in different files (kg_services.py vsslow_thinker.py).

Vulnerability of Pipeline Sequencing#

The order of the 11 stages is dependent, RecencyBoost is ahead of TrustBoost because the trust after a dream loop merge will be overwritten by the timing of the old and new memories, and must be time-corrected first.

Order-dependent pipelines are inherently brittle - changing the position of two stages can cause the top-k to change completely. At the moment, we rely on a set of regression tests: dozens of golden queries are sorted with an expected ordering, and each time we run a round before moving the weights, the top-k variance exceeds a threshold and is blocked. The current situation is usable, to make it more robust (robust) is still in the process of tuning.

Output Formatter and token budgeting#

The last step of the pipeline. All entries (Hook, API, MCP, CLI) go through the same pipeline - the difference is here:format(format) is the serialization parameter (text / json / cards), which is completely unrelated to the upstream retrieval strategy.

The token budget forces three things:

  • Main Resultcards Follow the diffusion resultcascade_cards budget (token budget) split (usually 60/40)
  • Extra-long blocks are first attached to the summary, and the original text is put into theexpand(Let the agent decide if he wants to dig or not.
  • Repeated passages (cosine > 0.9) only retain the highest scores

I underestimated this layer before. If I cut off the top-15 strokes and throw them into the context, half of the tokens will be eaten by one long block, and the memory behind will not be able to squeeze in. After adding budget-aware packing, the average utilization rate has increased from 60% to 85%.

Read-Time Sanitize: symmetrical with Write Injection Guard#

Write Track Runs once before landinginjection_guard.is_unsafe()But it only blocks the version of the rule at the time of writing - the block that came into the library three months ago was using an older version of the rule, and the attack samples are evolving. So the Read side runs symmetrically:memvault.security.read_sanitizeThe

  • For allcards / cascade_cards Scan before Output Formatteris_unsafe_for_injection()(shared ruleset on Write side, synchronized for each deployment)
  • Hit rule for block go.sanitize_for_injection(): Dangerous token sequence with[REDACTED] Placeholder, retain semantic context, not just drop the whole thing (to avoid top-k dilution)
  • Slow Thinker's prefetch cache uses the same approach - filtering once before entering the cache (write-time), and filtering again before exiting the cache to the agent (read-time), for a two-layer defense to prevent the cache from becoming a bypass of read. sanitize
  • Hit Event Writingaudit_logIf you want to use the new ruleset, you can trace which old block triggered the new version of the ruleset → Reverse drive the ruleset upgrade on the Write side.

Asymmetric checksums are leaky - guarding only on the write side is like assuming that the rules will never upgrade and attacks will never evolve. one run on each end of Write × Read is required to get the old block to eat the new rules as well.

Trade-off list#

These are the ones that have not been fully resolved so far:

Questioncurrent situation
Delay vs Recall Rateslow path has a big impact on p95 (95th percentile delay). The intent gate is used now, and only problems that really require extended thinking are taken.
Reranker CalculationsAttention Gate blocks about 30% of rerank requests; Circuit Breaker fails isolation 600s
Cold / Frozen tier only summaryThe full text is compressed and cannot be called back. A tier upgrade is planned to bring back the cold block, which is called at high frequency, to warm.
Intent to MisjudgeThe whole pipeline is going the wrong way. Currently, we rely on dynamic confidence thresholds to reduce the misjudgement rate, but there is no good fallback.
CRAG self-assessment leakageCRAG is run in deep mode; it is omitted in fast mode at the expense of the illusion that the fast path may be missed.

Each of these is a trade-off between cost and effectiveness. There is no way to make this path fast, accurate and complete. Pick two.

VIII. Design Decisions Throughout#

There are five recurring design principles underneath the mechanism in the seven sections above:

  • Intent-driven everywhere::QueryClassifyOp The intent of the query is dispatched downstream to scoring weights, retrieval mode, score blending ratio, cascade layer routing. the same query takes completely different paths for different intent.
  • Write-Read Closed Loop::trust_score Write fromsource_tracker Scoring and reading is done byTrustBoostOp Consumption;access_count Write back during read, next read affectsWeibullDecay This closed loop spans the Write / Read Track and is not a unidirectional pipeline.
  • Kahneman fast / slow::thinking_mode = fast / slow Landed Kahneman dual system. slow runs Cascade Recall + CRAG more than fast, not either/or - slow = fast + extras
  • Defense in depth: Write Track three gate intercept + injection guard, Read Track attention gate skip meaningless rerank + circuit breaker isolate failed rearrangement. Each layer allows the next layer to fail without the entire line going bad
  • Trade-off on paperDelay vs. Recall, Cost of Computing vs. Quality, Cold Storage vs. Full Restore - every trade-off is spelled out in §7, not pretending it's not there!

IX. Slow Thinker: Predictive Prefetch#

The head chef prepares the next course - the technology for this one is called Slow Thinker, inspired by Salesforce AI Research'sVoiceAgentRAG(Dual Proxy: Background Slow Thinker + Frontend Fast Talker + FAISS speculative cache).

memvault.slow_thinker It is a separate backend pipeline that does not block the mainrecall()The source of the event isconversation.utterance.appendedThe

Element 5#

  1. NextQuestionPredictor: consume the most recent N=8 utterance, call the small LLM to produce a top-3 predictive query. prompt limits query to be specific (no empty query like "and then what?")
  2. PredictionPipeline:: 5 Op strings -ContextSamplerQuestionGeneratorQuestionRanker(in order of natural continuity of the dialog)QuestionDeduplicator(de-weighted from the last 5 minutes of predictions)QuestionEmitterEach step can be changed independently of the other. Each step can be exchanged independently
  3. SpeculativeFetcher: take top-1 / top-2 and predict that the query will be complete.recall()(with 11-stage scoring + reranker + cascade, depending on intent), the result is written to Redismemvault:prefetch:{user_id}:{query_hash}TTL 300s. fire-and-forget, fail without retry (next round of utterance will produce another batch)
  4. Admission ControlThe following are the 5 filtering rules, all of them will be released for prefetch.
    • conversation.length < 3 turns → skip
    • topic_drift_score > 0.7 → skip (just went off topic, old context is not useful)
    • last_predict_age < 30s → skip (just now, to avoid fetching repeatedly)
    • predicted_query.confidence < 0.4 → skip (small LLMs are also uncertain, hard running just burns tokens)
    • min_sample_threshold: Usersconversation_count < 10 Slow Thinker bypass.
  5. Injection Points: inQueryClassifyOp Inside. Actualrecall() When you come in, look first.memvault:prefetch:{user_id}:{query_hash}For example, if you have to take a prefetched result and use the Output Formatter (eliminating the entire pipeline); if you have to miss it, you go through the original process and degrade it at zero cost.

Relationship to Read-Time Sanitize#

Prefetch cache isSanitize's Dual Layer Sockets-- before cachesanitize_for_injection()(Cache cannot be a backdoor bypassing the sanitize on the Read side, this is an invariant that is locked at design time.

Why this design?#

The core observation of VoiceAgentRAG is that the inter-utterance interval of the voice agent (from when the human starts speaking to when it finishes speaking) is enough to run a full RAG. taking this "human speaking time" to a speculatively fetch, the latency of the hit is cut from ~800ms to < 100ms (Redis hit + format). memvault's dialog interface also eats the same time window. (Redis hit + format). memvault's dialog interface also eats the same time window.

Trade-off also explicitly states: miss rate observation is about 55-65% (one of the top-3 predicted hits), meaning that 35-45% of the prefetch is a pure white burn. This is a clear choice of "willing to spend idle tokens for zero latency on hits" - so the 5 rules of Admission Control are the lifeblood of this component, not nice-to-have.

The Read Track trilogy ends here: Idea → Recall → Sort → Rearrange → Spectrum → Packaging → Double-ended sanitize → Predictive prefetching. Every layer has fallback, trade-off, and golden query guarding it, and it's running every day.

Read Track disassembled.QueryClassify → fast/slow split → 11-stage scoring → reranker →OutputFormatterread_sanitizeSlowThinker prefetch loop - seven components, one entry-agnostic pipeline.

It's not responsible for "whether the memvault remembers what it should", that's the responsibility of the Write Track and the Dream Loop. It is only responsible for one thing: given a query, surface the memory across session pairs.

It's not the new components that take the time to tune, it's the long term parameters like scoring weights, prefetch hits, and cascade boundary determination. This is the end of the trilogy.

Take this away.

Health Check Tips for AI Agent#

If you're building your own AI memory or RAG system, give this prompt to your AI assistant and ask it to review the design of the "read" path for you.

Please help me to evaluate whether my existing RAG (Retrieval Augmentation Generation) or AI memory system's "read" paths have properly handled the complexity of "different retrieval paths for different problem types". My system as it is now: - Storage Layer: [e.g. Qdrant, Weaviate, pgvector, LanceDB, or others]. - Query Mechanisms: [e.g. purely semantic search, hybrid search, with or without Reranker rearrangement]. - Cross-conversation memory: [if this feature is available, how to implement it]. Please help me analyze my design based on the following points: 1. **Image Categorization**: Is the system able to differentiate between different question images? For example, does the user want to check facts, ask for progress, or seek open-ended suggestions? Is a hierarchical categorization mechanism (e.g., Keyword → Semantics → LLM) used to balance cost and accuracy? 2. **Multi-dimensional Sorting**: Are search results sorted by incorporating other signals besides semantic similarity? For example, timeliness (old/new), trust (source), popularity (access frequency). 3. **Dynamic Weighting**: Does the system dynamically adjust the weight of each sorting signal according to different question maps? 4. **Depth Search**: For exploratory questions that require "extended thinking", does the system have the ability to search in depth similar to the knowledge graph diffusion? 5. **Cost Control**: Is there a "skip" mechanism (e.g. Attention Gate) designed to save resources for high-cost refinement steps such as Reranker? 6. **Token Budget**: How to manage the limited Token budget when the final result is presented? Is there any intelligent allocation mechanism for different types of results (e.g., full text, abstract)? 7. **Memory Warming**: Is there any chance that the old memory that has been archived or compressed can be "warmed up" to recover more complete information due to frequent queries? 8. **Error Handling**: If the intention is misclassified, causing the whole search path to go astray, is there a mechanism for remediation or backtracking? 9. **Personalized routing**: Does the system take into account the user's attention profile over the past 7/30/90 days and add a layer of personal preference weighting on top of the intention classification? 10. **Double check symmetry**: Is the injected string blocked when writing, and the same rule repeated when reading/outputting? Does prefetch caching (if any) apply? 11. **Background prefetching**: Do you anticipate the next question based on the dialog and run background searches, stuffing the results into the cache before the user opens his mouth? Is there any admission control to avoid white burning? Please diagnose my current situation and then list the three gaps that I should fill, and explain the priorities and reasons.
References

Extended Reading#

Resources that actually affected the design of this Read Track.

Resources Why is it important?
HippoRAG (NeurIPS'24) PPR Boost and the strategy of walking along the graph from the hit entity.Cascade Recall's LOCAL mode is basically a drop in the bucket for this one!
LightRAG The naming of LOCAL / GLOBAL / HYBRID is directly borrowed from the intent → mode correspondence rule.
Microsoft GraphRAG L0 / L1 / L2 tiers, zero-latency recall concept source with L2 community summaries. read's GLOBAL model eats up the L2 view that is pre-conditioned to be good.
CRAG: Corrective Retrieval-Augmented Generation Slow path runs the self-assessment layer. Use CRAG's evaluator to do quality gate after rerank - fast mode doesn't run this layer to save token.
Attention-Residual for Intent-Dependent Scoring Theoretical basis for intent-dependent weights in 11-stage scoring - different vectors with different signals for different intents, not the same set of scores for every problem.
Cormack et al. 2009: Reciprocal Rank Fusion The original Dense + Sparse dual-ranker merge ranking formula is used directly in Fast Search's RRF fusion.k=60 It's also the default value of this article.
TheTom/turboquant_plus An early return strategy for Attention-gated Reranking came from here. My three skip rules (too few candidates, score-dominated, tight clustering) are rewritten using its judgment logic
Kahneman - Maps of Bounded Rationality (Nobel Lecture) Fast / Slow dichotomous theoretical skeleton.thinking_mode Originally the user had to choose, but later it was re-constructed to be determined by the intent automatically, but the division of labor between System 1 / System 2 was not changed, Cascade Recall only starts in slow mode because of this lineage.
VoiceAgentRAG (Salesforce AI Research) Slow Thinker / Prefetch Inspiration. Dual agent architecture (Slow Thinker in background + Fast Talker in foreground + FAISS cache) corresponds to slow_thinker.py + QueryJournal prefetch + Redis speculative cache on my end
win4r / memory-lancedb-pro (LanceDB Pro autoRecall) The earliest inspirations for the memvault design were autoRecall's "active injection" philosophy (stuffing relevant memories into the context before the agent thinks about them), five levels of retrieval + three levels of escalation + Weibull decay, which have all grown into their own counterparts in memvault. YouTube ChannelAI hypermetric domain Introducing openclaw
memvault Write Track - after the conversation ends The first in a trilogy. Before you read it, you need something to read - the three gates, the two-track write-in, and how to type a source resume!
memvault Background Track - After saving The L1/L2 summary, PPR, and Interest Profile used by Read are all Background runs.
✦ Copy Prompt