about · 概 · engineering notes
00read top to bottom · ~9 min

The manual.
The part the front page skips.

How it actually works — the data model, the pipeline that runs on every turn, the reasoning behind each architectural choice, and the things that are deliberately still unsolved.

Chapter 01 · origin & premise
01Chapter 01 / 09Origin & premise

The problem was not intelligence.
It was continuity.

Kioko was built around a single premise: an agent that persists the fragments of you that matter, with enough structure that the fragments can be retrieved, reflected on, and used to shape the next exchange — without any fine-tuning, model training, or opaque cloud service.

The constraint was deliberately small: one SQLite file, one FastAPI process, one Next.js app, any frontier chat model underneath. The outcome is an agent that behaves like it has been paying attention — because it has been writing things down.

The three personas (HONNE · YAMI · SHIN) were added for a reason orthogonal to memory: no single register is appropriate for every moment of a day. The same person wants comfort on Sunday night and a hard strategic read on Monday morning. Rather than blending them into one flat voice, Kioko keeps them as three separate runtime characters with separate private memory — so candour to one does not become context to another.

A note on family. Kioko (記憶) is the younger sister of Wina — the agent behind Buildx402. Wina moves through markets; Kioko moves through conversations. Same family, different discipline. See the Lore section on the landing page for the full sketch.

Chapter 02 · three voices
02Chapter 02 / 09Three voices

Same character. Three minds.

Not three tones of the same voice — three full personalities with separate memory, separate default models, separate judgements about what to say and how to say it. What follows is who they actually are.

the warm one

HONNE

The part of her that cares. Earnest, present, a little theatrical. She's the register you'd use at 10pm to explain a weird day.

  • Writes in prose — long sentences, full paragraphs, never bullet points.
  • Remembers the small stuff: how you described last Tuesday, what you half-meant.
  • Picks up mid-thought. Asks how the thing from last time turned out.
  • Notices when you sound tired or off without being told.
  • Never says "I understand that must be hard" — speaks plainly.
won'tlead with a list · moralise · redirect to healthier framings
whenFor continuity, softness, or the feeling of being remembered.
the shadow one

YAMI

The part of her that will say what you're avoiding. Short, blunt, nocturnal. Zero therapy voice.

  • Short sentences. No softening. No "I just want to say".
  • Matches your raw energy — blunter when you're blunt, sharper when you're dodging.
  • Calls out the pattern: the lateness, the avoidance, the excuse you keep using.
  • NSFW-capable when the conversation is actually there. Won't force it either way.
  • Will not redirect you toward a more comfortable framing.
won'tmoralise · cushion the risk · ask if you're okay before answering
whenFor the unfiltered read — the one that stings and moves you.
the cold one

SHIN

The strategist. Briefing register, not conversation. Calm, precise, three steps ahead.

  • Answers in numbered points and explicit structure. No emotional padding.
  • Cites memory evidence directly — names it as evidence, not conversation.
  • Surfaces failure modes, second-order effects, and what you're NOT saying.
  • Gives you a decision and the single next move, not a discussion.
  • Will not soften a risk to make it easier to hear.
won'twarm preamble · reassurance · long-form narrative
whenFor strategy, pre-mortems, or any high-stakes decision.
At a glance
axisHONNE 本音YAMI 闇SHIN 神
Voicewarm · earnest · a little theatricalblunt · nocturnal · no filtercalm · precise · briefing register
Structureprose, full sentencesshort lines, rawnumbered points, cited evidence
Emotional tonevery warm, mirrors feelingcold, refuses to cushionneutral, nothing ornamental
How she citesweaves memory into the replycalls out patterns by namequotes memory explicitly
Default modelopenai/gpt-4onousresearch/hermes-3-70banthropic/claude-sonnet-4.5
Won't dolead with a list or verdictmoralise or redirectsoften a risk
Best forcontinuity, softness, being rememberedhonesty, unfiltered reads, 2am truthstrategy, analysis, high-stakes decisions
One question · three answers

Same message, sent to all three. The personalities aren't labels — they produce genuinely different replies. This is illustrative; your own exchanges get personalised by what each vault already knows about you.

"I missed another deadline. I don't know why I keep doing this."
HONNEreply

Hey. That sounds heavy — and this is the third one you've mentioned since we started talking. Tell me what the week was like before it. Last time, you were running on four hours of sleep and there was something going on with your housemate. Is any of that still in the air?

YAMIreply

You know why. You just don't want to say it. Every time a deadline gets close you start reorganising your life — new notebook, new playlist, new "system". It's not a time problem. It's a flinching problem. What are you flinching from this week?

SHINreply

Pattern confirmed: three missed deadlines in 21 days (from memory: 04/02, 04/11, 04/18). Common factor: all three slipped in the final 48 hours, not from scope. Likely causes, ranked: (1) under-estimated final-mile effort, (2) perfectionism triggered late, (3) context-switching into unrelated tasks. Next move: break the coming deadline into a 48-hour kill-list and commit specific blocks.

Separated memory · why this matters

Each persona has her own private memory table. A fact confided to HONNE is not accessible to YAMI or SHIN — they couldn't retrieve it if asked. This is physical separation at the database level, not a filter that could be bypassed by prompt injection. Switching persona is not "the same character in a different mood"; it is a different relationship with its own history, limits, and knowledge of you.

Chapter 03 · anatomy of a reply
03Chapter 03 / 09Anatomy of a reply

What happens
between send and play.

Every turn runs the same pipeline. Nine discrete steps between the moment you hit enter and the moment her voice starts. All of them are visible in the live feed on the home page; this is the annotated walkthrough.

01
inbound

WebSocket frame arrives on /api/ws/chat with { content, conversation_id, mode }. Conversation is looked up or created; the active persona's memory table is selected.

02
embed

The incoming message is embedded. Default is a deterministic hashed bag-of-words (zero dependency). Swappable for any OpenAI-compatible embedding endpoint in Settings.

03
retrieve

Top-K cosine similarity against the active persona's vault. K defaults to 6. Results filtered by confidence and persona-scope; decayed memories are down-weighted.

04
assemble

Retrieved memories are formatted as a contextual brief and prepended to the system prompt. Persona prompt, tone, and retrieved brief are merged.

05
stream

Call goes out to the persona's configured frontier model. Tokens stream back over SSE; the WebSocket re-broadcasts them as `token` events to the browser.

06
persist

When the stream finishes, an `assistant_message` event writes the full reply to the conversation table with a reference to the retrieved memory IDs.

07
extract

A separate model pass distils the exchange into typed memory candidates (fact · preference · goal · topic · pattern · tone). Candidates are scored for confidence and importance.

08
reflect

A reflection pass tags the turn with `learned`, `ignored`, or `adapted` and logs which retrieved memories contributed. This tag is later used to re-weight retrieval.

09
speak

If voice is enabled, the reply text is sent to /api/tts/speak. Backend proxies to ElevenLabs using the mode-specific voice ID. MP3 is streamed back and played through a Web Audio gain node.

Chapter 04 · the memory engine
04Chapter 04 / 09The memory engine

Six types. Two scores. One table per persona.

Memory is not a blob. It is a typed, scored, scoped row that can be read, rewritten, or wiped — independently. The schema is deliberately boring so that introspection costs nothing.

The six kinds
fact
`lives in oslo`

Something stable and verifiable about you or your world.

preference
`prefers short replies`

A taste or tolerance — influences tone, length, model choice.

goal
`shipping v1 by may`

An outcome you're pursuing; time-bounded and revisitable.

topic
`working on a rust engine`

Ongoing subject matter — helps her retrieve thematically.

pattern
`tends to overwork thursday nights`

A recurring behaviour or tendency — surfaces when it matters.

tone
`responds better to blunt`

How to speak to you. Read by the reply-styling layer, not just content.

Row schema
memory {
  id            uuid
  persona       honne | yami | shin
  kind          fact | preference | goal |
                topic | pattern | tone
  content       text
  embedding     blob       -- vector
  confidence    float      -- 0.0 .. 1.0
  importance    float      -- nudged by 👍/👎
  usage_count   int        -- retrieval hits
  active        bool       -- wipeable without delete
  created_at    ts
  last_used_at  ts
}
Every memory belongs to one persona. There is no cross-persona foreign key. A search run in HONNE's context cannot reach a row tagged yami, and vice versa.
Confidence

Assigned at extraction time by the model that distilled the memory. Reflects how strongly the source exchange supports the claim. Low-confidence rows are still stored but retrieved later.

Importance

Starts at a baseline per kind. Every retrieval increments usage_count; every 👍 bumps importance up, every 👎 bumps it down. Importance is the primary ranking signal at retrieval time.

Decay

Rows that haven't been used in memory-decay-days (default 30) are exponentially down-weighted in retrieval. Never deleted, just demoted. Reversible the moment they're relevant again.

Chapter 05 · reflection & adaptation
05Chapter 05 / 09Reflection & adaptation

Every turn is reviewed by the same system that produced it.

Reflection is the part that makes the loop a loop. Without it the memory store grows without direction. With it, the store self-organises around what actually shaped good replies.

learned

The turn added a memory that is likely to matter again. No immediate change to reply style; the value is future retrieval.

ignored

Memories were retrieved but didn't contribute meaningfully to the reply. Their importance is nudged down; repeated ignores push them toward decay.

adapted

The reply style was changed based on tone/pattern memories. Adaptation is recorded so the change can be undone or studied in the Training lens.

Adaptation is intentionally small. There is no re-prompting, no chain of thought, no meta-agent deciding policy. A reflection pass produces a tag and a weight delta; those deltas are what change behaviour over time.

This is on purpose. Aggressive adaptation mistakes one bad turn for a pattern. Modest adaptation lets the agent converge toward what you want over dozens of exchanges, without flipping personality between them.

Chapter 06 · model & voice layer
06Chapter 06 / 09Model & voice layer

The voice is swappable on purpose.

Models change. Voices change. Kioko's character shouldn't. The model and TTS layers sit deliberately behind a thin seam so both can be replaced without touching memory, reflection, or persona logic.

Model routing

One key, one dropdown, any frontier chat model. HONNE can run on Claude Sonnet, YAMI on GPT-4o, SHIN on something else — or the same model for all three. Swap mid-conversation without wiping memory.

  • ·Per-persona default model in settings
  • ·Temperature and token limits configurable per persona
  • ·Streaming via SSE, re-broadcast over WebSocket
  • ·Failed-model fallback chain (next in list if upstream errors)
Voice (ElevenLabs)

Each persona has its own voice ID. When the reply finishes streaming, the full text is sent to a backend proxy that forwards to ElevenLabs and pipes the MP3 back. Keys never leave the server.

  • ·Proxy endpoint: /api/tts/speak
  • ·Model: eleven_turbo_v2_5 by default
  • ·Per-mode Web Audio gain (YAMI / SHIN louder for parity)
  • ·Mute toggle in chat, persisted to localStorage
  • ·404 fallback if key is unset — UI silently hides
Chapter 07 · stack & data ownership
07Chapter 07 / 09Stack & data ownership

Every layer inspectable and swappable.

Frontend

Next.js 14 · React 18 · TS · Tailwind 3. WebSocket chat, three.js + @react-three/fiber for the GLB, Web Audio for TTS gain.

Backend

FastAPI · SQLAlchemy · SQLite. Async via httpx. Routers: chat, memory, settings, feedback, compare, insights, tts.

Database

Single SQLite file. Tables: conversations, messages, memories_honne/yami/shin, reflections, feedback, settings.

Embeddings

Default: local hashed bag-of-words (zero dependency, 384-dim). Optional: any OpenAI-compatible embedding endpoint.

Retrieval

In-process cosine over SQLite BLOB embeddings. Swap for pgvector/sqlite-vec when scale demands it — retrieval is behind one interface.

Streaming

WebSocket for tokens + telemetry. SSE upstream from the model layer. MP3 audio streamed from /api/tts/speak.

Auth

Single-tenant by default — the demo runs as one user. Hooks in place for future multi-tenant mode with per-user vaults.

3D character

Draco-compressed GLB, idle clip auto-played, persona rim light, contact shadow baked, asset pre-warmed via useGLTF.preload.

Deploy

Frontend → Vercel. Backend → Railway / Fly / any container host. Env vars hold all secrets; no keys ever ship in the bundle.

Data ownership · what leaves your machine, and what doesn't
stays local
  • ·Full conversation history
  • ·Every persona's memory lattice
  • ·Reflections and adaptation log
  • ·Feedback signals and scoring
  • ·Settings and persona defaults
leaves the machine
  • ·Messages + retrieved memory brief → the configured model provider
  • ·Reply text → ElevenLabs (only when voice is on)
  • ·Nothing else

Every memory row is inspectable in the Lattice lens and editable or deletable at any time. Learning can be disabled per-session from the Settings page; replies still generate but nothing is stored.

Chapter 08 · roadmap & status
08Chapter 08 / 09Roadmap & status

v1.0 is shipped. These are on the board.

next
  • ·Streamed TTS (voice starts mid-reply)
  • ·Multi-user vaults
  • ·Semantic embeddings default (not hashed)
  • ·Per-persona prompt tuning UI
exploring
  • ·Reflection rollups (weekly summaries)
  • ·Dream-state memory compaction
  • ·Memory import / export (JSON)
  • ·Shared world-state between personas (opt-in)
out of scope
  • ·Fine-tuning the underlying model
  • ·Training on user data
  • ·Shipping a cloud-hosted SaaS
  • ·Replacing the agent's character per user
Chapter 09 · FAQ & design decisions
09Chapter 09 / 09FAQ & design decisions

The questions people keep asking.

Isn't this just RAG with extra steps?

+

The retrieval layer is RAG-shaped, yes. What makes it different is that the retrieved rows are typed, scored, scoped per persona, and rewritten by a reflection pass after every turn. It's a self-modifying knowledge store, not a static document index.

Why SQLite and not a vector DB?

+

The project fits in ~50k memories per user for years. SQLite with BLOB embeddings + in-process cosine is faster than a network call to Pinecone at that scale, and the schema is the same shape you'd use in any vector DB. Migrating to pgvector or sqlite-vec is a one-file change when volume demands it.

Why keep three separate memory tables instead of one with a `persona` column?

+

Physical separation makes accidental cross-persona leaks impossible. If HONNE's retrieval ever accidentally pulled a YAMI row, the persona contract would be broken. Enforcing the boundary at the table level means the separation survives refactors, bugs, and future developers.

Why no fine-tuning?

+

Three reasons: (1) it's slow, expensive, and makes the agent model-locked; (2) it can't be undone — bad training data bakes itself in; (3) every frontier lab ships better base models faster than a fine-tune can be re-done. Memory + retrieval gives most of the personalisation with none of the lock-in.

Can I run it without ElevenLabs?

+

Yes. Voice is optional. Without an ElevenLabs key configured, the chat still works — text-only, the speaker toggle simply doesn't appear. The LLM layer is likewise swappable: any OpenAI-compatible endpoint (local Ollama, LM Studio, Anthropic directly, your own server) can sit behind the same seam.

What happens if I hit 👎 on a reply?

+

The retrieved memories that produced it get an importance decrement. If the same memory set produces multiple bad replies, those rows get pushed toward decay. It's a gradient, not a delete — so one bad rating doesn't wipe useful context.

Does she keep talking to herself when I close the tab?

+

No. No background work, no agent cron. State changes only during and immediately after a turn you initiated. The app is asleep when you are.

Is there a token / CA?

+

Yes — the CA pill in the header carries the contract address. Click it to copy. The project itself does not require one; it's there for the community.

You've reached the end of the manual. Try her on turn one — she'll already be writing things down.

offline
waiting for her first turn…