Project RIDI — Building a Local AI That Listens, Remembers, and Talks Back
A deep-dive into architecture, memory design, and a day of refactoring a real-time Korean voice AI.
What Is RIDI?
RIDI is a local, real-time AI companion that runs entirely on your machine. You talk to it through your microphone. It listens, transcribes your speech, thinks with a locally-running LLM, and answers in a cloned voice — all within a few seconds, with no cloud API required for the core loop.
The name fits the personality: RIDI is sarcastic, opinionated, a little jealous, fond of tiramisu, and built specifically for Korean-language conversation. It was built as a personal project — not as a product demo, but as a working system that a developer actually uses.
The hardware requirement is real: you need a CUDA-capable GPU. The stack uses llama-cpp-python for GGUF-quantized local inference, faster-whisper for speech-to-text, silero-vad for voice activity detection, and a full GPT-SoVITS pipeline for voice cloning. Redis holds the short-term conversation window; SQLite and FAISS back the long-term memory. It is, by any definition, a heavy stack — and it runs on a single Windows machine.
The Two Modes
Full Mode (python infer_v2.py)
The production path. Every component runs locally:
Microphone
↓ sounddevice (16 kHz, mono)
AudioProcessor — Silero VAD + faster-whisper (Whisper large-v3, Korean)
↓ user_input_queue (multiprocessing.Queue)
ConversationManager._loop_step() (main thread, ~10 Hz poll)
↓ llm_task_queue
LLM Worker Thread
├─ MemoryContext.context_for(utterance)
│ ├─ FAISS semantic search + SQLite FTS5 keyword search (hybrid RAG)
│ └─ Redis summary window
├─ PromptBuilder → system prompt with persona + memory + summary
├─ ModelChat.generate_stream() → llama-cpp streaming (Llama-3 chat format)
├─ STM.add_turn() → Redis → manage_memory_flow()
└─ process_tts_for_buffer() → GPT-SoVITS → audio_queue → AudioPlayback thread
Lite Mode (python conversation.py)
A lighter variant using the Google Gemini API instead of the local LLM. The audio pipeline (VAD, Whisper, TTS) is identical. STM is a raw Redis list instead of the full ShortTermMemoryManager. LTM is disabled. This mode exists for devices where you want the voice interface but can’t run a 3B GGUF model at useful speed.
The Memory System
RIDI’s most interesting engineering is the two-layer memory system. Most chatbot demos run with a flat context window. RIDI runs with three tiers.
Tier 1 — Active Context (the LLM’s token window)
The last N conversation turns, formatted as a messages list and passed directly to the LLM. N is small (4 turns by default) to keep latency low.
Tier 2 — Short-Term Memory (Redis)
Three Redis keys per user:
| Key | Type | Contents |
|---|---|---|
stm:{user_id} | list | JSON-encoded {role, content} turns |
summary_window:{user_id} | list | LLM-generated summaries of evicted turns |
interrupted_utterance:{user_id} | string | Partial utterance saved during mid-response interrupt |
When the STM list exceeds max_stm_conversations * 2 messages, the oldest turns are evicted — but not deleted. The LLM summarizes them first. That summary goes into the summary_window, which feeds into every future prompt as the “previous conversation” context.
When the summary window itself fills up, the LLM re-summarizes the whole window into a single compact memory and archives it to LTM.
Three modes control this flow: original (rolling eviction), mini (summarize every turn pair immediately), and semi (summarize and reset when STM is full).
Tier 3 — Long-Term Memory (SQLite + FAISS)
Two parallel stores:
- SQLite with FTS5 — full-text keyword search. Each
MemoryItemstores: content, tags, metadata (JSON), timestamp, source. - FAISS IndexFlatIP — 768-dimensional semantic vectors from
jhgan/ko-sroberta-multitask, a Korean sentence embedding model.
At query time, MemoryOrchestrator.search_memories_for_rag() runs both stores in parallel, merges candidates, and computes a hybrid score:
final_score = (semantic_weight × cosine_similarity) + (keyword_weight × normalized_rank)
Memories tagged memory_type = "core" receive a configurable score boost (default 1.5×), ensuring that important long-term facts remain relevant even when semantic similarity is moderate. The top-k results above ltm_rag_similarity_threshold (default 0.7) are injected into the system prompt.
The Concurrency Model
RIDI is a multi-threaded, single-process application with one subprocess for the Discord bot.
| Component | Mechanism | Communication |
|---|---|---|
| Main loop | Main thread | reads user_input_queue, writes llm_task_queue |
| LLM worker | threading.Thread (daemon) | reads llm_task_queue |
| Audio playback | threading.Thread (daemon) | reads audio_queue |
| Microphone / STT | threading.Thread (daemon) | writes user_input_queue |
| Discord bot | multiprocessing.Process (daemon) | reads/writes user_input_queue, discord_queue |
Three events coordinate the system:
exit_event— amultiprocessing.Event, set on shutdown, checked by all threads and the Discord processinterrupt_event— athreading.Event, set when the user speaks mid-response; causes the LLM stream and TTS to abortllm_tts_idle_event— athreading.Event, gates the auto-continue silence detection
The interrupt_event wires directly into ModelChat.generate_stream(), which checks it after each token chunk. When set, streaming stops immediately, the partial response is discarded, and the next input takes over.
A Day of Refactoring — Breaking the Monolith
Until today, the entire per-turn logic lived in one method: ConversationManager._process_single_turn(). That 40-line block did five unrelated things without a seam:
- LTM retrieval (FAISS + SQLite)
- STM summary loading (Redis)
- Prompt assembly
- LLM streaming
- STM write-back and CSV logging
To run a test against any part of this, you needed a live GPU, a running Redis server, and an initialized FAISS index. That’s not a unit test — that’s a full integration run.
Today, three injectable seams were carved out in order.
Seam 1 — MemoryContext
# core/memory/MemoryContext.py
class MemoryContext:
def __init__(self, ltm_manager, stm_manager, config): ...
def context_for(self, query: str) -> dict:
return {
"memory": self._load_ltm(query), # FAISS + SQLite, threshold-filtered
"summary": self._load_stm_summary() # Redis summary window
}
All knowledge about RAG thresholds, score filtering, and summary formatting lives here. ConversationManager no longer imports MemoryItem, List, or Tuple. The _gather_prompt_context method collapsed from 12 lines to one:
def _gather_prompt_context(self, user_utter: str) -> dict:
return self.memory_context.context_for(user_utter)
Seam 2 — Turn
# core/engine/Turn.py
class Turn:
def __init__(self, backend, stm_manager, prompt_builder,
memory_context, user_name: str, log_path: str): ...
def run(self, user_utter: str) -> Optional[str]: ...
The entire per-turn pipeline moved here. Turn depends only on its constructor arguments — no AppConfig, no SystemConfig. A unit test can construct one as:
Turn(
backend=FakeBackend(),
stm_manager=None,
prompt_builder=real_builder,
memory_context=FakeMemory(),
user_name="test",
log_path="/dev/null"
)
No GPU. No Redis. No FAISS. llm_worker_target in ConversationManager now calls self.turn.run(user_utter) — one line instead of 40.
Note that config was removed from Turn entirely in a follow-up step. The full SystemConfig was being passed in, but Turn.run() only read two fields: user_name and conversation_log_path. Those are now plain str arguments — zero config object dependency.
Seam 3 — LLMBackend
# core/engine/LLMBackend.py
class LLMBackend(ABC):
@abstractmethod
def generate_stream(self, messages: List[Dict]) -> Iterator[str]: ...
class LocalLlamaBackend(LLMBackend):
def __init__(self, model_chat): ...
def generate_stream(self, messages): return self.model_chat.generate_stream(messages)
class GeminiBackend(LLMBackend):
def __init__(self, api_key: str, model: str = "gemini-2.5-flash"): ...
def generate_stream(self, messages): ... # converts message format, yields response.text
Before this, swapping Full mode (local llama-cpp) for Lite mode (Gemini) meant overriding AppConfig.init_components() in a subclass. Now it is one line in AppConfig:
# Full mode
self.llm_backend = LocalLlamaBackend(self.model_chat)
# Lite mode (SemiAppConfig)
self.llm_backend = GeminiBackend(api_key=get_apikey())
Turn calls self.backend.generate_stream(final_messages) and knows nothing about llama-cpp or Gemini.
Bugs Found in Code Review
A code review pass surfaced 10 findings across the codebase. Four are high severity — silent data corruption that runs undetected.
1. FAISS -1 padding injects wrong memories (High)
SemanticSearchFAISS (semantic_search_faiss.py:213) checks if idx < len(metadata_list) to validate FAISS results. When the index has fewer entries than top_k, FAISS pads results with idx = -1. Python’s negative indexing makes -1 < len(list) true, so metadata_list[-1] — the last stored memory — silently appears in every under-full query.
# Fix: one extra check
if idx != -1 and idx < len(self.metadata_list):
2. STM eviction deletes the newest turns before reading them (High)
STM_Manager._evict_turns_to_summary_window() (STM_Manager.py:398) calls ltrim key 0 -N to trim the list before reading the items to summarize. ltrim 0 -2 on a list of 10 items deletes the last 1 item. Those turns are gone before the summarizer ever sees them. Every eviction cycle silently drops recent conversation.
# Fix: remove the pre-read trim entirely
# messages_to_summarize = self.context_manager.get_turns(0, num_to_evict - 1)
# self.context_manager.trim(num_to_evict, -1) ← keep only this, after summarizing
3. LTM metadata extraction always crashes silently (High)
MemoryOrchestrator._extract_ltm_metadata() (LTM_Manager.py:266) calls ModelChat.generate_stream(messages) as an unbound class method. Python passes messages (a list) as self. The first line of generate_stream runs self.em.clear_interrupt(), which raises AttributeError: 'list' object has no attribute 'em'. The outer except Exception catches it silently. Every memory is stored with empty metadata. The whole extraction pipeline has never run.
4. add_memory_from_summary reads wrong keys from metadata dict (High)
The same function reads extracted_data.get("metadata", {}) and extracted_data.get("tags", []) from the return value of _extract_ltm_metadata, which returns a flat dict — no nested "metadata" or "tags" keys. Even if bug #3 were fixed, every memory would still be stored with empty metadata and no tags.
Other findings (Medium/Low)
SemanticSearchFAISS.delete_itemis not implemented — any memory deletion leaves stale vectors in FAISS indefinitelyQueueManager.clear_audio_queue()does not exist — the interrupt path crashes immediately if ever calledhandle_interruptioncollects merged inputs into a list that is never saved back toself.pending_input— interrupted speech is lost- Singleton
_lockinitialization is not thread-safe — twoEventManager/QueueManagerinstances can be created under a race _extract_ltm_metadatacallsself.prompt_builderwithout aNoneguard —setup_data()creates an orchestrator with no prompt builder, so any call toadd_memory_from_summaryfrom that path crashes silentlydiscord_bot.process_responsesblocks forever onQueue.get()with no timeout — if the main process crashes before the sentinel is sent, the Discord process hangs indefinitely
Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│ infer_v2.py │
│ │
│ Ridi.run() │
│ └─ AppConfig.init_components() │
│ ├─ LocalLlamaBackend(ModelChat(Llama)) │
│ ├─ MemoryOrchestrator (FAISS + SQLite) │
│ └─ ShortTermMemoryManager (Redis) │
│ │
│ ConversationManager.__init__(app_context) │
│ ├─ MemoryContext(ltm, stm, config) │
│ └─ Turn(backend, stm, prompt_builder, │
│ memory_context, user_name, log_path) │
│ │
│ ConversationManager.llm_worker_target() [thread] │
│ └─ Turn.run(user_utter) │
│ ├─ MemoryContext.context_for(query) │
│ │ ├─ LTM hybrid RAG search │
│ │ └─ STM summary window │
│ ├─ PromptBuilder.build_final_response_prompt() │
│ ├─ STM.add_turn("user") │
│ ├─ LLMBackend.generate_stream(messages) │
│ └─ STM.add_turn("assistant") │
│ └─ manage_memory_flow() │
│ ├─ evict → summarize → LTM │
│ └─ archive summary → LTM │
└─────────────────────────────────────────────────────────────┘
┌──────────────────────────────┐ ┌──────────────────────────┐
│ AudioProcessor [thread] │ │ AudioPlayback [thread] │
│ Silero VAD + Whisper STT │ │ sounddevice OutputStream│
│ → user_input_queue │ │ ← audio_queue │
└──────────────────────────────┘ └──────────────────────────┘
┌──────────────────────────────┐
│ DiscordBot [process] │
│ discord.py client │
│ reads user_input_queue │
│ writes discord_queue │
└──────────────────────────────┘
What’s Left
Immediate (tracked in docs/todo.md):
- Fix the 4 high-severity bugs — the FAISS padding bug and the STM eviction bug are silent data corruption that runs on every session
- Fix
handle_interruptionlosing the merged input — interrupted speech should not be discarded MemoryContextstill holds a fullSystemConfigfor justltm_rag_similarity_threshold— reduce it to a plainfloatargument (same pattern as what was done forTurn)
Deferred:
TTSContext.speak(text, audio_queue)— collapse the threetts_enabledguard sites into a single no-op interfaceSemanticSearchFAISS.delete_item— implement index rebuild on deletion- Thread-safe singleton initialization for
EventManagerandQueueManager SemiConvManagerstill overrides_loop_stepentirely and never callsTurn.run()— migrating Lite mode to use the same turn pipeline requires reconciling its different STM strategy (raw Redis list vsShortTermMemoryManager)
Technical Stack
| Layer | Technology |
|---|---|
| Speech-to-text | faster-whisper (large-v3, CUDA, float16) |
| Voice activity detection | silero-vad (via torch.hub) |
| Local LLM inference | llama-cpp-python + GGUF (3B, q4_k_m) |
| Cloud LLM (Lite mode) | Google Gemini 2.5 Flash |
| Text-to-speech | GPT-SoVITS (AR token predictor + VITS decoder) |
| Short-term memory | Redis 6.x (localhost:6849) |
| Long-term memory | SQLite FTS5 + FAISS IndexFlatIP |
| Semantic embedding | jhgan/ko-sroberta-multitask (768-dim) |
| Audio I/O | sounddevice |
| Messaging | discord.py |
| Platform | Windows 11, Python 3.10, conda |
Closing Thoughts
RIDI is one of those projects where the ambition of the idea forces you to build systems you didn’t originally plan to. A voice-reactive AI that remembers across sessions requires you to solve speech detection, real-time transcription, LLM serving, voice synthesis, interrupt handling, and memory architecture — all at once, on consumer hardware, in a single Python process.
The refactoring session documented here took an untestable 40-line monolith and carved it into three injectable seams in a few hours. The result is Turn, MemoryContext, and LLMBackend — each testable in isolation, each with a clear responsibility. The code review pass that followed identified 4 high-severity silent bugs that had been running undetected.
The most interesting engineering challenge ahead is the memory lifecycle: when and what to evict, how to avoid surfacing irrelevant old memories, and how to keep the hybrid RAG score calibrated as the index grows. That’s where the real work is.