LLM Inference

How LucidPal streams tokens from llama.cpp to the SwiftUI UI.

Flow

User sends message
       ↓
ChatViewModel.sendMessage()
       ↓
Build system prompt (calendar context injected)
       ↓
llmService.generate() → AsyncThrowingStream<String>
       ↓
LlamaActor.generate() — serial actor, C FFI
       ↓
Token-by-token via AsyncThrowingStream
       ↓
applyStreamToken() — think/response split
       ↓
executeCalendarActions() — extract + execute JSON blocks
       ↓
Update messages array → SwiftUI re-renders

Token Streaming

LLMService.generate() returns an AsyncThrowingStream<String, Error> — each element is one token (~1 word fragment). ChatViewModel consumes this stream and applies each token live:

for try await token in llmService.generate(
    systemPrompt: systemPrompt,
    messages: historyMessages,
    thinkingEnabled: showThinking
) {
    guard let idx = messages.firstIndex(where: { $0.id == assistantID }) else { break }
    applyStreamToken(token, rawBuffer: &raw, thinkDone: &thinkDone, ...)
}

Thinking Mode (Qwen3.5 `<think>` Tags)

Qwen3.5 models emit a <think>...</think> block before answering. LucidPal handles this live:

<think>
The user wants to create an event...
</think>
Here's your event — tap confirm.

applyStreamToken() buffers the prefix and splits at </think>:

Buffer state	Action
Starts with `<think>` (no close yet)	Set `isThinking = true`, show in ThinkingDisclosure
`</think>` detected	Extract thinking text, reset to response mode
No `<think>` prefix	Treat entire output as response

Context Window

Context size is chosen based on device RAM at launch:

let ramGB = Int(ProcessInfo.processInfo.physicalMemory / 1_073_741_824)
let historyLimit = ramGB >= 6
    ? ChatConstants.largeHistoryLimit   // 50 messages ≈ 5 000 tokens
    : ChatConstants.smallHistoryLimit   // 20 messages ≈ 2 000 tokens

RAM	Context	History Limit
< 4 GB	4K tokens	20 messages
≥ 4 GB	8K tokens	50 messages

Sampler Configuration

Parameter	Value	Reason
Temperature	0.35	Low — reduces hallucinated JSON fields
Top-P	0.9	Nucleus sampling
Max new tokens	768	Prevents runaway generation

LlamaActor

LlamaActor is a Swift actor wrapping llama.cpp's C API. All calls are serialized — no concurrent inference, no data races on C pointers.

actor LlamaActor {
    private var model: OpaquePointer?
    private var context: OpaquePointer?

    func load(path: String) throws { ... }
    func unload() { ... }
    func generate(tokens: [Int32], maxNew: Int) async throws -> AsyncThrowingStream<String, Error>
}

warning

LlamaActor requires a physical device — the llama.cpp Metal backend does not run in the Simulator.

Error Handling

Error	Source	Handling
`LLMError.modelNotLoaded`	`LLMService.generate` when model not ready	ChatViewModel shows error banner
`CancellationError`	User taps stop button	Partial content left visible, no error shown
Any other `Error`	llama.cpp runtime	Displayed in `messages[idx].content`

Flow​

Token Streaming​

Thinking Mode (Qwen3.5 <think> Tags)​

Context Window​

Sampler Configuration​

LlamaActor​

Error Handling​