Skip to main content

LLM Inference

How LucidPal streams tokens from llama.cpp to the SwiftUI UI.

Flow

User sends message

ChatViewModel.sendMessage()

Build system prompt (calendar context injected)

llmService.generate() → AsyncThrowingStream<String>

LlamaActor.generate() — serial actor, C FFI

Token-by-token via AsyncThrowingStream

applyStreamToken() — think/response split

executeCalendarActions() — extract + execute JSON blocks

Update messages array → SwiftUI re-renders

Token Streaming

LLMService.generate() returns an AsyncThrowingStream<String, Error> — each element is one token (~1 word fragment). ChatViewModel consumes this stream and applies each token live:

for try await token in llmService.generate(
systemPrompt: systemPrompt,
messages: historyMessages,
thinkingEnabled: showThinking
) {
guard let idx = messages.firstIndex(where: { $0.id == assistantID }) else { break }
applyStreamToken(token, rawBuffer: &raw, thinkDone: &thinkDone, ...)
}

Thinking Mode (Qwen3.5 <think> Tags)

Qwen3.5 models emit a <think>...</think> block before answering. LucidPal handles this live:

<think>
The user wants to create an event...
</think>
Here's your event — tap confirm.

applyStreamToken() buffers the prefix and splits at </think>:

Buffer stateAction
Starts with <think> (no close yet)Set isThinking = true, show in ThinkingDisclosure
</think> detectedExtract thinking text, reset to response mode
No <think> prefixTreat entire output as response

Context Window

Context size is chosen based on device RAM at launch:

let ramGB = Int(ProcessInfo.processInfo.physicalMemory / 1_073_741_824)
let historyLimit = ramGB >= 6
? ChatConstants.largeHistoryLimit // 50 messages ≈ 5 000 tokens
: ChatConstants.smallHistoryLimit // 20 messages ≈ 2 000 tokens
RAMContextHistory Limit
< 4 GB4K tokens20 messages
≥ 4 GB8K tokens50 messages

Sampler Configuration

ParameterValueReason
Temperature0.35Low — reduces hallucinated JSON fields
Top-P0.9Nucleus sampling
Max new tokens768Prevents runaway generation

LlamaActor

LlamaActor is a Swift actor wrapping llama.cpp's C API. All calls are serialized — no concurrent inference, no data races on C pointers.

actor LlamaActor {
private var model: OpaquePointer?
private var context: OpaquePointer?

func load(path: String) throws { ... }
func unload() { ... }
func generate(tokens: [Int32], maxNew: Int) async throws -> AsyncThrowingStream<String, Error>
}
warning

LlamaActor requires a physical device — the llama.cpp Metal backend does not run in the Simulator.

Error Handling

ErrorSourceHandling
LLMError.modelNotLoadedLLMService.generate when model not readyChatViewModel shows error banner
CancellationErrorUser taps stop buttonPartial content left visible, no error shown
Any other Errorllama.cpp runtimeDisplayed in messages[idx].content