Real streaming: token by token

There's something psychologically important about watching text appear as the model "thinks." It's not just cosmetic: it completely changes the perceived speed. A model that takes 8 seconds to generate 200 tokens "all at once" feels slow. The same model showing those tokens one by one feels snappy.

Cloud services learned this years ago. We implemented it from day one.

How it works under the hood

llama-server exposes a POST /v1/chat/completions endpoint with stream: true support. When enabled, instead of responding with a complete JSON at the end, it returns a sequence of SSE (Server-Sent Events). Each event has a delta with the generated token:

data: {"choices":[{"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"choices":[{"delta":{"content":","},"finish_reason":null}]}
data: {"choices":[{"delta":{"content":" how"},"finish_reason":null}]}
...
data: [DONE]

Aleph's backend (Rust) opens that HTTP connection and processes the stream line by line. Every time a token arrives, it emits a Tauri event to the frontend:

// Rust — each token fires an emit
app_handle.emit("chat://token", TokenPayload { session_id, token });

When the stream ends (or if there's an error), it emits the final event:

app_handle.emit("chat://done", DonePayload { session_id, full_text, error });

The full flow

ChatView calls api.sendChat()

The frontend invokes the Tauri command with the session messages.

Rust opens POST to llama-server

With stream: true. The connection stays open until the model finishes.

For each SSE line: emit chat://token

The event reaches the frontend nearly in real time. Svelte appends the token to the in-progress message.

At the end: emit chat://done

With the full text and/or an error if something failed. The frontend closes the "generating" state.

Cancellation

One thing we didn't want to ignore: if the model is generating and the user clicks "Stop," the response should cut off immediately, not in 30 seconds when it finishes. For this we use tokio_util::sync::CancellationToken.

Each chat session has its own cancellation token in the AppState. When the user sends the stop signal, we cancel the token. The Rust task reading the llama-server stream detects this (via tokio::select!) and closes the connection.

Cancelling mid-generation shows a "generation cancelled" message in the UI. The partial text is kept. Nothing is lost.

Why not WebSockets?

We could have done it with WebSockets or polling. Tauri events are the most natural option in this context: the backend already lives in Rust inside the Tauri process, events are efficient and bidirectional without the complexity of maintaining a socket. SSE from llama-server + Tauri events turned out to be the cleanest combination.

The result: text appears as it's generated, with no artificial delay, and if you want to stop, it actually stops.