Real streaming: token by token
Waiting for the model to generate the full response before showing it makes it feel slow even when it isn't. Here's how we implemented true streaming in Aleph.
There's something psychologically important about watching text appear as the model "thinks." It's not just cosmetic: it completely changes the perceived speed. A model that takes 8 seconds to generate 200 tokens "all at once" feels slow. The same model showing those tokens one by one feels snappy.
Cloud services learned this years ago. We implemented it from day one.
How it works under the hood
llama-server exposes a POST /v1/chat/completions endpoint with stream: true support. When enabled, instead of responding with a complete JSON at the end, it returns a sequence of SSE (Server-Sent Events). Each event has a delta with the generated token:
data: {"choices":[{"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"choices":[{"delta":{"content":","},"finish_reason":null}]}
data: {"choices":[{"delta":{"content":" how"},"finish_reason":null}]}
...
data: [DONE]
Aleph's backend (Rust) opens that HTTP connection and processes the stream line by line. Every time a token arrives, it emits a Tauri event to the frontend:
// Rust — each token fires an emit
app_handle.emit("chat://token", TokenPayload { session_id, token });
When the stream ends (or if there's an error), it emits the final event:
app_handle.emit("chat://done", DonePayload { session_id, full_text, error });
The full flow
Cancellation
One thing we didn't want to ignore: if the model is generating and the user clicks "Stop," the response should cut off immediately, not in 30 seconds when it finishes. For this we use tokio_util::sync::CancellationToken.
Each chat session has its own cancellation token in the AppState. When the user sends the stop signal, we cancel the token. The Rust task reading the llama-server stream detects this (via tokio::select!) and closes the connection.
Cancelling mid-generation shows a "generation cancelled" message in the UI. The partial text is kept. Nothing is lost.
Why not WebSockets?
We could have done it with WebSockets or polling. Tauri events are the most natural option in this context: the backend already lives in Rust inside the Tauri process, events are efficient and bidirectional without the complexity of maintaining a socket. SSE from llama-server + Tauri events turned out to be the cleanest combination.
The result: text appears as it's generated, with no artificial delay, and if you want to stop, it actually stops.