10,000 models — which one do I download?

Hugging Face has over 800,000 models. If you filter by GGUF (the format llama.cpp uses, the one we run) you still have thousands. And if you ask someone just getting started "which one should you use for coding?", the honest answer is "it depends" followed by ten paragraphs of context that don't help you decide anything.

We didn't want Aleph to be like that. The goal was for someone who's never heard of quantization to open the app, see which model fits their hardware, and download it. No tutorials.

Topic chips: the first filter layer

The first thing we implemented was a topic chip system. Instead of showing an infinite list of models, the user sees categories: Code, Reasoning, Agent, Legal, Medical, Finance, and a few more. Click "Code" and the most suitable models for that appear.

Under the hood, each topic has two modes:

Rich: the chip does a live search against the Hugging Face API with topic-specific queries. Fresh results appear.
Niche: for more specific topics with thin quality volume on HF, our recommended models appear first, then search results.

All of this is defined in a catalog.json file that lives at ~/.config/agent-aleph/. The reason it's there instead of hardcoded in the binary: if you want to add a topic or change recommendations, you can edit the JSON without recompiling anything.

The hardware badge: 🟢 🟡 🔴

The second key piece was the hardware badge. Before downloading, the user sees an indicator of whether the model will fit in VRAM, whether it'll need to spill to RAM, or simply whether it won't fit at all.

Nobody wanted to download 8GB only to find out the model runs at 0.4 tokens per second because it doesn't fit on the GPU.

The math is straightforward: from the backend we read the available VRAM of each GPU (we sum them all, ignoring iGPUs with less than 1GB free because they cause more trouble than they're worth) and the system RAM. The model size comes from HF metadata. With those three numbers:

🟢 Fits in VRAM: will run fast, ideal.
🟡 Spills to RAM: part on GPU, part on CPU. Slower but works.
🔴 Doesn't fit: the model is too large for the hardware. Better pick a smaller quantization.

Downloads

Once the user picks a model, the download streams directly from Hugging Face with real-time progress. Progress is emitted as Tauri events (download://progress) that the UI consumes in real time. Downloads are cancellable, and if the user closes the app mid-download, the backend tracks state so no zombie partial files are left behind.

Models are stored at ~/.local/share/agent-aleph/models/. Standard XDG, nothing unusual.

What we learned

90% of the work wasn't technical: it was curation. Deciding which models go in each category, naming topics in a way that makes sense to someone who doesn't live on AI Twitter, how aggressive to be with recommendations.

We're still tweaking it. But the structure works, and most importantly, updating the catalog doesn't require shipping a new app version. Just edit the JSON.