Aleph learns to google (without Google)

One of the most obvious problems with a local model is that its knowledge has a cutoff date. Ask it about a library that came out three months ago and it has no idea. Ask for the changelog of a new Rust version and same thing.

The obvious solution: give it internet access. But here's the dilemma: search APIs with keys cost money (Google Search API, Bing, Serper...), and we want Aleph to be 100% free to use. So we started looking for alternatives.

web_search: DuckDuckGo HTML to the rescue

DuckDuckGo has a little-known endpoint: html.duckduckgo.com/html/. Unlike the main page which is a JavaScript SPA, this endpoint returns static HTML with search results. No JavaScript, no aggressive tracking, and no API key.

Why not the official DDG API?

DDG has an "Instant Answers" API but it's designed for direct answers like Wikipedia snippets, not for general search results. For what we need (lists of relevant URLs), scraping the HTML gives richer results.

The process is straightforward: make a GET with the query as a parameter, parse the HTML with the scraper crate, extract titles, snippets and URLs. URLs come wrapped in DDG redirects with a uddg parameter, so we decode them to get the real destination URL.

// Results exclude ads with :not(.result--ad)
let sel_result = Selector::parse(".result:not(.result--ad)").unwrap();

// href is in format //duckduckgo.com/l/?uddg=<encoded-url>
// Extract and decode:
fn extract_uddg(href: &str) -> Option<String> {
    let parsed = url::Url::parse(&normalized).ok()?;
    parsed.query_pairs()
        .find(|(k, _)| k == "uddg")
        .map(|(_, v)| urlencoding::decode(&v).ok()?.into_owned())
}

The result is a list of up to 20 results (configurable) with title, clean URL and snippet. The model receives them and decides which URL it wants to read in detail.

web_fetch: reading real pages

Searching is only the first step. The model also needs to read page content. That's what web_fetch does: makes a GET to the URL, detects if it's HTML, and if so strips it: removes <head>, <script>, <style>, <nav> and <footer>, and converts the remaining HTML to plain text.

The model receives that plain text. No noise, no tags, just content.

The GitHub problem

The first real test was asking Aleph to search for llama-server documentation. It found the README on GitHub, tried to read it... and got "Skip to content" followed by almost nothing. The problem: GitHub renders its content with JavaScript. A URL like github.com/ggerganov/llama.cpp/blob/master/README.md returns an empty SPA shell if you don't execute JS.

The fix was simple: automatically rewrite those URLs before fetching:

// github.com/{owner}/{repo}/blob/{ref}/{path}
// → raw.githubusercontent.com/{owner}/{repo}/{ref}/{path}

fn rewrite_github_url(url: &str) -> (String, bool) {
    if !url.starts_with("https://github.com/") { return (url.to_string(), false); }
    let parts: Vec<&str> = rest.splitn(4, '/').collect();
    if parts.len() == 4 && parts[2] == "blob" {
        let raw = format!("https://raw.githubusercontent.com/{}/{}/{}",
            parts[0], parts[1], parts[3]);
        return (raw, true);
    }
    (url.to_string(), false)
}

Now when the model asks to read a GitHub file, it gets raw Markdown. Perfect.

Detecting JS-rendered pages

GitHub isn't the only case. Many modern pages are pure SPAs: the HTML they return without JS is basically a skeleton with <div id="root"></div>. If the model receives "3 chars of useful content," it'll loop trying to fetch again or start making things up.

The fix was adding a minimum threshold: if extracted text has fewer than 200 characters, instead of returning that empty text, we return an explanatory message with a suggestion of what to do:

If it's a GitHub URL (but not a file/blob): suggest using the bash tool with the gh CLI.
For any other SPA: suggest looking for a cached version or an alternative API endpoint.

A good agent doesn't just execute tools — it also knows when a tool won't work and says why.

Testing in practice

The first real test was asking Aleph: "Search for documentation on llama-server's /v1/chat/completions endpoint and summarize the most important parameters." What it did:

Called web_search with "llama-server /v1/chat/completions parameters."
Chose the GitHub result pointing to the repository README.
Called web_fetch with the GitHub URL. The auto-rewrite converted it to raw.githubusercontent.com.
Received 8,400 characters of Markdown with full documentation.
Summarized the most important parameters in clear text.

It worked without any human intervention. That's exactly what we want.

Honest limitations

DuckDuckGo HTML sometimes returns no results if it detects too many requests in a row. Not aggressive rate limiting, but it does happen. We're evaluating adding a fallback, but for now rephrasing the query works. For complex JS-rendered pages (modern documentation apps, portals with login), fetch won't work and you need alternatives. We say this explicitly in the error message so the model knows what to do.

It's a pretty good 80/20: most useful technical documentation lives on static pages, GitHub, or Stack Overflow. With these two tools, Aleph can read all of them.