I revamped this blog’s RSS feed. The downside is that RSS readers will probably show the last few posts again. Sorry about that. On the upside, the new feed contains full post contents and is standards compliant. This post lists the changes I made, mostly to the axum
webserver setup.
I normally write in a “let’s build something together” voice, but this post is more of a “let’s all laugh at Alex for failing to set the Content-Type” kind of situation, so I’ll use the first person singular.
For the most part, I followed the recommendations from Kevin Cox’s excellent “RSS Feed Best Practises”.
Content-Type
header
Let’s start small: what was content type of the atom feed?
$ curl -v https://scvalex.net/atom.xml > /dev/null 2>&1 | grep -i content-type
< content-type: text/xml
That’s not good. It’s supposed to be application/atom+xml
(or application/rss+xml
before the switch to Atom). The value was being inferred by my server’s ServeDir
middleware from the xml
extension of the file. Since there was no way to configure ServeDir
, I wrote a new axum handler
just for /atom.xml
route:
...
let app = Router::new()
.route("/atom.xml", get(atom_handler))
...
async fn atom_handler(
State(state): State<SharedState>,
) -> Result<Response<BoxBody>, (StatusCode, String)> {
use http::header::CONTENT_TYPE;
get_static_file(state.dist_dir.join("atom.xml"))
.await
.map(|mut res| {
res.headers_mut().insert(
CONTENT_TYPE,
HeaderValue::from_static("application/atom+xml"),
);
res
})
}
async fn get_static_file<P: AsRef<std::path::Path>>(
file: P,
) -> Result<Response<BoxBody>, (StatusCode, String)> {
use axum::body::boxed;
use tower::util::ServiceExt;
ServeFile::new(file.as_ref())
.oneshot(Request::new(()))
.await
.map(|res| res.map(boxed))
.map_err(|err| internal_server_error(err, "static file not found"))
}
This is annoyingly verbose and it might have been easier read the file directly instead of offloading to ServeFile
, but I didn’t know if the latter did anything special and I had bigger issues to tackle.
Let’s see that it worked:
$ curl -v http://localhost:3000/atom.xml > /dev/null 2>&1 | grep -i content-type:
< content-type: application/atom+xml
Good. The even better news is that the code only gets less verbose from here.
CORS
To quote MDN:
Cross-Origin Resource Sharing (CORS) is an HTTP-header based mechanism that allows a server to indicate any origins (domain, scheme, or port) other than its own from which a browser should permit loading resources.
I think CORS is essentially a security scheme to prevent websites from making requests from Javascript to other websites using users’ credentials. For instance, when you’re visiting example.com, it would be bad if the site could get your browser to run Javascript that makes requests to the AWS API with your own logged-in cookie. This is prevented by the AWS API serving its pages without the CORS header authorizing example.com to make requests against it, so users’ browsers will throw an exception if this is attempted. Honestly, it seems like it would have been better to not normalize allowing random sites to run random code on users’ machines, but I guess that ship has sailed.
The CORS policy defaults to “not allowed”, so if no Access-Control-*
headers are sent, XMLHttpRequests and fetch()
calls will fail. What headers did my site deliver for the atom feed?
$ curl -v https://scvalex.net/atom.xml > /dev/null 2>&1 | grep -i 'access-control'
$
It delivered no headers. This meant that any browser-based RSS reader was going to fail to fetch the feed. I fixed this by annotating the /atom.xml
route with the Cors
middleware:
let app = Router::new()
.route(
"/atom.xml",
get(atom_handler).layer({
use tower_http::cors::{Any, CorsLayer};
CorsLayer::new()
.allow_methods([http::Method::GET])
.allow_origin(Any)
}),
)
And now the headers are present:
$ curl -v http://localhost:3000/atom.xml > /dev/null 2>&1 | grep -i 'access-control'
< access-control-allow-origin: *
< vary: access-control-request-method
< vary: access-control-request-headers
Switch from RSS2 to Atom
The choice to switch from RSS2 to Atom was mostly an aesthetic one on my part. The RSS spec has always felt a bit off to me. To give one example, dates in RSS must be formatted like “Sat, 07 Sep 2002 00:00:01 GMT” which just seems weird for a machine-readable file. For comparison, dates in Atom are formatted like “2022-12-18T00:00:00Z” which is exactly what I would expect.
More importantly, RSS has only one element for post contents, whereas Atom has separate “summary” and “content” elements. This came in handy since I was planning to include full post contents in the feed.
Other than that, the two formats are the same for the purposes of my blog and the conversion was little more than renaming some elements.
Full contents in feed
Before now, I’d include only the first paragraph of every post in the RSS feed. Conceptually, it had the same information as the All Posts page.
My original reasoning from ten years ago was that I wanted people to click through to the website so that I could see them in analytics—I valued seeing a number go up more than people actually reading my posts. These days, I’d much rather engage with readers regardless of how they see my content. Also, I suspect a fair number of them wouldn’t show up in analytics anyway because they block third party Javascript—I certainly do.
The more recent reason for not including the full post text in the RSS feed was a limitation on the part of my site generator. This is a Rust program which reads a bunch of input markdown files and Liquid templates, combines them somehow, and generates a directory of static files. The generator can run both in batch mode where it builds everything from scratch, but also in a polling mode where it watches input files for updates and only builds what has changed.
The generator has to handle three kinds of input text files:
-
Liquid templates that are loaded into
liquid-rust
. Some of these are small “macros” of HTML to be included in pages, and some are templates of full HTML meant to wrap content. For example, I have animg_float
macro to generate an<img>
with a title and an alt text, and I have ablog_post
template which has the<html>
and<head>
elements, and whose<body>
is just the contents of a variable. -
Non-markdown files that are run through the templating engine once. These may also contain content from other pages (e.g. the All Posts page includes the first paragraph of every post).
-
Markdown files. These are first passed through the templating engine to resolve small macros, then passed through the Markdown renderer to become HTML, then passed through the templating engine again to be included in a layout template.
The generator needs to know the dependencies between files so that it can decide in what order to process them in batch mode. And in polling mode, it needs to determine which processing steps need to be re-run when a file is updated. Graphically, the dependencies look like this:
The graph looks clean, but the dependencies are hard to pin down. For instance, the only way to tell that “post 69” depends on the img_float
template is to parse the post’s text. That’s annoying in itself, but it might not be enough because img_float
may also include other templates. The only way to fully resolve this is to pass “post 69” through the Liquid templating engine, but if we do that, then there’s no point in figuring out the dependencies between posts and templates because the work has already been done. In practice, we just have to assume a dependencies between “any page” and “all the templates”. So, the dependencies really look more like this:
Now that we know that everything depends on the opaque store of templates, the next step is rendering markdown. This is easy dependency-wise because each markdown file generates one item of HTML page content.
The next problem is determining which page contents are included by which final pages. For instance, posts/69/index.html
includes only “HTML of post 69”, but posts/index.html
includes the first HTML paragraph of all posts, and /atom.xml
includes the full HTML contents of the most recent five posts (or only the first paragraphs of same before the revamp). We have to parse a page’s contents to determine what it includes, but the inclusion is done through Liquid templates, and Liquid templates may include other Liquid templates, and it’s again the situation where the only way to resolve the dependencies is to do all the templating work. So really, the graph looks like this:
Empirically, I’ve used a lot of static site generators in the last ten years, and every single one failed to handle dependencies correctly. Some didn’t even try, and some tried but failed to update pages in some cases. Given this, I wasn’t enthusiastic about trying to write my own dependency inference and tracking.
That said, if you squint hard enough at the above diagram, it begins to look like a staged process: first you load all the template files into the template store, then you read all Markdown files and pass them through Liquid, then you convert all the Markdown to HTML, and finally do the second Liquid pass to generate all the static files. The stages are sequential, but the work inside each stage can be parallelized simply in batch mode. And in polling mode, the common case is for one post Markdown to change, so the generator can skip re-processing the Liquid templates and doing the expensive Markdown→HTML conversion for any other Markdown files. Then, the final stage of creating the static files is run fully. This is wasteful, but it’s pretty fast when written in Rust—updates complete in less than one second for my blog. This is what the generator does now and it works.
Originally though, I did try to do dependency inference, failed, then realized that I could cheat if I only supported including the first paragraph of posts in other pages. The trick was that I never used templates in the first paragraph, so I could pass the source through just the Markdown renderer to get the final output. These first paragraphs were then stored separately from the posts and could be included in other pages. In practice, this hack only needed to work for All Posts and the RSS feed, so it was good enough.
Absolute URLs
With full posts in the Atom feed, the only remaining problem was relative URLs. Pages on this site use internal links like the following:
<a href="/posts/68/">...</a>
<a href="#conclusion">...</a>
<img src="/r/68-dive-container-screenshot-small.png">
I like these relative links because they look clean, but they’re unlikely to work in RSS readers. Since I didn’t want to change the standalone pages, I instead wrote a small Liquid Filter
to make the URLs absolute:
fn absolutize_urls(
input: &dyn ValueView,
site_base_url: &str,
page_rel_url: &str,
) -> OrError<Value> {
use lol_html::{element, html_content::Element, rewrite_str, Settings};
if input.is_nil() {
return Ok(Value::Nil);
}
let html = input.to_kstr();
let absolutize_f = |attr| {
move |el: &mut Element| {
if let Some(href) = el.get_attribute(attr) {
if href.starts_with('/') {
el.set_attribute(attr, &format!("{site_base_url}{href}"))
.unwrap();
} else if href.starts_with('#') {
el.set_attribute(
attr,
&format!("{site_base_url}{page_rel_url}{href}"),
)
.unwrap();
}
}
Ok(())
}
};
let html = rewrite_str(
&html,
Settings {
element_content_handlers: vec![
element!("a[href]", absolutize_f("href")),
element!("img[src]", absolutize_f("src")),
],
..Settings::default()
},
)?;
Ok(Value::scalar(html))
}
This uses the lol_html
crate to find <a>
and <img>
elements whose URLs begin with a /
and prepends the site’s base URL to them. It also finds URLs that start with a #
and prepends both the site’s base URL and the page’s relative URL to them.
I then called the filter in the atom.xml
template like so:
<summary type="html">
<![CDATA[{{post.description | absolutize_urls: "https://scvalex.net", rel_url | unclassify}}]]>
</summary>
<content type="html">
<![CDATA[{{post.content_html | absolutize_urls: "https://scvalex.net", rel_url | unclassify}}]]>
</content>
There’s also a second filter in there to remove the class
attributes from elements since CSS-in-separate-files doesn’t get loaded by RSS readers.
Looking back
Writing an app to generate my site has been a mixed bag, but mostly filled with goodness. The bad part is that I occasionally have to write something like an RSS feed generator from scratch. In theory, if I used an off-the-shelf static site generator, this would be done for me. In practice, I’ve run into issues I couldn’t stomach in just about every static site generator that I’ve tried; sometimes it was output instability across upgrades, sometimes it was questionable output, and sometimes it was lack of features and of configurability—there was always something. Ever since I wrote my own generator, whenever I need a new feature, I can just write it. This brings a lot of peace of mind.