« blog

Web Syndication with JSON Feeds

Sometime in the last ten years, while you were mourning the loss of Google Reader, we entered the golden age of content syndication. Our social media overlords hit the syndication Comstock Lode. For all their dystopic visioneering, I doubt the feed-accelerationists at Microsoft and Netscape in the mid 90’s foresaw these particular macroeconomics.

Arguably, though, we’re also in a golden age of nondystopic author-managed syndication. Free and nearly-free tools for hosting static sites are only outnumbered by static site generators; new ones are released every week. These tools, like blogosphere-era blogging platforms, can generate feeds as side effects of the routine publishing activity of their users; many do so by default. Even if it’s only a feed of content previews (to draw users onto the publisher’s site), each feed is a contribution to the digital commons.1

Syndicated feeds — for which RSS, Atom, and JSON Feed are specifications — are essentially different from the feeds turning social media users into blue-app-anxiety foie gras. Rather than an algorithmically ranked and collated series of texts from a variety of sources, syndicated feeds just list items as a single source; the categorizing, collating, and display of those items is left up to feeds’ consumers. This has accessibility upsides, makes feeds easy to process programmatically, and provides a neat interface for users waiting on sparse updates (e.g. a blog that only updates once in a blue moon).

Providing a feed might mean content loses ad impressions to feed readers, but feeds generally align the interests of author-publishers who want their work read with the folks doing the reading.

The social challenge: maintaining and checking a feed reader will reward users only if their favorite sources of content provide feeds to be followed. For those sources of content, maintaining a feed (trickiest during site migrations!) is only worthwhile if readers would not otherwise follow them.

An introductory note on feed formats — RSS has the longest history and is the most widely-known, but its XML specification is pretty deeply janky. I would not recommend writing code for working with RSS feeds. Atom, RSS’s successor in the XML feed tradition, is a strict improvement. Most feed readers support both.

JSON Feed is a relative newcomer, introduced by the authors of NetNewsWire and Micro.blog. It has less client support than Atom/RSS, but it’s a sweet format to tinker with. I find JSON easier to read than XML, and my languages of choice these days (Go, Python, Typescript) have much nicer support for parsing and writing JSON objects than for XML (even with Python’s feedparser).

JSON feeds make syndication so simple that I’ve written a cluster of interrelated tools for working with them. Here’s a narrative breakdown of how they came to be and how I use them together.

Habitually collecting feeds makes one very aware of how many sites don’t (but should!) have them; how many have feeds but don’t prominently list links to them; and how many publications offer central aggregate feeds but not feeds broken down by category or author. I’ve built myself a few tools to help with this.

feedscan is a bash utility for discovering feeds by checking the routes that commonly host them: /feed, /atom.xml, and so on. If I find a sweet blog at lukasschwab.me/blog, I try feedscan https://lukasschwab.me/blog before digging for a link on the site itself. It’s totally disconnected from the other projects discussed here, but it has saved me a lot of frantic searches.

jsonfeed is a JSON feed parser and constructor package written in Python, the backbone to most of my other JSON feed tools. I wrote a Go equivalent, go-jsonfeed, but haven’t used it much. This very blog is generated with a fork of pandoc-blog, which generates a JSON feed using the jsonfeed package I authored.

I discovered a little running project for myself in building and hosting public feeds for sites that don’t offer them. I got my start with arxiv-feeds, which converts Atom to JSON using jsonfeed, but it’s a relatively boring wrapper. Blogs and news sites are more fun because they involve scraping feed items from the sites on demand. I wrote separate Python scraper/generator apps for a couple of sites, then realized those generators shared a certain procedural structure:

  1. Fetch some target page.
  2. Parse the target page HTML (with Beautiful Soup).
  3. Extract feed items from the parsed HTML, an operation unique to each site.
  4. Return the constructed feed.

Steps 1, 2, and 4 were essentially shared, so I factored them out into jsonfeed-wrapper, which takes the site-specific HTML-to-feed transform and wraps it with the standard fetching and feed-serving logic. I originally designed it for use with Google App Engine, but last weekend I rewrote it to expose a Google Cloud Function target.2 Cloud Functions save me a couple bucks a month.

I generate and host feeds for It’s Nice That, Bandcamp artists, The Baffler, and Atlas of Places. Generating feeds from scraped HTML is somewhat brittle, but these have been reliable enough for the last few months. Adding a new site to the list takes about an hour of filling out a jsonfeed-wrapper template; shortening that time is the jsonfeed-wrapper project’s north star. Everything deserves a feed.

The next frontier: feed filters with CEL. cel-go works neatly with raw JSON, but the JSON Feed schema is well defined — why not create a CEL environment with types and macros for filtering feeds?

I have a Cloud Function that does nothing but parse Bruce Schneier’s RSS feed, filter out the feed items involing squid (Bruce’s hobby outside of security), and re-host the feed. There’s no reason this filtered feed should be re-hosted on its own when it could instead compile a CEL expression it receives from a client:

!item.tags.exists_one(t, t == "squid")

…and just return items where that expression returns true. User-defined CEL expressions are non-Turing-complete and safe to execute, so I can use them in lieu of parsing and documenting some feed-specific filter API. Different requests, passing differenct CEL expressions, can fetch differently filtered feeds from the same endpoint.

I will probably never convince anyone to host a feed that behaves this way, but that’s the neat thing about syndication: I can mirror or aggregate other feeds in a feed of my own that provides the interface I want. No need to ask anyone else to implement anything.

  1. Podcasts are a good example of syndication working as expected: a system of RSS/Atom feeds means your podcast is available in whatever listening environment you like, including as a raw audio file. Platform-exclusive podcasts are trendy but somewhat toxic.↩︎

  2. Rewriting jsonfeed-wrapper from something App Engine specific to something that could equivalently provide a Cloud Function was super simple because of how each of those Cloud Services register the code’s entry points.

    Google App Engine’s Python runtime requires a WSGI-compatible object (in this case, a bottle application). Google’s Cloud Function runtime, on the other hand, asks you to define a function for handling requests.

    The platform-ambivalent rewrite of jsonfeed-wrapper exports interfaces for constructing both, and everything but the request/response handling code is shared. Moreover, since instantiating a bottle app doesn’t do the hard work of actually running it, leaving the app definition in the Cloud Function code has a minimal impact on performance.

    The result: each of my jsonfeed-wrapper applications can be deployed either as a Cloud Function or as an App Engine app without changing a single line of code.↩︎