Fetching ActivityPub Feeds

Mastodon is a federated social network that uses the ActivityPub protocol to connect separate communities into one large network. Both Mastodon and the ActivityPub protocol are increasing in usage every day. Compared to formats like RSS, which are pull-based, ActivityPub is push-based. This means rather than your followers downloading your feed regularly to check if you have shared anything, you send each follower (or each server as an optimization) the content you shared.

While this decreases latency in your followers receiving your updates, it does complicate the implementation of readers. But fortunately, it is still possible to pull the feed of ActivityPub users. Just like the good old days.

In this article; we’re going to start from a handle like leo@niu.moe, and end up with a feed of my latest posts.

WebFinger

First of all, let’s look at how the fediverse knows how to find the ActivityPub endpoint for a given handle. The way this is done is quite similar to email.

To find the domain name, let’s split the handle into the username and domain parts.

handle           = 'leo@niu.moe'
username, domain = handle.split('@')

Next, we need to make a request to the domain’s webfinger endpoint in order to find more data about the account. This is done by performing a GET request to /.well-known/webfinger.

wf_url = 'https://{}/.well-known/webfinger'.format(domain)
wf_par = {'resource': 'acct:{}'.format(handle)}
wf_hdr = {'Accept': 'application/jrd+json'}

# Perform the request
wf_resp = requests.get(wf_url, headers=wf_hdr, params=wf_par).json()

Now we have our WebFinger response. We can filter this data in order to find the correct ActivityPub endpoint. We need to do this because webfinger can return a variety of URLs, not just ActivityPub.

Filtering the endpoints

The response we get from WebFinger looks like this.

{
  "subject": "acct:leo@niu.moe",
  "aliases": [
    "https://niu.moe/@leo",
    "https://niu.moe/users/leo"
  ],
  "links": [
    {
      "rel": "http://webfinger.net/rel/profile-page",
      "type": "text/html",
      "href": "https://niu.moe/@leo"
    },
    {
      "rel": "http://schemas.google.com/g/2010#updates-from",
      "type": "application/atom+xml",
      "href": "https://niu.moe/users/leo.atom"
    },
    {
      "rel": "self",
      "type": "application/activity+json",
      "href": "https://niu.moe/users/leo"
    }
  ]
}

Depending on the server, there might be more or less entries in the links key. What we are intereted in is the URL with the type application/activity+json. Let’s go through the array and find the link URL we’re looking for.

matching = (link['href'] for link in wf_resp['links'] if link['type'] == 'application/activity+json')
user_url = next(matching, None)

Fetching the feed link

We can fetch our feed URL using requests like before. One detail to note here is the content type that we need to specify in order to get the data in the format we want.

as_header = {'Accept': 'application/ld+json; profile="https://www.w3.org/ns/activitystreams"'}
user_json = requests.get(user_url, headers=as_header).json()

user_json is a dictionary that contains information about the user. This information includes the username, profile summary, profile picture and other URLs related to the user. One such URL is the “Outbox”, which is basically a feed of whatever that user shares publicly.

This is the final URL we need to follow, and we will have the user feed.

feed_url  = user_json['outbox']

In ActivityPub, the feed is an OrderedCollection. And those can be paginated. The first page can be empty, or have all the content. Or it can be one event for each page. This is completely up to the implementation. In order to handle this transparently, let’s write a generator that will fetch the next pages when they are requested.

def parse_feed(url):
    feed = requests.get(url, headers=as_header).json()

    if 'orderedItems' in feed:
        for item in feed['orderedItems']:
            yield item

    next_url = None
    if 'first' in feed:
        next_url = feed['first']
    elif 'next' in feed:
        next_url = feed['next']

    if next_url:
        for item in parse_feed(next_url):
            yield item

Now; for the purposes of a blog post and for writing simple feed parsers, this code works with most servers. But this is not a fully spec-complient function for grabbing all the pages of content. Technically next and first can be lists of events instead of other links, but I haven’t come across that in the wild. It is probably a good idea to write your code to cover more edge cases when dealing with servers on the internet.

Printing the first 10 posts

The posts in ActivityPub contain HTML and while this is okay for web browsers, we should strip the HTML tags before printing them to the terminal.

Here’s how we can do that with the BeautifulSoup and html modules.

def clean_html(s):
    text = BeautifulSoup(s, 'html.parser').get_text()
    return html.unescape(text)

i = 0
for item in parse_feed(feed_url):
    try:
        # Only new tweets
        assert item['type'] == 'Create'
        content = item['object']['content']
        text = clean_html(content)

        print(text)
        i += 1
    except:
        continue

    if i == 10:
        break

Future Work

Mastodon is not the only implementation of ActivityPub, and each implementation can do things in slightly different ways. While writing code to interact with ActivityPub servers, you should always consult the specification document.

Useful Links