Rust Web Scraping

Rust is very useful as a performant systems-programming language, but it can also be used for data mining and web scraping. It’s type system and memory safety makes it a great tool to write reliable data extractors.

On this page, I will write about a few Rust libraries that are useful for this purpose and show some example code.

Getting the page contents (HTTP requests)

Perhaps the most important part of scraping a web page is getting the page HTML. Rust has a few choices for this but the most popular one (also my favorite) is Hyper.

Hyper is a fast and modern HTTP client (and server) library that leverages Rust’s type system to make zero-cost, safe abstractions over the protocol.

A new library that surfaced after this page was originally written is Reqwest. Reqwest tries to cover the common usecases with sensible defaults and relatively good performance. It could be seen as the Rust version of Python’s requests. It really simplifies the whole process of making HTTP requests, so unless you need absolute control over every part of your requests, you should give reqwest a try.

Extracting the data (with Regular Expressions)

While we know using Regexes to parse HTML is a Bad Idea™, and know that it’s not even possible to do so; we also know that at some point everybody will use them for this purpose for one reason or another. Rust has us covered for this use-case with it’s excellent Regex library.

It’s useful for when the page layout is known not to change, or for when you’re dealing with incorrect HTML. Still, it’s a good idea to give actual HTML parsers a go, they can be much more durable with changing layouts.

Extracting the data (with HTML parsers)

Select uses html5ever, a fast HTML parser written in Rust, in order to make navigating the page tags and extracting the data you need easy. It server a similar purpose to Java’s jsoup and Python’s BeautifulSoup.

Storing the Data

Every day Rust is getting more and more options for interfacing with databases. Depending on your preference you can either write the raw SQL queries yourself, or you can use a ORM library that will map Rust structures into SQL datatypes for you.

One of the simplest options is to use Sqlite. Sqlite is a well known embeddable database. It is written in C, and it has bindings for lots of different languages including Java, Python, Ruby and Rust. You can use the rusqlite crate for interacting with Sqlite databases. It lets you update and query the database while taking advantage of the type system.

Example - Getting the Hacker News Frontpage

As an example, let’s grab the HN Frontpage with reqwest and regex. First of all, let’s get the HTML of the page using reqwest.

let url = "https://news.ycombinator.com/";
let html = reqwest::get(url)?.text()?;

After this, we need to construct our regex matcher. If you look at the Hacker News HTML, you will see that the posts are shown like this.

<td class="title"><a href="https://blog.mozilla.org/addons/2018/01/26/extensions-firefox-59/" class="storylink">Extensions in Firefox 59</a>

Here’s how you can turn this into a regular expression in Rust.

let re = Regex::new("<td class=\"title\"><a href=\"(.*?)\" class=\"storylink\">(.*?)</a>").unwrap();

Let’s iterate over the matches in the HTML and print them to the console.

for cap in re.captures_iter(&content) {
    let link = &cap[1];
    let title = &cap[2];
    println!("{}: {}", title, link);
}

Pipelines with Iterators

Rust has excellent support for iterators, and with a little functional-programming magic, you can make your scrapers really modular and maintainable.

fn get_links(html: &str) -> Vec<String> {
    let re = Regex::new("<td class=\"title\"><a href=\"(.*?)\" class=\"storylink\">.*?</a>").unwrap();
    re.captures_iter(html)
        .map(|story| {
            story[1].to_string()
        }).collect()
}

fn get_page_size(r: Response) -> usize {
    r.bytes().count()
}

let mut resp = reqwest::get("https://news.ycombinator.com/")?;
let html = resp.text()?;

let a = get_links(&html).iter()
    .map(|link| reqwest::get(&*link))
    .filter_map(|res| res.ok())
    .map(get_page_size)
    .collect::<Vec<usize>>();

println!("{:?}", a);

Example Code (GET Request)

Reqwest

let mut resp = reqwest::get("https://gkbrk.com/feed.xml")?;
let content = resp.text()?;

println!("{}", content);

Comments BETA