Rust web scraping


Reading time: about 5 minutes

Rust is very useful as a performant systems-programming language, but it can also be used for data mining and Web scraping. It’s type system and memory safety makes it a great tool to write reliable data extractors.

On this page, I will write about a few Rust libraries that are useful for this purpose and show some example code.

HTTP requests / Getting the page contents

Perhaps the most important part of scraping a web page is getting the page HTML. Rust has a few choices for this, the right choice depends on your priorities.

When this page was originally written, most clients had blocking APIs. Now Rust has support for asynchronous and synchronous I/O.

For most “client” applications, it is much easier to go with synchronous options as the benefits of async IO are mostly useful on servers.

ureq

ureq is a simple HTTP client that doesn’t have the bloat of async runtimes. This is a synchronous client library, and doesn’t require another runtime.

curl

While it is not written in Rust, and there are excellent alternatives that are, you always have the option of using the libcurl bindings. You can find the curl.

Reqwest

A new library that surfaced after this page was originally written is Reqwest. Reqwest tries to cover the common use cases with sensible defaults and relatively good performance. It could be seen as the Rust version of Python’s requests. It really simplifies the whole process of making HTTP requests, so unless you need absolute control over every part of your requests, you should give reqwest a try.

Hyper

Hyper is a fast and modern HTTP client (and server) library that leverages Rust’s type system to make zero-cost, safe abstractions over the protocol.

In general, Hyper is pretty low-level and not the best choice for scrapers. There are libraries, such as Reqwest, that use Hyper internally but expose a simpler interface.

Extracting the data

Regular Expressions

While we know using Regexes to parse HTML is a Bad Idea™, and know that it’s not even possible to do so; we also know that at some point everybody will use them for this purpose for one reason or another. Rust has us covered for this use-case with it’s excellent Regex library.

It’s useful for when the page layout is known not to change, or for when you’re dealing with incorrect HTML. Still, it’s a good idea to give actual HTML parsers a go, they can be much more durable with changing layouts.

HTML parsers

Depending on the page you’re scraping, HTML parsers will probably be more reliable than regular expressions.

Select uses html5ever, a fast HTML parser written in Rust, in order to make navigating the page tags and extracting the data you need easy. It server a similar purpose to Java’s jsoup and Python’s BeautifulSoup.

Another library that can be used to parse HTML is scraper.

JSON parsers

If you are lucky, you can find an endpoint that produces JSON data. This is usually the case when the website you are trying to scrape is a JavaScript app that fetches its data dynamically. In these cases, you can use the Network tab of the Developer Tools to determine how you need to make a request to get the JSON data.

After you find the endpoint and fetch the content using the HTTP Clients above, you need to use a JSON parser in order to extract the data you need. The most popular JSON parser in the Rust ecosystem is serde_json.

Browser automation

Using Selenium / WebDriver with Rust

In Python, it is common to spawn a real browser and interact with it through the WebDriver protocol using Selenium. If you come across a website that is inconvenient to scrape with the usual methods, you can use a web browser to help you.

Some Rust crates that implement this protocol are fantoccini and thirtyfour.

Data storage

Every day Rust is getting more and more options for interfacing with databases. Depending on your preference you can either write the raw SQL queries yourself, or you can use a ORM library that will map Rust structures into SQL data types for you.

SQLite

One of the simplest options is to use SQLite. SQLite is a well known embeddable database. It is written in C, and it has bindings for lots of different languages including Java, Python, Ruby and Rust. You can use the rusqlite crate for interacting with SQLite databases. It lets you update and query the database while taking advantage of the type system.

PostgreSQL

If you prefer to use PostgreSQL, you can use the postgres crate to interact with it. The documentation has useful examples to get started quickly.

MongoDB

MongoDB publishes an official database driver for Rust, aptly called mongodb.

Code examples

Getting the Hacker News Front page

As an example, let’s grab the HN Front page with reqwest and regex. First of all, let’s get the HTML of the page using reqwest.

let url = "https://news.ycombinator.com/";
let html = reqwest::get(url)?.text()?;

After this, we need to construct our regex matcher. If you look at the Hacker News HTML, you will see that the posts are shown like this.

<td class="title"><a href="https://blog.mozilla.org/addons/2018/01/26/extensions-firefox-59/" class="storylink">Extensions in Firefox 59</a>

Here’s how you can turn this into a regular expression in Rust.

let re = Regex::new("<td class=\"title\"><a href=\"(.*?)\" class=\"storylink\">(.*?)</a>").unwrap();

Let’s iterate over the matches in the HTML and print them to the console.

for cap in re.captures_iter(&content) {
    let link = &cap[1];
    let title = &cap[2];
    println!("{}: {}", title, link);
}

Pipelines with Iterators

Rust has excellent support for iterators, and with a little functional-programming magic, you can make your scrapers really modular and maintainable.

fn get_links(html: &str) -> Vec<String> {
    let re = Regex::new("<td class=\"title\"><a href=\"(.*?)\" class=\"storylink\">.*?</a>").unwrap();
    re.captures_iter(html)
        .map(|story| {
            story[1].to_string()
        }).collect()
}

fn get_page_size(r: Response) -> usize {
    r.bytes().count()
}

let mut resp = reqwest::get("https://news.ycombinator.com/")?;
let html = resp.text()?;

let a = get_links(&html).iter()
    .map(|link| reqwest::get(&*link))
    .filter_map(|res| res.ok())
    .map(get_page_size)
    .collect::<Vec<usize>>();

println!("{:?}", a);

GET request with Reqwest

let mut resp = reqwest::get("https://gkbrk.com/feed.xml")?;
let content = resp.text()?;

println!("{}", content);

Citation

If you find this work useful, please cite it as:
@article{yaltirakli,
  title   = "Rust web scraping",
  author  = "Yaltirakli, Gokberk",
  journal = "gkbrk.com",
  year    = "2024",
  url     = "https://www.gkbrk.com/rust-web-scraping"
}
Not using BibTeX? Click here for more citation styles.
IEEE Citation
Gokberk Yaltirakli, "Rust web scraping", November, 2024. [Online]. Available: https://www.gkbrk.com/rust-web-scraping. [Accessed Nov. 12, 2024].
APA Style
Yaltirakli, G. (2024, November 12). Rust web scraping. https://www.gkbrk.com/rust-web-scraping
Bluebook Style
Gokberk Yaltirakli, Rust web scraping, GKBRK.COM (Nov. 12, 2024), https://www.gkbrk.com/rust-web-scraping

Comments

Comment by admin
2019-03-20 at 14:17
Spam probability: 1.768%

Hey @IT, Curl might be more lightweight. You can check it out on https://crates.io/crates/curl.

Comment by IT
2019-03-06 at 09:18
Spam probability: 0.461%

reqwest pulls in 100MB of crates and builds 138 of them, taking minutes, just to do one lousy GET. There really should be a more simple and sane crate that just does this one thing properly.

Comment by admin
2019-01-30 at 23:41
Spam probability: 0.025%

Hi, I don't use VBA at all; but from what I can gather from that StackOverflow link, you're trying to use a proxy. That should be pretty easy to do with the libraries above.

Comment by TahorSuiJuris
2019-01-30 at 16:13
Spam probability: 0.534%

Any experiences to share in the conversion of an Excel VBA data extraction script conversion to Rust? https://stackoverflow.com/questions/54444427/modify-from-xmlhttp-to-serverxmlhttp-for-enabling-proxy-use#54444427

Comment by TahorSuiJuris
2019-01-28 at 23:08
Spam probability: 1.466%

Are there any new MySQL integrations with Rust that you may be aware?

© 2024 Gokberk Yaltirakli