Rust is very useful as a performant systems-programming language, but it can also be used for data mining and Web scraping. It’s type system and memory safety makes it a great tool to write reliable data extractors.
On this page, I will write about a few Rust libraries that are useful for this purpose and show some example code.
HTTP requests / Getting the page contents
Perhaps the most important part of scraping a web page is getting the page HTML. Rust has a few choices for this, the right choice depends on your priorities.
When this page was originally written, most clients had blocking APIs. Now Rust has support for asynchronous and synchronous I/O.
For most “client” applications, it is much easier to go with synchronous options as the benefits of async IO are mostly useful on servers.
ureq
ureq is a simple HTTP client that doesn’t have the bloat of async runtimes. This is a synchronous client library, and doesn’t require another runtime.
curl
While it is not written in Rust, and there are excellent alternatives that are, you always have the option of using the libcurl bindings. You can find the curl.
Reqwest
A new library that surfaced after this page was originally written is Reqwest. Reqwest tries to cover the common use cases with sensible defaults and relatively good performance. It could be seen as the Rust version of Python’s requests. It really simplifies the whole process of making HTTP requests, so unless you need absolute control over every part of your requests, you should give reqwest a try.
Hyper
Hyper is a fast and modern HTTP client (and server) library that leverages Rust’s type system to make zero-cost, safe abstractions over the protocol.
In general, Hyper is pretty low-level and not the best choice for scrapers. There are libraries, such as Reqwest, that use Hyper internally but expose a simpler interface.
Extracting the data
Regular Expressions
While we know using Regexes to parse HTML is a Bad Idea™, and know that it’s not even possible to do so; we also know that at some point everybody will use them for this purpose for one reason or another. Rust has us covered for this use-case with it’s excellent Regex library.
It’s useful for when the page layout is known not to change, or for when you’re dealing with incorrect HTML. Still, it’s a good idea to give actual HTML parsers a go, they can be much more durable with changing layouts.
HTML parsers
Depending on the page you’re scraping, HTML parsers will probably be more reliable than regular expressions.
Select uses html5ever, a fast HTML parser written in Rust, in order to make navigating the page tags and extracting the data you need easy. It server a similar purpose to Java’s jsoup and Python’s BeautifulSoup.
Another library that can be used to parse HTML is scraper.
JSON parsers
If you are lucky, you can find an endpoint that produces JSON data. This is usually the case when the website you are trying to scrape is a JavaScript app that fetches its data dynamically. In these cases, you can use the Network tab of the Developer Tools to determine how you need to make a request to get the JSON data.
After you find the endpoint and fetch the content using the HTTP Clients above, you need to use a JSON parser in order to extract the data you need. The most popular JSON parser in the Rust ecosystem is serde_json.
Browser automation
Using Selenium / WebDriver with Rust
In Python, it is common to spawn a real browser and interact with it through the WebDriver protocol using Selenium. If you come across a website that is inconvenient to scrape with the usual methods, you can use a web browser to help you.
Some Rust crates that implement this protocol are fantoccini and thirtyfour.
Data storage
Every day Rust is getting more and more options for interfacing with databases. Depending on your preference you can either write the raw SQL queries yourself, or you can use a ORM library that will map Rust structures into SQL data types for you.
SQLite
One of the simplest options is to use SQLite. SQLite is a well known embeddable database. It is written in C, and it has bindings for lots of different languages including Java, Python, Ruby and Rust. You can use the rusqlite crate for interacting with SQLite databases. It lets you update and query the database while taking advantage of the type system.
PostgreSQL
If you prefer to use PostgreSQL, you can use the postgres crate to interact with it. The documentation has useful examples to get started quickly.
MongoDB
MongoDB publishes an official database driver for Rust, aptly called mongodb.
Code examples
Getting the Hacker News Front page
As an example, let’s grab the HN Front page with reqwest and regex. First of all, let’s get the HTML of the page using reqwest.
let url = "https://news.ycombinator.com/";
let html = reqwest::get(url)?.text()?;
After this, we need to construct our regex matcher. If you look at the Hacker News HTML, you will see that the posts are shown like this.
<td class="title"><a href="https://blog.mozilla.org/addons/2018/01/26/extensions-firefox-59/" class="storylink">Extensions in Firefox 59</a>
Here’s how you can turn this into a regular expression in Rust.
let re = Regex::new("<td class=\"title\"><a href=\"(.*?)\" class=\"storylink\">(.*?)</a>").unwrap();
Let’s iterate over the matches in the HTML and print them to the console.
for cap in re.captures_iter(&content) {
let link = &cap[1];
let title = &cap[2];
println!("{}: {}", title, link);
}
Pipelines with Iterators
Rust has excellent support for iterators, and with a little functional-programming magic, you can make your scrapers really modular and maintainable.
fn get_links(html: &str) -> Vec<String> {
let re = Regex::new("<td class=\"title\"><a href=\"(.*?)\" class=\"storylink\">.*?</a>").unwrap();
re.captures_iter(html)
.map(|story| {
story[1].to_string()
}).collect()
}
fn get_page_size(r: Response) -> usize {
r.bytes().count()
}
let mut resp = reqwest::get("https://news.ycombinator.com/")?;
let html = resp.text()?;
let a = get_links(&html).iter()
.map(|link| reqwest::get(&*link))
.filter_map(|res| res.ok())
.map(get_page_size)
.collect::<Vec<usize>>();
println!("{:?}", a);
GET request with Reqwest
let mut resp = reqwest::get("https://gkbrk.com/feed.xml")?;
let content = resp.text()?;
println!("{}", content);