Writing a Simple IPFS Crawler

IPFS is a peer-to-peer protocol that allows you to access and publish content in a decentralized fashion. It uses hashes to refer to files. Short of someone posting hashes on a website, discoverability of content is pretty low. In this article, we’re going to write a very simple crawler for IPFS.

It’s challenging to have a traditional search engine in IPFS because content rarely links to each other. But there is another way than just blindly following links like a traditional crawler.

Enter DHT

In IPFS, the content for a given hash is found using a Distributed Hash Table. Which means our IPFS daemon receives requests about the location of IPFS objects. When all the peers do this, a key-value store is distributed among them; hence the name Distributed Hash Table. Even though we won’t get all the queries, we will still get a fraction of them. We can use these to discover when people put files on IPFS and announce it on the DHT.

Fortunately, IPFS lets us see those DHT queries from the log API. For our crawler, we will use the Rust programming language and the ipfsapi crate for communicating with IPFS. You can add ipfsapi = "0.2" to your Cargo.toml file to get the dependency.

Using IPFS from Rust

Let’s test if our IPFS daemon and the IPFS crate are working by trying to fetch and print a file.

let api = IpfsApi::new("127.0.0.1", 5001);

let bytes = api.cat("QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u")?;
let data = String::from_utf8(bytes.collect())?;

println!("{}", data);

This code should grab the contents of the hash, and if everything is working print “Hello World”.

Getting the logs

Now that we can download files from IPFS, it’s time to get all the logged events from the daemon. To do this, we can use the log_tail method to get an iterator of all the events. Let’s get everything we get from the logs and print it to the console.

for line in api.log_tail()? {
    println!("{}", line);
}

This gets us all the loops, but we are only interested in DHT events, so let’s filter a little. A DHT announcement looks like this in the JSON logs.

{
  "duration": 235926,
  "event": "handleAddProvider",
  "key": "QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u",
  "peer": "QmeqzaUKvym9p8nGXYipk6JpafqqQAnw1ZQ4xBoXWcCrLb",
  "session": "ffffffff-ffff-ffff-ffff-ffffffffffff",
  "system": "dht",
  "time": "2018-03-12T00:32:51.007121297Z"
}

We are interested in all the log entries with the event handleAddProvider. And the hash of the IPFS object is key. We can filter the iterator like this.

let logs = api.log_tail()
        .unwrap()
        .filter(|x| x["event"].as_str() == Some("handleAddProvider"))
        .filter(|x| x["key"].is_string());

for log in logs {
    let hash = log["key"].as_str().unwrap().to_string();
    println!("{}", hash);
}

Grabbing the valid images

As a final step, we’re going to save all the valid image files that we come across. We can use the image crate. Basically; for each object we find, we’re going to try parsing it as an image file. If that succeeds, we likely have a valid image that we can save.

Let’s write a function that loads an image from IPFS, parses it with the image crate and saves it to the images/ folder.

fn check_image(hash: &str) -> Result<(), Error> {
    let api = IpfsApi::new("127.0.0.1", 5001);

    let data: Vec<u8> = api.cat(hash)?.collect();
    let img = image::load_from_memory(data.as_slice())?;

    println!("[!!!] Found image on hash {}", hash);

    let path = format!("images/{}.jpg", hash);
    let mut file = File::create(path)?;
    img.save(&mut file, image::JPEG)?;

    Ok(())
}

And then connecting to our main loop. We’re checking each image in a seperate thread because IPFS can take a long time to resolve a hash or timeout.

for log in logs {
    let hash = log["key"].as_str().unwrap().to_string();
    println!("{}", hash);

    thread::spawn(move|| check_image(&hash));
}

Possible improvements / future work

  • File size limits: Checking the size of objects before downloading them
  • More file types: Saving more file types. Determining the types using a utility like file.
  • Parsing HTML: When the object is valid HTML, parse it and index the text in order to provide search


Comments BETA