Craigslist automation

Update: this project is now available on npm:

Long time ago Craigslist allowed accessing their post via RSS. It was possible to append &format=rss to the Craigslist’s URL query string to get a programmatic access to posts. Unfortunately, Craigslist stopped supporting RSS a few years ago and it does not seem like it (or a replacement) is going to be available anytime soon, if ever. With RSS gone, the community stepped up and created python-craigslist  – a Python package that allows accessing Craigslist posts from a Python program. I remember experimenting with it some time ago and it worked pretty well. I tried it again last night and to my surprise I couldn’t get any results for my queries. I checked the project’s repo, and I quickly found an issue that looked exactly like mine. The issue points out that the HTML that Craigslist returns no longer contains posts but a message mentioning that to see the page a browser with JavaScript support is required. This breaks the python-craigslist library as it just sends HTTP requests and simply parses the returned HTML. It seems, Craigslist no longer serves results as plain old HTML but is using JavaScript to build the post gallery dynamically. Not being a web developer, it surprised me to see the same behavior when using a browser – out of curiosity I loaded the “cars+trucks” for sale post gallery, checked the page source, and saw the same message as mentioned in the GitHub issue. However, after inspecting the DOM with the built-in developer tools, I could see individual posts.  

For my experiment, the python-craigslist was an option anymore and I needed a different solution. I spend a few minutes looking at network request Craigslist was sending, and it was clear that making sense out of it would require a lot of effort. What I wanted was something that can act the same way as a browser only can be driven programmatically.  

Enter the headless browser 

When I described what I wanted, I realized this was an exact definition of a headless browser – a browser that can run without a graphical user interface. I knew Chrome could run in the headless mode and could be controlled from a Node.js project as I had played with it a few years earlier. Because it had been a while, I wanted to check how people do this these days. Sure enough, I quickly found puppeteer – a Node.js library that allows interacting with headless Chrome. I quickly created a new Node.js project, configured it to use TypeScript and voila – with a few lines of code:

import * as puppeteer from "puppeteer";
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(
waitUntil: "networkidle0",
let elements = await page.$$("");
await Promise.all( (e) => {
let href = await e.getProperty("href");
console.log(await href.jsonValue());
await browser.close();

I was able to get links to listings from my query:

Obviously this is only a simple prototype but could be useful to conduct simple experiments.

Tagged , , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: