Web Scraping using a proxy
Jan 2, 2025 | Article

Technology
Other
How this began?
When I was in the market for a car, I wanted to analyze listing data so I could make an informed decision. I had several questions in mind:
- How much extra do dealerships charge compared to private sellers?
- Are certain car models more expensive than others with similar mileage (like Toyota Corollas versus Honda Civics)?
- Do some dealerships consistently offer better deals?
- And how does a specific car's price compare to similar vehicles with matching mileage, make, and model?
Different Approaches
While gathering this data, I considered a few approaches. The most straightforward would be manually entering everything into Excel—but life's too short for that.
Traditional web scraping tools like BeautifulSoup, Scrapy, or Puppeteer could work, but the website I was using had robust anti-scraping measures in place. Plus, automated scraping often violates terms of service (always check these and the robots.txt file before attempting any scraping).
I needed a less invasive approach. I didn’t mind running my own searches, but I wanted to preserve the data from my searches. That's when I noticed something interesting: the site was fetching data client-side rather than rendering it on the server. This sparked an idea—instead of scraping HTML, why not intercept the request using a proxy and parse the JSON directly?
This approach offers an unexpected bonus: access to more detailed data. For instance, I could see unique IDs for each listing, making it easy to track specific vehicles over time. Plus, all the data came pre-structured, eliminating the need to parse messy HTML.
There is one catch, though: the data must be fetched client-side for this method to work. However, this is increasingly common, especially on sites with infinite scrolling.
The Process
I will demonstrate how do to this using the browser’s own developer tools.
While this method works well for casual use, a more robust setup would involve configuring your browser to use a proxy. This proxy could automatically intercept specified requests and save the responses, streamlining the whole process.
So to use the browser dev tools:
- Browse the site naturally.
- Open the Network tab and identify the specific request that fetches your desired data.
- Filter the network requests to show only the ones containing your target data.
- Download these requests as a .HAR file (HTTP Archive format), which is essentially JSON containing request/response information.
- Then you can read in the .HAR file use your programming language of choice and save data into your own database for analysis later.