I have had a lot of people ask what it takes for the site to do what it does. Even though I touch on a few points in the FAQ, I figured a quick write up wouldn’t hurt.
Likely not surprising, the site runs on a dedicated VM with a database at my hosting company, with all the pages being dynamically created when you request them… much like many sites you visit. Where we diverge from most of those other sites is the additional VM dedicated to handling site scraping/feeds and submitting those results to the previously mentioned database. The whole thing was custom written from scratch based on a quick pseudo-code block I wrote in Emacs (think notepad, but for unix) the first week. That same write up turned into a longer to-do list, which is still very long with all the future stuff I want to implement.
By the way, I typically use the word “scan” when I mention adding a new retailer on the front page news list. I think the word makes the most sense as the system is scanning pages on the retailer site similar to how you scan the page in a magazine, looking for what is important. Really, though, I grab the html (the stuff you see when you right click and View Source) like a web browser does and parse the text for the information. This means I am not requesting images, which would waste bandwidth for both the retailer and myself. Some people call it data mining and others call it site scraping. Tohmato – tomahto.
Interesting notes about the site/scanner:
- Retailer scans and db updates take about 700 gigabytes per month in bandwidth currently.
- The web site uses only about 50 gigs a month, due to external thumbnails and minimal images.
- There are typically between 15 and 20 retailer scans running every second.
- I typically have to fix at least 1 retailer’s site scan every 2-3 days as they tweak their site layout/html.
- Pauses between scans range from 1 minute to 15 minutes (typically long delays for ones who don’t change stock often).
- Some retailers are providing custom feeds, which ensure we catch all their products even if they change their site.
- Most new scan additions are done near midnight, to stop the New Items page from being spammed during busy times.
Don’t know if anyone found this stuff interesting. Either way, I typed it all out… so there you go.