Web Scraping – how to do it and add stability

If there is no API available to get info off of a web system – what do you do? You buy a scraping tool or … when that does not work because it can’t get past security or because it is a living website – you MAKE YOUR OWN. A young programmer asked me how to do this. Here is some advise on how-to and what not-to (do). This system that gave me most of my experience successfully scrapes gigabytes of info each night between initiating from a windows machine and uses another host on linux to do the scraping. The system babysits itself with 2 points of reference so when one machine is down – you are informed of it via email.

Idea 1- cURL (discussion follows)
Idea 2- learn how to automate internet explorer (for Windows) or Applescript (for Mac)

cURL is your best friend. It can access web pages with all sorts of security – it is free to use and integrated everywhere it seems. However don’t irritate facbook or kijiji or they will shut your account down. Now I am not myself on facebook because they think I am not really who I say I am – so I made another account and a fake name – I can’t even have a relationship on facebook with my spouse – do they have a ‘divorce app’ on facebook so I can re-marry her from one account to another? They are ok with that. DO NOT SCRAPE CERTAIN SITES – read the fine print.

Back to cURL … it is the root of the engine. Here are some tips when you go to make one

  • there are a million options and they all interact – read, re-read and re-re-read the instruction pages
  • there are lots of programming engines that make api’s. I have used cURL in php, unix command line, perl and windows environments.
  • Google is your friend
  • don’t give up – it can be done
  • save your files to local web files
    • protect these directories if sensitive info is there.
    • delete temporary files

Careful – expect things to change on the website you are changing so …

  • build in cost and expectation for your customers to pay a maintenance agreement. This money should not be considered profit – this is what you will be doing next year when the website is upgraded.
  • use generic names
  • program in blocks – DOCUMENT YOUR CODE for yourself especially (more profit!)

Processing scraped web pages.

  • HTML5 – AHHHHH! HTML5 is not XML – so if you use an XML parser and someone changes it to HTML5 – but … not really since it has lots of XHTML 4 in it . . . did I mention to give your customers the expectation of a yearly fee for this tool?
    • this is late 2012-2013 when I had to wrestle with this – programming languages have not adopted standard HTML5 libraries yet. There is only one pre-alpha library for parsing HTML5.
  • What happens if it is read and there are errors right at the borders of the info you just read?
    • build in retry loops with a slightly bigger (randomly bigger) size and try again
  • how about if the pages are not proper XML and weird things break the parser?
    • simply use a search and replace tool – keep it general with arrays of things to search and replace before it gets parsed
    • finding why XML breaks is a major pain. That is why I like XML because it is exact – play by the rules and all will work.
      • XML Validators are your friend – but don’t trust one – it might have less stringent ones and fool you in thinking all is well – when another will give you a HINT (not the answer) to the area
      • in your code – trap the error and spit out the data that is ‘offensive’ – look before and after it.
      • Do not try to debug it  using the whole file to find incorrect or invalid XML – you might need to recreate an XML header and paste most of the offensive bit out.

That is the only advise I have for now. It was a long while ago when I wrote it – it was supposed to be 30 hours – it was more like 100+ . Price carefully with lots of margin.