Web Scraping – how to do it and add stability

Web Scraping – how to do it and add stability

If there is no API available to get info off of a web system – what do you do? You buy a scraping tool or … when that does not work because it can’t get past security or because it is a living website – you MAKE YOUR OWN. A young programmer asked me how to do this. Here is some advise on how-to and what not-to (do). This system that gave me most of my experience successfully scrapes gigabytes of info each night between initiating from a windows machine and uses another host on linux to do the scraping. The system babysits itself with 2 points of reference so when one machine is down – you are informed of it via email.

Idea 1- cURL (discussion follows)
Idea 2- learn how to automate internet explorer (for Windows) or Applescript (for Mac)

cURL is your best friend. It can access web pages with all sorts of security – it is free to use and integrated everywhere it seems. However don’t irritate facbook or kijiji or they will shut your account down. Now I am not myself on facebook because they think I am not really who I say I am – so I made another account and a fake name – I can’t even have a relationship on facebook with my spouse – do they have a ‘divorce app’ on facebook so I can re-marry her from one account to another? They are ok with that. DO NOT SCRAPE CERTAIN SITES – read the fine print.

Back to cURL … it is the root of the engine. Here are some tips when you go to make one

  • there are a million options and they all interact – read, re-read and re-re-read the instruction pages
  • there are lots of programming engines that make api’s. I have used cURL in php, unix command line, perl and windows environments.
  • Google is your friend
  • don’t give up – it can be done
  • save your files to local web files
    • protect these directories if sensitive info is there.
    • delete temporary files

Careful – expect things to change on the website you are changing so …

  • build in cost and expectation for your customers to pay a maintenance agreement. This money should not be considered profit – this is what you will be doing next year when the website is upgraded.
  • use generic names
  • program in blocks – DOCUMENT YOUR CODE for yourself especially (more profit!)

Processing scraped web pages.

  • HTML5 – AHHHHH! HTML5 is not XML – so if you use an XML parser and someone changes it to HTML5 – but … not really since it has lots of XHTML 4 in it . . . did I mention to give your customers the expectation of a yearly fee for this tool?
    • this is late 2012-2013 when I had to wrestle with this – programming languages have not adopted standard HTML5 libraries yet. There is only one pre-alpha library for parsing HTML5.
  • What happens if it is read and there are errors right at the borders of the info you just read?
    • build in retry loops with a slightly bigger (randomly bigger) size and try again
  • how about if the pages are not proper XML and weird things break the parser?
    • simply use a search and replace tool – keep it general with arrays of things to search and replace before it gets parsed
    • finding why XML breaks is a major pain. That is why I like XML because it is exact – play by the rules and all will work.
      • XML Validators are your friend – but don’t trust one – it might have less stringent ones and fool you in thinking all is well – when another will give you a HINT (not the answer) to the area
      • in your code – trap the error and spit out the data that is ‘offensive’ – look before and after it.
      • Do not try to debug it  using the whole file to find incorrect or invalid XML – you might need to recreate an XML header and paste most of the offensive bit out.

That is the only advise I have for now. It was a long while ago when I wrote it – it was supposed to be 30 hours – it was more like 100+ . Price carefully with lots of margin.

ELB Solutions.com Inc.
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.