If there is no API available to get info off of a web system – what do you do? You buy a scraping tool or … when that does not work because it can’t get past security or because it is a living website – you MAKE YOUR OWN. A young programmer asked me how to do this. Here is some advise on how-to and what not-to (do). This system that gave me most of my experience successfully scrapes gigabytes of info each night between initiating from a windows machine and uses another host on linux to do the scraping. The system babysits itself with 2 points of reference so when one machine is down – you are informed of it via email.
Idea 1- cURL (discussion follows)
Idea 2- learn how to automate internet explorer (for Windows) or Applescript (for Mac)
cURL is your best friend. It can access web pages with all sorts of security – it is free to use and integrated everywhere it seems. However don’t irritate facbook or kijiji or they will shut your account down. Now I am not myself on facebook because they think I am not really who I say I am – so I made another account and a fake name – I can’t even have a relationship on facebook with my spouse – do they have a ‘divorce app’ on facebook so I can re-marry her from one account to another? They are ok with that. DO NOT SCRAPE CERTAIN SITES – read the fine print.
Back to cURL … it is the root of the engine. Here are some tips when you go to make one
- there are a million options and they all interact – read, re-read and re-re-read the instruction pages
- there are lots of programming engines that make api’s. I have used cURL in php, unix command line, perl and windows environments.
- Google is your friend
- don’t give up – it can be done
- save your files to local web files
- protect these directories if sensitive info is there.
- delete temporary files
Careful – expect things to change on the website you are changing so …
- build in cost and expectation for your customers to pay a maintenance agreement. This money should not be considered profit – this is what you will be doing next year when the website is upgraded.
- use generic names
- program in blocks – DOCUMENT YOUR CODE for yourself especially (more profit!)
Processing scraped web pages.
- HTML5 – AHHHHH! HTML5 is not XML – so if you use an XML parser and someone changes it to HTML5 – but … not really since it has lots of XHTML 4 in it . . . did I mention to give your customers the expectation of a yearly fee for this tool?
- this is late 2012-2013 when I had to wrestle with this – programming languages have not adopted standard HTML5 libraries yet. There is only one pre-alpha library for parsing HTML5.
- What happens if it is read and there are errors right at the borders of the info you just read?
- build in retry loops with a slightly bigger (randomly bigger) size and try again
- how about if the pages are not proper XML and weird things break the parser?
- simply use a search and replace tool – keep it general with arrays of things to search and replace before it gets parsed
- finding why XML breaks is a major pain. That is why I like XML because it is exact – play by the rules and all will work.
- XML Validators are your friend – but don’t trust one – it might have less stringent ones and fool you in thinking all is well – when another will give you a HINT (not the answer) to the area
- in your code – trap the error and spit out the data that is ‘offensive’ – look before and after it.
- Do not try to debug it using the whole file to find incorrect or invalid XML – you might need to recreate an XML header and paste most of the offensive bit out.
That is the only advise I have for now. It was a long while ago when I wrote it – it was supposed to be 30 hours – it was more like 100+ . Price carefully with lots of margin.