{"id":27,"date":"2011-12-13T22:29:13","date_gmt":"2011-12-13T22:29:13","guid":{"rendered":"http:\/\/elbsolutions.com\/projects\/?p=27"},"modified":"2022-02-03T11:25:07","modified_gmt":"2022-02-03T17:25:07","slug":"web-scraping-how-to-add-stability","status":"publish","type":"post","link":"https:\/\/elbsolutions.com\/projects\/web-scraping-how-to-add-stability\/","title":{"rendered":"Web Scraping &#8211; how to do it and add stability"},"content":{"rendered":"<p>If there is no API available to get info off of a web system &#8211; what do you do? You buy a scraping tool or &#8230; when that does not work because it can&#8217;t get past security or because it is a living website &#8211; you MAKE YOUR OWN. A young programmer asked me how to do this. Here is some advise on how-to and what not-to (do). This system that gave me most of my experience successfully scrapes gigabytes of info each night between initiating from a windows machine and uses another host on linux to do the scraping. The system babysits itself with 2 points of reference so when one machine is down &#8211; you are informed of it via email.<\/p>\n<p>Idea 1-\u00a0<a href=\"http:\/\/curl.haxx.se\/\">cURL<\/a> (discussion follows)<a href=\"http:\/\/curl.haxx.se\/\"><br \/>\n<\/a>Idea 2- <a href=\"http:\/\/vbscriptautomation.net\/93\/automating-internet-explorer-part-1\/\" target=\"_blank\" rel=\"noopener noreferrer\">learn how to automate internet explorer<\/a> (for Windows) or <a title=\"Load webpage, print, load, print, etc. with Applescipt\" href=\"http:\/\/elbsolutions.com\/projects\/load-webpage-print-load-print-etc-with-applescipt\/\">Applescript<\/a> (for Mac)<\/p>\n<p>cURL is your best friend. It can access web pages with all sorts of security &#8211; it is free to use and integrated everywhere it seems. However don&#8217;t irritate facbook or kijiji or they will shut your account down. Now I am not myself on facebook because they think I am not really who I say I am &#8211; so I made another account and a fake name &#8211; I can&#8217;t even have a relationship on facebook with my spouse &#8211; do they have a &#8216;divorce app&#8217; on facebook so I can re-marry her from one account to another? They are ok with that. DO NOT SCRAPE CERTAIN SITES &#8211; read the fine print.<\/p>\n<p>Back to cURL &#8230; it is the root of the engine. Here are some tips when you go to make one<\/p>\n<ul>\n<li><span class=\"Apple-style-span\" style=\"line-height: 15px;\">there are a million options and they all interact &#8211; read, re-read and re-re-read the instruction pages<\/span><\/li>\n<li>there are lots of programming engines that make api&#8217;s. I have used cURL in php, unix command line, perl and windows environments.<\/li>\n<li>Google is your friend<\/li>\n<li>don&#8217;t give up &#8211; it can be done<\/li>\n<li>save your files to local web files\n<ul>\n<li>protect these directories if sensitive info is there.<\/li>\n<li>delete temporary files<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Careful &#8211; expect things to change on the website you are changing so &#8230;<\/p>\n<ul>\n<li><span class=\"Apple-style-span\" style=\"line-height: 15px;\">build in cost and expectation for your customers to pay a maintenance agreement. This money should not be considered profit &#8211; this is what you will be doing next year when the website is upgraded.<\/span><\/li>\n<li>use generic names<\/li>\n<li>program in blocks &#8211; DOCUMENT YOUR CODE for yourself especially (more profit!)<\/li>\n<\/ul>\n<p>Processing scraped web pages.<\/p>\n<ul>\n<li>HTML5 &#8211; AHHHHH! HTML5 is not XML &#8211; so if you use an XML parser and someone changes it to HTML5 &#8211; but &#8230; not really since it has lots of XHTML 4 in it . . . did I mention to give your customers the expectation of a yearly fee for this tool?\n<ul>\n<li>this is late 2012-2013 when I had to wrestle with this &#8211; programming languages have not adopted standard HTML5 libraries yet. There is only one pre-alpha library for parsing HTML5.<\/li>\n<\/ul>\n<\/li>\n<li>What happens if it is read and there are errors right at the borders of the info you just read?\n<ul>\n<li>build in retry loops with a slightly bigger (randomly bigger) size and try again<\/li>\n<\/ul>\n<\/li>\n<li>how about if the pages are not proper XML and weird things break the parser?\n<ul>\n<li>simply use a search and replace tool &#8211; keep it general with arrays of things to search and replace before it gets parsed<\/li>\n<li>finding why XML breaks is a major pain. That is why I like XML because it is exact &#8211; play by the rules and all will work.\n<ul>\n<li><a href=\"http:\/\/www.google.ca\/search?client=safari&amp;rls=en&amp;q=xml+validator&amp;ie=UTF-8&amp;oe=UTF-8&amp;redir_esc=&amp;ei=FxIcUaXCGaSRygG54IHoAg\">XML Validators<\/a> are your friend &#8211; but don&#8217;t trust one &#8211; it might have less stringent ones and fool you in thinking all is well &#8211; when another will give you a HINT (not the answer) to the area<\/li>\n<li>in your code &#8211; trap the error and spit out the data that is &#8216;offensive&#8217; &#8211; look before and after it.<\/li>\n<li>Do not try to debug it \u00a0using the whole file to find incorrect or invalid XML &#8211; you might need to recreate an XML header and paste most of the offensive bit out.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>That is the only advise I have for now. It was a long while ago when I wrote it &#8211; it was supposed to be 30 hours &#8211; it was more like 100+ . Price carefully with lots of margin.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If there is no API available to get info off of a web system &#8211; what do you do? You buy a scraping tool or &#8230; when that does not work because it can&#8217;t get past security or because it is a living website &#8211; you MAKE YOUR OWN. A young programmer asked me how [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-27","post","type-post","status-publish","format-standard","hentry","category-general"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.7 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Web Scraping - how to do it and add stability - ELB Solutions.com Inc.<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/elbsolutions.com\/projects\/web-scraping-how-to-add-stability\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Web Scraping - how to do it and add stability - ELB Solutions.com Inc.\" \/>\n<meta property=\"og:description\" content=\"If there is no API available to get info off of a web system &#8211; what do you do? You buy a scraping tool or &#8230; when that does not work because it can&#8217;t get past security or because it is a living website &#8211; you MAKE YOUR OWN. A young programmer asked me how [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/elbsolutions.com\/projects\/web-scraping-how-to-add-stability\/\" \/>\n<meta property=\"og:site_name\" content=\"ELB Solutions.com Inc.\" \/>\n<meta property=\"article:published_time\" content=\"2011-12-13T22:29:13+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-02-03T17:25:07+00:00\" \/>\n<meta name=\"author\" content=\"Etienne Bley\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Etienne Bley\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/web-scraping-how-to-add-stability\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/web-scraping-how-to-add-stability\\\/\"},\"author\":{\"name\":\"Etienne Bley\",\"@id\":\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/#\\\/schema\\\/person\\\/51e717c68f4f5917c63baf88f0896c39\"},\"headline\":\"Web Scraping &#8211; how to do it and add stability\",\"datePublished\":\"2011-12-13T22:29:13+00:00\",\"dateModified\":\"2022-02-03T17:25:07+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/web-scraping-how-to-add-stability\\\/\"},\"wordCount\":734,\"articleSection\":[\"General\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/web-scraping-how-to-add-stability\\\/\",\"url\":\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/web-scraping-how-to-add-stability\\\/\",\"name\":\"Web Scraping - how to do it and add stability - ELB Solutions.com Inc.\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/#website\"},\"datePublished\":\"2011-12-13T22:29:13+00:00\",\"dateModified\":\"2022-02-03T17:25:07+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/#\\\/schema\\\/person\\\/51e717c68f4f5917c63baf88f0896c39\"},\"breadcrumb\":{\"@id\":\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/web-scraping-how-to-add-stability\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/web-scraping-how-to-add-stability\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/web-scraping-how-to-add-stability\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Web Scraping &#8211; how to do it and add stability\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/#website\",\"url\":\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/\",\"name\":\"ELB Solutions.com Inc.\",\"description\":\"Bringing all your IT Pieces together\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/#\\\/schema\\\/person\\\/51e717c68f4f5917c63baf88f0896c39\",\"name\":\"Etienne Bley\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/f8971dfb65b25b768415568f83247df4057f15d037137e386928a804e2c997b9?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/f8971dfb65b25b768415568f83247df4057f15d037137e386928a804e2c997b9?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/f8971dfb65b25b768415568f83247df4057f15d037137e386928a804e2c997b9?s=96&d=mm&r=g\",\"caption\":\"Etienne Bley\"},\"url\":\"https:\\\/\\\/elbsolutions.com\\\/projects\\\/author\\\/etienne-bley\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Web Scraping - how to do it and add stability - ELB Solutions.com Inc.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/elbsolutions.com\/projects\/web-scraping-how-to-add-stability\/","og_locale":"en_US","og_type":"article","og_title":"Web Scraping - how to do it and add stability - ELB Solutions.com Inc.","og_description":"If there is no API available to get info off of a web system &#8211; what do you do? You buy a scraping tool or &#8230; when that does not work because it can&#8217;t get past security or because it is a living website &#8211; you MAKE YOUR OWN. A young programmer asked me how [&hellip;]","og_url":"https:\/\/elbsolutions.com\/projects\/web-scraping-how-to-add-stability\/","og_site_name":"ELB Solutions.com Inc.","article_published_time":"2011-12-13T22:29:13+00:00","article_modified_time":"2022-02-03T17:25:07+00:00","author":"Etienne Bley","twitter_misc":{"Written by":"Etienne Bley","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/elbsolutions.com\/projects\/web-scraping-how-to-add-stability\/#article","isPartOf":{"@id":"https:\/\/elbsolutions.com\/projects\/web-scraping-how-to-add-stability\/"},"author":{"name":"Etienne Bley","@id":"https:\/\/elbsolutions.com\/projects\/#\/schema\/person\/51e717c68f4f5917c63baf88f0896c39"},"headline":"Web Scraping &#8211; how to do it and add stability","datePublished":"2011-12-13T22:29:13+00:00","dateModified":"2022-02-03T17:25:07+00:00","mainEntityOfPage":{"@id":"https:\/\/elbsolutions.com\/projects\/web-scraping-how-to-add-stability\/"},"wordCount":734,"articleSection":["General"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/elbsolutions.com\/projects\/web-scraping-how-to-add-stability\/","url":"https:\/\/elbsolutions.com\/projects\/web-scraping-how-to-add-stability\/","name":"Web Scraping - how to do it and add stability - ELB Solutions.com Inc.","isPartOf":{"@id":"https:\/\/elbsolutions.com\/projects\/#website"},"datePublished":"2011-12-13T22:29:13+00:00","dateModified":"2022-02-03T17:25:07+00:00","author":{"@id":"https:\/\/elbsolutions.com\/projects\/#\/schema\/person\/51e717c68f4f5917c63baf88f0896c39"},"breadcrumb":{"@id":"https:\/\/elbsolutions.com\/projects\/web-scraping-how-to-add-stability\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/elbsolutions.com\/projects\/web-scraping-how-to-add-stability\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/elbsolutions.com\/projects\/web-scraping-how-to-add-stability\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/elbsolutions.com\/projects\/"},{"@type":"ListItem","position":2,"name":"Web Scraping &#8211; how to do it and add stability"}]},{"@type":"WebSite","@id":"https:\/\/elbsolutions.com\/projects\/#website","url":"https:\/\/elbsolutions.com\/projects\/","name":"ELB Solutions.com Inc.","description":"Bringing all your IT Pieces together","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/elbsolutions.com\/projects\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/elbsolutions.com\/projects\/#\/schema\/person\/51e717c68f4f5917c63baf88f0896c39","name":"Etienne Bley","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/f8971dfb65b25b768415568f83247df4057f15d037137e386928a804e2c997b9?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/f8971dfb65b25b768415568f83247df4057f15d037137e386928a804e2c997b9?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f8971dfb65b25b768415568f83247df4057f15d037137e386928a804e2c997b9?s=96&d=mm&r=g","caption":"Etienne Bley"},"url":"https:\/\/elbsolutions.com\/projects\/author\/etienne-bley\/"}]}},"_links":{"self":[{"href":"https:\/\/elbsolutions.com\/projects\/wp-json\/wp\/v2\/posts\/27","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/elbsolutions.com\/projects\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/elbsolutions.com\/projects\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/elbsolutions.com\/projects\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/elbsolutions.com\/projects\/wp-json\/wp\/v2\/comments?post=27"}],"version-history":[{"count":10,"href":"https:\/\/elbsolutions.com\/projects\/wp-json\/wp\/v2\/posts\/27\/revisions"}],"predecessor-version":[{"id":2890,"href":"https:\/\/elbsolutions.com\/projects\/wp-json\/wp\/v2\/posts\/27\/revisions\/2890"}],"wp:attachment":[{"href":"https:\/\/elbsolutions.com\/projects\/wp-json\/wp\/v2\/media?parent=27"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/elbsolutions.com\/projects\/wp-json\/wp\/v2\/categories?post=27"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/elbsolutions.com\/projects\/wp-json\/wp\/v2\/tags?post=27"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}