Sunday, September 1, 2013

Crawling and parsing web pages in javascript directly from your web browser

Introduction

Developer tools that are built in all modern browsers are powerful tools in a skillful hands. In this post I will show you how you can use them (essentially javascript console) to parse web pages. If you are not familiar with any developer tools in web browsers, please read some introduction first. You should also have basic knowledge of html, javascript and jquery.

I'll use Google Chrome as a web browser.

Idea

Basically in browsers javascript console we can execute javascript code in a context of current web page. Using ajax (XMLHttpRequest) we can also fetch html from nested urls and parse them as well (like crawlers do). It isn't complicated or innovative, but there are two things that are worth mentioning.
  • I'll use jquery to produce smaller and easier code, because of its selectors and built-in ajax method. When page doesn't use that library already, we need to inject it. It will be shown later in "Live example" how to do that.
  • On ajax-based pages it's better to disable origin policy checking by web browser, because sometimes ajax requests will trigger origin errors like "Origin http://www.example.com is not allowed by Access-Control-Allow-Origin".
    In google chrome we can do it by executing it with --args --disable-web-security parameter. You can read more about origin policies here and here.

Basic example

I prepared really basic, static web page to demonstrate idea. The url is http://cinu.pl/research/jsparsing/

Source code of this web page is:

index.html:
<html>

<head>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script>
</head>

<body>
<a href="a.html">link 1</a>
<a href="b.html">link 2</a>
<a href="c.html">link 3</a>
</body>

</html>

a.html,b.html,c.html contains a div with value we want to read:
<html>

<body>
<div class="container">
   <div class="data">VALUE WE WANT TO FETCH</div>
</div>
</body>

</html>

As you can see in index.html there is already included jquery library so there is no need to inject it.

The parser code is:
var out = ''; // container for fetched values

function parse() {
 $('a').each( // go through each anchor on page and make ajax request to fetch html
  function(idx, item) { 
   var url = $(item).attr('href'); // get url
   console.log('Fetching: '+ url); // debug note
   
   // make ajax request (http://api.jquery.com/jQuery.ajax/)
   $.ajax({
    url: url, 
    async: false, // do it synchronously
   }).done(function(data) { // data variable contains fetched html
    var dataRetrieved = $('div',$(data)).html(); // get value we're looking for
    console.log( 'Retrieved ' +  dataRetrieved); // debug note
    
    out += dataRetrieved + "\n"; // save retrieved value (+ separator)
   });
  } 
 );
 console.log("-----------------\nParsing done, output:\n"+out); // print out parsed values
}

Go to http://cinu.pl/research/jsparsing/, paste above code in Developer tools console and hit enter. To execute this code just write "parse()" and hit enter.
Result:

I guess this code is well documented, so there is no need to describe what it does, so lets try to do some more complicated example.

Live example - parsing aliexpress.com

The main goal is to fetch first 5 items from products category (I'll use wireless routers as an example) and check if there is any "feedback" from poland country on first page of feedback.

This task seems silly and parsed data is rather useless but this is only example which helps me to utilize things I have previously written.

Step 1. Injecting JQuery

Since aliexpress doesn't use jquery we need to inject it.
Injection code:
var $jq; // jquery handler to avoid $ conflicts

function injectJquery() {
 var script = document.createElement('script');
 script.setAttribute('type', 'text/javascript');
 script.setAttribute('src', '//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js'); // fetch it from googles CDNs

 // Give $ back to whatever took it before; create new alias to jQuery. 
 script.setAttribute('onload','javascript:$jq = jQuery.noConflict();'); 

 document.body.insertBefore(script, document.body.firstChild); 
}

injectJquery(); // call it automatically when paste into console

We can see that apart from simple injection we also make a jQuery.noConflict() call and assign jquery to $jq and not $. We need to do that because some scripts can also use $ (prototype.js for instance) and we need to give $ variable back to it because some parts of javascript code on target page might be broken.

Step 2. Get urls of products we want to parse "feedback" on



We need to remember that when we are fetching static pages through ajax, javascript won't be parsed and executed and we need do it manually. Because "Feedback" tab is loaded dynamically with javascript we won't get "Feedback" data in html when we fetch product page. We will handle it in next step, for now parser code is:
var productsNum = 5; 

function parse() {
 var urls = $jq('a.product');
 
 for(var i=0;i<productsNum && i<urls.length; i++) {
  var url = $jq(urls[i]).attr('href'); // get url
  console.log('Fetching: '+ url); // debug note
  
  // make ajax request 
  $jq.ajax({
   url: url, 
   async: false, // do it synchronously
  }).done(function(data) { // data variable contains fetched html
   var parsedDom = $jq(data);
   
   // check if it works
   console.log( '[TEST] item price: ' + $jq('#sku-price', parsedDom).html() );
  });
 }
}

Step 3. Find a way to fetch feedback (cause it's dynamically fetched through ajax).

First of all we need to get url where http requests for feedback data goes. To do that we need to look in Network tab of Developer Tools, press "Feedback" tab on web page and check "Documents" and "XHR" checkboxes (we don't need scripts, images, fonts etc.).

We can see couple of interesting urls like:
http://www.aliexpress.com/store/productGroupsAjax.htm?storeId=413596 [with JSON response]
http://www.aliexpress.com/findRelatedProducts.htm?productId=733919144&type=new [with JSON response]

But what we are looking for is:
http://feedback.aliexpress.com/display/productEvaluation.htm?productId=733919144&ownerMemberId=201779865&companyId=214347019&memberType=seller&startValidDate=&i18n=true
It contains raw HTML response. When we look into "response" we will see that this is exactly what are we looking for.

Now we need to take a closer look into parameters in url, that are:
productId=733919144
ownerMemberId=201779865
companyId=214347019
memberType=seller
startValidDate=
i18n=true

We can extract productId from product url for example in http://www.aliexpress.com/item/Hot-Sale-Wireless-N-Networking-Device-Wifi-Wi-Fi-Repeater-Booster-Router-Range-Expander-300Mbps-2dBi/733919144.html (product id is 733919144)

Only two of them are unknown: ownerMemberId and companyId. However if we look in the product page source code we will find it inside script tag:
...
window.runParams.adminSeq="201779865";
window.runParams.companyId="214347019";
...


We need to get it directly from the html code. I'll use regular expressions:
...
 var rx = /window.runParams.adminSeq="(\d+)"/g;
 var arr = rx.exec(data); // data contains product page html
 var adminSeq = arr[1];
 
 var rx = /window.runParams.companyId="(\d+)"/g;
 var arr = rx.exec(data); // data contains product page html
 var companyId = arr[1]; 
 
 console.log('Parsed runParams: ' + adminSeq + ' ' +companyId);
... 

If you look closer you can see that productId is also in source code in window.runParams, so we will get it like adminSeq and companyId.

parse() function now looks like this:
var productsNum = 5; 

function parse() {
 var urls = $jq('a.product');
 
 for(var i=0;i<productsNum && i<urls.length; i++) {
  var url = $jq(urls[i]).attr('href'); // get url
  console.log('Fetching: '+ url); // debug note
  
  // make ajax request 
  $jq.ajax({
   url: url, 
   async: false, // do it synchronously
  }).done(function(data) { // data variable contains fetched html
   //var parsedDom = $jq(data); // we dont need parsedDom since we will be executing regular expressions on raw html
   
   // construct feedbackUrl:
   var rx = /window.runParams.adminSeq="(\d+)"/g;
   var arr = rx.exec(data); // data contains product page html
   var adminSeq = arr[1];
   
   var rx = /window.runParams.companyId="(\d+)"/g;
   var arr = rx.exec(data); // data contains product page html
   var companyId = arr[1]; 
   
   var rx = /window.runParams.productId="(\d+)"/g;
   var arr = rx.exec(data); // data contains product page html
   var productId = arr[1];    

   var feedbackUrl = 'http://feedback.aliexpress.com/display/productEvaluation.htm?productId='+productId+'&ownerMemberId='+adminSeq+'&companyId='+companyId+'&memberType=seller&startValidDate=&i18n=true';
   
   console.log('Feedback url: '+feedbackUrl);
   
   // here we'll make another ajax call to fetch feedback data
  });
 }
}

4. Final step: Avoiding Origin policy checking and parse feedback html and check for searched country

If we try to make ajax call on prepared feedbackUrl in our parse() function we will see in console that "Origin http://www.aliexpress.com is not allowed by Access-Control-Allow-Origin" browser error. In Google Chrome we can bypass it by adding --args --disable-web-security when we execute binary.

Looking into feedbacks html we can see that flag indicating users country is described as follows:
<span class="state"><b class="css_flag css_br"></b></span>
Simple jquery selector will do the job:
$jq('b.css_'+countryCode);

The final code is:
// jquery injection
var $jq; // jquery handler to avoid $ conflicts

function injectJquery() {
 var script = document.createElement('script');
 script.setAttribute('type', 'text/javascript');
 script.setAttribute('src', '//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js'); // fetch it from googles CDNs

 // Give $ back to whatever took it before; create new alias to jQuery. 
 script.setAttribute('onload','javascript:$jq = jQuery.noConflict();'); 

 document.body.insertBefore(script, document.body.firstChild); 
}

injectJquery();

// parsing
var productsNum = 5; 

function parse(country) {
 var urls = $jq('a.product');
 
 for(var i=0;i<productsNum && i<urls.length; i++) {
  var url = $jq(urls[i]).attr('href'); // get url
  console.log('Fetching: '+ url); // debug note
  
  // make ajax request 
  $jq.ajax({
   url: url, 
   async: false, // do it synchronously
  }).done(function(data) { // data variable contains fetched html
   //var parsedDom = $jq(data); // we dont need parsedDom since we will be executing regular expressions on raw html
   
   // construct feedbackUrl:
   var rx = /window.runParams.adminSeq="(\d+)"/g;
   var arr = rx.exec(data); // data contains product page html
   var adminSeq = arr[1];
   
   var rx = /window.runParams.companyId="(\d+)"/g;
   var arr = rx.exec(data); // data contains product page html
   var companyId = arr[1]; 
   
   var rx = /window.runParams.productId="(\d+)"/g;
   var arr = rx.exec(data); // data contains product page html
   var productId = arr[1];    

   var feedbackUrl = 'http://feedback.aliexpress.com/display/productEvaluation.htm?productId='+productId+'&ownerMemberId='+adminSeq+'&companyId='+companyId+'&memberType=seller&startValidDate=&i18n=true';
   
   // get feedback page and check if there is searched country
   $jq.ajax({ // to make that request we need to disable web security in google chrome
    url: feedbackUrl, 
    async: false,
   }).done(function(data) {
   console.log( $jq('b.css_'+country, $jq(data)).length );
    
    // check if element with css_country class exists:
    if ( $jq('b.css_'+country, $jq(data)).length ) {
     console.log('[FOUND] item: '+url);
    }
   });
  });
 }
}
We executing it with parse('pl') when we want to check if there is a feedback from poland.

Some thoughs

In above example we made operations on a raw html code, however using json is a lot easier, because we don't need to use regular expressions, jquery selectors, etc. to fetch data.

Another thing is that we don't need to store data in console log. We can inject some div into webpage and then store results in it.

48 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
    Replies
    1. IEEE Final Year projects Project Centers in Chennai are consistently sought after. Final Year Students Projects take a shot at them to improve their aptitudes. IEEE Final Year project centers ground for all fragments of CSE & IT engineers hoping to assemble.Final Year Projects for CSE

      Spring Framework has already made serious inroads as an integrated technology stack for building user-facing applications. Spring Framework Corporate TRaining .

      Specifically, Spring Framework provides various tasks are geared around preparing data for further analysis and visualization. Spring Training in Chennai

      The Angular Training covers a wide range of topics including Angular Directives, Angular Services, and Angular programmability.Angular Training

      Delete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete
  4. This comment has been removed by a blog administrator.

    ReplyDelete
  5. Well Said, you have furnished the right information that will be useful to anyone at all time. Thanks for sharing your Ideas.
    Web Designing Course in Chennai | web designing training in chennai

    ReplyDelete
  6. This comment has been removed by a blog administrator.

    ReplyDelete
  7. This comment has been removed by a blog administrator.

    ReplyDelete
  8. This comment has been removed by a blog administrator.

    ReplyDelete
  9. This comment has been removed by a blog administrator.

    ReplyDelete
  10. This comment has been removed by a blog administrator.

    ReplyDelete
  11. I am very enjoyed for this blog. Its an informative topic. It help me very much to solve some problems. Its opportunity are so fantastic and working style so speedy.
    Mason Soiza

    ReplyDelete
  12. I am really liked reading your nice articles. It looks like you spend a lot of time on your blog. I have saved it and I am looking forward to reading new articles. Keep it up the good work.
    Hadoop Training in Chennai
    Big Data Training in Chennai
    German Classes in Chennai
    hadoop training in OMR
    hadoop training in Tambaram
    big data course in chennai
    Hadoop course in chennai

    ReplyDelete
  13. This is just the information I am finding everywhere. Thanks for your blog, I just subscribe your blog. This is a nice blog.. Webdesign

    ReplyDelete
  14. Good job! Fruitful article. I like this very much. It is very useful for my research. It shows your interest in this topic very well. I hope you will post some more information about the software. Please keep sharing!!
    Hadoop Training in Chennai
    Big Data Training in Chennai
    Blue Prism Training in Chennai
    CCNA Course in Chennai
    Cloud Computing Training in Chennai
    Data Science Course in Chennai
    Big Data Training in Chennai Annanagar
    Hadoop Training in Velachery

    ReplyDelete
  15. I definitely enjoying every little bit of it. It is a great website and nice share. I want to thank you. Good job! You guys do a great blog, and have some great contents. Keep up the good work. Webdesign bureau

    ReplyDelete
  16. it is really a great and helpful piece of info. I am glad that you shared this helpful information with us. Please keep us informed like this. Thank you for sharing.
    apply for malaysia visa

    ReplyDelete
  17. I enjoyed over read your blog post. This was actually what i was looking for and i am glad to came here!
    Website: Antique jewellery designs

    ReplyDelete
  18. Amazing Article,Really useful information to all So, I hope you will share more information to be check and share here.thanks for sharing .
    Website: Vietnam Cycling Tours

    ReplyDelete
  19. Post is very good its amzazing post I love them thanks for sharing.
    visit here- election spoof comedy

    ReplyDelete
  20. Students have many types of problems related to college assignments. So, we have a well-educated expert for writing assignments. If you have any problem with writing assignments. then Don’t worry because we provide the best online assignment help and free plagiarism assignment.
    MyAssignmentHelp

    ReplyDelete
  21. an interesting article, it was pleasant to read, well written, I myself sometimes write articles, and it helps me in promotion https://soclikes.com/

    ReplyDelete
  22. Get instant assignment help service in Australia. We are in this service from last ten years and provide best assistance to all our clients. If you are in Australia and pursuing your graduation or post-graduation from over there, then you can get assistance from our experts. We believe in providing best to our clients to maintain a long relationship with our clients. You can expect from us a genuine work with zero plagiarism.
    MyAssignmentHelp

    ReplyDelete
  23. The Happy New Year Love Messages will be so much fun to read rather than being emotional. If you had a fight with your lover earlier, then settle it by sending these love-filled messages to her. She will surely come back to you running.

    ReplyDelete
  24. Greetings! Very useful advice within this article! It's the little changes that make the most important changes. Thanks a lot for sharing!
    Visit here :- Search Engine Optimization

    ReplyDelete
  25. wonderful article contains lot of valuable information. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article.
    This article resolved my all queries.good luck an best wishes to the team members.learn digital marketing use these following link
    Digital Marketing Course in Chennai

    ReplyDelete
  26. it is really a great and helpful piece of info. I am glad that you shared this helpful information with us. Please keep us informed like this. Thank you for sharing.
    Website: Online punjab lottery

    ReplyDelete
  27. fantasy cricket Download app to know more.Play and win exclusive prizes & experiences, only with Fantasy Power 11 .Fantasy Power 11- fantasy cricket best app. Download best fantasy cricket app in India & win cash; visit fantasy cricket best website-know more fantasy cricket tips.

    ReplyDelete
  28. I am new here. I like your post very much. It is very usefull post for me.
    website: market share

    ReplyDelete
  29. Thanks for sharing.....keep it up...
    Vision Developers is one of leading Real Estate Company in Pune. Checkout new property to buy your dream home at the most affordable prices. Click for more info!

    ReplyDelete
  30. It is really a good posting and i was searching for the same and have been satisfied after reading it,thanks for sharing it. how to play fantasy cricket

    ReplyDelete
  31. Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!
    top online electronics shopping sites in india

    ReplyDelete
  32. I always like to read a quality content having accurate information regarding the subject and the same thing I found in this post.
    Website Development Company |

    ReplyDelete
  33. Good day! This is kind of off topic but I need some help from an established blog. Is it very hard to set up your own blog? I m not very technical but I can figure things out pretty quick. I'm thinking about setting up my own but I'm not sure where to begin. Do you have any tips or suggestions? Thanks .
    Craigslist Posting Service for Car Dealers |

    ReplyDelete
  34. This is really amazing website that I have been found on google regarding website Blog Commenting sites. and I would like to thank admin who also given us to post the link on his side.
    Lubbock moving company |

    ReplyDelete
  35. This was something I was looking for, really helpful, and great work is done. Thank you so much for sharing such valuable information.
    Car Auction Software |

    ReplyDelete
  36. It’s really a cool and helpful piece of information. I am glad that you shared this useful information with us. Please keep us up to date like this. Thanks for sharing.
    Best CRM for Small Businesses |

    ReplyDelete
  37. I am really like it very much for the interesting info in this blog that to this website is providing the wonderful info in this blog that to utilize the great technology in this blog.
    Web Development Company in Gwalior |

    ReplyDelete
  38. With special privileges and services, UEFA BET offers opportunities for small capitalists. Together ufa with the best websites that collect the most games With a minimum deposit starting from just 100 baht, you are ready to enjoy the fun with a complete range of betting that is available within the website

    ufabet , our one another option We are a direct website, not through an agent, where customers can have great confidence without deception The best of online betting sites is that our Ufa will give you the best price

    หาคุณกำลังหาเกมส์ออนไลน์ที่สามารถสร้างรายได้ให้กับคุณ เรามีเกมส์แนะนำ เกมยิงปลา รูปแบบใหม่เล่นง่ายบนมือถือ คาสิโนออนไลน์ บนคอม เล่นได้ทุกอุปกรณ์รองรับทุกเครื่องมือ มีให้เลือกเล่นหลายเกมส์ เล่นได้ทั่วโลกเพราะนี้คือเกมส์ออนไลน์แบบใหม่ เกมยิงปลา

    อีกทั้งเรายังให้บริการ เกมสล็อต ยิงปลา แทงบอลออนไลน์ รองรับทุกการใช้งานในอุปกรณ์ต่าง ๆ HTML5 คอมพิวเตอร์ แท็บเล็ต สมาทโฟน คาสิโนออนไลน์ และมือถือทุกรุ่น เล่นได้ตลอด 24ชม. ไม่ต้อง Downloads เกมส์ให้ยุ่งยาก ด้วยระบบที่เสถียรที่สุดในประเทศไทย

    ReplyDelete
  39. Great article! This is the type of information that are meant to
    be shared across the internet. Thank you for sharing such a useful post. Very Interesting Post! I regularly follow this kind of Blog.

    scottishkiltcollection

    ReplyDelete
  40. Probably the most genuine football betting UFABET that's over and above description Find fun, excitement and excitement with slot video games, hundred no cost acknowledgement, fast withdrawal. In case you would like to relax slots for cash No need to deposit a lot, without minimum, without need to talk about, squander time simply because UFABET is really reduced, paid heavily, a number of good promotions are waiting for you. Ready to assure enjoyable, no matter if it is Joker SlotXo fruit slot, we can telephone call it an internet slot internet site for you especially. Able to relax Like the support team which is going to facilitate slot formulas as well as strategies of actively playing So you can be sure that every moment of fun and pleasure We'll be there for you to provide the customers of yours the best appearance and also total satisfaction.
    บาคาร่า
    สล็อต
    ufa
    แทงบอล

    ReplyDelete