Introduction
Developer tools that are built in all modern browsers are powerful tools in a skillful hands. In this post I will show you how you can use them (essentially javascript console) to parse web pages. If you are not familiar with any developer tools in web browsers, please read some introduction first. You should also have basic knowledge of html, javascript and jquery.I'll use Google Chrome as a web browser.
Idea
Basically in browsers javascript console we can execute javascript code in a context of current web page. Using ajax (XMLHttpRequest) we can also fetch html from nested urls and parse them as well (like crawlers do). It isn't complicated or innovative, but there are two things that are worth mentioning.- I'll use jquery to produce smaller and easier code, because of its selectors and built-in ajax method. When page doesn't use that library already, we need to inject it. It will be shown later in "Live example" how to do that.
- On ajax-based pages it's better to disable origin policy checking by web browser, because sometimes ajax requests will trigger origin
errors like "Origin http://www.example.com is not allowed by Access-Control-Allow-Origin".
In google chrome we can do it by executing it with --args --disable-web-security parameter. You can read more about origin policies here and here.
Basic example
I prepared really basic, static web page to demonstrate idea. The url is http://cinu.pl/research/jsparsing/Source code of this web page is:
index.html:
<html> <head> <script src="//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script> </head> <body> <a href="a.html">link 1</a> <a href="b.html">link 2</a> <a href="c.html">link 3</a> </body> </html>
a.html,b.html,c.html contains a div with value we want to read:
<html> <body> <div class="container"> <div class="data">VALUE WE WANT TO FETCH</div> </div> </body> </html>
As you can see in index.html there is already included jquery library so there is no need to inject it.
The parser code is:
var out = ''; // container for fetched values function parse() { $('a').each( // go through each anchor on page and make ajax request to fetch html function(idx, item) { var url = $(item).attr('href'); // get url console.log('Fetching: '+ url); // debug note // make ajax request (http://api.jquery.com/jQuery.ajax/) $.ajax({ url: url, async: false, // do it synchronously }).done(function(data) { // data variable contains fetched html var dataRetrieved = $('div',$(data)).html(); // get value we're looking for console.log( 'Retrieved ' + dataRetrieved); // debug note out += dataRetrieved + "\n"; // save retrieved value (+ separator) }); } ); console.log("-----------------\nParsing done, output:\n"+out); // print out parsed values }
Go to http://cinu.pl/research/jsparsing/, paste above code in Developer tools console and hit enter. To execute this code just write "parse()" and hit enter.
Result:
I guess this code is well documented, so there is no need to describe what it does, so lets try to do some more complicated example.
Live example - parsing aliexpress.com
The main goal is to fetch first 5 items from products category (I'll use wireless routers as an example) and check if there is any "feedback" from poland country on first page of feedback.This task seems silly and parsed data is rather useless but this is only example which helps me to utilize things I have previously written.
Step 1. Injecting JQuery
Since aliexpress doesn't use jquery we need to inject it.Injection code:
var $jq; // jquery handler to avoid $ conflicts function injectJquery() { var script = document.createElement('script'); script.setAttribute('type', 'text/javascript'); script.setAttribute('src', '//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js'); // fetch it from googles CDNs // Give $ back to whatever took it before; create new alias to jQuery. script.setAttribute('onload','javascript:$jq = jQuery.noConflict();'); document.body.insertBefore(script, document.body.firstChild); } injectJquery(); // call it automatically when paste into console
We can see that apart from simple injection we also make a jQuery.noConflict() call and assign jquery to $jq and not $. We need to do that because some scripts can also use $ (prototype.js for instance) and we need to give $ variable back to it because some parts of javascript code on target page might be broken.
Step 2. Get urls of products we want to parse "feedback" on
We need to remember that when we are fetching static pages through ajax, javascript won't be parsed and executed and we need do it manually. Because "Feedback" tab is loaded dynamically with javascript we won't get "Feedback" data in html when we fetch product page. We will handle it in next step, for now parser code is:
var productsNum = 5; function parse() { var urls = $jq('a.product'); for(var i=0;i<productsNum && i<urls.length; i++) { var url = $jq(urls[i]).attr('href'); // get url console.log('Fetching: '+ url); // debug note // make ajax request $jq.ajax({ url: url, async: false, // do it synchronously }).done(function(data) { // data variable contains fetched html var parsedDom = $jq(data); // check if it works console.log( '[TEST] item price: ' + $jq('#sku-price', parsedDom).html() ); }); } }
Step 3. Find a way to fetch feedback (cause it's dynamically fetched through ajax).
First of all we need to get url where http requests for feedback data goes. To do that we need to look in Network tab of Developer Tools, press "Feedback" tab on web page and check "Documents" and "XHR" checkboxes (we don't need scripts, images, fonts etc.).We can see couple of interesting urls like:
http://www.aliexpress.com/store/productGroupsAjax.htm?storeId=413596 [with JSON response] http://www.aliexpress.com/findRelatedProducts.htm?productId=733919144&type=new [with JSON response]
But what we are looking for is:
http://feedback.aliexpress.com/display/productEvaluation.htm?productId=733919144&ownerMemberId=201779865&companyId=214347019&memberType=seller&startValidDate=&i18n=trueIt contains raw HTML response. When we look into "response" we will see that this is exactly what are we looking for.
Now we need to take a closer look into parameters in url, that are:
productId=733919144 ownerMemberId=201779865 companyId=214347019 memberType=seller startValidDate= i18n=true
We can extract productId from product url for example in http://www.aliexpress.com/item/Hot-Sale-Wireless-N-Networking-Device-Wifi-Wi-Fi-Repeater-Booster-Router-Range-Expander-300Mbps-2dBi/733919144.html (product id is 733919144)
Only two of them are unknown: ownerMemberId and companyId. However if we look in the product page source code we will find it inside script tag:
... window.runParams.adminSeq="201779865"; window.runParams.companyId="214347019"; ...
We need to get it directly from the html code. I'll use regular expressions:
... var rx = /window.runParams.adminSeq="(\d+)"/g; var arr = rx.exec(data); // data contains product page html var adminSeq = arr[1]; var rx = /window.runParams.companyId="(\d+)"/g; var arr = rx.exec(data); // data contains product page html var companyId = arr[1]; console.log('Parsed runParams: ' + adminSeq + ' ' +companyId); ...
If you look closer you can see that productId is also in source code in window.runParams, so we will get it like adminSeq and companyId.
parse() function now looks like this:
var productsNum = 5; function parse() { var urls = $jq('a.product'); for(var i=0;i<productsNum && i<urls.length; i++) { var url = $jq(urls[i]).attr('href'); // get url console.log('Fetching: '+ url); // debug note // make ajax request $jq.ajax({ url: url, async: false, // do it synchronously }).done(function(data) { // data variable contains fetched html //var parsedDom = $jq(data); // we dont need parsedDom since we will be executing regular expressions on raw html // construct feedbackUrl: var rx = /window.runParams.adminSeq="(\d+)"/g; var arr = rx.exec(data); // data contains product page html var adminSeq = arr[1]; var rx = /window.runParams.companyId="(\d+)"/g; var arr = rx.exec(data); // data contains product page html var companyId = arr[1]; var rx = /window.runParams.productId="(\d+)"/g; var arr = rx.exec(data); // data contains product page html var productId = arr[1]; var feedbackUrl = 'http://feedback.aliexpress.com/display/productEvaluation.htm?productId='+productId+'&ownerMemberId='+adminSeq+'&companyId='+companyId+'&memberType=seller&startValidDate=&i18n=true'; console.log('Feedback url: '+feedbackUrl); // here we'll make another ajax call to fetch feedback data }); } }
4. Final step: Avoiding Origin policy checking and parse feedback html and check for searched country
If we try to make ajax call on prepared feedbackUrl in our parse() function we will see in console that "Origin http://www.aliexpress.com is not allowed by Access-Control-Allow-Origin" browser error. In Google Chrome we can bypass it by adding --args --disable-web-security when we execute binary.Looking into feedbacks html we can see that flag indicating users country is described as follows:
<span class="state"><b class="css_flag css_br"></b></span>Simple jquery selector will do the job:
$jq('b.css_'+countryCode);
The final code is:
// jquery injection var $jq; // jquery handler to avoid $ conflicts function injectJquery() { var script = document.createElement('script'); script.setAttribute('type', 'text/javascript'); script.setAttribute('src', '//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js'); // fetch it from googles CDNs // Give $ back to whatever took it before; create new alias to jQuery. script.setAttribute('onload','javascript:$jq = jQuery.noConflict();'); document.body.insertBefore(script, document.body.firstChild); } injectJquery(); // parsing var productsNum = 5; function parse(country) { var urls = $jq('a.product'); for(var i=0;i<productsNum && i<urls.length; i++) { var url = $jq(urls[i]).attr('href'); // get url console.log('Fetching: '+ url); // debug note // make ajax request $jq.ajax({ url: url, async: false, // do it synchronously }).done(function(data) { // data variable contains fetched html //var parsedDom = $jq(data); // we dont need parsedDom since we will be executing regular expressions on raw html // construct feedbackUrl: var rx = /window.runParams.adminSeq="(\d+)"/g; var arr = rx.exec(data); // data contains product page html var adminSeq = arr[1]; var rx = /window.runParams.companyId="(\d+)"/g; var arr = rx.exec(data); // data contains product page html var companyId = arr[1]; var rx = /window.runParams.productId="(\d+)"/g; var arr = rx.exec(data); // data contains product page html var productId = arr[1]; var feedbackUrl = 'http://feedback.aliexpress.com/display/productEvaluation.htm?productId='+productId+'&ownerMemberId='+adminSeq+'&companyId='+companyId+'&memberType=seller&startValidDate=&i18n=true'; // get feedback page and check if there is searched country $jq.ajax({ // to make that request we need to disable web security in google chrome url: feedbackUrl, async: false, }).done(function(data) { console.log( $jq('b.css_'+country, $jq(data)).length ); // check if element with css_country class exists: if ( $jq('b.css_'+country, $jq(data)).length ) { console.log('[FOUND] item: '+url); } }); }); } }We executing it with parse('pl') when we want to check if there is a feedback from poland.
Some thoughs
In above example we made operations on a raw html code, however using json is a lot easier, because we don't need to use regular expressions, jquery selectors, etc. to fetch data.Another thing is that we don't need to store data in console log. We can inject some div into webpage and then store results in it.
This comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteI am very enjoyed for this blog. Its an informative topic. It help me very much to solve some problems. Its opportunity are so fantastic and working style so speedy.
ReplyDeleteMason Soiza
Such a wonderful blog and the content was very interesting. Thanks for giving the great post with sharing us and keep blogging...
ReplyDeleteSocial Media Marketing Courses in Chennai
Social Media Training
Oracle Training in Chennai
Tableau Training in Chennai
Primavera Training in Chennai
Unix Training in Chennai
Power BI Training in Chennai
Social Media Marketing Courses in Chennai
Social Media Marketing Training in Chennai
I definitely enjoying every little bit of it. It is a great website and nice share. I want to thank you. Good job! You guys do a great blog, and have some great contents. Keep up the good work. Webdesign bureau
ReplyDeleteAmazing Article,Really useful information to all So, I hope you will share more information to be check and share here.thanks for sharing .
ReplyDeleteWebsite: Vietnam Cycling Tours
Post is very good its amzazing post I love them thanks for sharing.
ReplyDeletevisit here- election spoof comedy
an interesting article, it was pleasant to read, well written, I myself sometimes write articles, and it helps me in promotion https://soclikes.com/
ReplyDeleteThe Happy New Year Love Messages will be so much fun to read rather than being emotional. If you had a fight with your lover earlier, then settle it by sending these love-filled messages to her. She will surely come back to you running.
ReplyDeleteGreetings! Very useful advice within this article! It's the little changes that make the most important changes. Thanks a lot for sharing!
ReplyDeleteVisit here :- Search Engine Optimization
fantasy cricket Download app to know more.Play and win exclusive prizes & experiences, only with Fantasy Power 11 .Fantasy Power 11- fantasy cricket best app. Download best fantasy cricket app in India & win cash; visit fantasy cricket best website-know more fantasy cricket tips.
ReplyDeleteThanks for sharing.....keep it up...
ReplyDeleteVision Developers is one of leading Real Estate Company in Pune. Checkout new property to buy your dream home at the most affordable prices. Click for more info!
It is really a good posting and i was searching for the same and have been satisfied after reading it,thanks for sharing it. how to play fantasy cricket
ReplyDeleteAwesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!
ReplyDeletetop online electronics shopping sites in india
Good day! This is kind of off topic but I need some help from an established blog. Is it very hard to set up your own blog? I m not very technical but I can figure things out pretty quick. I'm thinking about setting up my own but I'm not sure where to begin. Do you have any tips or suggestions? Thanks .
ReplyDeleteCraigslist Posting Service for Car Dealers |
Liên hệ Aivivu, đặt vé máy bay tham khảo
ReplyDeletekhi nào có chuyến bay từ singapore về việt nam
đặt vé máy bay hà nội sài gòn vietjet
vé máy bay hà nội khứ hồi
vé máy bay huế đà lạt
vé máy bay đi Huế
I was reading some of your material on this website and I think this internet site is really informative!!! Keep on putting up. You can apply for a visa on arrival Turkey, or you can apply visa to Turkey. Eligible citizens are required to complete a simple online visa application form Turkey with personal and passport information.
ReplyDeleteKeep up the work, great job! The Kenya government has introduced an online visa system that allows citizens to obtain a visit visa electronically. kenya business visa you can apply online via kenya visa website.
ReplyDeleteGreat I really like your post. Such amazing information, all the best for more updates. Thanks for sharing. It is such a very amazing post.
ReplyDeleteOnline Dear Monthly Lottery
This comment has been removed by the author.
ReplyDeleteYour internet site is in fact cool and this is a pleasant challenging article. Click Here
ReplyDeleteThis is a great inspiring article.I am pretty much pleased with your good work.
ReplyDeleteVisit here :- nid courses
Wow Very Nice Information Thanks For Sharing It.
ReplyDeletewebsite: naturstein
best Travel Management Companies in India.
ReplyDeletesenior citizens chardham yatra
We appreciate your kind words about our article! Rejection of a Turkish E-visa can be disheartening for eager travelers. Understanding the reasons behind it is crucial. Let's explore the common causes and potential solutions to ensure a smoother visa application process.
ReplyDeleteThis is very interesting, You are a very skilled blogger
ReplyDeleteWeb - world777 sign up
We have more than 70 Varients of Antiviruses, Microsoft Windows & Microsoft Offices products available
ReplyDeleteclick here - MS Office for MAC
This is very interesting
ReplyDeleteSpices Exporter
their products with ISO, DIN WEb :- Pvc interlocking floor tiles India
ReplyDeleteWow Very Nice Information Thanks For Sharing TO SEE THIS - river rafting in rishikesh
ReplyDeletethe style and weight of the bike.
ReplyDeleteWeb :- crash bars
during and after a tattoo.
ReplyDeleteWeb :- tatovering oslo
Charges a Consulting fee.
ReplyDeleteWeb :- emirates visa online
Your in-depth research effortlessly combines information and a personal touch, weaving a compelling narrative. The engaging writing style captures and sustains the reader's interest, making complex ideas accessible. The thorough examination of various perspectives enriches the content. Your distinctive voice and careful attention to detail make this article stand out.
ReplyDeleteWell, well! What a pleasant surprise to find your site while I was randomly searching on Askjeeve. I must admit, your post is truly fantastic and your blog is a delightful source of entertainment. I'm also quite fond of the theme and design. While I can't delve into it deeply right now, I've made sure to bookmark it and subscribe to your RSS feeds. I'll surely be back for more when I have the time. Keep up the amazing work!
ReplyDeleteYour eloquent articulation elegantly navigates the realms of intellect, weaving a tapestry of profound insight and introspection. Each phrase serves as a conduit to deeper understanding and personal epiphany. Your distinct perspective resonates powerfully, guiding us through the intricate maze of enlightenment. We extend our heartfelt gratitude for your generous sharing of such priceless wisdom.
ReplyDeleteSuch an engaging read! Your words effortlessly navigate the maze of my consciousness, awakening dormant ideas and fostering fresh perspectives. Each sentence feels like a brushstroke on the canvas of my mind, painting a vivid portrait of understanding. It's rare to stumble upon such a gem in the vast expanse of the internet. Thank you for this enriching experience; I'll carry these reflections with me long after closing this tab. Looking forward to more enlightening insights.
ReplyDeleteTerrific blog post! Your ability to present the topic in easy-to-follow steps is praiseworthy. Your clear, concise explanations help readers understand challenging concepts. The visual examples and practical tips are fantastic additions. Your engaging writing keeps me intrigued. Keep up the superb work! Can’t wait for more of your future posts and insights. Thank you for sharing your knowledge with us!
ReplyDeleteYour writing resonates with authenticity and depth, speaking directly to the core of what it means to be human. It's like you've distilled the essence of truth into each word, creating a symphony of thought that reverberates within the recesses of the mind. Reading your post is an experience unto itself, a journey of self-discovery and enlightenment.
ReplyDeleteExceptional content! Your blog consistently offers valuable insights. Your talent for simplifying complex ideas is truly remarkable. I'm consistently impressed by the quality of your writing. Looking forward to delving into your future posts eagerly! Plus, your blog's layout is clean and user-friendly, making it a pleasure to explore. Keep up the fantastic work.
ReplyDelete