Challenge To Scrape A Page With VBA

amirkhosravi

New Member
Joined
Jul 5, 2020
Messages
1
Office Version
  1. 2019
Platform
  1. Windows
Hi,
I am trying to scrape and extract data from the link below by writing code in vba in excel:

.:TSETMC:. :: دیده بان بازار پیشرفته

I used different techniques:

  • MSXML2.XMLHTTP60 it does not work
  • MSXML2.ServerXMLHTTP60 it does not work
  • SHDocVw.InternetExplorer beside it is too slow, it rarely works.
In Facts, when I open the link in Firefox or chrome, the page is ok and it is displayed correctly but when I request the page through "MSXML2.XMLHTTP60" or "MSXML2.ServerXMLHTTP60", the returned response is completely different from what it must be.

I should say that other links of this site have the similar behavior, for example: .:TSETMC:. :: اطلاعات نماد.

I guess the site is designed dynamically and uses JavaScript to load contents during the loading procedure. Also, when using excel vba, it seems that the server recognizes that the request is not sent from a browser.

Please help to find a solution and scrape the table in the mentioned URL.

Sub CreateMainList()

Dim MainURL As String
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim MainDiv As MSHTML.IHTMLElement
Dim MainDivChildren As MSHTML.IHTMLElementCollection
Dim Res As String
Dim price As Integer

'MainURL = ThisWorkbook.Worksheets("Home").Range("C2").Value
MainURL = ".:TSETMC:. :: دیده بان بازار پیشرفته"

XMLReq.Open "GET", MainURL, False
'XMLReq.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
'XMLReq.setRequestHeader "Content-Type", "text/html; charset=utf-8"
'XMLReq.setRequestHeader "Content-Type", "text/html; charset=utf-8"
XMLReq.setRequestHeader "Accept-Language", "en-US,en;q=0.5"
XMLReq.setRequestHeader "Connection", "keep-alive"
XMLReq.setRequestHeader "accept-Encoding", "gzip , deflate"
XMLReq.setRequestHeader "accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
XMLReq.setRequestHeader "DNT", "1"
'XMLReq.setRequestHeader "Upgrade-Insecure- Requests", "1"
XMLReq.setRequestHeader "Set-Cookie", "ASP.NET_SessionId=cd03mksrog04g2ocuaeqxweb; path=/; HttpOnly"
'XMLReq.setRequestHeader "Cache-Control", "Max-age = 0"


'XMLReq.setRequestHeader "Cookie", MyCookie
XMLReq.send

If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If

' Get the webpage response data into a variable.
'response = StrConv(request.responseBody, vbUnicode)

HTMLDoc.body.innerHTML = XMLReq.responseText
Debug.Print XMLReq.responseText

Set XMLReq = Nothing

Set MainDiv = HTMLDoc.getElementById("main")


End Sub

and the output is:

<!doctype html><html><head><title>.:TSETMC:. :: ÏíÏå ÈÇä ÈÇÒÇÑ íÔÑÝÊå</title><!-- Global site tag (gtag.js) - Google Analytics --><script async src="https://www.googletagmanager.com/gtag/js?id=UA-63076930-1"></script><script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-63076930-1');</script><script>var DBFlag='1' ;var LongRunnigPagesSite='http://cdn2.tsetmc.com';function ens(ty,sv){var lv=localStorage.getItem("v_"+ty);var t;if (1==1 || lv!=sv){var oX=new XMLHttpRequest();oX.open('GET', 'tsev2/res/loader.aspx?t='+ty+'&_'+sv,false);oX.send(null);t=oX.responseText;if(t[t.length-1]!=';') return;localStorage.setItem("v_"+ty,sv);localStorage.setItem("t_"+ty,t)}else{t=localStorage.getItem("t_"+ty)}if (ty=='s'){document.write("<style>"+t+"</style>")}else{var h=document.getElementsByTagName('HEAD').item(0);var s=document.createElement("script");s.defer=true;s.text=t;h.appendChild(s)}}var version='472';ens('j',version); ens('s',v
ersion); (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })</script></head><body><div style="display:none">PSB is being developed by Pooya Paridel.</div><div id="FastView" class="slideTop"></div><div id="TopBar"><div id="company"><div id="company_logo"></div><div id="company_text">ÔÑßÊ ãÏíÑíÊ ÝäÇæÑí ÈæÑÓ ÊåÑÇä<br/> Tehran Securities Exchange Technology Management Co</div></div><div id="menu_btns"></div><script>ShowMenuIcon(1)</script></div><div class="MainContainer"><form method="post" action="./Loader.aspx?ParTree=15131F" id="form1">
<div class="aspNetHidden">
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="" />
</div>

<div id="display" style="direction:rtl;font-size:12px;text-align:right">
</div>
<script>
ens('g', version);
if (location.href.split('&')[1] == "E")
{ens('me', version); mw = MarketWatchPlus(); mw.StartMarketWatch() }
else {
ens('p', version); ens('tw', version); tw = TweetEngine();mw = MarketWatchPlus(); mw.StartMarketWatch()
}
</script>
</form>
</div></body></html>
 

Excel Facts

Using Function Arguments with nested formulas
If writing INDEX in Func. Arguments, type MATCH(. Use the mouse to click inside MATCH in the formula bar. Dialog switches to MATCH.
Hi there,
2 options:
Browser automation in Excel VBA using Selenium - Coding is Love -> try selenium, basically automating Chrome from VBA
or try diving a bit deeper into the page:
if you load the page in e.g. Chrome, right click "inspect" and go for the network tab. Reload the page. That should show you which files are loaded in sequence. In your example:
Loader.aspx?ParTree=15131F -> this gives the response you show, here stops the XML road
js?id=UA-63076930-1 -> some google javascript
MarketWatchInit.aspx?h=0&r=0 -> this actually provides the data for the page with a GET command
So you could try in your code getting that last file only, it seems to have most of the data of the page.

Hope that gets you started,
Koen
 
Upvote 0

Forum statistics

Threads
1,223,886
Messages
6,175,191
Members
452,616
Latest member
intern444

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top