VBA Scraping from dynamic web page

Bruzio

Board Regular
Joined
Aug 20, 2020
Messages
85
Office Version
  1. 2016
Platform
  1. Windows
  2. Mobile
Hi everybody. I'm trying to download soccer results from Football results 21 august 2020, I can't as I'd like.
My question was posted on a italian forum, but we were unable to resolve.
I trust in your knowledge.


What do i need?

6 columns:

LEAGUE - START - HOME - AWAY - SCORE- STATUS

What are the problems?

1) Macro doesn't download all scheduled events.
2) The content of the rows isn't correct
2) The web page has a more results button that is always visible and clickable, and i need to load all hidden pages.

I expect this, see screen:

This is my current code:

VBA Code:
Sub fromweb()
    Dim IE As Object
    Dim Doc As Object
    Dim x As Object
    Dim i As Long
    Dim myArray As Variant
    Set IE = CreateObject("InternetExplorer.Application")
    Application.ScreenUpdating = False
    Cells.Clear
    Const myURL As String = "fctables.com/livescore/21_08_2020/"
    With IE
        .navigate myURL
        .Visible = True
        Do While .Busy: DoEvents: Loop
        Do While .readyState <> 4: DoEvents: Loop
        Application.Wait Now + TimeValue("0:00:10")
    End With
    myArray = Array("LEAGUE", "START", "HOME", "AWAY", "SCORE", "STATUS")
    With Range("A1:F1")
        .Value = myArray
    End With
    Set Doc = IE.Document
    For Each x In Doc.getElementsByClassName("league")
        i = i + 1: j = 0
        Cells(i + 1, 1) = Replace(Replace(x.innerText, "Table", ""), Chr(10), "")
        For Each y In Doc.getElementsByClassName("col-xs-8 col-sm-9 col-lg-10 truncate")
            Cells(i + 1, 1) = y.innerText
            Cells(i + 1, 2) = Doc.getElementsByClassName("date")(j).innerText
            Cells(i + 1, 3) = Doc.getElementsByClassName("home")(j).innerText
            Cells(i + 1, 4) = Doc.getElementsByClassName("away")(j).innerText
            Cells(i + 1, 5) = Doc.getElementsByClassName("score")(j).innerText
            Cells(i + 1, 6) = Doc.getElementsByClassName("status_name")(j).innerText
            j = j + 1: i = i + 1
        Next
    Next
    IE.Quit
    Set IE = Nothing
    Set Doc = Nothing
    Application.ScreenUpdating = True
End Sub

How to proceed correctly?
Thanks for the attention
 

Attachments

  • test.jpg
    test.jpg
    109.2 KB · Views: 158
Hi, I've done new test.
Please, help me to download leagues names and all matches.
Here is new code:

VBA Code:
Sub newtest()
    Dim html As HTMLDocument, i As Long, ws As Worksheet
    Dim league As Object, start As Object, status As Object, setup As Object, home As Object, away As Object, score As Object

    Set html = New HTMLDocument
    Set ws = ThisWorkbook.Worksheets("newtest")
   
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.fctables.com/livescore/31_08_2020", False
        .send
        html.body.innerHTML = .responseText
    End With
    
    Set start = html.querySelectorAll(".date")
    Set home = html.querySelectorAll(".home")
    Set away = html.querySelectorAll(".away")
    Set status = html.querySelectorAll(".status_name")
    Set score = html.querySelectorAll(".score")
 
    For i = 0 To start.Length - 1
        With ws
            .Cells(i + 1, 1) = start.item(i).innerText
            .Cells(i + 1, 2) = home.item(i).innerText
            .Cells(i + 1, 3) = away.item(i).innerText
            .Cells(i + 1, 4) = status.item(i).innerText
            .Cells(i + 1, 5) = score.item(i).innerText
        End With
    Next
End Sub
 
Upvote 0

Excel Facts

Using Function Arguments with nested formulas
If writing INDEX in Func. Arguments, type MATCH(. Use the mouse to click inside MATCH in the formula bar. Dialog switches to MATCH.
I think you'll find the XMLhttp request approach more difficult than automating IE, because the web page calls a JavaScript every time you click the 'Show more matches' button or scroll down the page to load more leagues and matches. XMLhttp requests would have to emulate what the browser does.

The following code uses IE automation to click the 'Show more matches' button multiple times until no more matches are loaded. The main difficulty is that the web page gives no indication that all the matches for a specific date have been loaded, and the button is still active. An initial loop counts the number of leagues on the current page, clicks the button and exits the loop when the number of leagues hasn't changed for 10 seconds. After this it assumes that all leagues and matches have been loaded and the data is extracted into the active sheet.

This code requires the references noted at the top of the module - set these via Tools -> References in the VBA editor.

Note that the web page runs scripts in the background and I found that IE crashed several times, or the VBA error noted in the code occurs. I think this is caused by scripts updating the web page whilst the VBA code is running and extracting data.

VBA Code:
'References
'Microsoft Internet Controls
'Microsoft HTML Object Library

Option Explicit

#If VBA7 Then
    Private Declare PtrSafe Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
    Private Declare PtrSafe Function SetForegroundWindow Lib "user32.dll" (ByVal hwnd As LongPtr) As LongPtr
#Else
    Private Declare Sub Sleep Lib "kernel32" (ByVal milliseconds As Long)
    Private Declare Function SetForegroundWindow Lib "user32.dll" (ByVal hwnd As Long) As Long
#End If

Public Sub IE_Extract_Results()

    Dim IE As InternetExplorer
    Dim URL As String
    Dim HTMLdoc As HTMLDocument
    Dim moreMatchesDiv As HTMLDivElement
    Dim leagueDivs As IHTMLElementCollection
    Dim leagueDiv As HTMLDivElement
    Dim gameDivs As IHTMLElementCollection
    Dim gameDiv As HTMLDivElement
    Dim league As String
    Dim parts As Variant
    Dim numLeagues As Long
    Dim timeout As Date
    Dim home As String, away As String
    Dim destCell As Range, r As Long, i As Long
    Dim dateInput As String
    
    dateInput = InputBox("Enter results date (dd-mm-yyyy)", "IE Extract Results", Format(Date - 1, "dd-mm-yyyy"))
    If StrPtr(dateInput) = 0 Then Exit Sub
    If dateInput = "" Then Exit Sub

    With ActiveSheet
        .Cells.ClearContents
        .Range("A1:F1").Value = Array("LEAGUE", "START", "HOME", "AWAY", "SCORE", "STATUS")
        Set destCell = .Range("A2")
        r = 0
    End With
    
    URL = "https://www.fctables.com/livescore/" & Format(CDate(dateInput), "dd_mm_yyyy") & "/"    
    
    Set IE = Get_IE_Window2("https://www.fctables.com/livescore/")
    If IE Is Nothing Then Set IE = New InternetExplorer
    
    With IE
        SetForegroundWindow .hwnd
        .navigate URL
        While .Busy Or .readyState <> READYSTATE_COMPLETE: DoEvents: Wend
        .Visible = True
        Set HTMLdoc = .document
    End With

    'Click 'Show more matches' button repeatedly until number of leagues on page has not changed for 10 seconds
    
    Set moreMatchesDiv = HTMLdoc.getElementById("download_more_livescore")
    
    Do
        
        'Wait until not loading
        
        Do
            DoEvents
            Sleep 10
        Loop Until InStr(moreMatchesDiv.outerHTML, "loading") = 0
        
        moreMatchesDiv.Focus
        moreMatchesDiv.FireEvent "onclick"
        While IE.Busy: DoEvents: Sleep 10: Wend

        Set leagueDivs = HTMLdoc.getElementsByClassName("league")

        numLeagues = leagueDivs.Length
        timeout = DateAdd("s", 10, Now)
        Application.StatusBar = Time & " - " & numLeagues & " leagues"
        Do
            DoEvents
            Sleep 100
        Loop Until Now > timeout Or leagueDivs.Length <> numLeagues

        Application.StatusBar = Time & " - " & numLeagues & " leagues"

    Loop Until numLeagues = leagueDivs.Length
    
    'Extract all leagues and matches
        
    'NOTE - Any of the following getElementsByClassName lines can cause the following error on some matches
    '
    'Run-time error '-2147319783 (80028019)':
    '   Automation Error
    '   Old format or invalid type library
        
    i = 0
    For Each leagueDiv In leagueDivs
        
        '<div class="panel-title row">
        i = i + 1
        league = Trim(Split(leagueDiv.getElementsByClassName("panel-title")(0).innerText, vbCrLf)(0))
        Application.StatusBar = i & "/" & numLeagues & " : " & league
        Debug.Print Time; i & "/" & numLeagues & " : " & league
        
        Set gameDivs = leagueDiv.getElementsByClassName("game")
        
        For Each gameDiv In gameDivs
            
            destCell.Offset(r, 0).Value = league
            
            '<div title="" class="date date_unix date_unix_edit tooltip_setup" data-original-title="23 aug 2020 20:00" data-toggle="tooltip" data-unixtime="1598209200" data-only-hour="1">20:00</div>
            
            destCell.Offset(r, 1).Value = CDate(gameDiv.getElementsByClassName("date")(0).getAttribute("data-original-title"))

            home = gameDiv.getElementsByClassName("home")(0).innerText
            If home <> "" Then
                parts = Split(home, vbCrLf)
                destCell.Offset(r, 2).Value = parts(UBound(parts))                              'get last part to omit red card number, if present
            Else
                destCell.Offset(r, 2).Value = "**UNKNOWN**"
            End If
            away = gameDiv.getElementsByClassName("away")(0).innerText
            If away <> "" Then
                parts = Split(away, vbCrLf)
                destCell.Offset(r, 3).Value = parts(0)                                          'get first part to omit red card number, if present
            Else
                destCell.Offset(r, 3).Value = "**UNKNOWN**"
            End If
            destCell.Offset(r, 4).Value = "'" & gameDiv.getElementsByClassName("score")(0).innerText
            destCell.Offset(r, 5).Value = gameDiv.getElementsByClassName("status_name")(0).innerText
            r = r + 1
            
        Next
        
        DoEvents
                
    Next
    
    MsgBox "Done"
    Application.StatusBar = ""
      
End Sub


Private Function Get_IE_Window2(URLorName As String) As SHDocVw.InternetExplorer

    'Look for an IE browser window or tab already open at the (partial) URL or location name and, if found, return
    'that browser as an InternetExplorer object.  Otherwise return Nothing

    Dim Shell As Object
    Dim IE As SHDocVw.InternetExplorer
    Dim i As Variant 'Must be a Variant to index Shell.Windows.Item() array
    
    Set Shell = CreateObject("Shell.Application")
    
    i = 0
    Set Get_IE_Window2 = Nothing
    While i < Shell.Windows.Count And Get_IE_Window2 Is Nothing
        Set IE = Shell.Windows.Item(i)
        If Not IE Is Nothing Then
            'Debug.Print IE.LocationURL, IE.LocationName, IE.Name
            'If TypeOf IE Is SHDocVw.InternetExplorer And InStr(IE.LocationURL, "file://") <> 1 Then
            If IE.Name = "Internet Explorer" And InStr(IE.LocationURL, "file://") <> 1 Then
                If InStr(1, IE.LocationURL, URLorName, vbTextCompare) > 0 Or InStr(1, IE.LocationName, URLorName, vbTextCompare) > 0 Then
                    Set Get_IE_Window2 = IE
                End If
            End If
        End If
        i = i + 1
    Wend
    
End Function
 
Upvote 0
Hi, thank you for big job.
How and where i have to insert new code?

I've this error if I run Sub IE_Extract_Results()

error.jpg
 
Upvote 0
Put the code in a new module by itself. The error indicates you've placed the code in an existing module below another Sub or Function.
 
Upvote 0
That line is trying to convert the date and time of "data-original-title" to an Excel date-time. You'll get that error if CDate doesn't recognise it as a valid date, therefore try deleting CDate( and the closing ) on the highlighted line.

Post the value of gameDiv.outerHTML.
 
Upvote 0
Maybe the date isn't visible, I see the time on the site.

VBA Code:
 ' destCell.Offset(r, 1).Value = CDate(gameDiv.getElementsByClassName("date")(0).getAttribute("data-original-title"))

It works without the time.

newtest.xlsm
ABCDEF
1LEAGUESTARTHOMEAWAYSCORESTATUS
2Nations League Division A[1]UkraineSwitzerland[2]-Not started
3Nations League Division A[4]GermanySpain[3]-Not started
4Nations League Division B[3]FinlandWales[1]-Not started
5Nations League Division B[4]BulgariaIreland[2]-Not started
6Nations League Division B[3]RussiaSerbia[2]-Not started
7Nations League Division B[1]TurkeyHungary[4]-Not started
8Nations League Division C[2]MoldovaKosovo[3]-Not started
9Nations League Division C[1]SloveniaGreece[4]-Not started
10Nations League Division D[2]LatviaAndorra[4]0-01st half
newtest
 
Upvote 0
Hi, I'm still here. I'm trying and trying to download today's matches, 143 records, but there is no Norway.
Can you try?

Thank you
 
Upvote 0
Hi @Bruzio,

I can see that you switched to IE method, but I would still use MSXML2 for this, and simply call the pages as the result is paginated and called by XHR in the real page in that way. Dealing with DOM actions is not the way I would prefer if the source allows the otherwise.

Please note I used early binding Microsoft XML v6.0, and Microsoft HTML Object Library references instead of late binding. I suggest doing the same, simply go to VBE, Tool->References, and select these two references. However you can adapt the code if you'd like to use late binding for some reason.

You can set strDate variable for a specific date, and perhaps use it as a function variable to call the function for a specific date.

Finally, of course this code (or any other code for the page structure) depends on the source structure since they don't provide XML or JSON, but HTML as return data. Even small changes in the structure, like changing a class name that we used to retrieve a node below, would require to adapt the code to access to the necessary nodes in the document.

VBA Code:
Sub crawlForMe()

Dim sht As Worksheet
Dim rng As Range
Dim i As Integer
Dim strDate As String

Dim http As MSXML2.XMLHTTP60
Dim html As HTMLDocument
Dim league As HTMLDivElement
Dim game As HTMLDivElement
Dim status As HTMLDivElement
Dim teams As HTMLDivElement
    
    strDate = "06-11-2020"
    
    Set sht = ActiveSheet
    Set rng = sht.Range("A1")
    rng.CurrentRegion.ClearContents
    rng.Resize(, 6) = Array("LEAGUE", "START", "HOME", "AWAY", "SCORE", "STATUS")
    
    Set http = New MSXML2.XMLHTTP60
    Set html = New HTMLDocument
    
    Do
        DoEvents
        i = i + 1
        http.Open "POST", "https://www.fctables.com/xml/livescore/", True
        http.setRequestHeader "Content-type", "application/x-www-form-urlencoded"
        http.send "page=" & i & "&date=" & strDate
        
        Do Until http.readyState = 4
            DoEvents
        Loop
        html.body.innerHTML = http.responseText
        If Len(http.responseText) = 0 Then Exit Do
        
        For Each league In html.getElementsByClassName("league")
            For Each game In league.getElementsByClassName("games").Item(0).Children
                Set rng = rng.Offset(1)
                Set status = game.getElementsByClassName("status").Item(0)
                Set teams = game.getElementsByClassName("name game_hover_info").Item(0)
                rng.Cells(, 1).Value = league.getElementsByClassName("panel-title row").Item(0).getElementsByTagName("a").Item(0).innerText
                rng.Cells(, 2).Value = status.getElementsByClassName("date").Item(0).innerText
                rng.Cells(, 3).Value = teams.getElementsByClassName("home").Item(0).ChildNodes(teams.getElementsByClassName("home").Item(0).ChildNodes.Length - 1).NodeValue
                rng.Cells(, 4).Value = teams.getElementsByClassName("away").Item(0).ChildNodes(0).NodeValue
                rng.Cells(, 5).Value = "'" & teams.getElementsByClassName("score").Item(0).innerText
                rng.Cells(, 6).Value = status.getElementsByClassName("status_name").Item(0).innerText
            Next game
            
        Next league
    Loop
    
End Sub

Hope it helps.
 
Last edited:
Upvote 0
Hi, it works perfectly. thank you so much.
I would also like to download the odds, this code is ready to work?
 
Upvote 0

Forum statistics

Threads
1,225,754
Messages
6,186,825
Members
453,377
Latest member
JoyousOne

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top