Pulling Data from a Load-As-You-Scroll List

rjbinney · Nov 22, 2024

Dear Mr. Excel:

I am loving using "Get Data" and pulling tables off the web - Wikipedia tables work quite nicely.

I've been running across a few data sets where, in a normal browser, the page loads the first few items until you scroll to the end, then it loads the next few, and so on. Kinda like a Reddit feed.

When I use Get Data=>From Web, it pulls out a nice table, but ONLY for that first load. It doesn't grab the whole data set.

I don't know anything about web coding, Python, or anything like that, so my flicking through "Inspect Page" in my browser I felt like my dog watching nature shows. I asked a few AI search engines if they could pull the data, but the only thing that came close wanted to write me a script that I'd load in some emulator and blah blah blah.

Any thoughts?

Here's an example data set (the actual one I'm working with is proprietary, but this one seems to behave similarly in my browser).

Any thoughts would be cool.

Thanks!

Love,

smozgur · Nov 24, 2024

I don't think you can just use the basic Get Data interface to get what you need in this case. The data here isn’t directly available other than the first page —it’s paginated, meaning it loads in chunks as you scroll down the page. Behind the scenes, this happens through a JavaScript fetch call in the browser.

So, to get the data, we need to dig deeper. Instead of pulling from the HTML you see, we’ll fetch the data directly from the web service (Web API). If you’re curious and want to explore some M language methods for working with web services, keep reading.

When you scroll the page and track the network calls (browser Developer Tools - Network tab), you’ll notice that the endpoint https://www.digitalspy.com/api/listicle-slides gets called. This endpoint provides the paginated data, and it takes two query parameters:

id: the identifier for the image set group (for example: 93977ede-6f1e-4977-982a-a2a317f893b2 in this case. You need to find this and I didn't go that far. Perhaps, the page that you got the original link in your question might be giving some information about it.
page: the page number, starting at 1 and incrementing for each subsequent page

We just need to call this API repeatedly for each page to fetch the data, which is returned in JSON format.

{endpoint}?id=93977ede-6f1e-4977-982a-a2a317f893b2&page=1
{endpoint}?id=93977ede-6f1e-4977-982a-a2a317f893b2&page=2
{endpoint}?id=93977ede-6f1e-4977-982a-a2a317f893b2&page=3
{endpoint}?id=93977ede-6f1e-4977-982a-a2a317f893b2&page=...

See how the page parameter changes with each call?

Now, how do we know when to stop? Normally, APIs provide pagination metadata (like total pages or item counts), but this one doesn’t. Instead, it is just returning empty data when there’s nothing left for the queried page.

So, we can keep calling until we get an empty data set.
Alternatively, we could count the items returned in each page and stop when we see fewer items than the first page has. Personally, I’d just make one extra API call and handle the empty response—it’s simpler to code that way instead of dealing with the unknown first page item count (what if there is only one page data!).

Finally, here’s the code that does what we need. It includes a function (fxAPI) that’s called recursively and the beautiful List.Generate function, which helps us collect each data set into a list. From there, it’s just about creating a meaningful table at the end. In the final step of the query, you’ll notice some columns containing Records that you can expand to get the full details.

Create a blank query. Open the Advanced Editor, delete the default content, and then copy and paste the code below. Click Done to execute it. If you see any prompts asking for permission to access the web content, just confirm it and proceed as an Anonymous user.

Power Query:

let
    fxAPI = (id as text, page as number) =>
        let
            Result = Json.Document(Web.Contents("https://www.digitalspy.com/api/listicle-slides?id=" & id & "&page=" & Text.From(page)))
        in
            Result,
    id = "93977ede-6f1e-4977-982a-a2a317f893b2",
    Pages = List.Generate(
        () => [x = 1, data = fxAPI(id, 1)],
        each List.Count([data]) > 0,
        each [x = [x] + 1, data = fxAPI(id, [x] + 1)],
        each [data]
    ),
    Result = List.Combine(Pages),
    ConvertToTable = Table.FromList(Result, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    ExpandRecords = Table.ExpandRecordColumn(ConvertToTable, "Column1", Record.FieldNames(ConvertToTable{0}[Column1]))
in
    ExpandRecords

rjbinney · Dec 6, 2024

I really appreciate this detailed reply... I just haven't had time to play with it yet. (I punted and just did cut-and-paste, cut-and-paste. Probably would have been as fast as to just retype as new...)

Thanks! Bookmarked!

smozgur · Dec 8, 2024

rjbinney said:
Probably would have been as fast as to just retype as new...

Trust me, you don't have to retype. The code will work as it is.

Excel Ribbon -> Get Data -> From Other Sources -> Blank Query.
Power Query UI pops up with the Query1 empty query is selected.
Click Advanced Editor.
Select the existing default code in the code pane and delete it. Then paste the provided code.
Click Done.
You will see the following warning:
Click Edit Credentials and make sure Anonymous is selected with the selected domain in the displayed dialog.

Excel is making sure if you are aware where you are being connected and allows this connection.
Click Connect and wait until it connects and loads data.

Pulling Data from a Load-As-You-Scroll List

rjbinney

Active Member

smozgur

BatCoder

rjbinney

Active Member

smozgur

BatCoder

Similar threads

Share this page

Pulling Data from a Load-As-You-Scroll List

rjbinney

Active Member

smozgur

BatCoder

rjbinney

Active Member

smozgur

BatCoder

Similar threads

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock