Hello all,
I am in the process of creating a VBA macro in Excel 2010 to scrape text from a list of websites to examine the frequency of words. Ideally, I would like to plop in a list of thousands of websites and let the macro run overnight. I’m running Windows 7 64bit.
Ill post the entire Module below:
Most of my code is copied from various places on the internet and modified to fit this particular macro. I'm not the best with VBA, so please don't hesitate to dumb-down any explanations for me! Also, please feel free to criticize and/or recommend other improvements to my sloppily-written code! Now, to the freezing:
Problem:
When executed, the macro starts running perfectly. I wrote some debug.print lines in my macro to help me keep track of where the freezing occurs. Excel 2010 will freeze & the title bar will read (not responding) on the URL query after the sum of the "Grand Total" values on the Worksheet named "Main" is greater than about 7000. In other words:
If (Sum of all numbers in Column C in Worksheets("Main") > 7000) Then (the next URL Query will permanently freeze Excel)
Attempted Problem Solving:
I am thoroughly perplexed by the previous relationship, but my hypothesis has held for about 3 days worth of fighting/debugging. I added a bit of memory & connection cleaning code to the end of every iteration, wondering if there is some sort of 'buffer' I am filling..?
I determined which line causes the URL Query to fail, but I am confused as to why they are related at all, and how to fix the freezing while still getting output.
I have tried using variations such as .Text = x(j), .Value = x(j), Cells(WordCnt, 1) = x(j), but nothing solves the freezing
Stepping through the macro with F8 produces perfect results every time, no matter how many sites or words I iterate through.
Running the macro for one site, then starting it immediately on another site produces perfect results every time, no matter how many sites or words I iterate through.
Disabling the single line of code above produces perfect results every time (besides having the word-frequency output), no matter how many sites or words I iterate through.
Other Notes:
If a website Query fails, I am totally fine with just skipping to the next site. Plenty of fish in the sea, and the internet is a messy place. On the other hand, If there something I can do to decrease the number of failed sites, I'm all ears.
Here's a link to the full .xlsm file, for your viewing pleasure. If the link breaks, or is removed, here's a quick description so you can make your own:
https://dl.dropboxusercontent.com/u/42453705/Frequency Test/Keyword Frequency Test 1.xlsm
One worksheet named "Main" - Formatted output goes here
One worksheet named "JunkYard" - Workspace for messy data
One worksheet named "Word Frequency" - Aggregate messy data
One worksheet named "Site List" - Website #1 in A2, Website #2 in A3, etc. (remove "www."; "www.google.com" should be "google.com")
One worksheet named "Stop Words" - List containing filtered words like "and", "but", "also", etc. (Word #1 in A1, Word #2 in A2, etc.)
Thank you for any advice on this problem. I really appreciate the feedback to improve my coding skills. Please ask for clarification if anything in here if I've explained poorly!
- Vistalite
I am in the process of creating a VBA macro in Excel 2010 to scrape text from a list of websites to examine the frequency of words. Ideally, I would like to plop in a list of thousands of websites and let the macro run overnight. I’m running Windows 7 64bit.
Ill post the entire Module below:
Code:
Sub WebsiteScrape1()
Dim Query_URL, URL, txt, c, MyStr As String
Dim NumSites As Integer
Dim PuncChars, StopWords, x As Variant
Dim i, j, k, r, wordCnt As Long
Dim Match As Boolean
Dim Start, Finish
Dim AllWords As Range
Dim InputSheet As Worksheet
Dim pc As PivotCache
Dim PT As PivotTable
NumSites = 4 'Application.WorksheetFunction.CountA(Sheets("Site List").Range("A2", Sheets("Site List").Range("A2").End(xlDown)))
PuncChars = Array(".", ",", ";", ":", "'", "!", "#", _
"$", "%", "&", "(", ")", " - ", "_", "--", "+", _
"=", "~", "/", "\", "{", "}", "[", "]", """", "?", "*")
StopWords = RangeToArray(Sheets("Stop Words").Range("A1", Sheets("Stop Words").Range("A" & Rows.Count).End(xlUp)))
For i = 0 To (NumSites - 1)
Debug.Print ""
Debug.Print i, "First Line" ' First Line Success
j = k = r = 0
Query_URL = URL = txt = c = ""
Application.ScreenUpdating = False
Start = Timer
URL = Sheets("Site List").Range("A2").Offset(i, 0).Value
Query_URL = "URL;http://www." & URL
Sheets("JunkYard").UsedRange.Delete
Sheets("Word Frequency").UsedRange.Delete
With Sheets("JunkYard").QueryTables.Add(Connection:=Query_URL, Destination:=Sheets("JunkYard").Range("A1"))
.Name = "Website Text"
.FieldNames = True
.RowNumbers = False
.FillAdjacentFormulas = False
.PreserveFormatting = True
.RefreshOnFileOpen = False
.BackgroundQuery = False
.RefreshStyle = xlOverwriteCells
.SavePassword = False
.SaveData = False
.AdjustColumnWidth = True
.RefreshPeriod = 0
.WebSelectionType = xlEntirePage
.WebFormatting = xlWebFormattingNone
.WebPreFormattedTextToColumns = True
.WebConsecutiveDelimitersAsOne = True
.WebSingleBlockTextImport = False
.WebDisableDateRecognition = False
.WebDisableRedirections = False
On Error GoTo Error_Query:
Debug.Print i, "Query Start" ' Query Start
.Refresh BackgroundQuery:=False
Debug.Print i, "Query Done" ' Query Done
End With
On Error GoTo 0
Sheets("JunkYard").Activate
Columns("A:A").Select
ActiveWorkbook.Worksheets("JunkYard").Sort.SortFields.Clear
ActiveWorkbook.Worksheets("JunkYard").Sort.SortFields.Add Key:=Range("A1"), SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortTextAsNumbers
If Range("A1") <> "" Then
With ActiveWorkbook.Worksheets("JunkYard").Sort
.SetRange Range("A1", Range("A" & Rows.Count).End(xlUp))
.Header = xlNo
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
End If
Set InputSheet = ActiveSheet
Sheets("Word Frequency").Range("A1") = "All Words"
InputSheet.Activate
wordCnt = 2
r = 1
Debug.Print i, "Cleaning Start" ' Cleaning Start
'Loop until blank cell is encountered
Do While Cells(r, 1) <> ""
'covert to UPPERCASE
txt = UCase(Cells(r, 1))
'Remove punctuation
For j = 0 To UBound(PuncChars)
txt = Replace(txt, PuncChars(j), "")
Next j
'Remove excess spaces
txt = WorksheetFunction.Trim(txt)
'Extract the words
x = Split(txt)
For j = 0 To UBound(x)
If x(j) Like "*[!A-Z]*" Then
For k = 1 To Len(x(j))
c = Mid$(x(j), k, 1)
If c Like "[!A-Z]" Then x(j) = Replace(MyStr, c, "", k)
Next k
End If
Match = False
For k = 1 To UBound(StopWords)
If x(j) = StopWords(k) Then
Match = True
Exit For
End If
Next k
If Match = False Then
' Removing this line will solve the freezing!! (but then output of the macro will be blank)
Sheets("Word Frequency").Range("A" & wordCnt).FormulaR1C1 = x(j)
wordCnt = wordCnt + 1
End If
Next j
r = r + 1
Loop
Debug.Print i, "Cleaning Done" ' Cleaning Done
'Create pivot table
Sheets("Word Frequency").Activate
Set AllWords = Range("A1", ActiveSheet.Range("A" & Rows.Count).End(xlUp))
On Error Resume Next
Set pc = ActiveWorkbook.PivotCaches.Add(xlDatabase, "'" & AllWords.Parent.Name & "'!" & AllWords.Address(ReferenceStyle:=xlR1C1))
'Set PC = ActiveWorkbook.PivotCaches.Add(xlDatabase, AllWords)
Set PT = pc.CreatePivotTable(ActiveSheet.Range("C1"))
With PT
.AddDataField .PivotFields("All Words")
.PivotFields("All Words").Orientation = xlRowField
End With
Debug.Print i, "Pivot Done" ' Pivot Done
On Error GoTo 0
Sheets("Word Frequency").Activate
Columns("C:D").Copy
Columns("C:D").PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
:=False, Transpose:=False
Rows("1:1").Select
Application.CutCopyMode = False
Selection.Delete Shift:=xlUp
Columns("C:D").AutoFilter
ActiveWorkbook.ActiveSheet.AutoFilter.Sort.SortFields.Clear
ActiveWorkbook.ActiveSheet.AutoFilter.Sort.SortFields.Add Key:=Range _
("D1", ActiveSheet.Range("D1").End(xlDown)), SortOn:=xlSortOnValues, Order:=xlDescending, DataOption:= _
xlSortTextAsNumbers
On Error GoTo Error_Sort
With ActiveWorkbook.ActiveSheet.AutoFilter.Sort
.Header = xlYes
.MatchCase = False
.Orientation = xlTopToBottom
.Apply
End With
Debug.Print i, "Sort Done" ' Sort Done
On Error GoTo 0
Columns("A:B").Delete Shift:=xlToLeft
Range("A2:B51").Copy
Sheets("Main").Activate
ActiveSheet.Range("C2").Offset(i * 2, 0).PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks:=False, Transpose:=True
Continue:
Sheets("Site List").Range("A2").Offset(i, 0).Copy
Sheets("Main").Activate
Sheets("Main").Range("A2").Offset(i * 2, 0).Select
ActiveSheet.Paste
Do While Application.Ready = False
Debug.Print "Not Ready", i
Loop
With ActiveWorkbook
For Each cn In .Connections
cn.Delete
Next
For Each pc In .PivotCaches
pc.MissingItemsLimit = xlMissingItemsNone
Next pc
End With
With Sheets("JunkYard")
For j = .QueryTables.Count To 1 Step -1
.QueryTables(j).Delete
Next
End With
Debug.Print i, "Memory Cleanup Done" ' Memory Cleanup Done
Finish = Timer
Application.ScreenUpdating = True
Sheets("Main").Range("B2").Offset(i * 2, 0).Select
ActiveCell.Formula = Finish - Start
Debug.Print i, "Last Line" ' Last Line
Next i
Exit Sub
Error_Sort:
MsgBox ("Sort failed.")
Resume Continue
Error_Query:
Debug.Print i, "Query Error" ' Query Error
Resume Continue
Error_Pivot:
MsgBox ("Pivot Table could not be created.")
Resume Continue
End Sub
Function RangeToArray(ByVal my_range As Range) As String()
Dim vArray As Variant
Dim sArray() As String
Dim i As Long
vArray = my_range.Value
ReDim sArray(1 To UBound(vArray))
For i = 1 To UBound(vArray)
sArray(i) = vArray(i, 1)
Next
RangeToArray = sArray()
End Function
Most of my code is copied from various places on the internet and modified to fit this particular macro. I'm not the best with VBA, so please don't hesitate to dumb-down any explanations for me! Also, please feel free to criticize and/or recommend other improvements to my sloppily-written code! Now, to the freezing:
Problem:
When executed, the macro starts running perfectly. I wrote some debug.print lines in my macro to help me keep track of where the freezing occurs. Excel 2010 will freeze & the title bar will read (not responding) on the URL query after the sum of the "Grand Total" values on the Worksheet named "Main" is greater than about 7000. In other words:
If (Sum of all numbers in Column C in Worksheets("Main") > 7000) Then (the next URL Query will permanently freeze Excel)
Attempted Problem Solving:
I am thoroughly perplexed by the previous relationship, but my hypothesis has held for about 3 days worth of fighting/debugging. I added a bit of memory & connection cleaning code to the end of every iteration, wondering if there is some sort of 'buffer' I am filling..?
I determined which line causes the URL Query to fail, but I am confused as to why they are related at all, and how to fix the freezing while still getting output.
Code:
' Removing this line will solve the freezing!! (but then output of the macro will be blank)
Sheets("Word Frequency").Range("A" & wordCnt).FormulaR1C1 = x(j)
I have tried using variations such as .Text = x(j), .Value = x(j), Cells(WordCnt, 1) = x(j), but nothing solves the freezing
Stepping through the macro with F8 produces perfect results every time, no matter how many sites or words I iterate through.
Running the macro for one site, then starting it immediately on another site produces perfect results every time, no matter how many sites or words I iterate through.
Disabling the single line of code above produces perfect results every time (besides having the word-frequency output), no matter how many sites or words I iterate through.
Other Notes:
If a website Query fails, I am totally fine with just skipping to the next site. Plenty of fish in the sea, and the internet is a messy place. On the other hand, If there something I can do to decrease the number of failed sites, I'm all ears.
Here's a link to the full .xlsm file, for your viewing pleasure. If the link breaks, or is removed, here's a quick description so you can make your own:
https://dl.dropboxusercontent.com/u/42453705/Frequency Test/Keyword Frequency Test 1.xlsm
One worksheet named "Main" - Formatted output goes here
One worksheet named "JunkYard" - Workspace for messy data
One worksheet named "Word Frequency" - Aggregate messy data
One worksheet named "Site List" - Website #1 in A2, Website #2 in A3, etc. (remove "www."; "www.google.com" should be "google.com")
One worksheet named "Stop Words" - List containing filtered words like "and", "but", "also", etc. (Word #1 in A1, Word #2 in A2, etc.)
Thank you for any advice on this problem. I really appreciate the feedback to improve my coding skills. Please ask for clarification if anything in here if I've explained poorly!
- Vistalite