Read unicode from html file by VBA

morningstar1001

New Member
Joined
Jul 23, 2018
Messages
10
Hi everyone,

I need to extract the description text from a lot of html files. So I use this code to open and read the text from the html files.

Code:
myFile = myPath & Filename
    
    Open myFile For Input As [URL=https://www.mrexcel.com/forum/usertag.php?do=list&action=hash&hash=1]#1[/URL] 
    Do Until EOF(1)
        Line Input [URL=https://www.mrexcel.com/forum/usertag.php?do=list&action=hash&hash=1]#1[/URL] , textline
        posContent = InStr(1, textline, "title-detail")
        If posContent > 0 Then
            Sheets("data").Cells(j, 3) = textline
        End If
        
    Loop
    Close [URL=https://www.mrexcel.com/forum/usertag.php?do=list&action=hash&hash=1]#1[/URL]

It worked, but the description text which I need to collect is not UTF-8 and become "Nội thất văn phòng"

So is there any other way to extract correct data?

Thanks.
 

Excel Facts

Bring active cell back into view
Start at A1 and select to A9999 while writing a formula, you can't see A1 anymore. Press Ctrl+Backspace to bring active cell into view.
Code:
Function ExtractTitleDetail(strHtmlFilePath As String) As String
  Const TristateTrue = -1 ' Opens the file as Unicode
  Dim strLineOfText As String
  Dim objFileSys As Object
  Dim objStream As Object
  
  On Error GoTo ErrHandler
  Set objFileSys = CreateObject("Scripting.FileSystemObject")
  Set objStream = objFileSys.OpenTextFile(strHtmlFilePath, Format:=TristateTrue)
  
  Do Until objStream.AtEndOfStream
    strLineOfText = objStream.ReadLine()
    If InStr(1, strLineOfText, "title-detail", vbTextCompare) > 0 Then
      ExtractTitleDetail = strLineOfText
      GoTo ExitProc
    End If
  Loop
  
ExitProc:
  On Error Resume Next
  objStream.Close
  Set objStream = Nothing
  Set objFileSys = Nothing
  Exit Function
  
ErrHandler:
  ExtractTitleDetail = vbNullString
  Resume ExitProc
End Function
 
Upvote 0
Thanks for your help.
I tried with your code but the strLineOfText = "?????????".

However, I already found another solution using ADODB.Stream which work perfectly. Here the code just in case someone may need it.

Code:
 Dim adoStream As ADODB.Stream
    Dim var_String As Variant

    Set adoStream = New ADODB.Stream
    adoStream.Charset = "UTF-8"
    adoStream.Open
    adoStream.LoadFromFile myFile 'change this to point to your text file
  
    var_String = Split(adoStream.ReadText, vbCrLf) 'split entire file into array - lines delimited by CRLF
    
    For k = LBound(var_String) To UBound(var_String)
        textline = var_String(k)
        posContent = InStr(1, textline, "title-detail")
        If posContent > 0 Then
            Sheets("data2").Cells(j, 3) = textline
        End If
    Next k
 
Upvote 0
Thanks for your help.
I tried with your code but the strLineOfText = "?????????".

However, I already found another solution using ADODB.Stream which work perfectly. Here the code just in case someone may need it.

Code:
 Dim adoStream As ADODB.Stream
    Dim var_String As Variant

    Set adoStream = New ADODB.Stream
    adoStream.Charset = "UTF-8"
    adoStream.Open
    adoStream.LoadFromFile myFile 'change this to point to your text file
  
    var_String = Split(adoStream.ReadText, vbCrLf) 'split entire file into array - lines delimited by CRLF
    
    For k = LBound(var_String) To UBound(var_String)
        textline = var_String(k)
        posContent = InStr(1, textline, "title-detail")
        If posContent > 0 Then
            Sheets("data2").Cells(j, 3) = textline
        End If
    Next k

You said mine didn't work, but you stated in the title that the file is encoded as unicode rather than utf-8. If that was indeed the case, mine works fine.
 
Upvote 0

Forum statistics

Threads
1,224,823
Messages
6,181,178
Members
453,021
Latest member
Justyna P

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top