Extracting unit measures from text strings

emi_david

New Member
Joined
Dec 29, 2015
Messages
7
Hi,

So I'm scraping data from online stores into an excel sheet and often get cells with the following text:

"ACANA ADULT SMALL BREED 2.5KG"

I want 2.5 to be taken out of this cell (A1 for example) and moved into the next column (B1), and the unit kg to be moved in the next one yet (C1). The string "2.5kg" can be written any where in the text string. It can also be written in different ways such as "2.5 kg", "2,5 kg" or "2500g" depending on the store. However, after the string "2.5 kg" has been located and moved to B1 and C1 I can easily remove any and all blank spaces from these columns.

Obviously I cannot use =Right(x) since the number of characters differ and the hunted string can be anywhere in the text made available by the online store.

I need something like:
Unit = ml, l, g, kg, oz
For i = 0 to n
If it says "unit" in the text Then
Take all the characters before the unit until the next letter And
Move the chosen letters to C1 and the chosen numbers to B1
End If
Next i

If possible I'd like a macro for it. I so hope there is a general formula to use rather than adhoc solutions for each store output. Please tell me if you want to see more sample data.

Best regards,
emi_david
 
@PGC

That is the most beautiful thing I have seen since this morning's splendid sunrise. Bravo and thanks a million!

Finally a detail: Is it possible to move the extracted unit from the original cell when it has been moved to columns B and C?
 
Upvote 0

Excel Facts

Excel Wisdom
Using a mouse in Excel is the work equivalent of wearing a lanyard when you first get to college
Hi

This will delete the extracted data from the original cell.
Run the code on a copy of the data, because it will alter the source data.


Code:
...
        For lRow = 1 To r.Count
            Set regexMatches = .Execute(r(lRow, 1).Value)
            For lMatch = regexMatches.Count To 1 Step -1
                With regexMatches(lMatch - 1)
                    r(lRow, 2 * lMatch).Value = .submatches(0)
                    r(lRow, 2 * lMatch + 1).Value = .submatches(1)
                    r(lRow, 1).Value = Left(r(lRow, 1).Value, .firstindex) & Mid(r(lRow, 1).Value, .firstindex + .Length + 1)
                End With
            Next lMatch
        Next
...
 
Upvote 0
or, almost equivalent and simpler:

Code:
        For lRow = 1 To r.Count
            Set regexMatches = .Execute(r(lRow, 1).Value)
            For lMatch = regexMatches.Count To 1 Step -1
                With regexMatches(lMatch - 1)
                    r(lRow, 2 * lMatch).Value = .submatches(0)
                    r(lRow, 2 * lMatch + 1).Value = .submatches(1)
                    r(lRow, 1).Value = Replace(r(lRow, 1).Value, .Value, "")
                End With
            Next lMatch
        Next
 
Upvote 0
@PGC

@ pgc01,

For me, the code in post #10 does not extract anything:

Mexikansk 9 ml tacopaj Fryst 8 g Felix

I can also not extract the measurements from this string.

Another string that failed for me (the only anomaly out in 300 rows tested) was[TABLE="width: 354"]
<tbody>[TR]
[TD="width: 354"]"COOP, 1 kg."

I have tried to add (kg.) in sUnits[/TD]
[/TR]
</tbody>[/TABLE]
 
Upvote 0
I've made a check on a larger file and can confirm that the anomalies occur for single digits such as 1 l, 2 kg, 5 g, 6 ml etc.

There must be something in the Pattern row but I just don'r find it. Help anyone?

@PGC wrote
.Pattern = "\b(\d[0-9.,]+)\s*(" & sUnits & ")\b"

@Sektor first wrote it a little different but was then not having the list. sUnits is referring to the list of units in PGC's code:
.Pattern = "\s([0-9.]+)([a-z]+)$"

I now took away the \d before [0-9.,]+. I think it sort of became like a double requirement there. Now it works with the following pattern code.

.Pattern = "\b([0-9.,]+)\s*(" & sUnits & ")\b"

The complete code therefore

Code:
Sub GetData()
    
    Dim regexMatches As Object
    Dim r As Range, rC As Range
    Dim lRow As Long, lMatch As Long
    
    Const sUnits As String = "(kg)|g|(oz)|(fl oz)|(ml)|(cl)|(dl)|(hg)|(mg)"
    
    Set r = Range("A1:A8000")
    With CreateObject("VBScript.RegExp")
        .Pattern = "\b([0-9.,]+)\s*(" & sUnits & ")\b"
        .IgnoreCase = True
        .Global = True
        
        For lRow = 1 To r.Count
            Set regexMatches = .Execute(r(lRow, 1).Value)
            For lMatch = 1 To regexMatches.Count
                With regexMatches(lMatch - 1)
                    r(lRow, 2 * lMatch).Value = .submatches(0)
                    r(lRow, 2 * lMatch + 1).Value = .submatches(1)
                End With
            Next lMatch
        
            For lMatch = regexMatches.Count To 1 Step -1
                With regexMatches(lMatch - 1)
                    r(lRow, 2 * lMatch).Value = .submatches(0)
                    r(lRow, 2 * lMatch + 1).Value = .submatches(1)
                    r(lRow, 1).Value = Replace(r(lRow, 1).Value, .Value, "")
                End With
            Next lMatch
        Next
    End With
    


End Sub
 
Upvote 0
Hi

You are right, sorry, it should be a * instead of a +

Code:
.Pattern = "\b(\d[0-9.,][B][COLOR=#800000]*[/COLOR][/B])\s*(" & sUnits & ")\b"

This means that the number begins with a digit (the \d) and is followed by some other characters that can be either a digit or a comma or a dot.

I used a comma or a dot because in your examples you had 3,45 and 2.14 so I allow both.

The * means 0 or more
The + means 1 or more

That's why it was not working, it was not accepting a single digit number.
 
Upvote 0
Are they always one of the five: "ml, l, g, kg, oz"?

Are there ever any other numbers, that aren't referencing the measurement? (i.e. SKU numbers or product numbers)
Bro please reply i want your help
Basically i have data which contains serving size(48g) in one column and in another one it has unit (cup). In every 5 to 6 row these units for size are change like if it was 'g' with 48 now in some other row it is 48 ml so what i require is what i can do to extract ml or g from the text if it has ml then ml should be extracted only if it has g then only g has to be extracted in new column.
 
Upvote 0

Forum statistics

Threads
1,223,893
Messages
6,175,248
Members
452,623
Latest member
cliftonhandyman

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top