goblin
Active Member
- Joined
- Apr 16, 2003
- Messages
- 469
I've exposed the VBA regular expression mechanism as a function in Excel and I'm trying to match/break up some data. The data looks like this:
0301-123456 Some name 111111-1111 1,202,591.00 .5 .5 5 Counterparty name
0301-234567 Another name 222222-2222 1,230,000.00 30.04.2013 1.0 .5 8 Counterparty name2
0301-345678 Any name you want 333333-3333 1,230,000.00 29.04.2013 30.04.2013 1.0 .5 8 Counterparty nam3
0301-456789 No digits in names I think 444444-4444 750,000.00 01.10.2012 .0 .5 1 Counterparty name4
0301-567890 NameNameName 555555-5555 750,000.00 31.12.2013 1.0 .5 5 Counterparty name5
0301-678901 This is my name 666666-6666 14,176,426.00 02.01.2015 .0 .5 1 Counterparty name6
The data can be described as:
1. The lines will always start with ddd-dddd
2. Then comes a name
3. ID code for the name on the form dddddd-dddd
4. Amount is next, formatted as #,##0.00
5. Optional column that contains a date, formated as dd.mm.yyyy. If 5 is populated 6 is populated as well
6. Optional column that contains a date, formated as dd.mm.yyyy. Can be populated without having data in column 5
7. Fraction, formatted as #.0 (e.g. 1.0 or .5)
8. Fraction. Same as 7
9. Integer. Can be two digits
10. Another name and can contain essentially anything
I've almost matched this but I cannot turn off greedy matching on columns 5 and 6. I.e. if column 5 is empty but column 6 is not, my pattern matching always puts the date from column 6 into column 5.
My pattern string is:
^(\d*-\d*)+\s(.*)\s(\d*-\d*)+\s(\S*)\s(\d\d\S*\s|)(\d\d\S*\s|)\s?(\d?\.\d+)\s(\d?\.\d+)\s(\d+)\s(.*)$
Basically, I'm matching like this:
1. Match the digit group for 1 and 3. Anything in between goes as group 2.
2. Match the amount in 4 as anything non-space
3. Match EITHER anything starting with two digits or skip it, still reserving a group for this.
4. Again, match EITHER anything starting with two digits or nothing, reserving another group for this
5. Optional space, because if both Date columns are missing I only have one space between 4 and 7
6. Match 7-9 aggressively
7. Rest goes into last group
If anyone wants to take a crack at this I would be very thankful. The VBA code you need is below. To use the code in Excel, put the test data above in cells A2-A7, put the pattern string in Cell A1, select cells B2-K2, put in the formula "=RegExpGroup(A2;$A$1)", and CTRL-SHIFT-ENTER the formula. Copy down to break up all the strings:
0301-123456 Some name 111111-1111 1,202,591.00 .5 .5 5 Counterparty name
0301-234567 Another name 222222-2222 1,230,000.00 30.04.2013 1.0 .5 8 Counterparty name2
0301-345678 Any name you want 333333-3333 1,230,000.00 29.04.2013 30.04.2013 1.0 .5 8 Counterparty nam3
0301-456789 No digits in names I think 444444-4444 750,000.00 01.10.2012 .0 .5 1 Counterparty name4
0301-567890 NameNameName 555555-5555 750,000.00 31.12.2013 1.0 .5 5 Counterparty name5
0301-678901 This is my name 666666-6666 14,176,426.00 02.01.2015 .0 .5 1 Counterparty name6
The data can be described as:
1. The lines will always start with ddd-dddd
2. Then comes a name
3. ID code for the name on the form dddddd-dddd
4. Amount is next, formatted as #,##0.00
5. Optional column that contains a date, formated as dd.mm.yyyy. If 5 is populated 6 is populated as well
6. Optional column that contains a date, formated as dd.mm.yyyy. Can be populated without having data in column 5
7. Fraction, formatted as #.0 (e.g. 1.0 or .5)
8. Fraction. Same as 7
9. Integer. Can be two digits
10. Another name and can contain essentially anything
I've almost matched this but I cannot turn off greedy matching on columns 5 and 6. I.e. if column 5 is empty but column 6 is not, my pattern matching always puts the date from column 6 into column 5.
My pattern string is:
^(\d*-\d*)+\s(.*)\s(\d*-\d*)+\s(\S*)\s(\d\d\S*\s|)(\d\d\S*\s|)\s?(\d?\.\d+)\s(\d?\.\d+)\s(\d+)\s(.*)$
Basically, I'm matching like this:
1. Match the digit group for 1 and 3. Anything in between goes as group 2.
2. Match the amount in 4 as anything non-space
3. Match EITHER anything starting with two digits or skip it, still reserving a group for this.
4. Again, match EITHER anything starting with two digits or nothing, reserving another group for this
5. Optional space, because if both Date columns are missing I only have one space between 4 and 7
6. Match 7-9 aggressively
7. Rest goes into last group
If anyone wants to take a crack at this I would be very thankful. The VBA code you need is below. To use the code in Excel, put the test data above in cells A2-A7, put the pattern string in Cell A1, select cells B2-K2, put in the formula "=RegExpGroup(A2;$A$1)", and CTRL-SHIFT-ENTER the formula. Copy down to break up all the strings:
Code:
Function RegExpGroup(FindIn As String, FindWhat As String, Optional bGlobal As Boolean = True) As Variant
' Needs reference "Microsoft VBScript Regular Expressions 5.5
Dim re As RegExp, i As Integer, j As Integer, tmp
Application.Volatile False
RegExpGroup = Array(CVErr(xlErrNA))
Set re = New RegExp
re.Global = bGlobal
re.Pattern = FindWhat
If re.test(FindIn) Then
With re.Execute(FindIn)
ReDim rslt(0 To .count - 1)
For i = 0 To .count - 1
With .Item(i).SubMatches
tmp = Array()
ReDim tmp(0 To .count - 1)
For j = 0 To .count - 1
tmp(j) = .Item(j)
Next j
rslt(i) = tmp
End With
Next i
RegExpGroup = rslt
End With
End If
End Function