When you want to do a fuzzy lookup without VBA or addin or POWER QUERY, it might be usefull to use dynamic arrays and code a formula that test n-gram substrings to evaluate text similarities.
Hereafter is a detailed view of the computation of string similarities by n-gram substrings :
By wraping all string matching in a dynamic array formula, one can build a fuzzy lookup which return the most similar text value :
Obviously, the formula could be simplified when the LET() function will be widely available
Hope this helps
Excali
Hereafter is a detailed view of the computation of string similarities by n-gram substrings :
n-gram similarity.xlsx | ||||||||
---|---|---|---|---|---|---|---|---|
B | C | D | E | F | G | |||
1 | find_text | within_text | n-gram | |||||
2 | ELOQUENTLY | INELOQUENT | 3 | |||||
3 | N | |||||||
4 | 10 | n-gram similarity | 0,750 | =D6/(N-n_gram+1) | ||||
5 | single formula | 0,750 | =SOMME(1*ESTNUM(CHERCHE(STXT($B$2;SEQUENCE(NBCAR($B$2)-n_gram+1);n_gram);within_text)))/(N-n_gram+1) | |||||
6 | Total | 6 | ||||||
7 | 1 | ELO | 1 | |||||
8 | 2 | LOQ | 1 | |||||
9 | 3 | OQU | 1 | |||||
10 | 4 | QUE | 1 | |||||
11 | 5 | UEN | 1 | |||||
12 | 6 | ENT | 1 | |||||
13 | 7 | NTL | 0 | |||||
14 | 8 | TLY | 0 | |||||
Detailed single formula |
Cell Formulas | ||
---|---|---|
Range | Formula | |
B4 | B4 | =LEN(B2) |
F4 | F4 | =D6/(N-n_gram+1) |
G4:G5 | G4 | =FORMULATEXT(F4) |
F5 | F5 | =SUM(1*ISNUMBER(SEARCH(MID($B$2,SEQUENCE(LEN($B$2)-n_gram+1),n_gram),within_text)))/(N-n_gram+1) |
D6 | D6 | =SUM(D7#) |
B7:B14 | B7 | =SEQUENCE(LEN(B2)-n_gram+1) |
C7:C14 | C7 | =MID(B2,B7#,n_gram) |
D7:D14 | D7 | =1*ISNUMBER(SEARCH(C7#,within_text)) |
Dynamic array formulas. |
Named Ranges | ||
---|---|---|
Name | Refers To | Cells |
N | ='Detailed single formula'!$B$4 | F4:F5 |
n_gram | ='Detailed single formula'!$D$2 | B7:C7, F4:F5 |
within_text | ='Detailed single formula'!$C$2 | F5, D7 |
By wraping all string matching in a dynamic array formula, one can build a fuzzy lookup which return the most similar text value :
n-gram similarity.xlsx | |||||
---|---|---|---|---|---|
B | C | D | |||
1 | find_text | n-gram | Matching text | ||
2 | JACKSON, Michael | 2 | Michael JACKSON | ||
3 | |||||
4 | |||||
5 | |||||
6 | within_text | Simple formulae | Array formula | ||
7 | Michael LORRIE | 0,4 | 0,4 | ||
8 | Jackson FIVE | 0,4 | 0,4 | ||
9 | The Jacksons | 0,4 | 0,4 | ||
10 | Prince of Pop | 0 | 0 | ||
11 | Michael JACKSON | 0,8 | 0,8 | ||
12 | Jackson, Stewart | 0,533333333 | 0,533333333 | ||
13 | Latoya Jackson | 0,4 | 0,4 | ||
Fuzzy Lookup |
Cell Formulas | ||
---|---|---|
Range | Formula | |
D2 | D2 | =INDEX(within_text_list[within_text],MATCH(MAX( MMULT(1*ISNUMBER(SEARCH(MID(find_text,SEQUENCE(1,LEN(find_text)-n_gram+1),n_gram),within_text_list[within_text]))/(LEN(find_text)-n_gram+1), SEQUENCE(LEN(find_text)-n_gram+1,1,1,0)) ), MMULT(1*ISNUMBER(SEARCH(MID(find_text,SEQUENCE(1,LEN(find_text)-n_gram+1),n_gram),within_text_list[within_text]))/(LEN(find_text)-n_gram+1), SEQUENCE(LEN(find_text)-n_gram+1,1,1,0)),0)) |
D7:D13 | D7 | =MMULT(1*ISNUMBER(SEARCH(MID($B$2,SEQUENCE(1,LEN($B$2)-n_gram+1),n_gram),within_text_list[within_text]))/(LEN($B$2)-n_gram+1), SEQUENCE(LEN($B$2)-n_gram+1,1,1,0)) |
C7:C13 | C7 | =SUM(1*ISNUMBER(SEARCH(MID(find_text,SEQUENCE(1,LEN(find_text)-n_gram+1),n_gram),B7)))/(LEN(find_text)-n_gram+1) |
Dynamic array formulas. |
Named Ranges | ||
---|---|---|
Name | Refers To | Cells |
'Fuzzy Lookup'!find_text | ='Fuzzy Lookup'!$B$2 | D7, D2, C7:C13 |
'Fuzzy Lookup'!n_gram | ='Fuzzy Lookup'!$C$2 | D7, D2, C7:C13 |
Cells with Conditional Formatting | ||||
---|---|---|---|---|
Cell | Condition | Cell Format | Stop If True | |
C7:C13 | Cell Value | top 1 values | text | NO |
D7:D13 | Cell Value | top 1 values | text | NO |
Obviously, the formula could be simplified when the LET() function will be widely available
Hope this helps
Excali