Excel - RegEx Building Blocks - 2643

If you like this content, please consider visiting the video on YouTube and subscribe to the MrExcel Channel to show your support!
This video has been published on May 26, 2024.
Microsoft Excel Tutorial: Learning the Building Blocks for RegEx in Excel
Learn the basic building blocks of RegEx in Excel.

To download the RegEx Cheat Sheet and workbook from today: Excel - RegEx Building Blocks - 2643 Sample Files - MrExcel Publishing

Welcome back to MrExcel's YouTube channel! In this video, we'll dive into the building blocks for mastering Regular Expressions (RegEx) and their integration in Excel. If you find this video helpful, don’t forget to click the Like button below – it helps us reach more viewers. We've got over 2,500 videos on various Excel topics, so there's plenty to explore!

Our focus today is on RegEx, specifically the new functions in Excel: RegExTest, RegExExtract, and RegExReplace. I’ll break down these functions step-by-step, making it easy for you to grasp the concepts. And remember, you can always visit our site at mrx.cl/gpt to type in any questions you have. You'll get a summary of relevant videos where I've covered those topics extensively.

This video is designed to get you comfortable with RegEx, even if you're a complete beginner. I’ll walk you through examples like matching specific sequences of letters, using wildcards, and understanding special characters. By the end, you’ll have a solid foundation in using RegEx within Excel. Don’t forget to download the tip card below, summarizing everything we discuss today.

One of the key points we’ll cover is how different characters and sequences can be matched using RegEx patterns. For instance, how to specify that a cell starts with a certain string or contains a specific character set. These examples will demonstrate the versatility and power of RegEx in data manipulation and search tasks.

Lastly, I’ll share some advanced techniques, like using repetition operators and handling white spaces, including tricky non-breaking spaces. This will prepare you to tackle more complex text processing tasks in Excel. As always, if you have any questions, leave them in the comments, and I'll be sure to address them.

Thanks for watching! If you enjoyed this video, please Like, Subscribe, and Ring the Bell to stay updated with all my latest Excel tips and tutorials. See you next time for another exciting video from MrExcel!

Buy Bill Jelen's latest Excel book: MrExcel 2024 Igniting Excel

Table of Contents
(00:00) Intro to RegEx Building Blocks
(00:10) MrExcel GPT to answer your Excel questions
(00:30) Download RegEx Tip Card
(00:41) Building Blocks of RegEx
(01:13) RegEx Basics
(02:15) RegEx Starts With or Ends With
(02:35) RegEx Two Meanings of Tilde
(03:52) RegEx Wildcards
(05:48) More RegEx Wildcards
(08:05) Repetition in RegEx
(12:07) Whitespace in RegEx
(14:02) Converting Non-Breaking Space to Space Using RegEx
(16:24) Returning part of a match in REGEXEXTRACT
(18:50) Transposing multiple spilled results
(19:30) Download the Workbook
(19:46) Like, Subscribe & Ring the Bell

This video answers these common search terms:
Excel RegEx tutorial
How to use Regular Expressions in Excel
Excel RegEx functions explained
RegExTest Excel function guide
Excel RegExExtract examples
Advanced Excel RegEx techniques
RegExReplace in Excel tutorial
Learning RegEx for Excel beginners
Excel text manipulation with RegEx
MrExcel RegEx basics
maxresdefault.jpg


Transcript of the video:
Just down below the video, if you click Like,  that will make sure that YouTube shows this   video to more people.
Thanks.  The building blocks for Learning RegEx and Excel.
There's over 2,500 videos here at MrExcel.  Mrx.cl/gpt come here, type your question.
“Repeat a pivot table for each customer.”  It's going to give you a  short summary of the videos.  And then it's going to give you links to  the videos where I've talked about it.  Every word that I've spoken  on YouTube in over 15 years.  MrX.CL slash gpt, give it a try.
After watching this video, you will have all the   building blocks you need to put together RegEx.
Make sure to download this tip card, two pages,   with a summary of everything we  are going to talk about today.  Okay, in the previous video you talked about  these three new functions that came to Excel,   RegExTest, RegExExtract, RegExreplace.
And I had mentioned that I had a RegEx 101,   but that I'd do it in another video.
So this is that video.  This hopefully will just get you up to speed.
I have to tell you, in all the years up   until the point that it was in  Excel, I was horrible at RegEx.  I would always just go ask a friend how to do it.
But after studying this,   it's really not that hard.
It's pretty simple to understand.  Okay, so let's just start off,  put some letters in the pattern.  D, E, F.
That's saying that I want the   letters D, E, F in sequence.
Alright?  But there might be something where we  want either the letter D or E or F.  And so that's in square brackets.
This is a letter group and that   will match if there's a, D, E, or F.
But sometimes you don't want to have   to type the whole thing.
D to H will give us one   letter from the letters D through H.
So here's a whole bunch of different examples.  We're looking for D, E, F, and the  only one that has those three letters   exactly is A, B, C, D, E, F.
None of the others match.  But if we ask for square bracket, D, E, F, then  that's anything that has a D anywhere in it.  This one, this one, this all match or a  wider variety D through H picks up the G,   G, G and the H, H, H.
So square brackets means one   of the items in this group and you're  allowed to have a dash in the middle.  Alright, next up, what if you want to  start with a particular bit of text?  So the cell has to start  with the put a caret before.  Caret, t, h, e, we'll start with “the”.
Or if it has to end with something,   end dollar sign will end with end.
Right now, just right up front we   have to talk about this super annoying thing  that the caret, it has two different meanings.  This has happened before in Excel, like the F4  key puts dollar signs in if you're in edit mode.  Or repeats the last command If you're  not in edit mode, it has two jobs.  Well, here in RegEx, the caret inside  of a letter group means not X, Y, or Z.  So in this case, this caret is saying it has  to start with something that's not A through M.  A little confusing that the caret has two  different meanings, but you'll get used to it.  So if we're looking for something that  starts with the, there's only one.  If we're looking at something that ends  with end, it's this one or that one.  If we're looking for something that  contains a character that's not X, y, or Z.  Alright?
So this one does not contain a character   that's not X, Y, Z, but everything else just  about contains characters that aren't X, y, Z.  This one is actually far wider.
It means that it contains a character   that's A through Z capital, A through W,  lowercase, any of the digits, zero to nine,   any of the special characters.
Yeah.  And then this is one that starts with  something that's not A through M.  So a lot of these start with  something that's not A through M.  Even the I, that's a capital I.
And we're asking for lowercase A through M.  Alright, so that starts with and ends with, right?
This is all pretty easy so far, right?  Except for the wild cards.
Wildcard.  A wildcard: a period is a wildcard  for any single character, anything.  So what if you need to look for an actual period?
Then you put a backslash period.  So the backslash is kind of  a way to escape, to say, hey,   RegEx is using this period to mean something else.
A wild card in this case, put a backslash before   to say no, we actually want a real period.
There's a lot of special characters, okay?  So in this case, zero to nine is looking for  any single digit, but there's a beautiful   shortcut for that.
Backslash D is also   the shortcut for any single digit.
All right, now let's talk about how to repeat.  In this case, the period is  looking for any character.  And then the three is saying we  have three of that character.  Or we want at least two, but up to three.
So put a comma in between them.  So we're looking for a specific repetition.
Alright?  So here any character that  period means any character.  All of these are going to match except for  the blank cell, there's no character there.  But if we put a backslash period, well then only  the thing that has the period is going to match.  Looking for anything with a digit anywhere in it.
So we get those two and all of these have digits,   but it's shorter.
Just put backslash D,   it'll be the exact same answer here.
These are identical.  Identical column F and G.
Alright?  This is something that has  at least three characters.  So A has only one character  doesn't match this one.  The n has eight characters total,  including the space and the period,   but as long as it's more than three, we get it.
These two and these two only have two characters.  So they don't match here, but they do match here  where we're allowing two or three characters.  Alright?
So that's how you specify that we're   looking for a certain number of things.
Alright?  So remember this wild card is for any single  character or back slash D is for any single digit.  We're going to repeat those here under wild card.
So we start out with those two.  Period is a wild card for any single character.
Backslash D is any digit.  This is a great way if you want lowercase A  to Z or uppercase A to Z, it looks like this.  And then they have something for digits,  lowercase A to Z, uppercase A to Z, and that's W,   lowercase W is the code for any of those.
It seems like that'd be really useful.  We use that one a lot.
Back slash capital W means the reverse of that.  It's not A to Z or lowercase  A to Z or zero to nine.  This happens a lot where you take the lowercase  and you capitalize it and it makes it into a Not.  Okay?
Now we're going to talk about spaces here.  Actually two tabs from now backslash s  is a space or any kind of a white space.  There's all kinds of spaces that you  might encounter, like a character   160 space and that'll be handled.
And then not a space, not white space.  We'll look for something that's a non-space.
So here, if we're looking for a digit,   that's a match.
If we're looking for   something that's a letter while the one is not  a match, the exclamation point, not a match.  And then the only thing here that doesn't  fall into the WW is a great catchall,   A to Z, uppercase, lowercase, and digits.
Special characters like the exclamation   point or anything shifted along the  top row of your keyboard will be that.  And then not a W, right?
So W is A to Z, upper or lower zero to nine.  Well, what contains not a W?
Well the exclamation point, certainly it does.  But also here there's a space.
So a space Z or A space capital Z,   those fall into the not a W.
Okay, so now check out these two.  This one is saying that we have an A  followed by a space followed by a Z.  So right there, that one matches  because it's K sensitive, it's   not going to see the capital A, space Z.
And in this case it's with the capitals   slash capitals says something that's not a space.
In other words, we start with an A end with a Z.  And in the middle we have a letter  so that A and BZ there will work.  Next up, repetition.
Alright, so we know   back slash D is any digits from zero to  nine, a single digit from zero to nine.  What if we need to look for multiple  digits, like 1 23 main or 1234 Peyton place?  Alright, so the first thing they have,  and I hate this one back D asterisk   says there's going to be some number of digits.
We don't know how many digits.  There might be two digits,  there might be seven digits.  But it's also possible with  the asterisk that there might   be zero digits, which is really weird to me.
Why are we saying, hey go, we need a digit here,   but in fact there might be no digits at all.
So back slash D is getting everything here   that has a digit, that's great, but back slash d  asterisk is also returning things that have zero   digits at all.
Right?  Now in my RegEx journey, this  one seems pretty useless.  I'm sure there'll be use cases later, but what we  really need I think is D plus that says there's a   digit at least one and maybe more digits in a row.
So like a one, two, a, 1, 2, 3, 1, 2, 3,   4, 1, 2, 3, 4, 5, A, one Z.
But when there's no digits and then it's   not going to show up as true for that.
Alright?  The other one that's interesting is there can  be a digit or it's okay to not have a digit,   but we can't have two, three, or four digits.
So D question mark is specifically zero   or one digit.
They say that   the question mark means that it's optional.
There could or could not be a digit there.  So the way that I built this is we  need a letter from A to Z, lowercase,   and then backslash the question mark.
And then another letter from A to Z select.  Looking down here, this one fits the pattern.
A one Z is lowercase letter A, one and a Z.  And then even here, lowercase letter,  no digit and another lowercase letter.  Those both work.
But here where we   have a and then three digits in a row  and then the Z, that one does not work.  See, and it's also funny, all of these are  marked as correct because there's no digit.  So we have two lowercase letters and no digit.
Alright?  If we know specifically that we need three  digits slash D and then curly braces three,   or you're allowed to put a range in here,  like anywhere from two to four digits.  So curly braces two comma four is  between two and four digits like that.  So here slash D three, it's only the  ones that have three digits in a row,   but also four digits and five digits.
These have 1, 2, 3, 1, 2, 3.  It's definitely there.
So it finds the pattern.  Even a 9 9, 9 Z with three digits will  fire there between two and four digits.  That picks up a1 two.
That wasn't true before.  And again, all the others, even things with five  digits because it's saying, Hey, is there between   two and four digits?
Yes.  And then after that there happens to be a five.
All right, this one is interesting to me.  So if we have something like sounds like bur  it's cold, that could be spelled with three Rs,   four Rs, or five Rs, but not BR,  and not even BRR in my rule set.  So we're looking for B and then anywhere from  three to five Rs followed by an optional space.  So with that pattern, the BR and BRR are not true,  but the ber or ber burr is coming up as true.  Now the thing that's interesting to me is  that three to five refers to just the one   character before it.
What if we wanted to   have the word repeat the entire word?
So then we're going to put the whole thing in.  So now we need BRR space anywhere from two to  three times, and that will be true in that case.  So the parentheses creating a group  that the repetition direction will   refer to that entire group.
I was super interested in how   they're going to deal with white space.
Alright, so if you're just looking for   a space backs slash D is any  digit, the space back slash D.  So we're looking for digit space digit.
Now, one hassle with this is, so that matches   nine space one, which is what I was thinking of.
But it also matches nine, nine space,   nine nine because in the middle of that  whole thing is a digit, a space and a digit.  So if we want to say it has to start with the  digit, then a space and then end with the digit,   you need that caret at the beginning that  says it starts with dollar sign at the end   says it ends with, and that'll make sure that  that one matches, but this one does not match.  Okay?
And then other   kinds of whitespace back slash t is a tab slash  N is what you get when you press al to enter.  That's also known as a character.
10 VB is a VBF.  Some people call it a line feed  back slash R is a carriage return,   that's a character 13 or a V, BCR and VBA.
And then back slash s is either a space or   any of these three get counted as white space.
Basically think of it as some sort of white space.  And then backslash, not S is saying  that it would not be a space, right?  So in my codes across the top here,  this is backslash backs slash s,   and this has an axial space there.
And I also found when I was looking to   see how character one 60 would be handled  that backslash H in some implementations   of RegEx will give you more spaces.
So here I have word alt enter wrap,   and it gets caught by the backslash N,  but it also gets caught by the backslash   S because remember backslash S is looking for  tabs, line feed, carriage return or spaces.  Okay?
So non breaking spaces, this is one   that just never gets handled well by Excel.
And if we're pulling data off a website,   we're just scraping data off a website, there's  going to be character one 60 spaces there.  So right here I have evilly used the formula  to put in a number one and then three   character, one sixties, and then the number nine.
So does that get seen as a digit space digit?  No.
Does it get seen as starting with a digit,   a space and ending with a digit?
No.  Does it get seen as any of these?
No.  But it does thankfully get  seen as slash s and slash h.  Now out in Stack Overflow, they actually said  that you'd have to use back slash H for this.  And it seems like Excel is being better with this.
They're understanding that when we say   back slash s, they're treating the character  one 60 as a space, it sees it as a space.  It doesn't see it when we just type  a space, but if you use backslash S,   you're going to get all of those spaces.
Alright, so here's the problem.  So in this one, and then character 1 63 times  nine when I try and use trim, trim in Excel   will get rid of the interior repeated spaces.
So that should change it to one Space Nine,   but it doesn't see the character one 60  as a space, so it doesn't do anything.  So down here I'm going to use RegEx to  replace, let's put that formula next to it.  So I'm going to replace the  backslash S with a real space.  So that converts all those evil character  one 60 spaces to regular spaces right there.  That's worth the price of admission.
That formula.  We should make that formula red  because that's a formula I'm   going to come back to again and again and again.
And then once I've converted the evil character,   one 60 spaces, the regular spaces,  then Trim does the right thing.  It's funny, the character one 60 is the  one that we run into all the time because   the web is popular, but there's all kinds of  weird spaces out there and it very nicely,   a space character only matches a character 32.
But in Excel anyway, slash s slash h are both   programmed to find all of these weird spaces,  M spaces and spaces, all kinds of weird things.  We never run into any of these  except for character one 60.  But maybe if you're, I dunno, I don't  know why you would get any of these,   but in case you do great news, they're handled.
Alright, now look, there's one last thing here   and this is really, really powerful.
Alright, this is matching and then   selecting a group from the selection.
So here we're looking for any character   plus means any number of any character  followed by specifically a period.  And then P, D, F and ends there, right?
There's nothing after P, D, F.  So this will get all of these files that are  P, D, F files, but not the Excel files, right?  Good enough.
Let's modify that.  Now that I know that it's a pdf d,  I don't need to see the PDF anymore.  So I modify that pattern to put a parentheses  there and that says the only part that I want is   all of the characters up to the period.
I don't want the period PD,   F get me everything before that.
So check out that pattern with the group.  That's called a group.
I'm selecting a group.  It will return everything up to the PDF.
Now, we had to modify this formula,   the regex extract.
There is an argument   out here called return mode, and this is  called a capture group of the first match.  So we have to change that argument that  really was just defaulting to zero to be   a two to make that work.
It's pretty cool.  So then here I want to find  everything that's an Excel file.  So either X-L-S-X-X-L-S-M, XLSB, I'm ignoring  the templates for right now and notice that I'm   allowing either XMB and the question mark.
That question mark is saying, Hey,   there could be zero of those or one of those.
So let me capture XLS and it has to end with that.  So all of these get us the file name here.
Oh, except for this one.  What's up with that?
How?  I don't know how I got the file extension to  be capitals scatter, other graph template.  That must be ancient.
I have no idea how that's there,   but it's there so it's not being seen.
So the modification here is to come   out and do a case insensitive match.
So the case sensitivity allows me to handle that.  That's an important part.
And then finally, all right,   check out this awesome little bit of code.
I took this one and I said I want the file name.  So that's the parentheses.
And then, hey, since I don't   know what the file type is, go ahead.
I don't want the dot, but I want   everything else after that.
So give me all of the stuff   that's either XLS or x Ls X or x ls M or  XLS, B and return that as a second group.  All.
Now for this one to work or this   don't want to work, and I mentioned this at  the end of yesterday's video, I had an example.  We're saying that we want the two.
So we're getting all of the groups.  And when I do this, it's going to put the  first thing here, the file name in this cell,   and then it's going to put  the extension underneath it.  It's coming back as a vertical array.
So I just have to transpose that to make   it be a horizontal array.
All right?  But there I'm defining multiple  groups and I'm getting them all.  All right, so there you go.
There's the RegEx 1 0 1,   I think with this workbook.
And by the way, you can download this workbook.  There's enough examples here to get us through  a lot of the types of pattern matching that we   might need to do.
So there you go.  Hey, thanks for stopping by.
We'll see you next time for another   net cast from MrExcel.
If you like  These videos, please down below,  Like, Subscribe, and Ring the Bell.  Feel free to post any questions or  comments down in the comments below.
 

Forum statistics

Threads
1,224,837
Messages
6,181,255
Members
453,028
Latest member
letswriteafairytale

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top