Extract Substring from a Column

Narmerguy

New Member
Joined
Aug 16, 2011
Messages
3
Hi Everyone,

First of all, you guys are all awesome. I've learned a lot just from searching other stuff.

I do some computational biophysics research but am having trouble with some of the data processing because I need bits of information that are nestled inside large strings of text.

Here's an example of what my data might start out looking like:
startingdata.png

What I want are those "gi" numbers. The number begins after "gi|" and ends with "|".

However, part of the trouble I've found is that there are many | elsewhere in the string and the gi numbers can have variable length so that I can't just use mid and select out a certain number of characters after each "gi|".

Here's what I would hope to get as an output:
outputdata.png


I've attached a workbook that has three sheets (one with 10 rows of real data, and then the other two are simply the example dummy input and dummy output shown above). If I wasn't clear let me know.

As a note, my real data can have as many as several thousand rows of results so I don't know if that means I should be using a macro or if a worksheet function is still practical.

Thanks a ton if anyone is able to figure this out! The scientific world will benefit from your contribution :)
 

Excel Facts

What is =ROMAN(40) in Excel?
The Roman numeral for 40 is XL. Bill "MrExcel" Jelen's 40th book was called MrExcel XL.
Ok guys I just thought of something clever that might make this easier for us.

I went through and replaced all the "gi|" with "gi~". The | is common to the strings that I'm searching through but "~" never occurs so I know that any number between a ~ and the next | is going to be a gi number and not any other kind.

So now I guess one path is how to select out all the substrings that start with a ~ and end with a |.
 
Upvote 0
Hi

The following code uses
- regular expressions for efficient parsing of the numer portion of 'gi|xxx|'
- variant arrays for code speed

If you run the code on the sheet that contains your strings, it will create a new sheet with any results, catering for up to 10 gi strings in 10000 records. Which can be increased if needed

Cheers

Dave


Code:
'Press Alt + F11 to open the Visual Basic Editor (VBE)
'From the Menu, choose Insert-Module.
'Paste the code into the right-hand code window.
'Press Alt + F11 to close the VBE
'In Xl2003 Goto Tools … Macro … Macros and double-click GetGi

Sub GetGI()
    Dim rng1 As Range
    Dim rngArea As Range
    Dim lngRow As Long
    Dim lngCol As Long
    Dim lngCalc As Long
    Dim lngFoundR As Long
    Dim lngFoundC As Long
    Dim objReg As Object
    Dim objMC As Object
    Dim objM As Object
    Dim ws As Worksheet
    Dim X()
    Dim Y(1 To 10000, 1 To 10)

    'original code from http://www.experts-exchange.com/A_2684.html

    On Error Resume Next
    Set rng1 = Application.InputBox("Select range for the replacement of leading zeros", "User select", Selection.Address, , , , , 8)
    If rng1 Is Nothing Then Exit Sub
    On Error GoTo 0

    'See Patrick Matthews excellent article on using Regular Expressions with VBA
    Set objReg = CreateObject("vbscript.regexp")
    With objReg
        .Pattern = "gi\|(\d+)\|"
        .Global = True
    End With


    'Speed up the code by turning off screenupdating and setting calculation to manual
    'Disable any code events that may occur when writing to cells
    With Application
        lngCalc = .Calculation
        .ScreenUpdating = False
        .Calculation = xlCalculationManual
        .EnableEvents = False
    End With

    'Test each area in the user selected range

    'Non contiguous range areas are common when using SpecialCells to define specific cell types to work on
    For Each rngArea In rng1.Areas
        'The most common outcome is used for the True outcome to optimise code speed
        If rngArea.Cells.Count > 1 Then
            'If there is more than once cell then set the variant array to the dimensions of the range area
            'Using Value2 provides a useful speed improvement over Value. On my testing it was 2% on blank cells, up to 10% on non-blanks
            X = rngArea.Value2
            For lngRow = 1 To rngArea.Rows.Count
                For lngCol = 1 To rngArea.Columns.Count
                    If objReg.test(X(lngRow, lngCol)) Then
                        lngFoundC = 0
                        lngFoundR = lngFoundR + 1
                        Set objMC = objReg.Execute(X(lngRow, lngCol))
                        For Each objM In objMC
                            lngFoundC = lngFoundC + 1
                            Y(lngFoundR, lngFoundC) = objM.submatches(0)
                        Next
                    End If
                Next lngCol
            Next lngRow
        Else
            'caters for a single cell range area. No variant array required

            If objReg.test(rngArea.Value) Then
                lngFoundC = 0
                lngFoundR = lngFoundR + 1
                Set objMC = objReg.Execute(rngArea.Value)
                For Each objM In objMC
                    lngFoundC = lngFoundC + 1
                    Y(lngFoundR, lngFoundC) = objM.submatches(0)
                Next
            End If

        End If
    Next rngArea

    If lngFoundC > 0 Then
    Set ws = Sheets.Add
    ws.[a1].Resize(UBound(Y, 1), UBound(Y, 2)) = Y
    Else
    MsgBox "No matches"
    End If

    'cleanup the Application settings
    With Application
        .ScreenUpdating = True
        .Calculation = lngCalc
        .EnableEvents = True
    End With

    Set objReg = Nothing
End Sub
 
Upvote 0
Narmerguy,

What version of Excel are you using?

You are posting a picture. This means that if this was a problem where one needed to use your data, anyone trying to help you would have to enter the data manually.

To attach screenshots, see below in my Signature block: Post a screen shot with one of these:
 
Upvote 0
Hi

The following code uses
- regular expressions for efficient parsing of the numer portion of 'gi|xxx|'
- variant arrays for code speed

If you run the code on the sheet that contains your strings, it will create a new sheet with any results, catering for up to 10 gi strings in 10000 records. Which can be increased if needed

Cheers

Dave


Code:
'Press Alt + F11 to open the Visual Basic Editor (VBE)
'From the Menu, choose Insert-Module.
'Paste the code into the right-hand code window.
'Press Alt + F11 to close the VBE
'In Xl2003 Goto Tools … Macro … Macros and double-click GetGi

Sub GetGI()
    Dim rng1 As Range
    Dim rngArea As Range
    Dim lngRow As Long
    Dim lngCol As Long
    Dim lngCalc As Long
    Dim lngFoundR As Long
    Dim lngFoundC As Long
    Dim objReg As Object
    Dim objMC As Object
    Dim objM As Object
    Dim ws As Worksheet
    Dim X()
    Dim Y(1 To 10000, 1 To 10)

    'original code from http://www.experts-exchange.com/A_2684.html

    On Error Resume Next
    Set rng1 = Application.InputBox("Select range for the replacement of leading zeros", "User select", Selection.Address, , , , , 8)
    If rng1 Is Nothing Then Exit Sub
    On Error GoTo 0

    'See Patrick Matthews excellent article on using Regular Expressions with VBA
    Set objReg = CreateObject("vbscript.regexp")
    With objReg
        .Pattern = "gi\|(\d+)\|"
        .Global = True
    End With


    'Speed up the code by turning off screenupdating and setting calculation to manual
    'Disable any code events that may occur when writing to cells
    With Application
        lngCalc = .Calculation
        .ScreenUpdating = False
        .Calculation = xlCalculationManual
        .EnableEvents = False
    End With

    'Test each area in the user selected range

    'Non contiguous range areas are common when using SpecialCells to define specific cell types to work on
    For Each rngArea In rng1.Areas
        'The most common outcome is used for the True outcome to optimise code speed
        If rngArea.Cells.Count > 1 Then
            'If there is more than once cell then set the variant array to the dimensions of the range area
            'Using Value2 provides a useful speed improvement over Value. On my testing it was 2% on blank cells, up to 10% on non-blanks
            X = rngArea.Value2
            For lngRow = 1 To rngArea.Rows.Count
                For lngCol = 1 To rngArea.Columns.Count
                    If objReg.test(X(lngRow, lngCol)) Then
                        lngFoundC = 0
                        lngFoundR = lngFoundR + 1
                        Set objMC = objReg.Execute(X(lngRow, lngCol))
                        For Each objM In objMC
                            lngFoundC = lngFoundC + 1
                            Y(lngFoundR, lngFoundC) = objM.submatches(0)
                        Next
                    End If
                Next lngCol
            Next lngRow
        Else
            'caters for a single cell range area. No variant array required

            If objReg.test(rngArea.Value) Then
                lngFoundC = 0
                lngFoundR = lngFoundR + 1
                Set objMC = objReg.Execute(rngArea.Value)
                For Each objM In objMC
                    lngFoundC = lngFoundC + 1
                    Y(lngFoundR, lngFoundC) = objM.submatches(0)
                Next
            End If

        End If
    Next rngArea

    If lngFoundC > 0 Then
    Set ws = Sheets.Add
    ws.[a1].Resize(UBound(Y, 1), UBound(Y, 2)) = Y
    Else
    MsgBox "No matches"
    End If

    'cleanup the Application settings
    With Application
        .ScreenUpdating = True
        .Calculation = lngCalc
        .EnableEvents = True
    End With

    Set objReg = Nothing
End Sub

That was really well done! Thanks a lot for your help. :biggrin:

Narmerguy,

What version of Excel are you using?

You are posting a picture. This means that if this was a problem where one needed to use your data, anyone trying to help you would have to enter the data manually.

To attach screenshots, see below in my Signature block: Post a screen shot with one of these:

I'm using Excel 2010. I tried to use one of them but apparently it doesn't work anymore? Sorry about the pictures thing, I figured that it would be easier for people that way. I'll change that in the future.
 
Upvote 0

Forum statistics

Threads
1,223,236
Messages
6,170,917
Members
452,366
Latest member
TePunaBloke

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top