Extract Multiple Sections of Text in a String, Between Two Repeating Words

blafarm

Board Regular
Joined
Oct 14, 2018
Messages
59
Hi All,

I have a bit of a challenge and was hoping for some guidance.

I have a very long string of text in a single cell and I need to extract multiple sections of text that appear between two words that repeat in the string.

For example, here is a very simplified version of the text string in Cell A1:

{"date": 5/7/19 headline:"GE Posts Profit" source:"CNBC"}{"date": 5/8/19 headline:"GE Dividend Shrink" source:"MSN"}{"date": 5/9/19 headline:"GE Bankrupt" source:"WSJ"}

I have had limited success with the following formula, however it only extracts the text between the first appearance of the words: "headline" and "source":

Formula: =MID(A1,SEARCH("headline",A1)+2,SEARCH("source:",A1)-SEARCH("headline",A1)-4)

Result: GE Posts Profit


This formula does not capture ALL of the headlines in the full string -- only the first headline.

My initial goal is to learn a formula that will extract ALL of the headlines within the string, and that will place the results in a succession of horizontal or vertical cells that either look like this:

[TABLE="class: grid, width: 400"]
<tbody>[TR]
[TD]GE Posts Profit[/TD]
[TD]GE Dividends Shrink[/TD]
[TD]GE Bankrupt[/TD]
[/TR]
</tbody>[/TABLE]

... or that look like this:

[TABLE="class: grid, width: 150"]
<tbody>[TR]
[TD]GE Posts Profit[/TD]
[/TR]
[TR]
[TD]GE Dividends Shrink[/TD]
[/TR]
[TR]
[TD]GE Bankrupt[/TD]
[/TR]
</tbody>[/TABLE]

I have some additional questions that are directly related to this inquiry, but I'm trying to start by keeping it simple.

Thanks very much in advance.

Cheers
 
Re: How to Extract Multiple Sections of Text in a String, Between Two Repeating Words

Thank you very much jtakw!

You are clearly very talented -- and very generous with your time.

Any thoughts on the "Text result of SUBSTITUTE is longer than the limit of 50000 characters" error described above?

Thanks again.

You're welcome, thanks for the feedback.

This updated formula should Not error out due to character limits...
For Excel, as long as your Original text string is no more than, say 10,000 characters, you should be fine.
I don't use Google sheets, but if the limit is 50,000, then as long as your Original text string is no more than around 15,000 characters, you should be fine.
 
Upvote 0

Excel Facts

Back into an answer in Excel
Use Data, What-If Analysis, Goal Seek to find the correct input cell value to reach a desired result
Re: How to Extract Multiple Sections of Text in a String, Between Two Repeating Words

You're welcome, thanks for the feedback.
This updated formula should Not error out due to character limits...
For Excel, as long as your Original text string is no more than, say 10,000 characters, you should be fine.
I don't use Google sheets, but if the limit is 50,000, then as long as your Original text string is no more than around 15,000 characters, you should be fine.


Thank you very much for the prompt reply.

But, I have to ask you a tough question:

In the hours since your first post, I have spent the day building-out a very complicated page based on your first excellent solution.

It turns-out that creating columns of the individual data points is actually working very well for me (Necessity is the Mother of Invention!).

That's why I indicated that my only remaining problem was the error message.

And due to that error message, I am wondering if you would be kind enough to provide a version of your updated formula ("that should not error out due to character limits") but that functions the same way as your first formula -- which created a single column of identical data points.

Is that possible? If so, I would be very grateful.
 
Last edited:
Upvote 0
Re: How to Extract Multiple Sections of Text in a String, Between Two Repeating Words

Can you show a sample of your new layout, and explain how you want the text strings extracted (so there's no guess work)?

Thanks.
 
Upvote 0
Re: How to Extract Multiple Sections of Text in a String, Between Two Repeating Words

Btw:

I forgot to mention that your new formula more than doubles my results before hitting the error.
So, it is very effective and very useful.
Just hoping a version can be easily created that performs the same as your first generous suggestion.
Thank you.
 
Upvote 0
Re: How to Extract Multiple Sections of Text in a String, Between Two Repeating Words

Can you show a sample of your new layout, and explain how you want the text strings extracted (so there's no guess work)?

Thanks.
Absolutely jtakw. Sorry if my last post was confusing.
And thanks very much for asking -- and for being receptive to my request.
Your original formula for creating a vertical column of extracted data was perfect:

=TRIM(SUBSTITUTE(MID(SUBSTITUTE(SUBSTITUTE(A$1,"headline:",REPT(" ",LEN(A$1))),"source:",REPT(" ",LEN(A$1))),(ROWS(A$2:A2)*2-1)*LEN(A$1),LEN(A$1)),CHAR(34),""))

And it is that exact formula that represents the basis of the sheet I created today.

Your original formula actually works better than I could have imagined.
It allowed me to create a nice block of related data points that are horizontally aligned with each other.

Here is a screen grab of a very simplified version of the page:

Layout.png



In the screen grab:

A2 = Text String
A5:A11 = Your original formula, isolating the “DateTime”
B5:B11 = Your original formula, isolating the “Headline”
C5:C11 = Your original formula, isolating the “URL”
D5:D11 = Your original formula, isolating the “Source”

I just pasted your original formula into columns A, B, C, and D -- and customized the "start" and "end" boundaries for the data I wanted to extract.

Now, your new formula is also great, and very much appreciated. I'm quite sure I will have a use for it in the near future.
However, for this sheet, having all of the data point categories in one column actually makes it harder to arrive at the layout in the screen grab above.

But importantly, as you pointed out and as my tests confirm – your new formula structure significantly increases the number of extractions I can achieve before getting the error message.

So, I'm hoping that there might be a way to modify your original formula, so that it works the exact same way -- but to do so using whatever technique(s) you used in the new formula that allowed it to significantly increase the number of extractions before getting the error.

Does that make sense?
If not, I will happily take another stab at explaining it.

Thanks very much for your continued support.
 
Upvote 0
Re: How to Extract Multiple Sections of Text in a String, Between Two Repeating Words

Thank you for your prompt reply.

I tried your formula but am getting an error related to the requirement of the pipe argument "|"

That symbol does not appear in my text strings (please see my post above).

Thank you.

Hey, it should work as I have replaced instances of "headline:" that appear in your string, with the pipe, then searched for the pipe, the reason to use a SEARCH(SUBSTITUTE combination is so you can iterate through instances, hence the use of ROW() and dragging it down - what error do you get when trying it? It works on my screen so curious why it doesn't work for you
 
Upvote 0
Re: How to Extract Multiple Sections of Text in a String, Between Two Repeating Words

So, based on your sample layout in Post # 15, and using your sample data in Post # 5...

A5 formula copied Across to Column D and Down as far as needed:


Book1
A
15767
2[{"datetime":1557328805000 headline:"GE CEO says weaker quarters likely after 'good start' to year" source:"Reuters" url:"https://cloud.iexapis.com/v1/news/article/1e7714c0-93f6-4921-af24-f3122a2e809e" summary:"General Electric Co Chief Executive Larry Culp said on Wednesday the company will likely have weaker quarters in the rest of the year after a surprisingly "good start" in the first quarter." related:"GE" image:"https://cloud.iexapis.com/v1/news/image/1e7714c0-93f6-4921-af24-f3122a2e809e" lang:"en" hasPaywall:false} {"datetime":1557307800000 headline:"GE Investors Gather to Elect Revamped Board Air Gripes" source:"The Wall Street Journal" url:"https://cloud.iexapis.com/v1/news/article/dfb8a33f-9194-493e-9b83-9709dc0d18a9" summary:"General Electric Chairman and CEO Larry Culp who joined the company just a year ago is expected to answer questions from investors at the shareholder meeting in Tarrytown N.Y." related:"GE" image:"https://cloud.iexapis.com/v1/news/image/dfb8a33f-9194-493e-9b83-9709dc0d18a9" lang:"en" hasPaywall:true} {"datetime":1557205509000 headline:"IndiGo is making plans for another aggressive push" source:"The Economic Times India" url:"https://cloud.iexapis.com/v1/news/article/6ffe2450-5e9f-4166-b5f6-a26a79f10908" summary:"by Anurag KotokyIndiGo is in talks with Airbus SE for another large plane order in a sign Asias biggest budget carrier has no intention of letting up on a blistering pace of expansion.The Indian airline which adds an aircraft to its fleet every week is in discussions to buy a longer-range version of the European planemakers newest narrow-body jet according to Chief Executive Officer Ronojoy Dutta.Existing orders will see IndiGo through the next two years Dutta said in an interview with Bloomberg News Editor-in-Chief John Micklethwait. After that the carrier is considering adding new planes including Airbuss long-distance A321neo LR and the yet-to-be-released A321 XLR he said.When we order well order in bulk for sure Dutta said in New Delhi in his first interview since becoming CEO of the airline in January. I cant give you a tentative number but it will be large. We need longer range.Buying BigThough Dutta didnt specify the size of IndiGos next purchase its likely to be a multi-billion-dollar order based on the companys history." related:"GE" image:"https://cloud.iexapis.com/v1/news/image/6ffe2450-5e9f-4166-b5f6-a26a79f10908" lang:"en" hasPaywall:false} {"datetime":1557085967000 headline:"General Electric: Larry Culp Is The Man (For The Job)" source:"Seeking Alpha" url:"https://cloud.iexapis.com/v1/news/article/87bdd7f9-1aec-4943-835f-7a436483cede" summary:"General Electric reported strong Q1 2019 results that beat the consensus bottom-line estimate. The stock is up big since General Electric reported Q1 2019 resul" related:"GE" image:"https://cloud.iexapis.com/v1/news/image/87bdd7f9-1aec-4943-835f-7a436483cede" lang:"en" hasPaywall:false} {"datetime":1556912727000 headline:"GE to build 500 MW coal power plant in Kosovo" source:"Seeking Alpha" url:"https://cloud.iexapis.com/v1/news/article/cf2c1bf4-2740-4f80-981b-7a68a8ec327f" summary:"A General Electric-led ([[GE]] +2.1%) consortium is selected to build and equip a new 500 MW coal-fired power plant in Kosovo. The new plant designed to m" related:"GE" image:"https://cloud.iexapis.com/v1/news/image/cf2c1bf4-2740-4f80-981b-7a68a8ec327f" lang:"en" hasPaywall:false} {"datetime":1556901660000 headline:"Good News! General Electric Burned Less Cash in Q1" source:"The Motley Fool" url:"https://cloud.iexapis.com/v1/news/article/492e3f89-1206-4547-a8a8-db086e773abb" summary:"Is there light at the end of the tunnel for the troubled industrial conglomerate?" related:"GE" image:"https://cloud.iexapis.com/v1/news/image/492e3f89-1206-4547-a8a8-db086e773abb" lang:"en" hasPaywall:false} {"datetime":1556893560000 headline:"The Zacks Analyst Blog Highlights: Apple Microsoft General Electric Ecolab and Celgene" source:"Zacks Investment Research" url:"https://cloud.iexapis.com/v1/news/article/739fd73b-1e73-4d85-8e8b-7ced62a4e5ea" summary:"The Zacks Analyst Blog Highlights: Apple Microsoft General Electric Ecolab and Celgene" related:"GE" image:"https://cloud.iexapis.com/v1/news/image/739fd73b-1e73-4d85-8e8b-7ced62a4e5ea" lang:"en" hasPaywall:false} {"datetime":1556889257000 headline:"GE to build Kosovo's new 500 MW coal power plant" source:"Reuters" url:"https://cloud.iexapis.com/v1/news/article/f5e2699b-b92e-4aed-aa90-02b557ce5652" summary:"Kosovo and London-listed power firm ContourGlobal said on Friday they had chosen a consortium of General Electric subsidiaries to build and equip a new 500 megawatt (MW)coal-fired power plant in the Balkan country." related:"GE" image:"https://cloud.iexapis.com/v1/news/image/f5e2699b-b92e-4aed-aa90-02b557ce5652" lang:"en" hasPaywall:false} {"datetime":1556877840000 headline:"Key Takeaways From General Electric's Earnings" source:"The Motley Fool" url:"https://cloud.iexapis.com/v1/news/article/356ca854-67f0-4e1b-a782-4f1986da535f" summary:"The company delivered a solid but unspectacular quarter but that might be just what investors need right now." related:"GE" image:"https://cloud.iexapis.com/v1/news/image/356ca854-67f0-4e1b-a782-4f1986da535f" lang:"en" hasPaywall:false} {"datetime":1556833186000 headline:"General Electric: Still A Mystery" source:"Seeking Alpha" url:"https://cloud.iexapis.com/v1/news/article/2a0c3a5c-32fc-4d40-b8e7-458792672cda" summary:"General Electric released its first quarter reports with a little bit of positive news and a little bit of negative news. As one analyst put it Did this change" related:"GE" image:"https://cloud.iexapis.com/v1/news/image/2a0c3a5c-32fc-4d40-b8e7-458792672cda" lang:"en" hasPaywall:false}]
Sheet661 (3)
Cell Formulas
RangeFormula
A1=LEN(A2)



Book1
ABCDE
3
4
51557328805000GE CEO says weaker quarters likely after 'good start' to yearReutershttps//cloud.iexapis.com/v1/news/article/1e7714c0-93f6-4921-af24-f3122a2e809e
61557307800000GE Investors Gather to Elect Revamped Board Air GripesThe Wall Street Journalhttps//cloud.iexapis.com/v1/news/article/dfb8a33f-9194-493e-9b83-9709dc0d18a9
71557205509000IndiGo is making plans for another aggressive pushThe Economic Times Indiahttps//cloud.iexapis.com/v1/news/article/6ffe2450-5e9f-4166-b5f6-a26a79f10908
81557085967000General Electric Larry Culp Is The Man (For The Job)Seeking Alphahttps//cloud.iexapis.com/v1/news/article/87bdd7f9-1aec-4943-835f-7a436483cede
91556912727000GE to build 500 MW coal power plant in KosovoSeeking Alphahttps//cloud.iexapis.com/v1/news/article/cf2c1bf4-2740-4f80-981b-7a68a8ec327f
101556901660000Good News! General Electric Burned Less Cash in Q1The Motley Foolhttps//cloud.iexapis.com/v1/news/article/492e3f89-1206-4547-a8a8-db086e773abb
111556893560000The Zacks Analyst Blog Highlights Apple Microsoft General Electric Ecolab and CelgeneZacks Investment Researchhttps//cloud.iexapis.com/v1/news/article/739fd73b-1e73-4d85-8e8b-7ced62a4e5ea
121556889257000GE to build Kosovo's new 500 MW coal power plantReutershttps//cloud.iexapis.com/v1/news/article/f5e2699b-b92e-4aed-aa90-02b557ce5652
131556877840000Key Takeaways From General Electric's EarningsThe Motley Foolhttps//cloud.iexapis.com/v1/news/article/356ca854-67f0-4e1b-a782-4f1986da535f
141556833186000General Electric Still A MysterySeeking Alphahttps//cloud.iexapis.com/v1/news/article/2a0c3a5c-32fc-4d40-b8e7-458792672cda
15
Sheet661 (3)
Cell Formulas
RangeFormula
A5=IF(COLUMNS($A5:A5)>4,"",SUBSTITUTE(SUBSTITUTE(TRIM(MID(SUBSTITUTE(SUBSTITUTE($A$2,CHOOSE(COLUMNS($A5:A5),"datetime","headline:","source:","url:"),REPT(" ",LEN($A$2)),ROWS(A$5:A5)),CHOOSE(COLUMNS($A5:A5),"headline:","source:","url:","summary:"),REPT(" ",LEN($A$2)),ROWS(A$5:A5)),LEN($A$2),LEN($A$2))),CHAR(34),""),":",""))


NOTE: Formula is tested on your sample data from Post # 5, which is 5767 characters in length as indicated in A1 above, and does Not error out.
My formula from Post # 9 was Also tested on your sample from Post # 5 as shown in Post # 9, and does Not error out, if it errored for you, you must have changed/adjusted the formula incorrectly.
 
Last edited:
Upvote 0
Re: How to Extract Multiple Sections of Text in a String, Between Two Repeating Words

So, based on your sample layout in Post # 15, and using your sample data in Post # 5...

A5 formula copied Across to Column D and Down as far as needed:

[TABLE="width: 85%"]
<tbody>[TR]
[TD]Worksheet Formulas[TABLE="width: 100%"]
<thead>[TR="bgcolor: [URL=https://www.mrexcel.com/forum/usertag.php?do=list&action=hash&hash=DAE7F5]#DAE7F5[/URL] "]
[TH="width: 10px"]Cell[/TH]
[TH="align: left"]Formula[/TH]
[/TR]
</thead><tbody>[TR]
[TH="width: 10px, bgcolor: [URL=https://www.mrexcel.com/forum/usertag.php?do=list&action=hash&hash=DAE7F5]#DAE7F5[/URL] "]A5[/TH]
[TD="align: left"]=IF(COLUMNS($A5:A5)>4,"",SUBSTITUTE(SUBSTITUTE(TRIM(MID(SUBSTITUTE(SUBSTITUTE($A$2,CHOOSE(COLUMNS($A5:A5),"datetime","headline:","source:","url:"),REPT(" ",LEN($A$2)),ROWS(A$5:A5)),CHOOSE(COLUMNS($A5:A5),"headline:","source:","url:","summary:"),REPT(" ",LEN($A$2)),ROWS(A$5:A5)),LEN($A$2),LEN($A$2))),CHAR(34),""),":",""))[/TD]
[/TR]
</tbody>[/TABLE]
[/TD]
[/TR]
</tbody>[/TABLE]

NOTE: Formula is tested on your sample data from Post # 5, which is 5767 characters in length as indicated in A1 above, and does Not error out.
My formula from Post # 9 was Also tested on your sample from Post # 5 as shown in Post # 9, and does Not error out, if it errored for you, you must have changed/adjusted the formula incorrectly.

jtakw,

That works perfectly in Google Sheets.
Thank you very, very much.

For some reason, it does not seem to be working in Excel 365 on a desktop.

Can you please suggest why that might be happening?

Thank you.
 
Upvote 0
Re: How to Extract Multiple Sections of Text in a String, Between Two Repeating Words

EDIT:

I think I can see the source of the problem.
In Excel, the WEBSERVICE call seems to be comma delimited, which is different than the Google Sheets IMPORTDATA call.

[{"datetime":1557439398000,"headline":"The Little-Known Truth About The Carbon Footprint For The Tesla Model 3 And Other BEVs","source":"Seeking Alpha","url":"https://cloud.iexapis.com/v1/news/article/4e3cfd53-00e3-402a-89a7-fabc5cde087b","summary":"No summary available.","related":"TSLA,TTM","image":"https://cloud.iexapis.com/v1/news/image/4e3cfd53-00e3-402a-89a7-fabc5cde087b","lang":"en","hasPaywall":false},{"datetime":1557431760000,"headline":"Why Sunrun Stock Dropped 6% Today","source":"The Motley Fool","url":"https://cloud.iexapis.com/v1/news/article/8c733a8b-bf9e-4fa2-ab17-a0bf75a59e4e","summary":"Weak guidance (and competition from Tesla?) eclipse good news on Sunrun sales.","related":"TSLA","image":"https://cloud.iexapis.com/v1/news/image/8c733a8b-bf9e-4fa2-ab17-a0bf75a59e4e","lang":"en","hasPaywall":false},{"datetime":1557431040000,"headline":"When GM Sold a Plant to an Electric-Vehicle Startup...Named Tesla","source":"Barron's","url":"https://cloud.iexapis.com/v1/news/article/bb5a1634-9477-400c-aa9c-3f26afb55490","summary":"Workhorse is buying an idled car plant from General Motors. This isn’t the first time GM has sold a plant cheap to an upstart maker of electric vehicles.","related":"TSLA","image":"https://cloud.iexapis.com/v1/news/image/bb5a1634-9477-400c-aa9c-3f26afb55490","lang":"en","hasPaywall":true},{"datetime":1557426766000,"headline":"Tesla Battery Swap “Enron trading desk all over again”","source":"ValueWalk","url":"https://cloud.iexapis.com/v1/news/article/b9767020-bdbc-453c-9d91-719cceb05b79","summary":"Whitney Tilson’s email to investors discussing Glenn’s wise words on Tesla Inc (NASDAQ:TSLA); comments; Motor Mouths article. 1) Glenn with some wise words: Q1 hedge fund letters, conference, scoops etc Frankly, most of the people, many bears included, have become […] The post Tesla Battery Swap “Enron trading desk all over again” appeared first on ValueWalk .","related":"TSLA","image":"https://cloud.iexapis.com/v1/news/image/b9767020-bdbc-453c-9d91-719cceb05b79","lang":"en","hasPaywall":false},{"datetime":1557416228000,"headline":"Hilary Duff and boyfriend Matthew Koma are engaged: 'He asked me to be his wife'","source":"Business Insider","url":"https://cloud.iexapis.com/v1/news/article/1ac8113a-3028-45f4-bba2-674d1905b62d","summary":"On Thursday, Hilary Duff and Matthew Koma revealed that they were engaged. " He asked me to be his wife," Duff captioned two images on Instagram , one of which showed her sparkling ring. The couple has been linked since 2017 and welcomed a baby girl named Banks Violet Bair in October 2018 . Visit INSIDER's homepage for more stories. Hilary Duff and boyfriend Matthew Koma are engaged. The 31-year-old "Younger" star revealed the milestone with an Instagram post that was shared on Thursday. " He asked me to be his wife," Duff captioned two images. One photo showed her, Koma (whose real name is Matthew Bair), and the engagement ring while the second one showed the couple kissing. He asked me to be his wife♥️ A post shared by Hilary Duff (@hilaryduff) on May 9, 2019 at 7:20am PDT on May 9, 2019 at 7:20am PDT Koma, a musician who's part of a band called Winnetka Bowling League, shared the same images on his Instagram and wrote: " I asked my best friend to marry me." Read more : Hilary Duff says she struggles to ignore mom-shaming on Instagram: 'Bullying seems to be at an all-time high right now' It's unclear when Duff and Koma started dating, but they have been linked since early 2017.","related":"TSLA","image":"https://cloud.iexapis.com/v1/news/image/1ac8113a-3028-45f4-bba2-674d1905b62d","lang":"en","hasPaywall":false},{"datetime":1557416038000,"headline":"Intel's self-driving leader on why Uber, Lyft and Tesla are in a robo-taxi race for their lives","source":"CNBC","url":"https://cloud.iexapis.com/v1/news/article/20f34de1-d57c-46ef-bd5b-18ec018aaa53","summary":"A leading autonomous-driving executive from Mobileye, a company Intel bought for $15 billion, explains why Lyft, Uber, Tesla and any other company seeking a robo-taxi future is in a race to bring down costs by as much as 50%. And then there's the safety issues of teaching AI to drive without a threat to life.","related":"TSLA","image":"https://cloud.iexapis.com/v1/news/image/20f34de1-d57c-46ef-bd5b-18ec018aaa53","lang":"en","hasPaywall":false}]


I'm not exactly sure how to modify your excellent formula.

Any assistance would be greatly appreciated.

Thank you.
 
Last edited:
Upvote 0
Re: How to Extract Multiple Sections of Text in a String, Between Two Repeating Words

Hi jtakw,

You've been extremely helpful, and I thank you for your time.

Unfortunately, I'm stuck on this last part of the puzzle.

I don't know how to modify your formula to work with the Excel string listed above (that has commas).

I have tried many versions -- but none work.

I would be very grateful if you could point me in the right direction.

Thanks
 
Upvote 0

Forum statistics

Threads
1,223,908
Messages
6,175,306
Members
452,633
Latest member
DougMo

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top