Rick Strahl's Weblog  

Wind, waves, code and everything in between...
.NET • C# • Markdown • WPF • All Things Web
Contact   •   Articles   •   Products   •   Support   •   Advertise
Sponsored by:
Markdown Monster - The Markdown Editor for Windows

String Extraction


:P
On this page:

Ah thought I go back to basics today for a post. An operation I rely on a lot when I’m working with text is the ability to extract strings with delimiters from within another string. RegEx can be quite useful for many scenarios of finding matches and extracting text, but personally for the 10 times a year when I need to use RegEx expressions I tend to relearn most of the cryptic syntax each time. What should take me a couple minutes usually turns into a half an hour of experimenting with RegEx Buddy for me.

So a long time ago I created some helper functions that make it easier to help me with the common task of string extraction. One of them is ExtractString() which is something I use quite frequently to well extract a string from within another string. This can be in small strings or larger strings. The function accepts a source string and a couple of delimiters around which text gets extracted.

Usage looks something like this:

// MS json Date format is "\/Date(9221211)\/" 
jsonDate = StringUtils.ExtractString(jsonDate, @"\/Date(", @")\/");

which extracts the only the milliseconds portion of the string inside of the date delimiter. Now here’s actually a good example of why this function is easier than using RegEx – try constructing that particular pattern with RegEx escape codes for some real fun and undecipherable gobbledygook. :-} Been there done that (for the JavaScript portion of the parser and it took a while to get right!).

ExtractString() also has a few additional parameters for case sensitivity and how to behave if the end delimiter is not found which is useful. For example in the following snippet a query string is passed and I need to find the value of value r parameter:

string res = StringUtils.ExtractString(Url,"?r=","&",false,true);

And no, before you ask: I couldn’t use Request.QueryString() because the string in this case is a parsed URL that doesn’t come from a query string input. The function call searches for a r= and & as delimiters, does so case insensitively (false) and specifies that the “&” delimiter can be missing which means the result returns up to the end of the string. In fact to create a generic query string parser you could now write:

public static string GetUrlEncodedKey(string urlEncodedString, string key)
{
    string res = StringUtils.ExtractString(urlEncodedString, key + "=", "&", false, true);
    return HttpUtility.UrlDecode(res);
}

Again RegEx can solve this fairly easily as well, but doing or’d  expressions are a pain to get right and since I use this functionality all the time I don’t want to fuck around with RegEx’s constant relearning curve (for me).

I certainly prefer:

string res = StringUtils.ExtractString(Url,"?r=","&",false,true);

to:

public static string GetUrlEncodedKey(string urlEncodedString, string key)
{
    string res = StringUtils.ExtractString(urlEncodedString, key + "=", "&", false, true);
    return HttpUtility.UrlDecode(res);
}

I’m not saying that RegEx hasn’t got a place. Obviously for more complex searches RegEx is much more flexible. But for simple repetive scenarios RegEx is overkill and likely more resource intensive than a simple function that finds a couple of string indexes and extracts a substring. And to me at least RegEx usage seriously affects code readibility so if I have a chance to abstract behavior in such a way that I don’t have to use RegEx I will do so vehemently – either by wrapping the RegEx code or creating an alternative.

Anyway, here’s the implementation of ExtractString with various overloads:

/// Extracts a string from between a pair of delimiters. Only the first /// instance is found. /// 
/// Input String to work on
/// Beginning delimiter
/// ending delimiter
/// Determines whether the search for delimiters is case sensitive
/// Extracted string or "" 
public static string ExtractString(string source, string beginDelim, 
                                   string endDelim, bool caseSensitive, 
                                   bool allowMissingEndDelimiter)
{           
    int at1, at2;

    if (string.IsNullOrEmpty(source))
        return string.Empty;

    if (caseSensitive)
    {
        at1 = source.IndexOf(beginDelim);
        if (at1 == -1)
            return string.Empty;

        at2 = source.IndexOf(endDelim, at1 + beginDelim.Length);
    }
    else {
        //string Lower = source.ToLower(); at1 = source.IndexOf(beginDelim,0,source.Length,StringComparison.OrdinalIgnoreCase);
        if (at1 == -1)
            return string.Empty;

        at2 = source.IndexOf(endDelim, at1 + beginDelim.Length, StringComparison.OrdinalIgnoreCase);
    }

    if (allowMissingEndDelimiter && at2 == -1)
        return source.Substring(at1 + beginDelim.Length);

    if (at1 > -1 && at2 > 1)
        return source.Substring(at1 + beginDelim.Length, at2 - at1 - beginDelim.Length);

    return string.Empty;
}
  public static string ExtractString(string source, string beginDelim, string endDelim, bool caseSensitive)
{
    return ExtractString(source, beginDelim, endDelim, caseSensitive, false);
}
  public static string ExtractString(string source, string beginDelim, string endDelim)
{
    return ExtractString(source, beginDelim, endDelim, false, false);
}

It’s pretty basic stuff, but it’s one of those handy utility functions that are useful so frequently it’s nice to have them in my Utility set of classes (StringUtils to be exact).

So I’m curious to hear what the RegEx aficionados will have to say. I wonder if RegEx could be used inside of the function – the big issue I always have with RegEx in reusable/generic scenarios is that patterns that contain the start and end delimiters have to escaped properly. If I have backslash in the starting delimiter a generic expression will die the painful RegEx parsing death. So (he asks mockingly) - could the generic internals of ExtractString be re-written to use RegEx instead of indexOf and SubString to retrieve the string match – frankly I don’t know how (and it’s not really necessary, but an interesting thought).

BTW, for some of the FoxPro folks lurking this function might look familiar. It closely matches West Wind Web Connection’s Extract method as well as VFP’s native StrExtract functions on which I had relied for many years. I think the above functions was one of the first things I ever created in .NET because its use was sorely missed when starting out in .NET.

Hopefully this method will be useful to some of you.

this post created and published with Markdown Monster
Posted in CSharp  .NET  

The Voices of Reason


 

Andrei
December 19, 2008

# re: String Extraction

:) I'm also using a function that looks almost exactly the same

Michael
December 19, 2008

# re: String Extraction

I absolutely cringe anytime I have to create or modify a regex string. Yes, they are very powerful, but at a high cost to your time (well, for most people anyway). With very complex ones, it is about like trying to read binary. :)

Jon
December 19, 2008

# re: String Extraction

I thought it was just me that struggled with creating Regex patterns.

I've used all sorts of languages on different platforms over the years, but trying to build my own patterns always takes ages.

Happy to use regex for checking formats of email address etc. when I know I can use tried and tested patterns.

antonio
December 19, 2008

# re: String Extraction

Thanks, this is great and easily extensible. I made it an extension method for String by adding the this keyword to the signature.

public static string ExtractString(this string source, string beginDelimiter, string endDelimiter)

Rick Strahl
December 19, 2008

# re: String Extraction

@Antonio - yes I was thinking about extension methods recently for all my string Utils, but frankly I prefer to keep things separate rather than piling onto the string class. Apparently, I'm pretty reluctant to use extension methods in general - except when it really makes live easier. I find it useful to have all custom utilities in a separate class (StringUtils in this case).

andy
December 19, 2008

# re: String Extraction

I saw a good use of Regex in a MS SDL web cast to avoid Cross-Site Scripting (XSS):
http://www.microsoft.com/events/series/detail/webcastdetails.aspx?seriesid=15&webcastid=5103

As always, thanks for the code! I'm a recovering FoxPro Lurker and we've been "leveraging" your generous public posts for years now. You 'da man, Rick!

Happy Christmas.

Yoann. B
December 20, 2008

# re: String Extraction

Great Article Thanks.

Another Useful class, SubstringInfo

http://blog.sb2.fr/post/2008/12/01/SubstringInfo-The-Simple-Way-To-SubString.aspx

stuartd
December 20, 2008

# re: String Extraction

Another thing to consider is that RegEx creation is expensive in terms of object instantiation - http://www.acorns.com.au/blog/?p=136 - as well as in terms of time taken to create the regular expression itself..

Zubair Ahmed
December 20, 2008

# re: String Extraction

hey Rick,

Very useful method, I read your response to Antonio regarding the use of extension methods :) Anyway you might want to check out my extension method's library with few methods that I regularly use http://zubairdotnet.blogspot.com/2008/11/c-extension-methods-library.html and let me know what do you think of them.

Luke Breuer
December 22, 2008

# re: String Extraction

Does RegEx Buddy do realtime matching? In other words, does it update matches as you type either the regex, or the "test text"?

Dave Diehl
December 22, 2008

# re: String Extraction

Luke -- Yes, it's done in real-time

To all -- One quote about Regex I remember from way back -- "When you try to solve a problem using Regex, you actually have two problems." :)

J
December 22, 2008

# re: String Extraction

this looks way more kiddie than regex. as a everyday regex user (my whole job consists of parsing/extracting text and pattern matching), i'd unfortunately never use this. regex looks daunting but really you can learn it in about six hours, and it has extreme benefits when used correctly.

as a regex guy i'd never use this. not trying to knock your class, but to me this seems much less efficient with very little more "ease of use" than regex if one spends like five minutes with it.

for example, here's your string:
// json Date format is "\/Date(9221211)\/"


here's my regex:

@"\\/Date\((\d+)\)\\."

You could even have named capture like this:

"\\/Date\((?<yournumber>\d+)\)\\."

Then you can refer to it as match.Groups["yournumber"].Value

anyhoo, thats my thoughts.

Rick Strahl
December 23, 2008

# re: String Extraction

@Dave - well, I think you're making my point for me <s>... Respect your opinion on this, but sorry that code hardly fits the easy to read and maintain mantra even if you are comfortable with RegEx. And - it maybe readable to you, but probably to many others at least not just by looking at it briefly.

Don't get me wrong - RegEx is great when used appropriately especially for more complex parsing where it's difficult to create parsing code. That's where RegEx shines. But for simpler tasks - a little bit of plain CLR code is not much more voluminous and much more readable and likely to perform better than RegEx.

Steve from Pleasant Hill
December 23, 2008

# re: String Extraction

I'm one of those people who loses interest or gets a headache whenever I try to learn regex. More than two \\ or \/ and forget about it. I will use it for situations that are "well known", i.e., email, SSN, all alpha numeric, etc., with a big fat comment as to what it is doing.

I don't think I'd put it in a loop.

No doubt it could be used for some string chopping, but I never feel inclined to stop writing code long enough to learn, debug, test the regex needed and end up just doing it the brute force way.

Michael Freidgeim
January 02, 2009

# re: String Extraction

I have smilar function (MidBetween) in My StringHelper class(http://geekswithblogs.net/mnf/articles/84942.aspx)

Matt Denman
January 28, 2009

# re: String Extraction

I love regex, so I'm a bit disappointed to see so many programmers struggling with it. I remember a good mentor of mine, Marshall Cline, teaching me the most basic approach of taking regex creation one character at a time, from left to right. There really are only a handful of escape sequences and special characters to think about. Regex performance is greatly dependent on how the regex was written, but overall its a fast solution.

I think skilled C# developers should master regex and use it as a core tool in their work. A great book for this is "Mastering Regular Expressions" from O'Reilly press. Its a great book for deeply understanding such a gem of a tool for software developers.

Probably making Rick's point more than I want to, here is the last regex I wrote just this week:
</?(?:art:|ecom:)(StandAloneArticleListDisplay|StandAloneProductListDisplay)(?:(?:\s+(\w+)(?:\s*=\s*(?:""(.*?)""|'(.*?)'|[^'"">\s]+))?)+\s*|\s*)/?>

using that with:
MatchCollection mc = s_regex.Matches(pageHtml);

allows me to find specific server side controls and all of their attributes that are very easy to iterate through with the MatchCollection.

Rick Strahl
January 28, 2009

# re: String Extraction

@Matt - don't get me wrong I'm not opposed to using RegEx when the time is right. But there are two things working against RegEx for me at least at least in simple scenarios: Time taken to figure out a RegEx vs. coding a few lines of logic and the time it takes to understand what a RegEx expression does when I have to maintain the code later on.

As I mentioned before the biggest problem for me with RegEx is that I just don't use them enough to remember the 'syntax' if you can call it that. Occasionally I do get into areas where I use them heavily but I just don't have the wiring to remember arcane switches long term - everytime I pick it back up I have to relearn the switches and dig out my Visibone cheat sheet. I do remember the concepts (grouping, match sets, forward matching etc.) so it's not as bad as it used to be but it's still often more work that writing a small and testable piece of code.

There are definitely other situations where regEx is so much easier and more efficient than hand coding and in those cases it would really be silly to not use them. Right tool for the job I suppose.

Oh and the O'Reilly RegEx book you mention is great - I've been through it and use it for reference frequently. That and the Visibone RegEx guide and RegEx buddy are my crutches. :-}

Christoph Dreßler
November 06, 2012

# re: String Extraction

Great! I love VFP. I have convert some other powerful (string-)functions to c# over the years...

Stephen Wileman
October 15, 2021

# re: String Extraction

I was looking for the equivalent of the VFP STREXTRACT() function and stumbled across this blog so thank you very much for saving me some time!

A great enhancement would be to add a parameter to look for the 'th occurrence of a match.


West Wind  © Rick Strahl, West Wind Technologies, 2005 - 2024