Rick Strahl's Weblog  

Wind, waves, code and everything in between...
.NET • C# • Markdown • WPF • All Things Web
Contact   •   Articles   •   Products   •   Support   •   Advertise
Sponsored by:
Markdown Monster - The Markdown Editor for Windows

Html and Uri String Encoding without System.Web


:P
On this page:

I’ve been revisiting and refactoring some of my old utility libraries recently. One of my classes is WebUtils which is – duh - Web specific and really didn’t need to be in this general utility library and so I moved it off into my Web support library that contains the brunt of Web specific behavior like custom controls and general ASP.NET helpers. The Westwind.Web project naturally contains a reference to System.Web because it’s a Web project.

After removing the Web functionality from the Westwind.Utilities project I thought I could safely ditch the System.Web reference from the project, but alas I hit a snag – there was still a dependency for the HttpUtility class to provide UrlEncoding/Decoding and HtmlEncoding. Part of the Utilities project includes communication classes that use Http access and interact with Web servers so UrlEncoding at the very least is still a requirment and HtmlEncoding and UrlDecoding comes up in a few places as well.

So the problem is that the only comprehensive set of of UrlEncoding/Decoding and HtmlEncoding/Decoding features available in the .NET framework live in System.Web. Which is a bad design choice – these are general features that probably should live in System.Net. It turns out System.Uri contains Url encoding/decoding functionality but HtmlEncoding is not to be found outside of System.Web.

Now I could add a System.Web dependency to my Utility library – but that just doesn’t sit well with me. It forces System.Web into the loaded assembly  list of any application consuming the library. Now normally I’m not a stickler for including a an assembly here or there, but System.Web is quite a honker and loading it into your app will add a good 2.5 megs to the memory footprint just for loading it. And for just the privilege of Url and Html Encoding/Decoding that’s quite a bit of overhead. It also slows down load time as the assembly is read on system startup etc.

Long story short – I’d like to avoid including System.Web into a non-Web application – it just doesn’t feel right.

Url Encoding and Decoding

It turns out System.Net that when .NET 2.0 rolled around it did get some Url Encoding and Decoding functions. Unfortunately it looks like these functions don’t provide the full range of functionality that the HttpUtility functions provided. The System.Uri class contains a few static helper methods to escape data. The following code (in LinqPad) demonstrates the functionality:

string test = "This is a value & I don't care for it.\t\"quoted\" 'single quoted',<% alligator %>#";
string encoded = System.Web.HttpUtility.UrlEncode(test);
encoded.Dump();
System.Uri.EscapeDataString(test).Dump();
System.Uri.EscapeUriString(test).Dump();
Westwind.Utilities.StringUtils.UrlEncode(test).Dump();

System.Uri.UnescapeDataString(encoded).Dump();
System.Web.HttpUtility.UrlDecode(encoded).Dump();
Westwind.Utilities.StringUtils.UrlDecode(encoded).Dump();

The result of this diverse bunch is:

This+is+a+value+%26+I+don't+care+for+it.%09%22quoted%22+'single+quoted'%2c%3c%25+alligator+%25%3e%23
This%20is%20a%20value%20%26%20I%20don't%20care%20for%20it.%09%22quoted%22%20'single%20quoted'%2C%3C%25%20alligator%20%25%3E%23
This%20is%20a%20value%20&%20I%20don't%20care%20for%20it.%09%22quoted%22%20'single%20quoted',%3C%25%20alligator%20%25%3E#
This+is+a+value+%26+I+don%27t+care+for+it%2E%09%22quoted%22+%27single+quoted%27%2C%3C%25+alligator+%25%3E%23


This+is+a+value+&+I+don't+care+for+it.    "quoted"+'single+quoted',<%+alligator+%>#
This is a value & I don't care for it.    "quoted" 'single quoted',<% alligator %>#
This is a value & I don't care for it.    "quoted" 'single quoted',<% alligator %>#

Oh my what a fucking mess. Every single version (including my own) generates something different. All of them are actually valid, but the output generated from HttpUtility.UrlEncode is NOT parsed properly by the System.Uri methods. Ouch!

Notice that HttpUtility.UrlEncode and the System.Uri equivalents output different kinds of formatting. And worse that they are not compatible with each other – System.Uri.UnescapeDataString() cannot properly decode output created with System.Web.HttpUtility.UrlEncode() if it contains the + sign for spaces. That’s a bummer since the + sign syntax is certainly legal and well HttpUtility itself outputs spaces in that format. Only HttpUtility.UrlDecode() seems to work with both the + and %20 and does the right thing in all places, but the System.Uri equivalents fail to restore the string.

If we start UrlEncoding with EscapeDataString():

string encoded = System.Uri.EscapeDataString(test);

then all of the decoders work in returning the same result at least.

Sooo… to avoid the System.Web Reference and get around the confusion I needed to replace the calls to  HttpUtility without System.Web dependencies. Luckily a long time ago when I used these functions I had the good sense to create wrapper utility functions for the HttpUtility calls because it bugged me even in my early .NET days to have to include a reference to System.Web in a non-Web app, so the following function signatures were already part of my StringUtils class. The following are the UrlEncoding and Decoding related static methods of that class:

/// <summary>
/// UrlEncodes a string without the requirement for System.Web
/// </summary>
/// <param name="String"></param>
/// <returns></returns>
// [Obsolete("Use System.Uri.EscapeDataString instead")]
public static string UrlEncode(string text) { // Sytem.Uri provides reliable parsing return System.Uri.EscapeDataString(text); } /// <summary> /// UrlDecodes a string without requiring System.Web /// </summary> /// <param name="text">String to decode.</param> /// <returns>decoded string</returns> public static string UrlDecode(string text) { // pre-process for + sign space formatting since System.Uri doesn't handle it // plus literals are encoded as %2b normally so this should be safe text = text.Replace("+", " "); return System.Uri.UnescapeDataString(text); } /// <summary> /// Retrieves a value by key from a UrlEncoded string. /// </summary> /// <param name="urlEncoded">UrlEncoded String</param> /// <param name="key">Key to retrieve value for</param> /// <returns>returns the value or "" if the key is not found or the value is blank</returns> public static string GetUrlEncodedKey(string urlEncoded, string key) { urlEncoded = "&" + urlEncoded + "&"; int Index = urlEncoded.IndexOf("&" + key + "=",StringComparison.OrdinalIgnoreCase); if (Index < 0) return ""; int lnStart = Index + 2 + key.Length; int Index2 = urlEncoded.IndexOf("&", lnStart); if (Index2 < 0) return ""; return UrlDecode(urlEncoded.Substring(lnStart, Index2 - lnStart)); }

The UrlEncode method is just a passthrough to System.Uri.EscapeDataString() because it actually does the right thing. Initially I was going to Obsolete this method, but I decided against it – for consistency with the other wrappers it makes sense to use a common API to make the calls.

Decoding then pre-processes the input string for + signs since UnescapeDataString() doesn’t handle them by converting them into spaces. This makes for an invalid UrlEncoded string but UnescapeString leaves the embedded spaces alone, so it actually works as expected converting plus signs to spaces.

The final method is GetUrlEncodedKey which is basically a quick and dirty query string parser to return a single query string value. This is quite useful if a client app needs to look at intercepted URLs – for example in Web Browser Navigate events. Passing values as UrlEncoded strings can also be useful in interapplication communication for small chunks of message data in some situations.

Html Encoding

Html Encoding and Decoding has no equivalent to the HttpUtility functions so this is left up to the developer. HtmlEncoding I need to do quite frequently in client  applications that use HTML content (I do this alot using the Web Browser control to display certain content). But HtmlDecoding I have never really had a need for. Decoding is also quite a bit more complex than encoding so I never bothered with that. However encoding is more common and also straight forward to implement:

/// <summary>
/// HTML-encodes a string and returns the encoded string.
/// </summary>
/// <param name="text">The text string to encode. </param>
/// <returns>The HTML-encoded text.</returns>
public static string HtmlEncode(string text)
{
    if (text == null)
        return null;

    StringBuilder sb = new StringBuilder(text.Length);

    int len = text.Length;
    for (int i = 0; i < len; i++)
    {
        switch (text[i])
        {

            case '<':
                sb.Append("&lt;");
                break;
            case '>':
                sb.Append("&gt;");
                break;
            case '"':
                sb.Append("&quot;");
                break;
            case '&':
                sb.Append("&amp;");
                break;
            default:
                if (text[i] > 159)
                {
                    // decimal numeric entity
                    sb.Append("&#");
                    sb.Append(((int)text[i]).ToString(CultureInfo.InvariantCulture));
                    sb.Append(";");
                }
                else
                    sb.Append(text[i]);
                break;
        }
    }
    return sb.ToString();
}
}

This code is fairly simplistic in that it encodes only the angle brackets quotes and ampersands as Html entities. All characters over 159 are encoded as numeric entities which also happens to catch most Html Entity names. This doesn’t produce the most readable HTML if you have lots of upper ASCII or upper Unicode but it is valid Html to display and browsers do the right thing.

For client applications that likely only rarely use HtmlEncoding this is sufficient. Incidentally I took a quick look with Reflector at what the .NET runtime is doing in HttpUtility and it’s quite messy using native code and labels. Yuk. But it seems to be encoding the same set of characters except they are limiting from 160– 255 only for numeric entities and then go back to just outputting plain text. Not sure why that is but there are lots of entities in the above char 255 range so the above seems a little safer. <shrug>

Anyway – this is what I use in my Utilities class now in order to avoid the System.Web dependency. It seems well worth these couple of short implementations to avoid the dependency and memory hit.

Posted in .NET  

The Voices of Reason


 

Jaime
February 05, 2009

# re: Html and Uri String Encoding without System.Web

Hi, as you correctly point in the article the "+" chars in the URLs are very problematic, for example when you encrypt a piece of text and generate URLs from it because it becomes a bit tricky to decrypt it if it has "+" chars in the querystring.

Javier
February 05, 2009

# re: Html and Uri String Encoding without System.Web

I had a similar problem trying to use the HttpUtility encoding and decoding methods from a SQL CLR assembly. The System.Web is unsafe and can't be used.

I ended up, like you, implementing my own methods.

Thanks for sharing your solution. I may go back to my library and do the same.

PiersH
February 05, 2009

# re: Html and Uri String Encoding without System.Web

Any URL encoding that changes ' ' to '+' is broken (specifically javascript's 'escape' function). you should switch to encodeURIComponent and Uri.EscapeDataString to ensure correct roundtripping behavior.

Rick Strahl
February 05, 2009

# re: Html and Uri String Encoding without System.Web

It really doesn't matter if it's broken or not - it's that real life applications using existing and well-known tools are creating this output.

Frankly I don't see how this creates a problem at least if an encoder is used to encode query strings. If you create query strings with an encoder that uses plus signs for spaces it would follow that that encoder also encodes + as %2b. It's only a problem if a truly manual approach is used and somebody hand encodes the query string which is likely to cause other problems anyway.

DotNetKicks.com
February 05, 2009

# Html and Uri String Encoding without System.Web

You've been kicked (a good thing) - Trackback from DotNetKicks.com

Kevin Pirkl
February 05, 2009

# re: Html and Uri String Encoding without System.Web

Dang I cant find my original reference to this code but here is something very similar that I use.

http://www.conceptdevelopment.net/Localization/HtmlEncode/code3.html

John Walker
February 06, 2009

# re: Html and Uri String Encoding without System.Web

Rick,

I agree that there's a "dirty" feeling to including a reference to System.Web in a project that isn't a web application, but I have to ask what is the downside to including it if it saves you the trouble from dealing with this the way you have?

I've had the same feeling with projects done in the past, but in an opposite way. I've had web projects where I extract data from a database that's encoded in RichText (rtf) format. I've leveraged a Winforms RichTextBox in a thread-safe fashion to convert it to plain text. It's always felt wrong to me, but it "just works"(TM) and I'm loathe to find another solution although I've surely looked.

Anyway, this is a very interesting question to me. I remember the same feeling using System.Web.Mail in a Winforms project, but it was eventually moved into System.Net and somehow I feel better using it now anywhere.

I'm interested in your thoughts as to the downsides of using web parts of the framework in places it wasn't intended for. As always great post.

Rick Strahl
February 06, 2009

# re: Html and Uri String Encoding without System.Web

@John - this is a loaded question really. You just say - screw it - I'll just use what works and be done with it. But in a winform app System.Web adds a good 2.5 megs of overhead plus the load time it takes to load it up. It's not a trivial amount of memory and load time, so it's up to you to decide if that's worth it to you.

I personally try to keep my WinForms or WPF as lean as possible so removing a reference that requires 2.5 megs seems reasonable to me by writing a little bit of code. These types of apps have enough of a load time and memory requirements already without requiring System.Web to be added too.

If you're using other stuff in System.Web this might not be an issue, but if it's only for those few functions then the overhead seems a bit extreme to me and definitely NOT worth including System.Web in my application.

You have to judge how this fits with your environment, but for me the choice is pretty clear that it's better to reinvent the wheel for these few trivial functions.

Siderite
February 06, 2009

# re: Html and Uri String Encoding without System.Web

I don't really care about the footprint of System.Web, but I did find that the HtmlEncode method in System.Web.HttpUtility missed some of the functionality that I needed. So I created my own class that used these flags: HtmlEncode (for the default behaviour), Quotes (because HtmlEncode does not encode all quotes by default, particularly the single quote ), NonASCII (again, sometime you need to encode the characters over 127, not only those over 159), then PrefixSpaces, SuffixSpaces and InnerSpaces (because you need to save the spaces as well in some situations where the HTML interpreter would either ignore or trim them).

Matt
February 06, 2009

# re: Html and Uri String Encoding without System.Web

Hey, Great post. I'm curious how you determined the 2.5 meg overhead number and how you figured out it loads it at startup. I'm trying to get all my projects to be as efficient as possible but I'm not sure what the best way to determine these things.

Thanks!

Rick Strahl
February 06, 2009

# re: Html and Uri String Encoding without System.Web

@Matt - I created a Console project with a Console.ReadLine to break and wait and removed all refs except System. Ran a few times and checked Process Manager for memory usage. Added reference to System.Web and some code that calls into it and repeated the process. Memory usage jumps by nearly 2.5 megs in that process.

Peter
February 07, 2009

# re: Html and Uri String Encoding without System.Web

Any idea why it would increase the memory consumption by 2.5MB? I thought putting something in the GAC was supposed to help with this.

Rick Strahl
February 07, 2009

# re: Html and Uri String Encoding without System.Web

@Peter - System.Web is very large in size and the compiled machine code it has to load into memory once referenced. I don't know how this works out since System.Web is 5.2 megs (maybe minus the resources?). GAC isn't going to help image size AFAIK.

configurator
February 09, 2009

# re: Html and Uri String Encoding without System.Web

Here's a clarification for you:

As soon as any method or type for System.Web is used, the assembly is loaded into the current AppDomain. Only the code in the assembly is loaded, hence the 2.5 MB, and it remains in memory until the AppDomain is unloaded.

That said, you said:
"Which is a bad design choice – these are general features that probably should live in System.Net"
I have to disagree. Html is strongly related to the web! The fact that you use it in your WinForms application (as do I) does not mean it is not web-related or fit for System.Web.

Also, in what case is the memory footprint that important?

Rick Strahl
February 09, 2009

# re: Html and Uri String Encoding without System.Web

@configurator - It's Ok to have this stuff in System.Web, but it should also live in System.Net or maybe in System.Convert(). Basic string encodings creep up in many places and they are not natively supported in a logical matter in .NET. Maybe a System.Encoding namespace would be a good idea.

si
August 24, 2009

# re: Html and Uri String Encoding without System.Web

Do you really want to return null in HtmlEncode? What about:

if (string.IsNullOrEmpty(text)) return string.Empty;

NC
March 07, 2012

# re: Html and Uri String Encoding without System.Web

Why not use simply System.Net.WebUtility ?

Rick Strahl
March 07, 2012

# re: Html and Uri String Encoding without System.Web

@NC - System.Net.WebUtility only does HtmlEncode/HtmlDecode, not UrlEncode/UrlDecode.

Tanveer Badar
June 18, 2013

# re: Html and Uri String Encoding without System.Web

This is certainly a very iffy design choice .net framework team made.

Like a commenter suggested above there should be a separate namespace for various encodings accessible without taking a dependency on System.Web., which is by the way the route I took as it offered least time investment.

James
April 17, 2018

# re: Html and Uri String Encoding without System.Web

This article needs an update to cover the System.Net.WebUtility class available in .NET Framework 4 and higher.


Aleksey Tikhonov
September 12, 2018

# re: Html and Uri String Encoding without System.Web

Thank you for solving this problem, in .NET 2.0 there is no class System.Net.WebUtility, and your method helped. We used your information in our note. Sincerely, the editor-in-chief of the site tolik-punkoff.com


West Wind  © Rick Strahl, West Wind Technologies, 2005 - 2024