Rick Strahl's Weblog  

Wind, waves, code and everything in between...
.NET • C# • Markdown • WPF • All Things Web
Contact   •   Articles   •   Products   •   Support   •   Advertise
Sponsored by:
West Wind WebSurge - Rest Client and Http Load Testing for Windows

Fixing user input for Display in HTML pages


:P
On this page:

 

I had a little spare time tonight after coming home from my Home Recording class and I’ve been mucking around with my little Bug Report app that I use for tracking bugs in various products. Well at the moment it just tracks Help Builder issues. One of the things that keeps coming up in every application is how to display text input by users properly in HTML.

 

The idea here is that users enter what they think is plain text and the system needs to turn that into useful HTML to display. This seems like a simple proposition (in some cases it is) but if you’re dealing with applications that need to both support text and markup this sort of thing gets real tricky. For example, Help Builder supports both text and HTML markup in text – how do you handle this sort of thing? Help Builder does in 3 different ways: Raw HTML, Html Edit view (rich text) and finally a formatted text view that uses special markup tags and displays plain HTML as encoded text not as HTML. As you might expect this gets confusing on first blush.

 

I’m not going to go into the issues that Help Builder needs because that’s really a specialty case, but even a simple application like the bug reporting app has this issue on a lesser degree. Basically you’re never sure what people are entering. And since we’re dealing with developer issues there is a fair chance people will end up posting either code or HTML/XML from time to time and displaying that needs to be dealt with.

 

Users can input messages that get posted to the server and inserted into the bug database and eventually that content gets displayed. A couple of things I need to do with just about every app is:

 

  • Make sure input is safe
  • Fix up line feeds in the document
  • Expand URLs inside of the document
  • Maintaining some tag support that don’t get encoded

 

The first deals with HTMLEncoding the text to make sure there isn’t going to be any malicious script code embedded into the page. We also don’t want things like an old school grin (<g>) wiping out the topic content with strike through <g>.

 

The next issue are line feeds. If you HtmlEncode a topic and simply render the content as HTML you will find that all line breaks get lost and the text just runs together with spaces where the breaks used to be. So we need to further format the line and replace line breaks with <br> and <p> tags.

 

In addition I also want to have my topics to at least support letting the user put in the <pre> tag for formatted text so that if formatted text of some sort is posted there at least is one option to maintain the formatting. For example if some HTML or XML is pasted into the field.

 

Finally I like to have my link references automatically expanded so that any reference to href, ftp or mailto links automatically become HREFs in the document.

 

The following are several short routines that accomplish these tasks. I’m sure this is nothing new, but you may find these useful – I’m also sure people will make some useful suggestion on how to optimize some of these. Part of the reason some of these don’t use RegEx is because they are simple enough to do in plain code and I feel that this is probably faster than using RegEx parsing on the same text. However, the link parsing would be next to impossible without the RegEx stuff (and btw, I have to thank Markus and Simon for pointing me in the right direction on that one – I don’t think I’d have the patience to figure that one out on my own <g>).

 

Anyway here you go with some C# code from my wwUtils class:

 

/// <summary>

/// Fixes a plain text field for display as HTML by replacing carriage returns

/// with the appropriate br and p tags for breaks.

/// </summary>

/// <param name="String Text">Input string</param>

/// <returns>Fixed up string</returns>

public static string DisplayMemo(string HtmlText)

{                      

      HtmlText = HtmlText.Replace("\r\n","\r");

      HtmlText = HtmlText.Replace("\n","\r");

      HtmlText = HtmlText.Replace("\r\r","<p>");

      HtmlText = HtmlText.Replace("\r","<br>");

      return HtmlText;

}

/// <summary>

/// Method that handles handles display of text by breaking text.

/// Unlike the non-encoded version it encodes any embedded HTML text

/// </summary>

/// <param name="Text"></param>

/// <returns></returns>

public static string DisplayMemoEncoded(string Text)

{

      bool PreTag = false;

      if (Text.IndexOf("<pre>") > -1)

      {

            Text = Text.Replace("<pre>","__pre__");

            Text = Text.Replace("</pre>","__/pre__");

            PreTag = true;

      }

 

      // *** fix up line breaks into <br><p>

      Text = Westwind.Tools.wwUtils.DisplayMemo( HttpUtility.HtmlEncode(Text) );

 

      if (PreTag)

      {

            Text = Text.Replace("__pre__","<pre>");

            Text = Text.Replace("__/pre__","</pre>");

      }

 

      return Text;

}

 

/// <summary>

/// Expands links into HTML hyperlinks inside of text or HTML.

/// </summary>

/// <param name="Text"></param>

/// <returns></returns>

public static string ExpandUrls(string Text)

{

      // *** Expand embedded hyperlinks

      string regex = @"\b(((ftp|https?)://)?[-\w]+(\.\w[-\w]*)+|\w+\@|mailto:|[a-z0-9](?:[-a-z0-9]*[a-z0-9])?\.)+(com\b|edu\b|biz\b|gov\b|in(?:t|fo)\b|mil\b|net\b|org\b|[a-z][a-z]\b)(:\d+)?(/[-a-z0-9_:\@&?=+,.!/~*'%\$]*)*(?<![.,?!])(?!((?!(?:<a )).)*?(?:</a>))(?!((?!(?:<!--)).)*?(?:-->))";

      System.Text.RegularExpressions.RegexOptions options = ((System.Text.RegularExpressions.RegexOptions.IgnorePatternWhitespace | System.Text.RegularExpressions.RegexOptions.Multiline)

            | System.Text.RegularExpressions.RegexOptions.IgnoreCase);

      System.Text.RegularExpressions.Regex reg = new System.Text.RegularExpressions.Regex(regex, options);

   

      MatchEvaluator MatchEval = new MatchEvaluator( ExpandUrlsRegExEvaluator);

      return Regex.Replace(Text,regex,MatchEval);

}

 

/// <summary>

/// Internal RegExEvaluator callback

/// </summary>

/// <param name="M"></param>

/// <returns></returns>

private static string ExpandUrlsRegExEvaluator(System.Text.RegularExpressions.Match M)

{

      string Href = M.Groups[0].Value;

      string Text = Href;

     

      if ( Href.IndexOf("://") < 0 )

      {

            if ( Href.StartsWith("www.") )

                  Href="http://" + Href;

            else if (Href.StartsWith("ftp") )

                  Href="ftp://" + Href;

            else if (Href.IndexOf("@") > -1 )

                  Href="mailto://" + Href;

      }

      return "<a href='" + Href + "'>" + Text + "</a>";

}

 

So in the bug form I can now use:

 

wwUtils.ExpandUrls( wwUtils.DisplayMemoEncoded( Bug.Description) );

 

to display text, encoded and with links expanded.

 

Note the use of the cool RegEx MatchEvaluator callback which allows you perform complex replacement operations matches.

 

The DisplayMemoEncoded() method is maybe a little simplistic, but I find it handles the most common scenario. You might want to expand the replacement it performs to more than the <pre> tag, in which case it will quickly become beneficial to start using RegEx expressions.

 

Nothing new here, but it's nice to have these related items all in one place.


The Voices of Reason


 

Daniel Fisher(lennybacon)
February 15, 2005

# re: Fixing user input for Display in HTML pages

Your ExpandUrls() Regex dont supports things like:

http://ix.de

http://www.someserver.com/default.htm#pg3

http://www.someserver.com?param=1

but a lot of url's look like that

:-(

Kevin Wright
February 15, 2005

# re: Fixing user input for Display in HTML pages

Thanks for sharing the bits of code. You mentioned in your piece that you also handle potential malicious code in the user's entry. How do you do this - have you found a way of writing this once and using it many times?

BTW, if you get too busy to develop your bug tracking app further, take a look at Gemini at www.countersoft.com

Kevin

Rick Strahl
February 16, 2005

# re: Fixing user input for Display in HTML pages

Thanks Daniel. I'll look into fixing that scenario.

Kevin, HTMLEncoding the text pretty much takes care of malicious script getting into the text of the entered text.

Randy Pearson
February 25, 2005

# re: Fixing user input for Display in HTML pages

How about cleaning up stuff like smart quotes and special characters? I find people pasting blocks from, say, Microsoft Word. For plain text fields, I like to replace smart quotes with straight " character. Also various bullet symbols can be issues--we replace with asterisks and then re-interpret those at render time.

West Wind  © Rick Strahl, West Wind Technologies, 2005 - 2024