UTF-8 Encoding with West Wind Web Connection

August 13, 2012 •

For Web applications UTF-8 encoding has become fairly universal. According to WikiPedia:

"UTF-8 (UCS Transformation Format—8-bit^[1]) is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32"

As FoxPro developers stuck with 255 character wide character sets, UTF-8 allows us to represent Unicode output fairly easily, especially since version 8 when STRCONV() introduced easy UTF-8 conversions to and from UTF-8, Unicode and ANSI character sets, making it super easy to create and parse UTF-8 into the current active character set.

UTF-8 solves a lot of problems with character set display issues on the Web especially and I'd highly recommend that you use UTF-8 as your output format for Web content from Web Connection or otherwise. Although early versions of Web Connection didn't do anything special with extended character sets and UTF-8 , more recent versions make it fairly easy to deal create content in UTF-8 format and parse UTF-8 Request data (form vars and query strings) just about automatically.

Setting up UTF-8 Encoding and Decoding in Web Connection

Starting with Web Connection 5.0, there are properties on both the Request and Response objects that allow UTF-8 transformations automatically. By default these aren't enabled so as to not break existing code, although enabling them is unlikely to cause a problem unless you explicitly set character set encoding via meta tags and actually encode your content already.

It's almost trivial to enable UTF-8 processing in Web Connection for all requests routed through a given Process class with this code:

************************************************************************
* wwDemo :: OnProcessInit
***************************
FUNCTION OnProcessInit
 
*** Explicitly specify UTF-8 encoding and decoding
Response.Encoding = "UTF8" 
Request.lUtf8Encoding = .t.
 
ENDFUNC
* wwDemo :: OnProcessInit

These two innocuous looking property assignments tell Web Connection to:

Encode all output going through the Response object to UTF-8
Decode all input from Form Variables and QueryStrings to use UTF-8 Decoding

Note that both of these require Web Connection 5.x and later and Response.Encoding is available only on the wwPageResponse class, which is the default in Web Connection 5. It is not available for the older wwResponse/wwResponseFile/wwResponseString classes.

The above code to enable UTF-8 encoding is hooked to the wwProcess::OnProcessInit() method which if a custom implementation you can create in your custom wwProcess subclasses. This method is a per request hook that allows hooking tasks that need to fire on every request. Typically you set up things things like Session initialization (this.InitSession()) or checking authentication etc. Setting the request encoding is just another simple task. Once the properties are set on the Response and Request object all Response output and Request input is automatically UTF-8 parsed.

Response Encoding

The Response encoding works on the wwPageResponse class only which is a string based output generation mechanism. When UTF-8 encoding is enabled your code basically builds up the Response into a string throughout your request code. Whether you explicitly call Response.Write() or other low level Response method, or whether a more high level handler like the Web Control Framework or the script and template engines create the output doesn't really matter - it all ends up as a string on the Response.cOutput property.

Once the process method is complete Web Connection assembles the final Response output by combining the Response.cOutput string plus the request headers into a complete response, which is then UTF8 encoded and returned back to IIS via the Web Connection .NET Handler or the ISAPI module.

The result is a fully UTF8 encoded response that properly displays any upper ASCII characters.

While it's possible to send extended characters back without UTF-8, it's much more complex for clients - especially non-Web Browser clients - to deal with custom character sets. The server would have to specify which character set was used (such as Windows-1252) and browsers have to parse and decode the charset. UTF-8 simplifies this because UTF-8 is fairly easy to automatically map to the active character set in the client's OS. If you received an raw UTF-8 response in FoxPro (say by calling a UTF-8 URL with wwHTTP) it's as easy as calling STRCONV(lcOutput,11) to turn it into FoxPro usable ANSI characters which is much easier than trying to match a specific encoding type and character set.

UTF-8 makes it much easier to share text data spanning potentially many different character sets using a single encoding mechanism. This is why it's a good idea to always create Web output using UTF-8, rather than any other encoding.

Request Encoding

If you use UTF-8 Response encoding you will actually need to match it with UTF-8 Request Decoding. Why? Because if you embed a URL like this in a document:

http://localhost/wconnect/testpage.wwd?address=TamStraße

you'll find that this URL is turned into a UTF-8 encoded URL that looks like this by the browser when clicked:

http://localhost/wconnect/testpage.wwd?address=TamStra%C3%9Fee

Notice that there are TWO escaped values next to each other for the ß character: %C3%9F which is the UTF-8 encoded character. Why? Well, your document is UTF-8 encoded and so the URL sent to the server also is. The same goes for form data you enter into a form. In other words, the Request data is UTF-8 encoded.

In order to properly decode those UTF-8 values you need to use:

Request.lUtf8Encoding = .t.

or else Request.Form() or Request.QueryString() will return weird looking characters for any extended characters in strings. For example if I type:

TamStraße

into a textbox and retrieve the value when lUtf8Encoding = .F. I'll get:

TamStraÃŸe

which is basically the UTF-8 encoded version which is clearly not what you want. You can manually fix this easily enough:

? STRCONV("TamStraÃŸe",11)

which properly produces TamStraße, but the easier solution is to just set Request.lUtf8Encoding = .T. and have this happen automatically.

Note that UTF8 encoding is common in the browser and for most Web pages it's considered the default if no other encoding is specified. One big issue with character encoding is that the server doesn't always receive information on what encoding is used. In fact most Form posts don't specify the encoding so you'd have to guess. But since in most applications you control the page generation (ie. you generate the page that posts back) you know what the encoding of the parent page is which in turn determines the POST encoding and querystring encoding for embedded links.

The Moral of the Story is: Use UTF-8

If you run into any problems with character encoding in your Web Connection applications, the most likely culprit is that you forgot to properly encode your content. If this happens to you the easiest way to fix it almost always is to opt to output everything in UTF-8. If you're dealing with any extended character formats, or even multi-cultural applications, UTF-8 will always work. Whether FoxPro can map all characters received from the Web to the current character set - that is another issue altogether, but that's a tricky limitation of FoxPro that has no easy solutions short of switching character sets at runtime.

Blog Stats

Rick's Sites

Archives

West Wind News