Reading UTF-8 Encoded Files and File2Var()

February 12, 2007 •

I've just made a change to the File2Var() function in Web Connection which is the universal function that is used throughout for reading files from disk. You might be wondering why I continue to use this function rather than FILETOSTR() which is more efficient, but there are a number of things that this function does that FILETOSTR does not.

For one thing FILE2VAR doesn't fail with an error when a file cannot be read or isn't found but rather it returns just an empty string. This is not as big of an issue any longer in VFP 8/9 with exception handling but even so this makes for much cleaner code.

But the main reason has always been to provide additional error checking capability and the abillity to basically wrap the file access functionality. And once again there's a scenario where this is proving very useful.

UTF-8 Encoded Files
Specifically I ran into a scenario where script templates or Web Connection Page Templates are stored on disk as UTF-8 documents. For example, Microsoft Web Expression Designer only allows saving of documents in UTF-8. There's no option to do so otherwise. For Web Connnection (or any VFP tool for that matter) that can be problematic if you are trying to parse a template based on this data. If you load the UTF-8 encoded file into VFP you now have a document that is potentially not a valid string. This especially problematic for templates that you embed values into. Most of us happily write code like this:

< %= Customer.LastName %>

<ww:wwWebTextBox runat="server" id="txtCompany" ControlSource="Customer.LastName" />

But if you do this into a template that was loaded as a UTF-8 document, you now have a problem: You're embedding standard CodePage ANSI data into a document that potentially holds some UTF-8 encoded content. We could do:

< %= STRCONV(Customer.LastName,9) %>

to force all of our expressions to UTF-8 but that's both very inefficient and unlike to happen especially if you already have existing code. Instead the better solution is UTF-8 encoding the entire document when we're done generating it, which is now easy in Web COnnection 5.20 with Response.Encoding="UTF8". But that will only work on the assumption that the content in the template and any embedded expressions are NOT UTF-8 encoded to begin with. If you try the Response.Encoding on any already UTF-8 encoded content you'll double encode the extended markup characters and irreversibly mangly the document.

So - what's the solution to this problem?

The answer is to alway decode UTF-8 content when it's loaded from disk. Here's an example, lets say I type this into Expression:

<html>
<head>
</head>
<body>
Man sich ändern aber man kann seinen Weg nicht wählen.
</body>
</html>

save and then load the file with FILETOSTR:

lcText = FILETOSTR("c:\test.htm")

I'd get:

ï»¿<html>
<head>
</head>
<body>
Man sich Ã¤ndern aber man kann seinen Weg nicht wÃ¤hlen.
</body>
</html>

You'll notice the UTF-8 characters for the umlauts and also notice the BOM at the beginning of the document. Officially a UTF-8 document written to disk should always have the BOM at the beginning but not all programs actually write it. Most Microsoft tools do though - Expression, Notepad and Visual Studio do.

Luckily Visual Studio to date defaults to writing files in ANSI format (CodePage 1252), so we haven't run into any issues with this problem more widely. However, that may change in the future as more and more content is going UTF-8 encoded by default (and given the VS.Next will use Expressions designer chances are it will enforce the same rules).

Anyway, the problem with a UTF-8 encoded document is easy to fix generically given that you have a hook point where you can check the result. And so here's how File2Var handles this now:

************************************************************************
FUNCTION File2Var
******************
***  Function: Takes a file and returns the contents as a string or
***            Takes a string and stores it in a file if a second
***            string parameter is specified.
***      Pass: tcFilename  -  Name of the file
***            tcString    -  If specified the string is stored
***                           in the file specified in tcFileName
***    Return: file contents as a string
************************************************************************
LPARAMETERS tcFileName, tcString, llSharedWrite
LOCAL lcRetVal, lnHandle, lnSize
 
tcFileName=IIF(EMPTY(tcFileName),"",tcFileName)
 
IF VARTYPE(tcString) # "C"
   *** File to string - if possible use FILETOSTR
   *** since it's native and faster, but we need
   *** to wrap it into an error handler
   lcRetVal=""
   
   TRY
      lcRetVal = FILETOSTR(tcFilename)
   CATCH
   ENDTRY
ELSE
   tcString=IIF(EMPTY(tcString),"",tcString)
   
   IF !llSharedWrite
      *** Text to File
      lnHandle=FCREATE(tcFileName)
      IF lnHandle=-1
         RETURN .F.
      ENDIF
      =FWRITE(lnHandle,tcString)
      =FCLOSE(lnHandle)
      RETURN .T.
   ELSE
      LOCAL llFailed
      llFailed = .F.
      TRY
         STRTOFILE(tcString,tcFileName)         
      CATCH
         llFailed=.T.
      ENDTRY
      
      IF llFailed
         RETURN .F.
      ENDIF
      RETURN .T.
   ENDIF
ENDIF
 
IF lcRetVal = "ï»¿"
      lcRetVal = STRCONV(SUBSTR(lcRetVal,4),11)
ENDIF
 
RETURN lcRetVal
*EOP File2Var

You can see the code at the end there dealing with the BOM check and if finding it converting the document to decode the UTF-8 code.

For Web Connection this solves the potential pitfalls of dealing with UTF-8 data in template data, since Web Connection uses File2Var throughout to read Templates, Script Files and Web Control Pages for parsing. So even if you create your templates in UTF-8 encoding you'll be good to go.

If you truly need to get the raw file data rather than text data you can use FILETOSTR() to get the UTF-8 encoded content into a string.

Additional Issues: Writing UTF-8 data

The above solves the immediate issue in Web Connection, but it does raise a few more questions: If you need to work on a file make changes and write the data back, what then?

VFP also supports write output with a UTF-8 BOM directly:

STRTOFILE(STRCONV(lcContent,9),"c:\test4.htm",4)

Note that you have to encode the data yourself though, which makes sense since the data you're dealing with may already be encoded (say you load a raw UTF-8 string, modify in memory and write it back). So at the least with have an easy way without any fuss the write the data back in its original format. Actually you have two ways:

You can read the data as raw UTF-8 with FILETOSTR() and then simply write back as raw binary data. Since the BOM was there when you read the data you simply write it back to disk. Or alternately you read the data and strip the BOM and decode (using File2Var() or similar) and then re-encode the data to UTF-8 and use STRTOFILE() with flag 4 to write the BOM.

It's easy to forget about encoding until it bites you in the face as it did me a couple of days ago when I tried to load some pages that I had edited in Expression and I kept wonder WTF it kept rendering the extra BOM characters in a script page. <s> The above fix resolved that issue and will do so in the future for Web Connection.

Blog Stats

Rick's Sites

Archives

West Wind News

Reading UTF-8 Encoded Files and File2Var()

Additional Issues: Writing UTF-8 data

Feedback for this Weblog Entry