As you know, Delphi is now Unicode able, from Delphi 2009 and up. This had caused a lot of headaches to a lot of developers out there. Unicode is not easy, you need to understand how it works and know how to use it. It is maybe simpler for .NET developers because they had it from the beginning (but when they dive in deep, then they are on the same level or even worse).  For Delphi developers, that means they have to migrate old applications and code, or stay with the old ANSI versions of the compiler. These migrations can be very hard in some cases (I would say most of the time it is hard, some of the time it is easy) . And even if you write old code from scratch you may need to provide ANSI compatibility for old compiler versions.

In the previous post I wrote about implementing a XTEA cryptographic algorithm in Delphi. While doing it I found out that this was a perfect example to study, how Unicode should be done in Delphi when dealing with cryptography data. Let me clarify. Under cryptography I do not mean just encryption, but all operations that transform input data to output that has no meaning to humans (it is binary data with some algorithm specific pattern). Encryption, hashing and even ID generators all fall under this category. They all take input data and produce most often some sort of binary output that has no meaning on its own.

I have seen a lot of Unicode implementations of cryptography in Delphi, but most of them were not done correctly. The main problem here is, that older ANSI version of the Delphi compilers worked with ANSI strings. And that ANSI strings were ideal (or so it seemed) for storing binary data (bytes actually). But that was just plain wrong approach. It worked when strings were 1 byte per character, but with Unicode in the picture that world fell apart. This string miss usage is one of the most common reasons why transitions from ANSI to Unicode are so hard sometimes (there is also PChar math etc…). This is also an example why we, developers, must be very careful with things that seem obvious but may have a deeper meaning and why understanding what we do is so important. Remember never “code by coincidence”, always understand what your code does. And I mean every line of it.

While implementing this XTEA algorithm I had three goals to achieve:

  • The algorithm must be able to safely encrypt and decrypt data
  • It must offer easy way to encrypt strings and streams
  • I must be backward compatible with no change

The last one is important. I wrote the code from scratch, but if I already had the code, it should work in Delphi 2010 and Delphi 2009 with just recompile and no changes made to it. And it should handle data from Delphi 2006 with no problems. So how did I do it. First of all I set probably the most important rule:

Cryptography works with binary data and not strings. So all data should be threated as binary

Isn’t that obvious? Am I not saying something that everybody knows? Well no, at least not from the amount of incorrect code out there. Hm, you may ask, but doesn’t this contradict the second goal. At first glance yes, but as I will show it poses no problem at all. If we work with binary data, Unicode and ANSI do not matter at all. Bytes are bytes. So we ensure at a higher level, support functions, to convert strings to binary data and back. And this must be done as transparently as possible. Le me show how the interface section of the XTEA algorithm looks like. It has changed some from the last article.

  //***************************************************
  // tea stream encryption / decryption routines
  //***************************************************
 
  type
    TTeaUnicodeString = {$IFDEF UNICODE} UnicodeString {$ELSE} WideString {$ENDIF};
    TTeaAnsiString = {$IFDEF UNICODE} RawByteString {$ELSE} string {$ENDIF};
    {$IFDEF CLR} TStream = Stream; {$ENDIF}
    TLong2 = array[0.. 1] of Longword;  // 64-bit
    TTeaKey = array[0..3] of Longword;  // 128-bit
    TByte16 = array[0..15] of Byte;     // 128-bit
    TByte4 = array[0..3] of Byte;       // 32-bit
    TTeaData = array of Longword;       // n*32-bit
    TBytes = array of Byte;
 
  // XTEA encryption and decryption function
  function XTeaEncryptBytes(const Data, Key: TBytes): TBytes;
  function XTeaDecryptBytes(const Data, Key: TBytes): TBytes;
 
  procedure XTeaEncryptStream(const InStream, OutStream: TStream; const Key: TBytes);
  procedure XTeaDecryptStream(const InStream, OutStream: TStream; const Key: TBytes);
 
  // support functions for string <-> bytes conversions
  function GetBytesFromUnicodeString(const Value: TTeaUnicodeString): TBytes;
  function GetBytesFromAnsiString(const Value: TTeaAnsiString): TBytes;
 
  function GetUnicodeString(const Value: TBytes): TTeaUnicodeString;
  function GetAnsiString(const Value: TBytes): TTeaAnsiString;

This is it. We have core encryption and decryption routines (XTeaEncryptBytes, XTeaDecryptBytes) and they work with bytes. Then we have the stream encryption and decryption routines and support functions. You can see that core routines take in data and key as “array of Byte” and return “array of Byte” also. This is the only correct approach in my opinion.

It is up to the user to encrypt and decrypt the strings how he or she sees fit. This way we make no assumptions upfront of the string content and format, the user is the one that must know with all responsibility, why he or she is treating a string as ANSI or Unicode. Streams are easy here, because they are binary data, so I will not spend time talking about them. Now you might ask, but what if we must “unicodify” a previous ANSI solution and we do not have the liberty of constructing the solution from ground up. Well in that case you have to ensure that your code behaves in the same manner as before on all compiler versions (even Unicode ones). So this means you have to treat all strings as ANSI if not specifically ordered otherwise. Let me write another rule:

All string are ansi strings by default, if not specified otherwise, when dealing with legacy code

Does this make sense? Yes it does. This ensures that code written in older versions of Delphi (non Unicode) will still work without changes in newer version. Because we work with bytes beneath it also ensures that no matter what code page we use, the byte sequence will still be the same (you must be careful with the key however). It can be unreadable if you encode string under Chinese code page and view it under Russian code page, but it will still be exactly the same data if looked as binary data. And this is important as it ensures that data is not affected by the code page (at least not in the core algorithm routines). The code I presented is very flexible and easily adopted for ANSI or Unicode compiler. And that flexibility and transparency is the key here.

I think this is the only correct approach to problems of this kind. I you think I am wrong, or you have some other ideas please drop a comment and I will gladly answer any questions or ideas related to the topic. If you want to see the code and how it works, it is available from the download section.