ASCII, Unicode, and Windows

As mentioned previously, my company sold the rights to one of our previous Windows software products to a much larger company a few years ago. The much larger company keeps my company on retainer to advise them on the product’s continued development.

The product (let’s call it Project Badger) was originally written more than ten years ago now, with me as the sole original developer. At the time, I treated C++ as simply an improved version of C. Microsoft Visual C 6.0, my compiler of (reluctant) choice at the time, barely supported the newly-standardized C++ Standard Template Library (STL); the Boost library didn’t exist yet; and I hated the non-standard and non-portable Microsoft Foundation Classes with a passion. That didn’t leave much choice: I had to hand-code all the algorithms and data structures myself, and use the painfully primitive C string functions and the raw Win32 API.

(Yes, and walk barefoot in the snow to get to school everyday. Uphill. Both ways. 😉 But that’s another story.)

Project Badger has grown a great deal since then, but it was still using the hand-coded data structures and algorithms that I originally came up with, and the C string functions and raw Win32 API. Although they still worked, it was a very creaky and increasingly ugly infrastructure… grafting support for Unicode onto it (as our customers demanded) was a nightmare that required a lot of hand-crafted code, and it was only supported in certain places. And we couldn’t move to Unicode completely, because we still had to support Win9x (Windows 95, 98, and Millennium), for reasons I won’t go into, and they didn’t support Unicode. (Using the UNICOWS DLL, which is Microsoft’s way of retrofitting Unicode support into Win9x systems, was deemed unacceptable in this case, again for reasons I won’t go into.)

This all came to a head a few weeks ago, when a customer reported a problem that should have been easy to fix, but that our bastardized infrastructure made nearly impossible. For a few of its abilities, Project Badger has to open a copy of its EXE file and read some data from a table tacked onto the end; this customer reported that these features wouldn’t work if the EXE file were placed in a path with Unicode characters. The reason was easy to track down: we were using the GetModuleHandleA function to find the filename of the EXE, and CreateFileA to open it. (The ‘A’ on the end means that they’re the ASCII versions of those functions, rather than the Unicode versions, which would be denoted by a ‘W’ there.) But fixing it, in a way that wouldn’t break Win9x compatibility… that was more interesting.

After one of our developers spent a frustrating couple days trying to work around the problem using the “short filenames” (legacy of DOS and the FAT12/FAT16 file system) — to no avail — I proposed overhauling the entire program, replacing many hand-coded parts of the program with components from the STL and Boost libraries. Since I was the only one on the team that had much experience with both, I volunteered to do the initial conversion.

It was a large undertaking, even larger than I’d anticipated. It took me ten straight days, working twelve to sixteen hours a day, to finish it (I’d estimated seven days, for a broader scope of changes). But I think the result was worth the effort.

A key component to the overhaul was a way to support Unicode strings and functions similar to the way that the UNICOWS library does: on platforms where it’s available, dynamically load the Unicode version of the API function and pass the Unicode parameters directly to it. On Win9x platforms (where the Unicode functions aren’t available), fall back on the ASCII versions of the functions, translate all strings to ASCII before passing them in, and translate any results to Unicode when passing them back. To support this, I put together a specialized String class, using the Boost::Variant library. Here’s a simplified version of its declaration (the full version has conversion support for a few program-specific types as well):


namespace os {
    class String {
        public:
        String(const std::string& init): mData(init) { };
        String(const std::wstring& init): mData(init) { };
        String(const char* init);
        String(const wchar_t* init);

        std::string to_string() const;
        std::wstring to_wstring() const;
        const char* to_ptr() const;
        const wchar_t* to_wptr() const;

        bool isNull() const;
        bool isNativelyUnicode() const;
        bool isValidString() const;

        static std::string _toAscii(const std::wstring& str);
        static std::wstring _toUnicode(const std::string& str);

        private:
        typedef boost::variant<const void*, std::string, std::wstring> Data;

        Data mData;
    }
}

Note that the boost::variant typedef Data can accept a std::string (a standard ASCII string), a std::wstring (a Unicode string), or a raw ASCII or Unicode string pointer. The pointer option was necessary because some Win32 API functions allow you to pass in an invalid string pointer, encoding (usually via the MAKEINTRESOURCE macro) some specialized information instead of a standard string.

With that in place, and using a set of simple template classes I wrote to handle the dynamic loading, I could write functions that would take either ASCII or Unicode strings and pointers (or even a combination of different types) and do any necessary conversion on the fly:


namespace os {
    HANDLE CreateFile(const String& filename, DWORD acc,
        DWORD share, LPSECURITY_ATTRIBUTES sec,
        DWORD create, DWORD flags, HANDLE tpl)
    {
        static StdFn7<HANDLE, LPCWSTR, DWORD, DWORD,
            LPSECURITY_ATTRIBUTES, DWORD, DWORD, HANDLE>
            fn("CreateFileW", cKernel32);

        if (fn && filename.isUnicode()) {

            return fn(filename.to_wptr(), acc, share, sec, create,
                flags, tpl);

        } else {

            return ::CreateFileA(filename.to_ptr(), acc, share, sec,
                create, flags, tpl);

        }
    }
}

It’s a good solution, though not a perfect one. For one example, it’s not as fast as the raw calls. That doesn’t matter in our case, because most such calls are in user-interface code (where even the slowest machine is fast enough that much more inefficient code than that wouldn’t be noticeable), and the rest are in one-time operations where the speed isn’t critical.

Another limitation is that you can’t pass in a zero or NULL for one of the os::String parameters, even when it might be supported (as in the “title” parameter to the MessageBox function, which defaults to the localized “error” string if passed a NULL). I got around that by defining

const os::String NULLSTRING(static_cast<char*>(NULL));

in the source file, and then putting

extern const os::String NULLSTRING;

in the header. Then, whenever I needed to pass in a NULL, I just changed it to NULLSTRING, and everything worked as normal. I could probably have also defined an os::String constructor that accepted an int, but that seemed like overkill, and would have worked against type-safety.

The end result: minimal changes to the existing code, while adding maximum flexibility. I’m very happy with it. 🙂

9 Comments

  1. CreateFile and many other functions should already be defined to CreateFileA and CreateFileW by the SDK headers (winbase.h) depending on the UNICODE macro, so why load the wide functions dynamically?

  2. Because not all of the wide-character API functions are defined on Windows 95/98/Millennium, and as I said above, we still want to support those platforms. If our program imported the functions statically (i.e. by just using the wide-character functions without loading them dynamically), then it would fail to run on machines that don’t have one or more of those functions defined. It wouldn’t even be able to start on such machines, due to unresolvable imports.

    (As I understand it, most of the older wide-character functions are defined in Win9x, but practically all of them are simply stubs that return a not-implemented error. I don’t have a Win9x machine handy at the moment to confirm that though.)

    By loading them dynamically, and only under Windows NT/2000/XP or later, we ensure that the program will still run on those older platforms, while taking advantage of Unicode and newer API function on the ones that support it.

    It’s not a problem that most developers will ever have to deal with. The UNICOWS DLL provides pretty complete Unicode support for Win9x systems, so you can just tell Visual C++ to compile for Unicode and make it work with UNICOWS. But, also as mentioned above, we didn’t want to force people to install UNICOWS on their systems — for most programs, you can make it a requirement without any problem, but due to its nature, Project Badger has some unique needs.

  3. One of the few things that MS did that was a Good Thing is that they pushed for Unicode in a big way, at least that was how I interpreted Petzold (MS Press’s tome on Win32 with the unique name, “Programming Windows”) emphasizing it to the point of devoting one of the first chapters in the book entirely to Unicode in the last edition.

  4. They didn’t push it in Win9x though. There were good reasons for it at the time (Unicode strings take twice the storage of ASCII strings under Windows), but it makes supporting the full range of old Windows systems… interesting.

  5. Nobody really pushed Unicode back then, many popular Linux programs were late in adopting Unicode too. (OS X didn’t exist before 2000, though its filesystem, HFS+, has Unicode capabilities and I think it predates OS X, though HFS+ continues to get incremental improvements…

    I can’t wait till ZFS comes though. It’d make Time Machine supercharged and filesystem resizing on a live partition a lot better, along with many new benefits to OS X. It’ll be on the server OS soon, though of course big-iron server OSs already have ZFS or similar file-systems like the one from HP that is coming to the Linux kernel, or DragonflyBSD’s new filesystem. (Assuming Sun doesn’t get enlightened and relicense ZFS… Of course, it’s usable just with FUSE.)

  6. I believe Linux uses UTF-8, which in a lot of respects is a superior system. There are things that are much easier to do with Unicode strings though.

    ZFS may be “usable” with FUSE, but as someone who has been running it on Linux continually for quite a while, I can authoritatively say that it still has problems that way. I still use it, because it’s superior despite the problems, but I wouldn’t recommend it for anyone but a major geek (with good backups). If it were in the kernel, it would likely be a lot better.

  7. Pingback: Websites tagged "ascii" on Postsaver

  8. Pingback: winbase.h

  9. Wow… I just deleted a single spam post with the spammer’s entire treasure-trove of innocuous and meaningless phrases in it. I’d leave it up, but it was so big that I was afraid it would block people from finding the useful information.

Comments are closed.