Element 61

Sunday, August 07, 2005

Fixing endian issues -- the quick and easy way

Ok, so the real reason for this blog -- technical discussion. It's sunday morning, and I'm bored, so I thought we'd start with something mundane and ordinary. Suppose you want your game to work on both Windows and OSX (as it stands now, not the crazy Intel-OSX boxes that will be public in a few years). One of the biggest issues with doing this, apart from the usual problems of OS specific code, non standard code, etc, is the processor architecture endian. I wrote on this topic once before, looking at how these issues were resolved in the Q2 source...but as many people pointed out, the Q2 source is frickin insane. Plus, that was pure C -- we're going to be working in proper C++ here.

The actual origin of the term "endian" is a funny story. In Gulliver's Travels by Jonathan Swift, one of the places Gulliver visits has two groups of people who are constantly fighting. One group believes the a hard boiled egg should be eaten from the big, round end (the "big endians") and the other group believes that it should be eaten from the small, pointy end (the "little endians"). Endian no longer has anything to do with hard boiled eggs, but in many ways, the essence of the story (two groups fighting over a completely pointless subject) still remains.

Suppose we start with an unsigned 2 byte (16 bit) long number; we'll use 43707. If we look at the hexadecimal version of 43707, it's 0xAABB. Now, hexadecimal notation is convenient because it neatly splits up the number into it's component bytes. One byte is 'AA' and the other byte is 'BB'. But how would this number look in the computer's memory (not hard drive space, just regular memory)? Well, the most obvious way to keep this number in memory would be like this:

| AA | BB |
The first byte is the "high" byte of the number, and the second one is the "low" byte of the number. High and low refers to the value of the byte in the number; here, AA is high because represents the higher digits of the number. This order of keeping things is called MSB (most significant byte) or big endian. The most popular processors that use big endian are the PPC family, used by Macs. This family includes the G3, G4, and G5 that you see in most Macs nowadays.

So what is little endian, then? Well, a little endian version of 0xAABB looks like this in memory:
| BB | AA |
Notice that it's is backwards of the other one. This is called LSB (least significant byte) or little endian. There are a lot of processors that use little endian, but the most well known are the x86 family, which includes the entire Pentium and Athlon lines of chips, as well as most other Intel and AMD chips. The actual reason for using little endian is a question of CPU architecture and outside the scope of this article; suffice to say that little endian made compatibility with earlier 8 bit processors easier when 16-bit processors came out, and 32-bit processors kept the trend.

It gets a little more complicated than that. If you have a 4 byte (32 bit) long number, it's completely backwards, not switched every two bytes. Floating point numbers are also this way, and much to the chagrin of some people, you can't byte-shift them into the right order. This means that you can't arbitrarily change the endian of a file; you have to know what data is in it and what order it's in. When you write an int to a file, it stays in the processor's endian. The only good news is that if you're reading raw bytes that are only 8 bits at a time, you don't have to worry about endians.

Now, we need to deal with this problem. The first thing we need is a way to reverse the bytes of a given variable. We could write functions SwapInt, SwapShort, SwapFloat, etc, but that's hardly good C++. So I present to you the magic of templates and the standard library:
template<typename Type>
Type ByteSwap( const Type& Obj )
{
    Type NewVal;
    const char* Src = reinterpret_cast<const char*>( &Obj );
    std::reverse_copy( Src, Src + sizeof(Obj), reinterpret_cast<char*>( &NewVal ) );
    return NewVal;
}
The beauty is entirely in the simplicity. We pretend the value is a byte array, read it backwards with the help of a standard library function, and return the new value. Even better, it's well suited to optimization. Now, the next thing to do is automate calling this function, so that we never really need to call it explicitly. We'll fall back on old style macro magic for this one:
#ifdef CPU_BIG_ENDIAN
#    define LittleEndian(x)    Cpu::ByteSwap(x)
#    define BigEndian(x)        (x)
#else
#    define LittleEndian(x)    (x)
#    define BigEndian(x)        Cpu::ByteSwap(x)
#endif
Usage is really quite simple. Whenever we are reading some value from file, we know what endian the file was written in. We simply wrap the read in the macro for the same endian. So if we're reading a little endian file, it goes something like this:
MyFloat = LittleEndian( FileReader->Read<float>() );
MyShort = LittleEndian( FileReader->Read<short>() );
MyInt = LittleEndian( FileReader->Read<int>() );
And it's as simple as that.

0 Comments:

Post a Comment

<< Home