samhuri.net


By Sami Samhuri

January 2010

A preview of Mach-O file generation

This month I got back into an x86 compiler I started last May. It lives on github.

The code is a bit of a mess but it mostly works. It generates Mach object files that are linked with gcc to produce executable binaries.

The Big Refactoring of January 2010 has come to an end and the tests pass again, even if printing is broken it prints something, and more importantly compiles test/test_huge.code into something that works.

After print is fixed I can clean up the code before implementing anything new. I wasn't sure if I'd get back into this or not and am pretty excited about it. I'm learning a lot from this project.

If you are following the Mach-O posts you might want to look at asm/machofile.rb, a library for creating Mach-O files. Using it is quite straightforward, an example is in asm/binary.rb, in the #output method.

Definitely time for bed now!

Basics of the Mach-O file format

This post is part of a series on generating basic x86 Mach-O files with Ruby. The first post introduced CStruct, a Ruby class used to serialize simple struct-like objects.

Please note that the best way to learn about Mach-O properly is to read Apple's documentation on Mach-O, which is pretty good combined with the comments in /usr/include/mach-o/*.h. These posts will only cover the basics necessary to generate a simple object file for linking with ld or gcc, and are not meant to be comprehensive.

Mach-O File Format Overview

A Mach-O file consists of 2 main pieces: the header and the data. The header is basically a map of the file describing what it contains and the position of everything contained in it. The data comes directly after the header and consists of a number of binary blobs of data, one after the other.

The header contains 3 types of records: the Mach header, segments, and sections. Each binary blob is described by a named section in the header. Sections are grouped into one or more named segments. The Mach header is just one part of the header and should not be confused with the entire header. It contains information about the file as a whole, and specifies the number of segments as well.

Take a quick look at Figure 1 in Apple's Mach-O overview, which illustrates this quite nicely.

A very basic Mach object file consists of a header followed by single blob of machine code. That blob could be described by a single section named \_\_text, inside a single nameless segment. Here's a diagram showing the layout of such a file:


            ,---------------------------,
  Header    |  Mach header              |
            |    Segment 1              |
            |      Section 1 (__text)   | --,
            |---------------------------|   |
  Data      |           blob            | <-'
            '---------------------------'

The Mach Header

The Mach header contains the architecture (cpu type), the type of file (object in our case), and the number of segments. There is more to it but that's about all we care about. To see exactly what's in a Mach header fire up a shell and type otool -h /bin/zsh (on a Mac).

Using CStruct we define the Mach header like so:

Segments

Segments, or segment commands, specify where in memory the segment should be loaded by the OS, and the number of bytes to allocate for that segment. They also specify which bytes inside the file are part of that segment, and how many sections it contains.

One benefit to generating an object file rather than an executable is that we let the linker worry about some details. One of those details is where in memory segments will ultimately end up.

Names are optional and can be arbitrary, but the convention is to name segments with uppercase letters preceded by two underscores, e.g. \_\_DATA or \_\_TEXT

The code exposes some more details about segment commands, but should be easy enough to follow.

Sections

All sections within a segment are described one after the other directly after each segment command. Sections define their name, address in memory, size, offset of section data within the file, and segment name. The segment name might seem redundant but in the next post we'll see why this is useful information to have in the section header.

Sections can optionally specify a map to addresses within their binary blob, called a relocation table. This is used by the linker. Since we're letting the linker work out where to place everything in memory the addresses inside our machine code will need to be updated.

By convention segments are named with lowercase letters preceded by two underscores, e.g. \_\_bss or \_\_text

Finally, the Ruby code describing section structs:

macho.rb

As much of the Mach-O format as we need is defined in asm/macho.rb. The Mach header, Segment commands, sections, relocation tables, and symbol table structs are all there, with a few constants as well.

I'll cover symbol tables and relocation tables in my next post.

Looking at real Mach-O files

To see the segments and sections of an object file, run otool -l /usr/lib/crt1.o. -l is for load commands. If you want to see why we stick to generating object files instead of executables run otool -l /bin/zsh. They are complicated beasts.

If you want to see the actual data for a section otool provides a couple of ways to do this. The first is to use otool -d <segment> <section> for an arbitrary section. To see the contents of a well-known section, such as \_\_text in the \_\_TEXT segment, use otool -t /usr/bin/true. You can also disassemble the \_\_text section with otool -tv /usr/bin/true.

You'll get to know otool quite well if you work with Mach-O.

Take a break

That was probably a lot to digest, and to make real sense of it you might need to read some of the official documentation.

We're close to being able to describe a minimal Mach object file that can be linked, and the resulting binary executed. By the end of the next post we'll be there.

(You can almost do that with what we know now. If you create a Mach file with a Mach header (ncmds=1), a single unnamed segment (nsects=1), and then a section named \_\_text with a segment name of \_\_TEXT, and some x86 machine code as the section data, you would almost have a useful Mach object file.)

Until next time, happy hacking!

Working with C-style structs in Ruby

This is the beginning of a series on generating Mach-O object files in Ruby. We start small by introducing some Ruby tools that are useful when working with binary data. Subsequent articles will cover a subset of the Mach-O file format, then generating Mach object files suitable for linking with ld or gcc to produce working executables. A basic knowledge of Ruby and C are assumed. You can likely wing it on the Ruby side of things if you know any similar languages.

First we need to read and write structured binary files with Ruby. Array#pack and String#unpack get the job done at a low level, but every time I use them I have to look up the documentation. It would also be nice to encapsulate serializing and deserializing into classes describing the various binary data structures. The built-in Struct class sounds promising but did not meet my needs, nor was it easily extended to meet them.

Meet CStruct, a class that you can use to describe a binary structure, somewhat similar to how you would do it in C. Subclassing CStruct results in a class whose instances can be serialized, and unserialized, with little effort. You can subclass descendants of CStruct to extend them with additional members. CStruct does not implement much more than is necessary for the compiler. For example there is no support for floating point. If you want to use this for more general purpose tasks be warned that it may require some work. Anything supported by Array#pack is fairly easy to add though.

First a quick example and then we'll get into the CStruct class itself. In C you may write the following to have one struct "inherit" from another:

With CStruct in Ruby that translates to:

CStructs act like Ruby's built-in Struct to a certain extent. They are instantiated the same way, by passing values to #new in the same order they are defined in the class. You can find out the size (in bytes) of a CStruct instance using the #bytesize method, or of any member using #sizeof(name).

The most important method (for us) is #serialize, which returns a binary string representing the contents of the CStruct.

(I know that CStruct.newfrombin should be called CStruct.unserialize, you can see where my focus was when I wrote it.)

CStruct#serialize automatically creates a "pack pattern", which is an array of strings used to pack each member in turn. The pack pattern is mapped to the result of calling Array#pack on each corresponding member, and then the resulting strings are joined together. Serializing strings complicates matters so we cannot build up a pack pattern string and then serialize it in one go, but conceptually it's quite similar.

Unserializing is the same process in reverse, and was mainly added for completeness and testing purposes.

That's about all you need to know to use CStruct. The code needs some work but I decided to just go with what I have already so I can get on with the more interesting and fun tasks.

Next in this series: Basics of the Mach-O file format