Object Files

Fundamental Features and Structure of Object Files

Natively compiled languages like C, C++, and others, generate object files, and depend on a linker to produce an executable. While object files may be relatively portable, executables are native to the target operating system. Programmers who use compiled languages, should have some understanding of object files, even if somewhat simplified.

Overview

To simplify compilers, they are typically written and defined to produce object files, instead of opera­ting system specific executables. This is the case for most C and C++ compilers.

This makes compilers easier to port to other operating systems, as long as there are linkers avail­able on those operating systems, which can understand the object file format produced by the compilers.

It is the role of the linker to take one or more of the object files generated by the compilers, and com­bine them into a single executable file (or another object file).

For this to work, the object file must not only contain the compiled machine code and data, but also information regarding the content, as tables in the header of the file. The layout of these various parts determines the object file format.

Formats

Typical open formats are: COFF (Common Object File Format) and ELF (Executable and Linkable For­mat). The DWARF (Debugging With Attributed Record Formats) was initially de­sign­ed to fa­ci­li­tate the inclusion of debugging information for ELF files.

Microsoft's implementation of COFF is called PE (Portable Executable) format — often referred to as PE/COFF in some literature.

Regardless of the file format used, and the terminology for the various parts of the file, the basic idea (model) remains the same: the object file must contain everything necessary for the linker to create an executable.

Libraries

Although it is not often realised, libraries — often explicitly called object libraries — are simple con­tainers for object files. Linkers can therefore work with these libraries, as well as individual object files. Libraries are created with simple tools: GNU's ar and Microsoft's LIB. They can ef­fec­tive­ly only create libraries; and insert, delete, replace and extract object files to/from these library files.

Compiler drivers automatically pass certain library names to the linker. These names will be dif­fe­rent de­pen­ding on the operating system and compiler vendor, but conceptually, these will be the names of the relevant C and C++ standard libraries.

Tools

The GNU Linker is called ld, Clang's is ldd, and Microsoft's linker is called link. C/C++ pro­gram­mers sel­dom do, or have to, call the linker directly. The linker can be called implicitly, by passing only ob­ject files to gcc, g++, or cl. Linkers are automatically called by these compiler dri­vers, un­less oth­er­wise instructed (which is often useful). The corresponding switch to not call the linker is /c for link.

Object file inspection tools are GNU's nm, and objdump. Although third party tools are available, Microsoft provides the dumpbin tool for object file inspection.

Basic Structure

The formal description of the various formats can be daunting, especially since it uses ter­mi­no­lo­gy that is not common in the source languages. As a basic model, however, the principle is quite simple: an object file must contain the binary code and data generated by the compiler, as well as the information relating to these binary items.

Object File Structure

Information

The header of the object file is logically divided in two parts. The first provides an index re­gard­ing the variables and functions (from a C/C++ perspective) defined in the source file and thus compiled into the object file. The second contains a table of items referenced in the code, but not necessarily defined.

For a link operation to successfully produce an executable (which could also mean shared object or dynamically linked library), every external reference must match up with its corresponding pub­lic sym­bol, which may be in the same file, or more commonly, in another object file. Also, there cannot be duplicate public symbols in the set of object files provided.

Public Symbol Table

From a C/C++ language perspective, the public symbol table contains the names, addresses and oth­er relevant information for each and every function and variable with ex­ter­nal link­age. This is C/C++ terminology, not object file terminology — object files simply call these item pub­lic sym­bols, mean­ing “items that can be linked to”.

It is possible in C/C++ to define functions and variables with a global lifetime, but with­out ex­ter­nal link­age, a condition we often refer to as internal linkage (not a very rigorous definition, but sufficient for most purposes).

External References Table

When a C/C++ programmer declares functions or variables, the compiler assumes that they exist, and creates code as if they have been defined; it cannot, and does not, verify this, since it can only compile one file at a time.

It means that, at certain points in the machine code it generates, the compiler requires the ad­dress­es of these items. Since it cannot determine the addresses, it must depend on the linker to fix the “holes” left in the machine code. To inform the linker of this need, the compiler places the names of these items, and the locations where the addresses must be fixed, into the external ref­e­ren­ces table of the object file.

Binary Parts

Like the header part of the object file, the binary part closely resembles the image that will even­tu­al­ly be loaded in memory and be executed. This is roughly divided in two parts: the machine code, and data with a global lifetime.

Machine Code

From a C/C++ perspective, the machine code is the translation of C/C++ function bodies to mach­ine code — in other words, the manifestation of your functions. This is independent of their link­age, or whether they are normal or member functions (methods). Although this is binary machine code, some addresses will be incorrect or missing, and will have to be fixed.

Static Data

The term static, in this context, is not directly related to the C/C++ static keyword. It describes data for which place must be reserved in the executable. From a C/C++ pro­gram­mer's per­spec­tive, this will mean variables with a global lifetime. If you know enough C/C++, this is in­de­pen­dent of the vari­ables' linkage and/or scope.

In C/C++, this will mean the space for variables:

Since these variables have space in the executable file image, they are effectively initialised the mo­ment the program is loaded in memory, and before main() is executed.

Summary

In theory, a Pascal compiler, for example, can produce an object file, which in turn can be linked with object files produced by a Fortran compiler, a C compiler and a C++ compiler. From a format per­spec­tive, this is entirely technically possible. Practically, because of dif­fer­ent cal­ling con­ven­tions, and register usage, this is seldom a reality.

C/C++ programmers must, at minimum, understand the above principles for cer­tain er­ror mes­sages to be mean­ing­ful. Messages like “unresolved external”, or “duplicate public symbol” or­i­gi­nate from the linker, and not the compiler. Such knowledge further enhances a pro­gram­mer's un­der­stand­ing of the mechanisms, structures and syntax choices.

Note that compilers can put some function names in a special public symbol table section, that tells the linker to only use one of the multiple machine code definitions of the function. This is for inline functions and C++ template functions, so that we do not duplicate code.


2020-06-30: Object file layout diagram. [brx]
2017-11-17: Updated. [brx]
2017-09-24: Edited. [jjc]
2016-10-17: Created. [brx]