Fundamental Features and Structure of Object Files
To simplify compilers, they are typically written and defined to produce object files, instead of operating system specific executables. This is the case for most C and C++ compilers.
This makes compilers easier to port to other operating systems, as long as there are linkers available on those operating systems, which can understand the object file format produced by the compilers.
It is the role of the linker to take one or more of the object files generated by the compilers, and combine them into a single executable file (or another object file).
For this to work, the object file must not only contain the compiled machine code and data, but also information regarding the content, as tables in the header of the file. The layout of these various parts determines the object file format.
Typical open formats are: COFF (Common Object File Format) and ELF (Executable and Linkable Format). The DWARF (Debugging With Attributed Record Formats) was initially designed to facilitate the inclusion of debugging information for ELF files.
Microsoft’s implementation of COFF is called PE (Portable Executable) format — often referred to as PE/COFF in some literature.
Regardless of the file format used, and the terminology for the various parts of the file, the basic idea (model) remains the same: the object file must contain everything necessary for the linker to create an executable.
Although it is not often realised, libraries — often explicitly called object libraries — are simple containers for object files. Linkers can therefore work with these libraries, as well as individual object files. Libraries are created with simple tools: GNU’s
ar and Microsoft’s
LIB. They can effectively only create libraries; and insert, delete, replace and extract object files to/from these library files.
Compiler drivers automatically pass certain library names to the linker. These names will be different depending on the operating system and compiler vendor, but conceptually, these will be the names of the relevant C and C++ standard libraries.
The GNU Linker is called
ld, Clang’s is
ldd, and Microsoft’s linker is called
link. C/C++ programmers seldom do, or have to, call the linker directly. The linker can be called implicitly, by passing only object files to
cl. Linkers are automatically called by these compiler drivers, unless otherwise instructed (which is often useful). The corresponding switch to not call the linker is
The formal description of the various formats can be daunting, especially since it uses terminology that is not common in the source languages. As a basic model, however, the principle is quite simple: an object file must contain the binary code and data generated by the compiler, as well as the information relating to these binary items.
The header of the object file is logically divided in two parts. The first provides an index regarding the variables and functions (from a C/C++ perspective) defined in the source file and thus compiled into the object file. The second contains a table of items referenced in the code, but not necessarily defined.
For a link operation to successfully produce an executable (which could also mean shared object or dynamically linked library), every external reference must match up with its corresponding public symbol, which may be in the same file, or more commonly, in another object file. Also, there cannot be duplicate public symbols in the set of object files provided.
Public Symbol Table
From a C/C++ language perspective, the public symbol table contains the names, addresses and other relevant information for each and every function and variable with external linkage. This is C/C++ terminology, not object file terminology — object files simply call these item public symbols, meaning “items that can be linked to”.
It is possible in C/C++ to define functions and variables with a global lifetime, but without external linkage, a condition we often refer to as internal linkage (not a very rigorous definition, but sufficient for most purposes).
External References Table
When a C/C++ programmer declares functions or variables, the compiler assumes that they exist, and creates code as if they have been defined; it cannot, and does not, verify this, since it can only compile one file at a time.
It means that, at certain points in the machine code it generates, the compiler requires the addresses of these items. Since it cannot determine the addresses, it must depend on the linker to fix the “holes” left in the machine code. To inform the linker of this need, the compiler places the names of these items, and the locations where the addresses must be fixed, into the external references table of the object file.
Like the header part of the object file, the binary part closely resembles the image that will eventually be loaded in memory and be executed. This is roughly divided in two parts: the machine code, and data with a global lifetime.
From a C/C++ perspective, the machine code is the translation of C/C++ function bodies to machine code — in other words, the manifestation of your functions. This is independent of their linkage, or whether they are normal or member functions (methods). Although this is binary machine code, some addresses will be incorrect or missing, and will have to be fixed.
The term static, in this context, is not directly related to the C/C++
static keyword. It describes data for which place must be reserved in the executable. From a C/C++ programmer’s perspective, this will mean variables with a global lifetime. If you know enough C/C++, this is independent of the variables’ linkage and/or scope.
In C/C++, this will mean the space for variables:
- defined on the external level (outside of a function body or class);
- defined with a
staticstorage class designator, regardless of location.
Since these variables have space in the executable file image, they are effectively initialised the moment the program is loaded in memory, and before
main() is executed.
In theory, a Pascal compiler, for example, can produce an object file, which in turn can be linked with object files produced by a Fortran compiler, a C compiler and a C++ compiler. From a format perspective, this is entirely technically possible. Practically, because of different calling conventions, and register usage, this is seldom a reality.
C/C++ programmers must, at minimum, understand the above principles for certain error messages to be meaningful. Messages like “unresolved external”, or “duplicate public symbol” originate from the linker, and not the compiler. Such knowledge further enhances a programmer’s understanding of the mechanisms, structures and syntax choices.
Note that compilers can put some function names in a special public symbol table section, that tells the linker to only use one of the multiple machine code definitions of the function. This is for
inline functions and C++
template functions, so that we do not duplicate code.
2017-11-17: Updated. [brx]
2017-09-24: Edited. [jjc]
2016-10-17: Created. [brx]