Linking
Linking is the process of collecting and combining various pieces of code and data into a single file that can be loaded into memory and executed. Linking is performed by programs called linkers.
Compiler Drivers
The compiler driver invokes the language preprocessor, compiler, assembler, and linker to translate ASCII source files into executable object files.
Preprocessor translates source file
main.c
into an ASCII intermediate filemain.i
.Compiler translates
main.i
into an ASCII assembly-language filemain.s
.Assembler translates
main.s
into binary relocatable object filemain.o
.Linker combines different object files to create the binary executable file
prog
.
Static Linking
The static linker (e.g. ld
program) takes a collection of relocatable object files and command-line arguments to generate a fully linked executable object file that could be loaded and run.
Symbol resolution: Object files define and reference symbols, where each symbol corresponds to a function, a global variable, or a static variable. The linker associates each symbol referenece with one symbol definition.
Relocation: The linker relocates the code and data sections by associating a memory location with each symbol definition, and pointing all references to the symbols to this memory location.
Object Files
Relocatable object file contains binary code and data in a form tha could be combined with other relocatable object files at compile time. (Generated by assembler and compiler)
Shared object file is a special type of relocatable object file that could be loaded into memory and linked dynamically at load or run time. (Generated by assembler and compiler)
Executable object file contains binary code and data in a form that could be copied into memory and executed. (Generated by linker)
Relocatable Object Files
Modern x86-64 Linux systems use Executable and Linkable Format (ELF) for object files.
ELF Header: Word size and byte ordering of the system. The size of ELF header, the object file type (relocatable, etc.), the machine type (x86-64, etc.), the offset of the section header table, and the size and number of entires in the section header table.
.text
: The machine code of the compiled program..rodata
: Read-only data such as jump tables..data
: Initialized global and static variables. Local variables are maintained on the stack at run time..bss
: Uninitialized or zero-valued global and static variables. This section is a placeholder that occupies no actual space in the object file. These variables are allocated at runtime in memory..symtab
: The symbol table with information about functions and global variables that are defined and referenced in the program..rel.text
: The list of locations in the.text
section that will need to be modified when the linker combines the file with others. Any instruction that calls an external function or reference a global variable should be modified..rel.data
: Relocation information for any global variables that are referenced or defined by the module. Any global variable whose initial value is the address of a global variable or external function should be modified..debug
: The debugging symbol table with entires for local variables, typedefs, global variables, and original C source file..line
: The mapping between line numbers in the original C source file and machine code instructions in the.text
section..strtab
: The string table (a sequence of null-terminated character strings) for the symbol tables in thesymtab
and.debug
section, and the section names in the section headers.Section header table: The locations and sizes of the various sections.
Symbols and Symbol Tables
Each relocatable object module has a symbol table that contains information about the symbols that are defined and referenced in the module.
Global symbols are defined by the module and could be referenced by other modules. (e.g. nonstatic functions and global variables)
Global symbols that are referenced by the module but defined externally.
Local symbols that are defined and referenced exclusively by the module. (e.g. static functions and global variables)
Pseudosections that don't have entires in the section header table:
ABS: symbols that shoudn't be relocated
UNDEF: undefined symbols
COMMON: uninitialized global variables (GCC assigns these symbols to COMMON instead of
.bss
.)
Symbol Resolution
The linker resolves symbol references by associating each reference with exactly one symbol definition from the symbol tables of its input relocatable object files.
Local symbol: The compiler allows only one definition of each local symbol per module.
Global symbol: For symbols that not defined in the current file, the compiler assumes that it is defined in some other module and leaves it for the linker to handle. If the linker is unable to find the definition, it raises an error. Multiple object modules might define global symbols with the same name. The linker must either flag an error or discard other definitions.
Resolve Duplicate Symbol Names
The compiler exports each global symbol to the assembler as either strong or weak, and the assembler encodes this information implicitly in the symbol table.
Strong symbols: functions and initialized global variables
Weak symbols: uninitialized global variables
The linker uses the following rules for dealing with duplicate symbol names:
Multiple strong symbols with the same name are not allowed.
Given a strong symbol and multiple weak symbols with the same name, choose the strong symbol.
Randomly choose one from multiple weak symbols with the same name.
Static Libraries
In practice, all compilation systems provide a mechanism for packaging related object modules into a single file called a static library (e.g. printf
, scanf
, etc.). The linker will only copy the object modules that are referenced by the program, which reduces the size of the executable on disk and in memory. The application programmer only needs to include the names of a few library files.
Static libraries on Linux are stored on disk in a particular file format known as an archive (*.a
). An archive is a collection of concatenated relocatable object files, with a header that describes the size and location of each member object file.
libc.a
: The C standard library is a 4.6 MB archive of 1496 object files. (I/O, memory allocation, date and time, etc.)libm.a
: The C math library is a 2 MB archive of 44 object files. (sin, cos, log, sqrt, etc.)
Use Static Libraries to Resolve References
During the symbol resolution phase, the linker scans the relocatable object files and archives left to right in the same sequential order that they appear on the compiler driver’s command line.
During this scan, the linker maintains a set E of relocatable object files that will be merged to form the executable, a set U of unresolved symbols, and a set D of symbols that have been defined in previous input files.
Each relocatable object file
f
is added to E. U and D are updated o reflect the symbol definitions and references inf
.If
f
is an archive, the linker matches the unresolved symbols in U against the symbols defined by the archive. If some archive memberm
defines a symbol that resolves a reference in U, thenm
is added to E, and the linker updates U and D to reflect the symbol definitions and references inm
. This process iterates over the member object files in the archive until U and D no longer change.If U is non-empty when the linker finishes scanning the input files, an error is raised. Otherwise, the linker merges the files in E and build the output executable file.
If the library that defines a symbol appears before the object file that references that symbol, the reference will not be resolved and linking will fail. Therefore, the general rule for libraries is to place them at the end of the command line. Libraries can also be repeated on the command line to satisfy the dependence requirements.
Relocation
After symbol resolution, the linker knows the exact sizes of the code and data sections in its input object modules. The relocation step merges the input modules and assigns run-time addresses to each symbol.
Relocating sections and symbol definitions: The linker merges all sections of the same type into a new aggregate section of the same type, and then assigns run-time memory addresses to the new aggregate sections, to each section defined by the input modules, and to each symbol defined by the input modules. Therefore, each instruction and global variable in the program has a unique run-time memory address.
Relocating symbol references within sections: The linker modifies every symbol reference in the bodies of the code and data sections so that they point to the correct run-time addresses.
Relocation Entries
When the assembler encounters a reference to an object whose ultimate location is unknown, it generates a relocation entry that tells the linker how to modify the reference when it merges the object file into an executable.
Relocation entries for code are placed in
.rel.text
.Relocation entries for data are placed in
.rel.data
.
ELF defines 32 relocation types. The two most basic relocation types are R_X86_64_PC32
and R_X86_64_32
.
R_X86_64_PC32
: Relocate a reference that uses a 32-bit PC-relative address. The PC-relative address is an offset from the current run-time value of the program counter.R_X86_64_32
: Relocate a reference that uses a 32-bit absolute address.
Executable Object Files
The format of an executable object file is similar to that of a relocatable object file.
The ELF header describes the overall format of the file and includes the entry point, or the address of the first instruction to execute.
The .init section defines a small function, called _init
, that will be called by the program’s initialization code.
Dynamic Linking with Shared Libraries
The static libraries has some significant disadvantages.
The code for common static libraries (e.g.
scanf
,printf
, etc.) is duplicated in the text segment of each running process, which is a waste of memory.If the library has been updated, the applicatin has to be explicitly relinked.
The shared library is an object module that, at either run time or load time, can be loaded at an arbitrary memory address and linked with a program in memory. This dynamic linking process is performed by the dynamic linker. Shared libraries are referred to as shared objects on Linux and dynamic link libraries (DLLs) on Windows.
In any given file system, there is exactly one
.so
file for a particular library. The code and data in this.so
file are shared by all of the executable object files that reference the library.The single copy of the .text section of a shared library in memory can be shared by different running processes.
Last updated