[deliverable/binutils-gdb.git] / bfd / doc / bfdsumm.texi

@c This summary of BFD is shared by the BFD and LD docs.
@c Copyright (C) 2012-2015 Free Software Foundation, Inc.

When an object file is opened, BFD subroutines automatically determine
the format of the input object file.  They then build a descriptor in
memory with pointers to routines that will be used to access elements of
the object file's data structures.

As different information from the object files is required,
BFD reads from different sections of the file and processes them.
For example, a very common operation for the linker is processing symbol
tables.  Each BFD back end provides a routine for converting
between the object file's representation of symbols and an internal
canonical format. When the linker asks for the symbol table of an object
file, it calls through a memory pointer to the routine from the
relevant BFD back end which reads and converts the table into a canonical
form.  The linker then operates upon the canonical form. When the link is
finished and the linker writes the output file's symbol table,
another BFD back end routine is called to take the newly
created symbol table and convert it into the chosen output format.

@menu
* BFD information loss::	Information Loss
* Canonical format::		The BFD	canonical object-file format 
@end menu

@node BFD information loss
@subsection Information Loss

@emph{Information can be lost during output.} The output formats
supported by BFD do not provide identical facilities, and
information which can be described in one form has nowhere to go in
another format. One example of this is alignment information in
@code{b.out}. There is nowhere in an @code{a.out} format file to store
alignment information on the contained data, so when a file is linked
from @code{b.out} and an @code{a.out} image is produced, alignment
information will not propagate to the output file. (The linker will
still use the alignment information internally, so the link is performed
correctly).

Another example is COFF section names. COFF files may contain an
unlimited number of sections, each one with a textual section name. If
the target of the link is a format which does not have many sections (e.g.,
@code{a.out}) or has sections without names (e.g., the Oasys format), the
link cannot be done simply. You can circumvent this problem by
describing the desired input-to-output section mapping with the linker command
language.

@emph{Information can be lost during canonicalization.} The BFD
internal canonical form of the external formats is not exhaustive; there
are structures in input formats for which there is no direct
representation internally.  This means that the BFD back ends
cannot maintain all possible data richness through the transformation
between external to internal and back to external formats.

This limitation is only a problem when an application reads one
format and writes another.  Each BFD back end is responsible for
maintaining as much data as possible, and the internal BFD
canonical form has structures which are opaque to the BFD core,
and exported only to the back ends. When a file is read in one format,
the canonical form is generated for BFD and the application. At the
same time, the back end saves away any information which may otherwise
be lost. If the data is then written back in the same format, the back
end routine will be able to use the canonical form provided by the
BFD core as well as the information it prepared earlier.  Since
there is a great deal of commonality between back ends,
there is no information lost when
linking or copying big endian COFF to little endian COFF, or @code{a.out} to
@code{b.out}.  When a mixture of formats is linked, the information is
only lost from the files whose format differs from the destination.

@node Canonical format
@subsection The BFD canonical object-file format

The greatest potential for loss of information occurs when there is the least
overlap between the information provided by the source format, that
stored by the canonical format, and that needed by the
destination format. A brief description of the canonical form may help
you understand which kinds of data you can count on preserving across
conversions.
@cindex BFD canonical format
@cindex internal object-file format

@table @emph
@item files
Information stored on a per-file basis includes target machine
architecture, particular implementation format type, a demand pageable
bit, and a write protected bit.  Information like Unix magic numbers is
not stored here---only the magic numbers' meaning, so a @code{ZMAGIC}
file would have both the demand pageable bit and the write protected
text bit set.  The byte order of the target is stored on a per-file
basis, so that big- and little-endian object files may be used with one
another.

@item sections
Each section in the input file contains the name of the section, the
section's original address in the object file, size and alignment
information, various flags, and pointers into other BFD data
structures.

@item symbols
Each symbol contains a pointer to the information for the object file
which originally defined it, its name, its value, and various flag
bits.  When a BFD back end reads in a symbol table, it relocates all
symbols to make them relative to the base of the section where they were
defined.  Doing this ensures that each symbol points to its containing
section.  Each symbol also has a varying amount of hidden private data
for the BFD back end.  Since the symbol points to the original file, the
private data format for that symbol is accessible.  @code{ld} can
operate on a collection of symbols of wildly different formats without
problems.

Normal global and simple local symbols are maintained on output, so an
output file (no matter its format) will retain symbols pointing to
functions and to global, static, and common variables.  Some symbol
information is not worth retaining; in @code{a.out}, type information is
stored in the symbol table as long symbol names.  This information would
be useless to most COFF debuggers; the linker has command line switches
to allow users to throw it away.

There is one word of type information within the symbol, so if the
format supports symbol type information within symbols (for example, COFF,
IEEE, Oasys) and the type is simple enough to fit within one word
(nearly everything but aggregates), the information will be preserved.

@item relocation level
Each canonical BFD relocation record contains a pointer to the symbol to
relocate to, the offset of the data to relocate, the section the data
is in, and a pointer to a relocation type descriptor. Relocation is
performed by passing messages through the relocation type
descriptor and the symbol pointer. Therefore, relocations can be performed
on output data using a relocation method that is only available in one of the
input formats. For instance, Oasys provides a byte relocation format.
A relocation record requesting this relocation type would point
indirectly to a routine to perform this, so the relocation may be
performed on a byte being written to a 68k COFF file, even though 68k COFF
has no such relocation type.

@item line numbers
Object formats can contain, for debugging purposes, some form of mapping
between symbols, source line numbers, and addresses in the output file.
These addresses have to be relocated along with the symbol information.
Each symbol with an associated list of line number records points to the
first record of the list.  The head of a line number list consists of a
pointer to the symbol, which allows finding out the address of the
function whose line number is being described. The rest of the list is
made up of pairs: offsets into the section and line numbers. Any format
which can simply derive this information can pass it successfully
between formats (COFF, IEEE and Oasys).
@end table
Commit	Line	Data
	1	@c This summary of BFD is shared by the BFD and LD docs.
	2	@c Copyright (C) 2012-2015 Free Software Foundation, Inc.
	3
	4	When an object file is opened, BFD subroutines automatically determine
	5	the format of the input object file. They then build a descriptor in
	6	memory with pointers to routines that will be used to access elements of
	7	the object file's data structures.
	8
	9	As different information from the object files is required,
	10	BFD reads from different sections of the file and processes them.
	11	For example, a very common operation for the linker is processing symbol
	12	tables. Each BFD back end provides a routine for converting
	13	between the object file's representation of symbols and an internal
	14	canonical format. When the linker asks for the symbol table of an object
	15	file, it calls through a memory pointer to the routine from the
	16	relevant BFD back end which reads and converts the table into a canonical
	17	form. The linker then operates upon the canonical form. When the link is
	18	finished and the linker writes the output file's symbol table,
	19	another BFD back end routine is called to take the newly
	20	created symbol table and convert it into the chosen output format.
	21
	22	@menu
	23	* BFD information loss:: Information Loss
	24	* Canonical format:: The BFD canonical object-file format
	25	@end menu
	26
	27	@node BFD information loss
	28	@subsection Information Loss
	29
	30	@emph{Information can be lost during output.} The output formats
	31	supported by BFD do not provide identical facilities, and
	32	information which can be described in one form has nowhere to go in
	33	another format. One example of this is alignment information in
	34	@code{b.out}. There is nowhere in an @code{a.out} format file to store
	35	alignment information on the contained data, so when a file is linked
	36	from @code{b.out} and an @code{a.out} image is produced, alignment
	37	information will not propagate to the output file. (The linker will
	38	still use the alignment information internally, so the link is performed
	39	correctly).
	40
	41	Another example is COFF section names. COFF files may contain an
	42	unlimited number of sections, each one with a textual section name. If
	43	the target of the link is a format which does not have many sections (e.g.,
	44	@code{a.out}) or has sections without names (e.g., the Oasys format), the
	45	link cannot be done simply. You can circumvent this problem by
	46	describing the desired input-to-output section mapping with the linker command
	47	language.
	48
	49	@emph{Information can be lost during canonicalization.} The BFD
	50	internal canonical form of the external formats is not exhaustive; there
	51	are structures in input formats for which there is no direct
	52	representation internally. This means that the BFD back ends
	53	cannot maintain all possible data richness through the transformation
	54	between external to internal and back to external formats.
	55
	56	This limitation is only a problem when an application reads one
	57	format and writes another. Each BFD back end is responsible for
	58	maintaining as much data as possible, and the internal BFD
	59	canonical form has structures which are opaque to the BFD core,
	60	and exported only to the back ends. When a file is read in one format,
	61	the canonical form is generated for BFD and the application. At the
	62	same time, the back end saves away any information which may otherwise
	63	be lost. If the data is then written back in the same format, the back
	64	end routine will be able to use the canonical form provided by the
	65	BFD core as well as the information it prepared earlier. Since
	66	there is a great deal of commonality between back ends,
	67	there is no information lost when
	68	linking or copying big endian COFF to little endian COFF, or @code{a.out} to
	69	@code{b.out}. When a mixture of formats is linked, the information is
	70	only lost from the files whose format differs from the destination.
	71
	72	@node Canonical format
	73	@subsection The BFD canonical object-file format
	74
	75	The greatest potential for loss of information occurs when there is the least
	76	overlap between the information provided by the source format, that
	77	stored by the canonical format, and that needed by the
	78	destination format. A brief description of the canonical form may help
	79	you understand which kinds of data you can count on preserving across
	80	conversions.
	81	@cindex BFD canonical format
	82	@cindex internal object-file format
	83
	84	@table @emph
	85	@item files
	86	Information stored on a per-file basis includes target machine
	87	architecture, particular implementation format type, a demand pageable
	88	bit, and a write protected bit. Information like Unix magic numbers is
	89	not stored here---only the magic numbers' meaning, so a @code{ZMAGIC}
	90	file would have both the demand pageable bit and the write protected
	91	text bit set. The byte order of the target is stored on a per-file
	92	basis, so that big- and little-endian object files may be used with one
	93	another.
	94
	95	@item sections
	96	Each section in the input file contains the name of the section, the
	97	section's original address in the object file, size and alignment
	98	information, various flags, and pointers into other BFD data
	99	structures.
	100
	101	@item symbols
	102	Each symbol contains a pointer to the information for the object file
	103	which originally defined it, its name, its value, and various flag
	104	bits. When a BFD back end reads in a symbol table, it relocates all
	105	symbols to make them relative to the base of the section where they were
	106	defined. Doing this ensures that each symbol points to its containing
	107	section. Each symbol also has a varying amount of hidden private data
	108	for the BFD back end. Since the symbol points to the original file, the
	109	private data format for that symbol is accessible. @code{ld} can
	110	operate on a collection of symbols of wildly different formats without
	111	problems.
	112
	113	Normal global and simple local symbols are maintained on output, so an
	114	output file (no matter its format) will retain symbols pointing to
	115	functions and to global, static, and common variables. Some symbol
	116	information is not worth retaining; in @code{a.out}, type information is
	117	stored in the symbol table as long symbol names. This information would
	118	be useless to most COFF debuggers; the linker has command line switches
	119	to allow users to throw it away.
	120
	121	There is one word of type information within the symbol, so if the
	122	format supports symbol type information within symbols (for example, COFF,
	123	IEEE, Oasys) and the type is simple enough to fit within one word
	124	(nearly everything but aggregates), the information will be preserved.
	125
	126	@item relocation level
	127	Each canonical BFD relocation record contains a pointer to the symbol to
	128	relocate to, the offset of the data to relocate, the section the data
	129	is in, and a pointer to a relocation type descriptor. Relocation is
	130	performed by passing messages through the relocation type
	131	descriptor and the symbol pointer. Therefore, relocations can be performed
	132	on output data using a relocation method that is only available in one of the
	133	input formats. For instance, Oasys provides a byte relocation format.
	134	A relocation record requesting this relocation type would point
	135	indirectly to a routine to perform this, so the relocation may be
	136	performed on a byte being written to a 68k COFF file, even though 68k COFF
	137	has no such relocation type.
	138
	139	@item line numbers
	140	Object formats can contain, for debugging purposes, some form of mapping
	141	between symbols, source line numbers, and addresses in the output file.
	142	These addresses have to be relocated along with the symbol information.
	143	Each symbol with an associated list of line number records points to the
	144	first record of the list. The head of a line number list consists of a
	145	pointer to the symbol, which allows finding out the address of the
	146	function whose line number is being described. The rest of the list is
	147	made up of pairs: offsets into the section and line numbers. Any format
	148	which can simply derive this information can pass it successfully
	149	between formats (COFF, IEEE and Oasys).
	150	@end table