+Number-of-calls information for library routines is collected by using a
+special version of the C library. The programs in it are the same as in
+the usual C library, but they were compiled with @samp{-pg}. If you
+link your program with @samp{gcc @dots{} -pg}, it automatically uses the
+profiling version of the library.
+
+Profiling also involves watching your program as it runs, and keeping a
+histogram of where the program counter happens to be every now and then.
+Typically the program counter is looked at around 100 times per second of
+run time, but the exact frequency may vary from system to system.
+
+This is done is one of two ways. Most UNIX-like operating systems
+provide a @code{profil()} system call, which registers a memory
+array with the kernel, along with a scale
+factor that determines how the program's address space maps
+into the array.
+Typical scaling values cause every 2 to 8 bytes of address space
+to map into a single array slot.
+On every tick of the system clock
+(assuming the profiled program is running), the value of the
+program counter is examined and the corresponding slot in
+the memory array is incremented. Since this is done in the kernel,
+which had to interrupt the process anyway to handle the clock
+interrupt, very little additional system overhead is required.
+
+However, some operating systems, most notably Linux 2.0 (and earlier),
+do not provide a @code{profil()} system call. On such a system,
+arrangements are made for the kernel to periodically deliver
+a signal to the process (typically via @code{setitimer()}),
+which then performs the same operation of examining the
+program counter and incrementing a slot in the memory array.
+Since this method requires a signal to be delivered to
+user space every time a sample is taken, it uses considerably
+more overhead than kernel-based profiling. Also, due to the
+added delay required to deliver the signal, this method is
+less accurate as well.
+
+A special startup routine allocates memory for the histogram and
+either calls @code{profil()} or sets up
+a clock signal handler.
+This routine (@code{monstartup}) can be invoked in several ways.
+On Linux systems, a special profiling startup file @code{gcrt0.o},
+which invokes @code{monstartup} before @code{main},
+is used instead of the default @code{crt0.o}.
+Use of this special startup file is one of the effects
+of using @samp{gcc @dots{} -pg} to link.
+On SPARC systems, no special startup files are used.
+Rather, the @code{mcount} routine, when it is invoked for
+the first time (typically when @code{main} is called),
+calls @code{monstartup}.
+
+If the compiler's @samp{-a} option was used, basic-block counting
+is also enabled. Each object file is then compiled with a static array
+of counts, initially zero.
+In the executable code, every time a new basic-block begins
+(i.e., when an @code{if} statement appears), an extra instruction
+is inserted to increment the corresponding count in the array.
+At compile time, a paired array was constructed that recorded
+the starting address of each basic-block. Taken together,
+the two arrays record the starting address of every basic-block,
+along with the number of times it was executed.
+
+The profiling library also includes a function (@code{mcleanup}) which is
+typically registered using @code{atexit()} to be called as the
+program exits, and is responsible for writing the file @file{gmon.out}.
+Profiling is turned off, various headers are output, and the histogram
+is written, followed by the call-graph arcs and the basic-block counts.
+
+The output from @code{gprof} gives no indication of parts of your program that
+are limited by I/O or swapping bandwidth. This is because samples of the
+program counter are taken at fixed intervals of the program's run time.
+Therefore, the
+time measurements in @code{gprof} output say nothing about time that your
+program was not running. For example, a part of the program that creates
+so much data that it cannot all fit in physical memory at once may run very
+slowly due to thrashing, but @code{gprof} will say it uses little time. On
+the other hand, sampling by run time has the advantage that the amount of
+load due to other users won't directly affect the output you get.
+
+@node File Format
+@section Profiling Data File Format
+
+The old BSD-derived file format used for profile data does not contain a
+magic cookie that allows to check whether a data file really is a
+@code{gprof} file. Furthermore, it does not provide a version number, thus
+rendering changes to the file format almost impossible. @sc{gnu} @code{gprof}
+uses a new file format that provides these features. For backward
+compatibility, @sc{gnu} @code{gprof} continues to support the old BSD-derived
+format, but not all features are supported with it. For example,
+basic-block execution counts cannot be accommodated by the old file
+format.
+
+The new file format is defined in header file @file{gmon_out.h}. It
+consists of a header containing the magic cookie and a version number,
+as well as some spare bytes available for future extensions. All data
+in a profile data file is in the native format of the target for which
+the profile was collected. @sc{gnu} @code{gprof} adapts automatically
+to the byte-order in use.
+
+In the new file format, the header is followed by a sequence of
+records. Currently, there are three different record types: histogram
+records, call-graph arc records, and basic-block execution count
+records. Each file can contain any number of each record type. When
+reading a file, @sc{gnu} @code{gprof} will ensure records of the same type are
+compatible with each other and compute the union of all records. For
+example, for basic-block execution counts, the union is simply the sum
+of all execution counts for each basic-block.
+
+@subsection Histogram Records
+
+Histogram records consist of a header that is followed by an array of
+bins. The header contains the text-segment range that the histogram
+spans, the size of the histogram in bytes (unlike in the old BSD
+format, this does not include the size of the header), the rate of the
+profiling clock, and the physical dimension that the bin counts
+represent after being scaled by the profiling clock rate. The
+physical dimension is specified in two parts: a long name of up to 15
+characters and a single character abbreviation. For example, a
+histogram representing real-time would specify the long name as
+``seconds'' and the abbreviation as ``s''. This feature is useful for
+architectures that support performance monitor hardware (which,
+fortunately, is becoming increasingly common). For example, under DEC
+OSF/1, the ``uprofile'' command can be used to produce a histogram of,
+say, instruction cache misses. In this case, the dimension in the
+histogram header could be set to ``i-cache misses'' and the abbreviation
+could be set to ``1'' (because it is simply a count, not a physical
+dimension). Also, the profiling rate would have to be set to 1 in
+this case.
+
+Histogram bins are 16-bit numbers and each bin represent an equal
+amount of text-space. For example, if the text-segment is one
+thousand bytes long and if there are ten bins in the histogram, each
+bin represents one hundred bytes.
+
+
+@subsection Call-Graph Records
+
+Call-graph records have a format that is identical to the one used in
+the BSD-derived file format. It consists of an arc in the call graph
+and a count indicating the number of times the arc was traversed
+during program execution. Arcs are specified by a pair of addresses:
+the first must be within caller's function and the second must be
+within the callee's function. When performing profiling at the
+function level, these addresses can point anywhere within the
+respective function. However, when profiling at the line-level, it is
+better if the addresses are as close to the call-site/entry-point as
+possible. This will ensure that the line-level call-graph is able to
+identify exactly which line of source code performed calls to a
+function.
+
+@subsection Basic-Block Execution Count Records
+
+Basic-block execution count records consist of a header followed by a
+sequence of address/count pairs. The header simply specifies the
+length of the sequence. In an address/count pair, the address
+identifies a basic-block and the count specifies the number of times
+that basic-block was executed. Any address within the basic-address can
+be used.
+
+@node Internals
+@section @code{gprof}'s Internal Operation
+
+Like most programs, @code{gprof} begins by processing its options.
+During this stage, it may building its symspec list
+(@code{sym_ids.c:@-sym_id_add}), if
+options are specified which use symspecs.
+@code{gprof} maintains a single linked list of symspecs,
+which will eventually get turned into 12 symbol tables,
+organized into six include/exclude pairs---one
+pair each for the flat profile (INCL_FLAT/EXCL_FLAT),
+the call graph arcs (INCL_ARCS/EXCL_ARCS),
+printing in the call graph (INCL_GRAPH/EXCL_GRAPH),
+timing propagation in the call graph (INCL_TIME/EXCL_TIME),
+the annotated source listing (INCL_ANNO/EXCL_ANNO),
+and the execution count listing (INCL_EXEC/EXCL_EXEC).
+
+After option processing, @code{gprof} finishes
+building the symspec list by adding all the symspecs in
+@code{default_excluded_list} to the exclude lists
+EXCL_TIME and EXCL_GRAPH, and if line-by-line profiling is specified,
+EXCL_FLAT as well.
+These default excludes are not added to EXCL_ANNO, EXCL_ARCS, and EXCL_EXEC.
+
+Next, the BFD library is called to open the object file,
+verify that it is an object file,
+and read its symbol table (@code{core.c:@-core_init}),
+using @code{bfd_canonicalize_symtab} after mallocing
+an appropriately sized array of symbols. At this point,
+function mappings are read (if the @samp{--file-ordering} option
+has been specified), and the core text space is read into
+memory (if the @samp{-c} option was given).
+
+@code{gprof}'s own symbol table, an array of Sym structures,
+is now built.
+This is done in one of two ways, by one of two routines, depending
+on whether line-by-line profiling (@samp{-l} option) has been
+enabled.
+For normal profiling, the BFD canonical symbol table is scanned.
+For line-by-line profiling, every
+text space address is examined, and a new symbol table entry
+gets created every time the line number changes.
+In either case, two passes are made through the symbol
+table---one to count the size of the symbol table required,
+and the other to actually read the symbols. In between the
+two passes, a single array of type @code{Sym} is created of
+the appropriate length.
+Finally, @code{symtab.c:@-symtab_finalize}
+is called to sort the symbol table and remove duplicate entries
+(entries with the same memory address).
+
+The symbol table must be a contiguous array for two reasons.
+First, the @code{qsort} library function (which sorts an array)
+will be used to sort the symbol table.
+Also, the symbol lookup routine (@code{symtab.c:@-sym_lookup}),
+which finds symbols
+based on memory address, uses a binary search algorithm
+which requires the symbol table to be a sorted array.
+Function symbols are indicated with an @code{is_func} flag.
+Line number symbols have no special flags set.
+Additionally, a symbol can have an @code{is_static} flag
+to indicate that it is a local symbol.
+
+With the symbol table read, the symspecs can now be translated
+into Syms (@code{sym_ids.c:@-sym_id_parse}). Remember that a single
+symspec can match multiple symbols.
+An array of symbol tables
+(@code{syms}) is created, each entry of which is a symbol table
+of Syms to be included or excluded from a particular listing.
+The master symbol table and the symspecs are examined by nested
+loops, and every symbol that matches a symspec is inserted
+into the appropriate syms table. This is done twice, once to
+count the size of each required symbol table, and again to build
+the tables, which have been malloced between passes.
+From now on, to determine whether a symbol is on an include
+or exclude symspec list, @code{gprof} simply uses its
+standard symbol lookup routine on the appropriate table
+in the @code{syms} array.
+
+Now the profile data file(s) themselves are read
+(@code{gmon_io.c:@-gmon_out_read}),
+first by checking for a new-style @samp{gmon.out} header,
+then assuming this is an old-style BSD @samp{gmon.out}
+if the magic number test failed.
+
+New-style histogram records are read by @code{hist.c:@-hist_read_rec}.
+For the first histogram record, allocate a memory array to hold
+all the bins, and read them in.
+When multiple profile data files (or files with multiple histogram
+records) are read, the memory ranges of each pair of histogram records
+must be either equal, or non-overlapping. For each pair of histogram
+records, the resolution (memory region size divided by the number of
+bins) must be the same. The time unit must be the same for all
+histogram records. If the above containts are met, all histograms
+for the same memory range are merged.
+
+As each call graph record is read (@code{call_graph.c:@-cg_read_rec}),
+the parent and child addresses
+are matched to symbol table entries, and a call graph arc is
+created by @code{cg_arcs.c:@-arc_add}, unless the arc fails a symspec
+check against INCL_ARCS/EXCL_ARCS. As each arc is added,
+a linked list is maintained of the parent's child arcs, and of the child's
+parent arcs.
+Both the child's call count and the arc's call count are
+incremented by the record's call count.
+
+Basic-block records are read (@code{basic_blocks.c:@-bb_read_rec}),
+but only if line-by-line profiling has been selected.
+Each basic-block address is matched to a corresponding line
+symbol in the symbol table, and an entry made in the symbol's
+bb_addr and bb_calls arrays. Again, if multiple basic-block
+records are present for the same address, the call counts
+are cumulative.
+
+A gmon.sum file is dumped, if requested (@code{gmon_io.c:@-gmon_out_write}).
+
+If histograms were present in the data files, assign them to symbols
+(@code{hist.c:@-hist_assign_samples}) by iterating over all the sample
+bins and assigning them to symbols. Since the symbol table
+is sorted in order of ascending memory addresses, we can
+simple follow along in the symbol table as we make our pass
+over the sample bins.
+This step includes a symspec check against INCL_FLAT/EXCL_FLAT.
+Depending on the histogram
+scale factor, a sample bin may span multiple symbols,
+in which case a fraction of the sample count is allocated
+to each symbol, proportional to the degree of overlap.
+This effect is rare for normal profiling, but overlaps
+are more common during line-by-line profiling, and can
+cause each of two adjacent lines to be credited with half
+a hit, for example.
+
+If call graph data is present, @code{cg_arcs.c:@-cg_assemble} is called.
+First, if @samp{-c} was specified, a machine-dependent
+routine (@code{find_call}) scans through each symbol's machine code,
+looking for subroutine call instructions, and adding them
+to the call graph with a zero call count.
+A topological sort is performed by depth-first numbering
+all the symbols (@code{cg_dfn.c:@-cg_dfn}), so that
+children are always numbered less than their parents,
+then making a array of pointers into the symbol table and sorting it into
+numerical order, which is reverse topological
+order (children appear before parents).
+Cycles are also detected at this point, all members
+of which are assigned the same topological number.
+Two passes are now made through this sorted array of symbol pointers.
+The first pass, from end to beginning (parents to children),
+computes the fraction of child time to propagate to each parent
+and a print flag.
+The print flag reflects symspec handling of INCL_GRAPH/EXCL_GRAPH,
+with a parent's include or exclude (print or no print) property
+being propagated to its children, unless they themselves explicitly appear
+in INCL_GRAPH or EXCL_GRAPH.
+A second pass, from beginning to end (children to parents) actually
+propagates the timings along the call graph, subject
+to a check against INCL_TIME/EXCL_TIME.
+With the print flag, fractions, and timings now stored in the symbol
+structures, the topological sort array is now discarded, and a
+new array of pointers is assembled, this time sorted by propagated time.
+
+Finally, print the various outputs the user requested, which is now fairly
+straightforward. The call graph (@code{cg_print.c:@-cg_print}) and
+flat profile (@code{hist.c:@-hist_print}) are regurgitations of values
+already computed. The annotated source listing
+(@code{basic_blocks.c:@-print_annotated_source}) uses basic-block
+information, if present, to label each line of code with call counts,
+otherwise only the function call counts are presented.
+
+The function ordering code is marginally well documented
+in the source code itself (@code{cg_print.c}). Basically,
+the functions with the most use and the most parents are
+placed first, followed by other functions with the most use,
+followed by lower use functions, followed by unused functions
+at the end.
+
+@node Debugging
+@section Debugging @code{gprof}
+
+If @code{gprof} was compiled with debugging enabled,
+the @samp{-d} option triggers debugging output
+(to stdout) which can be helpful in understanding its operation.
+The debugging number specified is interpreted as a sum of the following
+options:
+
+@table @asis
+@item 2 - Topological sort
+Monitor depth-first numbering of symbols during call graph analysis
+@item 4 - Cycles
+Shows symbols as they are identified as cycle heads
+@item 16 - Tallying
+As the call graph arcs are read, show each arc and how
+the total calls to each function are tallied
+@item 32 - Call graph arc sorting
+Details sorting individual parents/children within each call graph entry
+@item 64 - Reading histogram and call graph records
+Shows address ranges of histograms as they are read, and each
+call graph arc
+@item 128 - Symbol table
+Reading, classifying, and sorting the symbol table from the object file.
+For line-by-line profiling (@samp{-l} option), also shows line numbers
+being assigned to memory addresses.
+@item 256 - Static call graph
+Trace operation of @samp{-c} option
+@item 512 - Symbol table and arc table lookups
+Detail operation of lookup routines
+@item 1024 - Call graph propagation
+Shows how function times are propagated along the call graph
+@item 2048 - Basic-blocks
+Shows basic-block records as they are read from profile data
+(only meaningful with @samp{-l} option)
+@item 4096 - Symspecs
+Shows symspec-to-symbol pattern matching operation
+@item 8192 - Annotate source
+Tracks operation of @samp{-A} option
+@end table
+
+@node GNU Free Documentation License
+@appendix GNU Free Documentation License
+@include fdl.texi
+
+@bye
+
+NEEDS AN INDEX