The design and implementation of pkgsrc
From NetBSD Wiki
General Design Guidelines
- Always do extensive error checking.
- When you're in doubt what to do in a certain situation, print a helpful error message and fail as hard as possible.
The compiler wrappers
The compilation environment
Compiling a file is not as trivial as it might seem. There are many aspects affecting the compilation. Some of them are:
- The selected POSIX compilation environment (XPG4, XPG5, XPG6, POSIX 2001).
- Some platforms (e.g. Solaris) are very strict at checking the consistency of these options.
- To discuss: Should all pkgsrc files be forced to be compiled in the very same environment? Sure, that needs some work, but we will very likely find many bugs and unportable programming styles through it.
- The selected application binary interface (ABI).
- What if a package selects and ABI different from what the user wants? Should the compilation fail in that case?
- The directories where header files and libraries are searched.
- This is already handled quite well by buildlink3. The remaining part is to exclude the system directories from being searched automatically.
Internals
The internals of the compiler wrapper are one of the lesser known parts of pkgsrc. While I'm trying to understand the code, I will note down anything that seems worth to be documented.
User-visible part
The pkgsrc user can set PKGSRC_COMPILER to a list of compilers that should be used. This list can contain "chainable" compilers like ccache or distcc, and should be terminated with a "real" compiler like gcc, sunpro, mipspro. Each of the compilers has its own definition file in mk/compiler/${name}.mk.
Structure of the compiler definition file
The files in pkgsrc/mk/compiler are a mystery to many people. I'm trying to understand things by examining a typical run through the file sunpro.mk. I'm taking this one because I have the compiler available and because gcc.mk is too complicated for understanding the basic things.
The file structure does not match the general pkgsrc design patterns anymore, so intend to rewrite them. To do this, I first have to understand in depth what is going on there.
The first two lines deal with multiple inclusion. I don't think they are necessary, since packages or users are not supposed to include this file directly.
Since the compiler may be installed anywhere in the local system, the variable SUNWSPROBASE can be set by the user to point to the directory where the compiler is installed. Other compilers, but not all, have similar variables.
The variable LANGUAGES.sunpro is set to an empty list, and after the file has been loaded, its value will be stable. Typical values include c, c++, fortran. This variable should not be user-visible.
_SUNPRO_DIR is the directory where the second level wrappers for the compiler are installed. I don't have any idea what this is good for, since the only case where the compiler is supposed to be called is by the wrappers, which could easily just call the real compiler.
_SUNPRO_VARS is a list of (make,shell,outermake?) variables. Typical values are CC, CXX, FC. For each of these values:
- there exists a variable _SUNPRO_* that points to the wrapper binary in _SUNPRO_DIR.
- the variable _ALIASES.* is a list of filenames that are created as symlinks in _SUNPRO_DIR.
- *PATH is the pathname to the real compiler.
- PKG_* is an alias for _SUNPRO_CC. (Why is this necessary?)
There are some other variables:
- _COMPILER_ABI_FLAG.${abi} is a list of flags (TODO: Change the name to _COMPILER_ABI_FLAGS) that is appended/prepended(?) to the compiler command line by the complex wrappers.
- CC_VERSION_STRING is the complete output of the command to determine the version of the compiler. (FIXME: This should be done for each compiler, not only for C.)
- CC_VERSION is similar to CC_VERSION_STRING. (TODO: Where are these two variables used?)
If any of these compilers is used, the directory where the simple wrappers are installed is prepended to the PATH variable.
Open questions
- Where is _COMPILER_STRIP_VARS used?
- Where is LANGUAGES.* used?
Compiler properties
There are several things that describe a specific compiler:
- the language that it can compile
- the path where the compiler can be found
- the list of aliases that should be generated by the wrapper framework
- special flags for some platforms/ABIs
- the option for rpath definition, for example -Wl,-R
- the pkgsrc dependencies (for example, devel/ccache)
- the dependencies on other compilers (for example, ccache needs a real C or C++ compiler)
- the environment variable that is used to find the compiler (for example CC or CXX)
- a shell command to get the version number of the compiler
What's in a compiler wrapper?
A wrapper is a mixture of many different files. Each of these files can be selected individually per wrapper (see bsd.wrapper.mk).
scan
Sets some variables, depending on the command line arguments:
- append_extra_args: Whether the extra arguments (_WRAP_EXTRA_ARGS.*) should be appended to the command or not.
Possible sources: mk/wrapper/scan or mk/buildlink3/scan-libtool.
arg_source
Iterates over all arguments, does some preprocessing and stores them in the argbuf queue.
Possible sources: mk/wrapper/arg-source.
Transformations (logic)
This is the main part. Here, the argbuf is analyzed bit by bit and transformed into cmdbuf, which will make up the real command that is executed at the end.
For each command line argument, the chain of arg-pp-main, arg-pp, cache, transform.sed and transform is executed.
- arg-pp-main does some transformations on argbuf and prepends its results to the same argbuf. That way, the replacement is transformed next.
- arg is the current argument.
- argmatch is set to "yes" when a rule from arg-pp-main matched.
- argok is set to "yes" when the argument should be passed to the * arg-pp is usually empty.
- skipargs can be set to a positive number to skip the next n arguments from the transformations.
- do_transform can be set to "no" if the argument should pass without the usual transformations.
- cache caches transformations to avoid calling too many external processes. There are no user-serviceable parts.
- transform.sed is the ugliest part of the compiler wrappers. I guess nobody understands it completely. Well, maybe except for jlam, who also wrote it. Every argument is processed through the large jungle of sed expressions before it reaches the next stage. (XXX: can the order be swapped?)
- transform does some transformations that depend on the selected compiler. This is usually one of the mk/wrapper/transform-* files.
- arg is the argument to be transformed. It may either be transformed in-place or by calling the transform_* functions from mk/wrapper/wrapper-subr.sh.
- addtocache can be set to "yes" if the argument's transformation should not be cached.
- split_arg can be set to "yes" if the result of the transformation should be split into many words.
Building the command (cmd-sink, buildcmd)
The cmd-sink command iterates over the arguments from cmdbuf, calling buildcmd for each of them. The buildcmd command writes the arguments into two scalar variables (not queues):
- libs for all the arguments of the form -l*, and
- cmd for all other arguments.
Reordering libraries
Finally, if there are any library reordering commands (WRAPPER_REORDER_CMDS, a package-settable variable), the libraries are reordered (see mk/wrapper/gen-reorder.sh).
Running the wrapped command
Now, the list of libraries (libs) is appended to the command (cmd), and the command is run.
Cleaning up
Is called just before the wrapper exists.
Possible sources: mk/buildlink3/libtool-cleanup or nothing.
The compiler-specific transformations
In the mk/wrapper directory there are some files called transform-*. These are the compiler-specific wrappers that analyze a single command line argument and transform it into options that the specific compiler understands.
Each compiler wrapper shall handle each of the standard options in the next section. It should also handle as many as possible of the non-standard options.
There are some helper functions that make writing a compiler wrapper quite easy. Have a look at mk/wrapper/wrapper-subr.sh and the various existing files to get a feeling how to use them.
See also:
Special transformations for single files
When compiling some files, there appear internal compiler errors. In such a case, it often helps to switch off the optimization for that file. Unfortunately, there is currently no way to do that for a single file, it can only done to a whole package. This is suboptimal.
My idea is to add a new wrapper transformation rule that is only applied when all of the following conditions apply:
- The wrapper type (like CC, LD or LIBTOOL) is correct.
- Any argument in the compiler command line matches an expression.
Then, a list of additional arguments is appended to the wrapper's command line.
Examples for these transformations are:
- CC:*parse_date.c:-OPT:Olimit=33353 -CG:longbranch_limit=100000
- CC:*internal_error.c:-O0
Notes:
- The list of arguments to be appended may contain each any every character, including white-space and colons.
TODO
- remove-if:*internal_error.c:-O*
Implementation
- These transformations must not be cached by the wrapper cache, as they only apply under special circumstances.
Options that every compiler wrapper should be able to handle
The following are the options required by The Open Group for every C99 compiler.
- -c: Tells the compiler not to link files together.
- -Dmacro: Defines a preprocessor macro that has the value 1.
- -Dmacro=value: Defines a preprocessor macro that has the given value.
- -E: Instead of compiling a file, it should only be preprocessed, and the resulting output be printed on the standard output.
- -g: Tells the compiler to add debugging information to the generated files.
- -Idirectory: Appends a directory to the search path for header files.
- -Ldirectory: Appends a directory to the search path for library files.
- -llibrary: Appends the library to the list of libraries the program should be linked with.
- -O, -O0, -O1: Enables or disables code optimization.
- -o file: Tells the compiler to use file as filename for the generated file.
- -s: Should generally be discarded, because stripping binaries from symbols is the user's decision in pkgsrc.
- -Umacro: Undefines a preprocessor macro.
There are other, non-standard options that many compilers implement under varying names. Since most open source packages use gcc for development, it seems to be best to require every compiler wrapper to accept the gcc variants of the following options.
- -L (IRIX ld), -nostdlib (GNU ld, IRIX ld), -Y P, (Solaris ld): Disables the default library search paths. This feature is currently not used in pkgsrc, but it seems worth to evaluate it in an experiment.
- -MM (gcc), -xM1 (sunpro): preprocesses the input file(s) and prints their dependencies on the standard output, in a format suitable to be included in a Makefile.
- -O2, -O3, -Os (gcc): Enables more optimization.
- -Wl,linker-option (many): In case that the compiler calls the linker, the given argument should be passed to it.
- -Wl,-Rdirectory (Solaris ld, GNU ld), -Wl,-rpath,directory (GNU ld): Appends a path to the run-time search path for library files.
There are also command line options that some compilers cannot handle. These should not be used in packages.
- -Dmacro(parameters)=value (sunpro)
Things that no compiler wrapper should ever do
- Create log files in the current directory. Some configure scripts (at least those from autoconf2.13) will treat these log files as the default output file of the compiler, which leads to unexpected behavior.
Open Questions
- In the configure and build phases, the package sees environment variables CC and CXX. Where do the values of these variables come from?
- Why are there extra wrappers in ${WRKDIR}/.gcc or ${WRKDIR}/.sunpro? Wouldn't it be sufficient to just call the "real" compilers?
- Why is there so much code duplication in the files in mk/compiler/*.mk?
- Why is _LINKER_RPATH_FLAG defined by the compiler, not by the platform?
- How can a transform-* file know which program (C compiler, C++ compiler, linker) is currently called?
Unsolved problems
- There are some packages that need a compiler at runtime. How should these packages know which compiler to take? The wrapped compilers usually accept far more command line options than the real ones, and they usually find their dependent libraries and include files more easily.
Environment variables
User-settable environment variables that might disturb pkgsrc
- LD_LIBRARY_PATH
- PATH
- LD_RUN_PATH
- LC_ALL, LC_MESSAGES, LC_CTYPE
- MAKECONF
- CFLAGS, CPPFLAGS, CXXFLAGS
- CC, CXX, FC, CPP
Naming conventions
Make targets
- Make targets that start with an underscore are private to the pkgsrc infrastructure.
- Make targets that start with two underscores are private to the current file.
Variables
- Variable names starting with an underscore are reserved to the pkgsrc infrastructure.
- Variable names consisting of lower-case letters should only be used in make's .for loops.
- Variable names starting with an upper-case letter and consisting only of upper-case letters and underscores are mostly freely usable.
- This definition includes parameterized variables, such as PKG_OPTIONS.mplayer.
- Note: Some of the variables from this namespace are already used by pkgsrc.
Diagnostic messages (errors and warnings)
Diagnostics are an important piece of every complex software system. In pkgsrc, they are designed carefully to help the user find problems quickly. A good diagnostic message consists of:
- The word "ERROR" or "WARNING".
- The file that generated the message.
- More information about the context in which the message was generated (for example, the make target).
- The filename, possibly including the line number, where the problem was found.
- The message text, which may include more details.
The message text consists of a fixed part and variable parts (for example, filenames, variable values). The fixed part should be easily searchable by search engines like Google.
Hooks
Hooks are special make targets that run ${DO_NADA}. Their purpose is that other tasks can depend, as in the following example:
check-shlibs: privileged-install-hook
Files other than the one that defines the hook must not add any code to the hook itself. They may only make other targets depend on the hook.
By convention, all hook names are of the form foo-hook.
The following hooks are currently provided, in chronological order:
- pre-depends-hook
- pre-configure-checks-hook
- do-configure-pre-hook
- do-configure-post-hook
- pre-build-checks-hook
- privileged-install-hook
- unprivileged-install-hook
Build-time consistency checks
Pkgsrc contains a lot of well-maintained packages, where a large organization takes care of programming style, such as the KDE and GNOME projects. But it also contains many small packages written by people who simply do not know how to write software that runs everywhere, just because they didn't need it yet. These packages usually contain assumptions that may well hold on a GNU/Linux system on the i386 platform using gcc-3 as the compiler.
On other systems, the software may fail silently (which is very bad, as it is hard to track down the actual defect) or it may not build at all. The latter is annoying to the pkgsrc user, but good for the developers, as they can see that something's wrong.
Since one of the pkgsrc mottos is "It should only work if it is correct", these kinds of bugs should be found as early as possible. In many cases this means failing to compile or to link the package, so it does not get installed until it is done right.
There are already some basic checks that are run during the build of a package:
- All shell scripts that have /bin/sh as interpreter must not contain a test command that makes use of the == operator.
- After configuring a package, all *.h files must be free of macros that contain strings like "${prefix}/share", since they are usually caused by incorrect usage of autoconf macros.
- A package must not install files that are world-writable or have other weird permissions.
- A package must not install files that start with #!, but don't mention an existing interpreter.
Proposed future checks
- Check all source code files for string literals that contain absolute pathnames, especially for device files. To put this check into good use, the section on "Device files" in The pkgsrc portability guide must first be written.
- Check all C and C++ source code files for the use of the inline keyword on functions without static. This is one thing that the SunPro compiler cannot handle.
- Add hooks to the pkgsrc compiler wrappers to allow static analysis tools to be hooked upon the calls to the actual compiler. This allows for many other checks. (See How to use static analysis tools within pkgsrc.)
- Make the compilation fail for all source files that use functions without including the appropriate header first (keyword: -Werror-implicit-function-declaration).
- Normalize all installed files, especially interpreter scripts, so that their first line always contains the valid interpreter. (XXX: Beware of side-effects with the alternatives framework.)
- Make sure that the file permissions of the installed files are minimal, that is, 0444 for regular files, 0555 for script files and programs.
- Specify the permissions of the installed files more strictly. All installed files should be 0444 by default, except for bin/*, sbin/*, lib/*.la, lib/*.so and lib/*.so.*. All files that have other permissions must be marked somehow, either in the PLIST or in the package Makefile.
Regression tests
The pkgsrc infrastructure consists of a large codebase, and there are many corners where every little bit of a file is well thought out, making pkgsrc likely to fail as soon as anything is changed near those parts. To prevent most changes from breaking anything, a suite of regression tests should go along with every important part of the pkgsrc infrastructure. This chapter describes how regression tests work in pkgsrc and how you can add new tests.
The regression tests framework
Running the regression tests
You first need to install the pkgtools/pkg_regress package, which provides the pkg_regress command. Then you can simply run that command, which will run all tests in the regress category.
Adding a new regression test
Every directory in the regress category that contains a file called spec is considered a regression test. This file is a shell program that is included by the pkg_regress command. The following functions can be overridden to suit your needs.
Overridable functions
These functions do not take any parameters. They are all called in set -e mode, so you should be careful to check the exitcodes of any commands you run in the test.
- do_setup prepares the environment for the test. By default it does nothing.
- do_test runs the actual test. By default, it calls TEST_MAKE with the arguments MAKEARGS_TEST and writes its output including error messages into the file TEST_OUTFILE.
- check_result is run after the test and is typically used to compare the actual output from the one that is
expected. It can make use of the various helper functions from the next section.
- do_cleanup cleans everything up after the test has been run. By default it does nothing.
Helper functions
- exit_status expected compares the exitcode of the do_test function with expected. If they differ, the test will fail.
- output_require [regex...] checks for each of its parameters if the output from do_test matches the extended regular expression. If it does not, the test will fail.
- output_prohibit [regex...] checks for each of its parameters if the output from do_test does not match the extended regular expression. If any of the regular expressions matches, the test will fail.
From the DIP book
The pkgsrc infrastructure consists of many small Makefile fragments. Each such fragment needs a properly specified interface. This chapter explains how such an interface looks like.
The meaning of variable definitions
<para>Whenever a variable is defined in the pkgsrc infrastructure, the location and the way of definition provide much information about the intended use of that variable. Additionally, more documentation may be found in a header comment or in this pkgsrc guide.</para>
<para>A special file is <filename>mk/defaults/mk.conf</filename>, which lists all variables that are intended to be user-defined. They are either defined using the <literal>?=</literal> operator or they are left undefined because defining them to anything would effectively mean <quote>yes</quote>. All these variables may be overridden by the pkgsrc user in the <varname>MAKECONF</varname> file.</para>
<para>Outside this file, the following conventions apply: Variables that are defined using the <literal>?=</literal> operator may be overridden by a package.</para>
<para>Variables that are defined using the <literal>=</literal> operator may be used read-only at run-time.</para>
<para>Variables whose name starts with an underscore must not be accessed outside the pkgsrc infrastructure at all. They may change without further notice.</para>
<note><para>These conventions are currently not applied consistently to the complete pkgsrc infrastructure.</para></note>
Avoiding problems before they arise
<para>All variables that contain lists of things should default to being empty. Two examples that do not follow this rule are <varname>USE_LANGUAGES</varname> and <varname>DISTFILES</varname>. These variables cannot simply be modified using the <literal>+=</literal> operator in package <filename>Makefile</filename>s (or other files included by them), since there is no guarantee whether the variable is already set or not, and what its value is. In the case of <varname>DISTFILES</varname>, the packages <quote>know</quote> the default value and just define it as in the following example.</para>
DISTFILES= ${DISTNAME}${EXTRACT_SUFX} additional-files.tar.gz
<para>Because of the selection of this default value, the same value appears in many package Makefiles. Similarly for <varname>USE_LANGUAGES</varname>, but in this case the default value (<quote><literal>c</literal></quote>) is so short that it doesn't stand out. Nevertheless it is mentioned in many files.</para>
Variable evaluation
At load time
<para>Variable evaluation takes place either at load time or at runtime, depending on the context in which they occur. The contexts where variables are evaluated at load time are:</para>
- The right hand side of the <literal>:=</literal> and <literal>!=</literal> operators,
- Make directives like <literal>.if</literal> or <literal>.for</literal>,
- Dependency lines.
<para>A special exception are references to the iteration variables of <literal>.for</literal> loops, which are expanded inline, no matter in which context they appear.</para>
<para>As the values of variables may change during load time, care must be taken not to evaluate them by accident. Typical examples for variables that should not be evaluated at load time are <varname>DEPENDS</varname> and <varname>CONFIGURE_ARGS</varname>. To make the effect more clear, here is an example:</para>
CONFIGURE_ARGS= # none
CFLAGS= -O
CONFIGURE_ARGS+= CFLAGS=${CFLAGS:Q}
CONFIGURE_ARGS:= ${CONFIGURE_ARGS}
CFLAGS+= -Wall
<para>This code shows how the use of the <literal>:=</literal> operator can quickly lead to unexpected results. The first paragraph is fairly common code. The second paragraph evaluates the <varname>CONFIGURE_ARGS</varname> variable, which results in <literal>CFLAGS=-O</literal>. In the third paragraph, the <literal>-Wall</literal> is appended to the <varname>CFLAGS</varname>, but this addition will not appear in <varname>CONFIGURE_ARGS</varname>. In actual code, the three paragraphs from above typically occur in completely unrelated files.</para>
At runtime
<para>After all the files have been loaded, the values of the variables cannot be changed anymore. Variables that are used in the shell commands are expanded at this point.</para>
How can variables be specified?
<para>There are many ways in which the definition and use of a variable can be restricted in order to detect bugs and violations of the (mostly unwritten) policies. See the <literal>pkglint</literal> developer's documentation for further details.</para>
Designing interfaces for Makefile fragments
<para>Most of the <filename>.mk</filename> files fall into one of the following classes. Cases where a file falls into more than one class should be avoided as it often leads to subtle bugs.</para>
Procedures with parameters
<para>In a traditional imperative programming language some of the <filename>.mk</filename> files could be described as procedures. They take some input parameters and—after inclusion—provide a result in output parameters. Since all variables in <filename>Makefile</filename>s have global scope care must be taken not to use parameter names that have already another meaning. For example, <varname>PKGNAME</varname> is a bad choice for a parameter name.</para>
<para>Procedures are completely evaluated at preprocessing time. That is, when calling a procedure all input parameters must be completely resolvable. For example, <varname>CONFIGURE_ARGS</varname> should never be an input parameter since it is very likely that further text will be added after calling the procedure, which would effectively apply the procedure to only a part of the variable. Also, references to other variables wit will be modified after calling the procedure.</para>
<para>A procedure can declare its output parameters either as suitable for use in preprocessing directives or as only available at runtime. The latter alternative is for variables that contain references to other runtime variables.</para>
<para>Procedures shall be written such that it is possible to call the procedure more than once. That is, the file must not contain multiple-inclusion guards.</para>
<para>Examples for procedures are <filename>mk/bsd.options.mk</filename> and <filename>mk/buildlink3/bsd.builtin.mk</filename>. To express that the parameters are evaluated at load time, they should be assigned using the <literal>:=</literal> operator, which should be used only for this purpose.</para>
Actions taken on behalf of parameters
<para>Action files take some input parameters and may define runtime variables. They shall not define loadtime variables. There are action files that are included implicitly by the pkgsrc infrastructure, while other must be included explicitly.</para>
<para>An example for action files is <filename>mk/subst.mk</filename>.</para>
The order in which files are loaded
<para>Package <filename>Makefile</filename>s usually consist of a set of variable definitions, and include the file <filename>../../mk/bsd.pkg.mk</filename> in the very last line. Before that, they may also include various other <filename>*.mk</filename> files if they need to query the availability of certain features like the type of compiler or the X11 implementation. Due to the heavy use of preprocessor directives like <literal>.if</literal> and <literal>.for</literal>, the order in which the files are loaded matters.</para>
<para>This section describes at which point the various files are loaded and gives reasons for that order.</para>
The order in <filename>bsd.prefs.mk</filename>
<para>The very first action in <filename>bsd.pkg.mk</filename> is to define some essential variables like <varname>OPSYS</varname>, <varname>OS_VERSION</varname> and <varname>MACHINE_ARCH</varname>.</para>
<para>Then, the user settings are loaded from the file specified in <varname>MAKECONF</varname>. If the bmake command from pkgsrc is used, <varname>MAKECONF</varname> defaults to <filename><replaceable>${prefix}</replaceable>/etc/mk.conf</filename>. With the native &man.make.1; command on NetBSD, it defaults to <filename>/etc/mk.conf</filename>. After that, those variables that have not been overridden by the user are loaded from <filename>mk/defaults/mk.conf</filename>.</para>
<para>After the user settings, the system settings and platform settings are loaded, which may override the user settings.</para>
<para>Then, the tool definitions are loaded. The tool wrappers are not yet in effect. This only happens when building a package, so the proper variables must be used instead of the direct tool names.</para>
<para>As the last steps, some essential variables from the wrapper and the package system flavor are loaded, as well as the variables that have been cached in earlier phases of a package build.</para>
The order in <filename>bsd.pkg.mk</filename>
<para>First, <filename>bsd.prefs.mk</filename> is loaded.</para>
<para>Then, the various <filename>*-vars.mk</filename> files are loaded, which fill default values for those variables that have not been defined by the the package. These variables may later be used even in unrelated files.</para>
<para>Then, the file <filename>bsd.pkg.error.mk</filename> provides the target <literal>error-check</literal> that is added as a special dependency to all other targets that use <varname>DELAYED_ERROR_MSG</varname> or <varname>DELAYED_WARNING_MSG</varname>.</para>
<para>Then, the package-specific hacks from <filename>hacks.mk</filename> are included.</para>
<para>Then, various other files follow. Most of them don't have
any dependencies on what they need to have included before or
after them, though some do.</para>
<para>The code to check <varname>PKG_FAIL_REASON</varname> and <varname>PKG_SKIP_REASON</varname> is then executed, which restricts the use of these variables to all the files that have been included before. Appearances in later files will be silently ignored.</para>
<para>Then, the files for the main targets are included, in the order of later execution, though the actual order should not matter.</para>
<para>At last, some more files are included that don't set any interesting variables but rather just define make targets to be executed.</para>
