tuning-5 : prefer -O3 --------------------- In this run, I moved to -marcg=native -O3 with cheap hardening. When building the pythons I also enabled expensive optimization (not in the LFS Python3 build, I rebuild it early on after completing LFS). This adds PGO, so the builds take _much_ longer. For python2, I guess that might nr useful if you use python plugins in gimp, but for everything else (browsers, etc) it seems unlikely that you will recoup the extra build time. Foe python3, I suspect this shows some benefit when making the kernel documentation, and will also be benefivial if you script in Python, but the extra time taken ny Python3 is enormous (about 20 SBU). Despite the implication in Python2's configure help, this does NOT enable LTO, that is a separate option. Similarly, I interpreted a comment elsewhere to mean that libreoffice uses LTO by default: it doesn't, it has a similar configure switch to enable it. New test failures ----------------- In the binutils tests I got Running /building/binutils-2.32/ld/testsuite/ld-gc/gc.exp ... FAIL: Check --gc-section FAIL: Check --gc-section/-q FAIL: Check --gc-section/-r/-e also FAIL: S-records with constructors which results in === ld Summary === # of expected passes 2337 # of unexpected failures 4 When using it to see if it can build itself (only LFS!) I got the same. Running it manually in chroot, ld/ld.log showed: spawn [open ...]^M 0000000000404004 D __bss_start 0000000000000000 A __stack_chk_fail 0000000000404004 D _edata 0000000000404008 D _end 0000000000401000 T main 0000000000401010 t used_func.constprop.0 0000000000404000 D used_var used sections do not exist FAIL: Check --gc-section similarly for the other two Check --gc-section variants ./ld-new: tmpdir/sr3.o: in function `Foo::operator=(Foo const&)': /building/binutils-2.32/ld/testsuite/ld-srec/sr3.cc:109: undefined reference to `memmove' ./ld-new: tmpdir/sr3.o: in function `Foo::Foo(Foo const&)': /building/binutils-2.32/ld/testsuite/ld-srec/sr3.cc:104: undefined reference to `memmove' ./ld-new: tmpdir/sr3.o: in function `Foo::operator=(Foo const&)': /building/binutils-2.32/ld/testsuite/ld-srec/sr3.cc:109: undefined reference to `memmove' ./ld-new: tmpdir/sr3.o: in function `Foo::Foo(Foo const&)': /building/binutils-2.32/ld/testsuite/ld-srec/sr3.cc:104: undefined reference to `memmove' FAIL: S-records with constructors My initial approach was to detune both binutils and gcc to -O2 (I had the impression that compiles were perhaps faster if gcc used -O2, but that seems to be a false impression. These failures in binutils are technically a testsuite problem, it looks for things which have been optimized away, so now I am inclined to go with -O3 here. I also had a new failure in the coreutils tests on both this run and the subsequent build-itself run: FAIL: tests/misc/env-signal-handler.sh Unfortunately, to save space I normally delete the build tree and therefore any relevant log from the testsuite (tests/test-suite.log in this case). Running this manually with the same CFLAGS, CXXFLAGS passed, so I initially marked this as "cause unknown". On the rebuild with both binutils and gcc using -O2 I did not get this failure. But now that I've decided that I prefer -O3 for both those packages I wanted to track down where the problem lay. In a fresh attempt at 'build itself', but stopping immediately after the coreutils tests, all the tests passed. But I then gave it another try and watched the CPU monitor on icewm: on icewm's CPU monitor the tests started out as just one core (equivalent), but then before this failure I saw a spike up towards all 8 cores and this time it did fail. The important part of the log appears to be: + diff -u exp-err5 err5 --- exp-err5 2019-07-23 00:26:08.665447853 +0000 +++ err5 2019-07-23 00:26:08.665447853 +0000 @@ -1 +1,2 @@ timeout: sending signal INT to command 'sleep' +timeout: sending signal KILL to command 'sleep' I now think that there is a race, and when everything is more optimized the sigint does not appear to get received quickly enough. I'm going to treat this as "Seems to be racy on multicore machines, may fail with -O3". Changes after the initial build ------------------------------- At the end I finally discovered how to build firefox (still 67 in that test) using -O3 for almost everything, so I did that and updated the notes. I had noticed that the times for building and testing fftw3 varied widely. In the end I concluded that the testsuite, which uses perl and is single- threaded, was probably where most of the variation was happening. I decided that trying yo track down why there was so much variation was probably a waste of my time. Instead, I decided that I wanted to test the runtime performance. In the end I found some examples of how to use (basic) fftw. Looking at my use-cases (libsamplerate tests, and g'mic) I only need the basic build (and with threads) and was able to hack one of the examples to get a large enough elapsed time without using time to report the results on screen, which is comparatively slow. So, for the future I will only build the default (double) size of the fftw3 libraries (single and threaded). Then I looked at testing sndfile-reample from libsamplerate, but even on converting 48/16 to 44.1/16 it tends to detect clipping and restart, which means it is slow. On investigation, I found a report comparing resample tools for 48/16 tp 44.1/16 - sox was better, and can also reduce the bit depth as well as its many other capabilities, so this will be the last time I build libsamplerate for my own use. More tests ---------- At this point, having wanted to test fftw3 I added more tests and revisted some things I had been only timing for one run (the kernel build, and the kernel docs). For the kernel docs, I added sphinx_rtd_theme which had fallen out of my build - with that I was able to 'make pdfdocs' in the kernel tree. So, more "reinstate old systems from backup, and retest" shenanigans. Build with some detuning and other changes ------------------------------------------ I then made a build with binutils and gcc using -O2 instead of -O3, with rustc and rust packages dropping -Ctarget-cpu=native, for fftw3 replacing -march=native with -mtune=native, and for sox doing the same and also only using -O2. I also upgraded rustc and the firefox deps so that I could build firefox-68.0. For Python3 I recognise that enabling the expensive options seems to be beneficial when using sphinx to build the kernel docs, but at a high initial cost, so I tried using lto in Python3 and libreoffice. For libreoffice that adds a little to the build time (probably more noticeable if you don't enable as many languages/dictionaries as I do, but for Python3 it makes the libraries bigger! This build also picked up firefox-68.0. I added some more runtime tests (latex variants) in the belief that maybe gcc -O2 would turn out to be faster (OK, we all make wrong assumprions from time to time). When I came to building the kernel pdfdocs (mostly Sphinx, a few seconds of XeLaTeX, dominated by Sphinx i.e. Python3) I found that the build was slower than ever. From this I conclude that while the expensive optimizations in Python3 are worthwhile if you do enough Pythoning, its lto option is not worth the effort. I eventually found some example rust programs (a version of the sieve of Eratosthenes, and a version of coreutils) to play with. These led me to suspect that what I was actually testing was the runtime of the rust standard library, so I rebuilt rust using only -Copt-level=2 and found no significant variation in the runtimes. From that, I've concluded that the available rust optimzations, apart from using the default unoptimized debug build which is horribly slow, do as close to nothing as makes no difference. Examining whether building gcc using -O2 is beneficial ------------------------------------------------------ My impression was that compiling gcc using -O3 seemed to make the build slower. To test that, and since binutils had test failures with -O3, I detuned both to -O2. For many things, the difference is just noise, but for compiling the kernel I seem to see a slight benefit if gcc is itself compiled with -O3. So, for me -O3 is preferred. Packages where -O3 -march=native is NOT beneficial -------------------------------------------------- Of the packages where I've been able to do runtime testing, fftw3 (for the default double size) benefits from -O3 -mtune=native rather than -O3 -mtne=native. Its default optimization is -O3, for me it is convenient to change -march= to -mtune= in the script for this package but otherwise you could just drop that and leave it at plain -O3. Similarly, sox (not in BLFS) runs faster when using -O2 -mtune=native instead of -O3 and/or -march=native. I went back to the -O3 system and detuned texlive and xindy to use -O2. For texlive this definitely speeds up the build (including its tests). I then tested the runtimes, and on balance I find this beneficial. In particular, it seems beneficial for traditional latex. I only detuned xindy because it is used in my tests, on reflection I think -O3 is fine there. Optimizing in rust programs --------------------------- Having spent some time finding example rust programs which might be useful for testing the runtime, I eventually decided that the lack of variation in builds with the equivalents of -O2 and -O3 must mean that I was actually measuring the speed of the rust standard library. So I rebuilt rust detuning it to the equivalent of -O2. The runtimes were not affected. If an existing program/package uses cargo, the default build will be unoptimized debug, and that _will_ run slowly. By specifying a release build it will run much faster. For standalone unpackaged programs (e.g. examples found on the web), use the equivalents of -O2 or -O3. In my experience (for things which complete in less than 50 seconds) there is no practical difference between the two levels. Conclusion ---------- The more that I do runtime tests, the less my confidence that they are showing anything. Example: with gcc and binutils detuned to -O2 (so potentially slower compiles, but the same optimizations in the programs that get built) I'm seeing faster runs of my 'rawpng' test which basically uses ImageMagick7 and LibRaw : neither of those had changed how I built them in this run. Similarly, earlier tests on builds that did not use -march=native were sometimes faster if the cheap hardening had been added, even though that cannot speed things up. Summary ------- Runtime tests vary. This is the end of this particular series of experiments. 2019-07-23.