The Oprofile Profiler
As a developer, there are a number of useful tools that you always want to have in your arsenal for testing and debugging. This includes debuggers, memory checkers and performance profilers.
In this blog post I would like to present some information about a profiler tool that I feel deserves more recognition: oprofile.
What It Does
Oprofile (1) is an open source tool for statistical performance profiling. It runs on Linux systems and uses the hardware performance counters of the CPU to perform profiling of code with low overhead. It allows all code to be profiled including applications, interrupt handlers, kernel code and shared libraries. It produces profile data at run-time, which is then post-processed into human readable reports.
Older versions of the tool required a special kernel driver and daemon, but with the Linux kernel version 2.6 or later, it uses the standard Linux kernel performance events subsystem. Because of this, unlike most profiling tools, no special recompilation or linking is needed and it can profile any process running on the system.
Oprofile works on a wide range of CPUs including Intel x86 (32 and 64-bit), DEC Alpha, MIPS, ARM, SPARC64, ppc64 and AVR32. Because it has low overhead, it is suitable for use on embedded Linux systems. On x86 and most ARM platforms, oprofile can also produce call graph profiling data.
The main limitation is that it only runs on Linux systems and requires a 2.6 or later kernel (unless you use an older version of oprofile).
Some Examples
Let's look at a simple example of profiling a C++ program.
On many Linux distributions oprofile is available as a standard package so you don't need to build it yourself. With recent versions of Ubuntu, for example, you can install it by running "sudo apt-get install oprofile".
Let's use the following C++ program as an example (2). It doesn't do anything particularly meaningful, but the calculations will take some time to run so that we can see where the time is spent. You can adjust the number of loop iterations if it runs too slowly or too quickly on your system.
#include <iostream> int main() { long l = 0; long long ll = 0; float f = 0; double d = 0; for (long i = 0; i < 1000000000; i++) { l += 2*i*i + 3*i + 4; ll += 3*i*i + 4*i + 5; f += 4*i*i + 5*i + 6; d += 5*i*i + 6*i + 7; } std::cout << "l = " << l << std::endl; std::cout << "ll = " << ll << std::endl; std::cout << "f = " << f << std::endl; std::cout << "d = " << d << std::endl; }
You compile your code as you normally would and then run it under the operf program. When run, it will store profiling data, by default in an oprofile_data sub-directory.
On my system, I see the following output when I first compile and then run it, where it takes about seven seconds to execute (the commands I typed are in bold):
% g++ -g -Wall -o example1 example1.cpp % operf ./example1 operf: Profiler started l = 389932800 ll = -33545635511552 f = 6.95627e+11 d = -3.77758e+12 Profiling done.
The opreport program can then be used to read the profile data that was collected and display it in a meaningful way. Here is the output on my system when run with the default options:
% opreport --symbols Using /home/tranter/git/jtranter/blog/oprofile/oprofile_data/samples/ for samples directory. warning: /no-vmlinux could not be found. CPU: Intel Ivy Bridge microarchitecture, speed 2.901e+06 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % image name symbol name 183018 99.9727 example1 main 33 0.0180 no-vmlinux /no-vmlinux 5 0.0027 ld-2.19.so _dl_lookup_symbol_x 4 0.0022 ld-2.19.so do_lookup_x 2 0.0011 ld-2.19.so check_match.9376 1 5.5e-04 ld-2.19.so _dl_map_object_from_fd 1 5.5e-04 ld-2.19.so _dl_name_match_p 1 5.5e-04 ld-2.19.so _dl_relocate_object 1 5.5e-04 ld-2.19.so strcmp 1 5.5e-04 libgcc_s.so.1 /lib/i386-linux-gnu/libgcc_s.so.1 1 5.5e-04 libstdc++.so.6.0.19 /usr/lib/i386-linux-gnu/libstdc++.so.6.0.19
From this we can observe that over 99% of the time was spent in the application binary itself and negligible time in the shared libraries it uses, the Linux kernel (/no-vmlinux), or the dynamic linker/loader (ld.so). Within the application, the time was spent in the function main.
To further drill down into where the time was spent in the code, we can annotate the source code with timing information. For that to work we need some symbolic debugging information. Earlier we compiled with the -g option so the binary should have that information and we can run the opannotate tool to annotate the code:
% opannotate --source --output-dir=$PWD Using /home/tranter/git/jtranter/blog/oprofile/oprofile_data/samples/ for session-dir warning: /no-vmlinux could not be found. opannotate (warning): unable to open for reading: /build/buildd/eglibc-2.19/elf/dl-lookup.c opannotate (warning): unable to open for reading: /build/buildd/eglibc-2.19/string/../sysdeps/i386/i686/multiarch/../strcmp.S opannotate (warning): unable to open for reading: /build/buildd/eglibc-2.19/elf/get-dynamic-info.h opannotate (warning): unable to open for reading: /build/buildd/eglibc-2.19/elf/dl-misc.c opannotate (warning): unable to open for reading: /build/buildd/eglibc-2.19/elf/../sysdeps/i386/dl-machine.h
If we had source code available for the C run-time library, it could have annotated that for us, but this was not available, hence the above warnings.
We should now have an annotated version of the source file for the application itself. On my system it looked like the following:
:#include <iostream> : :int main() :{ /* main total: 186787 99.9615 */ : long l = 0; : long long ll = 0; : float f = 0; : double d = 0; : 7720 4.1315 : for (long i = 0; i < 1000000000; i++) { 20853 11.1598 : l += 2*i*i + 3*i + 4; 31944 17.0952 : ll += 3*i*i + 4*i + 5; 78718 42.1270 : f += 4*i*i + 5*i + 6; 47552 25.4481 : d += 5*i*i + 6*i + 7; : } : : std::cout << "l = " << l << std::endl; : std::cout << "ll = " << ll << std::endl; : std::cout << "f = " << f << std::endl; : std::cout << "d = " << d << std::endl; :} /* * Total samples for file : "/home/tranter/git/jtranter/blog/oprofile/example1.cpp" * * 186787 99.9615 */ /* * Command line: opannotate --source --output-dir=/home/tranter/git/jtranter/blog/oprofile * * Interpretation of command line: * Output annotated source file with samples * Output all files * * CPU: Intel Ivy Bridge microarchitecture, speed 2.901e+06 MHz (estimated) * Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 */
From the above we can see the number of "hits" or times that the statistical profiler found the code executing the line of code in question, as well as the percentage of overall time spent on that line.
In this particular program, we see the majority of the time was spent, not surprisingly, in the for loop.
We also make some other observations, such as the fact it apparently takes longer to do the calculation using type float than it does with type double. Further investigation would probably be warranted if we wanted to confirm if that was actually the case.
As an example of the overhead of running with the profiler, we can compare the execution times and see that it was not much slower than when running the application normally:
% time operf ./example1 operf: Profiler started l = 389932800 ll = -33545635511552 f = 6.95627e+11 d = -3.77758e+12 Profiling done. real 0m7.467s user 0m9.088s sys 0m5.824s % time ./example1 l = 389932800 ll = -33545635511552 f = 6.95627e+11 d = -3.77758e+12 real 0m6.749s user 0m6.740s sys 0m0.000s
Note that the sys time, was much higher. This is time spent in the Linux kernel and is presumably higher due to the use of the performance counters in the kernel.
One tip: If you don't see any annotated source file after running opannotate, the most likely reason is that not enough time was spent to get any profiler sample hits. Only files that had samples get annotated.
Now let's look at a simple Qt program example. This example uses some of the Qt container classes and might give some insight into how they compare in terms of performance. Here is the source listing:
#include <QDebug> #include <QLinkedList> #include <QList> #include <QVector> int main() { QList<int> list; QLinkedList<int> linkedList; QVector<int> vector; for (long i = 0; i < 10000000; i++) { list << i; linkedList << i; vector << i; } /* for (long i = 0; i < 500000; i++) { list.prepend(i); linkedList.prepend(i); vector.prepend(i); } */ long long total = 0; foreach (auto item, list) { total += item; } qDebug() << "total = " << total; total = 0; foreach (auto item, linkedList) { total += item; } qDebug() << "total = " << total; total = 0; foreach (auto item, vector) { total += item; } qDebug() << "total = " << total; }
These are the commands I used to build and run it and collect profiling information:
% qmake % make % operf ./example2 operf: Profiler started total = 49999995000000 total = 49999995000000 total = 49999995000000 Profiling done. % opreport --symbols Using /home/tranter/git/jtranter/blog/oprofile/oprofile_data/samples/ for samples directory. warning: /no-vmlinux could not be found. CPU: Intel Ivy Bridge microarchitecture, speed 2.901e+06 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % image name symbol name 7356 12.5832 example2 main 6474 11.0744 libc-2.19.so _int_malloc 3723 6.3686 example2 QtPrivate::RefCount::isShared() const 3561 6.0914 example2 QLinkedList<int>::append(int const&) 3142 5.3747 libc-2.19.so malloc 3040 5.2002 example2 __x86.get_pc_thunk.bx 2875 4.9180 libc-2.19.so _int_free 2777 4.7503 no-vmlinux /no-vmlinux 2383 4.0764 example2 QVector<int>::append(int const&) 2309 3.9498 libstdc++.so.6.0.19 /usr/lib/i386-linux-gnu/libstdc++.so.6.0.19 2219 3.7958 libQt5Core.so.5.5.1 /usr/local/Qt-5.5.1/lib/libQt5Core.so.5.5.1 2217 3.7924 example2 QBasicAtomicInteger<int>::load() const 1857 3.1766 example2 QList<int>::append(int const&) 1794 3.0688 example2 QArrayData::data() 1174 2.0082 example2 QLinkedList<int>::detach() 1159 1.9826 example2 QTypedArrayData<int>::end(int*) 1013 1.7328 libc-2.19.so free 857 1.4660 example2 QLinkedList<int>::freeData(QLinkedListData*) 793 1.3565 example2 QTypedArrayData<int>::data() 757 1.2949 example2 QVector<int>::isDetached() const 691 1.1820 example2 QList<int>::operator<<(int const&) 689 1.1786 example2 QLinkedListNode<int>::QLinkedListNode(int const&) 585 1.0007 example2 QVector<int>::operator<<(int const&) 575 0.9836 example2 QLinkedList<int>::operator<<(int const&) 554 0.9477 example2 QList<int>::const_iterator::operator*() const 538 0.9203 libc-2.19.so __x86.get_pc_thunk.bx 448 0.7663 example2 QLinkedList<int>::const_iterator::operator!=(QLinkedList<int>::const_iterator const&) const 438 0.7492 example2 QList<int>::const_iterator::operator!=(QList<int>::const_iterator const&) const 427 0.7304 example2 QList<int>::node_construct(QList<int>::Node*, int const&) 419 0.7167 example2 QList<int>::const_iterator::operator++() 403 0.6894 example2 QLinkedList<int>::const_iterator::operator++() 321 0.5491 example2 qt_noop() 294 0.5029 example2 QLinkedList<int>::const_iterator::operator*() const 287 0.4909 libc-2.19.so __memcpy_ssse3_rep 269 0.4602 example2 QList<int>::Node::t() 13 0.0222 ld-2.19.so do_lookup_x 10 0.0171 ld-2.19.so _dl_lookup_symbol_x 5 0.0086 ld-2.19.so strcmp 2 0.0034 ld-2.19.so _dl_relocate_object 2 0.0034 ld-2.19.so strlen 2 0.0034 libicuuc.so.52.1 /usr/lib/i386-linux-gnu/libicuuc.so.52.1 1 0.0017 ld-2.19.so _dl_catch_error 1 0.0017 ld-2.19.so _dl_fixup 1 0.0017 ld-2.19.so _dl_map_object_deps 1 0.0017 ld-2.19.so dl_main 1 0.0017 libc-2.19.so _dl_addr 1 0.0017 libc-2.19.so pthread_mutex_init 1 0.0017 libc-2.19.so realloc % opannotate --source --output-dir=$PWD % less home/tranter/git/jtranter/blog/oprofile/example2.cpp :#include <QDebug> :#include <QLinkedList> :#include <QList> :#include <QVector> : :int main() :{ /* main total: 7356 12.5832 */ : : QList<int> list; : QLinkedList<int> linkedList; : QVector<int> vector; : 319 0.5457 : for (long i = 0; i < 10000000; i++) { 330 0.5645 : list << i; 477 0.8160 : linkedList << i; 235 0.4020 : vector << i; : } : : /* : for (long i = 0; i < 500000; i++) { : list.prepend(i); : linkedList.prepend(i); : vector.prepend(i); : } : */ : : long long total = 0; 1625 2.7797 : foreach (auto item, list) { 504 0.8621 : total += item; : } : qDebug() << "total = " << total; : : total = 0; 1661 2.8413 : foreach (auto item, linkedList) { 576 0.9853 : total += item; : } : qDebug() << "total = " << total; : : total = 0; 1124 1.9227 : foreach (auto item, vector) { 505 0.8639 : total += item; : } : qDebug() << "total = " << total; :} /* * Total samples for file : "/home/tranter/git/jtranter/blog/oprofile/example2.cpp" * * 7356 12.5832 */ /* * Command line: opannotate --source --output-dir=/home/tranter/git/jtranter/blog/oprofile * * Interpretation of command line: * Output annotated source file with samples * Output all files * * CPU: Intel Ivy Bridge microarchitecture, speed 2.901e+06 MHz (estimated) * Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 */
From the above, a few observations that immediately jump out are that the application spends a considerable amount of time in the standard C library calling malloc or related functions and in Qt is spending a lot of time calling QtPrivate::RefCount::isShared() and QLinkedList<int>::append().
Looking at the annotated source code, not a lot of time is spent in the code in the main function itself, with the iterators taking most of this time. The times spent iterating over and appending to the QList, QLinkedList and QVector data structures are not too different.
For comparison, if you uncomment the code that prepends the elements to the start of the containers rather than appending to the end, you will see a significant difference in performance between the different container types. For example, the majority of time is being spent in the memmove() function as items are prepended to the QVector and the existing elements have to be moved. On my system oprofile even annotated the relevant Qt source code where this was happening.
Summary
Oprofile has a number of advanced features we didn't cover. It can produce call graph information that can be read and displayed by the graphical KCacheGrind (3) application. Because oprofile is a system wide profiler, it can generate performance information for all processes running on the system (subject to permissions). This is useful if your application communicates with other processes that are the performance bottleneck. This can be the case, for example, when using X11 or Wayland as your rendering back end.
Unfortunately, oprofile doesn't work on the Raspberry Pi platform due to the ARM and SOC hardware used on that platform. Other profiling tools such as gprof, gperftools and valgrind should work (4), but I have not personally tried them.
Oprofile is one of the tools you should become familiar with and have in your developer bag of tricks. If you have used other profiling tools in the past, I encourage you to take a look at it.
References
- OProfile Project, SourceForge web site, last accessed 15 Jan 2016, http://oprofile.sourceforge.net
- Source Code for Example Programs, last accessed 15 Jan 2016, ftp://ftp.ics.com/pub/pickup/oprofile.zip
- The KCachegrind Handbook, KDE documentation web site, last accessed 15 Jan 2016, https://docs.kde.org/trunk4/en/kdesdk/kcachegrind/index.html
- Software Profiling tools for Raspberry Pi, blog post, last accessed 15 Jan 2016, http://www.softwarecoven.com/software-profiling-tools-for-raspberry-pi/