C++ Lessons
This is the first in a series of C++ lessons, based on the excellent C++ Crash Course. For some reason, I’ve been attracted to C and then C++ since getting into technology. They seem hard, and close to the machine. I want to get closer to the machine to better understand my day-to-day programming problems. So, I’m going to write several C++ lessons as blog posts to teach myself C++ by teaching others C++.
We’re going to start with writing and compiling a Hello, World
program.
Hello, World Source Code
Most folks are familiar with Hello, World
as the first program you write while learning a new language. We’re going to write Hello, World
, but we’re going to spend most of our time talking about the underlying compilation process to (hopefully) get a better grasp of what’s going on under the hood when you compile a C++ program.
#include <cstdio>
int main () {
printf("hello, world!");
}
I’m going to assume some familiarity with C++ (or the ability to guess what int
and main
are in this context) so that we can just right into understanding compilation.
We’ll use #include
as our entry point. #include
is how C++ pulls in external code, similar to how JavaScript has require
/import
.
Compiling Hello, World
We need to compile our first program! Compilation happens in three parts. The preprocessor, compiler, and linker.
Preprocessor
First, the preprocessor takes your source files and for each file creates a translation unit. A translation unit is the source file, but with any headers included. In our program above, the #include
is a directive saying to include content from the cstdio
header file.
The relevant lines from the cstdio
header file are:
#include <stdio.h>
namespace std
{
using ::printf;
}
There’s a lot going on here. First, there’s another #includes
directive to <stdio.h>
, the C standard input/output header. That tells us that the cstdio
header file is going to make use of C’s stdio
header file.
After including the stdio
header a namespace, std
, is declared. Namespaces are a way of managing scope. You can place classes, objects, functions, and so on, under a name, e.g., std
, and then reference those entities under that name. The cstdio
header puts printf
under the std
namespace.
Namespaces help avoid name collisions when two entities share the same name in the same scope. E.g., if we had just printf
without putting it under a particular name, we might have two definitions of it causing a name collision. Rather, we can have one definition be under std
and another be under some other name, like exampleNameSpace
.
After declaring a namespace called std
, printf
is placed in the std
namespace. This happens with the using
keyword and the scope resolution operator, ::
. The scope resolution operator, ::
, when unqualified like it is here, refers to the global scope. Were the scope resolution operator qualified, it would look something like this: someNameSpace::someEntity
. The using
keyword declares that the globally scoped printf
also belongs to the std
namespace.
So, printf
, whose definition comes from the stdio
C header file gets placed under the std
namespace and included into our Hello, World
program for use. It’s important to know that header files are only definitions, not implementations. printf
in the Cstdio
header file looks like this:
int printf(const char * __restrict, ...) __printflike(1, 2);
Which merely defines the printf function, using another definition, __printflike
, to capture its behavior (which gets defined in cdefs.h
, for the curious).
So, after pre-processing is finished, we’ll have translation units that contain both source code and information from header files. Again, that information is only definitions, not implementations. Linking the definitions to the actual libraries happens in the third step of compilation: linkage.
Compiler
The second step after pre-processing is to compile the translation units into object code. Object code is machine-readable code (i.e., a binary). Whether or not it’s executable depends on what’s inside of it. Each translation unit gets an object code file. And if there are headers included, the translation units know the definitions of what gets included, but not how to perform the associated behavior. Above, we included cstdio
for printf
. In the associated object code file, we only have the definition of what gets used, printf
, not its actual implementation. Linking together object code files to produce a program executable by the CPU happens in the final step, which we’ll turn to shortly.
Before we turn to that final step, though, let’s take a deeper look at object code files. There are certain tools that let you peak into those machine-readable files. otool
and nm
are examples.
Let’s say we’ve saved our source file as hello-world.cc
. We can use gcc
or some other compiler to spit out just the object code file, not the final executable:
gcc -c hello-world.cc
That gives us an object code file, hello-world.o
. Note the o
extension. We can use nm
to get some details on our object file:
> nm hello-world.o
0000000000000000 T _main
U _printf
nm
lists the names of the symbols in the object file, some details about its value, and where that symbol is located in the object file. Those numbers you see, 0000000000000000
, represent the memory offset from within the file to that particular symbol (_main
here). They are literally the number of bytes to that symbol.
Notice the T
for _main
and U
for _printf
. T means that the symbol is made up of text. More interestingly, U stands for undefined: _printf
is undefined! That’s because while the header includes a definition of printf, it doesn’t include its implementation. Object code files put placeholders in for symbols that haven’t been linked to their implementations. That happens in the final step of compilation, which we’ll turn to shortly.
otool
is a powerful tool for interacting with object code files. We can get a sense of what’s going on at the machine level. E.g., let’s look at just the text section (identified above via nm
as _main
):
> otool -t hello-world.o
hello-world.o:
Contents of (__TEXT,__text) section
0000000000000000 55 48 89 e5 48 83 ec 10 48 8d 3d 14 00 00 00 b0
0000000000000010 00 e8 00 00 00 00 31 c9 89 45 fc 89 c8 48 83 c4
0000000000000020 10 5d c3
That’s definitely machine code. Let’s use otool
to disassemble it:
> otool -t -v hello-world.o
hello-world.o:
(__TEXT,__text) section
_main:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp, %rbp
0000000000000004 subq $0x10, %rsp
0000000000000008 leaq 0x14(%rip), %rdi
000000000000000f movb $0x0, %al
0000000000000011 callq 0x16
0000000000000016 xorl %ecx, %ecx
0000000000000018 movl %eax, -0x4(%rbp)
000000000000001b movl %ecx, %eax
000000000000001d addq $0x10, %rsp
0000000000000021 popq %rbp
0000000000000022 retq
That doesn’t make any sense to me, but it’s fun to look at! Take a look at the different flags for otool
to get a sense of what you can do with it.
Linking
After source code is made into a translation unit via the preprocessor, and after that translation unit is compiled into object code files, those object files need to be linked together to form a complete program.
For our example, the linker will find the cstdio
library and include everything our program needs to run printf
. The translation unit, and object code, only has information from the cstdio
header, not the library. The linking step produces executable code from object code files by filling in the implementation details of sub-programs like printf
. What gets spit out from the linker is an executable binary.