Programming basics: Pointers and references

This morning I had a very interesting discussion with Mihai related to this answer on Quora. We were basically saying the same thing but we were referring to different things, so I think it’s worth pointing out what’s going on with pointers and references. [NOTE] The discussion below refers to variables from the C++ point of view, where things are closer to the metal. This whole discussion requires understanding of C++ first and foremost.

First, definitions: a pointer is a variable that contains an address. A reference is a hidden pointer that makes a set of operations easier and automates dereferencing. This sounds complicated, but it’s not. Let’s look first at what exactly IS a pointer.

The computer has two important components. The first is the processor, the CPU that takes one instruction at a time and executes it. The second component is the memory, that can be seen as a contiguous array of bytes. The first byte in memory has the address 0, the second byte in memory has the address 1, and so on. If you have 4GB of RAM, you will have 4.294.967.296 such bytes, with addresses from 0 to 4.294.967.295. For the time being we’ll ignore the virtual address space, as it is not helpful in our explanations and it won’t change anything, really.

A variable occupies space in memory and it represents the interpretation of that space. For example, if we need to represent an integer value, we might have a variable that is composed of 4 bytes, and starts at address 16 in memory. That variable will occupy the bytes with the addresses: 16, 17, 18 and 19. But since there is only one way to go, we only need to indicate the address of the first byte of that variable; the size of the variable will be deduced from its type.

The CPU can read only one memory location at a time (or a sequence of 4/8/16 bytes). This is well represented in assembly language, where we cannot really use the addresses directly from RAM, and we have to load them into special registers first. This, of course, introduces an extra step that makes programmers less productive. The C programming language strives to hide such complexity from the programmer, by introducing the pointer.

A pointer is a fancy name for a variable which holds a positive integer value which will be interpreted as an address. There are some obvious issues that come from that. For example, if I have a pointer to another variable, and I modify that pointer (for example I increment it), the pointer will no longer point towards that variable.

Accessing a variable placed at the address the pointer holds is called dereferencing. It is a complicated name for something relatively simple. We look at what that address is and interpret it. It can be an integer or it can be a character. We can interpret together the next 8 bytes or the next 4 bytes, or even the next one byte. It’s only a matter of interpretation.

int a = 10;
int *p = &a;
printf ("[%p] = %d\n", p, *p);

In the example above we have a variable of the integer type, and a pointer p to such a variable. This pointer takes the address of said variable. We can display the address of the variable, as well as its value by dereferencing it (the notation for dereferencing is *p).

If you’re not bored by now, you will understand soon why I took the longer route to explaining what references are. Please note that references are a matter of vocabulary, a matter that I’ll get into later in this post. However, for the time being let’s see what references are.

References are variables that directly refer to another variable. So basically, when you have a variable and a reference to it, you only have one variable, and two names for it, like in the example below:

int a = 10, b = 40;
int& ref = a;
ref = 20; // a becomes 20 as well.
ref = b; // ref becomes 40, as now it references b;
// future changes to 'ref' will change b!</pre>

When you assign to a reference, you don’t have a real assignment of value there, instead, you have a copy of the reference. But, by now, we know what memory is and what variables are, and we know that, for example, we can’t really have two variables in the same memory space, so there must be a trick to this ‘reference’ thing.

Well, there is a trick, and the trick is that ref actually contains an address. The compiler automatically dereferences the ‘ref’ pointer, hiding the complexity of using addresses from the ‘user’ (the developer). But, behind the curtains, ref is a pointer, really. For example, when translated to machine code, the following two functions will look exactly the same:

void f1(int *p) { (*p)++; }
void f2(int &r) { r++; }

However, this poses a wonderful problem. Below are two functions that apparently do the same thing, only they don’t.

void f1(int *p)
{
     int x = 10;
     p = &x; // this changes a copy of the address made when calling f1
}
void f2(int &r)
{
     int x = 10;
     r = x; // this changes the caller in an evil manner
}
...
int a = 20;
f1(&a);
f2(a);

It is very important to understand the difference. When f1 is invoked, it is invoked with an address. The function creates a local copy of that address, in a variable which it names „p”. That variable is changed with a new address (that of the local x) – but since we’re working on a copy, the caller of f1 will NOT be affected.

Now, there is the fact that you can have pointer arithmetics. That means you can increment a pointer, and that pointer will skip a number of bytes. If, for example, you have a pointer to integer, you will prefer that increments will be made in multiple of four instead of 1 byte, since the next integer in memory will be four bytes away. But this is a problem for another day.

Now, the original topic that brought this post to life was my statement that Java and C# are pointer-based languages, and they are NOT reference based, as many people claim. Indeed, the common knowledge is that these languages commonly use references, however, references are not used widely. Any interviewer of Java/C# will throw at you a lot of ‘pass by reference vs. pass by value’. However, not even the interviewers realize that they are usually wrong about passing by reference.

Here’s the problem. Both C# and Java hide the ugly pointer behind all the variables, and disallow pointer arithmetic :(. So, for example, in C# we will have the following code (I will use a dictionary to make sure we have a ‘pass by reference’):

void cs_f1(int x) { x = x+1; Console.WriteLine ("{0}", x); }
void cs_f2(Dictionary<int, int> x)
{
    x = new Dictionary<int, int>(); // this won't change the caller
    x[0] = 100;
    Console.WriteLine("{0}", x[0]);
}
void cs_f3(ref Dictionary<int, int> x)
{
    x = new Dictionary<int, int>();
    x[0] = 100;
    Console.WriteLine ("{0}", x[0]);
}

The common knowledge is that in the example above, cs_f1 is an example of ‘pass by value’, while cs_f2 is an example of ‘pass by reference’. However, when we modify x inside the cs_f2, we won’t affect the caller! Only in cs_f3 we will affect the caller, when we indeed force the ‘ref’ specifier to the function parameter.

In fact, cs_f1 and cs_f2 are both an example of ‘pass by value’, and only cs_f3 is a ‘pass by reference’. In the case of cs_f1, we have, of course, a primitive type, and since we don’t really have an address of that type, we will copy its value on the stack to invoke the function.

In the case of cs_f2, however, the x is, in reality, a pointer to an instance of the Int32 class. In syntax it might feel like a reference, but it’s not. Since the language doesn’t really have pointers as a user-fronting feature, this syntax is allowed, but even if it looks like a reference, it is not since further work is done on a copy of a pointer. The fact that you can affect a structure that was passed to you is just a (most of the times undesired) effect of passing by pointer, it’s not a proof that you pass a reference.

There is a danger that we might not know when we’re dealing with a pointer or a value. The transformation between pointer and value is called ‘boxing’ – and „boxed” values (that will require the compiler to generate code using the indirection of a pointer) will make your code slower. This is why people implementing number crunching functions in C# have to make sure that their code does not require the process of boxing/unboxing.

cs_f3 is also interesting (note, this is valid only for C#). If you read attentively up to this point, you will realize that using a ‘ref’ qualifier is just a way of adding another indirection level. So you basically have there a pointer to pointer to an object of type Int32. The double indirection will allow you to modify the pointer that was passed to the function as well, therefore creating this desired ‘reference’ effect.

So to set things straight: C# and Java make extensive use of pointers, and they are pointer based languages. What people usually call „pass by reference” is not, really, passing by reference; in fact, you need special language constructs to obtain that ‘pass by reference’ effect. I suggest that this mode of passing a value to be renamed to something more realistic, like ‘pass by address’ or ‘pass by pointer’.

I know, computer science is all wrong, and textbooks need to be re-written. But until that happens we can be right by using the proper denomination, that is: C# and Java are pointer-based languages and references are slightly more complicated than that. C# supports references, but (and this is a good thing) you have to specify clearly that you’re passing a reference. Java stays consistent to the goals of the language and remains ‘reference free’. For the time being.

PS: By the way. Changing members of a structure you received as a parameter that is not ‘ref’ is a sin, and people doing that should be severely punished. Sometimes, reinstating corporal punishment is not such a bad idea. Ok, I’m slightly exaggerating. Only slightly.

The reason is that updating a parameter is most of the times a sign of ‘feature envy’. And unless you have extremely good reasons to do so, you should not do it, since people don’t expect, really, to have their parameters changed.

PPS: I remembered the fact that I wanted to say why it’s important that everything is a pointer. There is also a hidden cost in the fact that everything is pointers; you don’t get to use the CPU cache that well, because, well, everything is an indirection.