Programming basics: Pointers and references

C++This morning I had a very interesting discussion with Mihai related to this answer on Quora. We were basically saying the same thing but we were referring to different things, so I think it’s worth pointing out what’s going on with pointers and references. [NOTE] The discussion below refers to variables from the C++ point of view, where things are closer to the metal. This whole discussion requires understanding of C++ first and foremost.

First, definitions: a pointer is a variable that contains an address. A reference is a hidden pointer that makes a set of operations easier and automates dereferencing. This sounds complicated, but it’s not. Let’s look first at what exactly IS a pointer.

The computer has two important components. The first is the processor, the CPU that takes one instruction at a time and executes it. The second component is the memory, that can be seen as a contiguous array of bytes. The first byte in memory has the address 0, the second byte in memory has the address 1, and so on. If you have 4GB of RAM, you will have 4.294.967.296 such bytes, with addresses from 0 to 4.294.967.295. For the time being we’ll ignore the virtual address space, as it is not helpful in our explanations and it won’t change anything, really.

A variable occupies space in memory and it represents the interpretation of that space. For example, if we need to represent an integer value, we might have a variable that is composed of 4 bytes, and starts at address 16 in memory. That variable will occupy the bytes with the addresses: 16, 17, 18 and 19. But since there is only one way to go, we only need to indicate the address of the first byte of that variable; the size of the variable will be deduced from its type.

The CPU can read only one memory location at a time (or a sequence of 4/8/16 bytes). This is well represented in assembly language, where we cannot really use the addresses directly from RAM, and we have to load them into special registers first. This, of course, introduces an extra step that makes programmers less productive. The C programming language strives to hide such complexity from the programmer, by introducing the pointer.

A pointer is a fancy name for a variable which holds a positive integer value which will be interpreted as an address. There are some obvious issues that come from that. For example, if I have a pointer to another variable, and I modify that pointer (for example I increment it), the pointer will no longer point towards that variable.

Accessing a variable placed at the address the pointer holds is called dereferencing. It is a complicated name for something relatively simple. We look at what that address is and interpret it. It can be an integer or it can be a character. We can interpret together the next 8 bytes or the next 4 bytes, or even the next one byte. It’s only a matter of interpretation.

In the example above we have a variable of the integer type, and a pointer p to such a variable. This pointer takes the address of said variable. We can display the address of the variable, as well as its value by dereferencing it (the notation for dereferencing is *p).

If you’re not bored by now, you will understand soon why I took the longer route to explaining what references are. Please note that references are a matter of vocabulary, a matter that I’ll get into later in this post. However, for the time being let’s see what references are.

References are variables that directly refer to another variable. So basically, when you have a variable and a reference to it, you only have one variable, and two names for it, like in the example below:

When you assign to a reference, you don’t have a real assignment of value there, instead, you have a copy of the reference. But, by now, we know what memory is and what variables are, and we know that, for example, we can’t really have two variables in the same memory space, so there must be a trick to this ‘reference’ thing.

Well, there is a trick, and the trick is that ref actually contains an address. The compiler automatically dereferences the ‘ref’ pointer, hiding the complexity of using addresses from the ‘user’ (the developer). But, behind the curtains, ref is a pointer, really. For example, when translated to machine code, the following two functions will look exactly the same:

However, this poses a wonderful problem. Below are two functions that apparently do the same thing, only they don’t.

It is very important to understand the difference. When f1 is invoked, it is invoked with an address. The function creates a local copy of that address, in a variable which it names “p”. That variable is changed with a new address (that of the local x) – but since we’re working on a copy, the caller of f1 will NOT be affected.

Now, there is the fact that you can have pointer arithmetics. That means you can increment a pointer, and that pointer will skip a number of bytes. If, for example, you have a pointer to integer, you will prefer that increments will be made in multiple of four instead of 1 byte, since the next integer in memory will be four bytes away. But this is a problem for another day.

Now, the original topic that brought this post to life was my statement that Java and C# are pointer-based languages, and they are NOT reference based, as many people claim. Indeed, the common knowledge is that these languages commonly use references, however, references are not used widely. Any interviewer of Java/C# will throw at you a lot of ‘pass by reference vs. pass by value’. However, not even the interviewers realize that they are usually wrong about passing by reference.

Here’s the problem. Both C# and Java hide the ugly pointer behind all the variables, and disallow pointer arithmetic :(. So, for example, in C# we will have the following code (I will use a dictionary to make sure we have a ‘pass by reference’):

The common knowledge is that in the example above, cs_f1 is an example of ‘pass by value’, while cs_f2 is an example of ‘pass by reference’. However, when we modify x inside the cs_f2, we won’t affect the caller! Only in cs_f3 we will affect the caller, when we indeed force the ‘ref’ specifier to the function parameter.

In fact, cs_f1 and cs_f2 are both an example of ‘pass by value’, and only cs_f3 is a ‘pass by reference’. In the case of cs_f1, we have, of course, a primitive type, and since we don’t really have an address of that type, we will copy its value on the stack to invoke the function.

In the case of cs_f2, however, the x is, in reality, a pointer to an instance of the Int32 class. In syntax it might feel like a reference, but it’s not. Since the language doesn’t really have pointers as a user-fronting feature, this syntax is allowed, but even if it looks like a reference, it is not since further work is done on a copy of a pointer. The fact that you can affect a structure that was passed to you is just a (most of the times undesired) effect of passing by pointer, it’s not a proof that you pass a reference.

There is a danger that we might not know when we’re dealing with a pointer or a value. The transformation between pointer and value is called ‘boxing’ – and “boxed” values (that will require the compiler to generate code using the indirection of a pointer) will make your code slower. This is why people implementing number crunching functions in C# have to make sure that their code does not require the process of boxing/unboxing.

cs_f3 is also interesting (note, this is valid only for C#). If you read attentively up to this point, you will realize that using a ‘ref’ qualifier is just a way of adding another indirection level. So you basically have there a pointer to pointer to an object of type Int32. The double indirection will allow you to modify the pointer that was passed to the function as well, therefore creating this desired ‘reference’ effect.

So to set things straight: C# and Java make extensive use of pointers, and they are pointer based languages. What people usually call “pass by reference” is not, really, passing by reference; in fact, you need special language constructs to obtain that ‘pass by reference’ effect. I suggest that this mode of passing a value to be renamed to something more realistic, like ‘pass by address’ or ‘pass by pointer’.

I know, computer science is all wrong, and textbooks need to be re-written. But until that happens we can be right by using the proper denomination, that is: C# and Java are pointer-based languages and references are slightly more complicated than that. C# supports references, but (and this is a good thing) you have to specify clearly that you’re passing a reference. Java stays consistent to the goals of the language and remains ‘reference free’. For the time being.

PS: By the way. Changing members of a structure you received as a parameter that is not ‘ref’ is a sin, and people doing that should be severely punished. Sometimes, reinstating corporal punishment is not such a bad idea. Ok, I’m slightly exaggerating. Only slightly.

The reason is that updating a parameter is most of the times a sign of ‘feature envy’. And unless you have extremely good reasons to do so, you should not do it, since people don’t expect, really, to have their parameters changed.

PPS: I remembered the fact that I wanted to say why it’s important that everything is a pointer. There is also a hidden cost in the fact that everything is pointers; you don’t get to use the CPU cache that well, because, well, everything is an indirection. This presentation of Herb Sutter explains it:

Comments

Programming basics: Pointers and references — 15 Comments

  1. Uh … “in fact, you need special language constructs to obtain that ‘pass by reference’ effect.” – like what’s that in Java?

    Not sure what you’re on about but Java is not C# – and you’re incorrect wherever you say C# *and Java*.

    I don’t know of any way to do cs_f3 in Java, and if we are to go into details that’s mostly because there is no random operator overloading like that and by design all wrapper classes are immutable.

    It’s much more appropriate to talk about objects in Java, in which case:

    void cs_f2(Map map) { map.put(key, value); }

    is the only form you have and does kind of what you describe cs_f3 of doing in C#. The only thing is: that is the only option in Java, no “special” keyword or language support to say “ref” or god forbid copy constructors or other madness like that 🙂

  2. I’m only talking about what I know, so C# is that. C# is obviously the superior language here, but that was never a question between it and Java. 😀 I was shy to talk about Java since I didn’t know about any sort of reference really passed there.

    So there you have it. In Java you can’t talk about “pass by reference”, only “pass by pointer”. That simplifies things, indeed.

    Uh, uh, does that mean that Java has no ‘out’ keyword either? 😀 Dammit, man, that is so last century!

  3. Why talk about “C# and Java” then? And no, you don’t need an “out” keyword or references as you describe them (which only really apply to primitive types). Yet java written systems can be as complex and developed as any software goes.

    I’d say the concepts you describe are so last century. Kids these days shouldn’t even need to understand them, like we don’t get the details of how punch card programming used to work 🙂

    • @Raul: No, you don’t need an “out” keyword, but it helps. No, you don’t need lambdas, but they help. Plus, if we take in account how much junk has been written with Java, how much bloatware. But, on the other hand, bloatware is not the privilege of Java 🙂

  4. Question from a beginner programmer: why do programmers really need pointers? For what they’re useful? They seem overtly complicated to the point you don’t really understand what’s going on in the program. Why don’t use only constants and variables?

  5. @massiveattack80: Pointers are not 100% necessary. There are languages that don’t need them although they use them behind closed doors; Functional languages, for example. However, it’s much more difficult to handle things without pointers.

    Let’s take an example: A text. If you want to have a text read from a file, you might allocate a variable of 80 characters. However, I give you a file that contains the full “Treasure Island” by Robert Louis Stevenson (from Project Gutenberg). You will have to resize that variable. How? You can’t really resize it in place, because maybe after those 80 characters you have other important variables, like the file name.

    Now, a bit of context: The compiler needs to know where in memory are your variables, so it can load them, that’s because the machine only knows to address a variable by its address, not by its name. But if you have something in memory like: “an integer, a variable array, a second integer” you can’t really tell where the other integer really lies, can you? The compiler will not be able to refer to the second integer. It’s a lot better to have, in this case, “an integer, a pointer to the first character from an array, a second integer”, which are all pre-determined sizes, and the compiler can handle at compile time.

    Technically, one uses addresses all the time, because the compiler will refer all the variables by their address. The language covers this by allowing you to give the variables names. On the other hand pointers are usually allocated in the heap, a special storage space meant for user allocations. That means space for variable allocations that are maintained not by the compiler (like the stack allocations are) but by the user via use of allocation API (malloc,free).

    Pointers are, therefore, vital for variable allocations. However, pointers are used in other places as well; for example you can have a pointer to a function and execute that function via the pointer. This is very useful when you have inheritance hierarchies; this is implemented with the use of pointers, as well as when you have callbacks that you need to pass to functions. When you pass a lambda, for example, you pass to the function just a pointer.

    Also, they are useful when you want to have a large quantity of data passed to a function. The alternative would be to copy the entire book to the function, instead of passing just the address of the first byte. That would make your program very impractical; it’s easier to just point to things in memory instead of copying them over and over again.

  6. @massiveattack80 – that’s a great question and is exactly pointing to what I was commenting above. The reality is you don’t need to understand this anymore.

    Manual memory management is an outdated concept and only C-like languages (afaik) still think it’s a good idea to let the developer figure out how to allocate memory.

    If you were to use Java or Python or one of the functional languages, you wouldn’t care. And I suggest you start with these before you dive too deep in legacy constructs described here.

    PS: I’d still prefer you correct your blog Dorin, instead of clinging to “C# and Java” – you should either talk about Java/Python separately or leave it out, to be correct.

    PPS: Lambdas have nothing to do with this, don’t sidetrack just to point out Java deficiencies, Java is far from perfect as well.

  7. @Raul: I for one think it’s a trade-off, garbage collection is not superior to manual management.

    There’s manual management as in pointer arithmetic – that’s good for system stuff, but not for high level applications, I agree (the D programming language doesn’t let you do that for instance; if you write high level C++ code void casting and pointer arithmetic should be avoided anyway).

    With respect to deallocation, there are, IMO, three ways of doing it, and there are trade-offs:

    a) manual deallocation: Full control and best performance, but at the cost of memory errors (e.g., use-after-free).
    b) garbage collection: no memory errors but you pay a really high cost if you get a “stop the world” garbage collection event.
    c) language support for allocation regions: this way you get best of both worlds, but you add another requirement to the language (region annotation). Personally I’m all for regions but it’s just not popular. It does make programming a bit more difficult.

    b) is actually pretty awesome if you never end up garbage collecting anyway, or if the cost is low. However, if you write a memory-intensive, long-running app, it’s a very different story. Add in real-time requirements and you just made your life harder – as a dev now you have to take into consideration and compensate somehow for the possibility of a “stop-the-world” event.

    Just my opinion of course.

  8. Also with b) languages you end up with a lot more indirection and dereferencing in the machine code. Compared to well-tuned C++ code (yes, well-tuned not any C++ code), those extra indirections will cost performance.

  9. There is also a hidden cost in the fact that everything is pointers; you don’t get to use the CPU cache that well, because, well, everything is an indirection. Very instructive is the presentation of Herb Sutter regarding this point, which I wanted to embed in the post, but forgot about it altogether. I updated the post with this presentation as well