The pipe dream of O(1) memory allocation

bits There is a piece of well established knowledge, that says that memory allocation in managed languages is a linear operation. It is so, everyone knows that. But here are some things you need to consider: Even finding an item in an unsorted list is O(1) if the item is first in the list, all the time.

So let’s discuss about how the O(1) memory allocation works. It’s so simple (obviously, since it’s O(1)) that anyone can understand it. The algorithm is described below:

  • Memorize the current heap pointer and increase the heap top pointer with the number of bytes required for the memory allocation.
  • Success! Return the memorized heap pointer and call the constructor!

O(1)! Constant time! Success! Dorin, see, it was simple! Let’s rejoice and do a victory dance! Move to managed languages! All those people doing it wrong in C/C++ were wrong! Etc!

Now let’s examine the first step. Memorize the current heap pointer and increase. What if we are on two separate threads, and we try to allocate memory at the same time? We’ll memorize the same heap pointer but … Oh, oh, nevermind. We will use synchronization… Yay?!?!

No yay. Synchronization is a costy operation; depending on the implementation it may or may not require a userspace-kernel switch. But let’s assume that the best engineers in the world implemented the best lock free method; can it be really be done? Maybe, by using something like the Interlocked family of functions on Windows, but most likely it isn’t. Why? Let us continue our investigation, keeping in mind that after allocation the memory is used by the constructor (or zeroed by the virtual machine environment).

What happens if we reach the end of the memory space allocated for the heap? Oh, but that would never happen, because we have an address space of… Oh, wait.

Pipe dream broken. Because no program will ever be allocated 2GB of memory from the start. It will probably have a bit of memory, and then slightly more and slightly more until it actually grows to a respectable size. And now you will have to deal with the OS memory allocator, which is a whole different beast altogether. It handles memory pages, it handles swap space, it handles mapping between real mode and protected mode addresses. And this is definitely not an O(1) operation, since there are a lot of memory pages to be handled, and very few chances to grow in a constant time. Add to this synchronization inside the kernel, which probably will freeze other processes or threads from performing similar operations (or other operations if you have something like an infamous “Big Kernel Lock“), and you have a clearer picture about what’s going on.

And we’re only skimming the surface. That memory returned to the user, but what happens in the optimistic case, in which new pages of memory are not allocated by the kernel? Nothing much; the worst thing that could happen is to hit a major page fault, and bring back unused memory from the swap. Why would you do that?

Because life is tough. Let’s say you have an algorithm that allocates a lot of memory, it frees it, then another algorithm uses a similar amount of memory. When it frees it, let’s say the garbage collector kicks in, and clears stuff, and compacts the heap.

All is fun and games, until you realize that that the garbage collector can choose to dispose of the freed memory pages, and release them back to the operating system. Or, hopefully, it will choose to keep those memory pages around. But, you know, unused memory may at any time be swapped on disk by an operating system under memory requirements pressure i.e. with not enough resources.

So what will happen is what could happen at any memory access (but it’s more likely during a ‘new’ operation) to have a hard page fault, and require more complicated operations to happen in the background.

I wish that the story would be over here. It’s a clear exposition of how a constant time can go wrong, and the problem is that it probably will. The discussion is not over here, though.

When you allocate an object, you also allocate space for all the members of that objects. But probably you will not only have non-string primitive types in your objects. You will probably have more objects and more objects. The more the merrier. In C++ you have something that is quite fun, and that is the fact that you don’t need to hold a pointer to every object you contain in your class; you can have directly the object, inlined in memory. But you don’t really have this option in managed languages. So for each object you will have more memory allocations (including the kernel roundtrip for using the critical section, I am not sure that it can happen without the kernel roundtrip, but even so, it is an expensive operation). We are not talking about the code that runs in the constructor, since that would be unfair (that is a cost that really depends only on the implementer, and they should be aware of those costs).

I am not saying all these things happen all the time. Synchronization can be made cheaper by using lighter synchronization methods, and definitely some are in place; I didn’t look recently over a memory allocator for a managed language, but I can make some guesses. The possible costs are there; and O(1) is an amortized O(1).

I’m not saying that memory allocation will be cheaper in C/C++. Using the stock memory allocation in C/C++ will cost a lot, due to various strategies for memory allocation that malloc implements. However, you can implement different memory allocation strategies that are more suitable to your needs. For example, you can have a huge chunk of memory allocated, and filled on demand by your algorithm. When the algorithm is done, you simply discard that chunk of memory, avoiding costs of both memory allocation and deallocation. I’m not saying this is customary, but this can be done, and should be done especially by memory constrained developers (the forgetful developers are the worst!).

There is, however, a real O(1) memory allocation. It is the stack allocation, stack being preallocated and least likely to hit the disk, making the chance of page faults quite slim. C# unlocks this memory by the use of the unsafe ‘stackalloc‘ which in turn is actually a glorified name for the alloca C call.

So memory allocation in managed languages is a O(1) operation the same way O(log n) is O(1). It depends on how you choose to see things; it’s easy to forget the impact of O(1) on garbage collection, and, unfortunately, managed space developers tend to forget that their code runs on real operating systems, on real hardware.

Comments

The pipe dream of O(1) memory allocation — 15 Comments

  1. You make some good points. That said, there are many options available to developers for managed applications. For example, developers can enable an option that causes .NET to use one heap per logical processor (gcServer = true). This prevents multi-threaded apps from having to synchronize access to the heap management structure.
    As I am sure you agree, paging can be a problem for managed apps and non-managed apps alike. Managed .NET apps can set Process.MinWorkingSet to some value that lets the OS know how much physical memory you would like your app to hog, thus reducing the chance that your app’s pages will be paged out.
    I ran a simple benchmark on my machine. I created 2 apps, one written in C++ and the other in C#.
    Each app has 4 threads that each loop 10 million times allocating memory (arrays of type int) that is of a random size from 1 element to 32 elements. The C++ app additionally deletes the array after it is allocated. The C# app lets the GC manage the deletion. Additionally, the C# app is configured to be in gcServer mode and GCSettings.Latency mode is set to SustainedLowLatency. Both apps are run in release mode without a debugger attached.
    The C++ completes 40 million allocations across 4 threads in 7879ms. The C# completes the same in 4602ms. These numbers make sense to me because C++ is doing more work per allocation and also has to deallocate.

    • Ah, of course. In the matter of memory allocation C# is definitely faster than C++ precisely because of garbage collection, and the improvements that garbage collection brings to the table. However, this is not the complete story, that was what I was saying.

      If the memory allocation is the issue in C++ you can simply replace the memory allocator with one custom made. In my book that says something like: you can bring the performance of memory allocation to battle the performance of C# allocation for specific problems. But that was not actually the point of my rant here. The point was mostly that memory allocation is not an O(1) operation (it’s an amortized O(1)), simplifying the memory allocation requires a garbage collector that is more sophisticated and has to do a lot of work in the background, with your program frozen. These are some aspects that the normal managed programmers don’t really think about, unfortunately.

    • I understand the nuanced argument you are making. I also agree with your assessment that “These are some aspects that the normal managed programmers don’t really think about…” Fortunately the argument is mostly an academic one. If you can allocate more than 5 million objects per second in either language (on my slow machine), then it is something very few developers will ever need to worry about. I agree that in the very rare case that one needs a custom allocation scheme, C++ affords you that luxury. Even so, I am impressed that with every version of .NET, performance improves significantly. For example, with .NET 4.5, generation 2 garbage collection happens in the background without suspending any of your threads.

    • Of course it’s an academic argument, the same is the algorithm complexity;

      As for .NET 4.5 and generation 2; the most ‘traffic’ is seen by generation 1 and generation 0; it’s good that generation 2 collection is done in the background, but ideally it should only re-confirm that the objects are alive, and generally generation 2 collection should be rare. 🙂

  2. Two comments:

    a) Amortized analysis.

    Using your argument, a vector in C++ is not O(1), because when it grows over capacity it’ll take O(size) to copy everything to a newer location. See amortized analysis though. I’d rather not do a bad job explaining it – Google (or some other search engine) is your friend.

    b) Synchronization is expensive when there’s no contention (~50 cycles if the lock is in the local cache), but that’s still constant time. However, C allocators aren’t that much better there. libc is equally crappy AFAIK. Though there are better threaded allocators – see tcmalloc. C/C++ have the advantage of pluggable allocators. People have been able to play around with custom allocators. Not to mention that you can also override new.

    I guess the question that I’d be happy to know the answer for, is whether modern managed environments (.NET, JVM) have better allocators for heavily threaded apps, like tcmalloc.

    • I’m not fooling myself, I know that when doing an amortized analysis, you will get an O(1), and I said that this happens the same way O(log n) is constant time for high values of n. All I’m saying it’s not a simple thing.

      In the end it is constant time, but high constant time. And yes, as I said (and really, I really said that) malloc is not very good.

      I am quite sure that CLR and JVM both have better allocators. Otherwise, they would definitely have a lot of issues, as they are multithreaded by default (at least CLR starts a few threads for your thread pool automatically; perhaps you can disable that, but I doubt they really have a different allocator in that case).

    • Ok, sorry, I guess I’m starting to see where you’re going.

      Well, it seems that you’re aiming towards hard real time. It’s really just in the case of hard real time that you get “real” O(1). But … you lose a lot of features with hard real time (e.g., no more paging of virtual memory). Not to mention that it’s not exactly possible for modern desktop/server processors to do hard real time to begin with. I mean, exactly the same static instruction, a load (mov from memory into a register on x86) can take 2-4 cycles if it hits in the L1 cache to >1000 cycles if it has to go to a different NUMA node and the bus has high contention (I’m not aware of hard latency upper bounds for such cases … I’m sure they exist but they’re probably even higher than 1000 cycles).

      What you get with C/C++ on a “normal” x86 processor is at best soft real time. In server mode, you look at your average latency under normal load and plot the latency for the 90th, 95, 99, and even 99.9 percentile (usually you get a long tail, because of crap that happens every once in a while)

      It’s true, the GC of a managed language is the worst “crap” that can happen to your server app. Best if you have multiple server instances with some dynamic load balancing.

    • I probably would hate a game hung on GC as well. So what I’m saying is: if you want to do something important, pay the iron price (and I find myself talking in Game of Thrones jargon).

      Having options, as one does in C++ is important. I’m disregarding C on purpose, since C can be done in C++ with almost 0 overhead (and I say almost because there is the matter of linking libstdc++). C++ can even do it better at times (see the matter of using std::sort vs. qsort). And C++, the damned octopus with 11 legs sawn over the original 98 ones, is by far the more configurable beast, that will allow you to write the least time consuming constant time memory allocator.

    • Garbage collection also offers memory safety. The three options for memory safety are: 1) no dynamic allocations, 2) garbage collection, 3) region-based memory allocation. For #3 you need pretty significant language support, so it doesn’t really count for this discussion.

      What I’m saying is that you’re paying a bigger price than just having to allocate/deallocate manually. You lose memory safety as well.Also, particularly if you have a long-running storage application (which is one of the things I did in my past), the C/C++ approach also suffers from memory fragmentation. Managed languages with garbage collection can also move objects around in memory.

      Manual memory allocation is not inherently superior to garbage collection. It’s just a different trade-off.

      BTW, Garbage collection has improved significantly in the last decade. It doesn’t freeze the app unless it really has to.

      http://javarevisited.blogspot.com/2011/04/garbage-collection-in-java.html

  3. Another thing that you should keep in mind is that x86 processors have a TSO memory consistency model (total store order) – which is not sequential consistency, but it’s a lot better than the “minimum subset of all architectures” consistency that C++ and the VMs promise you. A clever implementation can elide many locks.

    For most developers, “always lock” is probably good advice. But I hope that VMs are implemented by outlier developers.

    • That article on garbage collection looks pretty interesting, I’ll have a look at it once I’m back in .ro.

      I agree that many locks can be elided, and I bet the VM implementers are much better at this than the usual “Always lock” policies, but I am not sure that memory management can help it.

      Maybe I should write something about this’always lock’ strategy that is quite time consuming as well. Things that, as you said, eats a lot of time even if not needed.

    • Memory consistency and locking are not only interesting topics, but also highly relevant for today’s state of computer architecture.

      Apologies if I’m overly pedantic/lecturing, but here’s something that I always found cool:

      On a TSO architecture (x86 included), a basic spinlock implementation can be done as follows:

      1. acquiring the spinlock: exchange (or compare and exchange) in a loop (if there’s no contention, the loop will only execute once).
      2. releasing the spinlock: simple store. This may sound surprising, but it’s totally safe.

    • In a multi-processor system? I find myself doubting that. However, I would really like to have a look over JVM or Dalvik. As soon as I’ll get some computer time, I will. 🙂 But right now I’m in vacation (viva the holiday!)

    • Yes! Multi-processor, multi-core, multi-threaded and NUMA 😉 (sounds fancy but it’s not, 2 Ivy Xeons on a LGA 2011 board are like that, TTBOMK)

      Doubt is great – and the basis of science :). Understanding why it works should be fun.

      Hope you’re having a great time!