Lessons learnt while trying to modernize some C code

2022-07-11, 19:10

I really like goaccess. I use it to build reports about what’s being read on my blog, based on my logs, ever since I ditched all the spying tools that run on your browser when you’re accessing my page. Perhaps you noticed that this page doesn’t ask you to eat some cookies - that’s why this happens. goaccess is a wonderful tool - it can show you in real time what’s going on with your server, who reads what, it’s generally a neat tool. But there is one problem I have.

For example, I get a lot of requests from bots on the /feed/ path. Of course, using wordpress between 2006 and 2020 is what makes a lot of bots come to that path, but I don’t want to see that in the stats; I want to ignore those requests. Unfortunately, I cannot. It’s a frustrating situation, the issue was opened five years ago, and the resolution from the author is a clear „no go”. But hey, it’s open source, you can grab the source and fix this!

You can’t fix this

goaccess is a great piece of software from the outside. The users are excited by it, as I was as well. If the features goaccess offers match your needs, it’s perfect. Looking from the inside, though, it’s bad. How bad? Unmaintainably bad. The code reviewer in me died for weeks every evening when I was starting VSCode to look at more of how goaccess is built like. I even started my own branch to try to bring it into C++, to refactor the hell out of it and make it digestable for anyone who wants to hack on it.

However, please understand that the effort put into this project is not to be undermined. goaccess explores some of the limitations of the C programming language, and the authors of this project did a tremendous job bringing the tool in the shape it is right now. goaccess is pretty neat for a C project, and it’s a demonstration why you shouldn’t use C to write complex projects. I’m not writing this to put down the developers of goaccess, even if some of the assessments below are harsh. They did an admirable job, however, they made it hard for themselves. Here are some points.

The type system was completely abandoned

C has a type system, but the developers are ignoring it, preferring to move stuff around as void *. Obviously, it shouldn’t be necessary to do that, and this was one of the things I wanted to fix by moving the code into a C++ compiler.

This is a case of one function infecting the entire code base. In this case, a function that is often invoked, the xmalloc family, a wrapper over malloc-like functions that return pointers to void. A good practice would be to immediately cast the result of the x*alloc functions. Perhaps even make some smart wrappers around them with macros; I hate macros, but since you don’t have the template system of C++ you might use what you can. It’s not hard to see a macro like this: #define ALLOCATE(TYPE) (TYPE *)xmalloc(sizeof(TYPE))

The it would be easy to always have a proper type attached to all your variables. You’d definitely not end up with definitions such as: find_color_in_list (void *data, void *color). Not even C programmers should be subjected to this sort of pain.

There are many places that can be fixed with just adding types to things. And I don’t just mean replacing instances of void *. Those are just scratching the surface. I see a few more other points that would fix a lot of the possible issues. For example:

Add a string type and eradicate the zero terminated strings. Running strlen everytime you want to look at a string is pretty wasteful, and a lot of APIs would be greatly improved if they would just take a string type that is made of a char * and an uint32_t or uint64_t. I should write a separate article about this anyway.
Add container types in general. The problem with having X * all over the place is that you never know if that is a pointer towards a single X in memory or the start of a compact array of Xs in memory. The problem is amplified by the fact that char * is universally seen as a string, and not as a pointer to character in memory, which is what it is. X * is not always a container, and the APIs would be greately improved. In goaccess one never knows what X * is, it can be anything.
Avoid constructs like char ***. I thought it was a joke, but people do pass around char *** and it’s insane - almost impossible to comprehend why do you need a pointer to a pointer to a pointer.
Use and respect const. By default, I think everything passed as parameter should be considered const, however, C programmers don’t really feel like using the type system to their advantage. This is usually about passing char * around; and the tragic thing is that sometimes (not always) they actually modify the damned strings. And that’s bad, bad, bad, because as a newcomer to the code you are never sure what is going to happen with your strings, so you’ll just perpetuate the bad style.

On portability, features, and libraries

C’s portability is a joke between #ifdefs and #endif. I get it but hardly anyone else does; and perhaps one of the worst features of the C programming language, the preprocessor, is what stops people from doing portability badly. Not saying that there are better ways to do it, but the temptation to smack a #ifdef in the middle of the code is too much, and people are weak and do it often.

I am not commenting on goaccess now; instead, I’m commenting on the general practice I’ve seen applied in C and C++ code way too often, usually with bad results. The maintanability of a huge piece of code that is caught between #ifdefs is zero, mostly because people modifying a function will only touch the code they can run. It’s fine, you’ll run unit tests (yeah, zero tests), and also, again, you won’t be able to run that code anyway, since you don’t own the platform. #ifdef __APPLE__ costs a few thousand euros, and there are zero incentives for me to actually do it if I ever were to change such code.

But, say, if you no longer can maintain that version, what do you do with the code? You keep it? Until when? How about the compromises you made to support a strange platform because the one who initially wanted that supported couldn’t create a proper abstraction layer for it? Will you keep them forever with you?

And let’s discuss about the features that are enabled/disabled based on their availability on the platform. Who needs that? Why would I want a goaccess neutered without support for GeoIP? Or what about supporting either one library or the other? Why not both?

The answer for the last one is relatively easy: you can’t link against both libraries, maybe one is available on your system and one isn’t. You could link against either, but that means that on a system where there is library X2, but not library X1, choosing X1 is the wrong one to link against. There’s no dynamic availability check, there’s no forced inclusion of support for both X1 and X2. It’s DLL hell but we don’t call it that because it’s not Windows.

I think that developers should be brutal and inclusive. Libraries that support file formats are generally small - you shouldn’t link dynamically against them. Most things should be statically linked anyway, And there are just a handful of cases where linking dynamically adds any benefit. The libc should be dynamically linked most of the time, same for libcrypto, since you want to be able to update those independently of all their users. libgugumugu that needs to be delivered as a separate package, as a separate dev package and so on? Maybe not.

I think there’s a fetish for efficiency that makes people not see things straight. Like breaking things in many small libraries that you end up with thousands of libraries in hundreds of packages that really don’t make sense at all. You know what? I don’t need libgugumugu on my system. I want you to include libgugumugu in your executable whenever you need it. I don’t want you to complain ten years from now that the support for libgugumugu 2.3 was discontinued, and you can’t update to 3.2 because FreeBSD emulated on Win32 run with WINE in an Ubuntu that runs in a VM on Mac still needs to support 2.3!!!

I talked before about the build system. Your install script gives me hundreds of options and combinations of options, most of which you don’t even support, and I need a text processing language from back when computers occupied entire rooms to do what a microcontroller does today.

The positives

First of all, klb works and does its job well. I created a .depot.conf where I declared the libraries I will use and when, and ran it and it worked like a charm. Do you get what’s going on in this build file? You’re confused about the fact that you don’t tell it what should enter in your build? It’s fine, you already said it in the source code!

CXXFLAGS: "-std=c++20 -g3 -O2 -flto -Werror"
LDFLAGS: "-flto -Werror -ltcmalloc"
system:
  openssl:
    header: openssl/
    pkg-config: openssl
  bzlib:
    header: bzlib.h
    ldflags: -lbz2
  ncurses:
    header: ncurses.h
    pkg-config: ncurses
  libgeoip:
    header: GeoIP.h
    pkg-config: geoip
  fmtlib:
    header: fmt/
    pkg-config: fmt

The second positive is that I gave up and gave up relatively fast. Sure, I invested a ton of effort, and that went down the drain, but at least I get to rant about this on my blog, and for some reason that is enough.

Writing good code is hard - it takes a ton of effort and care. Writing bad code that does things correctly is incredibly hard. Of course, goaccess is a program that is stuck forever in the form you see it now, because adding new features is nigh impossible, but looking at the code, knowing that it was valgrind tested, the thing is a masterpiece.

And I wouldn’t touch that code with a ten feet pole.