blog/_posts/2021-12-18-fixing-c.md
2021-12-18 21:52:13 +01:00

227 lines
9.1 KiB
Markdown

---
layout: post
title: "Fixing C"
tags: tech c programming
---
The C Programming Language remains one of the most popular ones in the history of computer science.
All major operating systems use it, and if you design your own systems language you will have to
make it talk to C if you want to accomplish anything.
However, during my own journey through the weeds of systems programming over the last several
years, i have experienced several annoyances with its "features" and lack thereof.
Let's fix that!
As a little foreword, if you are not into the whole reading stuff, i've made a fancy sample code
file summarizing everything in this article and some more stuff i thought is neat
[here](/static/type-c).
## Integer Promotion Is Dumb
One of the (in my opinion) most stupid and pointless bug inducing flaws in C's design is integer
promotion and how it is implemented.
Let me demonstrate this by an example, which assumes the target architecture uses two's complement
for negative numbers (which is true for pretty much any platform):
```c
int main(void)
{
char c = -1;
while (c >>= 1);
}
```
You would assume that the `while` loop will execute exactly `CHAR_BIT` times and then exit, but this
is not the case due to the rules of integer promotion.
One of those rules states that any operation on integrals narrower than `int` require converting
it to an `int` first (including sign extension; you probably see where this is going) and only then
executing the operation.
Therefore, this is what's actually happening:
```c
int main(void)
{
char c = -1;
while (c)
c = (char)((int)c >> 1);
}
```
Assuming we are on x86, this would convert `c`, whose value is `0xff`, to an `int` representing the
same number, which is `0xffffffff` after sign extension.
Then, after the bitshift, that number becomes `0x7fffffff`, and is cast back to `char`, resulting in
`c` having the same value as before!
And sure enough, if you feed gcc with the above code, it spits out an
[endless loop](https://godbolt.org/z/rMjW19dfT).
That is, unless you compile with `-O0`, which makes the whole thing even nastier to debug.
An immediate fix that would break basically nothing that wasn't already broken before is altering
the rules for integer promotion: _Bitwise operations always promote to unsigned types._
It's really as simple as that, at least as far as i can tell.
And while we're at it, we could also make 2's complement the mandatory representation for all signed
integers, because why wouldn't we.
Even the C23 people seem to
[agree](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2412.pdf).
## More Metaprogramming
Since C, except for the, uh, _interesting_ `_Generic` operator, completely lacks generics and
puts all of that work into the preprocessor instead, it should at least have some useful tools for
writing macros.
However, ISO C is still very much lacking in that regard.
The `typeof` operator is currently being
[discussed](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2724.htm)
as a candidate for C23, which would be a very welcome addition to the language in my opinion.
When combined with a certain GNU extension, `typeof` would allow for very powerful metaprogramming.
I'm talking of course about
[Statement Expressions](https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html)!
This duo lets you write macros that would otherwise introduce bugs:
```c
#define max(a, b) ({ \
typeof(a) _a = (a); \
typeof(b) _b = (b); \
_a > _b ? _a : _b; \
})
```
There isn't much else to say about this.
I already rely on these extensions in pretty much all of my projects with over 1000 source lines of
code, and a lot of others do too (including gigantic projects like Linux or FreeBSD).
## Integer Ranges
Now that's where my take is starting to get interesting and significantly higher in temperature,
but hear me out.
Suppose you have a function that takes a parameter which will be used as the count for a bitshift,
or generally anything that has to be constrained to a certain range of values.
Then, suppose you don't trust yourself because you regularly write code at 3 AM and might pass some
value that is greater than the width of the integer you're shifting.
So, because you're a good boy/girl/enby who avoids undefined behavior, you write a debug assertion
that checks whether the parameter is less than `LONG_BIT` or whatever:
```c
vm_page_t alloc_pages(unsigned int order)
{
assert(order < NR_ORDERS);
struct pool *pool = &pools[order];
/* blah */
size_t size = (1 << order) * PAGE_SIZE;
/* blah */
}
```
Now, that can obviously become tedious if you have a lot of functions that require a constraint like
this.
But wait a minute, doesn't the compiler already do these sorts of sanity checks for pointer types
and the such, so you don't accidentally assign an `int *` to a `char *`?
It sure does, so why wouldn't we extend this capability to the actual _values_ themselves, rather
than just the _types_?
Optimizing compilers already do static analysis to track what values a variable may have at any
given point in the program so it can do its fun little tricks that have absolutely never resulted
in any bugs in the output binary whatsoever.
To be perfectly clear about what i mean, here is another example:
```c
void do_stuff(int x)
{
/* x could have any value here */
if (x >= 8 && x < 16) {
/* x must be >= 8 and < 16 if
* we reach this scope (duh) */
}
/* x might be all sorts of things */
}
```
Now, my proposal is to write the range in angle brackets directly after the type, as in `int<0,10>`
for any integer from 0 (inclusive) to 10 (exclusive).
I don't care how the exact syntax would look like, though, and considering that C is pretty much the
queen of cursed syntax anyway it doesn't really matter in my opinion.
By the way, i'm consciously not writing that in a code block because it makes my syntax parser freak
out a little, but i'm sure you can use your imagination.
If not, just see the [fancy sample code](/static/type-c).
## Dynamic Size Annotated Arrays
This is already a proposal for C23 if i remember correctly, but it's so useful that i wanted to
include it anyway.
It's kind of related to the type range feature.
Have a look at the signature of `read(2)`:
```c
ssize_t read(int fd, void *buf, size_t nbytes);
```
This has worked well for several decades.
But it still has a fatal flaw: Literally nothing is stopping you from passing a `buf` that is
smaller than `nbytes`.
What if instead the signature looked like this:
```c
ssize_t read(int fd, char buf[nbytes], size_t nbytes);
```
The compiler could easily figure out whether the buffer is sufficiently sized, and emit a warning if
not (for example, because it knows how big a memory area returned from `malloc()` is).
A drawback of this is of course that it would require casting any buffer to a `char *` before
passing it to the function, which could be compensated by making `void` arrays behave like `char`
ones _in this specific situation_:
```c
ssize_t read(int fd, void buf[nbytes], size_t nbytes);
```
The implementation of `read` would still need to perform some form of explicit or implicit type
cast in order to write to the destination buffer, of course.
## Better Type Obfuscation
This one is inspired by physicists insisting on always using the correct unit along with numbers.
Let's say you are writing a security critical function that checks whether a process is a member of
a certain group.
It's probably not a good idea to accidentally mix up uid and gid, but since both are usually
`typedef`ed to `int` or something simlar, it is pretty easy to do so:
```c
typedef int uid_t;
typedef int gid_t;
bool has_gid(const struct task *task, gid_t gid);
```
Nothing is stopping you from passing a uid to this function.
This is a bad thing.
So, why not make `uid_t` and `gid_t` completely obfuscated types that are mutually incompatible?
I propose the following syntax that makes values of type `uid_t` and `gid_t` incompatible with
values of any other integral type, unless it has an explicit cast.
```c
typedef int ~uid_t;
typedef int ~gid_t;
```
Of course, you also can't assign a `uid_t` to a `gid_t` and vice versa.
In effect, this would be the same as encapsulating the actual value into a struct, but without
having to access the member every time you need the raw integral value.
## Putting It All Together
There is some minor stuff that i left out in this article but included in the
[fancy sample code](/static/type-c)
file, but it's mostly just a logical conclusion of the concepts declared herein.
I hope you find these ideas as interesting as i do, and maybe someone (hopefully not me, since i am
drowning in side projects already) will write a transpiler for this dialect of C.
The only thing that's left now is to give it a name.
How about ... Type-C?
Because it's primarily an extension of the type system, and i like the idea of making the whole
ambiguity around a certain serial bus standard even more confusing than it already is.
I'm envisioning this to be kind of what TypeScript is to JavaScript, which is also just a transpiled
language and a superset of the latter.
Let me know what you think in the comments, and be sure to like and subscribe as well as hit the
bell icon so you won't miss any future videos.