227 lines
9.1 KiB
Markdown
227 lines
9.1 KiB
Markdown
---
|
|
layout: post
|
|
title: "Fixing C"
|
|
tags: tech c programming
|
|
---
|
|
|
|
The C Programming Language remains one of the most popular ones in the history of computer science.
|
|
All major operating systems use it, and if you design your own systems language you will have to
|
|
make it talk to C if you want to accomplish anything.
|
|
However, during my own journey through the weeds of systems programming over the last several
|
|
years, i have experienced several annoyances with its "features" and lack thereof.
|
|
Let's fix that!
|
|
|
|
As a little foreword, if you are not into the whole reading stuff, i've made a fancy sample code
|
|
file summarizing everything in this article and some more stuff i thought is neat
|
|
[here](/static/type-c).
|
|
|
|
## Integer Promotion Is Dumb
|
|
|
|
One of the (in my opinion) most stupid and pointless bug inducing flaws in C's design is integer
|
|
promotion and how it is implemented.
|
|
Let me demonstrate this by an example, which assumes the target architecture uses two's complement
|
|
for negative numbers (which is true for pretty much any platform):
|
|
|
|
```c
|
|
int main(void)
|
|
{
|
|
char c = -1;
|
|
while (c >>= 1);
|
|
}
|
|
```
|
|
|
|
You would assume that the `while` loop will execute exactly `CHAR_BIT` times and then exit, but this
|
|
is not the case due to the rules of integer promotion.
|
|
One of those rules states that any operation on integrals narrower than `int` require converting
|
|
it to an `int` first (including sign extension; you probably see where this is going) and only then
|
|
executing the operation.
|
|
Therefore, this is what's actually happening:
|
|
|
|
```c
|
|
int main(void)
|
|
{
|
|
char c = -1;
|
|
while (c)
|
|
c = (char)((int)c >> 1);
|
|
}
|
|
```
|
|
|
|
Assuming we are on x86, this would convert `c`, whose value is `0xff`, to an `int` representing the
|
|
same number, which is `0xffffffff` after sign extension.
|
|
Then, after the bitshift, that number becomes `0x7fffffff`, and is cast back to `char`, resulting in
|
|
`c` having the same value as before!
|
|
And sure enough, if you feed gcc with the above code, it spits out an
|
|
[endless loop](https://godbolt.org/z/rMjW19dfT).
|
|
That is, unless you compile with `-O0`, which makes the whole thing even nastier to debug.
|
|
|
|
An immediate fix that would break basically nothing that wasn't already broken before is altering
|
|
the rules for integer promotion: _Bitwise operations always promote to unsigned types._
|
|
It's really as simple as that, at least as far as i can tell.
|
|
And while we're at it, we could also make 2's complement the mandatory representation for all signed
|
|
integers, because why wouldn't we.
|
|
Even the C23 people seem to
|
|
[agree](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2412.pdf).
|
|
|
|
## More Metaprogramming
|
|
|
|
Since C, except for the, uh, _interesting_ `_Generic` operator, completely lacks generics and
|
|
puts all of that work into the preprocessor instead, it should at least have some useful tools for
|
|
writing macros.
|
|
However, ISO C is still very much lacking in that regard.
|
|
The `typeof` operator is currently being
|
|
[discussed](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2724.htm)
|
|
as a candidate for C23, which would be a very welcome addition to the language in my opinion.
|
|
|
|
When combined with a certain GNU extension, `typeof` would allow for very powerful metaprogramming.
|
|
I'm talking of course about
|
|
[Statement Expressions](https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html)!
|
|
This duo lets you write macros that would otherwise introduce bugs:
|
|
|
|
```c
|
|
#define max(a, b) ({ \
|
|
typeof(a) _a = (a); \
|
|
typeof(b) _b = (b); \
|
|
_a > _b ? _a : _b; \
|
|
})
|
|
```
|
|
|
|
There isn't much else to say about this.
|
|
I already rely on these extensions in pretty much all of my projects with over 1000 source lines of
|
|
code, and a lot of others do too (including gigantic projects like Linux or FreeBSD).
|
|
|
|
## Integer Ranges
|
|
|
|
Now that's where my take is starting to get interesting and significantly higher in temperature,
|
|
but hear me out.
|
|
|
|
Suppose you have a function that takes a parameter which will be used as the count for a bitshift,
|
|
or generally anything that has to be constrained to a certain range of values.
|
|
Then, suppose you don't trust yourself because you regularly write code at 3 AM and might pass some
|
|
value that is greater than the width of the integer you're shifting.
|
|
So, because you're a good boy/girl/enby who avoids undefined behavior, you write a debug assertion
|
|
that checks whether the parameter is less than `LONG_BIT` or whatever:
|
|
|
|
```c
|
|
vm_page_t alloc_pages(unsigned int order)
|
|
{
|
|
assert(order < NR_ORDERS);
|
|
struct pool *pool = &pools[order];
|
|
/* blah */
|
|
size_t size = (1 << order) * PAGE_SIZE;
|
|
/* blah */
|
|
}
|
|
```
|
|
|
|
Now, that can obviously become tedious if you have a lot of functions that require a constraint like
|
|
this.
|
|
But wait a minute, doesn't the compiler already do these sorts of sanity checks for pointer types
|
|
and the such, so you don't accidentally assign an `int *` to a `char *`?
|
|
It sure does, so why wouldn't we extend this capability to the actual _values_ themselves, rather
|
|
than just the _types_?
|
|
|
|
Optimizing compilers already do static analysis to track what values a variable may have at any
|
|
given point in the program so it can do its fun little tricks that have absolutely never resulted
|
|
in any bugs in the output binary whatsoever.
|
|
To be perfectly clear about what i mean, here is another example:
|
|
|
|
```c
|
|
void do_stuff(int x)
|
|
{
|
|
/* x could have any value here */
|
|
|
|
if (x >= 8 && x < 16) {
|
|
/* x must be >= 8 and < 16 if
|
|
* we reach this scope (duh) */
|
|
}
|
|
|
|
/* x might be all sorts of things */
|
|
}
|
|
```
|
|
|
|
Now, my proposal is to write the range in angle brackets directly after the type, as in `int<0,10>`
|
|
for any integer from 0 (inclusive) to 10 (exclusive).
|
|
I don't care how the exact syntax would look like, though, and considering that C is pretty much the
|
|
queen of cursed syntax anyway it doesn't really matter in my opinion.
|
|
By the way, i'm consciously not writing that in a code block because it makes my syntax parser freak
|
|
out a little, but i'm sure you can use your imagination.
|
|
If not, just see the [fancy sample code](/static/type-c).
|
|
|
|
## Dynamic Size Annotated Arrays
|
|
|
|
This is already a proposal for C23 if i remember correctly, but it's so useful that i wanted to
|
|
include it anyway.
|
|
It's kind of related to the type range feature.
|
|
Have a look at the signature of `read(2)`:
|
|
|
|
```c
|
|
ssize_t read(int fd, void *buf, size_t nbytes);
|
|
```
|
|
|
|
This has worked well for several decades.
|
|
But it still has a fatal flaw: Literally nothing is stopping you from passing a `buf` that is
|
|
smaller than `nbytes`.
|
|
What if instead the signature looked like this:
|
|
|
|
```c
|
|
ssize_t read(int fd, char buf[nbytes], size_t nbytes);
|
|
```
|
|
|
|
The compiler could easily figure out whether the buffer is sufficiently sized, and emit a warning if
|
|
not (for example, because it knows how big a memory area returned from `malloc()` is).
|
|
A drawback of this is of course that it would require casting any buffer to a `char *` before
|
|
passing it to the function, which could be compensated by making `void` arrays behave like `char`
|
|
ones _in this specific situation_:
|
|
|
|
```c
|
|
ssize_t read(int fd, void buf[nbytes], size_t nbytes);
|
|
```
|
|
|
|
The implementation of `read` would still need to perform some form of explicit or implicit type
|
|
cast in order to write to the destination buffer, of course.
|
|
|
|
## Better Type Obfuscation
|
|
|
|
This one is inspired by physicists insisting on always using the correct unit along with numbers.
|
|
Let's say you are writing a security critical function that checks whether a process is a member of
|
|
a certain group.
|
|
It's probably not a good idea to accidentally mix up uid and gid, but since both are usually
|
|
`typedef`ed to `int` or something simlar, it is pretty easy to do so:
|
|
|
|
```c
|
|
typedef int uid_t;
|
|
typedef int gid_t;
|
|
|
|
bool has_gid(const struct task *task, gid_t gid);
|
|
```
|
|
|
|
Nothing is stopping you from passing a uid to this function.
|
|
This is a bad thing.
|
|
So, why not make `uid_t` and `gid_t` completely obfuscated types that are mutually incompatible?
|
|
I propose the following syntax that makes values of type `uid_t` and `gid_t` incompatible with
|
|
values of any other integral type, unless it has an explicit cast.
|
|
|
|
```c
|
|
typedef int ~uid_t;
|
|
typedef int ~gid_t;
|
|
```
|
|
|
|
Of course, you also can't assign a `uid_t` to a `gid_t` and vice versa.
|
|
In effect, this would be the same as encapsulating the actual value into a struct, but without
|
|
having to access the member every time you need the raw integral value.
|
|
|
|
## Putting It All Together
|
|
|
|
There is some minor stuff that i left out in this article but included in the
|
|
[fancy sample code](/static/type-c)
|
|
file, but it's mostly just a logical conclusion of the concepts declared herein.
|
|
I hope you find these ideas as interesting as i do, and maybe someone (hopefully not me, since i am
|
|
drowning in side projects already) will write a transpiler for this dialect of C.
|
|
|
|
The only thing that's left now is to give it a name.
|
|
How about ... Type-C?
|
|
Because it's primarily an extension of the type system, and i like the idea of making the whole
|
|
ambiguity around a certain serial bus standard even more confusing than it already is.
|
|
I'm envisioning this to be kind of what TypeScript is to JavaScript, which is also just a transpiled
|
|
language and a superset of the latter.
|
|
Let me know what you think in the comments, and be sure to like and subscribe as well as hit the
|
|
bell icon so you won't miss any future videos.
|