'C Pointer Arithmetic for Unusual Architectures
I'm trying to get a better understanding of the C standard. In particular I am interested in how pointer arithmetic might work in an implementation for an unusual machine architecture.
Suppose I have a processor with 64 bit wide registers that is connected to RAM where each address corresponds to a cell 8 bits wide. An implementation for C for this machine defines CHAR_BIT to be equal to 8. Suppose I compile and execute the following lines of code:
char *pointer = 0;
pointer = pointer + 1;
After execution, pointer is equal to 1. This gives one the impression that in general data of type char corresponds to the smallest addressable unit of memory on the machine.
Now suppose I have a processor with 12 bit wide registers that is connected to RAM where each address corresponds to a cell 4 bits wide. An implementation of C for this machine defines CHAR_BIT to be equal to 12. Suppose the same lines of code are compiled and executed for this machine. Would pointer be equal to 3?
More generally, when you increment a pointer to a char, is the address equal to CHAR_BIT divided by the width of a memory cell on the machine?
Solution 1:[1]
Would pointer be equal to 3?
Well, the standard doesn't say how pointers are implemented. The standard tells what is to happen when you use a pointer in a specific way but not what the value of a pointer shall be.
All we know is that adding 1 to a char pointer, will make the pointer point at the next char object - where ever that is. But nothing about pointers value.
So when you say that
pointer = pointer + 1;
will make the pointer equal 1, it's wrong. The standard doesn't say anything about that.
On most systems a char
is 8 bit and pointers are (virtual) memory addresses referencing a 8 bit addressable memory loacation. On such systems incrementing a char pointer will increase the pointer value (aka memory address) by 1. However, on - unusual architectures - there is no way to tell.
But if you have a system where each memory address references 4 bits and a char is 12 bits, it seems a good guess that ++pointer
will increase the pointer by three.
Solution 2:[2]
Pointers are incremented by the minimum of they width of the datatype they "point to", but are not guaranteed to increment to that size exactly.
For memory alignment purposes, there are many times where a pointer might increment to the next memory word alignment past the minimum width.
So, in general, you cannot assume this pointer to be equal to 3. It very well may be 3, 4, or some larger number.
Here is an example.
struct char_three {
char a;
char b;
char c;
};
struct char_three* my_pointer = 0;
my_pointer++;
/* I'd be shocked if my_pointer was now 3 */
Memory alignment is machine specific. One cannot generalize about it, except that most machines define a WORD as the first address that can be aligned to a memory fetch on the bus. Some machines can specify addresses that don't align with the bus fetches. In such a case, selecting two bytes that span the alignment may result in loading two WORDS.
Most systems don't accept WORD loads on non-aligned boundaries without complaining. This means that a bit of boiler plate assembly is applied to translate the fetch to the proceeding WORD boundary, if maximum density is desired.
Most compilers prefer speed to maximum density of data, so they align their structured data to take advantage of WORD boundaries, avoiding the extra calculations. This means that in many cases, data that is not carefully aligned might contain "holes" of bytes that are not used.
If you are interested in details of the above summary, you can read up on Data Structure Alignment which will discuss alignment (and as a consequence) padding.
Solution 3:[3]
char *pointer = 0;
After execution, pointer is equal to 1
Not necessarily.
This special case gives you a null pointer, since 0
is a null pointer constant. Strictly speaking, such a pointer is not supposed to point at a valid object. If you look at the actual address stored in the pointer, it could be anything.
Null pointers aside, the C language expects you to do pointer arithmetic by first pointing at an array. Or in case of char
, you can also point at a chunk of generic data such as a struct. Everything else, like your example, is undefined behavior.
An implementation of C for this machine defines CHAR_BIT to be equal to 12
The C standard defines char
to be equal to a byte, so your example is a bit weird and contradicting. Pointer arithmetic will always increase the pointer to point at the next object in the array. The standard doesn't really speak of representation of addresses at all, but your fictional example that would sensibly increase the address by 12 bits, because that's the size of a char
.
Fictional computers are quite meaningless to discuss even from a learning point-of-view. I'd advise to focus on real-world computers instead.
Solution 4:[4]
When you increment a pointer to a char, is the address equal to CHAR_BIT divided by the width of a memory cell on the machine?
On a "conventional" machine -- indeed on the vast majority of machines where C runs -- CHAR_BIT
simply is the width of a memory cell on the machine, so the answer to the question is vacuously "yes" (since CHAR_BIT / CHAR_BIT
is 1.).
A machine with memory cells smaller than CHAR_BIT
would be very, very strange -- arguably incompatible with C's definition.
C's definition says that:
sizeof(char)
is exactly 1.CHAR_BIT
, the number of bits in achar
, is at least 8. That is, as far as C is concerned, a byte may not be smaller than 8 bits. (It may be larger, and this is a surprise to many people, but it does not concern us here.)There is a strong suggestion (if not an explicit requirement) that
char
(or "byte") is the machine's "minimum addressable unit" or some such.
So for a machine that can address 4 bits at a time, we would have to pick unnatural values for sizeof(char)
and CHAR_BIT
(which would otherwise probably want to be 2
and 4
, respectively), and we would have to ignore the suggestion that type char
is the machine's minimum addressable unit.
C does not mandate the internal representation (the bit pattern) of a pointer. The closest a portable C program can get to doing anything with the internal representation of a pointer value is to print it out using %p
-- and that's explicitly defined to be implementation-defined.
So I think the only way to implement C on a "4 bit" machine would involve having the code
char a[10];
char *p = a;
p++;
generate instructions which actually incremented the address behind p
by 2.
It would then be an interesting question whether %p
should print the actual, raw pointer value, or the value divided by 2.
It would also be lots of fun to watch the ensuing fireworks as too-clever programmers on such a machine used type punning techniques to get their hands on the internal value of pointers so that they could increment them by actually 1 -- not the 2 that "proper" additions of 1 would always generate -- such that they could amaze their friends by accessing the odd nybble of a byte, or confound the regulars on SO by asking questions about it. "I just incremented a char pointer by 1. Why is %p
showing a value that's 2 greater?"
Solution 5:[5]
Seems like the confusion in this question comes from the fact that the word "byte" in the C standard doesn't have the typical definition (which is 8 bits). Specifically, the word "byte" in the C standard means a collection of bits, where the number of bits is specified by the implementation-defined constant CHAR_BITS
. Furthermore, a "byte" as defined by the C standard is the smallest addressable object that a C program can access.
This leaves open the question as to whether there is a one-to-one correspondence between the C definition of "addressable", and the hardware's definition of "addressable". In other words, is it possible that the hardware can address objects that are smaller than a "byte"? If (as in the OP) a "byte" occupies 3 addresses, then that implies that "byte" accesses have an alignment restriction. Which is to say that 3 and 6 are valid "byte" addresses, but 4 and 5 are not. This is prohibited by section 6.2.8 which discusses the alignment of objects.
Which means that the architecture proposed by the OP is not supported by the C specification. In particular, an implementation may not have pointers that point to 4-bit objects when CHAR_BIT
is equal to 12.
Here are the relevant sections from the C standard:
§3.6 The definition of "byte" as used in the standard
[A byte is an] addressable unit of data storage large enough to hold any member of the basic character set of the execution environment.
NOTE 1 It is possible to express the address of each individual byte of an object uniquely.
NOTE 2 A byte is composed of a contiguous sequence of bits, the number of which is implementation-defined. The least significant bit is called the low-order bit; the most significant bit is called the high-order bit.
§5.2.4.2.1 describes CHAR_BIT as the
number of bits for smallest object that is not a bit-field (byte)
§6.2.6.1 restricts all objects that are larger than a char to be a multiple of CHAR_BIT bits:
[...] Except for bit-fields, objects are composed of contiguous sequences of one or more bytes, the number, order, and encoding of which are either explicitly specified or implementation-defined.
[...] Values stored in non-bit-field objects of any other object type consist of n × CHAR_BIT bits, where n is the size of an object of that type, in bytes.
§6.2.8 restricts the alignment of objects
Complete object types have alignment requirements which place restrictions on the addresses at which objects of that type may be allocated. An alignment is an implementation-defined integer value representing the number of bytes between successive addresses at which a given object can be allocated.
Valid alignments include only those values returned by an _Alignof expression for fundamental types, plus an additional implementation-defined set of values, which may be empty. Every valid alignment value shall be a nonnegative integral power of two.
§6.5.3.2 specifies the sizeof
a char
, and hence a "byte"
When sizeof is applied to an operand that has type char, unsigned char, or signed char, (or a qualified version thereof) the result is 1.
Solution 6:[6]
The following code fragment demonstrates an invariant of C pointer arithmetic -- no matter what CHAR_BIT
is, no matter what the hardware least addressable unit is, and no matter what the actual bit representation of pointers is,
#include <assert.h>
int main(void)
{
T x[2]; // for any object type T whatsoever
assert(&x[1] - &x[0] == 1); // must be true
}
And since sizeof(char) == 1
by definition, this also means that
#include <assert.h>
int main(void)
{
T x[2]; // again for any object type T whatsoever
char *p = (char *)&x[0];
char *q = (char *)&x[1];
assert(q - p == sizeof(T)); // must be true
}
However, if you convert to integers before performing the subtraction, the invariant evaporates:
#include <assert.h>
#include <inttypes.h>
int main(void);
{
T x[2];
uintptr_t p = (uintptr_t)&x[0];
uintptr_t q = (uintptr_t)&x[1];
assert(q - p == sizeof(T)); // implementation-defined whether true
}
because the transformation performed by converting a pointer to an integer of the same size, or vice versa, is implementation-defined. I think it's required to be bijective, but I could be wrong about that, and it is definitely not required to preserve any of the above invariants.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | Lundin |
Solution 4 | |
Solution 5 | |
Solution 6 | zwol |