Computer science 101: Data types

Important

I'm assuming you have some basic programming knowledge and that know how binary numbers work.

Programming is the art of manipulating and displaying data, but not all data is created and manipulated equally. That's why data types are important, they tell the compiler/interpreter how to make sense of the data you give it.

Despite being useful data types are also a bit misleading. You see computer science started out as branch of mathematics, but that ended up evolving into something different from just applied math.

Although CS (computer science) became something else the data types we use still talk about mathematical concepts. All though those concepts help us understand the code they don't give us the full picture.

The int

Take for example the integer type (int), in most programming languages it is used to represent just that an integer value, but if look through some code on the internet you will see some strange things.

Firstly the importance between signed and unsigned numbers, then comes the uintX_t family and finally some cryptic symbols like >>=, |=, ^=. What is going on here? Is CS having a stroke?

Well the reason why CS's integer, and other data types as well, seems so unhinged is the fact that in math an integer is an idea while in CS an integer is a space in memory (RAM, ROM, Hard drive, etc).

Bit-wise operations

Being a space in memory means that a data type is made of bits and bytes. Bits and bytes that can be directly manipulated. That's the reason why you can use bit-wise operations on data types like int or char.

Bit-wise operations

Bit-wise operations are operations that run for every bit of an value. Most of them represent the operations in boolean algebra.

The most common operators are 'or' (| \lor), 'and' (& \land), 'not' (~/! ¬\lnot), xor (^ \oplus), left shift (<<) and right shift (>>).

When you apply a bit-wise operator to an integer you are considering the bits that make up that integer and not the actual number that integer represents.

Example

If you apply an 'or' (| \lor) to the numbers 5 (00000101) and 6 (00000110) you get 7 (00000111).

The result may seem strange at first, but if you apply the ‘or’ to each bit it wll make more sense...

56=7 00=0 00=0 00=0 00=0 00=0 11=1 01=1 10=1\begin{aligned} \underline{5\lor6=7}\\\ 0\lor0=0\\\ 0\lor0=0\\\ 0\lor0=0\\\ 0\lor0=0\\\ 0\lor0=0\\\ 1\lor1=1\\\ 0\lor1=1\\\ 1\lor0=1 \end{aligned}

The example above uses mathematical notation to make things easier to visualize.

The sizeof the problem

The architecture of a computer defines a lot of things, but most importantly it defines the size data types can have. If your machine is a 32 bit one than the size of an integer is 32 bits (4 bytes), if your machine is 64 bit then the size of an integer be 64 bits (8 bytes).

That's why the operator sizeof exists, it allows you to know what size of int you are working with. That's also why the uintX_t family exists, sometimes you don't need a 64 bits long integer. By using uint32_t, for example, you ensure the integer you work with is 32 bits long regardless of the architecture of the machine that runs your code.

The long

Sometimes will may need large integers even larger than the architecture your code runs on. To fix that you can use the long keyword.

The keyword long has a very specific behavior but in short it makes the size of a data type longer. Even allowing you to go over the architecture limit of the machine running your code.

How can it go past the architecture?

To go over the limit you just need to divide the number into smaller chunks.

For example to use the 32 bit integer 10011000101010101001100110011001 (2561317273) into an 8 bit machine, first divide it into 8 bit chunks like this 10011000, 10101010, 10011001 and 10011001.

Then perform the desired operations on each chunk respecting their order. If you start from the 1st chunk the order must be 1st > 2nd > 3rd > 4th and if start from the 4th one the order must be 4th > 3rd > 2nd > 1st.

Signed vs unsigned

The sign of number is important in math, but in CS it seems a lot more important. Why? Why does CS care so much about the sign of a number? That's because while in math the sign of number doesn't affect operations that much but in CS it changes everything.

Firstly signed numbers have a "smaller" range than unsigned ones, that's because in signed numbers the first bit from left to right is dedicated to it's sign.

Example

Unsigned 8 bit numbers can range between 0 (00000000) and 255 (11111111) while signed ones can range between -128 (11111111) and 127 (01111111).

Pay attention to the fact that for unsigned numbers 11111111 means 255 but for signed numbers it means -128. That's one of the reasons why the sign of number is so important.

Lastly in math 111-1 is straightforward but in CS it requires some extra steps. You see in order to simplify the math required by negative numbers compilers use a system known as two's complement.

Defining a data type as unsigned avoids the need for two's complement and removes the confusion around it's upper and lower limits.

Non integers

Integers might be useful but they are only a fraction of all real numbers.

Fixed point

The first type to be used for representing real numbers is fixed point. This type puts a point in between the bits of a number. The point will always be at the same place and that's why fixed point number have a fixed precision.

Example

The number 5.05.0 can be represented as 01010000, you can imagine a point in between the bits like this 0101.0000.

So to represent the number 4.04.0 you use 01000000 (0100.0000), to represent 0.50.5 you use 00000101 (0000.0101) and to represent 6.46.4 you use 01100100 (0110.0100).

Although fixed points can be useful their limited precision is well limiting, specially when you need them to have a sign.

The float

The floating point type (float) is the solution to the limited precision of fixed point. Floats use a system known as scientific notation, where a number is broken into "smaller pieces".

Example

The number 6.46.4 can be expressed as 64/1064 / 10 or 6410164 * 10^{-1}. The two numbers that are important in scientific notation are the mantissa mm (6464) and the exponent ee 1-1, where a number nn can be written as n=m10en = m * 10^e.

That's why scientific notation is so useful, it turns a real number into two integers.

To store a float you need to divide the into the mantissa and the exponent and, as you may have guessed, it depends on the architecture.

Example

In a 32 bit system 1 bit is dedicated to the sign, 8 bits to the exponent and the remaining 23 bits are dedicated to the mantissa.

So the number 0.156250.15625 or 156251010000015625*10^{-100000} can be store as 00111110001000000000000000000000 where:

0sign01111100exponent01000000000000000000000mantissa\overbrace{0}^{sign}\overbrace{01111100}^{exponent}\overbrace{01000000000000000000000}^{mantissa}

The precise elephant in the room

If you play around with floats you may realize that some compilers don't let you use bit-wise operations on float. That seems very strange doesn't it? I mean both are made of bits then why that restriction exists?

Bit-wise operations doesn't existis just for the sake of existing they are very useful and hold some mathematical properties.

Example

Bit shifting a number the left (<<) is the same as multiplying the number by 22 and bit shifting to the right (>>) is the same as dividing a number by 22 and discarding the remainder.

The number 77 bit shifted left once (7 << 1) is 1414 and 77 bit shifted right once (7 >> 1) is 33, because 72=3.5\frac{7}{2}=3.5 but if you discard the remainder you get 33.

The problem here is that those properties only hold true for integers as their representation is straightforward while floats require some conversion to take place.

So if you apply a bit-wise operation you get strange results that break their mathematical properties and that's why some compilers block you from using them in floats.

The char

Another oddball is the character type (char), not on it's regular form but on the cryptic unsigned char form. What does it mean to have a unsigned character? Is it like the a-a in algebra, where the letter represents a number? Almost like that but in a different sense.

Most programming languages and computers use a system known as ASCII, where each character gets assigned to a integer value. So to some extent, Yes characters do indeed represent numbers is just that the numbers they represent doesn't change.

But the reason why unsigned char is a thing isn't it's ASCII index, but it's size in memory. char gets an 8 bits long space dedicated to it, in other words a char can represent a byte and by defining it as unsigned we remove the confusion mention previously.

In the end unsigned char is another way of saying "a byte that ranges from 0 to 255". As you might have guessed using unsigned char to represent a byte isn't the best practice, so that's why you should use uint8_t instead and leave the char for when you actually need a character like on arrays of characters aka strings.

Final words

I hope this article helps to remove some of the confusion that may arise from data types. Next time I want to talk about pointers.