What In The Hell | Basic Numeric Data Types (Char, Integer, Short, Signed)

Numbers are an abstract idea. A concept that can only be approximated in reality. To illustrate this concept a little better, take out some paper and write down the answer to 1/3.

Are you done? How could you be? Whatever you wrote down I could make a better by adding another 3 at the end, or another trillion 3’s at the end. How about the biggest number you can think of? Any number you can write down I can make better by adding one more digit.

So you can see that we have to put some limits somewhere. The limits that we put on numbers in a computer correspond to the amount of memory it takes to represent those numbers.

To understand this, we need to know a little bit about computer memory. Computers store information in bits. A bit is a binary number (1 or 0) that represents the on or off state of the memory. Since a single bit is often too little to use effectively, computer memory is organized into blocks of 8 bits called bytes.

The 8 bit byte can represent the numbers 00000000 through 11111111 in binary, which is 0 to 255. Your computer’s memory is organized into a big row of bytes.

Each address above refers to one byte. A char is one byte. However, what do we do when we need a number bigger than 255? We have to use more than one byte. A number that is 2 bytes can represent a number from 0 to 65,535 and is called a short int.

A 4 byte number (32 bit) could represent 0 to 4,294,967,295 and is called an int or integer.

An 8 byte number (64 bit) could represent 0 to 18,446,744,073,709,551,615 and is called a double.

Another way to look at the basic numeric types is as names for sizes of memory.

What happens when we want negative numbers?

Let’s go back to the char or the single byte. A char can represent 256 values and only 256 values, so to get negative numbers we need to give up some of the positive ones. In fact we’ll give up exactly half of the positive ones and represent the numbers -128 through 127.

When we want to represent negative numbers we use a signed type. Unsigned types represent only positive numbers and 0. Any of char, short int, int or double can be signed or unsigned and each follows the split in half rule.

What about the decimal point?

Numbers with a decimal point in them are called floating point numbers. An entire book can probably be written about floating point numbers as they can become quite complicated. A future WITH article will cover floating point numbers in more detail.

* The size of each char, int, etc can vary depending upon the hardware. This article used common sizes for a 32 bit architecture.
* Teaching Tip: Write down numbers on a whiteboard/paper and measure them with a ruler. Imagine each inch is a block of memory. How many different numbers can you store in one block? How about two or four? What would you name those blocks so you could talk about them with other people? Bonus: how many blocks does it take to write a character?

GOTO Table of Contents

What In The Hell Are Strings

GOTO All What In The Hell Articles

Strings are a data type that consist of a sequence of characters. The simplest way to think about strings is that they are text. Strings are usually formed by “putting quotes around some text” or by using a string function.

Strings come in two varieties. Terminated strings use a special character to signal the end of the string. This character is not allowed to appear in the string, as it would cause the string to end, leaving out the rest. Strings can also be implemented using a length field. These strings do not need a terminating character as the length of the string is stored with the string. In many languages you will not be able to tell which type of string you are using and it is usually irrelevant. However some languages, like C, require you to place the termination character manually.

Strings can use a variety of encodings for the characters contained within. The two most common are ASCII, which can represent 128 different characters, and Unicode, which has support for over 100,000 different characters and allows strings to hold non-English characters. Both forms of encoding are popular today, however most applications are moving toward Unicode characters as the need for international software grows.

One common problem with strings is representing characters that would otherwise be interpreted by the language to mean something else. For instance the string “Have you read the book “1984” by George Orwell?” would be interpreted as 3 expressions; (1) The string: “Have you read the book ” (2) The number: 1984 (3) The string: ” by George Orwell?”. This is obviously not what we want. In this case we would use an escape character to stop the quotes around “1984” from terminating the string. Our new string would look something like this (this may vary from language to language): “Have you read the book \”1984\” by George Orwell?”. In this string the ‘\’ character tells the language not to end the string when it sees the following “.

 

Terminology for Strings

Concatenation:
Joining two strings together to form one string is called concatenation.

Substring:
A substring is a part of a larger string. For example, in the string “Hello World” one possible substring would be “Hello”, and another could be “llo Wo”.

Literal:
A literal string is a string that appears directly in the source code. Strings formed by “putting quotes around them” are usually literal strings in a language. To build a regular string you must usually use a string function. In languages with mutable strings it is usually not possible to mutate a literal string, even where it is possible it is almost never a good idea.  More information on string literals can be found here.

 

Code

C
Strings in C are represented as a null-terminated array of characters. To create a string you create an array, fill it with characters and place a null character directly after the last character in the string. If you are using string literals to create a string you do not have to supply the size or the null character, the compiler will do that for you.

/* Create a string */
char s[6] = {'H', 'e', 'l', 'l', 'o', '\0'}
char sl[] = "Hello World"

/* Escape Character */
char e[50] = "Have you read the book \"1984\" by George Orwell?"

Python
Strings in Python are enclosed in quotes. You can use one of three different quoting styles to create a string. Single quotes are useful when you have double quote characters in the string. Triple quotes allow you to place newlines in the string.

# Create a string
a = "Hello World"
b = 'Have you read "1984" by George Orwell?'
c = """Materials:
Pen
Paper"""
# Escape Character
d = "Have you read the book \"1984\" by George Orwell?"

Scheme
Strings in Scheme are enclosed in quotes or created with the ‘string’ function. Newlines may be embedded in any string.

;; Create a string
(define a "Hello World")
(define b (string #\H #\e #\l #\l #\o))

;; Escape Character:
(define c "Have you read the book \"1984\" by George Orwell?")

 

Performance

The performance characteristics of a string will depend on the data type that is used to represent them. Most often arrays are used, however linked lists are somewhat common as well. Strings may also be either mutable or immutable which can have an impact on their algorithmic performance.

Because of these factors an accurate overview of the performance of strings cannot be given.  However in the vast majority of cases strings will share the same characteristics as arrays.

In the case of terminated strings implemented as an array the indexing time can still be O(1) however supplying an index greater than the length will access memory outside the string and, hopefully, cause an error. You may need to determine the length first to prevent this, which would be an O(n) operation.

 

What the heck is: a String
A great article on strings, much more in-depth than this one. Incidentally where I got the inspiration for this series.

 

GOTO Table of Contents

What In The Hell Are Arrays

GOTO All What In The Hell Articles

Arrays are one of the two fundamental data structures in computer science (the other is the tree).  An array is a contiguous (uninterrupted or sequential) block of memory that stores a set of elements.

Elements in an array can be accessed by their index.  An array’s index usually starts at 0 and stops at the number of elements in the array minus 1. The diagram below shows an array of the characters in the string “HELLO”.

In the array above the element at index 0 is the character “H”, the element at index 1 is the character “E” and so on.

Arrays have constant time access to the elements stored within. This means that it takes the same amount of time to access any element. For example, it would take the same amount of time to access the last element of an array as it would the first, or even an element in the middle.

A multi-dimensional array is simply an array of arrays. A spreadsheet is a form of multi-dimensional array. The diagram above shows a 3 element array where each element is also a 3 element array. Multi-dimensional arrays can be used for a variety of tasks from tracking pieces on a chess board to drawing graphics on the screen.

A “true” multi-dimensional array is implemented as a contiguous block of memory, just like the regular array. However it is also common practice to implement multi-dimensional arrays as trees.

 

Code

C

An array in C can contain elements of only one type/size and is not resizable.

/* Create an array */
int a[3] = {5, 10, 15};

/* Access an array element */
a[1] /*==> 10 */

Python

An array in Python is called a list. It can contain elements of any type/size and is resizable.


# Create an array
a = [5, 10, 15]

# Access an array element

a[1]  #==>  10

Scheme

An array in Scheme is called a vector. It can contain elements of any type/size and may be resizable.


;; Create an array

(define a (vector 5 10 15))
(define b #(5 10 15))

;; Access an array element

(vector-ref a 1)  ;;==>  10

 

What is the index?

An index is a way of labeling elements in an array. The first element is usually located at index 0, the second at index 1, the third at 2 and so on. The index is used to calculate the address of the element.

To understand this we have to remember that an array is just a reserved block of memory. Let’s look at a simple diagram of how the memory in your computer is organized.

Each block of memory can hold one value (usually equal to an integer in a language like C). To create an array we reserve a block of memory big enough to hold all our elements. If we had 4 elements it would look like the diagram below.

To access our array we only need to keep around the address of the beginning of our reserved space or 102 in this example. This address is equal to the index 0. But our index values still don’t point us at the right addresses yet. To do that we multiply the index by the size of an element then add that value to the start of our array. Let’s see how that maps out on our array.

Sometimes elements in an array are larger than just one block of memory. Here is what an array looks like when each element is two blocks of memory large.

An array can only contain elements of one size. This is because we need to multiply the index by the size of the elements in the array. How then do some languages let you put elements in an array that are different sizes? One simple way to do that is to use the size of the largest element in the array for indexing. This leads to some wasted space when storing smaller elements but allows the array to work just like before.

How do we know when we’re at the end of the array? One way is to store the length of the array with the address to the first element. In that case we just check if the index is <= the length. Another way to signal the end of an array is a sentinel node. A sentinel node is a value that can only occur at the end of an array and signals the end.

 

Performance

Indexing: O(1)
Insertion/Removal: O(n)
Search: O(n)

Arrays are either static or dynamic. A static array is set at a fixed size, elements can be inserted and removed but the array cannot grow or shrink. A dynamic array, on the other hand, can grow or shrink. Dynamic arrays are often initialized to be larger than the set of elements it contains. When enough elements are added to the array, such that the number of elements is greater than the array size, the array must be resized. Resizing an array could be a costly operation if the array is large or the act of copying is slow.

 

GOTO Table of Contents