What In The Hell | Basic Numeric Data Types (Char, Integer, Short, Signed)

Numbers are an abstract idea. A concept that can only be approximated in reality. To illustrate this concept a little better, take out some paper and write down the answer to 1/3.

Are you done? How could you be? Whatever you wrote down I could make a better by adding another 3 at the end, or another trillion 3’s at the end. How about the biggest number you can think of? Any number you can write down I can make better by adding one more digit.

So you can see that we have to put some limits somewhere. The limits that we put on numbers in a computer correspond to the amount of memory it takes to represent those numbers.

To understand this, we need to know a little bit about computer memory. Computers store information in bits. A bit is a binary number (1 or 0) that represents the on or off state of the memory. Since a single bit is often too little to use effectively, computer memory is organized into blocks of 8 bits called bytes.

The 8 bit byte can represent the numbers 00000000 through 11111111 in binary, which is 0 to 255. Your computer’s memory is organized into a big row of bytes.

Each address above refers to one byte. A char is one byte. However, what do we do when we need a number bigger than 255? We have to use more than one byte. A number that is 2 bytes can represent a number from 0 to 65,535 and is called a short int.

A 4 byte number (32 bit) could represent 0 to 4,294,967,295 and is called an int or integer.

An 8 byte number (64 bit) could represent 0 to 18,446,744,073,709,551,615 and is called a double.

Another way to look at the basic numeric types is as names for sizes of memory.

What happens when we want negative numbers?

Let’s go back to the char or the single byte. A char can represent 256 values and only 256 values, so to get negative numbers we need to give up some of the positive ones. In fact we’ll give up exactly half of the positive ones and represent the numbers -128 through 127.

When we want to represent negative numbers we use a signed type. Unsigned types represent only positive numbers and 0. Any of char, short int, int or double can be signed or unsigned and each follows the split in half rule.

What about the decimal point?

Numbers with a decimal point in them are called floating point numbers. An entire book can probably be written about floating point numbers as they can become quite complicated. A future WITH article will cover floating point numbers in more detail.

* The size of each char, int, etc can vary depending upon the hardware. This article used common sizes for a 32 bit architecture.
* Teaching Tip: Write down numbers on a whiteboard/paper and measure them with a ruler. Imagine each inch is a block of memory. How many different numbers can you store in one block? How about two or four? What would you name those blocks so you could talk about them with other people? Bonus: how many blocks does it take to write a character?

GOTO Table of Contents

Advertisements

What In The Hell Are Trees

GOTO All What In The Hell Articles

A tree is a hierarchical data structure that contains a set of nodes. The node is the the basic building block of a tree and to understand trees you have to understand nodes.

A node is a data structure that has a parent, can contain a value and can have 0 or more child nodes.  A node with no child nodes is called a leaf node, or more commonly just a leaf.

A node with no parent is referred to as a root node.

Each node can have a child node of its own.

A node can have more than two children too.

By now you should grasp the basic structure of a tree, but what are they good for and where are they used?  Well the good news is that trees are all around you.  Your operating system’s directory structure is a tree, HTML and XML are both trees.  Your filing cabinet could even be considered a tree.  A linked list is also a form of tree, where each node has exactly one child node.  Let’s take a look at some more trees.

You belong to a tree too!

and of course…

 

Implementation of Trees

Programming languages have varying levels of support for tree-like data structures.  The abstract idea of a tree and the implementation of one are often very different things.  Pretty much any nestable data structure can suffice for an ad-hoc tree implementation.  For example here’s one of the simplest trees you can implement in Python.

class Node:
  def __init__(self, value=None, *args):
    self.value = value
    self.children = args

# A tree can then be defined as
tree = Node('root', Node(1), Node(2))

Scheme makes it quite a bit easier, both code and data are represented using trees (actually lists).

(define tree (cons 1 2))
tree
;>>> (1 . 2)

(define nested-tree (cons 1 (cons 2 3)))
nested-tree
;>>> (1 2 . 3)

If you’re wondering why the nested-tree looks like a list, remember that lists are a subset of trees.

I didn’t get to heavily into how trees are implemented in any particular language.  Mostly this is because the implementations of tree data structures varies widely from one language to the next and sometimes there are even multiple implementations of trees in a single language.

Many decisions go into a tree implementation and the tools that work with them.   Because of this there are many different types of trees, each with its own strengths and weaknesses and tools for manipulating them.  Future WITH articles will will go deeper into trees, their implementations and characteristics.

GOTO Table of Contents

What In The Hell Are Hash Tables

GOTO All What In The Hell Articles

A hash table is a combination of two parts; an array, to store the data, and a hash function that computes the index of the array from a key. The best analogy to a hash table is a dictionary. When you use a dictionary you lookup a word (the key) and find the definition (the value).

Hash tables are an improvement over arrays and linked lists for large data sets. With a hash table you don’t have to search through every element in the collection (O(n) access time). Instead you pay a one time cost (the hash function) to retrieve data (O(1) access time).

 

The Hash Function

The purpose of the hash function is to compute an index from a key. That index is then used to store and retrieve data from the array. Many higher-level languages include some type of hash table and hash function.

As you can see from the diagram above the hash(“string”) function transforms a string into an integer which is then used as the index into the array. This index is used to store or retrieve data in the array.

The amount of time it takes to set or retrieve a value from a hash table is the time it takes to run the hash function plus the time it takes to access an element of the array. For small data sets this could take longer than simply iterating over each element of the array. However, since it is a fixed cost, for larger data sets using a hash table could be much faster.

Assuming we have a function set that takes a key to be hashed as its first argument and a value to be stored as its second argument, we can create the following hash table.

As you can see from the above diagram we store both the key and the value in the table. The reason for this is collisions, which will be explained below.

 

Collisions

A collision occurs when two different keys hash to the same value. Collisions will occur in even the best designed hash tables, however there are a few reasons why they can occur more frequently. The first reason is a poor hash function that does not distribute the indices evenly throughout the table, this will be explained in more detail below. The second reason is that the array is too small. If you have a 10 element array and you are trying to store 100 key/values in the hash table, there will obviously be a lot of collisions.

In the diagram above “Puff” hashed to the same value as “Peter” causing a collision. Both “Peter” and “Puff” are now stored in the same index of the array. A certain amount of collisions are unavoidable in a hash table, so we’ll need a way to deal with them.

The simplest way to deal with collisions is to make each element of the array a linked list. Then, when a collision occurs, we simply search through the linked list to find the data we want. This way the hash function gives us a ‘starting point’ to begin our search.

With a good hash function and an appropriately sized array collisions are a rare occurrence. However it is possible, though very unlikely, that all the key/values will be stored in one index and we will end up doing a search on a linked list.

 

What is a good hash function?

The sole job of the hash function is to return a unique array index for each key. This is not always possible and collisions can be expected to occur when the number of elements in the hash table approaches the square root of the size of the hash table’s array. A hash function that does not provide a uniform distribution of keys will cause a lot of collisions, possibly breaking our hash table’s performance down from O(1) to O(n). The only workaround to an ineffective hash function is to make the hash table’s array much larger than the number of elements that it holds, which obviously wastes space.

 

Hash Table Implementations

Python

Python’s equivalent of a hash table is called a dictionary or dict.

# Creating a dictionary
d = {'a':5, 'b':10}

# Adding key/value pairs
d['c'] = 15

# Get value by key
d['c']
#>>> 15

Chicken Scheme

;; Creating a hash-table from an alist
(define d (alist->hash-table '((a . 5) (b . 10))))

;; Adding key/value pairs
(hash-table-set d 'c 15)

;; Get value by key
(hash-table-ref d 'c)
;>>> 15

 

Performance

Indexing: O(1)
Insertion/Removal: O(1)
Search: O(1)

The performance listed above is what is normally found in specific hash table implementations. There are actually a number of different implementations, each with their own advantages, disadvantages and performance guarantees. Specific implementations will have to be presented in future WITH articles.

 

Links

GOTO Table of Contents