What In The Hell Are Trees

Posted on 2011/02/21 by Nick Zarczynski

A tree is a hierarchical data structure that contains a set of nodes. The node is the the basic building block of a tree and to understand trees you have to understand nodes.

A node is a data structure that has a parent, can contain a value and can have 0 or more child nodes. A node with no child nodes is called a leaf node, or more commonly just a leaf.

A node with no parent is referred to as a root node.

Each node can have a child node of its own.

A node can have more than two children too.

By now you should grasp the basic structure of a tree, but what are they good for and where are they used? Well the good news is that trees are all around you. Your operating system’s directory structure is a tree, HTML and XML are both trees. Your filing cabinet could even be considered a tree. A linked list is also a form of tree, where each node has exactly one child node. Let’s take a look at some more trees.

You belong to a tree too!

and of course…

Implementation of Trees

Programming languages have varying levels of support for tree-like data structures. The abstract idea of a tree and the implementation of one are often very different things. Pretty much any nestable data structure can suffice for an ad-hoc tree implementation. For example here’s one of the simplest trees you can implement in Python.

class Node:
  def __init__(self, value=None, *args):
    self.value = value
    self.children = args

# A tree can then be defined as
tree = Node('root', Node(1), Node(2))

Scheme makes it quite a bit easier, both code and data are represented using trees (actually lists).

(define tree (cons 1 2))
tree
;>>> (1 . 2)

(define nested-tree (cons 1 (cons 2 3)))
nested-tree
;>>> (1 2 . 3)

If you’re wondering why the nested-tree looks like a list, remember that lists are a subset of trees.

I didn’t get to heavily into how trees are implemented in any particular language. Mostly this is because the implementations of tree data structures varies widely from one language to the next and sometimes there are even multiple implementations of trees in a single language.

Many decisions go into a tree implementation and the tools that work with them. Because of this there are many different types of trees, each with its own strengths and weaknesses and tools for manipulating them. Future WITH articles will will go deeper into trees, their implementations and characteristics.

GOTO Table of Contents

What In The Hell Are Hash Tables

Posted on 2011/02/20 by Nick Zarczynski

GOTO All What In The Hell Articles

A hash table is a combination of two parts; an array, to store the data, and a hash function that computes the index of the array from a key. The best analogy to a hash table is a dictionary. When you use a dictionary you lookup a word (the key) and find the definition (the value).

Hash tables are an improvement over arrays and linked lists for large data sets. With a hash table you don’t have to search through every element in the collection (O(n) access time). Instead you pay a one time cost (the hash function) to retrieve data (O(1) access time).

The Hash Function

The purpose of the hash function is to compute an index from a key. That index is then used to store and retrieve data from the array. Many higher-level languages include some type of hash table and hash function.

As you can see from the diagram above the hash(“string”) function transforms a string into an integer which is then used as the index into the array. This index is used to store or retrieve data in the array.

The amount of time it takes to set or retrieve a value from a hash table is the time it takes to run the hash function plus the time it takes to access an element of the array. For small data sets this could take longer than simply iterating over each element of the array. However, since it is a fixed cost, for larger data sets using a hash table could be much faster.

Assuming we have a function set that takes a key to be hashed as its first argument and a value to be stored as its second argument, we can create the following hash table.

As you can see from the above diagram we store both the key and the value in the table. The reason for this is collisions, which will be explained below.

Collisions

A collision occurs when two different keys hash to the same value. Collisions will occur in even the best designed hash tables, however there are a few reasons why they can occur more frequently. The first reason is a poor hash function that does not distribute the indices evenly throughout the table, this will be explained in more detail below. The second reason is that the array is too small. If you have a 10 element array and you are trying to store 100 key/values in the hash table, there will obviously be a lot of collisions.

In the diagram above “Puff” hashed to the same value as “Peter” causing a collision. Both “Peter” and “Puff” are now stored in the same index of the array. A certain amount of collisions are unavoidable in a hash table, so we’ll need a way to deal with them.

The simplest way to deal with collisions is to make each element of the array a linked list. Then, when a collision occurs, we simply search through the linked list to find the data we want. This way the hash function gives us a ‘starting point’ to begin our search.

With a good hash function and an appropriately sized array collisions are a rare occurrence. However it is possible, though very unlikely, that all the key/values will be stored in one index and we will end up doing a search on a linked list.

What is a good hash function?

The sole job of the hash function is to return a unique array index for each key. This is not always possible and collisions can be expected to occur when the number of elements in the hash table approaches the square root of the size of the hash table’s array. A hash function that does not provide a uniform distribution of keys will cause a lot of collisions, possibly breaking our hash table’s performance down from O(1) to O(n). The only workaround to an ineffective hash function is to make the hash table’s array much larger than the number of elements that it holds, which obviously wastes space.

Hash Table Implementations

Python

Python’s equivalent of a hash table is called a dictionary or dict.

# Creating a dictionary
d = {'a':5, 'b':10}

# Adding key/value pairs
d['c'] = 15

# Get value by key
d['c']
#>>> 15

Chicken Scheme

;; Creating a hash-table from an alist
(define d (alist->hash-table '((a . 5) (b . 10))))

;; Adding key/value pairs
(hash-table-set d 'c 15)

;; Get value by key
(hash-table-ref d 'c)
;>>> 15

Performance

Indexing: O(1)
Insertion/Removal: O(1)
Search: O(1)

The performance listed above is what is normally found in specific hash table implementations. There are actually a number of different implementations, each with their own advantages, disadvantages and performance guarantees. Specific implementations will have to be presented in future WITH articles.

Links

GOTO Table of Contents

What In The Hell Are Strings

Posted on 2011/02/20 by Nick Zarczynski

GOTO All What In The Hell Articles

Strings are a data type that consist of a sequence of characters. The simplest way to think about strings is that they are text. Strings are usually formed by “putting quotes around some text” or by using a string function.

Strings come in two varieties. Terminated strings use a special character to signal the end of the string. This character is not allowed to appear in the string, as it would cause the string to end, leaving out the rest. Strings can also be implemented using a length field. These strings do not need a terminating character as the length of the string is stored with the string. In many languages you will not be able to tell which type of string you are using and it is usually irrelevant. However some languages, like C, require you to place the termination character manually.

Strings can use a variety of encodings for the characters contained within. The two most common are ASCII, which can represent 128 different characters, and Unicode, which has support for over 100,000 different characters and allows strings to hold non-English characters. Both forms of encoding are popular today, however most applications are moving toward Unicode characters as the need for international software grows.

One common problem with strings is representing characters that would otherwise be interpreted by the language to mean something else. For instance the string “Have you read the book “1984” by George Orwell?” would be interpreted as 3 expressions; (1) The string: “Have you read the book ” (2) The number: 1984 (3) The string: ” by George Orwell?”. This is obviously not what we want. In this case we would use an escape character to stop the quotes around “1984” from terminating the string. Our new string would look something like this (this may vary from language to language): “Have you read the book \”1984\” by George Orwell?”. In this string the ‘\’ character tells the language not to end the string when it sees the following “.

Terminology for Strings

Concatenation:
Joining two strings together to form one string is called concatenation.

Substring:
A substring is a part of a larger string. For example, in the string “Hello World” one possible substring would be “Hello”, and another could be “llo Wo”.

Literal:
A literal string is a string that appears directly in the source code. Strings formed by “putting quotes around them” are usually literal strings in a language. To build a regular string you must usually use a string function. In languages with mutable strings it is usually not possible to mutate a literal string, even where it is possible it is almost never a good idea. More information on string literals can be found here.

Code

C
Strings in C are represented as a null-terminated array of characters. To create a string you create an array, fill it with characters and place a null character directly after the last character in the string. If you are using string literals to create a string you do not have to supply the size or the null character, the compiler will do that for you.

/* Create a string */
char s[6] = {'H', 'e', 'l', 'l', 'o', '\0'}
char sl[] = "Hello World"

/* Escape Character */
char e[50] = "Have you read the book \"1984\" by George Orwell?"

Python
Strings in Python are enclosed in quotes. You can use one of three different quoting styles to create a string. Single quotes are useful when you have double quote characters in the string. Triple quotes allow you to place newlines in the string.

# Create a string
a = "Hello World"
b = 'Have you read "1984" by George Orwell?'
c = """Materials:
Pen
Paper"""
# Escape Character
d = "Have you read the book \"1984\" by George Orwell?"

Scheme
Strings in Scheme are enclosed in quotes or created with the ‘string’ function. Newlines may be embedded in any string.

;; Create a string
(define a "Hello World")
(define b (string #\H #\e #\l #\l #\o))

;; Escape Character:
(define c "Have you read the book \"1984\" by George Orwell?")

Performance

The performance characteristics of a string will depend on the data type that is used to represent them. Most often arrays are used, however linked lists are somewhat common as well. Strings may also be either mutable or immutable which can have an impact on their algorithmic performance.

Because of these factors an accurate overview of the performance of strings cannot be given. However in the vast majority of cases strings will share the same characteristics as arrays.

In the case of terminated strings implemented as an array the indexing time can still be O(1) however supplying an index greater than the length will access memory outside the string and, hopefully, cause an error. You may need to determine the length first to prevent this, which would be an O(n) operation.

Links

What the heck is: a String
A great article on strings, much more in-depth than this one. Incidentally where I got the inspiration for this series.

GOTO Table of Contents

Pointless Programming

Just another programming blog about high level languages like Scheme and Python.

Tag Archives: datatype

What In The Hell Are Trees

Implementation of Trees

What In The Hell Are Hash Tables

The Hash Function

Collisions

What is a good hash function?

Hash Table Implementations

Performance

Links

What In The Hell Are Strings

Terminology for Strings

Code

Performance

Links