What In The Hell Are Strings

GOTO All What In The Hell Articles

Strings are a data type that consist of a sequence of characters. The simplest way to think about strings is that they are text. Strings are usually formed by “putting quotes around some text” or by using a string function.

Strings come in two varieties. Terminated strings use a special character to signal the end of the string. This character is not allowed to appear in the string, as it would cause the string to end, leaving out the rest. Strings can also be implemented using a length field. These strings do not need a terminating character as the length of the string is stored with the string. In many languages you will not be able to tell which type of string you are using and it is usually irrelevant. However some languages, like C, require you to place the termination character manually.

Strings can use a variety of encodings for the characters contained within. The two most common are ASCII, which can represent 128 different characters, and Unicode, which has support for over 100,000 different characters and allows strings to hold non-English characters. Both forms of encoding are popular today, however most applications are moving toward Unicode characters as the need for international software grows.

One common problem with strings is representing characters that would otherwise be interpreted by the language to mean something else. For instance the string “Have you read the book “1984” by George Orwell?” would be interpreted as 3 expressions; (1) The string: “Have you read the book ” (2) The number: 1984 (3) The string: ” by George Orwell?”. This is obviously not what we want. In this case we would use an escape character to stop the quotes around “1984” from terminating the string. Our new string would look something like this (this may vary from language to language): “Have you read the book \”1984\” by George Orwell?”. In this string the ‘\’ character tells the language not to end the string when it sees the following “.

 

Terminology for Strings

Concatenation:
Joining two strings together to form one string is called concatenation.

Substring:
A substring is a part of a larger string. For example, in the string “Hello World” one possible substring would be “Hello”, and another could be “llo Wo”.

Literal:
A literal string is a string that appears directly in the source code. Strings formed by “putting quotes around them” are usually literal strings in a language. To build a regular string you must usually use a string function. In languages with mutable strings it is usually not possible to mutate a literal string, even where it is possible it is almost never a good idea.  More information on string literals can be found here.

 

Code

C
Strings in C are represented as a null-terminated array of characters. To create a string you create an array, fill it with characters and place a null character directly after the last character in the string. If you are using string literals to create a string you do not have to supply the size or the null character, the compiler will do that for you.

/* Create a string */
char s[6] = {'H', 'e', 'l', 'l', 'o', '\0'}
char sl[] = "Hello World"

/* Escape Character */
char e[50] = "Have you read the book \"1984\" by George Orwell?"

Python
Strings in Python are enclosed in quotes. You can use one of three different quoting styles to create a string. Single quotes are useful when you have double quote characters in the string. Triple quotes allow you to place newlines in the string.

# Create a string
a = "Hello World"
b = 'Have you read "1984" by George Orwell?'
c = """Materials:
Pen
Paper"""
# Escape Character
d = "Have you read the book \"1984\" by George Orwell?"

Scheme
Strings in Scheme are enclosed in quotes or created with the ‘string’ function. Newlines may be embedded in any string.

;; Create a string
(define a "Hello World")
(define b (string #\H #\e #\l #\l #\o))

;; Escape Character:
(define c "Have you read the book \"1984\" by George Orwell?")

 

Performance

The performance characteristics of a string will depend on the data type that is used to represent them. Most often arrays are used, however linked lists are somewhat common as well. Strings may also be either mutable or immutable which can have an impact on their algorithmic performance.

Because of these factors an accurate overview of the performance of strings cannot be given.  However in the vast majority of cases strings will share the same characteristics as arrays.

In the case of terminated strings implemented as an array the indexing time can still be O(1) however supplying an index greater than the length will access memory outside the string and, hopefully, cause an error. You may need to determine the length first to prevent this, which would be an O(n) operation.

 

What the heck is: a String
A great article on strings, much more in-depth than this one. Incidentally where I got the inspiration for this series.

 

GOTO Table of Contents