7 Think Euphoria

7.1 Sequence: strings

Eu) back next


A sequence may be used to represent text. Sequences containing text are called strings. The same generic techniques apply to sequences and strings.

Euphoria provides some shortcuts when working with strings. You may use ( " ) as a delimeter instead of ( { } ). An empty string is "" instead of {}. Use gets() and puts() for input and output of strings.

Routines commonly associated with text are found in include std/text.e. They will work with any sequence; but are mostly used with strings.



7.2 A compound data type



Data-types that comprise smaller pieces are called compound data-types. Depending on what we are doing, we may want to treat a compound data type as a single thing, or we may want to access its parts. This ambiguity is useful.

The sequence is the ultimate compound data-type.

A line of text is commonly called a string since it is like a string of characters. While conventional languages have a special data type for strings, Euphoria can use the sequence to represent strings. This means that techniques that apply to strings can also be used with any sequence; what you learn about sequences in turn apply to strings.

The bracket operator selects a single character from a string.

    sequence fruit = "banana"
    atom  letter = fruit[1]
    puts(1, letter)
        -- b
The expression fruit[1] selects character number 1 from fruit. The variable letter refers to the result. When we output letter, we get: b.

The first letter of "banana" is 'b'.

png/07_banana.png

The expression in brackets is called an index. An index specifies a member of an ordered set, in this case the set of characters in the string. The index indicates which one you want, hence the name.

The index must be an integer value. If you use a fractional value, then the value is truncated and that value is used for the index:

    sequence fruit =  "banana"
    puts(1, fruit[1.5]
        -- b
The value 1.5 truncates to the integer 1. Therefore the first letter is selected. If you try 0.5 as an index value then you get an error message--the truncated integer value of 0.5 is 0, which is an illegal index value.



7.3 Length



The length() function returns the number of characters in a string:

    sequence fruit = "banana"
    ? length(fruit)
        -- 6
To get the last letter of a string, you could use the length() function like this:

    sequence fruit = "banana"
    integer Length = length(fruit)
    atom last = fruit[Length] 
    puts(1, last )
        --a
Written more compactly:

    sequence fruit = "banana"
    puts(1, banana[ length(banana) ]
        -- a
You may also use ( $ ) to index the last item in a sequence.

    sequence fruit = "banana"
    puts(1, fruit[$] )
        -- a
Index values must not be 0, a negative value, or a value greater than the length of the sequence. Such values all produce an error message--for strings and sequences.



7.4 Traversal and the for loop



Many computations involve processing a string one character at a time. Often they start at the beginning, select each character in turn, do something to it, and continue until the end. This pattern of processing is called a traversal. One way to encode a traversal is with a while loop:

    integer index = 1
    while index <= length(fruit) do
        letter = fruit[index]
        puts(1, letter )
        puts(1, '\n' )
        index = index + 1
    end while
This loop traverses the string and displays each letter on a line by itself. The loop condition is index < = length(fruit), so when index exceeds the length of the string, the condition is false, and the body of the loop is not executed. The last character accessed is the one with the index = length(fruit), which is the last character in the string.

Using an index to traverse a set of values is so common that Euphoria provides an alternative, simpler syntax--the for loop:

    for index = 1 to length( fruit ) do
        puts(1, fruit[index] )
    end for 
Each time through the loop, the next character in the string is assigned to the variable index. The loop continues until no characters are left.

The following example shows how to use concatenation and a for loop to generate an abecedarian series (that is in alphabetical order). For example, in Robert McCloskey's book Make Way for Ducklings, the names of the ducklings are Jack, Kack, Lack, Mack, Nack, Ouack, Pack, and Quack. This loop outputs these names in order:

    sequence prefixes = "JKLMNOPQ"
    sequence suffix = "ack"
    for letter = 1 to length( prefixes ) do
        puts(1, prefixes[letter] & suffix )
        puts(1, '\n'  )
    end for
The output of this program is:

    Jack
    Kack
    Lack
    Mack
    Nack
    Oack
    Pack
    Qack
Of course, that's not quite right because "Ouack" and "Quack" are misspelled.



7.5 String slices



A segment of a string is called a slice. Selecting a slice is similar to selecting a character:

    sequence s
    s = "Peter, Paul, and Mary"
    puts(1, s[1 .. 5] )
    puts(1, '\n' )
    puts(1, s[8 ..12] )
    puts(1, '\n' )
    puts(1, s[18 .. 22] )
        -- Peter
        -- Paul
        -- Marry
The operator [n .. m] returns the part of the string from the "n-th" character to the "m-th" character inclusively. You always need a start and end index.

You can use the ( $ ) when taking a slice out of a sequence.

    sequence fruit = "banana"
    puts(1, fruit[ 3 .. $ ]
        -- ana
In this example:

    sequence fruit = "banana"
    puts(1, fruit[3..3]
        --
you have asked for a sequence of length zero, so "nothing" is output.



7.6 String comparison



To compare strings (sequences) you must use either the equal() or compare() functions. Thus, to see if two strings are equal:

    if equal( word, "banana" ) = 0 then
        puts(1,  "Yes, we have no bananas!" )
    end if
The compare() function is useful for putting words into alphabetical order:

    if compare( word,  "banana" ) < 0 then
        puts(1, "Your word," & word & ", comes before banana." )
    elsif compare( word > "banana" ) > 0 then
        puts(1, "Your word," & word & ", comes after banana." )
    else
        puts(1, "Yes, we have no bananas!" )
    end if
You should be aware, though, that computer alphabets are not ordered they way you would expect.

FootNote{The ASCII chart gives the standard order used in all programming languages.}

All the uppercase letters come before all the lowercase letters. As a result:

Your word, Zebra, comes before banana.

A common way to address this problem is to convert strings to a standard format, such as all lowercase, before performing the comparison. Use the upper() function to do this. A more difficult problem is making the program realize that zebras are not fruit.



7.7 About sequence (and string) comparisons



Comparisons are made on an element to element basis. That is why the ( = ) may not do the comparison you expect.

    sequence w1 = "zebras"
    sequence w2 = "banana"

    ? w1 = w2
        -- { 0, 0, 0, 0, 0, 0 }
Yes, a comparison was made, but it is not a simple false or true result.

If the lengths of the two sequences are not the same, you get an error message:

    sequence w1 = "zebra"   -- no 's'
    sequence w2 = "bananas" -- extra 's'
    ? w1 = w2
        -- error
        -- sequence lengths are not the same (5 != 7)
In general you should use only atom values in simple comparisons:

    sequence w1 = "zebras"
    sequence w2 = "banana"

    if w1 = w2 then
        puts(1, "they are the same" )
    end if
        -- error
        -- true/false condition must be an ATOM
This error message is a reminder that the ( = ) does not produce a simple false or true result when used in a compare sequences. (This is a common mistake for begining Euphoria users.)



7.8 Strings are mutable



Mutable means that you can change the value of any part of your string variable.

Euphoria lets you use the ( [ ] ) operator on the left side of an assignment, with the intention of changing a character in a string. For example:

    sequence greeting = "Hello, world!"
    greeting[1] = 'J'
    puts(1, greeting )
        -- Jello, world!



7.9 A find function



What does the following function do?

    function find(  integer ch, sequence str  )
        integer index = 1
        while index <= length( str ) do
            if str[index] = ch then
                return index
            end if
            index = index + 1
        end while
        return 0
    end function

    ? find( 'n', "banana" )
        -- 3
In a sense, find() is the opposite of the ( [ ] ) operator. Instead of taking an index and extracting the corresponding character, it takes a character and finds the index where that character appears. If the character is not found, the function returns 0.

This is the first example we have seen of a return statement inside a loop. If str[index] = ch, the function returns immediately, breaking out of the loop prematurely.

If the character doesn't appear in the string, then the program exits the loop normally and returns 0.


This pattern of computation is sometimes called a "eureka" traversal because as soon as we find what we are looking for, we can cry "Eureka!" and stop looking.

We commonly call this traversal a search .



7.10 Looping and counting



The following program counts the number of times the letter 'a' appears in a string:

    sequence fruit
    fruit = "banana"
    integer count
    count = 0
    for char=1 to length( fruit ) do
        if fruit[char] = 'a' then
            count = count + 1
        end if
    end for
    ? count
        -- 3
This program demonstrates another pattern of computation called a counter. The variable count is initialized to 0 and then incremented each time an 'a' is found.



7.11 Euphoria: find() and match()



Euphoria as several related functions, including a find() routine as built-in routines. See the library reference under "searching" and "matching." If a routine seems like it should universally useful, the odds are that someone has already written that routine for you. It may be in the Euphoria Library (look in the documentation), or it may be found in the Euphoria Archives (search the RDS webside).

You have access to may useful routines for the manipulation of sequences. These are well suited to string manipulations. Some, commonly used, routines are built-in; they are part of the Euphoria interpreter itself. A few must be included with an include command before you may use them. An include command makes available the contents of a file--containing code and routines--in your main program-code. Check with the Euphoria documentation before using one of these routines.

Two related functions are built-in: find() and match(). Even though this chapter is on strings, the good news is that these routines work on any sequences--the string being just one case of the sequence-type.

It helps to remember that a string is in reality a sequence of individual characters:

    sequence string
    string = "bannana"
    ? string
    -- { 98, 97, 110, 110, 97, 110, 97 }

The find() function is designed to find an object in a sequence (needle in a haystack.)

? find(   'a', "banana" ) -- ? find(       97, {98,97,110,97,110,97})
             --  2                                 ^ 
? find(   "a", "banana" ) -- ? find(     {97}, {98,97,110,97,110,97})
             --0                               ^   
? find(  "na", "banana" ) -- ? find( {110,97}, {98,97,110,97,110,97})
             --0                               ^

Since 'a' ( 97 ) is an element of "banana", the find() function gives you the expected result.

Since "a" ( {97} ) is not an element of "banana", the find() function says there is no such element.

Similarity, "na" ( {110,97} ) is not an element of "banana".



The match() function is different, it is used to find a sequence as a slice of another sequence.

? match(  'a', "banana" ) -- ? match(       97, {98,97,110,97,110,97})
             --error                           {}   
? match(  "a", "banana" ) -- ? match(     {97}, {98,97,110,97,110,97}
             --  2                                 {  }
? match( "na", "banana" ) -- ? match( {110,97}, {98,97,110,97,110,97})
             --   3                                   {      }
Since 'a' ( 97 ) is not a sequence, an error message is issued.

Since "a" ( {97} ) is a sequence, it can be matched against "banana" to the location 2.

Since "na" ( {110,97} ) is a sequence, it can be matched against "banana" to the location 3.



The find(), object in sequence, works on nested sequences:

? find( "pear", { "banana", "apple", "pear" } )
               --                    "pear"
    -- 3
The match(), slice of sequence, works on nested sequences:

? match( { "apple","pear"}, { "banana", "apple", "pear" } )
                           --          {"apple", "pear" }
    -- 2
When learning Euphoria, concentrate on how things work the same, instead of looking for exceptions.



7.12 find vs find

7.12.1 problem with overriding routines / no warning



So you have written a program with your personal find() routine and then decide to use the built-in

The string module includes a function named find that does the same thing as the function we wrote. To call it we have to specify the name of the module and the name of the function using dot notation.

    sequence fruit = "banana"
    atom index
    index = find( fruit, 'a' )
    ? index
        -- 1
This example demonstrates one of the benefits of modules they help avoid collisions between the names of built-in functions and user-defined functions. By using dot notation we can specify which version of find we want.

Actually, string.find is more general than our version. First, it can find substrings, not just characters:

>>> string.find("\""banana"\"", "\""na"\"")
2
Also, it takes an additional argument that specifies the index it should start at:

>>> string.find("\""banana"\"", "\""na"\"", 3)
4

Or it can take two additional arguments that specify a range of indices:

>>> string.find("\""bob"\"", "\""b"\"", 1, 2)
-1

In this example, the search fails because the letter b does not appear in the index range from 1 to 2 (not including 2). }



7.13 Character classification



It is often helpful to examine a character and distinguish between upper and lowercase, or distinguish between characters and digits.

One way to do this is to first define some string constants:

    constant lowercase = "abcdefghijklmnopqrstuvwxyz"
    constant uppperase = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    constant digits = "1234567890"
We can use these constants and find() to classify characters. For example, if find(lowercase, ch) returns a value other than -1, then ch must be lowercase:

    function isLower( integer ch )
        return find( ch, lowercase ) 
    end function

    ? isLower( 'A' )
    ? isLower( 'r' )
        -- false
        -- true
Not a surprise, something useful like isLower() is available in the Euphoria library. Euphoria has

!# predefined utility data-types with many commonly used sets of characters.

<euphoria> include std/types.e ? t_lower( 'A' ) ? t_lower( 'r' ) false

true </eucode>

1! return ch in string.lowercase As yet another alternative, we can use the comparison operator:

  function isLower2( atom ch)
        --return 'a' < = ch < = 'z'
        if 'a' <= ch
        and ch <= 'z' then
            return ch
        else
            return 0
        end if
    end function

? isLower2( 'A' )
? isLower2( 'r' )
    -- 0
    -- 114
If ch is between 'a' and 'z', it must be a lowercase letter.

Another constant may be useful:

    constant whitespace = {  ' ', '\t' , '\n' }

Whitespace characters move the cursor without printing anything. They create the "white space" between visible characters (at least on white paper). The sequence whitespace contains all the whitespace characters, including space, tab ( '\t' ), and newline ( '\n' ).

Examine the library routines for a wealth sequence routines that can be applied to strings. In addition, there are text utilities available in the Euphoria archives.



7.14 Debugging




back next


7.15 Glossary