Euphoria Performance Tips


General Tips

  • If your program is fast enough, forget about speeding it up. Just make it simple and readable.

  • If your program is way too slow, the tips below will probably not solve your problem. You should find a better overall algorithm.

  • The easiest way to gain a bit of speed is to turn off run-time type-checking. Insert the line:
               without type_check
    
    at the top of your main .ex file, ahead of any include statements. You'll typically gain between 0 and 20 percent depending on the types you have defined, and the files that you are including. Most of the standard include files do some user-defined type-checking. A program that is completely without user-defined type-checking might still be speeded up slightly.

    Also, be sure to remove, or comment-out, any

               with trace
               with profile
               with profile_time
    
    statements. with trace (even without any calls to trace()), and with profile can easily slow you down by 10% or more. with profile_time might slow you down by 1%. Each of these options will consume extra memory as well.

  • Calculations using integer values are faster than calculations using floating-point numbers

  • Declare variables as integer rather than atom where possible, and as sequence rather than object where possible. This usually gains you a few percent in speed.

  • In an expression involving floating-point calculations it's usually faster to write constant numbers in floating point form, e.g. when x has a floating-point value, say, x = 9.9

    change:
               x = x * 5
    
    to:
               x = x * 5.0
    
    This saves the interpreter from having to convert integer 5 to floating-point 5.0 each time.

  • Euphoria does short-circuit evaluation of if, elsif, and while conditions involving and and or. Euphoria will stop evaluating any condition once it determines if the condition is true or not. For instance in the if-statement:
            if x > 20 and y = 0 then
                ...
            end if
    
    The "y = 0" test will only be made when "x > 20" is true.

    For maximum speed, you can order your tests. Do "x > 20" first if it is more likely to be false than "y = 0".

    In general, with a condition "A and B", Euphoria will not evaluate the expression B, when A is false (zero). Similarly, with a condition like "A or B", B will not be evaluated when A is true (non-zero).

    Simple if-statements are highly optimized. With the current version of the interpreter, nested simple if's that compare integers are usually a bit faster than a single short-circuit if-statement e.g.:

           if x > 20 then
               if y = 0 then
                   ...
               end if
           end if
    

  • The speed of access to private variables, local variables and global variables is the same.

  • There is no performance penalty for defining constants versus plugging in hard-coded literal numbers. The speed of:
               y = x * MAX
    
    is exactly the same as:
               y = x * 1000
    
    where you've previously defined:
               constant MAX = 1000
    
  • There is no performance penalty for having lots of comments in your program. Comments are completely ignored. They are not executed in any way. It might take a few milliseconds longer for the initial load of your program, but that's a very small price to pay for future maintainability, and when you bind your program, or translate your program to C, all comments are stripped out, so the cost becomes absolute zero.


Measuring Performance

In any programming language, and especially in Euphoria, you really have to make measurements before drawing conclusions about performance.

Euphoria provides both execution-count profiling, as well as time profiling (DOS32 only). See refman.doc. You will often be surprised by the results of these profiles. Concentrate your efforts on the places in your program that are using a high percentage of the total time (or at least are executed a large number of times.) There's no point to rewriting a section of code that uses 0.01% of the total time. Usually there will be one place, or just a few places where code tweaking will make a significant difference.

You can also measure the speed of code by using the time() function. e.g.

        atom t
        t = time()
        for i = 1 to 10000 do
            -- small chunk of code here
        end for
        ? time() - t

You might rewrite the small chunk of code in different ways to see which way is faster.


How to Speed-Up Loops

Profiling will show you the hot spots in your program. These are usually inside loops. Look at each calculation inside the loop and ask yourself if it really needs to happen every time through the loop, or could it be done just once, prior to the loop.


Converting Multiplies to Adds in a Loop

Addition is faster than multiplication. Sometimes you can replace a multiplication by the loop variable, with an addition. Something like:

        for i = 0 to 199 do
            poke(screen_memory+i*320, 0)
        end for
becomes:
        x = screen_memory
        for i = 0 to 199 do
            poke(x, 0)
            x = x + 320 
        end for

Saving Results in Variables
  • It's faster to save the result of a calculation in a variable, than it is to recalculate it later. Even something as simple as a subscript operation, or adding 1 to a variable is worth saving.

  • When you have a sequence with multiple levels of subscripting, it is faster to change code like:
               for i = 1 to 1000 do
                   y[a][i] = y[a][i]+1
               end for
    
    to:
               ya = y[a]
               for i = 1 to 1000 do
                   ya[i] = ya[i] + 1
               end for
               y[a] = ya
    
    So you are doing 2 subscript operations per iteration of the loop, rather than 4. The operations, ya = y[a] and y[a] = ya are very cheap. They just copy a pointer. They don't copy a whole sequence.

  • There is a slight cost when you create a new sequence using {a,b,c}. If possible, move this operation out of a critical loop by storing it in a variable before the loop, and referencing the variable inside the loop.


In-lining of Routine Calls

If you have a routine that is rather small and fast, but is called a huge number of times, you will save time by doing the operation in-line, rather than calling the routine. Your code may become less readable, so it might be better to in-line only at places that generate a lot of calls to the routine.


Operations on Sequences

Euphoria lets you operate on a large sequence of data using a single statement. This saves you from writing a loop where you process one element at-a-time. e.g.

        x = {1,3,5,7,9}
        y = {2,4,6,8,10}

        z = x + y
versus:
        z = repeat(0, 5)  -- if necessary
        for i = 1 to 5 do
            z[i] = x[i] + y[i]
        end for

In most interpreted languages, it is much faster to process a whole sequence (array) in one statement, than it is to perform scalar operations in a loop. This is because the interpreter has a large amount of overhead for each statement it executes. Euphoria is different. Euphoria is very lean, with little interpretive overhead, so operations on sequences don't always win. The only solution is to time it both ways. The per-element cost is usually lower when you process a sequence in one statement, but there are overheads associated with allocation and deallocation of sequences that may tip the scale the other way.


Some Special Case Optimizations

Euphoria automatically optimizes certain special cases. x and y below could be variables or arbitrary expressions.

        x + 1      -- faster than general x + y
        1 + x      -- faster than general y + x
        x * 2      -- faster than general x * y
        2 * x      -- faster than general y * x
        x / 2      -- faster than general x / y
        floor(x/y) -- where x and y are integers, is faster than x/y
        floor(x/2) -- faster than floor(x/y)

x below is a simple variable, y is any variable or expression:

        x = append(x, y)   -- faster than general z = append(x, y)
        x = prepend(x, y)  -- faster than general z = prepend(x, y)

        x = x & y          -- where x is much larger than y,
                           -- is faster than general z = x & y
When you write a loop that "grows" a sequence, by appending or concatenating data onto it, the time will, in general, grow in proportion to the square of the number (N) of elements you are adding. However, if you can use one of the special optimized forms of append(), prepend() or concatenation listed above, the time will grow in proportion to just N (roughly). This could save you a huge amount of time when creating an extremely long sequence. (You could also use repeat() to establish the maximum size of the sequence, and then fill in the elements in a loop, as discussed below.)


Assignment with Operators

For greater speed, convert:

        left-hand-side = left-hand-side op expression
to:
        left-hand-side op= expression
whenever left-hand-side contains at least 2 subscripts, or at least one subscript and a slice. In all simpler cases the two forms run at the same speed (or very close to the same).


Pixel-Graphics Tips

  • Mode 19 is the fastest mode for animated graphics and games.

  • The video memory (in mode 19) is not cached by the CPU. It usually takes longer to read or write data to the screen than to a general area of memory that you allocate. This adds to the efficiency of virtual screens, where you do all of your image updating in a block of memory that you get from allocate(), and then you periodically mem_copy() the resulting image to the real screen memory. In this way you never have to read the (slow) screen memory.

  • When plotting pixels, you may find that modes 257 and higher are fast near the top of the screen, but slow near the bottom.


Text-Mode Tips

Writing text to the screen using puts() or printf() is rather slow. If necessary, in DOS32, you can do it much faster by poking into the video memory, or by using display_text_image(). There is a very large overhead on each puts() to the screen, and a relatively small incremental cost per character. The overhead with exw (WIN32) is especially high (on Windows 95/98/ME at least). Linux and FreeBSD are somewhere between DOS32 and WIN32 in text output speed. It therefore makes sense to build up a long string before calling puts(), rather than calling it for each character. There is no advantage to building up a string longer than one line however.

The slowness of text output is mainly due to operating system overhead.


Library Routines

Some common routines are extremely fast. You probably couldn't do the job faster any other way, even if you used C or assembly language. Some of these are:

  • mem_copy()
  • mem_set()
  • repeat()

Other routines are reasonably fast, but you might be able to do the job faster in some cases if speed was crucial.

        x = repeat(0,100)
        for i = 1 to 100 do
            x[i] = i
        end for
is somewhat faster than:
        x = {}
        for i = 1 to 100 do
            x = append(x, i)
        end for
because append() has to allocate and reallocate space as x grows in size. With repeat(), the space for x is allocated once at the beginning. (append() is smart enough not to allocate space with every append to x. It will allocate somewhat more than it needs, to reduce the number of reallocations.)

You can replace:

        remainder(x, p)
with:
        and_bits(x, p-1)
for greater speed when p is a positive power of 2. x must be a non-negative integer that fits in 32-bits.

arctan() is faster than arccos() or arcsin().


Searching

Euphoria's find() is the fastest way to search for a value in a sequence up to about 50 elements. Beyond that, you might consider a hash table (demo\hash.ex) or a binary tree (demo\tree.ex).


Sorting

In most cases you can just use the shell sort routine in include\sort.e.

If you have a huge amount of data to sort, you might try one of the sorts in demo\allsorts.e (e.g. great sort). If your data is too big to fit in memory, don't rely on Euphoria's automatic memory swapping capability. Instead, sort a few thousand records at a time, and write them out to a series of temporary files. Then merge all the sorted temporary files into one big sorted file.

If your data consists of integers only, and they are all in a fairly narrow range, try the bucket sort in demo\allsorts.e.


Taking Advantage of Cache Memory

As CPU speeds increase, the gap between the speed of the on-chip cache memory and the speed of the main memory or DRAM (dynamic random access memory) becomes ever greater. You might have 256 Mb of DRAM on your computer, but the on-chip cache is likely to be only 8K (data) plus 8K (instructions) on a Pentium, or 16K (data) plus 16K (instructions) on a Pentium with MMX or a Pentium II/III. Most machines will also have a "level-2" cache of 256K or 512K.

An algorithm that steps through a long sequence of a couple of thousand elements or more, many times, from beginning to end, performing one small operation on each element, will not make good use of the on-chip data cache. It might be better to go through once, applying several operations to each element, before moving on to the next element. The same argument holds when your program starts swapping, and the least-recently-used data is moved out to disk.

These cache effects aren't as noticeable in Euphoria as they are in lower-level compiled languages, but they are measurable.


Using Machine Code and C

Euphoria lets you call routines written in 32-bit Intel machine code. On WIN32 and Linux and FreeBSD you can call C routines in .dll or .so files, and these C routines can call your Euphoria routines. You might need to call C or machine code because of something that can't be done directly in Euphoria, or you might do it for improved speed.

To boost speed, the machine code or C routine needs to do a significant amount of work on each call, otherwise the overhead of setting up the arguments and making the call will dominate the time, and it might not gain you much.

Many programs have some inner core operation that consumes most of the CPU time. If you can code this in C or machine code, while leaving the bulk of the program in Euphoria, you might achieve a speed comparable to C, without sacrificing Euphoria's safety and flexibility.

 
Using The Euphoria To C Translator

In version 3.0, the full Euphoria To C Translator is included in the installation package. It will translate any Euphoria program into a set of C source files that you can compile using a C compiler.

The executable file that you get using the Translator should run the same, but faster than when you use the interpreter. The speed-up can be anywhere from a few percent to a factor of 5 or more.