Possible guidelines for code review

Possible guidelines for reviewing Fortran code

Code reviews - whatever form they take - are generally seen as a very useful and powerful instrument to guarantee the quality of a program, in terms of adhering to the requirements and in terms of maintainability in all its aspects. But there seem very few specific guidelines published. And by "specific" I mean: what sort of things in a piece of code should one look for? I want to collect such specific guidelines in this document and as a start I have assembled a number of coding practices that caused problems or made me wonder ...

Note: the examples I give below are both from programs other people wrote and from programs I wrote myself - I am hardly immune to bad coding practices.

Another note: while I present this as guidelines for code reviews, you can also use these rules while coding in the first place. And while the code fragments are in Fortran, most of the rules apply equally to other languages.

I try to capture the underlying principle of each example as a positive rule and then give the example. But there is no grouping yet or any priority.

Provide useful and at least adequate error messages

A useful error message would include a way to solve the problem, but it should at least be made clear what the problem is.

For instance: the message

    Error: Invalid number of areas

in a program that dealt with geographical areas was rahter puzzling. It turned out, after inspecting the source code that we had exceeded an arbitrary limit in the number of areas it could handle. A much clearer and adequate message would have been:

    Error: Too many areas

or even better:

    Error: Too many areas - increase size of array X

Here are a few other examples:

    An internal error occurred

(in a message box. No further information: it is not even clear if the program stops after clicking the "OK" button)

    Invalid line in header: ....

(the culprit line is shown, but no reason why it is invalid)

Allow the caller to handle errors gracefully

The simplest way of dealing with a nasty error in a library routine is of course to stop the entire program. This, however, does not give the calling program a chance to handle the error in a way more suitable to the user.

For instance: if the library is used in a graphical user-interface, it would be much better to present a messagebox to indicate the fatal error (and what can be done about it) than simply terminate the entire program. But the form should be left to the program!

Avoid arbitrary limits

Whenever possible, do not assume that something will not exceed a particular size. If you do need to set a limit (in Fortran it is tough to avoid arbitrary string lengths), make sure it will not be a problem in the foreseeable future and at least document the limit.

For instance, the program that dealt with geograpical areas was a large GUI handling input scenarios via a relational database. Jut about everything in it was dynamically allocated - except the arrays holding the information on the geographical areas.

Another one I have come across: strings of exactly 12 characters, so that a DOS file name would fit perfectly ...

Know the pitfalls and possibilities of real numbers

We all know the rule: do not compare real numbers directly for equality or inequality. Of course, it is slightly subtler than that: the number -999.0 (I use it often to indicate a missing value) can be represented exactly, so there is no problem in comparing a value to that number.

However:

Be careful with comparisons in general. Always use a range, if you can not be certain that rounding errors may influence the result. This actually includes not only "equal" and "not equal", but also "greater than" and so on!
Be careful with the precision of literal values - Fortran does not interpret single precision reals as double precision because the left-hand side is double precision:
```
    real(kind=kind(1.0d0)) :: value1 = 0.1
    real(kind=kind(1.0d0)) :: value2 = 0.1d0
```
will give two slightly different values for these two variables.
Never use a real variable as a loop index. That is begging for trouble.

Avoid temporary files

Sometimes you see temporary files used as part of processing the input or for other reasons. When this is done to avoid storing the data in memory, that is - in my opinion - a sign of bad thinking: memory is cheap nowadays and usually there is plenty of it.

Using temporary files for storing intermediate results means that the data structure is spread out over the writing and reading part of these files. So changes in the structure require changes in various locations of the program code in addition to the changes required for the actual processing.

They can be of great use though: for instance to strip comments from an input file, so that reading the data becomes simpler.

Relaxed use of strings and substrings

In Fortran character strings are easy to use, even though they have some drawbacks. One of the attractive features is that comparing strings of different lengths is done by padding the shortest strings with spaces.

If you want to identify the contents of a string - for instance: see which keyword it contains, then avoid code like this:

    if ( string(1:5) .eq. 'start' ) then

Use instead:

    if ( string .eq. 'start' ) then

Because the keyword is "start" full stop, there is no need to limit the string to just the five characters of the keyword. It is even error-prone: miscounting is easy and what do you do with "starttime"?

Get rid of tab characters

Tab characters are a nuisance: they are not part of the Fortran language, and they expand to a varying number of spaces. So, what looks good in the editor of your favourite integrated development environment, may look bizarre on print.

Use lower case text in your code

Of old, Fortran was written in uppercase letters. This was the standard because long ago not all devices were capable of handling lower and upper case. Lower case is easier to read, so use lower case.

Above all: if you want to use uppercase characters, do so consistently. If you want to use capitalised keywords like "Program" and "End If", suit yourself, but use the casing consistently. This is even more important for variable names: VarName and VARNAME and varname all indicate the same variable (at least in Fortran), but using these casings arbitrarily makes your program look sloppy.

Document the input - especially command-line arguments

It is just a few minutes work in general to document the input of a small utility program or what command-line arguments it accepts (and in what form).

Lack of documentation made me puzzle over the strange results I got with one particular program (one I wrote!). It turned out that the date, one of the command-line arguments, had to be specified as 2008-01-01, not as 1-jan-2008, two forms it would actually accept, "cleverly" converting between them (this was actually done by one single routine that would automatically convert one form in the other).

Use separate routines for separate tasks

Sometimes it is tempting to merge two different but similar tasks into one single routine, automatically distinguishing them using a key value or another characteristic of the input.

You should probably not do this: the logic soon gets very intricate and whatever processing was common is flooded by it.

Another reason: the input with the wrong characteristic can lead to wrong results somewhere down the line (see the rule on documenting the input).

Be aware of the effects of negative values with MAX/MIN/MOD/MODULO

When you use the function MAX() and MIN() to limit the size of something:

    change = min( allowed, potential_change )

(for instance in the context of numerically solving partial differential equations with source terms - that is my context), be aware of the effect of negative numbers: if the potential_change is negative, there will be no limitation, despite what you think, and the solution becomes unstable anyway.

Negative numbers as arguments for MOD() and MODULO() are even worse: they may give very different results than you expect. (I would have to look up the definition myself!)

Use standard idiom and standard functions

Unless you have a very good reason not to, use the standard idiom of a programming language and the standard functions it provides.

For instance, if you need to find the position of a particular substring in a string, use INDEX(). Do not write a DO-loop:

    pos = 0
    do i = 1,len(string)
        if ( string(i:i+3) == 'word' ) then
            pos = i
            exit
        endif
    enddo

A fragment like:

    pos = index( string, 'word' )

expresses the intention at one glance - not to mention that you do not have to worry about the length of the substring you are searching.

The same holds for control constructs like DO-loops, IF-blocks, SELECT-blocks and so on. Do not write DO-loops like:

    do TheLoopIndex = StartOfIndex,StopOfIndex
        ...
    enddo

Loop indices are usually variables with short names.

Declare all your variables

To make sure that all your variables have a correct type, always use "IMPLICIT NONE". It is all too easy to make a mistake like:

    do l1 = 1,10
        a(ll) = 1
    enddo

(Do you see the error?)

When you force the compiler to check the declaration of all variables, the typo would be spotted rightaway.

Use the compiler to check for as many problems as it can

The compiler and specialised tools for static analysis are much more capable of finding bad code than human beings. Use them! Quite a few of the rules I list here can be checked automatically.

Use variables and parameters instead of literal values

Few people would recognise the number 86400 as being the number of seconds in a day. And the conversion of a unit like kg/day to g/s could better be programmed as:

    real, parameter :: seconds_per_day = 86400.0
    real, parameter :: grams_per_kilo  = 1000.0
    real, parameter :: kg_per_day      = grams_per_kilo / seconds_per_day

    waste_load = waste_load * kg_per_day

Instead of:

    waste_load = waste_load * 1000.0/86400.0

or even:

    waste_load = waste_load * 0.01574074074

(For this particular type of computations, you might even consider using a specialised library that deals with physical units)

TODO

Use KINDs the proper way - REAL(4), REAL*4
Put literal numbers in parameters - issues with precision (additional info)
Use standard language features - no unnecessary platform-dependencies
Keep your logic clear and robust - don't rely on short-circuiting
Make preconditions explicit - don't assume the input data will have certain characteristics - check them!
Restrict access to data - avoid too many global data
Check for errors - beware of no checking of errors in OPEN, ...
Check optional arguments with present()
Make sure your pointers are associated
Let the compiler do the work - warnings via g95, gfortran, ... -finit-real=nan -finit-integer=42 - non-initialised variables and arrays (additional info)
Organise the code in files
Use makefiles or IDEs
Beware of local variables that should keep their value - the SAVE attribute/statement
Use modern constructs - SELECT CASE instead of a series of IF/ELSEIF

-- Arjen Markus, dd. january 2009