flibs/strings(n) 1.1 "flibs"

NAME

flibs/strings - Tokenizing strings

    TABLE OF CONTENTS
    SYNOPSIS
    DESCRIPTION
    INTERFACE
    COPYRIGHT

SYNOPSIS

use tokenize

call set_tokenizer( token, gaps, separators, delimiters )

part = first_token( token, string, length)

part = next_token( token, string, length)

DESCRIPTION

The tokenize module provides a method to split a string into parts according to some simple rules:

A string can be split into "words" by considering spaces and commas and such as separating the words. Two or more such characters are treated as a single such separator. In the terminology of the module they represent gaps of varying width. As a consequence there are no zero-length words.
A string can be split into "words" by considering each individual comma as a separator. A string like "One,,two" would then be split into three fields: "One", an empty field and "two".
Just like Fortran's list-directed input, the module handles strings with delimiters: "Just say 'Hello, world!'" would be split in "Just", "say" and "Hello, world!".

The module is meant to help analyse input data where list-directed input can not be used: for instance because the data are not separated by the standard characters or you need a finer control over the handling of the data.

INTERFACE

The module contains three routines and it defines a single derived type and a few convenient parameters.

The data type is type(tokenizer), a derived type that holds all the information needed for parsing the string. It is initialised via the set_tokenizer subroutine and it is meant for the string passed to the first_token() function. If you want to reuse it for a different string, but the same definition, simply use first_token() on the new string.

use tokenize

To import the definitions, use this module.

call set_tokenizer( token, gaps, separators, delimiters )

Initialise the tokenizer "token" with various sets of characters controlling the splitting process. otherwise.

type(tokenizer) token: The tokenizer to be initialised
character(len=*) gaps: The string of characters that are to be treated as "gaps". They take precedence over the "separators". Use "token_empty" if there are none.
character(len=*) separators: The string of characters that are to be treated as "separators". Use "token_empty" if there are none.
character(len=*) delimiters: The string of characters that are to be treated as "delimiters". Use "token_empty" if there are none.

part = first_token( token, string, length)

Find the first token of the string (also initialises the tokenisation for this string). Returns a string of the same length as the original one.

type(tokenizer) token: The tokenizer to be used
character(len=*) string: The string to be split into tokens.
integer, intent(out) length: The length of the token. If the length is -1, no token was found.

part = next_token( token, string, length)

Find the first token of the string (also initialises the tokenisation for this string). Returns a string of the same length as the original one.

type(tokenizer) token: The tokenizer to be used
character(len=*) string: The string to be split into tokens.
integer, intent(out) length: The length of the token. If the length is -1, no token was found.

Convenient parameters:

token_whitespace - whitespace (a single character)
token_tsv - tab, useful for tab-separated values files
token_csv - comma, useful for comma-separated values files
token_quotes - single and double quotes, commonly used delimiters
token_empty - empty string, useful to suppress any of the arguments in the set_tokenizer routine.