Myrddin: Unicode

Summary: Unicode

pkg std =
        const Badchar   : char 
        const Maxcharlen : size 
        const Maxcharval : char 

    /* iterators */
    impl iterable chariter -> char

    const chariter  : (byte[:] -> chariter)

        /* utf8 information */
        const charlen   : (chr : char -> size)
        const encode    : (buf : byte[:], chr : char -> size)
        const decode    : (buf : byte[:] -> char)
        const charstep  : (str : byte[:] -> (char, byte[:]))
        const graphemestep  : (str : byte[:] -> (byte[:], byte[:]))
        const cellwidth : (c : char -> int)
        const strcellwidth  : (str : byte[:] -> size)

        /* character class predicates */
        const isalpha   : (c : char -> bool)
        const isdigit   : (c : char -> bool)
        const isxdigit  : (c : char -> bool)
        const isnum : (c : char -> bool)
        const isalnum   : (c : char -> bool)
        const isspace   : (c : char -> bool)
        const isblank   : (c : char -> bool)
        const islower   : (c : char -> bool)
        const isupper   : (c : char -> bool)
        const istitle   : (c : char -> bool)

        /* character class conversions */
        const tolower   : (c : char -> char)
        const toupper   : (c : char -> char)
        const totitle   : (c : char -> char)
;;

Summary

As a reminder, Myrddin characters hold a single Unicode codepoint, and all strings are assumed to be encoded in UTF-8 by default. These functions are designed to facilitate manipuating unicode strings and codepoints.

The APIs are generally designed that strings will be streamed through, and not encoded or decoded wholesale.

Constants

const Badchar   : char 

This is a character value that is not, and will never be, a valid unicode codepoint. This is generally returned when we encounter an error fr

const Maxcharlen : size 

This is a constant defining the maximum number of bytes that a character may be decoded into. It's guaranteed that a buffer that is at least Maxcharlen bytes long will be able to contain any character.

const Maxcharval : char 

This is the maximum value that any valid future unicode codepoint may decode into. Any character that is greater than this is an invalid character.

Functions: Iterating over strings

impl iterable chariter -> char

const chariter  : (byte[:] -> chariter)

Chariter returns an iterator which steps through a string character by character.

Functions: Encoding, Decoding, Information

const charlen   : (chr : char -> size)

Charlen returns the length in bytes that decoding the character provided into unicode would take. This can vary between 1 and Maxcharlen bytes. In the case of an invalid character, 1 is returned.

const encode    : (buf : byte[:], chr : char -> size)

Encode takes a single character, and encodes it to a utf8 string. The buffer must be at least long enough to hold the character.

Returns: The number of bytes written, or -1 if the character could not be encoded.

const decode    : (buf : byte[:] -> char)

Decode converts the head of the buffer buf to a single unicode codepoint, returning the codepoint itself, or Badchar if the codepoint is invalid.

The tail of the buffer is not considered, allowing this function to be used to peek at the contents of a string.

const charstep  : (str : byte[:] -> (char, byte[:]))

charstep is a function for stepping through unicode encoded strings. It returns the tuple (Badchar, str[1:]) if the value cannot be decoded, or (charval, str[std.charlen(charval):]) if it can.

const graphemestep  : (str : byte[:] -> (byte[:], byte[:]))

graphemestep is similar to charstep. If possible, it will return a tuple consisting of the first grapheme of str and the rest of str. A "grapheme" is taken to be either one of the characters '\n', '\r', '\t', or a character of nonzero width (according to std.cellwidth) followed by arbitrarily many characters of zero width.

If str does not start with a grapheme, or does not contain valid UTF-8, graphemestep returns the tuple ([][:], str[1:])

const cellwidth : (str : byte[:] -> size)

cellwidth returns a prediction of how many cells in a terminal c would take, if printed.

const strcellwidth  : (str : byte[:] -> size)

strcellwidth returns the sum of invoking std.cellwidth on each character in str, as returned by charstep.

Character Classes

const isalpha   : (c : char -> bool)
const isdigit   : (c : char -> bool)
const isxdigit  : (c : char -> bool)
const isnum : (c : char -> bool)
const isalnum   : (c : char -> bool)
const isspace   : (c : char -> bool)
const isblank   : (c : char -> bool)
const islower   : (c : char -> bool)
const isupper   : (c : char -> bool)
const istitle   : (c : char -> bool)