Summary: Unicode
pkg std =
const Badchar : char
const Maxcharlen : size
const Maxcharval : char
/* iterators */
impl iterable chariter -> char
const chariter : (byte[:] -> chariter)
/* utf8 information */
const charlen : (chr : char -> size)
const encode : (buf : byte[:], chr : char -> size)
const decode : (buf : byte[:] -> char)
const charstep : (str : byte[:] -> (char, byte[:]))
const graphemestep : (str : byte[:] -> (byte[:], byte[:]))
const cellwidth : (c : char -> int)
const strcellwidth : (str : byte[:] -> size)
/* character class predicates */
const isalpha : (c : char -> bool)
const isdigit : (c : char -> bool)
const isxdigit : (c : char -> bool)
const isnum : (c : char -> bool)
const isalnum : (c : char -> bool)
const isspace : (c : char -> bool)
const isblank : (c : char -> bool)
const islower : (c : char -> bool)
const isupper : (c : char -> bool)
const istitle : (c : char -> bool)
/* character class conversions */
const tolower : (c : char -> char)
const toupper : (c : char -> char)
const totitle : (c : char -> char)
;;
Summary
As a reminder, Myrddin characters hold a single Unicode codepoint, and all strings are assumed to be encoded in UTF-8 by default. These functions are designed to facilitate manipuating unicode strings and codepoints.
The APIs are generally designed that strings will be streamed through, and not encoded or decoded wholesale.
Constants
const Badchar : char
This is a character value that is not, and will never be, a valid unicode codepoint. This is generally returned when we encounter an error fr
const Maxcharlen : size
This is a constant defining the maximum number of bytes that a character may be decoded into. It's guaranteed that a buffer that is at least Maxcharlen bytes long will be able to contain any character.
const Maxcharval : char
This is the maximum value that any valid future unicode codepoint may decode into. Any character that is greater than this is an invalid character.
Functions: Iterating over strings
impl iterable chariter -> char
const chariter : (byte[:] -> chariter)
Chariter returns an iterator which steps through a string character by character.
Functions: Encoding, Decoding, Information
const charlen : (chr : char -> size)
Charlen returns the length in bytes that decoding the character provided into
unicode would take. This can vary between 1 and Maxcharlen bytes. In the case
of an invalid character, 1
is returned.
const encode : (buf : byte[:], chr : char -> size)
Encode takes a single character, and encodes it to a utf8 string. The buffer must be at least long enough to hold the character.
Returns: The number of bytes written, or -1 if the character could not be encoded.
const decode : (buf : byte[:] -> char)
Decode converts the head of the buffer buf
to a single unicode codepoint,
returning the codepoint itself, or Badchar
if the codepoint is invalid.
The tail of the buffer is not considered, allowing this function to be used to peek at the contents of a string.
const charstep : (str : byte[:] -> (char, byte[:]))
charstep
is a function for stepping through unicode encoded strings. It
returns the tuple (Badchar
, str[1:]) if the value cannot be decoded,
or (charval, str[std.charlen(charval):])
if it can.
const graphemestep : (str : byte[:] -> (byte[:], byte[:]))
graphemestep
is similar to charstep
. If possible, it will return
a tuple consisting of the first grapheme of str
and the rest of
str
. A "grapheme" is taken to be either one of the characters
'\n'
, '\r'
, '\t'
, or a character of nonzero width (according
to std.cellwidth
) followed by arbitrarily many characters of zero
width.
If str
does not start with a grapheme, or does not contain valid
UTF-8, graphemestep
returns the tuple ([][:], str[1:])
const cellwidth : (str : byte[:] -> size)
cellwidth
returns a prediction of how many cells in a terminal
c
would take, if printed.
const strcellwidth : (str : byte[:] -> size)
strcellwidth
returns the sum of invoking std.cellwidth
on each
character in str
, as returned by charstep
.
Character Classes
const isalpha : (c : char -> bool)
const isdigit : (c : char -> bool)
const isxdigit : (c : char -> bool)
const isnum : (c : char -> bool)
const isalnum : (c : char -> bool)
const isspace : (c : char -> bool)
const isblank : (c : char -> bool)
const islower : (c : char -> bool)
const isupper : (c : char -> bool)
const istitle : (c : char -> bool)