docs/Writerside/topics/string.h.md

Thu, 23 Oct 2025 17:50:28 +0200

author
Mike Becker <universe@uap-core.de>
date
Thu, 23 Oct 2025 17:50:28 +0200
changeset 1440
0d1430668271
parent 1426
3a89b31f0724
permissions
-rw-r--r--

add documentation for cxMapClone() - resolves #743

# String

UCX strings store character arrays together with a length and come in two variants: immutable (`cxstring`) and mutable (`cxmutstr`).

In general, UCX strings are *not* necessarily zero-terminated.
If a function guarantees to return a zero-terminated string, it is explicitly mentioned in the documentation.
As a rule of thumb, you _should not_ pass a character array of a UCX string structure to another API without explicitly
ensuring that the string is zero-terminated.

## Basics

> To simplify documentation, we introduce the pseudo-type `AnyStr` with the meaning that
> both `cxstring` and `cxmutstr` are accepted for that argument.
> The implementation is actually hidden behind a macro which uses `cx_strcast()` to guarantee compatibility.
{style="note"}

```C
#include <cx/string.h>

struct cx_string_s {const char *ptr; size_t length;};

struct cx_mutstr_s {char *ptr; size_t length;};

typedef struct cx_string_s cxstring;

typedef struct cx_mutstr_s cxmutstr;

cxstring cx_str(const char *cstring);

cxstring cx_strn(const char *cstring, size_t length);

cxmutstr cx_mutstr(char *cstring);

cxmutstr cx_mutstrn(char *cstring, size_t length);

cxmutstr cx_strdup(AnyStr string);

cxmutstr cx_strdup_a(const CxAllocator *allocator, AnyStr string);

int cx_strcpy(cxmutstr *dest, cxstring source);

int cx_strcpy_a(const CxAllocator *allocator,
        cxmutstr *dest, cxstring source);

void cx_strfree(cxmutstr *str);

void cx_strfree_a(const CxAllocator *alloc, cxmutstr *str);


#define CX_SFMT(s)   (int) (s).length, (s).ptr
#define CX_PRIstr    ".*s"
#define cx_strcast(s)  // converts any string to cxstring
```

The functions `cx_str()` and `cx_mutstr()` create a UCX string from a `const char*` or a `char*`
and compute the length with a call to stdlib `strlen()` (except for `NULL` in which case the length is set to zero).
In case you already know the length, or the string is not zero-terminated, you can use `cx_strn()` or `cx_mutstrn()`.

The function `cx_strdup_a()` allocates new memory with the given `allocator` and copies the given `string`
and guarantees that the result string is zero-terminated.
The function `cx_strdup()` is equivalent to `cx_strdup_a()`, except that it uses the [default allocator](allocator.h.md#default-allocator).

The functions `cx_strcpy_a()` and `cx_strcpy()` copy the contents of the `source` string to the `dest` string,
and also guarantee zero-termination of the resulting string.
The memory in `dest` is either freshly allocated or re-allocated to fit the size of the string plus the terminator.

Allocated strings are always of type `cxmutstr` and can be deallocated by a call to `cx_strfree()` or `cx_strfree_a()`.
The caller must make sure to use the correct allocator for deallocating a string.
It is safe to call these functions multiple times on a given string, as the pointer will be nulled and the length set to zero.
It is also safe to call the functions with a `NULL`-pointer, just like any other `free()`-like function.

When you want to use a UCX string in a `printf`-like function, you can use the macro `CX_PRIstr` for the format specifier,
and the `CX_SFMT(s)` macro to expand the arguments.

> When you want to convert a string _literal_ into a UCX string, you can also use the `CX_STR(lit)` macro.
> This macro uses the fact that `sizeof(lit)` for a string literal `lit` is always the string length plus one,
> effectively saving an invocation of `strlen()`.
> However, this only works for literals - in all other cases you must use `cx_str()` or `cx_strn`.

## Comparison

```C
#include <cx/string.h>

int cx_strcmp(AnyStr s1, AnyStr s2);

int cx_strcmp_p(const void *s1, const void *s2);

int cx_strcasecmp_p(const void *s1, const void *s2);

bool cx_strprefix(AnyStr string, AnyStr prefix);

bool cx_strsuffix(AnyStr string, AnyStr suffix);

int cx_strcasecmp(AnyStr s1, AnyStr s2);

bool cx_strcaseprefix(AnyStr string, AnyStr prefix);

bool cx_strcasesuffix(AnyStr string, AnyStr suffix);
```

The `cx_strcmp()` function compares two strings lexicographically
and returns an integer greater than, equal to, or less than 0, if `s1` is greater than, equal to, or less than `s2`, respectively.

The `cx_strcmp_p()` function takes pointers to UCX strings (i.e., only to `cxstring` and `cxmutstr`) and the signature is compatible with `cx_compare_func`.
Use this as a compare function for lists or other data structures.

The functions `cx_strprefix()` and `cx_strsuffic()` check if `string` starts with `prefix` or ends with `suffix`, respectively.

The functions `cx_strcasecmp()`, `cx_strcasecmp_p()`, `cx_strcaseprefix()`, and `cx_strcasesuffix()` are equivalent,
except that they compare the strings case-insensitive.

> In the current version of UCX, case-insensitive comparisons are only guaranteed to work with ASCII characters.
{style="note"}

## Concatenation

```C
#include <cx/string.h>

cxmutstr cx_strcat(size_t count, ... );

cxmutstr cx_strcat_a(const CxAllocator *alloc, size_t count, ... );

cxmutstr cx_strcat_m(cxmutstr str, size_t count, ... );

cxmutstr cx_strcat_ma(const CxAllocator *alloc,
        cxmutstr str, size_t count, ... );

size_t cx_strlen(size_t count, ...);
```

The `cx_strcat_a()` function takes `count` UCX strings,
allocates memory for a concatenation of those strings _with a single allocation_,
and copies the contents of the strings to the new memory.
`cx_strcat()` is equivalent, except that it uses the [default allocator](allocator.h.md#default-allocator).

The `cx_strcat_ma()` and `cx_strcat_m()` append the `count` strings to the specified string `str` and,
instead of allocating new memory, reallocate the existing memory in `str`.
If the pointer in `str` is `NULL`, there is no difference to `cx_strcat_a()`.
Note, that `count` always denotes the number of variadic arguments in _both_ variants.

The function `cx_strlen()` sums the length of the specified strings.

> There is no reason to use `cx_strlen()` for a single UCX string.
> You can access the `length` field of the structure directly. 

> You can mix `cxstring` and `cxmutstr` in the variadic arguments without the need of `cx_strcast()`.

## Find Characters and Substrings

```C
#include <cx/string.h>

cxstring cx_strchr(cxstring string, int chr);

cxstring cx_strrchr(cxstring string, int chr);

cxstring cx_strstr(cxstring string, cxstring search);

cxstring cx_strsubs(cxstring string, size_t start);

cxstring cx_strsubsl(cxstring string, size_t start, size_t length);

cxstring cx_strtrim(cxstring string);

cxmutstr cx_strchr_m(cxmutstr string, int chr);

cxmutstr cx_strrchr_m(cxmutstr string, int chr);

cxmutstr cx_strstr_m(cxmutstr string, cxstring search);

cxmutstr cx_strsubs_m(cxmutstr string, size_t start);

cxmutstr cx_strsubsl_m(cxmutstr string, size_t start, size_t length);

cxmutstr cx_strtrim_m(cxmutstr string);
```

The functions `cx_strchr()`, `cx_strrchr()`, and `cx_strstr()`, behave like their stdlib counterparts.

The function `cx_strsubs()` returns the substring starting at the specified `start` index,
and `cx_strsubsl()` returns a substring with at most `length` bytes.

The function `cx_strtrim()` returns the substring that results when removing all leading and trailing
whitespace characters.

All functions with the `_m` suffix behave exactly the same as their counterparts without `_m` suffix,
except that they operate on a `cxmustr`.
In _both_ variants the functions return a view into the given `string`
and thus the returned strings must never be passed to `cx_strfree()`.

## Replace Substrings

```C
#include <cx/string.h>

cxmutstr cx_strreplace(cxstring str,
        cxstring search, cxstring replacement);

cxmutstr cx_strreplace_a(const CxAllocator *allocator, cxstring str,
        cxstring search, cxstring replacement);

cxmutstr cx_strreplacen(cxstring str,
        cxstring search, cxstring replacement, size_t replmax);

cxmutstr cx_strreplacen_a(const CxAllocator *allocator, cxstring str,
        cxstring search, cxstring replacement, size_t replmax);
```

The function `cx_strreplace()` allocates a new string which will contain a copy of `str`
where every occurrence of `search` is replaced with `replacement`.
The new string is guaranteed to be zero-terminated even if `str` is not.

The function `cx_strreplace_a()` uses the specified `allocator` to allocate the new string.

The functions `cx_strreplacen()` and `cx_strreplacen_a()` are equivalent, except that they stop
after `replmax` number of replacements.

## Basic Splitting

```C
#include <cx/string.h>

size_t cx_strsplit(cxstring string, cxstring delim,
        size_t limit, cxstring *output);

size_t cx_strsplit_a(const CxAllocator *allocator,
        cxstring string, cxstring delim,
        size_t limit, cxstring **output);

size_t cx_strsplit_m(cxmutstr string, cxstring delim,
        size_t limit, cxmutstr *output);

size_t cx_strsplit_ma(const CxAllocator *allocator,
        cxmutstr string, cxstring delim,
        size_t limit, cxmutstr **output);
```

The `cx_strsplit()` function splits the input `string` using the specified delimiter `delim`
and writes the substrings into the pre-allocated `output` array.
The maximum number of resulting strings can be specified with `limit`.
That means, at most `limit-1` splits are performed.
The function returns the actual number of items written to `output`.

On the other hand, `cx_strsplit_a()` uses the specified `allocator` to allocate the output array,
and writes the pointer to the allocated memory to `output`.

The functions `cx_strsplit_m()` and `cx_strsplit_ma()` are equivalent to `cx_strsplit()` and `cx_strsplit_a()`,
except that they work on `cxmustr` instead of `cxstring`.

> The `allocator` in `cx_strsplit_a()` and `cx_strsplit_ma()` is _only_ used to allocate the output array.
> The strings will always point into the original `string`
> and you need to use `cx_strdup()` or `cx_strdup_a()` if you want copies or zero-terminated strings after performing the split.  
{style="note"}

## Complex Tokenization

```C
#include <cx/string.h>

CxStrtokCtx cx_strtok(AnyStr str, AnyStr delim, size_t limit);

void cx_strtok_delim(CxStrtokCtx *ctx,
        const cxstring *delim, size_t count);

bool cx_strtok_next(CxStrtokCtx *ctx, cxstring *token);

bool cx_strtok_next_m(CxStrtokCtx *ctx, cxmutstr *token);
```

You can tokenize a string by creating a _tokenization_ context with `cx_strtok()`,
and calling `cx_strtok_next()` or `cx_strtok_next_m()` as long as they return `true`.

The tokenization context is initialized with the string `str` to tokenize,
one delimiter `delim`, and a `limit` for the maximum number of tokens.
When `limit` is reached, the remaining part of `str` is returned as one single token.

You can add additional delimiters to the context by calling `cx_strtok_delim()`, and
specifying an array of delimiters to use.

> Regardless of how the context was initialized, you can use either `cx_strtok_next()`
> or `cx_strtok_next_m()` to retrieve the tokens. However, keep in mind that modifying
> characters in a token returned by `cx_strtok_next_m()` has only defined behavior, when the
> underlying `str` is a `cxmutstr`.

### Example

```C
#include <cx/string.h>

cxstring str = cx_str("an,arbitrarily;||separated;string");

// create the context
CxStrtokCtx ctx = cx_strtok(str, CX_STR(","), 10);

// add two more delimters
cxstring delim_more[2] = {CX_STR("||"), CX_STR(";")};
cx_strtok_delim(&ctx, delim_more, 2);

// iterate over the tokens
cxstring tok;
while(cx_strtok_next(&ctx, &tok)) {
    // to something with the tokens
    // be aware that tok is NOT zero-terminated!
}
```

## Conversion to Numbers

For each integer type, as well as `float` and `double`, there are functions to convert a UCX string to a value of those types.

Integer conversion comes in two flavors:
```C
int cx_strtoi(AnyStr str, int *output, int base);

int cx_strtoi_lc(AnyStr str, int *output, int base,
        const char *groupsep);
```

The basic variant takes a string of any UCX string type, a pointer to the `output` integer, and the `base` (one of 2, 8, 10, or 16).
Conversion is attempted with respect to the specified `base` and respects possible special notations for that base.
Hexadecimal numbers may be prefixed with `0x`, `x`, or `#`, and binary numbers may be prefixed with `0b` or `b`.

The `_lc` versions of the integer conversion functions are equivalent, except that they allow the specification of an
array of group separator chars, each of which is simply ignored during conversion.
The default group separator for the basic version is a comma `,`.

The signature for the floating-point conversions is quite similar:
```C
int cx_strtof(AnyStr str, float *output);

int cx_strtof_lc(AnyStr str, float *output,
        char decsep, const char *groupsep);
```

The two differences are that the floating-point versions do not support different bases,
and the `_lc` variant allows specifying not only an array of group separators,
but also the character used for the decimal separator.

In the basic variant, the group separator is again a comma `,`, and the decimal separator is a dot `.`.

> The floating-point conversions of UCX 3.1 do not achieve the same precision as standard library implementations
> which usually use more sophisticated algorithms.
> The precision might increase in future UCX releases,
> but until then be aware of slight inaccuracies, in particular when working with `double`.
{style="warning"}

> The UCX string to number conversions are intentionally not considering any locale settings
> and are therefore independent of any global state.
{style="note"}

<seealso>
<category ref="apidoc">
<a href="https://ucx.sourceforge.io/api/string_8h.html">string.h</a>
</category>
</seealso>

mercurial