On the expressiveness of word equations

We say that a language $L \subseteq \Sigma^*$ is expressible (by word equations) if there exists a system of word equations and a variable $x$ such that $L$ is the projection of the solution set onto $x$ . We naturally extend this notion to relations. For example, the equation $x = aybzb$ expresses the regular language $a\Sigma^*b\Sigma^*b$ since it equals

\{h(x) : h \text{ is a solution to } x = aybzb\}.

It is readily seen that word equations can express non-regular and non-context-free languages, e.g. $\{ww : w \in \Sigma^*\}$ with (the quadratic equation) $x = yy$ . Let us explore the expressiveness of word equations.

Powers of a word

Let us show that $w^*$ is expressible by word equations for any word $w$ . The key ingredient is the following characterization of commuting words.

u, v \in \Sigma^*

Proof.

$\Leftarrow$ ) Let $i, j \in \N$ be such that $u = w^i$ and $v = w^j$ . We have $uv = w^i w^j = w^{i+j} = w^j w^i = vu$ .

$\Rightarrow$ ) If $u$ or $v$ is empty, then the claim is trivial. If $|u| = |v|$ , then we must have $u = v$ , and hence we are done. Assume $|u| > |v| > 0$ (the other case is symmetric). Since $uv = vu$ , by our assumption, there exists a non-empty word $x$ such that $u = vx$ . Thus, we have

vxv = uv = vu = vvx.

By cancelling $v$ on the left, we obtain $xv = vx$ . Note that $|x| < |u|$ . Thus, by induction, $x = w^i$ and $v = w^j$ for some word $w$ and some $i, j \in \N$ . We are done since $u = vx = w^j w^i = w^{i + j}$ and $v = w^j$ .

A non-empty word $w$ is said to be primitive if it cannot be written as $w = u^k$ for some $u \in \Sigma^+$ and $k \geq 2$ .

w \in \Sigma^+

Proof.

Let

L

be the language expressed by the equation. We have

w^* \subseteq L

since

[x \mapsto w^i]

is a solution for any

i \in \N

. It remains to show that

L \subseteq w^*

. Let

x \in L

. Since

xw = wx

, Lemma 1 yields

u \in \Sigma^+

and

i, j \in \N

such that

x = u^i

and

w = u^j

. We have

j = 1

since

w

is primitive. Thus,

x = w^i \in w^*

w \in \Sigma^*

Proof.

w = \varepsilon

, then we can use the equation

xa = a

. Otherwise, there exists

u \in \Sigma^+

and

k \geq 1

such that

w = u^k

and

u

is primitive. By Proposition 2, we can express

w^*

with

x = y^k \land yu = uy

Boolean operations

Recall that we introduced “expressibility” in terms of systems of word equations. This is mostly for convenience. Indeed, any system of equations over $\Sigma = \{a, b, \ldots\}$ can be rewritten as a single equation, using this trick:

(u = v) \land (u' = v') \iff uau'ubu' = vav'vbv'.

Click for a proof.

The left-to-right implication is trivial. Assume that $uau'ubu' = vav'vbv'$ . We have $2|u| + 2|u'| + 2 = 2|v| + 2|v'| + 2$ and hence $|u| + |u'| = |v| + |v'|$ . For the sake of contradiction, suppose that $u \neq v$ (the other case is symmetric). We must have $|u| \neq |v|$ . Without loss of generality, we may assume that $|u| > |v|$ . Since $uau'ubu' = vav'vbv'$ , there exists $w \in \Sigma^*$ such that $u = vaw$ . By $|u| + |u'| = |v| + |v'|$ , we have $uau' = vav'$ and $ubu' = vbv'$ . In particular, this means that $vawbu' = vbv'$ , and hence $awbu' = bv'$ , which is impossible.

In fact, disjunction and disequality can also be achieved with a single equation¹:

u = v

As corollary, this means that any Boolean combination of word equations can be written as a single word equation, at the cost of introducing extra variables. Note that for conjunction, no extra variable is needed.

Impossibility to express extended constraints

In the literature, word equations have been extended with constraints such as membership in regular languages; membership in context-free languages; linear constraints on the length of words or on letter counts. A natural question is whether these constraints could be expresssed directly. As we will see, the answer is no.

Regular languages

A code is a non-empty subset $X \subseteq \Sigma^*$ such that every word of $X^*$ has a unique decomposition as the concatenation of words from $X$ . A code is bifix is for every $u, v \in X$ it is the case that $x$ is neither a prefix nor a suffix of $v$ . Among other things, the following was shown by Day et al.²:

X \subseteq \Sigma^*

Note that the right-to-left direction is simple. Indeed, the case of $X = \Sigma$ is trivial, and we proved the case of $|X| \leq 1$ in Proposition 3.

The left-to-right direction allows to show that regular languages are not all expressible. Indeed, as $X = \Sigma^k$ is a bifix code, the regular language $X^* = \{w \in \Sigma^* : |w| \equiv 0~(\mathrm{mod}~k)\}$ is unexpressible for $|\Sigma| > 1$ . It is however expressible over $\Sigma = \{a\}$ with $x = y^k \land y \in a^*$ . In fact, any regular unary language is expressible, provided one is allowed to reason with an extra letter.

Click for a proof.

Given two non-empty expressible unary languages $A$ and $B$ , we can express $x \in A \cup B$ as follows:

x \in a^* \land y \in A \land z \in B \land u {\bullet} x {\bullet} v = {\bullet} y {\bullet} z {\bullet}.

Indeed, since $x \in a^*$ , the last constraint yields $[u \mapsto \varepsilon, x \mapsto y, v \mapsto z {\bullet}]$ or $[u \mapsto {\bullet} y, x \mapsto z, v \mapsto \varepsilon]$ .

Let $L \subseteq \{a\}^*$ be regular. It is recognized by a deterministic finite automaton, whose shape is necessarily a straight line of $k \geq 0$ transitions, followed by a cycle of $p \geq 1$ transitions. In other words, $\{|w| : w \in L\}$ is ultimately periodic. So, there exist $I \subseteq \{0, 1, \ldots, k - 1\}$ and $J \subseteq \{0, 1, \ldots, p-1\}$ with

L = \{a^i : i \in I\} \cup \bigcup_{j \in J} a^{k + j} (a^p)^*.

We can trivially express $\{a^i\}$ with $x = a^i$ . Moreover, we can express $a^{k + j} (a^p)^*$ with $x = a^{k + j} y^p \land y \in a^*$ . Thus, we are done by the closure under union.

Context-free languages

Using pumping-like arguments, it has been shown that the following languages cannot be expressed by word equations³:

$A = \{w \in \{a, b, c\}^* : |w|_c = 0\}$ ,
$B = \{a^n b^n : n \in \N\}$ ,
$C = \{w : w \text{ is primitive}\}$ .

Language $A$ provides another example of a regular language that cannot be expressed. Moreover, $B$ provides an example of a context-free language unexpressible by word equations. To the best of my knowledge, it is still unknown whether $C$ is context-free or not. Yet, $C$ is not unambiguous context-free⁴.

Length and count constraints

The relation $\{(u, v) \in \Sigma^* \times \Sigma^* : |u| = |v|\}$ is unexpressible when $|\Sigma| \geq 2$ , as otherwise $B$ could be expressed as follows, with two extra variables:

(x = yz) \land (y \in a^*) \land (z \in b^*) \land (|y| = |z|).

Furthermore, the relation $\{(u, v) \in \Sigma^* \times \Sigma^* : |u|_\sigma = |v|_\sigma\}$ is unexpressible when $|\Sigma| \geq 2$ , as otherwise $B$ could be expressed as follows, with three extra variables:

(x = yz) \land (y \in a^*) \land (z \in b^*) \land (|y|_a = |p|_a) \land (|z|_b = |p|_b) \land (p \in (ab)^*).

EDT0L languages

L-systems are parallel rewriting systems that were developed by A. Lindenmayer to formalize biological processes such as the growth of plants. As we shall see, L-systems relate to word equations.

Extended deterministic table zero-context L-systems

An EDT0L-system is a tuple $\mathcal{S} = (\Gamma, \Sigma, T, u_0)$ where

$\Gamma$ is a finite alphabet,
$\Sigma \subseteq \Gamma$ is an alphabet whose letters are said to be terminal,
$T$ is a finite set of homomorphisms $t \colon \Gamma \to \Gamma^*$ , and
$u_0 \in \Gamma^*$ is a word called the axiom.

In a nutshell, $\mathcal{S}$ starts from the axiom and rewrites all of its letters simultaneously using some “table” $t \in T$ , and repeats this process using possibly different tables, until reaching a word from $\Sigma^*$ .

Formally, given $u, v \in \Gamma^*$ and $t \in T$ , we write $u \to^t v$ if $v = t(u)$ . We further write $u \to^{t_1 \cdots t_n} v$ if $u = u_0 \to^{t_1} u_1 \cdots \to^{t_n} u_n = v$ for some $u_0, \ldots, u_n \in \Gamma^*$ . For example, if $t(a) = bb$ and $t(b) = a$ , then

abb \to^t bbaa \to^t aabbbb.

The language described by $\mathcal{S}$ is $L(\mathcal{S}) = \{v \in \Sigma^* : u_0 \to^r v \text{ for some } r \in T^*\}$ . Languages described by EDT0L-systems are closed under union, concatenation, Kleene star, homomorphisms and intersection with regular languages. EDT0L languages form a strict subclass of indexed languages, which in turn are strictly included in context-sensitive languages.

Note that the expressiveness of EDT0L-systems remains the same if we allow a regular language $R \subseteq T^*$ to control the derivations. In that setting, the language of $\mathcal{S}$ is defined by

L(\mathcal{S}) = \{v \in \Sigma^* : u_0 \to^r v \text{ for some } r \in R\}.

A simple example

Consider the language $L = \{uu : u \in \Sigma^*\}$ which is expressed by the word equation $x = yy$ . We define an EDT0L-system that describes $L$ . Let $T = \{s, s'\} \cup \{t_a : a \in \Sigma\}$ be the homomorphisms defined by

\begin{aligned} s(x) &= yy &\text{and} && s(\sigma) &= \sigma \text{ for } \sigma \neq x, \\ t_a(y) &= ay &\text{and} && t_a(\sigma) &= \sigma \text{ for } \sigma \neq y, \\ s'(y) &= \varepsilon &\text{and} && s'(\sigma) &= \sigma \text{ for } \sigma \neq y. \end{aligned}

For example, we have

x \to^s yy \to^{t_a} ayay \to^{t_a} aayaay \to^{t_b} aabyaaby \to^{s'} aabaab.

The language $L$ is described by the EDT0L-system $(\Gamma, \Sigma, T, u_0)$ where $\Gamma = \Sigma \cup \{x, y\}$ and $u_0 = x$ .

From word equations to EDT0L

The previous example generalizes naturally to any word equation of the form $x = \cdots$ . In fact, a translation into an EDT0L-system can be done for any system of word equations.

Given a system of word equations over variables $x_1, \ldots, x_n$ and a solution $h$ , we define $\mathrm{enc}(h) = h(x_1) \# \cdots \# h(x_n)$ where $\#$ is a fresh letter. The encoding of the relation expressed by the system is $\{\mathrm{enc}(h) : h \text{ is a solution}\}$ . The following is due to Ferté et al.⁵ for quadratic equations, and to Ciobanu et al.⁶ for the general case:

The encoding of a relation expressible by word equations can be described by an EDT0L-system.