We say that a language LΣL \subseteq \Sigma^* is expressible (by word equations) if there exists a system of word equations and a variable xx such that LL is the projection of the solution set onto xx. We naturally extend this notion to relations. For example, the equation x=aybzbx = aybzb expresses the regular language aΣbΣba\Sigma^*b\Sigma^*b since it equals

{h(x):h is a solution to x=aybzb}. \{h(x) : h \text{ is a solution to } x = aybzb\}.

It is readily seen that word equations can express non-regular and non-context-free languages, e.g. {ww:wΣ}\{ww : w \in \Sigma^*\} with (the quadratic equation) x=yyx = yy. Let us explore the expressiveness of word equations.

Powers of a word

Let us show that ww^* is expressible by word equations for any word ww. The key ingredient is the following characterization of commuting words.

Let u,vΣu, v \in \Sigma^*. We have uv=vuuv = vu iff u,vwu, v \in w^* for some wΣw \in \Sigma^*.
Proof.

\Leftarrow) Let i,jNi, j \in \N be such that u=wiu = w^i and v=wjv = w^j. We have uv=wiwj=wi+j=wjwi=vuuv = w^i w^j = w^{i+j} = w^j w^i = vu.

\Rightarrow) If uu or vv is empty, then the claim is trivial. If u=v|u| = |v|, then we must have u=vu = v, and hence we are done. Assume u>v>0|u| > |v| > 0 (the other case is symmetric). Since uv=vuuv = vu, by our assumption, there exists a non-empty word xx such that u=vxu = vx. Thus, we have

vxv=uv=vu=vvx. vxv = uv = vu = vvx.

By cancelling vv on the left, we obtain xv=vxxv = vx. Note that x<u|x| < |u|. Thus, by induction, x=wix = w^i and v=wjv = w^j for some word ww and some i,jNi, j \in \N. We are done since u=vx=wjwi=wi+ju = vx = w^j w^i = w^{i + j} and v=wjv = w^j.

A non-empty word ww is said to be primitive if it cannot be written as w=ukw = u^k for some uΣ+u \in \Sigma^+ and k2k \geq 2.

If wΣ+w \in \Sigma^+ is primitive, then the word equation xw=wxxw = wx expresses the language ww^*.
Proof.
Let LL be the language expressed by the equation. We have wLw^* \subseteq L since [xwi][x \mapsto w^i] is a solution for any iNi \in \N. It remains to show that LwL \subseteq w^*. Let xLx \in L. Since xw=wxxw = wx, Lemma 1 yields uΣ+u \in \Sigma^+ and i,jNi, j \in \N such that x=uix = u^i and w=ujw = u^j. We have j=1j = 1 since ww is primitive. Thus, x=wiwx = w^i \in w^*.
Let wΣw \in \Sigma^*. The language ww^* is expressible by word equations.
Proof.
If w=εw = \varepsilon, then we can use the equation xa=axa = a. Otherwise, there exists uΣ+u \in \Sigma^+ and k1k \geq 1 such that w=ukw = u^k and uu is primitive. By Proposition 2, we can express ww^* with x=ykyu=uyx = y^k \land yu = uy.

Boolean operations

Recall that we introduced “expressibility” in terms of systems of word equations. This is mostly for convenience. Indeed, any system of equations over Σ={a,b,}\Sigma = \{a, b, \ldots\} can be rewritten as a single equation, using this trick:

(u=v)(u=v)    uauubu=vavvbv. (u = v) \land (u' = v') \iff uau'ubu' = vav'vbv'.
Click for a proof.

The left-to-right implication is trivial. Assume that uauubu=vavvbvuau'ubu' = vav'vbv'. We have 2u+2u+2=2v+2v+22|u| + 2|u'| + 2 = 2|v| + 2|v'| + 2 and hence u+u=v+v|u| + |u'| = |v| + |v'|. For the sake of contradiction, suppose that uvu \neq v (the other case is symmetric). We must have uv|u| \neq |v|. Without loss of generality, we may assume that u>v|u| > |v|. Since uauubu=vavvbvuau'ubu' = vav'vbv', there exists wΣw \in \Sigma^* such that u=vawu = vaw. By u+u=v+v|u| + |u'| = |v| + |v'|, we have uau=vavuau' = vav' and ubu=vbvubu' = vbv'. In particular, this means that vawbu=vbvvawbu' = vbv', and hence awbu=bvawbu' = bv', which is impossible.

In fact, disjunction and disequality can also be achieved with a single equation1:

Let u=vu = v and u=vu' = v' be word equations over a non-unary alphabet. There is a word equation, with two extra variables, that is equivalent to (u=v)(u=v)(u = v) \lor (u' = v'). Furthermore, there is a word equation, using extra variables, that is equivalent to uvu \neq v.

As corollary, this means that any Boolean combination of word equations can be written as a single word equation, at the cost of introducing extra variables. Note that for conjunction, no extra variable is needed.

Impossibility to express extended constraints

In the literature, word equations have been extended with constraints such as membership in regular languages; membership in context-free languages; linear constraints on the length of words or on letter counts. A natural question is whether these constraints could be expresssed directly. As we will see, the answer is no.

Regular languages

A code is a non-empty subset XΣX \subseteq \Sigma^* such that every word of XX^* has a unique decomposition as the concatenation of words from XX. A code is bifix is for every u,vXu, v \in X it is the case that xx is neither a prefix nor a suffix of vv. Among other things, the following was shown by Day et al.2:

If XΣX \subseteq \Sigma^* is a bifix code, then XX^* is expressible by word equations iff X=ΣX = \Sigma or X1|X| \leq 1.

Note that the right-to-left direction is simple. Indeed, the case of X=ΣX = \Sigma is trivial, and we proved the case of X1|X| \leq 1 in Proposition 3.

The left-to-right direction allows to show that regular languages are not all expressible. Indeed, as X=ΣkX = \Sigma^k is a bifix code, the regular language X={wΣ:w0 (mod k)}X^* = \{w \in \Sigma^* : |w| \equiv 0~(\mathrm{mod}~k)\} is unexpressible for Σ>1|\Sigma| > 1. It is however expressible over Σ={a}\Sigma = \{a\} with x=ykyax = y^k \land y \in a^*. In fact, any regular unary language is expressible, provided one is allowed to reason with an extra letter.

Click for a proof.

Given two non-empty expressible unary languages AA and BB, we can express xABx \in A \cup B as follows:

xayAzBuxv=yz. x \in a^* \land y \in A \land z \in B \land u {\bullet} x {\bullet} v = {\bullet} y {\bullet} z {\bullet}.

Indeed, since xax \in a^*, the last constraint yields [uε,xy,vz][u \mapsto \varepsilon, x \mapsto y, v \mapsto z {\bullet}] or [uy,xz,vε][u \mapsto {\bullet} y, x \mapsto z, v \mapsto \varepsilon].

Let L{a}L \subseteq \{a\}^* be regular. It is recognized by a deterministic finite automaton, whose shape is necessarily a straight line of k0k \geq 0 transitions, followed by a cycle of p1p \geq 1 transitions. In other words, {w:wL}\{|w| : w \in L\} is ultimately periodic. So, there exist I{0,1,,k1}I \subseteq \{0, 1, \ldots, k - 1\} and J{0,1,,p1}J \subseteq \{0, 1, \ldots, p-1\} with

L={ai:iI}jJak+j(ap). L = \{a^i : i \in I\} \cup \bigcup_{j \in J} a^{k + j} (a^p)^*.

We can trivially express {ai}\{a^i\} with x=aix = a^i. Moreover, we can express ak+j(ap)a^{k + j} (a^p)^* with x=ak+jypyax = a^{k + j} y^p \land y \in a^*. Thus, we are done by the closure under union.

Context-free languages

Using pumping-like arguments, it has been shown that the following languages cannot be expressed by word equations3:

  • A={w{a,b,c}:wc=0}A = \{w \in \{a, b, c\}^* : |w|_c = 0\},

  • B={anbn:nN}B = \{a^n b^n : n \in \N\},

  • C={w:w is primitive}C = \{w : w \text{ is primitive}\}.

Language AA provides another example of a regular language that cannot be expressed. Moreover, BB provides an example of a context-free language unexpressible by word equations. To the best of my knowledge, it is still unknown whether CC is context-free or not. Yet, CC is not unambiguous context-free4.

Length and count constraints

The relation {(u,v)Σ×Σ:u=v}\{(u, v) \in \Sigma^* \times \Sigma^* : |u| = |v|\} is unexpressible when Σ2|\Sigma| \geq 2, as otherwise BB could be expressed as follows, with two extra variables:

(x=yz)(ya)(zb)(y=z). (x = yz) \land (y \in a^*) \land (z \in b^*) \land (|y| = |z|).

Furthermore, the relation {(u,v)Σ×Σ:uσ=vσ}\{(u, v) \in \Sigma^* \times \Sigma^* : |u|_\sigma = |v|_\sigma\} is unexpressible when Σ2|\Sigma| \geq 2, as otherwise BB could be expressed as follows, with three extra variables:

(x=yz)(ya)(zb)(ya=pa)(zb=pb)(p(ab)). (x = yz) \land (y \in a^*) \land (z \in b^*) \land (|y|_a = |p|_a) \land (|z|_b = |p|_b) \land (p \in (ab)^*).

EDT0L languages

L-systems are parallel rewriting systems that were developed by A. Lindenmayer to formalize biological processes such as the growth of plants. As we shall see, L-systems relate to word equations.

Extended deterministic table zero-context L-systems

An EDT0L-system is a tuple S=(Γ,Σ,T,u0)\mathcal{S} = (\Gamma, \Sigma, T, u_0) where

  • Γ\Gamma is a finite alphabet,
  • ΣΓ\Sigma \subseteq \Gamma is an alphabet whose letters are said to be terminal,
  • TT is a finite set of homomorphisms t ⁣:ΓΓt \colon \Gamma \to \Gamma^*, and
  • u0Γu_0 \in \Gamma^* is a word called the axiom.

In a nutshell, S\mathcal{S} starts from the axiom and rewrites all of its letters simultaneously using some “table” tTt \in T, and repeats this process using possibly different tables, until reaching a word from Σ\Sigma^*.

Formally, given u,vΓu, v \in \Gamma^* and tTt \in T, we write utvu \to^t v if v=t(u)v = t(u). We further write ut1tnvu \to^{t_1 \cdots t_n} v if u=u0t1u1tnun=vu = u_0 \to^{t_1} u_1 \cdots \to^{t_n} u_n = v for some u0,,unΓu_0, \ldots, u_n \in \Gamma^*. For example, if t(a)=bbt(a) = bb and t(b)=at(b) = a, then

abbtbbaataabbbb. abb \to^t bbaa \to^t aabbbb.

The language described by S\mathcal{S} is L(S)={vΣ:u0rv for some rT}L(\mathcal{S}) = \{v \in \Sigma^* : u_0 \to^r v \text{ for some } r \in T^*\}. Languages described by EDT0L-systems are closed under union, concatenation, Kleene star, homomorphisms and intersection with regular languages. EDT0L languages form a strict subclass of indexed languages, which in turn are strictly included in context-sensitive languages.

Note that the expressiveness of EDT0L-systems remains the same if we allow a regular language RTR \subseteq T^* to control the derivations. In that setting, the language of S\mathcal{S} is defined by

L(S)={vΣ:u0rv for some rR}. L(\mathcal{S}) = \{v \in \Sigma^* : u_0 \to^r v \text{ for some } r \in R\}.

A simple example

Consider the language L={uu:uΣ}L = \{uu : u \in \Sigma^*\} which is expressed by the word equation x=yyx = yy. We define an EDT0L-system that describes LL. Let T={s,s}{ta:aΣ}T = \{s, s'\} \cup \{t_a : a \in \Sigma\} be the homomorphisms defined by

s(x)=yyands(σ)=σ for σx,ta(y)=ayandta(σ)=σ for σy,s(y)=εands(σ)=σ for σy. \begin{aligned} s(x) &= yy &\text{and} && s(\sigma) &= \sigma \text{ for } \sigma \neq x, \\ t_a(y) &= ay &\text{and} && t_a(\sigma) &= \sigma \text{ for } \sigma \neq y, \\ s'(y) &= \varepsilon &\text{and} && s'(\sigma) &= \sigma \text{ for } \sigma \neq y. \end{aligned}

For example, we have

xsyytaayaytaaayaaytbaabyaabysaabaab. x \to^s yy \to^{t_a} ayay \to^{t_a} aayaay \to^{t_b} aabyaaby \to^{s'} aabaab.

The language LL is described by the EDT0L-system (Γ,Σ,T,u0)(\Gamma, \Sigma, T, u_0) where Γ=Σ{x,y}\Gamma = \Sigma \cup \{x, y\} and u0=xu_0 = x.

From word equations to EDT0L

The previous example generalizes naturally to any word equation of the form x=x = \cdots. In fact, a translation into an EDT0L-system can be done for any system of word equations.

Given a system of word equations over variables x1,,xnx_1, \ldots, x_n and a solution hh, we define enc(h)=h(x1)##h(xn)\mathrm{enc}(h) = h(x_1) \# \cdots \# h(x_n) where #\# is a fresh letter. The encoding of the relation expressed by the system is {enc(h):h is a solution}\{\mathrm{enc}(h) : h \text{ is a solution}\}. The following is due to Ferté et al.5 for quadratic equations, and to Ciobanu et al.6 for the general case:

The encoding of a relation expressible by word equations can be described by an EDT0L-system.