Strings
SMTLIB2 standard The theory of unicode strings
Introduction
This section describes Z3's handling of strings. Strings are represented as a sequence of Characters where the default character sort is Unicode. You can reason about regular expressions of sequences. Sequences over elements other than Unicode characters and regular expressions are described in separate sections. Z3 offers built-in support for using string constants and dedicated solvers for checking satisfiability over constraints using strings and sequences. Note that the (current) specialized solver is able to establish satisfiability for a non-trivial class of formulas, but is far from a decision procedure even for fragments of string equalities that can be decided: Z3 only solves string equalities using an incomplete heuristic solver and the full combination of lengths and sequences (and regular expressions) is not decidable anyway. In Z3, strings are a special case of sequences, and for the case of Unicode strings, and regular expressions over Unicode strings seeks to implement the SMTLIB2 standard.
You can configure z3 to use one of two backends for solving strings.
The default backend is called the seq
solver. It solves constraints over both sequences and strings.
The other backend is z3str3, which applies to strings and regular expressions over strings.
Extensions to sequences operations are not supported in z3str3.
String Constraints by Example
Strings a, b, c
can have a non-trivial overlap.
There is a solution to a
that is not a sequence of "a"'s.
Creating Strings
Built-in types and constants
The name String
is a built-in name for the String sort.
String literals can furthermore be entered directly as literals
delimited by quotes. The following example asks whether there are
strings a
and b
that concatenate to "abc"
followed by b
.
String literals
To represent non-ASCII characters the SMTLIB2 standard uses unicode escape sequences.
The escape sequences are of the form \u{d₀}
, \u{d₁d₀}
, \u{d₂d₁d₀}
, \u{d₃d₂d₁d₀}
, \u{d₄d₃d₂d₁d₀}
, \ud₃d₂d₁d₀
where d
is a hexadecimal digit. Other
characters are treated as part of the string. For example, a newline within a string
is treated as a new-line character.
The following example shows three ways to enter the newline character.
String Operations by example
(str.++ a b c)
- String concatenation
A string cannot overlap with two different characters.
(str.len s)
- the length of string s
There is a solution to a
of length at most 2.
(str.at s offset)
- character substring at offset
The substring is of length 1 if offset
is within the bounds of s
, otherwise the result is the empty string.
Note that str.at
does not result in a character but a string of length one.
You can use (seq.nth "abc" 1)
to access the character at offset 1.
The function seq.nth
is not part of the SMTLIB2 format for strings.
(str.indexof s sub [offset])
- first position of substring
The result is the first position of sub
in s
, -1 if there are no occurrence.
The offset argument is optional. No offset corresponds to the offset 0.
(str.substr s offset length)
- substring at a given offset
(str.contains a b)
- string containment
Contains is transitive.
But containment is not a linear order.
(str.prefixof a b)
(str.suffixof a b)
- prefix and suffix checks
Every string is equal to the prefix and suffix that add up to a its length.
(str.from_int i)
(str.to_int s)
- convert to and from non-negative integers
(str.< s t)
(str.<= s t)
- lexicographic string comparison
(str.is_digit s)
- test if string represents a digit
(str.to_code s)
(str.from_code i)
- character codes
(_ char n)
- string from a character code
Note that after (_ char 54)
is the same as (simplify (str.from_code 54))
.
Summary of Operations
Operation | Brief description |
---|---|
(str.++ a b c) | String concatenation of one or more strings |
(str.len s) | String length. Returns an integer |
(str.substr s offset length) | Retrieves substring of s at offset |
(str.indexof s sub) | Retrieves first position of sub in s , -1 if there are no occurrences |
(str.indexof s sub offset) | Retrieves first position of sub at or after offset in s , -1 if there are no occurrences |
(str.at s offset) | Substring of length 1 at offset in s |
(str.contains s sub) | Does s contain the substring sub ? |
(str.prefixof pre s) | Is pre a prefix of s ? |
(str.suffixof suf s) | Is suf a suffix of s ? |
(str.replace s src dst) | Replace the first occurrence of src by dst in s |
(str.to_int s) | Retrieve integer encoded by string s |
(str.from_int i) | Retrieve string encoding of integer i |
(str.< s1 s2) | Lexicographic string less than |
(str.<= s1 s2) | Lexicographic string less or equal to |
(_ char ch) | Unit string from unicode character code |
(str.is_digit s) | A predicate whether string is a one of the digits 0 to 9 |
(str.to_code s) | Convert string of length one to the character code (an integer). Produce -1 if the string is not of length 1 |
(str.from_code i) | Convert an integer in the range of valid Unicode to a string of length one |
(str.replace_all s src dst) | Currently not supported: replace all occurrences of s in src by dst |
Note that (str.indexof s offset)
is shorthand for (str.indexof s offset 0)
.
Also, note that (str.at s i)
is the empty string or sequence for indices that are either negative or beyond
(- (str.len s) 1)
. Furthermore (str.substr s offset length)
is empty
when the offset is outside the range of positions in s
or length
is negative or
offset+length
exceeds the length of s
.