Skip to main content

Strings

SMTLIB2 standard The theory of unicode strings

Introduction

This section describes Z3's handling of strings. Strings are represented as a sequence of Characters where the default character sort is Unicode. You can reason about regular expressions of sequences. Sequences over elements other than Unicode characters and regular expressions are described in separate sections. Z3 offers built-in support for using string constants and dedicated solvers for checking satisfiability over constraints using strings and sequences. Note that the (current) specialized solver is able to establish satisfiability for a non-trivial class of formulas, but is far from a decision procedure even for fragments of string equalities that can be decided: Z3 only solves string equalities using an incomplete heuristic solver and the full combination of lengths and sequences (and regular expressions) is not decidable anyway. In Z3, strings are a special case of sequences, and for the case of Unicode strings, and regular expressions over Unicode strings seeks to implement the SMTLIB2 standard.

You can configure z3 to use one of two backends for solving strings. The default backend is called the seq solver. It solves constraints over both sequences and strings. The other backend is z3str3, which applies to strings and regular expressions over strings. Extensions to sequences operations are not supported in z3str3.

String Constraints by Example

Strings a, b, c can have a non-trivial overlap.

There is a solution to a that is not a sequence of "a"'s.

Creating Strings

Built-in types and constants

The name String is a built-in name for the String sort. String literals can furthermore be entered directly as literals delimited by quotes. The following example asks whether there are strings a and b that concatenate to "abc" followed by b.

String literals

To represent non-ASCII characters the SMTLIB2 standard uses unicode escape sequences. The escape sequences are of the form \u{d₀}, \u{d₁d₀}, \u{d₂d₁d₀}, \u{d₃d₂d₁d₀}, \u{d₄d₃d₂d₁d₀}, \ud₃d₂d₁d₀ where d is a hexadecimal digit. Other characters are treated as part of the string. For example, a newline within a string is treated as a new-line character.

The following example shows three ways to enter the newline character.

String Operations by example

(str.++ a b c) - String concatenation

A string cannot overlap with two different characters.

(str.len s) - the length of string s

There is a solution to a of length at most 2.

(str.at s offset) - character substring at offset

The substring is of length 1 if offset is within the bounds of s, otherwise the result is the empty string.

Note that str.at does not result in a character but a string of length one. You can use (seq.nth "abc" 1) to access the character at offset 1. The function seq.nth is not part of the SMTLIB2 format for strings.

(str.indexof s sub [offset]) - first position of substring

The result is the first position of sub in s, -1 if there are no occurrence. The offset argument is optional. No offset corresponds to the offset 0.

(str.substr s offset length) - substring at a given offset

(str.contains a b) - string containment

Contains is transitive.

But containment is not a linear order.

(str.prefixof a b) (str.suffixof a b) - prefix and suffix checks

Every string is equal to the prefix and suffix that add up to a its length.

(str.from_int i) (str.to_int s) - convert to and from non-negative integers

(str.< s t) (str.<= s t) - lexicographic string comparison

(str.is_digit s) - test if string represents a digit

(str.to_code s) (str.from_code i) - character codes

(_ char n) - string from a character code

Note that after (_ char 54) is the same as (simplify (str.from_code 54)).

Summary of Operations

OperationBrief description
(str.++ a b c)String concatenation of one or more strings
(str.len s)String length. Returns an integer
(str.substr s offset length)Retrieves substring of s at offset
(str.indexof s sub)Retrieves first position of sub in s, -1 if there are no occurrences
(str.indexof s sub offset)Retrieves first position of sub at or after offset in s, -1 if there are no occurrences
(str.at s offset)Substring of length 1 at offset in s
(str.contains s sub)Does s contain the substring sub?
(str.prefixof pre s)Is pre a prefix of s?
(str.suffixof suf s)Is suf a suffix of s?
(str.replace s src dst)Replace the first occurrence of src by dst in s
(str.to_int s)Retrieve integer encoded by string s
(str.from_int i)Retrieve string encoding of integer i
(str.< s1 s2)Lexicographic string less than
(str.<= s1 s2)Lexicographic string less or equal to
(_ char ch)Unit string from unicode character code
(str.is_digit s)A predicate whether string is a one of the digits 0 to 9
(str.to_code s)Convert string of length one to the character code (an integer). Produce -1 if the string is not of length 1
(str.from_code i)Convert an integer in the range of valid Unicode to a string of length one
(str.replace_all s src dst)Currently not supported: replace all occurrences of s in src by dst

Note that (str.indexof s offset) is shorthand for (str.indexof s offset 0). Also, note that (str.at s i) is the empty string or sequence for indices that are either negative or beyond (- (str.len s) 1). Furthermore (str.substr s offset length) is empty when the offset is outside the range of positions in s or length is negative or offset+length exceeds the length of s.