Skip to main content

Strings and Text

Introductionโ€‹

Stringsโ€‹

A string is a sequence of characters such as "Hello, ๐ŸŒ!" or "Simplify(๐Ÿ‘จโ€๐Ÿš€ + โšก๏ธ) โ†’ ๐Ÿ‘จโ€๐ŸŽค".

In the Compute Engine, strings are composed of encoding-independent Unicode characters and provide access to those characters through a variety of Unicode representations.

Strings are not treated as collections. This is because the concept of a โ€œcharacterโ€ is inherently ambiguous: a single user-perceived character (a grapheme cluster) may consist of multiple Unicode scalars (code points), and those scalars may in turn be represented differently in various encodings. To avoid confusion and ensure consistent behavior, strings must be explicitly converted to a sequence of grapheme clusters or Unicode scalars when individual elements need to be accessed.

Annotated Expressionsโ€‹

An annotated expression is an expression that carries additional visual or semantic metadata that is not material to the interpretation of an expression such as text color and size or other typographic variations, a tooltip or a hyperlink data to link to a web page.

For example, an annotated expression can be used to highlight a specific part of a mathematical expression:

["Equal", 
"circumference",
["Multiply", 2, ["Annotated", "Pi", {"color": "blue"}], "r"]
]
// โž” Pi (in blue)

which would correspond to the LaTeX expression:

\mathrm{circumference} = 2 \cdot \textcolor{blue}{\pi} \cdot r

Annotated expressions are similar to attributed strings in other systems.

Text Expressionsโ€‹

A ["Text"] expression is a sequence of strings, annotated expressions or other ["Text"] expressions. It is used to represent formatted text content in the Compute Engine, for example from a LaTeX expression like \text{Hello \mathbf{world}}.

What would happen if you used a string expression instead of a text expression?

The argument of a ["String"] expression get converted to their string representation, then joined together with no spaces.

The arguments of a ["Text"] expression remain a sequence of elements. When serialized to LaTeX, the elements are serialized to appropriate LaTeX commands to preserve their formatting and structure.

const stringExpr = ce.box([
"String",
"Hello",
["Annotated", "world", {"color": "blue"}]
]);
console.info(stringExpr.latex);
// โž” "\text{Hello $\mathrm{Annotated}(\text{world}, {color: "blue"})$}"

const textExpr = ce.box([
"Text",
"Hello",
["Annotated", "world", {"color": "blue"}]
]);
console.info(textExpr.latex);
// โž” "\text{Hello \textcolor{blue}{world}}"

Functionsโ€‹

String(any*) -> string

A string created by joining its arguments. The arguments are converted to their default string representation.

["String", "Hello", ", ", "๐ŸŒ", "!"]
// โž” "Hello, ๐ŸŒ!"

["String", 42, " is the answer"]
// โž” "42 is the answer"

StringFrom(any, format:string?) -> string

Convert the argument to a string, using the specified format.

formatDescription
utf-8The argument is a list of UTF-8 code points
utf-16The argument is a list of UTF-16 code points
unicode-scalarsThe argument is a list of Unicode scalars (same as UTF-32)

For example:

["StringFrom", [240, 159, 148, 159], "utf-8"]
// โž” "Hello"

["StringFrom", [55357, 56607], "utf-16"]
// โž” "\u0048\u0065\u006c\u006c\u006f"

["StringFrom", [128287], "unicode-scalars"]
// โž” "๐Ÿ”Ÿ"

["StringFrom", [127467, 127479], "unicode-scalars"]
// โž” "๐Ÿ‡ซ๐Ÿ‡ท"

Utf8(string) -> list<integer>

Return a list of UTF-8 code points for the given string.

Note: The values returned are UTF-8 bytes, not Unicode scalar values.

["Utf8", "Hello"]
// โž” [72, 101, 108, 108, 111]

["Utf8", "๐Ÿ‘ฉโ€๐ŸŽ“"]
// โž” [240, 159, 145, 169, 226, 128, 141, 240, 159, 142, 147]

Utf16(string) -> list<integer>

Return a list of utf-16 code points for the given string.

Note: The values returned are UTF-16 code units, not Unicode scalar values.

["Utf16", "Hello"]
// โž” [72, 101, 108, 108, 111]

["Utf16", "๐Ÿ‘ฉโ€๐ŸŽ“"]
// โž” [55357, 56489, 8205, 55356, 57235]

UnicodeScalars(string) -> list<integer>

A Unicode scalar is any valid Unicode code point, represented as a number between U+0000 and U+10FFFF, excluding the surrogate range (U+D800 to U+DFFF). In other words, Unicode scalars correspond exactly to UTF-32 code units.

This function returns the sequence of Unicode scalars (code points) that make up the string. Note that some characters perceived as a single visual unit (grapheme clusters) may consist of multiple scalars. For example, the emoji ๐Ÿ‘ฉโ€๐Ÿš€ is a single grapheme but is composed of several scalars.

["UnicodeScalars", "Hello"]
// โž” [72, 101, 108, 108, 111]

["UnicodeScalars", "๐Ÿ‘ฉโ€๐ŸŽ“"]
// โž” [128105, 8205, 127891]

GraphemeClusters(string) -> list<string>

A grapheme cluster is the smallest unit of text that a reader perceives as a single character. It may consist of one or more Unicode scalars (code points).

For example, the character รฉ can be a single scalar (U+00E9) or a sequence of scalars (e U+0065 + combining acute U+0301), but both form a single grapheme cluster.

Here, NFC (Normalization Form C) refers to the precomposed form of characters, while NFD (Normalization Form D) refers to the decomposed form where combining marks are used.

Similarly, complex emojis (๐Ÿ‘ฉโ€๐Ÿš€, ๐Ÿ‡ซ๐Ÿ‡ท) are grapheme clusters composed of multiple scalars.

The exact definition of grapheme clusters is determined by the Unicode Standard (UAX #29) and may evolve over time as new characters, scripts, or emoji sequences are introduced. In contrast, Unicode scalars and their UTF-8, UTF-16, or UTF-32 encodings are fixed and stable across Unicode versions.

The table below illustrates the difference between grapheme clusters and Unicode scalars:

StringGrapheme ClustersUnicode Scalars (Code Points)
รฉ (NFC)["รฉ"][233]
eฬ (NFD)["รฉ"][101, 769]
๐Ÿ‘ฉโ€๐ŸŽ“["๐Ÿ‘ฉโ€๐ŸŽ“"][128105, 8205, 127891]

In contrast, a Unicode scalar is a single code point in the Unicode standard, corresponding to a UTF-32 value. Grapheme clusters are built from one or more scalars.

This function splits a string into grapheme clusters, not scalars.

["GraphemeClusters", "Hello"]
// โž” ["H", "e", "l", "l", "o"]

["GraphemeClusters", "๐Ÿ‘ฉโ€๐ŸŽ“"]
// โž” ["๐Ÿ‘ฉโ€๐ŸŽ“"]

["UnicodeScalars", "๐Ÿ‘ฉโ€๐ŸŽ“"]
// โž” [128105, 8205, 127891]

For more details on how grapheme cluster boundaries are determined, see Unicodeยฎ Standard Annex #29.

BaseForm(value:integer) -> string

BaseForm(value:integer, base:integer) -> string

Format an integer in a specific base, such as hexadecimal or binary.

If no base is specified, use base-10.

The sign of integer is ignored.

  • value should be an integer.
  • base should be an integer from 2 to 36.
["Latex", ["BaseForm", 42, 16]]

// โž” (\text(2a))_{16}
Latex(BaseForm(42, 16))
// โž” (\text(2a))_{16}
String(BaseForm(42, 16))
// โž” "'0x2a'"

Delimiter(expr)

Delimiter(expr, delim)

Visually group expressions with an open delimiter, a close delimiter and separators between elements of the expression.

When serializing to LaTeX, render expr wrapped in delimiters.

The Delimiter function is inert and the value of a ["Delimiter", _expr_] expression is expr.

expr is a function expression, usually a ["Sequence"]. It should not be a symbol or a number.

delim is an optional string:

  • when it is a single character it is a separator
  • when it is two characters, the first is the opening delimiter and the second is the closing delimiter
  • when it is three characters, the first is the opening delimiter, the second is the separator, and the third is the closing delimiter

The delimiters are rendered to LaTeX.

The open and close delimiters are a single character, one of: ()[]{}<>|โ€–โŒˆโŒ‰โŒŠโŒ‹โŒœโŒโŒžโŒŸโŽฐโŽฑ". The open and close delimiters do not have to match. For example, "')]'" is a valid delimiter.

If an open or close delimiter is ., it is ignored.

The separator delimiter is also a single character, one of ,;.&:|- or U+00B7 (middle dot), U+2022 (bullet) or U+2026 (ellipsis).

If no delim is provided, a default delimiter is used based on the type of expr:

  • ["Sequence"] -> (,)
  • ["Tuple"], ["Single"], ["Pair"], ["Triple"] -> (,)
  • ["List"] -> [,]
  • ["Set"] -> {,}

Spacing(width)

When serializing to LaTeX, widthis the dimension of the spacing, in 1/18 em.

The Spacing function is inert and the value of a ["Spacing", _expr_] expression is expr.

Annotated(expr:expression, dictionary) -> expression

Annotated(expr, attributes) is an expression that behaves exactly like expr, but carries visual or semantic metadata as an attribute dictionary.

The attributes have no effect on evaluation. This function is inert โ€” it evaluates to its first argument.

The attributes dictionary may include:

  • Visual style hints (e.g. weight: "bold", color: "blue")
  • Semantic metadata (e.g. tooltip, language, link)

Use Annotated when you want to attach presentational or semantic information to an expression without affecting its evaluation or identity. This is useful for rendering, tooltips, highlighting, etc.

The following keys are applicable to math expressions:

  • mathStyle = "compact" or "normal". The "compact" style is used for inline math expressions, while the "normal" style is used for display math expressions.
  • scriptLevel = 0, 1, or -1, +1. The script level is used to determine the size of the expression in relation to the surrounding text. A script level of 0 is normal size, 1 is smaller, and 2 is even smaller.

The following keys are applicable to text content:

  • weight a string, one of "normal", "bold", "bolder", "light"
  • style a string, one of "normal", "italic", "oblique"
  • language a string indicating the language of the expression, e.g. "en", "fr", "es" etc.

The following keys are applicable to both math expressions and text content:

  • color a color name or hex code
  • backgroundColor a color name or hex code for the background color
  • tooltip a string to be displayed as a tooltip when the expression is hovered over
  • link a URL to be followed when the expression is clicked
  • cssClass a string indicating the CSS class to be applied to the expression
  • cssId a string indicating the CSS id of the expression

The keys in the dictionary include:

  • style a string, one of "normal", "italic", "oblique"
  • size a number from 1 to 10 where 5 is normal size
  • font a string indicating the font family
  • fontSize a number indicating the font size in pixels
  • fontWeight a string indicating the font weight, e.g. "normal", "bold", "bolder", "lighter"
  • fontStyle a string indicating the font style, e.g. "normal", "italic", "oblique"
  • textDecoration a string indicating the text decoration, e.g. "none", "underline", "line-through"
  • textAlign a string indicating the text alignment, e.g. "left", "center", "right"
  • textTransform a string indicating the text transformation, e.g. "none", "uppercase", "lowercase"
  • textIndent a number indicating the text indentation in pixels
  • lineHeight a number indicating the line height in pixels
  • letterSpacing a number indicating the letter spacing in pixels
  • wordSpacing a number indicating the word spacing in pixels
  • backgroundColor a color name or hex code for the background color
  • border a string indicating the border style, e.g. "none", "solid", "dashed", "dotted"
  • borderColor a color name or hex code for the border color
  • borderWidth a number indicating the border width in pixels
  • padding a number indicating the padding in pixels
  • margin a number indicating the margin in pixels
  • textShadow a string indicating the text shadow, e.g. "2px 2px 2px rgba(0,0,0,0.5)"
  • boxShadow a string indicating the box shadow, e.g. "2px 2px 5px rgba(0,0,0,0.5)"
  • opacity a number from 0 to 1 indicating the opacity of the expression
  • transform a string indicating the CSS transform, e.g. "rotate(45deg)", "scale(1.5)", "translateX(10px)"
  • transition a string indicating the CSS transition, e.g. "all 0.3s ease-in-out"
  • cursor a string indicating the cursor style, e.g. "pointer", "default", "text"
  • display a string indicating the CSS display property, e.g. "inline", "block", "flex", "grid"
  • visibility a string indicating the CSS visibility property, e.g. "visible", "hidden", "collapse"
  • zIndex a number indicating the z-index of the expression
  • position a string indicating the CSS position property, e.g. "static", "relative", "absolute", "fixed"
  • float a string indicating the CSS float property, e.g. "left", "right", "none"
  • clear a string indicating the CSS clear property, e.g. "left", "right", "both", "none"
  • overflow a string indicating the CSS overflow property, e.g. "visible", "hidden", "scroll", "auto"
  • overflowX a string indicating the CSS overflow-x property, e.g. "visible", "hidden", "scroll", "auto"
  • overflowY a string indicating the CSS overflow-y property, e.g. "visible", "hidden", "scroll", "auto"
  • whiteSpace a string indicating the CSS white-space property, e.g. "normal", "nowrap", "pre",
  • textOverflow a string indicating the CSS text-overflow property, e.g. "ellipsis", "clip"
  • direction a string indicating the text direction, e.g. "ltr" (left-to-right) or "rtl" (right-to-left)
  • lang a string indicating the language of the expression, e.g. "en" (English), "fr" (French), "es" (Spanish)
  • role a string indicating the ARIA role of the expression, e.g. "button", "link", "textbox"
  • aria-label a string providing an accessible label for the expression
  • aria-labelledby a string providing an accessible label by referencing another element's ID
  • aria-describedby a string providing an accessible description by referencing another element's ID
  • aria-hidden a boolean indicating whether the expression is hidden from assistive technologies
  • aria-live a string indicating the ARIA live region, e.g. "off", "polite", "assertive"
  • aria-atomic a boolean indicating whether assistive technologies should treat the expression as a whole
  • aria-relevant a string indicating what changes in the expression are relevant to assistive technologies, e.g. `"additions"
  • aria-controls a string providing the ID of another element that the expression controls
  • aria-expanded a boolean indicating whether the expression is expanded or collapsed
  • aria-pressed a boolean indicating whether the expression is pressed (for toggle buttons)
  • aria-selected a boolean indicating whether the expression is selected
  • aria-checked a boolean indicating whether the expression is checked (for checkboxes or radio buttons)
  • aria-valuenow a number indicating the current value of the expression (for sliders or progress bars)
  • aria-valuetext a string providing a text representation of the current value of the expression
  • aria-valuemin a number indicating the minimum value of the expression (for sliders or progress bars)
  • aria-valuemax a number indicating the maximum value of the expression (for sliders or progress bars)
  • aria-keyshortcuts a

The Annotated function is inert and the value of a ["Annotated", expr] expression is expr.