Help for package sumer

Type:

Package

Title:

Sumerian Cuneiform Text Analysis

Version:

1.3.0

Description:

Provides functions for converting transliterated Sumerian texts to sign names and cuneiform characters, creating and querying dictionaries, analyzing the structure of Sumerian words, and creating translations. Includes a built-in dictionary and supports both forward lookup (Sumerian to English) and reverse lookup (English to Sumerian).

License:

GPL-3

Encoding:

UTF-8

Depends:

R (≥ 4.0.0)

Imports:

stringr, officer, xml2, cli, rlang, ggplot2, ragg, shiny

Suggests:

knitr, rmarkdown

VignetteBuilder:

knitr

NeedsCompilation:

yes

Maintainer:

Robin Wellmann <ro.wellmann@gmail.com>

Packaged:

2026-03-11 09:57:39 UTC; rowel

Author:

Robin Wellmann [aut, cre]

Repository:

CRAN

Date/Publication:

2026-03-11 12:00:02 UTC

Add Brackets to a Structure String

Description

Inserts curly braces into a structure string so that operators are grouped with their arguments and remaining elements are grouped by compositional rules.

The function processes {...} groups already present in the input recursively, then resolves operator binding (left-binding first, right-binding second) and finally applies compositional grouping rules (S+A, S+S, S,S, S+V).

Usage

add_brackets(s, type)

Arguments

s

Character string containing #N tags and optional {...} groups, as produced by step a of compose_skeleton_entry.

type

Character vector of type annotations for the children. type[k] is the type string for tag #k (e.g. "S", "V", "Sx->V", "xS->S").

Details

The algorithm works in four phases:

Existing groups: Recursively process any {...} groups already in the input.
Left-binding operators: Operators with a left argument (e.g. Sx->V) bind their left neighbour first.
Right-binding operators: Operators with a right argument (e.g. xS->V) bind their right neighbour.
Compositional grouping: Remaining elements are grouped by type priority: S+A, then S+S (no punctuation), then S,S (comma), then S+V.

If operators cannot find matching arguments, an error is returned.

Value

A list with two elements:

string: The structure string with all brackets inserted.
type: The resulting type of the outermost group ("S", "V", "SEN", or "ERROR").

Examples

x <- "mec3-ki-aj2-ga-ce-er"
x <- as.cuneiform(x)
x

meaning <- rbind( c("S",      "a man who relies on his own strength"),
                  c("S",      "place {earth}"),
                  c("Sx->A",  ", whose allocated resource is S"),
                  c("xS->A",   ", whose sustenance is S"),
                  c("S",      "grain"),
                  c("Sx->S",  "lamented S"))

df <- data.frame(
    type  = meaning[,1],
    translation = meaning[,2],
    expr  = split_sumerian(x)$signs)

s <- x
for(i in 1:nrow(df)){
 s <- sub(df$expr[i], paste0("#", i), s)
}
s

(s_bracketed  <- sumer:::add_brackets(s,df$type))
apply_translation_rules(s_bracketed$string, df$type, df$translation)

Apply Translation Rules to a Bracketed Structure String

Description

Translates a bracketed structure string into English by evaluating sumerian operators (substituting arguments into translation templates) and composing adjacent elements according to grammatical rules. The input is a structure string as produced by add_brackets, together with vectors of types and translations for each tag.

Usage

apply_translation_rules(s, type, translation)

Arguments

s

Character string showing the order of evaluation, as produced by add_brackets.

type

Character vector of grammatical types. type[k] is the type string for tag #k (e.g. "S", "V", "Sx->V").

translation

Character vector of translations. translation[k] is the translation for tag #k. Operator translations contain placeholders (e.g. "to deliver S").

Details

Nested {...} groups are evaluated from the inside out. Within each group, an operator (if present) binds its arguments and produces a typed result. Groups without an operator are composed according to the rules described in eval_operator.

Value

A character vector of length 2: c(result_type, result_translation).

On error (e.g. incompatible types), a character vector of length 1 containing the error message.

Examples

x <- "mec3-ki-aj2-ga-ce-er"
x <- as.cuneiform(x)
x

meaning <- rbind( c("S",      "a man who relies on his own strength"),
                  c("S",      "place {earth}"),
                  c("Sx->A",  ", whose allocated resource is S"),
                  c("xS->A",   ", whose sustenance is S"),
                  c("S",      "grain"),
                  c("Sx->S",  "lamented S"))

df <- data.frame(
    type  = meaning[,1],
    translation = meaning[,2],
    expr  = split_sumerian(x)$signs)

s <- x
for(i in 1:nrow(df)){
 s <- sub(df$expr[i], paste0("#", i), s)
}
s

s_bracketed <- sumer:::add_brackets(s, df$type)
s_bracketed

apply_translation_rules(s_bracketed$string, df$type, df$translation)

Convert Transliterated Sumerian Text to Cuneiform

Description

Converts transliterated Sumerian text to Unicode cuneiform characters. This is a generic function with a method for character vectors.

Usage

as.cuneiform(x, ...)

## Default S3 method:
as.cuneiform(x, ...)

## S3 method for class 'character'
as.cuneiform(x, mapping = NULL, ...)

## S3 method for class 'cuneiform'
print(x, ...)

Arguments

x

For as.cuneiform: An object to be converted to cuneiform. Currently, only character vectors are supported.

For print.cuneiform: an object of class "cuneiform".

mapping

A data frame containing the sign mapping table with columns syllables, name, and cuneiform. If NULL (the default), the package's internal mapping file ‘etcsl_mapping.txt’ is loaded. Only used by the character method.

...

Additional arguments passed to methods.

Details

The function processes each element of the input character vector by:

Calling info to look up sign information for each transliterated sign.
Extracting the Unicode cuneiform symbols for each sign.
Reconstructing the cuneiform text using the original separators, but removing hyphens and periods which are only used in transliteration to indicate sign boundaries.

The default method throws an error for unsupported input types.

Value

as.cuneiform returns a character vector of class cuneiform with the cuneiform representation of each input element.

print.cuneiform displays a character vector of class cuneiform.

Note

The cuneiform output requires a font that supports the Unicode Cuneiform block (U+12000 to U+12500) to display correctly.

Examples


# Convert transliterated text to cuneiform
as.cuneiform(c("na-an-jic li-ic ma","en tarah-an-na-ke4"))

# Load transliterated text from a file
file <- system.file("extdata", "transliterated-text.txt", package = "sumer")
x <- readLines(file)
cat(x, sep="\n")

# Convert transliterated text to cuneiform
as.cuneiform(x)

# Using a custom mapping table
path <- system.file("extdata", "etcsl_mapping.txt", package = "sumer")
my_mapping <- read.csv2(path, sep=";", na.strings="")
as.cuneiform("lugal", mapping = my_mapping)

Convert Transliterated Sumerian Text to Sign Names

Description

Converts transliterated Sumerian text to canonical sign names in uppercase notation. This is a generic function with a method for character vectors.

Usage

as.sign_name(x, ...)

## Default S3 method:
as.sign_name(x, ...)

## S3 method for class 'character'
as.sign_name(x, mapping = NULL, ...)

## S3 method for class 'sign_name'
print(x, ...)

Arguments

x

For as.sign_name: An object to be converted to sign names. Currently, only character vectors are supported.

For print.sign_name: An object of class "sign_name".

mapping

...

Additional arguments passed to methods.

Details

The function processes each element of the input character vector by:

Calling info to look up sign information for each transliterated sign.
Extracting the canonical sign names for each sign.
Reconstructing the text using the original separators, but replacing hyphens with periods to follow standard sign name notation.

The default method throws an error for unsupported input types.

Value

as.sign_name returns a character vector of class c("sign_name", "character") with the sign name representation of each input element.

print.sign_name displays a character vector of class "sign_name".

Examples

# Convert transliterated text to sign names
as.sign_name(c("lugal-e", "an-ki"))

# Load transliterated text from a file
file <- system.file("extdata", "transliterated-text.txt", package = "sumer")
x <- readLines(file)
cat(x, sep="\n")

# Convert transliterated text to sign names
as.sign_name(x)

# Using a custom mapping table
path <- system.file("extdata", "etcsl_mapping.txt", package = "sumer")
my_mapping <- read.csv2(path, sep=";", na.strings="")
as.sign_name("lugal", mapping = my_mapping)

Compose a Skeleton Entry from its Children

Description

Composes the type and translation fields of a parent skeleton entry from the corresponding fields of its direct children, following the rules of the Sumerian type system. This function is called by translate when the user clicks the brown compose button next to a skeleton entry.

Usage

compose_skeleton_entry(df)

Arguments

df

A data frame with n + 1 rows, where the first row represents the parent entry and the remaining n rows represent its direct children in the skeleton hierarchy.

The data frame has the following columns:

expr: Character. The expression of the entry.
n_tokens: Integer. Number of tokens covered by the entry.
start: Integer. Position (1-based) of the first token.
depth: Integer. Nesting depth in the skeleton hierarchy.
type: Character. The grammatical type annotation: "S" (substantive), "V" (verb), "A" (adjective/attribute), or an operator type like "xS->S".
translation: Character. The translation. Operator translations contain type-letter placeholders (e.g. "S" in "supplier of energy from S").

Details

The algorithm works in four steps:

Structure string: Build a structure string by replacing each child's expression in the parent expression with a numbered tag (#1, #2, etc.). Token-group wrappers (<> and single-tag {}) are removed.
Bracket insertion: add_brackets groups operators with their arguments and applies compositional rules (S+A, S+S, S+V, SEN+SEN).
Translation: apply_translation_rules recursively evaluates the bracketed string to produce the final translation.

Value

Either the input data frame df with the type and translation fields of the first row filled in, or a character string containing an error message.

Examples

# Minimal example: S + Sx->V
df <- data.frame(
  expr = c("#1 #2", "#1", "#2"),
  n_tokens = c(2L, 1L, 1L),
  start = c(1L, 1L, 2L),
  depth = c(0L, 1L, 1L),
  type = c("", "S", "Sx->V"),
  translation = c("", "temple", "to utilize S"),
  stringsAsFactors = FALSE
)
result <- sumer:::compose_skeleton_entry(df)
stopifnot(result$type[1] == "V")
stopifnot(result$translation[1] == "to utilize the temple")
cat("compose_skeleton_entry check passed.\n")

Convert Translation Data to a Sumerian Dictionary

Description

Converts a data frame of Sumerian translations into a structured dictionary format, adding cuneiform representations and phonetic readings for each sign.

Usage

convert_to_dictionary(df, mapping = NULL)

Arguments

df

A data frame with columns sign_name, type, and meaning, typically produced by read_translated_text.

mapping

A data frame containing sign-to-reading mappings with columns name, cuneiform and syllables. If NULL (default), the package's built-in mapping file etcsl_mapping.txt is used.

Details

Processing Steps

Aggregates translations and counts occurrences of each unique combination in df
Looks up phonetic readings and cuneiform signs for each sign component
Combines cuneiform, reading, and translation rows into a single data frame
Sorts the result by sign name and row type

Reading Format

Phonetic readings are formatted as follows:

Multiple possible readings are enclosed in braces: {a, dur5, duru5}
For compound signs, readings of individual components are joined with hyphens
If a sign has more than three possible readings in a compound, only the first three are shown followed by ...
Unknown readings are marked with ?

Value

A data frame with the following columns:

sign_name: The normalized Sumerian text (e.g., "A", "AN", "A2.TAB")
row_type: Type of entry: "cunei." (cuneiform character), "reading" (phonetic readings), or "trans." (translation)
count: Number of occurrences for translations; NA for cuneiform and reading entries
type: Grammatical type (e.g., "S", "V", "A") for translations; empty string for other row types
meaning: The cuneiform character(s), phonetic reading(s), or translated meaning depending on row_type

The data frame is sorted by sign_name, row_type, and descending count.

Examples

# Read translations from a single text document
filename     <- system.file("extdata", "text_with_translations.txt", package = "sumer")
translations <- read_translated_text(filename)

# View the structure
head(translations)

#Make some custom unifications (here: removing the word "the")
translations$meaning <- gsub("\\bthe\\b", "", translations$meaning, ignore.case = TRUE)
translations$meaning <- trimws(gsub("\\s+", " ", translations$meaning))

# View the structure
head(translations)

#Convert the result into a dictionary
dictionary   <- convert_to_dictionary(translations)

# View the structure
head(dictionary)

# View entries for a specific sign
dictionary[dictionary$sign_name == "EN", ]

# With custom mapping
path  <- system.file("extdata", "etcsl_mapping.txt", package = "sumer")
mapping <- read.csv2(path, sep=";", na.strings="")
translations <- read_translated_text(filename, mapping = mapping)
dictionary <- convert_to_dictionary(translations, mapping = mapping)
head(dictionary)

Evaluate an Operator or Composition

Description

Evaluates a single operator by substituting resolved arguments into its translation string, or composes two elements without an operator using compositional rules.

This function contains the complete translation logic: extraction of specific meanings from {specific} annotations, article insertion ("the "), verb-prefix stripping ("to "), placeholder substitution, and composition rules.

Usage

eval_operator(operator, rule_translation, args = list(), seps = "", trailing_sep = "")

Arguments

operator

Type string of the operator (e.g. "Sx->V", "SSx->V", "xS->A"), or NULL for composition (no operator present). For base types without arguments (e.g. "S", "V"), pass the type string here.

rule_translation

Translation string with placeholders (e.g. "to utilize S", "S1 and therefore S2"), or NULL for composition.

args

List of c(name, translation) character vectors:

name: Placeholder name encoding the grammatical type (e.g. "S", "S1", "V", "SEN2"). The type is reconstructed internally by stripping the trailing index digits.
translation: Resolved translation of the argument.

Default: list() (no arguments, for base types). The helper function build_args() can be used to construct this list from a set of elements.

seps

Character vector of separators between elements, as extracted by apply_translation_rules. Hyphens in separators are converted to spaces for composition. Default: "".

trailing_sep

Text after the last element (e.g. trailing punctuation). Hyphens are converted to spaces for composition. Default: "".

Details

The function operates in two modes:

Operator mode (operator is not NULL):

Determine the result type from the operator string via parse_type().
For each argument, apply transformations:
- Strip "to " from verb-type ("V") arguments.
- Add "the " before substantive ("S") arguments when the result type is not "S" (i.e., the operator changes the type).
Replace each placeholder in rule_translation by whole-word match.

Composition mode (operator is NULL):

Exactly two arguments are required. The composition rules are:

S + A ⁠->⁠ S: Simple juxtaposition: "X Y".
S + S ⁠->⁠ S: Without comma separator: "X of/with the Y".
S, S ⁠->⁠ S: With comma separator: "X, the Y".
S + V ⁠->⁠ SEN: Subject-verb: "X Y" (with "to " stripped from verb).
SEN + SEN ⁠->⁠ SEN: Sentence concatenation: "X. Y" (period added if missing).

In both modes, translations of the form "general \{specific\}" are reduced to the specific part before processing.

Value

A character vector of length 2: c(result_type, result_translation).

On error (e.g. incompatible types in composition), a character vector of length 1 containing the error message.

Examples

# --- Base type (operator without arguments) ---
# A simple substantive passes through unchanged:
sumer:::eval_operator("S", "temple")
# [1] "S"      "temple"

# With {specific} extraction:
sumer:::eval_operator("S", "place {earth}")
# [1] "S"     "earth"

# --- Operator with one argument: Sx->V ---
# The S argument gets "the " prepended (since result type is V, not S):
args <- list(c("S", "temple"))
sumer:::eval_operator("Sx->V", "to utilize S", args)
# [1] "V"                      "to utilize the temple"

# --- Operator with two S arguments: SSx->V ---
# Duplicate types get indexed names (S1, S2):
args <- list(c("S1", "leader"), c("S2", "temple"))
sumer:::eval_operator("SSx->V", "to equip S1 with S2", args)
# [1] "V"                                  "to equip the leader with the temple"

# --- Composition: S + A -> S ---
args <- list(c("S", "temple"), c("A", ", which is a great one"))
sumer:::eval_operator(NULL, NULL, args)
# [1] "S"            "temple, which is a great one"

# --- Composition: S + V -> SEN ---
args <- list(c("S", "men"), c("V", "to bring grain into the temple"))
sumer:::eval_operator(NULL, NULL, args)
# [1] "SEN"       "men bring grain into the temple"

# --- Full pipeline with add_brackets and apply_translation_rules ---
x <- "mec3-ki-aj2-ga-ce-er"
x <- as.cuneiform(x)
x

meaning <- rbind( c("S",      "a man who relies on his own strength"),
                  c("S",      "place {earth}"),
                  c("Sx->A",  ", whose allocated resource is S"),
                  c("xS->A",   ", whose sustenance is S"),
                  c("S",      "grain"),
                  c("Sx->S",  "lamented S"))

df <- data.frame(
    type  = meaning[,1],
    translation = meaning[,2],
    expr  = split_sumerian(x)$signs)

s <- x
for(i in 1:nrow(df)){
 s <- sub(df$expr[i], paste0("#", i), s)
}
s

s_bracketed <- sumer:::add_brackets(s, df$type)
s_bracketed

apply_translation_rules(s_bracketed$string, df$type, df$translation)

Extract Hierarchical Skeleton Entries from Bracketed Text

Description

Recursively extracts the contents of nested round brackets from a normalized Sumerian text string and returns them as a data frame with position, length, nesting depth, and expression for each entry.

This is an internal helper function used by skeleton.

Usage

extract_skeleton_entries(x)

Arguments

x

A character string containing Sumerian text with round brackets, as returned by mark_skeleton_entries.

Details

The first row of the result always represents the entire input expression at depth 0 (the root entry). The function then extracts the contents of all outermost (top-level) bracket pairs using an internal helper function. For each extracted group, a row is added to the result data frame at depth 1. If a group itself contains further nested brackets, the function recurses into it to extract deeper levels.

The depth value of each entry reflects the nesting level: the root entry has depth 0, entries from the outermost brackets have depth 1, entries nested one level deeper have depth 2, and so on.

The start column records the position (in tokens) of the first token in each group, relative to the full input. The n_tokens column gives the number of tokens in the group as determined by split_sumerian.

Value

A data frame with one row per extracted entry and the following columns:

start

Integer. The token position of the first token in the group (1-based).

n_tokens

Integer. The number of Sumerian tokens (signs) in the group.

depth

Integer. The nesting depth of the entry (0 for the root entry representing the full expression, 1 for top-level groups, 2 for groups nested one level deeper, etc.).

expr

Character. The text content of the bracket group (without the surrounding brackets). For the root entry (row 1), this is the full input string.

The result always has at least one row (the root entry).

Examples

# First normalize the input with mark_skeleton_entries
x <- "<d-nu-dim2-mud> ki a. jal2 (e2{kur}) ra. gaba jal2. an ki a"
normalized <- sumer:::mark_skeleton_entries(x)
normalized

# Then extract the hierarchical structure
sumer:::extract_skeleton_entries(normalized)

Posterior Probabilities of Grammatical Types for Each Sign

Description

For each cuneiform sign in a sentence, computes Bayesian posterior probabilities for all grammatical types, combining prior beliefs from prior_probs with observed dictionary frequencies. The dictionary counts are corrected for verb underrepresentation using the sentence_prob stored in the prior.

Usage

grammar_probs(sg, prior, dic, alpha0 = 1)

Arguments

sg

A data frame as returned by sign_grammar.

prior

A named numeric vector as returned by prior_probs, with a sentence_prob attribute.

dic

A dictionary data frame as returned by read_dictionary.

alpha0

Numeric (>= 0). Strength of the prior (pseudo sample size). Larger values pull the posterior towards the prior. When alpha0 = 0, the result is purely data-driven. Default: 1.

Details

For each sign at position i in the sentence, the function computes:

The raw dictionary counts n_k for each grammar type k.
A correction factor x_k = 1 / \mathrm{sentence\_prob} for verb-like types, x_k = 1 otherwise. The corrected counts are m_k = n_k \cdot x_k with total M = \sum_k m_k.
The posterior probability (Dirichlet-Multinomial model):

\theta_k = \frac{\alpha_0 \, p_k + m_k}{\alpha_0 + M}

where p_k is the prior probability from prior_probs().

For signs not in the dictionary (M = 0), the posterior equals the prior. For signs with many observations (M \gg \alpha_0), the posterior is dominated by the data.

Value

A data frame with columns:

position: Integer. Position of the sign in the sentence.
sign_name: Character. The sign name.
cuneiform: Character. The cuneiform character.
type: Character. The grammar type (e.g., "S", "V", "Sx->S").
prob: Numeric. Posterior probability for this type at this position.
n: Numeric. Number of counts in the dictionary.

Examples

dic   <- read_dictionary()
sg    <- sign_grammar("a-ma-ru ba-ur3 ra", dic)
prior <- prior_probs(dic, sentence_prob = 0.25)
gp    <- grammar_probs(sg, prior, dic, alpha0 = 1)
print(gp)

Grammatical Structure of a Sumerian Expression

Description

Determines and visualizes the grammatical structure of a Sumerian expression. The function groups sub-expressions according to operator binding and composition rules and returns a bracketed string in which each bracket type indicates the grammatical role of the group:

() – substantive (S)
<> – verb (V)
[] – attribute (A)
{} – sentence (SEN)

The result has class "grammatical_structure" and comes with a print method that displays the bracket tree with color-coded groups in the console (requires ANSI color support).

Usage

grammatical_structure(s, type, expr = NULL)

## S3 method for class 'grammatical_structure'
print(x, ...)

Arguments

s

Character string. A Sumerian expression in cuneiform characters.

type

Character vector of grammatical types, one per sub-expression. Each entry is either a base type ("S", "V", "A") or an operator type (e.g. "Sx->V", "xS->A").

expr

Character vector of sub-expressions (e.g. the individual signs or sign groups that make up s). The function matches each expr[k] in s, determines the grouping, and returns the result with the original expressions in place.

x

An object of class "grammatical_structure".

...

Further arguments (currently unused).

Details

The grouping is performed in two stages. First, add_brackets inserts bracket groups based on operator binding strength and pairwise composition rules. Then each group is assigned a bracket type that reflects its grammatical role, as determined by the operator it contains or by the types of its elements.

The print method displays the resulting string with ANSI colors in the console. Each bracket type and its direct content (nesting level 0) are shown in a distinct color: green for (), blue for [], red for <>, and yellowish-brown for {}. Bracket pairs that contain only nested sub-groups (no bare symbols at nesting level 0) are shown in light gray.

Value

A character string of class "grammatical_structure" with typed brackets showing the grammatical grouping. On error, a plain character string containing the error message.

Examples

x <- "mec3-ki-aj2-ga-ce-er ce du"
x <- as.cuneiform(x)
x

expr <- split_sumerian(x)$signs
expr

type <- c("S", "S", "Sx->A", "xS->A",  "S",  "Sx->S", "S", "Sx->V")

grammatical_structure(x, type, expr)


#You can also work with the transliteration:

x <- "mec3 ki aj2 ga ce er ce du"
expr <- split_sumerian(x)$signs

grammatical_structure(x, type, expr)

Look Up Translations for All Substrings of a Sumerian Text

Description

Converts a Sumerian text string into cuneiform tokens, generates all contiguous substrings, and looks up the most frequent translation for each substring in one or more dictionaries.

Usage

guess_substr_info(x, dic, mapping = NULL)

Arguments

x

A character string of length 1 containing Sumerian text (transliteration, sign names, or cuneiform characters). May contain brackets as used by skeleton.

dic

A dictionary, a list of dictionaries, or a character vector of file paths to dictionary files. If file paths are given, each file is loaded with read_dictionary. Dictionaries are tried in order: the first dictionary that contains a translation for a given substring wins.

mapping

A data frame containing the sign mapping table with columns syllables, name, and cuneiform. If NULL (the default), the package's internal mapping file ‘etcsl_mapping.txt’ is loaded.

Details

The function performs the following steps:

If dic is a character vector of file paths, the dictionaries are loaded with read_dictionary. If dic is a single data frame, it is wrapped in a list.
The input string x is converted to cuneiform with as.cuneiform and split into individual tokens with split_sumerian.
A data frame of all contiguous substrings is created with init_substr_info.
A sign_name column is added by converting each substring expression with as.sign_name.
For each substring, the dictionaries are searched in order. The most frequent translation (highest count among rows with row_type == "trans.") from the first dictionary that contains a match is used to fill in the type and translation columns.

Value

A data frame with one row per substring and the following columns:

start

Integer. The token position of the first token in the substring (1-based).

n_tokens

Integer. The number of tokens in the substring.

expr

Character. The concatenated cuneiform tokens of the substring.

type

Character. The grammatical type of the most frequent translation (e.g. "S", "V"), or "" if no translation was found.

translation

Character. The most frequent translation from the dictionaries, or "" if no translation was found.

sign_name

Character. The sign name representation of the substring.

The rows are ordered as in init_substr_info (by n_tokens descending, then start ascending), so that row indices can be computed with substr_position.

Examples

# Load the built-in dictionary
dic <- read_dictionary()

# Look up translations for all substrings
x <- "lugal kur-ra-ke4"
df <- guess_substr_info(x, dic)

# Show rows that have a translation
df[df$translation != "", ]

# Use multiple dictionaries (ordered by reliability -> first match wins)
file1 <- system.file("extdata", "sumer-dictionary.txt", package = "sumer")
df <- guess_substr_info(x, file1)

Retrieve Information About Sumerian Signs

Description

Analyzes a transliterated Sumerian text string and retrieves detailed information about each sign, including syllabic readings, sign names, cuneiform symbols, and alternative readings.

The function info computes the result and returns an object of class "info". The print method displays a summary of different text representations in the console.

Usage

info(x, mapping = NULL)

## S3 method for class 'info'
print(x, flatten = FALSE, ...)

Arguments

x

For info: a character string of length 1 containing transliterated Sumerian text.

For print.info: an object of class "info".

mapping

A data frame containing the sign mapping table with columns syllables, name, and cuneiform. If NULL (the default), the package's internal mapping file ‘etcsl_mapping.txt’ is loaded.

flatten

Logical. If TRUE, grammar indicators in the text are removed (such as parentheses, brackets, braces, and operators). If FALSE (the default), the original separators are preserved.

...

Additional arguments passed to the print method (currently unused).

Details

The function info performs the following steps:

Splits the input string into signs and separators using split_sumerian.
Standardizes the signs.
Looks up each sign in the mapping table based on its type:
- Type 1 (lowercase): Searches for a matching syllable reading.
- Type 2 (uppercase): Searches for a matching sign name.
- Type 3 (cuneiform): Searches for a matching cuneiform character.
Returns a data frame with the results, along with the separators stored as an attribute.

The mapping table must contain the following columns:

syllables: Comma-separated list of possible syllabic readings for the sign. The first reading is used as the default.
name: The canonical sign name in uppercase.
cuneiform: The Unicode cuneiform character.

The print method displays each sign with its name and alternative readings, followed by three text representations: syllables, sign names, and cuneiform text.

Value

info returns a data frame of class c("info", "data.frame") with one row per sign and the following columns:

reading

The syllabic reading of the sign. For lowercase input, this is the standardized input; for other types, this is the default syllable from the mapping.

sign

The Unicode cuneiform character corresponding to the sign.

name

The canonical sign name in uppercase.

alternatives

A comma-separated string of all possible syllabic readings for the sign.

The data frame has an attribute "separators" containing the separator characters between signs.

print.info prints the following to the console and returns x invisibly:

Sign table: Each sign with its cuneiform symbol, name, and alternative readings.
syllables: The text with syllabic readings, using hyphens as separators within words.
sign names: The text with sign names, using periods as separators within words.
cuneiform text: The text rendered in Unicode cuneiform characters, with hyphens and periods removed.

Note

If no custom mapping is provided, the function loads the internal mapping file included with the sumer package.

Examples

library(stringr)

# Basic usage - compute and print
info("lugal-e")

# Store the result for further processing
result <- info("an-ki")
result

# Access the underlying data frame
result$sign
result$name

# Print with and without flattened separators
result <- info("(an)na")
print(result)
print(result, flatten = TRUE)

# Using a custom mapping table
path <- system.file("extdata", "etcsl_mapping.txt", package = "sumer")
my_mapping <- read.csv2(path, sep=";", na.strings="")
info("an-ki", mapping = my_mapping)

Initialize a Data Frame of All Substrings

Description

Creates a data frame containing all contiguous substrings of a token vector, including the full token sequence itself. Each row represents one substring, with its starting position, length in tokens, the concatenated expression, and empty columns for type and translation.

The rows are ordered by n_tokens descending and start ascending, so that the row number can be computed from start and n_tokens using substr_position.

This is an internal helper function.

Usage

init_substr_info(token)

Arguments

token

A character vector of Sumerian tokens (e.g. cuneiform signs).

Details

For a token vector of length N, the function generates all N(N+1)/2 contiguous substrings. The substrings are ordered by n_tokens descending (longest first) and within each group by start ascending. This ordering ensures that the row index of any substring can be computed with the formula

\mathrm{row} = \frac{(N - k)(N - k + 1)}{2} + s

where k is the number of tokens (n_tokens) and s is the starting position (start).

The expr column contains the tokens concatenated without separators. The type and translation columns are initialized as empty strings, intended to be filled in later.

Value

A data frame with N(N+1)/2 rows and the following columns:

start

Integer. The position of the first token in the substring (1-based).

n_tokens

Integer. The number of tokens in the substring.

expr

Character. The concatenated token sequence (without separators).

type

Character. Initialized as empty string "".

translation

Character. Initialized as empty string "".

Examples

x<-"<d-nu-dim2-mud> ki a. jal2 (e2{kur}) ra. gaba jal2. an ki a"

token <- split_sumerian(as.cuneiform(x))$signs

df <- sumer:::init_substr_info(token)
df

# Verify that substr_position recovers the row indices
N <- length(token)
all(seq_len(nrow(df)) == sumer:::substr_position(df$start, df$n_tokens, N))

Look Up Sumerian Signs or Search for Translations

Description

Searches a Sumerian dictionary either by sign name (forward lookup) or by translation text (reverse lookup).

The function look_up computes the search results and returns an object of class "look_up". The print method displays formatted results with cuneiform representations, grammatical types, and translation counts.

Usage

look_up(x, dic, lang = "sumer", width = 70, mapping = NULL)

## S3 method for class 'look_up'
print(x, ...)

Arguments

x

For look_up: A character string specifying the search term. Can be either:

A Sumerian sign name (e.g., "AN", "AN.EN.ZU")
A cuneiform character string
A word or phrase to search in translations (e.g., "Gilgamesh", "heaven")

For print.look_up: An object of class "look_up" as returned by look_up.

dic

A dictionary data frame, typically created by make_dictionary or loaded with read_dictionary. Must contain columns sign_name, row_type, count, type, and meaning.

lang

Character string specifying whether x is a Sumerian expression ("sumer") or an English expression ("en").

width

Integer specifying the text width for line wrapping. Default is 70.

mapping

A data frame containing the sign mapping table with columns syllables, name, and cuneiform. If NULL (the default), the package's internal mapping file ‘etcsl_mapping.txt’ is loaded.

...

Additional arguments passed to the print method (currently unused).

Details

Search Modes

The function operates in two modes depending on the input:

Forward Lookup (Sumerian input detected):

Converts the sign name to cuneiform
Retrieves all translations for the exact sign combination
Retrieves translations for all individual signs and substrings

Reverse Lookup (non-Sumerian input):

Searches for the term in all translation meanings
Retrieves matching entries with sign names and cuneiform

Output Format

The print method displays results with:

Sign names with cuneiform representations
Occurrence counts in brackets (e.g., [29])
Grammatical type abbreviations (e.g., S, V)
Translation meanings with automatic line wrapping
Search term highlighting in blue for reverse lookups (only for ANSI-compatible terminals)

Value

look_up returns an object of class "look_up", which is a list containing:

search

The original search term.

lang

The language setting used for the search.

width

The text width for formatting.

cuneiform

The cuneiform representation (only for Sumerian searches).

sign_name

The canonical sign name (only for Sumerian searches).

translations

A data frame with translations for the exact sign combination (only for Sumerian searches).

substrings

A named list of data frames with translations for individual signs and substrings (only for Sumerian searches).

matches

A data frame with matching entries (only for non-Sumerian searches).

print.look_up prints formatted dictionary entries to the console and returns x invisibly.

Examples

# Load dictionary
dic <- read_dictionary()

# Forward lookup: search by phonetic spelling
look_up("d-suen", dic)

# Forward lookup: search by Sumerian sign name
look_up("AN", dic)
look_up("AN.EN.ZU", dic)

# Forward lookup: search by cuneiform character string
AN.NA <- paste0(intToUtf8(0x1202D), intToUtf8(0x1223E))
AN.NA
look_up(AN.NA, dic)

# Reverse lookup: search in translations
look_up("Gilgamesh", dic, "en")

# Adjust output width for narrow terminals
look_up("water", dic, "en", width = 50)

# Store results for later use
result <- look_up("lugal", dic)
result$cuneiform
result$translations

# Print stored results
print(result)

Create a Sumerian Dictionary from Annotated Text Files

Description

Parses Word documents (.docx) or plain text files containing annotated Sumerian translations and creates a structured dictionary data frame. The function extracts sign names, their cuneiform representations, possible readings, and translations with grammatical types.

Usage

make_dictionary(file, mapping = NULL)

Arguments

file

A character vector of file paths to .docx or text files. Files must contain translation lines that are formatted as described below.

mapping

A data frame containing sign-to-reading mappings with columns name, cuneiform and syllables. If NULL (default), the package's built-in mapping file etcsl_mapping.txt is used.

Details

Input Format

The input files must contain lines starting with | in the following format:

|sign_name: TYPE: meaning

|equation for sign_name: TYPE: meaning

For example:

|a2-tab: S: the double amount of work performance
|me=ME: S: divine force
|AN: S: god of heaven
|na=NA: Sx->A: whose existence is bound to S

Lines not starting with | are ignored. Only the first entry in an equation of sign names is used for the dictionary. The following notation is suggested for grammatical types:

S for substantives and noun phrases, (e.g., "the old man in the temple")
V for verbs and decorated verbs (e.g., "to go", "to bring the delivery into the temple")
A for adjectives, attributes and subordinate clauses that further define the subject (e.g., "who/which is weak", "whose resource for sustaining life is grain")
Sx->A for a symbol that transforms the preceding noun phrase into an attribute (e.g., "whose resource for sustaining life is S"). Other transformations are denoted accordingly.
N for numbers,
D for everything else.

Processing Steps

Extracts text from .docx files or reads plain text
Filters lines starting with |
Normalizes sign names and looks up possible readings from the mapping table
Aggregates translations and counts occurrences

Output Structure

For each unique sign, the output contains:

One cunei. row with the cuneiform character(s)
One reading row with possible phonetic readings
One or more trans. rows with translations, sorted by frequency

Value

A data frame with the following columns:

sign_name: The normalized Sumerian sign name (e.g., "A", "AN", "ME")
row_type: Type of entry: "cunei." (cuneiform), "reading" (phonetic readings), or "trans." (translation)
count: Number of occurrences for translations; NA for cuneiform and reading entries
type: Grammatical type (e.g., "S", "V", "Sx->A") for translations; empty for other line types
meaning: The cuneiform character(s), reading(s), or translated meaning depending on line_type

Examples


# Create a dictionary from a single text document
filename  <- system.file("extdata", "text_with_translations.txt", package = "sumer")
dict <- make_dictionary(filename)

# Use the dictionary
look_up("an", dict)

Mark N-gram Combinations in Cuneiform Text

Description

Takes a character vector of Sumerian text and marks all n-gram combinations (from ngram_frequencies) with curly braces. Longer combinations are marked first, shorter ones afterwards (including inside already-marked regions).

Usage

mark_ngrams(x, ngram, mapping = NULL)

Arguments

x

A character vector of Sumerian text (transliteration, sign names, or cuneiform). Will be converted to cuneiform internally.

ngram

A data frame as returned by ngram_frequencies, with at least columns combination and length.

mapping

A data frame containing the sign mapping table with columns syllables, name, and cuneiform. If NULL (the default), the package's internal mapping file ‘etcsl_mapping.txt’ is loaded.

Details

The function first converts x to cuneiform (if not already) and removes spaces and brackets ()[]{}.

Then it sorts ngram descending by length and replaces each occurrence of a combination with {combination} (space, open brace, combination, close brace, space).

Shorter n-grams may be marked inside already-marked longer n-grams (nesting is allowed).

Value

A character vector of cuneiform text with n-gram combinations enclosed in curly braces and surrounded by spaces.

Examples


# Load the example text of "Enki and the World Order"
path  <- system.file("extdata", "enki_and_the_world_order.txt", package = "sumer")
text <- readLines(path, encoding="UTF-8")
cat(text[1:10],sep="\n")

# Find combinations that appear at least 6 times in the text
freq <- ngram_frequencies(text, min_freq = 6)
freq[1:10,]

# Mark these combinations in the text
text_marked <- mark_ngrams(text, freq)
cat(text_marked[1:10], sep="\n")

# You can enter transliterated text
x <- "kij2-sig unu2 gal d-re-e-ne-ka me-te-ac im-mi-ib-jal2"
mark_ngrams(x, freq)

# Find all occurences of a pattern in the annotated text
term     <- "IGI.DIB.TU"
(pattern <- mark_ngrams(term, freq))
result   <- text_marked[grepl(pattern, text_marked, fixed=TRUE)]
cat(result, sep="\n")

Normalize Brackets for Skeleton Generation

Description

Transforms a transliterated Sumerian text string into a normalized form that contains only round brackets. This prepares the input for hierarchical extraction by extract_skeleton_entries.

This is an internal helper function used by skeleton.

Usage

mark_skeleton_entries(x)

Arguments

x

A character string of length 1 containing transliterated Sumerian text. The string may contain angle brackets (< >), round brackets (( )), and curly braces ({ }) to annotate token groups (see Details).

Details

The function performs the following transformations:

Tokenizes the input using an internal helper function. Tokens enclosed in angle brackets are merged into a single token.
Removes angle brackets from the separators, replacing them with spaces. Curly braces are preserved.
Wraps every token that is not already enclosed in round brackets with round brackets.

The result is a string in which every token is enclosed in round brackets. Existing round brackets from the input are preserved, so the nesting structure reflects the grouping specified in the original input.

For example, the input

"<d-nu-dim2-mud> ki a. jal2 (e2{kur}) ra"

is transformed into a string where d-nu-dim2-mud appears as a single bracketed token, e2 and kur are individually bracketed inside the existing round brackets around e2{kur}, and all other tokens (ki, a, jal2, ra) are each wrapped in their own round brackets.

Value

A character string of length 1 in which all tokens are enclosed in round brackets. Angle brackets are removed; curly braces from the input are preserved.

Examples

# Input with all three bracket types
x <- "<d-nu-dim2-mud> ki a. jal2 (e2{kur}) ra. gaba jal2. an ki a"
sumer:::mark_skeleton_entries(x)

# Input without any brackets: each token gets wrapped in round brackets
sumer:::mark_skeleton_entries("LUGAL.E")

# Angle brackets merge tokens into a single unit
sumer:::mark_skeleton_entries("<an-ki> lugal")

Frequency Analysis of Cuneiform Sign Combinations (N-grams)

Description

Analyzes a Sumerian text for frequently occurring cuneiform sign combinations (n-grams). The input can be either cuneiform text or transliterated text (which is automatically converted to cuneiform via as.cuneiform). The analysis starts with the longest combinations and works down to single signs, masking already-counted occurrences to avoid reporting subsequences that are only frequent because they are part of a longer frequent combination. N-grams are searched within lines only (not across line boundaries).

Usage

ngram_frequencies(x, min_freq = c(6, 4, 2), mapping = NULL)

Arguments

x

Character vector whose elements are the lines of a Sumerian text. The input can be either cuneiform characters or transliterated text. If no cuneiform characters (U+12000 to U+1254F) are detected, the input is automatically converted using as.cuneiform. Lines starting with # are treated as comments and ignored. Optional line numbers at the beginning of a line (e.g., "42)\t") are automatically removed. Spaces are removed before tokenization.

min_freq

Integer vector specifying minimum frequencies (default: c(6, 4, 2)). The i-th value specifies the minimum frequency for combinations of length i. For lengths beyond the vector's length, the last value is used.

The default c(6, 4, 2) means: single signs must occur at least 6 times, pairs at least 4 times, and all longer combinations at least 2 times.

mapping

A data frame containing the sign mapping table with columns syllables, name, and cuneiform. If NULL (the default), the package's internal mapping file ‘etcsl_mapping.txt’ is loaded.

Details

A “sign” is defined as either a single cuneiform Unicode character (U+12000 to U+1254F) or a character sequence enclosed in mathematical angle brackets (U+27E8 ... U+27E9), which is treated as a single token. All other characters (spaces, X, numbers, punctuation, etc.) are skipped during tokenization.

The maximum n-gram length is automatically determined as the length of the longest tokenized line in the input.

The analysis proceeds from the longest combinations down to single signs. When a combination is identified as frequent (i.e., meets the minimum frequency threshold), all occurrences except the first are masked before continuing with shorter combinations. This prevents subsequences from being reported as frequent when their frequency is solely due to a longer frequent combination.

Value

A data frame with three columns, sorted by descending length, then descending frequency:

frequency

Integer. The number of occurrences of the combination.

length

Integer. The number of signs in the combination.

combination

Character. The cuneiform sign combination (e.g., "\U0001202D\U00012097\U000120A0").

Examples

# Read the text "Enki and the World Order"

path  <- system.file("extdata", "enki_and_the_world_order.txt", package = "sumer")
text <- readLines(path, encoding="UTF-8")

cat(text[1:10],sep="\n")

# Find combinations that appear at least 6 times in the text
freq <- ngram_frequencies(text, min_freq = 6)

freq[1:10,]

Stacked Bar Chart of Grammatical Type Frequencies

Description

Creates a stacked bar chart from the output of sign_grammar or grammar_probs. Each bar represents one sign position in the sentence. The colours indicate the relative frequency or posterior probability of each individual grammatical type.

Usage

plot_sign_grammar(sg,
                  output_file = NULL,
                  width       = 10,
                  height      = 5,
                  sign_names  = FALSE,
                  font_family = NULL,
                  mapping     = NULL)

Arguments

sg

A data frame as returned by sign_grammar (with column n) or grammar_probs (with column prob).

output_file

Character. File path for saving the plot (PNG or JPG). If NULL (default), the plot is displayed on the current device.

width

Numeric. Plot width in inches. Default: 10.

height

Numeric. Plot height in inches. Default: 5.

sign_names

Logical. Whether sign names or cuneiform characters should be used as labels of the x-axis. Default: FALSE.

font_family

Character. Font family for cuneiform x-axis labels. If NULL (default), a cuneiform-capable font is detected automatically.

mapping

A data frame containing the sign mapping table with columns syllables, name, and cuneiform. If NULL (the default), the package's internal mapping file ‘etcsl_mapping.txt’ is loaded.

Details

When the input comes from sign_grammar() (column n), absolute frequencies are converted to percentages so that bars sum to 100%. When the input comes from grammar_probs() (column prob), posterior probabilities are used directly.

Colours are assigned per grammatical type, grouped by class:

Red shades: Verbs (V) and operators returning verbs
Blue shades: Operators returning attributes A
Orange: Adjectives and other signs with grammatical type (Sx->S)
Green: Nouns
Grey/other shades: All other types

Value

Invisibly returns the ggplot2 plot object.

Examples

dic   <- read_dictionary()
sg    <- sign_grammar("a-ma-ru ba-ur3 ra", dic)

# Plot raw frequencies
file <- file.path(tempdir(), "test.png")
plot_sign_grammar(sg, file)

# Plot probabilities
prior <- prior_probs(dic, sentence_prob = 0.25)
gp    <- grammar_probs(sg, prior, dic, alpha0 = 1)
file  <- file.path(tempdir(), "test2.png")
plot_sign_grammar(gp, file)

Prior Probabilities of Grammatical Types

Description

Computes prior probabilities for each grammatical type (e.g., S, V, Sx->S, xS->A, etc.) from a dictionary. The priors can be corrected for verb underrepresentation in the dictionary data.

Usage

prior_probs(dic, sentence_prob = 1.0)

Arguments

dic

A dictionary data frame as returned by read_dictionary.

sentence_prob

Numeric in (0, 1]. The estimated proportion of complete sentences (as opposed to noun phrases) in the training data from which the dictionary was created. Verbs appear in complete sentences, so a value less than 1 upweights verb-like types. Default: 1.0.

Details

The function proceeds in three steps:

For each single-sign dictionary entry with at least one count, the counts per grammatical type are normalised to sum to 1.
The prior probability of each type is the mean of these normalised frequencies across all signs.
A correction is applied: counts of verb-like types (V and all operators with return type V, such as Vx->V or xV->V) are multiplied by 1/sentence_prob, then all probabilities are renormalised. This compensates for the fact that verbs are underrepresented when most dictionary entries are obtained from noun phrases rather than complete sentences.

When sentence_prob = 1, no correction is applied.

Value

A named numeric vector with one element per grammatical type found in the dictionary, summing to 1. The names are the type strings as they appear in the dictionary (e.g., "S", "V", "Sx->S"). The sentence_prob parameter is stored as an attribute.

Examples

dic   <- read_dictionary()

# Default usage
prior_probs(dic)

# Applying correction (only 25% sentences in training data)
prior_probs(dic, sentence_prob = 0.25)

Read a Sumerian Dictionary from File

Description

Reads a Sumerian dictionary from a semicolon-separated text file, optionally displaying the metadata header with author, version, and update information.

Usage

read_dictionary(file = NULL, verbose = TRUE)

Arguments

file

A character string specifying the path to the dictionary file. If NULL (default), the package's built-in dictionary sumer-dictionary.txt is loaded.

verbose

Logical. If TRUE (default), the metadata header (author, year, version, URL) is printed to the console.

Details

File Format

The function expects a semicolon-separated file with a metadata header. Lines starting with # are treated as comments. The expected format is:

###---------------------------------------------------------------
###                Sumerian Dictionary
###
### Author:  Robin Wellmann
### Year:    2026
### Version: 0.5
### Watch for Updates:
###   https://founder-hypothesis.com/en/sumerian-mythology/downloads/
###---------------------------------------------------------------
sign_name;row_type;count;type;meaning
A;cunei.;;;<here would be the cuneiform sign for A>
A;reading;;;{a, dur5, duru5}
A;trans.;3;S;water

Encoding

The file is read with UTF-8 encoding to properly handle cuneiform characters.

Value

A data frame with the following columns:

sign_name: The Sumerian sign name (e.g., "A", "AN", "ME")
row_type: Type of entry: "cunei." (cuneiform character), "reading" (phonetic readings), or "trans." (translation)
count: Number of occurrences for translations; NA for cuneiform and reading entries
type: Grammatical type (e.g., "S", "V") for translations; empty string for other row types
meaning: The cuneiform character(s), phonetic reading(s), or translated meaning depending on row_type

Examples

# Load the built-in dictionary
dic <- read_dictionary()

# Load a custom dictionary
filename <- system.file("extdata", "sumer-dictionary.txt", package = "sumer")
dic <- read_dictionary(filename)

# Look up an entry
look_up("d-suen", dic)

Read Annotated Sumerian Translations from Text Files

Description

Reads Word documents (.docx) or plain text files containing annotated Sumerian translations and extracts sign names, grammatical types, and meanings into a structured data frame.

Usage

read_translated_text(file, mapping=NULL)

Arguments

file

A character vector of file paths to .docx or text files. Files must contain translation lines that are formatted as described below.

mapping

A data frame containing sign-to-reading mappings with columns name, cuneiform and syllables. If NULL (default), the package's built-in mapping file etcsl_mapping.txt is used.

Details

Input Format

The input files must contain lines starting with | in the following format:

|sign_name: TYPE: meaning

|equation for sign_name: TYPE: meaning

For example:

|a2-tab: S: the double amount of work performance
|me=ME: S: divine force
|AN: S: god of heaven
|na=NA: Sx->A: whose existence is bound to S

Lines not starting with | are ignored. Only the first entry in an equation of sign names is extracted. The following notation is suggested for grammatical types:

S for substantives and noun phrases, (e.g., "the old man in the temple")
V for verbs and decorated verbs (e.g., "to go", "to bring the delivery into the temple")
A for adjectives, attributes and subordinate clauses that further define the subject (e.g., "who/which is weak", "whose resource for sustaining life is grain")
Sx->A for a symbol that transforms the preceding noun phrase into an attribute (e.g., "whose resource for sustaining life is S"). Other transformations are denoted accordingly.
N for numbers,
D for everything else.

Processing Steps

Reads text from .docx files or plain text files
Filters lines starting with |
Parses each line into sign name, type, and meaning components
Normalizes transliterated text by removing separators and looking up the sign names from the mapping
Cleans meaning field by removing content after ; or | delimiters
Issues a warning for entries with missing type annotations
Excludes empty sign names from the result

Value

A data frame with the following columns:

sign_name: The normalized sign name with components separated by hyphens (e.g., "A", "AN", "X-NA")
type: Grammatical type (e.g., "S", "V", "A", "Sx->A")
meaning: The translated meaning of the sign

Note

If any translations have missing type annotations, the function prints a warning message listing the affected entries.

Examples


# Read translations from a single text document
filename     <- system.file("extdata", "text_with_translations.txt", package = "sumer")
translations <- read_translated_text(filename)

# View the structure
head(translations)

# Filter by grammatical type
nouns <- translations[translations$type == "S", ]
nouns

#Make some custom unifications (here: removing the word "the")
translations$meaning <- gsub("\\bthe\\b", "", translations$meaning, ignore.case = TRUE)
translations$meaning <- trimws(gsub("\\s+", " ", translations$meaning))

# View the structure
head(translations)

#Convert the result into a dictionary
dictionary   <- convert_to_dictionary(translations)

# View the structure
head(dictionary)

Save a Sumerian Dictionary to File

Description

Saves a Sumerian dictionary data frame to a semicolon-separated text file with a metadata header containing author, year, version, and URL information.

Usage

save_dictionary(dic, file, author = "", year = "", version = "", url = "")

Arguments

dic

A dictionary data frame, typically created by make_dictionary or convert_to_dictionary. Must contain columns sign_name, row_type, count, type, and meaning.

file

A character string specifying the output file path.

author

A character string with the author name(s) for the metadata header.

year

A character string with the year of creation for the metadata header.

version

A character string with the version number for the metadata header.

url

A character string with a URL where updates can be found.

Details

Output Format

The output file consists of two parts:

A metadata header with lines starting with ###, containing author, year, version, and URL information
The dictionary data in semicolon-separated format with columns: sign_name, row_type, count, type, meaning

Example output:

###---------------------------------------------------------------
###                Sumerian Dictionary
###
### Author:  Robin Wellmann
### Year:    2026
### Version: 1.0
### Watch for Updates: https://founder-hypothesis.com/sumer/
###---------------------------------------------------------------
sign_name;row_type;count;type;meaning
A;cunei.;;;<cuneiform sign for A>
A;reading;;;{a, dur5, duru5}
A;trans.;3;S;water

Value

No return value. The function is called for its side effect of writing the dictionary to a file.

Examples

# Create and save a dictionary

filename  <- system.file("extdata", "text_with_translations.txt", package = "sumer")
dictionary <- make_dictionary(filename)

save_dictionary(
  dic     = dictionary,
  file    = file.path(tempdir(), "sumerian_dictionary.txt"),
  author  = "John Doe",
  year    = "2026",
  version = "1.0",
  url     = "https://example.com/dictionary"
)

Grammatical Type Frequencies for Each Sign in a Sumerian Sentence

Description

For each cuneiform sign in a Sumerian sentence, looks up the dictionary to determine the frequency of each individual grammatical type (e.g., S, V, Sx->S, xS->A). Returns a data frame with one row per sign per grammatical type.

Usage

sign_grammar(x, dic, mapping = NULL)

Arguments

x

A single character string containing a Sumerian sentence (cuneiform, sign names, or transliteration).

dic

A dictionary data frame as returned by read_dictionary.

mapping

A data frame containing the sign mapping table with columns syllables, name, and cuneiform. If NULL (the default), the package's internal mapping file ‘etcsl_mapping.txt’ is loaded.

Details

The function converts the input to cuneiform, splits it into individual signs, and looks up each sign in the dictionary. For each sign, the translations are grouped by their individual type string (e.g., "S", "V", "Sx->S", "xS->A").

For each type the dictionary count values are summed. If a translation entry has no count, it is treated as 1.

The set of types returned is the union of all types found across all signs in the sentence. Each sign gets one row per type, even if the count is 0 for that type.

Value

A data frame with columns:

position: Integer. Position of the sign in the sentence.
sign_name: Character. The sign name (e.g., "KA").
cuneiform: Character. The cuneiform character.
type: Character. The grammar type string (e.g., "S", "V", "Sx->S").
n: Integer. Sum of dictionary counts for this sign and this type.

Examples

dic <- read_dictionary()

# Analyse a sentence
sg <- sign_grammar("a-ma-ru ba-ur3 ra", dic)
print(sg)

# Use with cuneiform input
x<-"\U00012000\U000121AD"
print(x)
sg <- sign_grammar(x, dic)
print(sg)

Create a Translation Template for Sumerian Text

Description

Creates a structured template (skeleton) for translating Sumerian text. The template displays each token and subexpression with its syllabic reading, sign name, and cuneiform representation, providing a framework for adding translations.

The input may contain three types of brackets to control how the template is generated (see Details). Optionally, the template can be pre-filled with translations from one or more dictionaries using guess_substr_info.

The function skeleton computes the template and returns an object of class "skeleton". The print method displays the template in the console.

Usage

skeleton(x, mapping = NULL, fill = NULL, space = FALSE)

## S3 method for class 'skeleton'
print(x, ...)

Arguments

x

For skeleton: A character string of length 1 containing transliterated Sumerian text (transliteration, sign names, or cuneiform characters). Tokens may be grouped with brackets to control template generation (see Details).

For print.skeleton: An object of class "skeleton" as returned by skeleton.

mapping

A data frame containing the sign mapping table with columns syllables, name, and cuneiform. If NULL (the default), the package's internal mapping file ‘etcsl_mapping.txt’ is loaded.

fill

A data frame as returned by guess_substr_info, containing translations and grammatical types for all substrings of x. If provided, the template lines are pre-filled with the corresponding type and translation. If NULL (the default), the template lines are left empty.

space

Logical. If TRUE, an empty line is inserted before each entry at nesting depth 1, visually separating top-level groups. Defaults to FALSE.

...

Additional arguments passed to the print method (currently unused).

Details

The function generates a hierarchical template from a Sumerian text string. The input is first converted to cuneiform with as.cuneiform. The input string may contain three types of brackets that control how entries in the template are generated:

Angle brackets < >: The enclosed token sequence is treated as a fixed term. No individual skeleton entries are generated for the tokens inside. For example, <d-nu-dim2-mud> is treated as a single unit.
Round brackets ( ): The enclosed token sequence is a coherent term for which a single skeleton entry is generated, in addition to entries for its individual tokens. Nesting is allowed.
Curly braces { }: Ignored during skeleton generation. They can be used in the input to indicate which tokens serve as arguments to an operator, but this information is not needed for the skeleton.

In addition, a skeleton entry is generated for every individual token that does not appear inside angle brackets.

Each line in the resulting template follows the format:

|[tabs]reading=SIGN.NAME=cuneiform:type:translation

When fill is not provided, the type and translation fields are left empty:

|[tabs]reading=SIGN.NAME=cuneiform::

The template should then be filled in as follows:

Between the two colons: the grammatical type of the expression (e.g., S for noun phrases, V for verbs). See make_dictionary for details.
After the second colon: the translation.

The indentation level (number of tabs) reflects the nesting depth: top-level entries have no indentation, their sub-entries have one tab, and so on.

The template format is designed to be saved as a text file (.txt) or Word document (.docx), edited manually, and then used as input for make_dictionary to create a custom dictionary.

If fill is provided, the function validates that fill matches x: the cuneiform tokens of the first row in fill must be identical to the tokens of x, and the number of rows must equal N(N+1)/2 where N is the number of tokens.

Value

skeleton returns a character vector of class c("skeleton", "character") containing the template lines. The first line is the header with the full reading of the input, followed by one line per skeleton entry. If space = TRUE, empty strings are inserted as separator lines.

print.skeleton prints the template to the console (one line per element) and returns x invisibly.

Examples

# Create an empty template
x <- "<d-nu-dim2-mud> ki a. jal2 (e2{kur}) ra. gaba jal2. an ki a"
skeleton(x)

# Pre-fill the template with dictionary translations
dic <- read_dictionary()
fill <- guess_substr_info(x, dic)
skeleton(x, fill = fill)

# Use spacing to visually separate top-level groups
skeleton(x, fill = fill, space = TRUE)

Split a String into Sumerian Signs and Separators

Description

Splits a transliterated Sumerian text string into its constituent signs and the separators between them. The function recognizes three types of Sumerian sign representations: lowercase transliterations, uppercase sign names, and Unicode cuneiform characters.

Usage

split_sumerian(x)

Arguments

x

A character string containing transliterated Sumerian text.

Details

The function identifies Sumerian signs based on three patterns:

Lowercase transliterations (type 1): Sequences of lowercase letters (a-z) including special characters (ĝ, š, ...) and accented vowels (á, é, í, ú, à, è, ì, ù), optionally followed by a numeric index.
Uppercase sign names (type 2): Sequences starting with an uppercase letter, optionally followed by additional uppercase letters, digits, or the characters +, /, and ×.
Cuneiform characters (type 3): Unicode characters in the Cuneiform block (U+12000 to U+12500).

The function returns the signs and separators in a format that allows exact reconstruction of the original string using paste0(c("", signs), separators, collapse = "").

Value

A list with three components:

signs

A character vector containing the extracted Sumerian signs.

separators

A character vector of length length(signs) + 1 containing the separators. The first element contains any text before the first sign, subsequent elements contain text between consecutive signs, and the last element contains any text after the final sign. Empty strings indicate no separator at that position.

types

An integer vector of the same length as signs indicating the type of each sign: 1 for lowercase transliterations, 2 for uppercase sign names, and 3 for cuneiform characters.

Examples


# Example 1

set.seed(4)

x <- "en-tarah-an-na-ke4"

result <- split_sumerian(x)

result

# Example 2

x <- "en-DARA3.AN.na-ke4"

result <- split_sumerian(x)

result

# Reconstruct the original string
paste0(c("", result$signs), result$separators, collapse = "")

Compute Row Index of a Substring in a Substring Data Frame

Description

Computes the row index of a substring in the data frame created by init_substr_info, given its starting position, its length in tokens, and the total number of tokens.

This is an internal helper function.

Usage

substr_position(start, n_tokens, N)

Arguments

start

Integer (or integer vector). The starting position of the substring (1-based).

n_tokens

Integer (or integer vector). The number of tokens in the substring.

N

Integer. The total number of tokens in the full token sequence.

Details

The data frame returned by init_substr_info is ordered by n_tokens descending and start ascending. This function computes the corresponding row index using the formula

\mathrm{row} = \frac{(N - k)(N - k + 1)}{2} + s

where k = n_tokens and s = start.

The function is vectorized: if start and n_tokens are vectors of the same length, a vector of row indices is returned.

Value

A numeric vector of row indices (1-based).

Examples


# Create a character vector with tokens
x <- "<d-nu-dim2-mud> ki a. jal2 (e2{kur}) ra. gaba jal2. an ki a"
token <- split_sumerian(as.cuneiform(x))$signs
token

N <- length(token)

# Create a data frame with all substrings
df <- sumer:::init_substr_info(token)

# The full string (start=1, n_tokens=N) is in row 1
pos <- sumer:::substr_position(1, N, N)
pos
df$expr[pos]


# The last single token (start=N, n_tokens=1) is in the last row
pos <- sumer:::substr_position(N, 1, N)
pos
df$expr[pos]

# Vectorized call
start <- c(1, 2, 1)
n_token <- c(2, 2, 1)
pos <- sumer:::substr_position(start, n_token, N)
pos
df$expr[pos]

Interactive Translation Tool for Sumerian Text

Description

Opens an interactive Shiny gadget for translating a single line of Sumerian cuneiform text. The page displays four sections on a single scrollable page: n-gram patterns, context with neighbouring lines, grammar probabilities, and an interactive skeleton with dictionary lookup. When the user clicks “Done”, the function returns a skeleton object with the updated translations.

Usage

translate(x, text = NULL, dic = NULL, mapping = NULL, fill = NULL,
          min_freq = c(6, 4, 2), sentence_prob = 1.0,
          viewer = shiny::paneViewer())

Arguments

x

A single Sumerian text string (transliteration, sign names, or cuneiform), or an integer line number indexing into text.

text

A character vector containing the full text being translated (one line per element), a file path to load with readLines(), or NULL. Lines may start with numbering like "12)\t..." or "12. ...". Required when x is an integer; optional otherwise. If a single string that is an existing file path, it is loaded automatically.

dic

A dictionary (data.frame), a list of dictionaries, or a character vector of file paths to dictionary files. If file paths are given, each is loaded with read_dictionary. If NULL, the built-in dictionary is loaded via read_dictionary().

mapping

A data frame containing the sign mapping table with columns syllables, name, and cuneiform. If NULL (the default), the package's internal mapping file ‘etcsl_mapping.txt’ is loaded.

fill

A pre-computed substring info data frame (as from init_substr_info or guess_substr_info). If NULL, it is computed automatically via guess_substr_info.

min_freq

Minimum frequency thresholds passed to ngram_frequencies. A numeric vector where the i-th element is the minimum frequency for n-grams of length i. Default is c(6, 4, 2).

sentence_prob

Probability that a randomly chosen sign is part of a sentence with a verb, passed to prior_probs. Default is 1.0.

viewer

A Shiny viewer function that controls where the gadget window is opened. The default is shiny::paneViewer() which uses the RStudio Viewer pane. Use shiny::browserViewer() to open in the system browser, or shiny::dialogViewer("Translate", width = 842, height = 900) to open a fixed-size dialog in RStudio

Details

The gadget opens in the viewer specified by the viewer parameter (by default a dialog window in RStudio) and displays four sections on a single scrollable page. The first three sections (N-grams, Context, Grammar) can be collapsed individually. A sticky navigation menu at the top allows jumping to each section.

N-gram Patterns: Displays a merged table of n-gram combinations that appear in the current line: n-grams of length 2 or more from the full text (controlled by min_freq), combined with shared n-grams found in neighbouring lines. A “Theme” column marks n-grams shared with the context. Frequencies refer to the full text.
Context: Shows neighbouring lines (up to 2 before and after) with frequent n-grams marked. Only available when text is provided and the line index is known.
Grammar Probabilities: Displays a bar chart of grammar probabilities for each sign in the line, computed via grammar_probs with the given sentence_prob.
Translation: The main interactive section with dictionary selection checkboxes, a bracket input field for editing the skeleton structure, an interactive skeleton display with type and translation fields, and a dictionary lookup panel. Clicking a dictionary row adopts its type and translation into the selected skeleton entry.

When the line contains multiple sentences (separated by dots in the transliteration), skeleton entries belonging to different sentences are displayed with alternating background colours.

The bracket input field allows the user to add or modify brackets (), <>, {} to control the grouping structure of the skeleton. Pressing “Update Skeleton” rebuilds the skeleton display while preserving all translations in the fill data frame.

Value

A skeleton object (character vector of class c("skeleton", "character")), generated by calling skeleton with the final bracket string and updated fill data frame. Returns invisible(NULL) if the user closes the window without clicking “Done”.

Note

Requires packages shiny and miniUI (listed in Suggests). By default, the gadget opens as a dialog window in RStudio with fixed dimensions. To get a resizable window or a more stable connection (e.g. when the computer may enter standby), use viewer = shiny::browserViewer().

Examples

## Not run: 
# Basic usage with a transliterated string
result <- translate("lugal kur-ra-ke4")


# Full example with package data
x <- "<d-nu-dim2-mud> ki a. jal2 (e2-kur) ra. gaba jal2. an ki a"

dict_file <- system.file("extdata", "sumer-dictionary.txt", package = "sumer")
text_file <- system.file("extdata", "enki_and_the_world_order.txt", package = "sumer")

result <- translate(x,
                text = text_file,
                dic = dict_file,
                min_freq = c(6, 4, 2),
                sentence_prob = 0.25)
print(result)



# Open in system browser (resizable, survives standby)

x <- 9

result <- translate(x,
                text = text_file,
                dic = dict_file,
                min_freq = c(6, 4, 2),
                sentence_prob = 0.25,
                viewer = shiny::browserViewer())
print(result)


## End(Not run)

Add Brackets to a Structure String

Description

Usage

Arguments

Details

Value

See Also

Examples

Apply Translation Rules to a Bracketed Structure String

Description

Usage

Arguments

Details

Value

See Also

Examples

Convert Transliterated Sumerian Text to Cuneiform

Description

Usage

Arguments

Details

Value

Note

See Also

Examples

Convert Transliterated Sumerian Text to Sign Names

Description

Usage

Arguments

Details

Value

See Also

Examples

Compose a Skeleton Entry from its Children

Description

Usage

Arguments

Details

Value

See Also

Examples

Convert Translation Data to a Sumerian Dictionary

Description

Usage

Arguments

Details

Processing Steps

Reading Format

Value

See Also

Examples

Evaluate an Operator or Composition

Description

Usage

Arguments

Details

Value

See Also

Examples

Extract Hierarchical Skeleton Entries from Bracketed Text

Description

Usage

Arguments

Details

Value

See Also

Examples

Posterior Probabilities of Grammatical Types for Each Sign

Description

Usage

Arguments

Details

Value

See Also

Examples

Grammatical Structure of a Sumerian Expression

Description

Usage

Arguments

Details