String manipulation

From ConTeXt wiki
(Redirected from String Manipulation)

The I/O-Library >

Overview

In comparison with other scripting languages the bare Lua string library lacks some very useful “features”. But before a texnician reimplements all the goodies that E knows from Eir favorite language, E should first have a look at the helper functions that ConTeXt already provides. The following section briefly introduces these extensions with regard to string manipulation.

The function name and its arguments are given as heading, the content is split into, first, an example and, second, a short description of what the function does and its peculiarities if applicable. The examples are so designed that they should work when copy-pasted into an empty Lua file and processed with context.

l-string.lua

string.esc(string)

print(string.esc([[^([\"\'])(.*)%1$","%2]]))

Returns string with all occurrences of %, ., +, -, *, ^, $, [, ], (, ), {, and } escaped with the percent symbol. Cf. string.escapedpattern.

string.unquote(string)

print(string.unquote[["Do you see any quotes around here?"]])
print(string.unquote[['Do you see any quotes around here?']])
print(string.unquote[["Do you see any quotes around here?']]) -- Doesn't match
print(string.unquote[[“Do you see any quotes around here?”]]) -- Doesn't match

Returns string with surrounding quotes removed iff they are of the same kind (ascii single/double quote only).

string.quote(string)

Equivalent to string.format("%q", string).

string.count(string, pattern)

print(string.count("How many a's?", "a"))
print(string.count("How many many's?", "many"))

Returns the count of matches for pattern in string.

string.limit(string, max, [tail])

s = "This string is too long for our purpose."
print(string.limit(s, 15))
print(string.limit(s, 15, " …")) -- "…" seems to be three bytes long.

Returns the string capped at position max (minus the byte count of tail) with tail appended. The optional tail defaults to "...".

string.strip(string)

print(string.strip([[ I once used to be surrounded by whitespace.                             
    ]]))

Yields string with leading and trailing whitespace (spaces, horizontal and vertical tabs, newlines) removed.

string.is_empty(string)

print(string.is_empty([[]]))
print(string.is_empty([[notempty]]))
print(string.is_empty(9))

Returns true if the argument is an empty string and false for nonempty strings and numbers. (Throws an error with functions, booleans or nil.)

string.enhance(string, pattern, function)

s = "I'd like to file a complaint."
f = function () 
    return "have an argument!"
end
io.write(string.enhance(s, "file a complaint.", f))

Returns the input string with function applied to all matches for pattern. Note: string.enhance relies on gsub from the lua string library so unfortunately you can't pass it an LPEG as second argument.

string.characters(string)

n = 0
for chr in string.characters("Some bytes") do
    n = n + 1
    io.write(string.format("Nr. %2u is %s.\n", n, chr))
end

Returns an iterator over the ascii characters in string. Note: As this relies on the string library you should expect unwelcome results (e.g. invalid utf8 sequences) when using it for anything else than 7-bit ascii.

Cf. string.utfcharacters from LuaTeX.

string.bytes(string)

n = 0
for byte in string.bytes("Some bytes") do
    n = n + 1
    io.write(string.format("Nr. %2u is %3u.\n", n, byte))
end

The same as string.characters but returns bytes as base 10 integers.

string.lpadd(string, n, character) | string.rpadd(string, n, character)

s = "Some string."
print(string.rpadd(s, 15, "r"))
print(string.lpadd(s, 15, "l"))

Adds as many times character as needed to the left or right respectively to get a string of length n from string. Note: character can in fact be a string of any length which can distort the result.

string.escapedpattern(string) | string.partialescapedpattern(string)

print(string.escapedpattern("Some characters like *, + and ] need to be escaped for formatting."))
print(string.partialescapedpattern("Some characters like *, + and ] need to be escaped for formatting."))

string.escapedpattern escapes all occurences of the characters -, ., +, *, %, (, ), [, and ] using percent signs (%); string.partialescapedpattern only escapes - and . using percent signs whereas ? and * are prefixed with dots (.). The latter is used for pattern building, e.g. in trac-set.lua.


string.tohash(string)

t = string.tohash("Comma,or space,separated values")
for k,v in pairs(t) do
    print(k,v)
end

Returns a hashtable with every substring of string between spaces and commas as keys and true as values.

string.totable(string)

t = string.totable("Insert your favorite string here!")
for k,v in pairs(t) do
    print(k,v)
end

Returns a list of ascii characters that constitute string. Note: As this relies on LPEG's character pattern it is guaranteed to turn your multi-byte sequences into garbage!

string.tabtospace(string, [tabsize])

local t = {
    "1234567123456712345671234567",
    "a\tb\tc",
    "aa\tbb\tcc",
    "aaa\tbbb\tccc",
    "aaaa\tbbbb\tcccc",
    "aaaaa\tbbbbb\tccccc",
    "aaaaaa\tbbbbbb\tcccccc",
}
for k,v in ipairs(t) do
    print(string.tabtospace(t[k]))
end

(Modified example from the context sources.) Replaces tabs with spaces depending on position. The optional argument tabsize defaults to 7.

string.compactlong(string) | utilities.strings.striplong(string)

s = [[
This is 
a fairly
    long
        string
            with some
                rather
                    silly
indents.]]

print(string.compactlong(s))
printutilities.strings.striplong(s))

string.compactlong removes newlines (dos and unix) and leading spaces from string.

utilities.strings.striplong removes leading spaces and converts dos newlines to unix newlines.

string.topattern(string, lowercase, strict)

print(string.topattern("Sudo make Me a Pattern from '*' and '-'!", false, false))
print(string.topattern("Sudo make Me a Pattern from '*' and '-'!", true,  false))
print(string.topattern("Sudo make Me a Pattern from '*' and '-'!", true,  true ))

Returns a valid pattern from string that can be used with string.find et al. The return value is essentially the same as with string.escapedpattern plus two options: the boolean lowercase specifies whether the string is to be lowercased first, whereas strict results in a whole line pattern.

l-lpeg.lua

Predefined Patterns

Dozens of the most common patterns including hexadecimal numbers, diverse whitespace and line endings, punctuation, and even an XML path parser etc. are already predefined. To get an impression about what they do you can check them with the followings snippet:

function show_patterns (p, super)
    for i,j in pairs(p) do
        i = super and super..":"..i or i
        print(string.rpadd("=== " .. i .. " ", 80, "="))
        if type(j) == "userdata" then
            j:print()
        else -- descend into next level
            show_patterns(j, i)
        end 
    end 
end

local p = lpeg.patterns

show_patterns(p)

lpeg.anywhere(string|pattern)

str = "Fetchez la vache!"
print(lpeg.anywhere("a" ):match(str))
print(lpeg.anywhere("la"):match(str))
print(lpeg.anywhere("ac"):match(str))

Returns a pattern that matches the first occurrence of string, returning the position of the last character in the matched sequence. Keep in mind that you can pass it patterns as well:

print(lpeg.anywhere(lpeg.P"a"*lpeg.patterns.whitespace):match(str))

lpeg.splitter(delimiter, function)

wedge = function (str)
    local dict = {
        Romanes = "Romani",
           evnt = "ite",
          domvs = "domvm",
    }
    return dict[str] or ""
end

splitme = "Romanes evnt domvs"
print(lpeg.splitter(" ", wedge):match(splitme))

Returns a pattern that can be used to apply function to all substrings delimited by delimiter which can be a string or a pattern.

string.splitlines(string)

str = [[
Bravely bold Sir Robin rode forth from Camelot.
He was not afraid to die, O brave Sir Robin!
He was not at all afraid to be killed in nasty ways,
Brave, brave, brave, brave Sir Robin!
]]

for n,line in ipairs(string.splitlines(str)) do
    io.write(string.format("%u: %s\n", n, line))
end

Splits string into a list of lines where empty lines – i.e. consecutive \n's – yield the empty string.

lpeg.splitat(delimiter, [single])

str = [[
Number twenty-three. The shin.
Number twenty-four. Reginald Maudling's shin.
Number twenty-five. The brain.
Number twenty-six. Magaret Thatcher's brain.
Number twenty-seven. More naughty bits.
]]

t = {lpeg.splitat("Number", false):match(str)}
for n,element in pairs(t) do 
    element = element == "" and element .. "\n" or element
    io.write(n..": "..element) 
end

Returns a pattern that produces a list of substrings delimited by delimiter (which can be a pattern or a string). The optional boolean single determines whether the string should be split only at the first match.

string.split(string, separator) | string.checkedsplit(string, separator)

theory = [[All brontosauruses are thin at one end, much much thicker in the middle, and then thin again at the far end.]]

theorems = string.split(theory, lpeg.P", " * lpeg.P"and "^-1) 

for n, element in ipairs(theorems) do
    io.write (string.format("Theorem %u: %s\n", n, element))
end

string.split returns, as you would expect, a list of substrings of string delimited by separator. Consecutive separators result in the empty string; its counterpart string.checkedsplit does not match these sequences, returning nil instead.

Note: The corresponding pattern generators are lpeg.split and lpeg.checkedsplit.

lpeg.stripper(string|pattern) | lpeg.keeper(string|pattern)

str = "A dromedary has one hump and a camel has a refreshment car, buffet, and ticket collector."
print(lpeg.stripper("aeiou")       :match(str))
print(lpeg.stripper(lpeg.P"camel "):match(str))

lpeg.stripper returns a pattern that removes either, if the argument is a string, all occurrences of every character of that string or, if the argument is a pattern, all occurrences of that pattern. Its complement, lpeg.keeper, removes anything but the string or pattern respectively. Note: string.keeper does not seem to work as expected with patterns consisting of more than one byte, e.g. lpeg.P("camel").

lpeg.replacer(table)

str = "Luxury Yacht"

rep = {
    [1] = { "Luxury", "Throatwobbler"   },
    [2] = { "Yacht",  "Mangrove"        },
}

print("My name is spelled “" .. str .. "”, but it's pronounced “" .. lpeg.replacer(rep):match(str) .. "”.")

Accepts a list of pairs and returns a pattern that substitutes any first elements of a given pair by its second element. The latter can be a string, a hashtable, or a function (whatever fits with lpeg.Cs).

Note: Choose the order of elements in table with care. Due to LPEG's matching the leftmost element of disjunction first it might turn out to be as crucial as in the following example:

str = "aaababaaba"
rep1 = {
    { "a",  "x" },
    { "aa", "y" },
}

rep2 = {
    { "aa", "y" },
    { "a",  "x" },
}

print(lpeg.replacer(rep1):match(str))
print(lpeg.replacer(rep2):match(str))

lpeg.firstofsplit(separator) | lpeg.secondofsplit(separator)

str = "menu = spam, spam, spam, spam, spam, baked beans, spam, spam and spam"
print(lpeg.firstofsplit (" = "):match(str))
print(lpeg.secondofsplit(" = "):match(str))

lpeg.firstofsplit returns a pattern that matches the substring until the first occurrence of separator, its complement generated by lpeg.secondofsplit matches the whole rest after that regardless of any further occurrences of separator.

util-prs.lua

utilities.parsers.settings_to_hash(str)

str = 'a=1, b=2, c=3'
utilities.parsers.settings_to_hash(str)
--> { a = 1, b = 2, c = 3 }

utilities.parsers.settings_to_array takes a string of comma-separated key=value statements, and returns an associative array of ["key"] = value entries. Very useful for parsing and accessing macro arguments at the Lua end.

utilities.parsers.settings_to_array(str)

str = 'top, inmargin=2, top, {here,now}'
utilities.parsers.settings_to_array(str)
--> { "top", "inmargin=2", "top", "here,now" }

utilities.parsers.settings_to_array takes a string of comma-separated keywords, and returns a array of those keywords in the order in which they appear. Duplicates are not filtered. Key=value strings are taken as a single keyword. Surrounding braces are removed

utilities.parsers.settings_to_set(str)

str = 'top, inmargin=2, top, {here,now}'
utilities.parsers.settings_to_set(str)
--> { ["top"]=true, ["inmargin=2"]=true, ["here,now"]=true }

utilities.parsers.settings_to_array takes a string of comma-separated keywords, and returns a array of those keywords in the order in which they appear, with duplicates removed. Key=value strings are taken as a single keyword. Surrounding braces are removed

Other function in utilities.parsers

utilities.parsers.add_settings_to_array parse and write directly into table
utilities.parsers.arguments_to_table parse arguments, return table
utilities.parsers.array_to_string(a,sep) concatenates a with custom sep or comma
utilities.parsers.getparameters write settings_to_hash to an array with a metatable. A metatable is a sort of parent: when a table is accessed, undefined values will be looked up in the metatable.
utilities.parsers.hash_to_string turn a hash into a string, with optional strictness settings
utilities.parsers.simple_hash_to_string concatenate the values of a hash
utilities.parsers.make_settings_to_hash_pattern returns parser pattern for strict, tolerant, or normal arg-parsing
utilities.parsers.settings_to_hash_strict like settings_to_hash with strict parsing
utilities.parsers.settings_to_hash_tolerant like settings_to_hash with tolerant parsing
utilities.parsers.splitthousands turns 12345678.44 into 12,345,678.44

LuaTeX

Some very useful functionality is already implemented at the lowest level. See the LuaTeX Reference for further information.

string.explode(string, [character])

str = "Amongst our weaponry are such diverse elements as fear, surprise, ruthless efficiency, and an almost fanatical devotion to the Pope, and nice red uniforms."

for _, elm in ipairs(string.explode(str, ",")) do
    print(elm)
end

Returns a list of strings from string split at every occurrence of character. Adding "+" to character ignores consecutive occurrences (not producing the empty string). (default: " +"). Note: character should consist of only one-byte else only the first byte will be respected.

string.characters(string) | string.utfcharacters(string)

alphabet = "abcdefghijklmnopqrstuvwxyz"
alphabit = "абвгдежзийклмнопрстуфхцчшщъыьэюя"

for char in alphabet:characters()    do io.write(char .. ",") end
io.write("\n")
for char in alphabit:utfcharacters() do io.write(char .. ",") end
io.write("\n")

These iterators can be used to walk over strings character by character. They are extremely fast in comparison with equivalent LPEG's. For instance, when hopping once through Anna Karenina (about 3M of 2-byte utf8 characters) string.utfcharacters turned out to be almost twice as fast as an LPEG iterator.

Recipes

General: string section of the Lua wiki.

You have a useful function for string manipulation and want to share it? Do go on!

Iterator: words (string, chr)

local function words (str, chr)
    local C, Cp, P = lpeg.C, lpeg.Cp, lpeg.P

    local chr = chr and P(chr) or P" "
    local g = C((1 - chr)^1) * chr^1 * Cp()
    local pos = 1

    function iterator()
        local word, newpos = g:match(str, pos)
        pos = newpos
        return word
    end
    return iterator
end

Iterates over substrings delimited by pattern chr (defaults to space, ignores consecutive occurrences).

Usage:

for char in words(text, " ") do
    -- pass
end

Comparison with similar iterators (Empty loop; texlua v. beta-0.62.0-2010082314; text is the aforementioned Anna Karenina, 3MB UTF-8):

ipairs(string.explode(text, " +")) : 0.262s
unicode.utf8.gmatch(text,"%w+")    : 0.363s
unicode.utf8.gmatch(text,"%S+")    : 0.384s
words(text, " ")                   : 0.448s

The results slightly differ depending on the treatment of consecutive spaces. words has the advantage that it allows for arbitrary patterns as delimiters.

String formatter

The context() function uses its own formatter, of the form context("something %Z something", object_formatted_by_Z) Below is a table of the available formatting codes.

result type code input type
integer %...i number
integer %...d number
unsigned %...u number
utf character %...c number
hexadecimal %...x number
HEXADECIMAL %...X number
octal %...o number
string %...s string, number
float %...f number
exponential %...e number
exponential %...E number
autofloat %...g number
autofloat %...G number
force tostring %...S any
force tostring %Q any
force tonumber %N number (strips leading zeros)
signed number %I number
rounded number %r number
0xhexadecimal %...h character, number
0xHEXADECIMAL %...H character, number
U+hexadecimal %...u character, number
U+HEXADECIMAL %...U character, number
points %p number in scaled points (65536sp = 1pt)
basepoints %b number in scaled points
table concat %...t table
true or false %l boolean
TRUE or FALSE %L boolean
number spaces %...w number
escaped XML %!xml! string
escaped TeX %!tex! string, number