Changes

15,848 bytes added , 12:24, 29 August 2010

Please correct and extend!

=Overview=

In comparison with other scripting languages the bare Lua string library
lacks some very useful “features”.
But before a texnician reimplements all the goodies that E knows from Eir
favorite language, E should first have a look at the helper functions that
ConTeXt already provides.
The following section briefly introduces these extensions with regard to
string manipulation.

The function name and its arguments are given as heading, the content is split
into, first, an example and, second, a short description of what the function
does and its peculiarities if applicable.
The examples are so designed that they should work when copy-pasted into an
empty Lua file and processed with context.

=l-string.lua=

==string.esc(string)==

<pre>
print(string.esc([[^([\"\'])(.*)%1$","%2]]))
</pre>

Returns <tt>string</tt> with all occurrences of
<tt>%</tt>,
<tt>.</tt>,
<tt>+</tt>,
<tt>-</tt>,
<tt>*</tt>,
<tt>^</tt>,
<tt>$</tt>,
<tt>[</tt>,
<tt>]</tt>,
<tt>(</tt>,
<tt>)</tt>,
<tt>{</tt>, and
<tt>}</tt>
escaped with the percent symbol.
Cf. [[#string.escapedpattern.28string.29_.7C_string.partialescapedpattern.28string.29|string.escapedpattern]].

==string.unquote(string)==

<pre>
print(string.unquote[["Do you see any quotes around here?"]])
print(string.unquote[['Do you see any quotes around here?']])
print(string.unquote[["Do you see any quotes around here?']]) -- Doesn't match
print(string.unquote[[“Do you see any quotes around here?”]]) -- Doesn't match
</pre>

Returns <tt>string</tt> with surrounding quotes removed iff they are
of the same kind (ascii single/double quote only).

==string.quote(string)==

Equivalent to <tt>string.format("%q", string)</tt>.

==string.count(string, pattern)==

<pre>
print(string.count("How many a's?", "a"))
print(string.count("How many many's?", "many"))
</pre>

Returns the count of matches for <tt>pattern</tt> in <tt>string</tt>.

==string.limit(string, max, [tail])==

<pre>
s = "This string is too long for our purpose."
print(string.limit(s, 15))
print(string.limit(s, 15, " …")) -- "…" seems to be three bytes long.
</pre>

Returns the string capped at position <tt>max</tt> (minus the byte
count of <tt>tail</tt>) with <tt>tail</tt> appended. The optional <tt>tail</tt>
defaults to <tt>"..."</tt>.

==string.strip(string)==

<pre>
print(string.strip([[ I once used to be surrounded by whitespace.
]]))
</pre>

Yields <tt>string</tt> with leading and trailing whitespace (spaces, horizontal
and vertical tabs, newlines) removed.

==string.is_empty(string)==

<pre>
print(string.is_empty([[]]))
print(string.is_empty([[notempty]]))
print(string.is_empty(9))
</pre>

Returns <tt>true</tt> if the argument is an empty string and
<tt>false</tt> for nonempty strings and numbers.
(Throws an error with functions, booleans or <tt>nil</tt>.)

==string.enhance(string, pattern, function)==

<pre>
s = "I'd like to file a complaint."
f = function ()
return "have an argument!"
end
io.write(string.enhance(s, "file a complaint.", f))
</pre>

Returns the input string with <tt>function</tt> applied to all matches
for <tt>pattern</tt>.
''Note'': <tt>string.enhance</tt> relies on <tt>gsub</tt> from the lua
string library so unfortunately you can't pass it an LPEG as second
argument.

==string.characters(string)==

<pre>
n = 0
for chr in string.characters("Some bytes") do
n = n + 1
io.write(string.format("Nr. %2u is %s.\n", n, chr))
end
</pre>

Returns an iterator over the ascii characters in <tt>string</tt>.
''Note'': As this relies on the string library you should expect
unwelcome results (e.g. invalid utf8 sequences) when using it for
anything else than 7-bit ascii.

Cf.
[[#string.characters(string) .7C string.utfcharacters(string)|string.utfcharacters]]
from LuaTeX.

==string.bytes(string)==

<pre>
n = 0
for byte in string.bytes("Some bytes") do
n = n + 1
io.write(string.format("Nr. %2u is %3u.\n", n, byte))
end
</pre>

The same as <tt>string.characters</tt> but returns bytes as base 10
integers.

==string.lpadd(string, n, character) | string.rpadd(string, n, character)==

<pre>
s = "Some string."
print(string.rpadd(s, 15, "r"))
print(string.lpadd(s, 15, "l"))
</pre>

Adds as many times <tt>character</tt> as needed to the left or right
respectively to get a string of length <tt>n</tt> from
<tt>string</tt>. ''Note'': <tt>character</tt> can in fact be a string
of any length which can distort the result.

==string.escapedpattern(string) | string.partialescapedpattern(string)==

<pre>
print(string.escapedpattern("Some characters like *, + and ] need to be escaped for formatting."))
print(string.partialescapedpattern("Some characters like *, + and ] need to be escaped for formatting."))
</pre>

<tt>string.escapedpattern</tt> escapes all occurences of the characters
<tt>-</tt>,
<tt>.</tt>,
<tt>+</tt>,
<tt>*</tt>,
<tt>%</tt>,
<tt>(</tt>,
<tt>)</tt>,
<tt>[</tt>, and
<tt>]</tt>
using percent signs (<tt>%</tt>);
<tt>string.partialescapedpattern</tt> only escapes
<tt>-</tt> and
<tt>.</tt> using percent signs whereas
<tt>?</tt> and
<tt>*</tt> are prefixed with dots (<tt>.</tt>).
The latter is used for pattern building, e.g. in [[source:trac-set.lua|trac-set.lua]].

==string.tohash(string)==

<pre>
t = string.tohash("Comma,or space,separated values")
for k,v in pairs(t) do
print(k,v)
end
</pre>

Returns a hashtable with every substring of <tt>string</tt>
between spaces and commas as keys and <tt>true</tt> as values.

==string.totable(string)==

<pre>
t = string.totable("Insert your favorite string here!")
for k,v in pairs(t) do
print(k,v)
end
</pre>

Returns a list of ascii characters that constitute <tt>string</tt>.
''Note'': As this relies on LPEG's character pattern it
'''is guaranteed'''
to turn your multi-byte sequences into garbage!

==string.tabtospace(string, [tabsize])==

<pre>
local t = {
"1234567123456712345671234567",
"a\tb\tc",
"aa\tbb\tcc",
"aaa\tbbb\tccc",
"aaaa\tbbbb\tcccc",
"aaaaa\tbbbbb\tccccc",
"aaaaaa\tbbbbbb\tcccccc",
}
for k,v in ipairs(t) do
print(string.tabtospace(t[k]))
end
</pre>

(Modified example from [[source:l-string.lua|the context sources]].)
Replaces tabs with spaces depending on position. The optional argument
<tt>tabsize</tt> defaults to <em>7</em>.

==string.compactlong(string) | string.striplong(string)==

<pre>
s = [[
This is
a fairly
long
string
with some
rather
silly
indents.]]

print(string.compactlong(s))
print(string.striplong(s))
</pre>

<tt>string.compactlong</tt> removes newlines (dos and unix) and
leading spaces from <tt>string</tt>.

<tt>string.striplong</tt> removes leading spaces and converts dos
newlines to unix newlines.

==string.topattern(string, lowercase, strict)==

<pre>
print(string.topattern("Sudo make Me a Pattern from '*' and '-'!", false, false))
print(string.topattern("Sudo make Me a Pattern from '*' and '-'!", true, false))
print(string.topattern("Sudo make Me a Pattern from '*' and '-'!", true, true ))
</pre>

Returns a valid pattern from <tt>string</tt> that can be used with
<tt>string.find</tt> et al.
The return value is essentially the same as with
<tt>string.escapedpattern</tt> plus two options:
the boolean <tt>lowercase</tt> specifies whether the string is to be
lowercased first,
whereas <tt>strict</tt> results in a whole line pattern.

=l-lpeg.lua=

==Predefined Patterns==

Dozens of the most common patterns including
hexadecimal numbers, diverse whitespace and line endings, punctuation,
and even an XML path parser etc. are already predefined.
To get an impression about what they do you can check them with the
followings snippet:

<pre>
function show_patterns (p, super)
for i,j in pairs(p) do
i = super and super..":"..i or i
print(string.rpadd("=== " .. i .. " ", 80, "="))
if type(j) == "userdata" then
j:print()
else -- descend into next level
show_patterns(j, i)
end
end
end

local p = lpeg.patterns

show_patterns(p)
</pre>

==lpeg.anywhere(string|pattern)==

<pre>
str = "Fetchez la vache!"
print(lpeg.anywhere("a" ):match(str))
print(lpeg.anywhere("la"):match(str))
print(lpeg.anywhere("ac"):match(str))
</pre>

Returns a pattern that matches the first occurrence of
<tt>string</tt>, returning the position of the last character in the
matched sequence.
Keep in mind that you can pass it patterns as well:

<pre>
print(lpeg.anywhere(lpeg.P"a"*lpeg.patterns.whitespace):match(str))
</pre>

==lpeg.splitter(delimiter, function)==

<pre>
wedge = function (str)
local dict = {
Romanes = "Romani",
evnt = "ite",
domvs = "domvm",
}
return dict[str] or ""
end

splitme = "Romanes evnt domvs"
print(lpeg.splitter(" ", wedge):match(splitme))
</pre>

Returns a pattern that can be used to apply <tt>function</tt> to all
substrings delimited by <tt>delimiter</tt> which can be a string or a
pattern.

==string.splitlines(string)==

<pre>
str = [[
Bravely bold Sir Robin rode forth from Camelot.
He was not afraid to die, O brave Sir Robin!
He was not at all afraid to be killed in nasty ways,
Brave, brave, brave, brave Sir Robin!
]]

for n,line in ipairs(string.splitlines(str)) do
io.write(string.format("%u: %s\n", n, line))
end
</pre>

Splits <tt>string</tt> into a list of lines where empty lines – i.e.
consecutive <tt>\n</tt>'s – yield the empty string.

==lpeg.splitat(delimiter, [single])

<pre>
str = [[
Number twenty-three. The shin.
Number twenty-four. Reginald Maudling's shin.
Number twenty-five. The brain.
Number twenty-six. Magaret Thatcher's brain.
Number twenty-seven. More naughty bits.
]]

t = {lpeg.splitat("Number", false):match(str)}
for n,element in pairs(t) do
element = element == "" and element .. "\n" or element
io.write(n..": "..element)
end
</pre>

Returns a pattern that produces a list of substrings delimited by
<tt>delimiter</tt> (which can be a pattern or a string).
The optional boolean <tt>single</tt> determines whether the string
should be split only at the first match.

==string.split(string, separator) | string.checkedsplit(string, separator)==

<pre>
theory = [[All brontosauruses are thin at one end, much much thicker in the middle, and then thin again at the far end.]]

theorems = string.split(theory, lpeg.P", " * lpeg.P" and"^-1)

for n, element in ipairs(theorems) do
io.write (string.format("Theorem %u: %s\n", n, element))
end
</pre>

<tt>string.split</tt> returns, as you would expect, a list of
substrings of <tt>string</tt> delimited by <tt>separator</tt>.
Consecutive separators result in the empty string;
its counterpart <tt>string.checkedsplit</tt> does not match these
sequences, returning <tt>nil</tt> instead.

''Note'': The corresponding pattern generators are <tt>lpeg.split</tt>
and <tt>lpeg.checkedsplit</tt>.

==lpeg.stripper(string|pattern) | lpeg.keeper(string|pattern)==

<pre>
str = "A dromedary has one hump and a camel has a refreshment car, buffet, and ticket collector."
print(lpeg.stripper("aeiou") :match(str))
print(lpeg.stripper(lpeg.P"camel "):match(str))
</pre>

<tt>lpeg.stripper</tt> returns a pattern that removes either, if the
argument is a string, all occurrences of every character of that
string or, if the argument is a pattern, all occurrences of that pattern.
Its complement, <tt>lpeg.keeper</tt>, removes anything but the string
or pattern respectively.
''Note'': <tt>string.keeper</tt> does not seem to work as expected with
patterns consisting of more than one byte, e.g. <tt>lpeg.P("camel")</tt>.

==lpeg.replacer(table)==

<pre>
str = "Luxury Yacht"

rep = {
[1] = { "Luxury", "Throatwobbler" },
[2] = { "Yacht", "Mangrove" },
}

print("My name is spelled “" .. str .. "”, but it's pronounced “" .. lpeg.replacer(rep):match(str) .. "”.")
</pre>

Accepts a list of pairs and returns a pattern that substitutes any
first elements of a given pair by its second element.
The latter can be a string, a hashtable, or a function (whatever fits
with <tt>lpeg.Cs</tt>).

''Note'': Choose the order of elements in <tt>table</tt> with care.
Due to LPEG's matching the leftmost element of disjunction first
it might turn out to be as crucial as in the following example:

<pre>
str = "aaababaaba"
rep1 = {
{ "a", "x" },
{ "aa", "y" },
}

rep2 = {
{ "aa", "y" },
{ "a", "x" },
}

print(lpeg.replacer(rep1):match(str))
print(lpeg.replacer(rep2):match(str))
</pre>

==lpeg.firstofsplit(separator) | lpeg.secondofsplit(separator)==

<pre>
str = "menu = spam, spam, spam, spam, spam, baked beans, spam, spam and spam"
print(lpeg.firstofsplit (" = "):match(str))
print(lpeg.secondofsplit(" = "):match(str))
</pre>

<tt>lpeg.firstofsplit</tt> returns a pattern that matches the
substring until the first occurrence of <tt>separator</tt>, its
complement generated by <tt>lpeg.secondofsplit</tt> matches the whole
rest after that regardless of any further occurrences of <tt>separator</tt>.

=LuaTeX=

Some very useful functionality is already implemented at the lowest
level.
See the [http://www.luatex.org/svn/trunk/manual/luatexref-t.pdf LuaTeX
Reference] for further information.

==string.explode(string, [character])==

<pre>
str = "Amongst our weaponry are such diverse elements as fear, surprise, ruthless efficiency, and an almost fanatical devotion to the Pope, and nice red uniforms."

for _, elm in ipairs(string.explode(str, ",")) do
print(elm)
end
</pre>

Returns a list of strings from <tt>string</tt> split at every
occurrence of <tt>character</tt> (default: space).
Adding <tt>"+"</tt> to <tt>character</tt> ignores consecutive occurrences (not
producing the empty string).
''Note'': <tt>character</tt> should consist of only one-byte else only
the first byte will be respected.

==string.characters(string) | string.utfcharacters(string)==

<pre>
alphabet = "abcdefghijklmnopqrstuvwxyz"
alphabit = "абвгдежзийклмнопрстуфхцчшщъыьэюя"

for char in alphabet:characters() do io.write(char .. ",") end
io.write("\n")
for char in alphabit:utfcharacters() do io.write(char .. ",") end
io.write("\n")
</pre>

These iterators can be used to walk over strings character by
character. They are extremely fast in comparison with equivalent
LPEG's. For instance, when hopping once through
[http://az.lib.ru/t/tolstoj_lew_nikolaewich/text_0080.shtml Anna Karenina]
(about 3M of 2-byte utf8 characters) <tt>string.utfcharacters</tt>
turned out to be almost twice as fast as an LPEG iterator.

=Recipies=

General: [http://lua-users.org/wiki/StringRecipes string section of the Lua wiki].

You have a useful function for string manipulation and want to share
it? Do go on!

==Iterator: words (string, chr)==

<pre>
local function words (str, chr)
local C, Cp, P = lpeg.C, lpeg.Cp, lpeg.P

local chr = chr and P(chr) or P" "
local g = C((1 - chr)^1) * chr^1 * Cp()
local pos = 1

function iterator()
local word, newpos = g:match(str, pos)
pos = newpos
return word
end
return iterator
end
</pre>

Iterates over substrings delimited by pattern <tt>chr</tt>
(defaults to space, ignores consecutive occurrences).

Usage:

<pre>
for char in words(text, " ") do
-- pass
end
</pre>

Comparison with similar iterators (Empty loop; <tt>texlua</tt>
v. beta-0.62.0-2010082314; <tt>text</tt> is the aforementioned Anna
Karenina, 3MB UTF-8):

<pre>
ipairs(string.explode(text, " +")) : 0.262s
unicode.utf8.gmatch(text,"%w+") : 0.363s
unicode.utf8.gmatch(text,"%S+") : 0.384s
words(text, " ") : 0.448s
</pre>

The results slightly differ depending on the treatment of consecutive spaces.
<tt>words</tt> has the advantage that it allows for arbitrary patterns as
delimiters.

Phg

188

edits

Changes

String manipulation (view source)

Revision as of 12:24, 29 August 2010

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Main

Navigation

Indexes

Interaction

Tools