bug#41970: Suggestions for corrections to Emacs and Elisp manuals

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

bug#41970: Suggestions for corrections to Emacs and Elisp manuals

Jay Bingham-3

Information about the operators and constructs used to create regular expressions is contained in two locations in the Info manuals, one in the Emacs manual (section 15.6 Syntax of Regular Expressions), the other in the Elisp manual (section 34.3.1.1 Special Characters in Regular Expressions). The first paragraph in section 15.6 of the Emacs manual provides the justification for maintaining two versions of the material, even though the two versions contain mostly the same information. There are legitimate differences, however all of the differences cannot be attributed to the "features used mainly in Lisp programs". Here are differences that I have noticed, which I believe should not be differences.

Section 15.6 Syntax of Regular Expressions of the Emacs manual contains descriptions of the postfix repetition operators ‘\{N\}’ and ‘\{N,M\}’. These operators are not described the Elisp manual in section 34.3.1.1, but are described in section 34.3.1.3 Backslash Constructs in Regular Expressions where they are defined as ‘\{M\}’ and ‘\{M,N\}’. Since the Emacs manual also has a section for backslash constructs, 15.7 Backslash in Regular Expressions, moving the descriptions of the postfix repetition operators to section 15.7 and naming the as they are named in the Elisp manual would contribute greatly to the consistency of the two manuals. Additionally the description of ‘\{M,N\}’ in the Elisp manual contains information not included in the Emacs manual version that would be appropriate to include there.

The terminology used in section 15.6 Syntax of Regular Expressions to describe and discuss the ‘[ ... ]’ and ‘[^ ... ]’ constructs. The first paragraph and the final paragraph in the section both refer to these constructs as "a character alternative", while the paragraphs describing them call them a “character set”. In section 34.3.1.1 of the Elisp manual the phrase used consistently to describe them and refer to them is "a character alternative". It would increase the consistency of both manuals to use the same terminology to describe and refer to these constructs. A more grammatically correct phrase to describe these features would be "a set of alternative characters" (but when have programming nerds ever been that concerned with grammatical correctness). Whatever phrase is used to describe and refer to these constructs, it should be consistent throughout both manuals. (The introduction to tsection 34.3.1.2 Character Classes in the Elisp manual included).

In both section 15.6 Syntax of Regular Expressions and section 34.3.1.1 Special Characters in Regular Expressions near the end of each section is a paragraph which contains the sentence:

As a ‘\’ is not special inside a character alternative, it can never remove the special meaning of ‘-’ or ‘]’.

In both sections, in the description of the ‘[ ... ]’ construct, is a sentence which states that the characters ‘]’, ‘-’ and ‘^’ are special inside character alternatives.

Shouldn't the sentences found in both sections that are cited above include the '^' character?

The construct ‘\(?NUM: ... \)’ that is described in the Elisp manual, section 34.3.1.3 Backslash Constructs in Regular Expressions is not included in the Emacs manual section 15.7 Backslash in Regular Expressions, it should be. However, the description of the construct in section 34.3.1.3 should be modified to make it clear that only the digits 1 through 9 can be used as NUM. Here is a suggestion for doing that:

\(?DIGIT:...\)

is the explicitly numbered group construct. Normal groups get their number implicitly, based on their position, which can be inconvenient. This construct allows a specific group number (limited to the digits 1 through 9, see: ‘\DIGIT’ construct) to be assigned to the group construct. There is no particular restriction on the numbering, e.g., several groups can have the same number in which case the last one to match (i.e., the rightmost match) will be recorded. Implicitly numbered groups always get the smallest integer larger than the largest one of any previous group.

In the Emacs manual section 15.7 Backslash in Regular Expressions in the description of the ‘\D’ construct the following sentence in the second paragraph is misleading:

Then, later on in the regular expression, you can use ‘\’ followed by the digit D to mean “match the same text matched the Dth time by the ‘\( ... \)’ construct”.

This does not agree with the description in the paragraphs that surround it nor with the description of the construct in the Elisp manual, section 34.3.1.3 Backslash Constructs in Regular Expressions. This is not an error introduced in version 26, it has been present since at least version 23. It should read:

Then, later on in the regular expression, ‘\’ followed by the digit D can be used to mean “match the same text matched by the Dth ‘\( ... \)’ construct”.

In section 15.7 Backslash in Regular Expressions of the Emacs manual the descriptions for the constructs ‘\`’, ‘\'’, ‘\=’, ‘\b’, ‘\B’, ‘\<’, ‘\>’, ‘\w’, ‘\W’, ‘\_<’, ‘\_>’, ‘\sC’, ‘\SC’, ‘\cC’ and ‘\CC’ appear in the order show here, while in section 34.3.1.3 Backslash Constructs in Regular Expressions of the Elisp manual they appear in the following order: ‘\w’, ‘\W’, ‘\sCODE’, ‘\SCODE’, ‘\cC’, ‘\CC’, ‘\`’, ‘\'’, ‘\=’, ‘\b’, ‘\B’, ‘\<’, ‘\>’, ‘\_<’ and ‘\_>’, which groups the constructs which match characters together and those which match empty strings relative to positions together. This grouping makes much more sense than the apparent haphazard order used in the Emacs manual. The order in the Emacs manual should match that of the Elsip manual.

Also in section 34.3.1.3 Backslash Constructs in Regular Expressions of the Elsip manual the four constructs having placeholders: \sCODE’, ‘\SCODE’, ‘\cCand\CC’, the same convention is not used for specifying the placeholders. Either the constructs \sCODE’ and\SCODEshould be written as ‘\sCand\SCor the constructs \cCand\CCshould be written as ‘\cCODEand\CCODEmaking the convention consistent throughout the section. The same convention should be used in both the Emacs manual and the Elisp manual in all constructs where place holders occur. I prefer the use of a mnemonic as a placeholder over the use of a dingle character.

Adopting this convention would necessitate changing the ‘\{M\}’, ‘\{M,N\} and ‘\D’ constructs as well. I suggest the following: ‘\{NUM\}’, ‘\{MIN,MAX\} and ‘\DIGIT’. I prefer the convention used in the online version of the Elisp manual where placeholders are shown in lowercase italics. I do not know it that is possible to do or if it would conflict with the convention of showing place holders in all caps that is used in function descriptions. Since it is possible to cause links to files and the names of variables to be displayed differently in function descriptions, it should not be difficult to define a mechanism for displaying place holders in italics in function descriptions.

In section 34.3.1.3 Backslash Constructs in Regular Expressions of the Elsip manual in the paragraph that introduces the regular expression constructs match the empty string the word ‘consume’ would be more appropriate than the phrase ‘use up’.

The format of the descriptions in section 34.3.1.3 Backslash Constructs in Regular Expressions of the Elsip manual is not consistent. I offer you the following which I have attempted to add some consistency to by stating the name of the operator/construct then describing how it is used. The corrections and improvements mentioned above are incorporated into what follows.

For the most part, \’ followed by any character matches only that character.However, there are several exceptions: two-character sequences starting with ‘\’ that have special meanings.The second character in the sequence is always an ordinary character when used on its own.Here are the ‘\’ operators and constructs.

\|

is the alternative operator.Two regular expressions A and B with ‘\|’ between forms an expression that matches either the text matched by A or the text matched by B

Thus, ‘foo\|bar’ matches either ‘foo’ or ‘bar’ but no other string.

\|’ applies to the largest possible surrounding expressions.Only a surrounding ‘\( … \)’ grouping can limit the grouping power of ‘\|’.

When full backtracking capability is needed to handle multiple uses of ‘\| , use the POSIX regular expression functions (see POSIX Regexps in the Elisp manual).

\{num\}

is the postfix number of repetitions operator. It specifies the exact number of consecutive repetitions that the preceding regular expression must match.For example, ‘x\{4\}’ matches only the string ‘xxxx’; ‘c[ad]\{3\}r’ matches only the eight valid strings that can be created with two characters in three places, that is the strings: ‘caaar’, ‘caadr’, ‘cadar’, ‘caddr’, ‘cdaar’, ‘cdadr’, ‘cddar’, ‘cdddr’.

\{min,max\}

is the postfix range of repetitions operator. It specifies the range of consecutive repetitions between min and max that the preceding regular expression must match, i.e. at least min times, but no more than max times.If min is omitted, the minimum is 0, but the preceding regular expression must match at least max times; if max is omitted, there is no maximum.

\{0,1\}’ or ‘\{,1\}’ is equivalent to ‘?’.

\{0,\}’ or ‘\{,\}’is equivalent to ‘*’.

\{1,\}’ is equivalent to ‘+’.

For example, ‘c[ad]\{1,2\}r’ matches only the strings: ‘car’, ‘cdr’, ‘caar’, ‘cadr’, ‘cdar’, and ‘cddr’.

The maximum value allowed for num, min and max is 2**15 − 1.

\( … \)

is the grouping construct that serves three purposes:

  1. To enclose a set of ‘\|’ alternatives for other operations. Thus, ‘\(foo\|bar\)x’ matches either ‘foox’ or ‘barx’.

  2. To enclose a complicated expression for the postfix operators ‘*’, ‘+’ and ‘?’ to operate on.Thus, ‘ba\(na\)*’ matches ‘bananana’, etc., with any number of (zero or more) ‘na’ strings.

  3. To record a matched substring for future reference with ‘\digit’ (described below).

This last application is not a consequence of the idea of a parenthetical grouping; it is a separate feature that is assigned as a second meaning to the same ‘\( … \)’ construct.In practice there is usually no conflict between the two meanings; when there is a conflict, a “shy” group (described below) can be used.

\(?: … \)

is the “shy” group construct. A shy group serves the first two purposes of an ordinary group (controlling the nesting of other operators), but it does not record the matched substring; it can’t be referred back to with ‘\digit ’ construct (see below).This is useful in mechanically combining regular expressions, so that groups can be added for syntactic purposes without interfering with the numbering of the groups that are meant to be referred to.

\(?digit: … \)

is the explicitly numbered group construct. Normal groups get their number implicitly, based on their position, which can be inconvenient. This construct allows a specific group number (limited to the digits 1 through 9, see: ‘\digit’ construct) to be assigned to the group construct. There is no particular restriction on the numbering, e.g., several groups can have the same number in which case the last one to match (i.e., the rightmost match) will be recorded. Implicitly numbered groups always get the smallest integer larger than the largest one of any previous group.

\digit

is the back reference operator. It matches the same text that matched the digitth occurrence of a ‘\( … \)’ construct.

After the end of a ‘\( … \)’ construct, the matcher remembers the beginning and end of the text matched by that construct.Later in the regular expression, ‘\’ followed by the digit can be used to match the same text matched by the digitth\( … \) construct.

The strings matching the first nine ‘\( … \)’ constructs appearing in a regular expression are assigned numbers 1 through 9 in the order that the open-parentheses appear in the regular expression.So ‘\1’ through ‘\9’ can be used to refer to the text matched by the corresponding ‘\( … \)’ constructs.

For example, ‘\(.*\)\1’ matches any newline-free string that is composed of two identical halves.The ‘\(.*\)’ matches the first half, which may be anything, but the ‘\1’ that follows must match the same exact text.

If a ‘\( … \)’ construct matches more than once (which can easily happen if it is followed by ‘*’), only the last match is recorded.

If a particular grouping construct in the regular expression was never matched—for instance, if it appears inside of an alternative that wasn’t used, or inside of a repetition that repeated zero times—then the corresponding ‘\digit’ construct never matches anything. For example, the regexp ‘\(foo\(b*\)\|lose\)\2’ cannot match ‘lose’ because the second alternative inside the larger group matches it, which results in ‘\2’ being undefined and unable to match anything. It can match ‘foobb’, because the first alternative matches ‘foob’ and ‘\2’ matches the second ‘b’.

The following operators pertaining to words and syntax are controlled by the setting of the syntax table (See: Table of Syntax Classes).

\w

is the word-constituent operator, it matches any word-constituent character.The syntax table determines which characters these are.(See: Table of Syntax Classes)

\W

is the non-word-constituent operator, it matches any character that is not a word-constituent.(See: Table of Syntax Classes)

\scode

is the syntax class operator, it matches any character whose syntax is code.Here code is a character that designates a particular syntax class: thus, ‘w’ for word constituent, ‘-’ or ‘ ’ for whitespace, ‘.’ for ordinary punctuation, etc.(See: Table of Syntax Classes)

\Scode

is the non syntax class operator, it matches any character whose syntax is not code.(See: Table of Syntax Classes)

\ccode

is the character category operator, it matches any character that belongs to the category code.For example, ‘\cc’ matches Chinese characters, ‘\cg’ matches Greek characters, etc.For the description of the known categories, type ‘M-x describe-categories <RET>’.(See also: Category Characters)

\Ccode

is the non character category operator, it matches any character that does not belong to category code.(See: Category Characters)

The following regular expression constructs match the empty string—that is, they don't consume any characters—but whether they match depends on the context. For all, the beginning and end of the accessible portion of the buffer are treated as if they were the actual beginning and end of the buffer.

\`

is the beginning of string operator, it matches the empty string, but only at the beginning of the string or buffer (or its accessible portion) being matched against.

\’

is the end of string operator, it matches the empty string, but only at the end of the string or buffer (or its accessible portion) being matched against.

\=

is the at point operator, it matches the empty string, but only at point.

\b

is the beginning or end of word operator, it matches the empty string, but only at the beginning or end of a word.Thus, ‘\bfoo\b’ matches any occurrence of ‘foo’ as a separate word.\bballs?\b’ matches ‘ball’ or ‘balls’ as a separate word.

\b’ matches at the beginning or end of the buffer regardless of what text appears next to it.

\B

is the middle of word operator, it matches the empty string, but not at the beginning or end of a word.

\<

is the beginning of word operator, it matches the empty string, but only at the beginning of a word; furthermore, ‘\<’ matches at the beginning of the buffer only if a word-constituent character follows.

\>

is the end of word operator, it matches the empty string, but only at the end of a word; furthermore, ‘\>’ matches at the end of the buffer only if the contents end with a word-constituent character.

\_<

is the beginning of symbol operator, it matches the empty string, but only at the beginning of a symbol.A symbol is a sequence of one or more symbol-constituent characters.A symbol-constituent character is a character whose syntax is either ‘w’ or ‘_’. It matches at the beginning of the buffer only if a symbol-constituent character immediately follows the beginning of the buffer. As with words, the syntax table determines which characters are symbol-constituent.

\_>

is the end of symbol operator, it matches the empty string, but only at the end of a symbol. It matches at the end of the buffer only if a symbol-constituent character immediately precedes the end of the buffer.

Not every string is a valid regular expression. For example, a string that ends inside a set of alternative characters without a terminating ‘]’ is invalid, and so is a string that ends with a single ‘\’. If an invalid regular expression is passed to any of the search functions, an invalid-regexp error is signaled.


J C Bingham
   - Georgetown, TX USA -
___________________________



Virus-free. www.avast.com
Reply | Threaded
Open this post in threaded view
|

bug#41970: Suggestions for corrections to Emacs and Elisp manuals

Drew Adams
> The terminology used in section 15.6 Syntax of Regular Expressions to describe and discuss the ‘[ ... ]’ and ‘[^ ... ]’ constructs. The first paragraph and the final paragraph in the section both refer to these constructs as "a character alternative", while the paragraphs describing them call them a “character set”. In section 34.3.1.1 of the Elisp manual the phrase used consistently to describe them and refer to them is "a character alternative".

> It would increase the consistency of both manuals to use the same terminology to describe and refer to these constructs. A more grammatically correct phrase to describe these features would be "a set of alternative characters" (but when have programming nerds ever been that concerned with grammatical correctness).

A nit:

These references refer to the syntax construct [...], and not to the set of chars that it represents.  It is wrong to call this construct "a character set", and it would be wrong to call it "a set of alternative characters".  What it _matches_, or represents, is any _one_ char of a set of alternative chars.  But the syntax construct is not a set of chars.