docs(ref): Specify frontmatter #1974

epage · 2025-08-21T21:25:24Z

epage · 2025-08-21T21:25:39Z

src/SUMMARY.md

    - [Input format](input-format.md)
    - [Keywords](keywords.md)
    - [Identifiers](identifiers.md)
+    - [Frontmatter](frontmatter.md)


Not stable yet: rust-lang/rust#136889

epage · 2025-08-21T21:26:07Z

src/SUMMARY.md

    - [Input format](input-format.md)
    - [Keywords](keywords.md)
    - [Identifiers](identifiers.md)
+    - [Frontmatter](frontmatter.md)


I put this under "Lexical structure" because this shebang is there

epage · 2025-08-21T21:27:19Z

src/crates-and-source-files.md

+    FRONTMATTER?
    InnerAttribute*
    Item*


I used comments as my guide which were mostly SCREAMING_CASE. Unsure when things should be SCREAMING_CASE vs UpperCamelCase.

epage · 2025-08-21T21:28:47Z

src/crates-and-source-files.md

+    FRONTMATTER?
    InnerAttribute*
    Item*


Despite shebang's not having being here, I assumed I should put FRONTMATTER here

The reason shebang isn't there is that shebangs are ignored at the start of any file, not just in the file which defines a crate.

If that's true for frontmatter too, input-format.md might be a better place to say how frontmatter.md fits in.

This is a bit confusing.

input-format.md is specifically under the section for lexical analysis.

crates-and-source-files.md includes what looks like its the AST but its specifically scoped to the crate root. r[crate-items] talks generally about any rust source file.

So, like shebang, I'll leave this out but this feels like something that could be cleaned up.

epage · 2025-08-21T21:29:41Z

src/frontmatter.md

+r[frontmatter.body]
+The body of the frontmatter may contain any content except for a line starting with as many or more dashes (`-`) than in the fences.


This follows the RFC rather than rustc's current implementation, see rust-lang/rust#141367

epage · 2025-08-21T21:34:50Z

src/crates-and-source-files.md

 # Crates and source files

 r[crate.syntax]
 ```grammar,items


Aside: it took me a while before I found the appropriate documentation for this.

I quickly went to CONTRIBUTING.md but skipped over the link to the authoring page because it was at the end of the intro. I skimmed the sections until I found "Adding Documentation" which seemed to describe my situation but I found no link.

Eventually I found authoring.md and thankfully I read thoroughly enough to notice the "Grammar" section at the bottom. At that point, I was mostly able to interpret the results to figure out what I needed (e.g. the limitations of ~, what to search for to understand how to do footnotes).

As noted at https://github.com/rust-lang/reference/pull/1974/files#r2292171963, it doesn't really give a style guidance on casing.

epage · 2025-08-21T21:35:25Z

src/frontmatter.md

+FRONTMATTER ->
+      FRONTMATTER_FENCE INFOSTRING? LF
+      (FRONTMATTER_LINE LF )*
+      FRONTMATTER_FENCE[^matched-fence] LF
+
+FRONTMATTER_FENCE -> `---` `-`*
+
+INFOSTRING -> XID_Start ( XID_Continue | `.` )*
+
+FRONTMATTER_LINE -> (~INVALID_FRONTMATTER_LINE_START (~INVALID_FRONTMATTER_LINE_CONTINUE)*)?
+
+INVALID_FRONTMATTER_LINE_START -> (FRONTMATTER_FENCE[^escaped-fence] | LF)
+
+INVALID_FRONTMATTER_LINE_CONTINUE -> LF


Was unsure whether to create a new rule for non-newline whitespace and specify that in every location or not.

src/frontmatter.md

epage · 2025-08-25T20:47:01Z

src/frontmatter.md

+FRONTMATTER ->
+      FRONTMATTER_FENCE INFOSTRING? LF
+      (FRONTMATTER_LINE LF )*
+      FRONTMATTER_FENCE[^matched-fence] LF
+
+FRONTMATTER_FENCE -> `---` `-`*
+
+INFOSTRING -> (XID_Start | `_`) ( XID_Continue | `-` | `.` )*
+
+FRONTMATTER_LINE -> (~INVALID_FRONTMATTER_LINE_START (~INVALID_FRONTMATTER_LINE_CONTINUE)*)?
+
+INVALID_FRONTMATTER_LINE_START -> (FRONTMATTER_FENCE[^escaped-fence] | LF)
+
+INVALID_FRONTMATTER_LINE_CONTINUE -> LF


Thinking more on this, I feel like I should be explicit where frontmatter whitespace is allowed.

I assume I should then pull out a whitespace grammar rule to build on that.

epage · 2025-08-25T20:48:32Z

src/frontmatter.md

+FRONTMATTER ->
+      FRONTMATTER_FENCE INFOSTRING? LF
+      (FRONTMATTER_LINE LF )*
+      FRONTMATTER_FENCE[^matched-fence] LF
+
+FRONTMATTER_FENCE -> `---` `-`*
+
+INFOSTRING -> (XID_Start | `_`) ( XID_Continue | `-` | `.` )*
+
+FRONTMATTER_LINE -> (~INVALID_FRONTMATTER_LINE_START (~INVALID_FRONTMATTER_LINE_CONTINUE)*)?
+
+INVALID_FRONTMATTER_LINE_START -> (FRONTMATTER_FENCE[^escaped-fence] | LF)
+
+INVALID_FRONTMATTER_LINE_CONTINUE -> LF


Something that i don't make explicit is CR. In most places that can be chalked up to whitespace except for INVALID_FRONTMATTER_LINE_START and INVALID_FRONTMATTER_LINE_CONTINUE.

I created productions for `END_OF_LINE`, `IGNORABLE_CODE_POINT`, and `HORIZONTAL_WHITESPACE` as that is how the unicode standard is written and in preparation for rust-lang#1974 which will make use of `HORIZONTAL_WHITESPACE`

This does not create any new productions, instead preferring comments. rust-lang#1974 will involve pulling out the horizontal whitespace into a separate production. Comment wording (and casing) is modeled off of https://www.unicode.org/reports/tr31/#R3a. I left off a "unicode" prefix for ASCII items as they are likely common enough in that context that specifying them as "unicode" could cause more confusion.

This reverts commit 60eb145. This re-formats our Whitespace to be centered on Unicode's defintion. This makes it easy to compare with the standard and helps with Frontmatter. Unlike regular Rust, Frontmatter cares about the type of Whitespace. Even if we want to duplicate the definition, having them formatted similarly makes them easy to compare.

I'm splitting out `HORIZONTAL_WHITESPACE` to make it easier to connect the concept in frontmatter with `WHITESPACE`. In doing this, I found it awkward to only pull out part when the comments are there. I either needed a redundant comment to start a section, re-arrange out of order from Unicode, or split out all classes. I did the latter.

traviscross

Thanks for the PR. Before digging into the rest, let me ask some questions about changes here related to layout, organization, and style.

traviscross · 2025-10-29T23:52:19Z

src/whitespace.md

-      U+0009 // Horizontal tab, `'\t'`
-    | U+000A // Line feed, `'\n'`
-    | U+000B // Vertical tab
-    | U+000C // Form feed
-    | U+000D // Carriage return, `'\r'`
-    | U+0020 // Space, `' '`
-    | U+0085 // Next line
-    | U+200E // Left-to-right mark
-    | U+200F // Right-to-left mark
-    | U+2028 // Line separator
-    | U+2029 // Paragraph separator
-
-TAB -> U+0009 // Horizontal tab, `'\t'`
-
-LF -> U+000A  // Line feed, `'\n'`
-
-CR -> U+000D  // Carriage return, `'\r'`
+      END_OF_LINE
+    | IGNORABLE_CODE_POINT
+    | HORIZONTAL_WHITESPACE
+
+END_OF_LINE ->
+      LF
+    | U+000B // vertical tabulation
+    | U+000C // form feed
+    | CR
+    | U+0085 // Unicode next line
+    | U+2028 // Unicode LINE SEPARATOR
+    | U+2029 // Unicode PARAGRAPH SEPARATOR
+
+IGNORABLE_CODE_POINT ->
+      U+200E // Unicode LEFT-TO-RIGHT MARK
+    | U+200F // Unicode RIGHT-TO-LEFT MARK
+
+HORIZONTAL_WHITESPACE ->
+      TAB
+    | U+0020  // space ' '
+
+TAB -> U+0009  // horizontal tab ('\t')
+
+LF -> U+000A  // line feed ('\n')
+
+CR -> U+000D  // carriage return ('\r')


In these sort of grammar productions that are essentially lists, we generally find it more clear to not mix terminals and non-terminals even if that leads to some duplication. As applied here, that would suggest inlining LF and CR in END_OF_LINE and TAB in HORIZONTAL_WHITESPACE. Does that make sense, or are there non-obvious problems with that here?

I can see it both ways.

Extra abstractions can make things less clear.

Lack of use of abstractions highlights a difference which can also be confusing or make searching more difficult.

Which gets me thinking, with how thin this abstraction is, should we even have a production for TAB, LF, or CR?

It's fairly common in grammars to have separate for TAB, LF, and CR, even if duplicated. I'd suggest keeping these three but inlining the terminals into the other productions.

src/whitespace.md

traviscross · 2025-10-30T00:24:26Z

src/whitespace.md

+    | U+0085 // Unicode next line
+    | U+2028 // Unicode LINE SEPARATOR
+    | U+2029 // Unicode PARAGRAPH SEPARATOR


In my copy of the Unicode Character Database, I see:

0085;<control>;Cc;0;B;;;;;N;NEXT LINE (NEL);;;;

2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;; 2029;PARAGRAPH SEPARATOR;Zp;0;B;;;;;N;;;;;

In a somewhat more pretty form:

https://util.unicode.org/UnicodeJsps/character.jsp?a=0085

https://util.unicode.org/UnicodeJsps/character.jsp?a=2028

https://util.unicode.org/UnicodeJsps/character.jsp?a=2029

That is, in the database, U+0085 has a null name and a name alias of "NEXT LINE|NEL", U+2028 has a name of "LINE SEPARATOR" and a null name alias, and U+2029 has a name of "PARAGRAPH SEPARATOR" and a null name alias.

Given that, is there a reason that we need to capitalize these differently, or could we perhaps align the capitalization with Reference norms without losing something important here?

I do see, in The Unicode Standard, Version 17.0.0, §5.8.1 "Definitions", Table 5-1 that all of "next line", "line separator", and "paragraph separator" are presented in lowercase, but that doesn't seem to explain why we might want to treat these differently.

This was to match how they are specified where it talks about the different categories of whitespace, see https://www.unicode.org/reports/tr31/#R3a-1

It's not clear to me that the capitalization and other matters of presentation there are intended to be more normative than the capitalization in the Unicode Character Database (which is normatively part of the Unicode standard).

To my eyes, it looks like they made the editorial choice in tr31 to notate code point names as presented in the database, and then where the name is null, to present the primary code point alias in parentheses and in lowercase (and to elide secondary aliases). This presentation was, I'm guessing, intended to notate this difference between names and aliases.

Is there a reason that it's important to follow the presentation in tr31 exactly? If so, I'd expect that we'd want to follow the convention with the parentheses as well.

However, it's not clear to me that it's important, for our purposes, to make this distinction between code point names and code point aliases. If that's right, and there's not, then I'd prefer we render these according to our normal conventions for comments, i.e., in sentence case (especially as the authors of Unicode standards documents also seem to treat the capitalizations as malleable).

But, is that right, or is there an important reason we need to distinguish, for our purposes, code point names and aliases (where the name is null)?

Is there a reason that it's important to follow the presentation in tr31 exactly? If so, I'd expect that we'd want to follow the convention with the parentheses as well.

That is the document we are referring back to on this page.

Yes, understood. How would you then apply that to the points and questions raised above?

src/frontmatter.md

traviscross · 2025-10-30T00:38:03Z

src/whitespace.md

+HORIZONTAL_WHITESPACE ->
+      TAB
+    | U+0020  // space ' '


Looking at the grammar productions above for frontmatter, it seems that only HORIZONTAL_WHITESPACE is used (and LF, of course). If we were willing to accept some duplication, i.e. by defining HORIZONTAL_WHITESPACE directly in terms of terminals (that also would appear under WHITESPACE), does that work for defining frontmatter without these other changes, or is there a reason these other changes to the layout, organization, and styling of the WHITESPACE grammar are important for defining frontmatter?

I think duplicating an entry like HORIZONTAL_WHITESPACE is anti-user.

I think duplicating an entry like HORIZONTAL_WHITESPACE is anti-user.

Obviously we strive to be pro-user. However, I'm not clear how this stylistic choice would be anti-user. If you could please elaborate, I'd appreciate it.

First off, I am not a DRY absolutist. Things having the appearance of duplication isn't a reason to de-duplicate but we should work towards understanding the underlying principles and requirements and de-duplicate those where it works well and reducing the burden of duplication where it doesn't work well (e.g. having two definitions next to each other). In this case, we are working off of an item (whitespace) that is defined in three categories that is then defined as literals that natural leads to a way to divide things up to not duplicate knowledge.

I see this is similar to but worse than #1974 (comment)

At least with TAB, the abstraction is thin (too thin as I suggested in that thread). Here, we'd be duplicating a lot of the definition. This puts a burden on the user to match up the entries to

Understand that one is a superset of the other

Understand what the role of the items mean

When seeing duplication like this, it is natural to expect there to a reason for it and that reason to be there are differences. For me, this makes me doubt myself and pore over something more, wasting my time and making me not confident that I understand what I'm working off of, and more annoyed in the process.

I suspect this will also be a honey trap for contributors who will see this and then submit PRs to do what I did.

epage commented Aug 21, 2025

View reviewed changes

epage force-pushed the frontmatter branch from 3243e11 to dce5d20 Compare August 21, 2025 21:30

epage commented Aug 21, 2025

View reviewed changes

epage mentioned this pull request Aug 19, 2025

Tracking Issue for frontmatter rust-lang/rust#136889

Open

14 tasks

ehuss added the S-waiting-on-stabilization Waiting for a stabilization PR to be merged in the main Rust repository label Aug 21, 2025

epage force-pushed the frontmatter branch 2 times, most recently from 161b0fa to 61afc85 Compare August 22, 2025 13:54

mattheww reviewed Aug 25, 2025

View reviewed changes

src/frontmatter.md Outdated Show resolved Hide resolved

epage force-pushed the frontmatter branch from 61afc85 to 2b3a5fa Compare August 25, 2025 20:44

epage commented Aug 25, 2025

View reviewed changes

This was referenced Aug 27, 2025

Frontmatter and cargo-script rust-lang/project-goal-reference-expansion#9

Open

feature(frontmatter) allows vertical whitespace where only horizontal whitespace is intended rust-lang/rust#145971

Closed

joshtriplett mentioned this pull request Sep 3, 2025

Tracking issue rust-lang/project-goal-reference-expansion#11

Open

epage mentioned this pull request Sep 4, 2025

Create Whitespace grammar productions #1991

Merged

epage mentioned this pull request Sep 16, 2025

Stabilize cargo-script rust-lang/rust-project-goals#119

Open

11 tasks

traviscross added the S-waiting-on-review Status: The marked PR is awaiting review from a maintainer label Oct 4, 2025

epage added 3 commits October 9, 2025 14:01

docs(ref): Specify frontmatter

d7203a4

epage force-pushed the frontmatter branch from 2b3a5fa to df1726b Compare October 9, 2025 19:20

epage mentioned this pull request Oct 23, 2025

Stabilize Frontmatter rust-lang/rust#148051

Open

10 tasks

traviscross reviewed Oct 30, 2025

View reviewed changes

epage force-pushed the frontmatter branch 5 times, most recently from 3d69944 to d7203a4 Compare October 30, 2025 20:36

		r[frontmatter.body]
		The body of the frontmatter may contain any content except for a line starting with as many or more dashes (`-`) than in the fences.

docs(ref): Specify frontmatter #1974

Are you sure you want to change the base?

docs(ref): Specify frontmatter #1974

Conversation

epage commented Aug 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

traviscross left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

traviscross Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

traviscross Oct 30, 2025 •

edited

Loading