An opinionated HTML Serializer for PHP 8.4
https://shkspr.mobi/blog/2025/04/an-opinionated-html-serializer-for-php-8-4/
A few days ago, I wrote a shitty pretty-printer for PHP 8.4's new Dom\HTMLDocument class.
I've since re-written it to be faster and more stylistically correct.
It turns this:
<html lang="en-GB"><head><title id="something">Test</title></head><body><h1 class="top upper">Testing</h1><main><p>Some <em>HTML</em> and an <img src="example.png" alt="Alternate Text"></p>Text not in an element<ol><li>List</li><li>Another list</li></ol></main></body></html>
Into this:
<!doctype html><html lang=en-GB> <head> <title id=something>Test</title> </head> <body> <h1 class="top upper">Testing</h1> <main> <p> Some <em>HTML</em> and an <img src=example.png alt="Alternate Text"> </p> Text not in an element <ol> <li>List</li> <li>Another list</li> </ol> </main> </body></html>
I say it is "opinionated" because it does the following:
- Attributes are unquoted unless necessary.
- Every element is logically indented.
- Text content of CSS and JS is unaltered. No pretty-printing, minification, or checking for correctness.
- Text content of elements may have extra newlines and tabs. Browsers will tend to ignore multiple whitespaces unless the CSS tells them otherwise.
- This fucks up
<pre>
blocks which contain markup.
It is primarily designed to make the markup easy to read. Because according to the experts:
A computer language is not just a way of getting a computer to perform operations but rather … it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute.
I'm fairly sure this all works properly. But feel free to argue in the comments or send me a pull request.
Here's how it works.
When is an element not an element? When it is a void!
Modern HTML has the concept of "Void Elements". Normally, something like <a>
must eventually be followed by a closing </a>
. But Void Elements don't need closing.
This keeps a list of elements which must not be explicitly closed.
$void_elements = [ "area", "base", "br", "col", "embed", "hr", "img", "input", "link", "meta", "param", "source", "track", "wbr",];
Tabs
Space
Tabs, obviously. Users can set their tab width to their personal preference and it won't get confused with semantically significant whitespace.
$indent_character = "\t";
Setting up the DOM
The new HTMLDocument should be broadly familiar to anyone who has used the previous one.
$html = '<html lang="en-GB"><head><title id="something">Test</title></head><body><h1 class="top upper">Testing</h1><main><p>Some <em>HTML</em> and an <img src="example.png" alt="Alternate Text"></p>Text not in an element<ol><li>List</li><li>Another list</li></ol></main></body></html>>'$dom = Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR, "UTF-8" );
This automatically adds <head>
and <body>
elements. If you don't want that, use the LIBXML_HTML_NOIMPLIED
flag:
$dom = Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED, "UTF-8" );
To Quote or Not To Quote?
Traditionally, HTML attributes needed quotes:
<img src="example.png" class="avatar no-border" id="user-123">
Modern HTML allows those attributes to be unquoted as long as they don't contain ASCII Whitespace or certain other characters
For example, the above becomes:
<img src=example.png class="avatar no-border" id=user-123>
This function looks for the presence of those characters:
function value_unquoted( $haystack ){ // Must not contain specific characters $needles = [ // https://infra.spec.whatwg.org/#ascii-whitespace "\t", "\n", "\f", "\n", " ", // https://html.spec.whatwg.org/multipage/syntax.html#unquoted "\"", "'", "=", "<", ">", "`" ]; foreach ( $needles as $needle ) { if ( str_contains( $haystack, $needle ) ) { return false; } } // Must not be null if ( $haystack == null ) { return false; } return true;}
Re-re-re-recursion
I've tried to document this as best I can.
It traverses the DOM tree, printing out correctly indented opening elements and their attributes. If there's text content, that's printed. If an element needs closing, that's printed with the appropriate indentation.
function serializeHTML( $node, $treeIndex = 0, $output = ""){ global $indent_character, $preserve_internal_whitespace, $void_elements; // Manually add the doctype to start. if ( $output == "" ) { $output .= "<!doctype html>\n"; } if( property_exists( $node, "localName" ) ) { // This is an Element. // Get all the Attributes (id, class, src, &c.). $attributes = ""; if ( property_exists($node, "attributes")) { foreach( $node->attributes as $attribute ) { $value = $attribute->nodeValue; // Only add " if the value contains specific characters. $quote = value_unquoted( $value ) ? "" : "\""; $attributes .= " {$attribute->nodeName}={$quote}{$value}{$quote}"; } } // Print the opening element and all attributes. $output .= "<{$node->localName}{$attributes}>"; } else if( property_exists( $node, "nodeName" ) && $node->nodeName == "#comment" ) { // Comment $output .= "<!-- {$node->textContent} -->"; } // Increase indent. $treeIndex++; $tabStart = "\n" . str_repeat( $indent_character, $treeIndex ); $tabEnd = "\n" . str_repeat( $indent_character, $treeIndex - 1); // Does this node have children? if( property_exists( $node, "childElementCount" ) && $node->childElementCount > 0 ) { // Loop through the children. $i=0; while( $childNode = $node->childNodes->item( $i++ ) ) { // Is this a text node? if ($childNode->nodeType == 3 ) { // Only print output if there's no HTML inside the content. // Ignore Void Elements. if ( !str_contains( $childNode->textContent, "<" ) && property_exists( $childNode, "localName" ) && !in_array( $childNode->localName, $void_elements ) ) { $output .= $tabStart . $childNode->textContent; } } else { $output .= $tabStart; } // Recursively indent all children. $output = serializeHTML( $childNode, $treeIndex, $output ); }; // Suffix with a "\n" and a suitable number of "\t"s. $output .= "{$tabEnd}"; } else if ( property_exists( $node, "childElementCount" ) && property_exists( $node, "innerHTML" ) ) { // If there are no children and the node contains content, print the contents. $output .= $node->innerHTML; } // Close the element, unless it is a void. if( property_exists( $node, "localName" ) && !in_array( $node->localName, $void_elements ) ) { $output .= "</{$node->localName}>"; } // Return a string of fully indented HTML. return $output;}
Print it out
The serialized string hardcodes the <!doctype html>
- which is probably fine. The full HTML is shown with:
echo serializeHTML( $dom->documentElement );
Next Steps
Please raise any issues on GitLab or leave a comment.