python - Regular expression matching liquid code

Question

Welcome To Ask or Share your Answers For Others

python - Regular expression matching liquid code

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

python - Regular expression matching liquid code

I'm building a website using Jekyll I would like to automatically remove liquid code (and only liquid code) from a given HTML file. I'm doing it in Python using regular expressions, and so far I have this one:

{.*?}|{{.*?}}

As I am not too familiar with liquid (and .html), could someone confirm that this will suffice for my goal?

Here is an example of the kind of file I will be working with:

<div class="post-preview">
    <div class="post-title">
        <div class="post-name">
            <a href="{{ post.url }}">{{ post.title }}</a>
        </div>
        <div class="post-date">
            {% include time.html %}
        </div>
    </div>

    <div class="post-snippet">
        {% if post.content contains '<!--break-->' %}
            {{ post.content | split:'<!--break-->' | first }}
            <div class="post-readmore">
                <a href="{{ post.url }}">read more-></a>
            </div>
        {% endif %}
    </div>
    {% include post-meta.html %}
</div>

In this case my regex works, but I wanted to make sure I'm not missing something for the future. I could go for a hackish way and surround all liquid code with comments like

/* start_liquid */ {blalala} /* end_liquid */

but I'm looking for a more elegant way to do it.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:26:37+0000

The regular expression and tokenizer in Python-Liquid might be overkill for your use case, but will handle {% comment %} and {% raw %} blocks, and any multi-line tags. Something a simple regular expression alone would not cope with.

The Writing a tokenizer example from Python's re docs uses the same technique.

You could use python-liquid to filter out Liquid tokens like this.

from liquid.lex import get_lexer
from liquid.token import TOKEN_LITERAL

s = """
<div class="post-preview">
    <div class="post-title">
        <div class="post-name">
            <a href="{{ post.url }}">{{ post.title }}</a>
        </div>
        <div class="post-date">
            {% include time.html %}
        </div>
    </div>

    <div class="post-snippet">
        {% if post.content contains '<!--break-->' %}
            {{ post.content | split:'<!--break-->' | first }}
            <div class="post-readmore">
                <a href="{{ post.url }}">read more-></a>
            </div>
        {% endif %}
    </div>
    {% include post-meta.html %}
</div>
"""

tokenize = get_lexer()
tokens = tokenize(s)
only_html = [token.value for token in tokens if token.type == TOKEN_LITERAL]
print("".join(only_html))

Notice the output preserves newlines outside Liquid markup.

<div class="post-preview">
    <div class="post-title">
        <div class="post-name">
            <a href=""></a>
        </div>
        <div class="post-date">
            
        </div>
    </div>

    <div class="post-snippet">
        
            
            <div class="post-readmore">
                <a href="">read more-></a>
            </div>
        
    </div>
    
</div>

Categories

python - Regular expression matching liquid code

python - Regular expression matching liquid code

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags