Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
481 views
in Technique[技术] by (71.8m points)

python - Regular expression matching liquid code

I'm building a website using Jekyll I would like to automatically remove liquid code (and only liquid code) from a given HTML file. I'm doing it in Python using regular expressions, and so far I have this one:

{.*?}|{{.*?}}

As I am not too familiar with liquid (and .html), could someone confirm that this will suffice for my goal?

Here is an example of the kind of file I will be working with:

<div class="post-preview">
    <div class="post-title">
        <div class="post-name">
            <a href="{{ post.url }}">{{ post.title }}</a>
        </div>
        <div class="post-date">
            {% include time.html %}
        </div>
    </div>

    <div class="post-snippet">
        {% if post.content contains '<!--break-->' %}
            {{ post.content | split:'<!--break-->' | first }}
            <div class="post-readmore">
                <a href="{{ post.url }}">read more-></a>
            </div>
        {% endif %}
    </div>
    {% include post-meta.html %}
</div>

In this case my regex works, but I wanted to make sure I'm not missing something for the future. I could go for a hackish way and surround all liquid code with comments like

/* start_liquid */ {blalala} /* end_liquid */

but I'm looking for a more elegant way to do it.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The regular expression and tokenizer in Python-Liquid might be overkill for your use case, but will handle {% comment %} and {% raw %} blocks, and any multi-line tags. Something a simple regular expression alone would not cope with.

The Writing a tokenizer example from Python's re docs uses the same technique.

You could use python-liquid to filter out Liquid tokens like this.

from liquid.lex import get_lexer
from liquid.token import TOKEN_LITERAL

s = """
<div class="post-preview">
    <div class="post-title">
        <div class="post-name">
            <a href="{{ post.url }}">{{ post.title }}</a>
        </div>
        <div class="post-date">
            {% include time.html %}
        </div>
    </div>

    <div class="post-snippet">
        {% if post.content contains '<!--break-->' %}
            {{ post.content | split:'<!--break-->' | first }}
            <div class="post-readmore">
                <a href="{{ post.url }}">read more-></a>
            </div>
        {% endif %}
    </div>
    {% include post-meta.html %}
</div>
"""

tokenize = get_lexer()
tokens = tokenize(s)
only_html = [token.value for token in tokens if token.type == TOKEN_LITERAL]
print("".join(only_html))

Notice the output preserves newlines outside Liquid markup.

<div class="post-preview">
    <div class="post-title">
        <div class="post-name">
            <a href=""></a>
        </div>
        <div class="post-date">
            
        </div>
    </div>

    <div class="post-snippet">
        
            
            <div class="post-readmore">
                <a href="">read more-></a>
            </div>
        
    </div>
    
</div>

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...