Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
282 views
in Technique[技术] by (71.8m points)

html - Find keyword in text when keyword match certain conditions - C#

I'm looking for a nice way to do the following:

I have an article which has HTML tags in it like anchors and paragraphs and so on.
I also have keyword which i need to find in the article and set it as anchor (I have some url to set there).
If the keyword does exist in the article it should then match the following TWO conditions BEFORE making it an anchor:

  1. It can not be inside any tag. For example, something like

    <img alt="keyword"> 
    

    will not be valid/matched.

  2. The keyword can't already be inside anchor. For example, somthing like

    <a>keyword</a>
    

    will not be valid/matched.


    Any help would be appreciated. Thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I have managed to get it done!

Very much thanks to this post which helped me a lot with the xpath expression: http://social.msdn.microsoft.com/Forums/en-US/regexp/thread/beae72d6-844f-4a9b-ad56-82869d685037/

My task was to add X keywords to the article using table of keywords and urls on my database.
Once keyword was matched - it won't search for it again, but will try to find the next keyword in the text.
The 'keyword' could have been made of more than one word. That's why i added the Replace(" ", "s+").
Also, i had to give precedence to the longest keywords first. That is if i had:
"good day" and "good" as two different keywords - "good day" always wins.

This is my solution:

static public string AddLinksToArticle(string article, int linksToAdd)
    {
        try
        {
            //load keywords and urls
            var dt = new DAL().GetArticleLinks();

            //sort the it
            IEnumerable<ArticlesRow> sortedArticles = dt.OrderBy(row => row.keyword, new StringLengthComparer());

            // iterate the dictionary to get keyword to replace with anchor
            foreach (var item in sortedArticles)
            {
                article = FindAndReplaceKeywordWithAnchor(article, item.keyword, item.url, ref linksToAdd);
                if (linksToAdd == 0)
                {
                    break;
                }
            }

            return article;
        }
        catch (Exception ex)
        {
            Utils.LogErrorAdmin(ex);
            return null;
        }
    }

    private static string FindAndReplaceKeywordWithAnchor(string article, string keyword, string url, ref int linksToAdd)
    {
        //convert text to html
        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(article);

        // w* - means it can start with any alphanumeric charactar
        // s+ - was placed to replace all white spaces (when there is more than one word).
        //  - set bounderies for the keyword
        string pattern = @"" + keyword.Trim().Insert(0, "\w*").Replace(" ", "\s+") + @"";

        //get all elements text propery except for anchor element 
        var nodes = doc.DocumentNode.SelectNodes("//text()[not(ancestor::a)]") ?? new HtmlAgilityPack.HtmlNodeCollection(null);
        foreach (var node in nodes)
        {
            if (node.InnerHtml.Contains(keyword))
            {
                Regex regex = new Regex(pattern);
                node.InnerHtml = regex.Replace(node.InnerHtml, "<a href="" + url + "">" + keyword + "</a>", 1);//match only first occurrence
                linksToAdd--;
                break;
            }
        }

        return doc.DocumentNode.OuterHtml;
    }
}

public class StringLengthComparer : IComparer<string>
{
    public int Compare(string x, string y)
    {
        return y.Length.CompareTo(x.Length);
    }
}

Hope it will help someone in the future.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...