regex - Regular Expression for extracting text from an RTF string

Question

Welcome To Ask or Share your Answers For Others

regex - Regular Expression for extracting text from an RTF string

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - Regular Expression for extracting text from an RTF string

I was looking for a way to remove text from and RTF string and I found the following regex:

({\)(.+?)(})|(\)(.+?)()

However the resulting string has two right angle brackets "}"

Before: { tf1ansiansicpg1252deff0deflang1033{fonttbl{f0fnilfcharset0 MS Shell Dlg 2;}{f1fnil MS Shell Dlg 2;}} {colortbl ; ed0green0lue0;} {*generator Msftedit 5.41.15.1507;}viewkind4uc1pardx720cf1f0fs20 can u send me info for the call plsf1par }

After: } can u send me info for the call pls }

Any thoughts on how to improve the regex?

Edit: A more complicated string such as this one does not work: { tf1ansiansicpg1252deff0deflang1033{fonttbl{f0fnilfcharset0 MS Shell Dlg 2;}} {colortbl ; ed0green0lue0;} {*generator Msftedit 5.41.15.1507;}viewkind4uc1pardx720cf1f0fs20 HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\test\myapp\Apps\{3423234-283B-43d2-BCE6-A324B84CC70E}par }

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T22:13:20+0000

In RTF, { and } marks a group. Groups can be nested. marks beginning of a control word. Control words end with either a space or a non alphabetic character. A control word can have a numeric parameter following, without any delimiter in between. Some control words also take text parameters, separated by ';'. Those control words are usually in their own groups.

I think I have managed to make a pattern that takes care of most the cases.

{*?\[^{}]+}|[{}]|\
?[A-Za-z]+
?(?:-?d+)?[ ]?

It leaves a few spaces when run on your pattern though.

Going trough the RTF specification (some of it), I see that there are a lot of pitfalls for pure regex based strippers. The most obvious one are that some groups should be ignored (headers, footers, etc.), while others should be rendered (formatting).

I have written a Python script that should work better than my regex above:

def striprtf(text):
   pattern = re.compile(r"\([a-z]{1,32})(-?d{1,10})?[ ]?|\'([0-9a-f]{2})|\([^a-z])|([{}])|[
]+|(.)", re.I)
   # control words which specify a "destionation".
   destinations = frozenset((
      'aftncn','aftnsep','aftnsepc','annotation','atnauthor','atndate','atnicn','atnid',
      'atnparent','atnref','atntime','atrfend','atrfstart','author','background',
      'bkmkend','bkmkstart','blipuid','buptim','category','colorschememapping',
      'colortbl','comment','company','creatim','datafield','datastore','defchp','defpap',
      'do','doccomm','docvar','dptxbxtext','ebcend','ebcstart','factoidname','falt',
      'fchars','ffdeftext','ffentrymcr','ffexitmcr','ffformat','ffhelptext','ffl',
      'ffname','ffstattext','field','file','filetbl','fldinst','fldrslt','fldtype',
      'fname','fontemb','fontfile','fonttbl','footer','footerf','footerl','footerr',
      'footnote','formfield','ftncn','ftnsep','ftnsepc','g','generator','gridtbl',
      'header','headerf','headerl','headerr','hl','hlfr','hlinkbase','hlloc','hlsrc',
      'hsv','htmltag','info','keycode','keywords','latentstyles','lchars','levelnumbers',
      'leveltext','lfolevel','linkval','list','listlevel','listname','listoverride',
      'listoverridetable','listpicture','liststylename','listtable','listtext',
      'lsdlockedexcept','macc','maccPr','mailmerge','maln','malnScr','manager','margPr',
      'mbar','mbarPr','mbaseJc','mbegChr','mborderBox','mborderBoxPr','mbox','mboxPr',
      'mchr','mcount','mctrlPr','md','mdeg','mdegHide','mden','mdiff','mdPr','me',
      'mendChr','meqArr','meqArrPr','mf','mfName','mfPr','mfunc','mfuncPr','mgroupChr',
      'mgroupChrPr','mgrow','mhideBot','mhideLeft','mhideRight','mhideTop','mhtmltag',
      'mlim','mlimloc','mlimlow','mlimlowPr','mlimupp','mlimuppPr','mm','mmaddfieldname',
      'mmath','mmathPict','mmathPr','mmaxdist','mmc','mmcJc','mmconnectstr',
      'mmconnectstrdata','mmcPr','mmcs','mmdatasource','mmheadersource','mmmailsubject',
      'mmodso','mmodsofilter','mmodsofldmpdata','mmodsomappedname','mmodsoname',
      'mmodsorecipdata','mmodsosort','mmodsosrc','mmodsotable','mmodsoudl',
      'mmodsoudldata','mmodsouniquetag','mmPr','mmquery','mmr','mnary','mnaryPr',
      'mnoBreak','mnum','mobjDist','moMath','moMathPara','moMathParaPr','mopEmu',
      'mphant','mphantPr','mplcHide','mpos','mr','mrad','mradPr','mrPr','msepChr',
      'mshow','mshp','msPre','msPrePr','msSub','msSubPr','msSubSup','msSubSupPr','msSup',
      'msSupPr','mstrikeBLTR','mstrikeH','mstrikeTLBR','mstrikeV','msub','msubHide',
      'msup','msupHide','mtransp','mtype','mvertJc','mvfmf','mvfml','mvtof','mvtol',
      'mzeroAsc','mzeroDesc','mzeroWid','nesttableprops','nextfile','nonesttables',
      'objalias','objclass','objdata','object','objname','objsect','objtime','oldcprops',
      'oldpprops','oldsprops','oldtprops','oleclsid','operator','panose','password',
      'passwordhash','pgp','pgptbl','picprop','pict','pn','pnseclvl','pntext','pntxta',
      'pntxtb','printim','private','propname','protend','protstart','protusertbl','pxe',
      'result','revtbl','revtim','rsidtbl','rxe','shp','shpgrp','shpinst',
      'shppict','shprslt','shptxt','sn','sp','staticval','stylesheet','subject','sv',
      'svb','tc','template','themedata','title','txe','ud','upr','userprops',
      'wgrffmtfilter','windowcaption','writereservation','writereservhash','xe','xform',
      'xmlattrname','xmlattrvalue','xmlclose','xmlname','xmlnstbl',
      'xmlopen',
   ))
   # Translation of some special characters.
   specialchars = {
      'par': '
',
      'sect': '

',
      'page': '

',
      'line': '
',
      'tab': '',
      'emdash': u'u2014',
      'endash': u'u2013',
      'emspace': u'u2003',
      'enspace': u'u2002',
      'qmspace': u'u2005',
      'bullet': u'u2022',
      'lquote': u'u2018',
      'rquote': u'u2019',
      'ldblquote': u'201C',
      'rdblquote': u'u201D', 
   }
   stack = []
   ignorable = False       # Whether this group (and all inside it) are "ignorable".
   ucskip = 1              # Number of ASCII characters to skip after a unicode character.
   curskip = 0             # Number of ASCII characters left to skip
   out = []                # Output buffer.
   for match in pattern.finditer(text):
      word,arg,hex,char,brace,tchar = match.groups()
      if brace:
         curskip = 0
         if brace == '{':
            # Push state
            stack.append((ucskip,ignorable))
         elif brace == '}':
            # Pop state
            ucskip,ignorable = stack.pop()
      elif char: # x (not a letter)
         curskip = 0
         if char == '~':
            if not ignorable:
                out.append(u'xA0')
         elif char in '{}\':
            if not ignorable:
               out.append(char)
         elif char == '*':
            ignorable = True
      elif word: # foo
         curskip = 0
         if word in destinations:
            ignorable = True
         elif ignorable:
            pass
         elif word in specialchars:
            out.append(specialchars[word])
         elif word == 'uc':
            ucskip = int(arg)
         elif word == 'u':
            c = int(arg)
            if c < 0: c += 0x10000
            if c > 127: out.append(unichr(c))
            else: out.append(chr(c))
            curskip = ucskip
      elif hex: # 'xx
         if curskip > 0:
            curskip -= 1
         elif not ignorable:
            c = int(hex,16)
            if c > 127: out.append(unichr(c))
            else: out.append(chr(c))
      elif tchar:
         if curskip > 0:
            curskip -= 1
         elif not ignorable:
            out.append(tchar)
   return ''.join(out)

It works by parsing the RTF code, and skipping any groups which has a "destination" specified, and all "ignorable" groups ({*...}). I also added handling of some special characters.

There are lots of features missing to make this a full parser, but should be enough for simple documents.

UPDATED: This url have this script updated to run on Python 3.x:

https://gist.github.com/gilsondev/7c1d2d753ddb522e7bc22511cfb08676

Categories

regex - Regular Expression for extracting text from an RTF string

regex - Regular Expression for extracting text from an RTF string

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags