You (originally) asked for an awk
-based solution. As others mentioned in the comments there are better tools for the job. That said, based on 4.9 Multiple-Line Records and 4.7 Defining Fields by Content you can try something like:
$ awk --version
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
[...]
$ awk 'BEGIN {RS = ";
"; FPAT = "([^;]+)|("<p.+p>")" } { print "NF = ", NF; for (i = 1; i <= NF; i++) { printf("$%d = %s
", i, $i) } }' testfile
RS = ";
"
is here assuming that your input file has multiple ID;Name;value1;value2;DESCRIPTION;valueX;valueY;
records and that the records are separated with a ;
(this is the ;
after valueY
in your example) followed by a newline
.
FPAT = "([^;]+)|("<p.+p>")"
is a "best-effort" approach to tell (g)awk
how the fields of your records look like. You may need to modify it according to your needs. What is actually says is that there are two field formats (see (...)|(...)
). The first field format captures strings that do not contain ;
and is used to capture all the fields except DESCRIPTION
. The second field format captures strings that start with "<
and end with >"
.
Against a file with 2 ID;Name;value1;value2;DESCRIPTION;valueX;valueY;
:
$ cat testfile
ID;Name;value1;value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY;
ID;Name;value1;value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY;
$ awk 'BEGIN {RS = ";
"; FPAT = "([^;]+)|("<p.+p>")" } { print "NF = ", NF; for (i = 1; i <= NF; i++) { printf("$%d = %s
", i, $i) } }' testfile
NF = 7
$1 = ID
$2 = Name
$3 = value1
$4 = value2
$5 = "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"
$6 = valueX
$7 = valueY
NF = 7
$1 = ID
$2 = Name
$3 = value1
$4 = value2
$5 = "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"
$6 = valueX
$7 = valueY
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…