Your data looks like SGML, the superset of XML allowing tag inference/omission. I'm in the process of releasing an SGML parser for JavaScript (for the browser, node.js and other CommonJS platforms) but it's not released yet. For the time being, I suggest to use the venerable OpenSP software, which doesn't have an npm integration package, but which you can easily install on eg. Ubuntu/Debian using sudo apt-get install opensp
, and similar on other Linuxen and on Mac OS via MacPorts.
The OpenSP package contains the osx
command line utility to down-convert SGML to XML. You can use the node child_process
core package to invoke the osx
program, pipe it your SGML data, and grab the XML output produced by it, and then feed the produced XML to the XML parser of your choice in your node app.
SGML and the osx
program must be told to add the omitted end-element tags for CONFORMED-NAME
, CIK
, and the other elements with omitted end-element tags. You do that by prepending a document type declaration (DTD) before your SGML content. In your case, what you supply to the osx
program should look as follows:
<!DOCTYPE ISSUER [
<!ELEMENT ISSUER - -
(COMPANY-DATA,BUSINESS-ADDRESS,MAIL-ADDRESS)>
<!ELEMENT COMPANY-DATA - -
(CONFORMED-NAME,CIK,ASSIGNED-SIC,IRS-NUMBER,
STATE-OF-INCORPORATION,FISCAL-YEAR-END)>
<!ELEMENT (BUSINESS-ADDRESS,MAIL-ADDRESS) - -
(STREET1,CITY,STATE,ZIP)>
<!ELEMENT
(CONFORMED-NAME,CIK,ASSIGNED-SIC,IRS-NUMBER,
STATE-OF-INCORPORATION,FISCAL-YEAR-END,
STREET1,CITY,STATE,ZIP) - O (#PCDATA)>
]>
<ISSUER> ... rest of your input data followin here
Crucially, the declaration for the CONFORMED-NAME
, CIK
, and the other field-like elements use - O
(hyphen-minus and letter O) as tag omission indicators, telling SGML that the end-element tags for these elements can be omitted, and will be inserted automatically by the osx
program.
You can read more about the meaning of these declarations on my project page at http://sgmljs.net/docs/sgmlrefman.html .