Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
263 views
in Technique[技术] by (71.8m points)

java - Parsing a chemical formula

I'm trying to write a method for an app that takes a chemical formula like "CH3COOH" and returns some sort of collection full of their symbols.

CH3COOH would return [C,H,H,H,C,O,O,H]

I already have something that is kinda working, but it's very complicated and uses a lot of code with a lot of nested if-else structures and loops.

Is there a way I can do this by using some kind of regular expression with String.split or maybe in some other brilliant simple code?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I have developed a couple of series of articles on how to parse molecular formulas, including more complex formulas like C6H2(NO2)3CH3 .

The most recent is my presentation "PLY and PyParsing" at PyCon2010 where I compare those two Python parsing systems using a molecular formula evaluator as my sample problem. There's even a video of my presentation.

The presentation was based on a three-part series of articles I did developing a molecular formula parser using ANTLR. In part 3 I compare the ANTLR solution to a hand-written regular expression parser and solutions in PLY and PyParsing.

The regexp and PLY solutions were first developed in a two-part series on two ways of writing parsers in Python.

The regexp solution and base ANTLR/PLY/PyParsing solutions, use a regular expression like [A-Z][a-z]?d* to match terms in the formula. This is what @David M suggested.

Here is it worked out in Python

import re

# element_name is: capital letter followed by optional lower-case
# count is: empty string (so the count is 1), or a set of digits
element_pat = re.compile("([A-Z][a-z]?)(d*)")

all_elements = []
for (element_name, count) in element_pat.findall("CH3COOH"):
    if count == "":
        count = 1
    else:
        count = int(count)
    all_elements.extend([element_name] * count)

print all_elements

When I run this (it's hard-coded to use acetic acid, CH3COOH) I get

['C', 'H', 'H', 'H', 'C', 'O', 'O', 'H']

Do note that this short bit of code assumes the molecular formula is correct. If you give it something like "##$%^O2#$$#" then it will ignore the fields it doesn't know about and give ['O', 'O']. If you don't want that then you'll have to make it a bit more robust.

If you want to support more complicated formulas, like C6H2(NO2)3CH3, then you'll need to know a bit about tree data structures, specifically (as @Roman points out), abstract syntax trees (most often called ASTs). That's too complicated to get into here, so see my talk and essays for more details.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...