Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
109 views
in Technique[技术] by (71.8m points)

python - Reconstructing two (string concatenated) numbers that were originally floats

Unfortunately the printing instruction of a code was written without an end-of-the-line character and one every 26 numbers consists of two numbers joined together. The following is a code that shows an example of such behaviour; at the end there is a fragment of the original database.

import numpy as np

for _ in range(2):
  A=np.random.rand()+np.random.randint(0,100)
  B=np.random.rand()+np.random.randint(0,100)
  C=np.random.rand()+np.random.randint(0,100)
  D=np.random.rand()+np.random.randint(0,100)
  with open('file.txt','a') as f:
    f.write(f'{A},{B},{C},{D}')

And thus the output example file looks very similar to what follows:

40.63358599010553,53.86722741700399,21.800795158561158,13.95828176311762557.217562728494684,2.626308403991772,4.840593988487278,32.401778122213486

With the issue being that there are two numbers 'printed together', in the example they were as follows:

13.95828176311762557.217562728494684

So you cannot know if they should be

13.958281763117625, 57.217562728494684

or

13.9582817631176255, 7.217562728494684

Please understand that in this case they are only two options, but the problem that I want to address considers 'unbounded numbers' which are type Python's "float" (where 'unbounded' means in a range we don't know e.g. in the range +- 1E4)

Can the original numbers be reconstructed based on "some" python internal behavior I'm missing?

Actual data with periodicity 27 (i.e. the 26th number consists of 2 joined together):

0.9221878978925224, 0.9331311610066017,0.8600582424784715,0.8754578588852764,0.8738648974725404, 0.8897837559800233,0.6773502027673041,0.736325377603136,0.7956454122424133, 0.8083168444596229,0.7089031184165164, 0.7475306242508357,0.9702361286847581, 0.9900689384633811,0.7453878225174624, 0.7749000030576826,0.7743879170108678, 0.8032590543649807,0.002434,0.003673,0.004194,0.327903,11.357262,13.782266,20.14374,31.828905,33.9260060.9215201173775437, 0.9349343132442707,0.8605282244327555,0.8741626682026793,0.8742163597524663, 0.8874673376386358,0.7109322043854609,0.7376362393985332,0.796158275345
question from:https://stackoverflow.com/questions/65928610/reconstructing-two-string-concatenated-numbers-that-were-originally-floats

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

To expand my comment into an actual answer:

We do have some information - An IEEE-754 standard float only has 32 bits of precision, some of which is taken up by the mantissa (not all numbers can be represented by a float). For datasets like yours, they're brushing up against the edge of that precision.

We can make that work for us - we just need to test whether the number can, in fact, be represented by a float, at each possible split point. We can abuse strings for this, by testing num_str == str(float(num_str)) (i.e. a string remains the same after being converted to a float and back to a string)

  • If your number is able to be represented exactly by the IEEE float standard, then the before and after will be equal
  • If the number cannot be represented exactly by the IEEE float standard, it will be coerced into the nearest number that the float can represent. Obviously, if we then convert this back to a string, will not be identical to the original.

Here's a snippet, for example, that you can play around with

def parse_number(s: str) -> List[float]:
    if s.count('.') == 2:
        first_decimal = s.index('.')
        second_decimal = s[first_decimal + 1:].index('.') + first_decimal + 1
        split_idx = second_decimal - 1
        for i in range(second_decimal - 1, first_decimal + 1, -1):
            a, b = s[:split_idx], s[split_idx:]
            if str(float(a)) == a and str(float(b)) == b:
                return [float(a), float(b)]
        # default to returning as large an a as possible
        return [float(s[:second_decimal - 1]), float(s[second_decimal - 1:])]
    else:
        return [float(s)]

parse_number('33.9260060.9215201173775437')
# [33.926006, 0.9215201173775437]
# this is the only possible combination that actually works for this particular input

Obviously this isn't foolproof, and for some numbers there may not be enough information to differentiate the first number from the second. Additionally, for this to work, the tool that generated your data needs to have worked with IEEE standards-compliant floats (which does appear to be the case in this example, but may not be if the results were generated using a class like Decimal (python) or BigDecimal (java) or something else).

Some inputs might also have multiple possibilities. In the above snippet I've biased it to take the longest possible [first number], but you could modify it to go in the opposite order and instead take the shortest possible [first number].


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...