Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
213 views
in Technique[技术] by (71.8m points)

python - Cleaning messy observations while keeping information

I'm practising my importing and cleaning skills and have reached a bit of a quagmire. I've been importing from here. The importing works and I have been able to drop na's. However, the issue is that certain observations are written in such a way (for example 13.7 (2016)). Because of how they're written they're read in as strings and even if they weren't they would contain false information.

I want to get rid of the year observations which are in the parentheses but preserve the data observation itself.

At present here is my code:

#Declare Missing Variables
missing_values = ['?', np.nan]
#Read Data
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_firearm-related_death_rate', na_values=missing_values)
#Set Dataset and Drop Variables
df = dfs[3]
df_drops = df[['Year', 'Undetermined', 'Sources and notes']]
df.drop(df_drops, inplace = True, axis=1)

print(df)
# pd.to_numeric(df['Guns per 100 inhabitants'])

Any help appreciated!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Bit of a workaround, but you could clean it up by splitting the string by a space and then taking the first entry.

df['Guns per 100 inhabitants (clean)'] = np.array([float(s.split(' ')[0]) for s in df['Guns per 100 inhabitants'])

I tried it out with your example and there are still some errors (for example, one entry is formatted '6.2-19.4', and some entries are already floats not strings so s.split(' ') throws an error) but I think this solves the year in parentheses issue.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...