Consider the following Code in Python 3.8:
def clean_data(data) -> list:
"""Takes a list of Webelements and extracts needed information using RegEx.
Groups Information into Dictionarys.
Returns List of Dictionarys"""
clean_data = []
for element in data:
datapoint = {}
text = element.text
name = re.search(r"A.*", text).group()
auction_id = re.search(r"(?:LOT ID: )(d+)", text).group(1)
auction_price = re.search(r"(?:€)(d+.d+.d+)", text).group(1)
auction_price = auction_price.replace(",", "")
auction_price = float(auction_price)
date = re.search(r"(?:Sold on |Reserve not met )(.*)", text).group(1)
date = datetime.strptime(date, "%d/%m/%Y")
datapoint["auction_id"] = int(auction_id)
datapoint["name"] = name
datapoint["auction_price"] = auction_price
datapoint["date"] = date
clean_data.append(datapoint)
return clean_data
def read_id_from_database() -> list:
"""Connects to the MySQL Database and returns a list of all existing auction_ids."""
connection = mysql.connector.connect(**database_credentials)
cursor = connection.cursor()
SQL = """SELECT auction_id FROM database"""
cursor.execute(SQL)
ids = cursor.fetchall()
existing_ids = [id[0] for id in ids]
return existing_ids
def select_new_datapoints(data : list) -> list:
"""Takes a list of dictionaries. Compares the auction_id with a list of existing auction IDs.
Returns a list of all dictionaries with new auction IDs"""
existing_ids = read_id_from_database()
new_datapoints = []
for element in data:
if int(element["auction_id"]) not in existing_ids:
new_datapoints.append(element)
return new_datapoints
Clean_data extracts relevant information from a webdriver object on auction items. Every item is represented as a dictionary called datapoint. Every key-value-pair in this dictionary is one bit of information. The function returns a list of dictionaries.
read_id_from_database just returns a list of all auction_ids which are already stored in my database.
select_new_datapoints checks whether an item on the Clean_data list has an id which is already in my database by comparing its id to the list provided by read_id_from_database. If the id does not already exist, the item gets added to the new_datapoint list.
for element in data:
if int(element["auction_id"]) not in existing_ids:
new_datapoints.append(element)
Those items get added to my database later on.
The problem is now, that the select_new_datapoints method does not work. It always keeps all items, despite most or even all of the IDs being already present in my database.
I already checked this by printing the IDs from clean_data and existing_ids into a csv file and comparing them visually:
CSV File
Both lists are identical, meaning the read_id_from_database should return an empty new_datapoints list. However, the read_id_from_database instead returns a list with 226 elements, meaning something is going wring here:
new_datapoints = []
for element in data:
if int(element["auction_id"]) not in existing_ids:
new_datapoints.append(element)
return new_datapoints
I assume that for some reason
if int(element["auction_id"]) not in existing_ids:
new_datapoints.append(element)
always evaluates to TRUE and therefore all elements get appended.
I was unable to find out why this is since
if int(element["auction_id"]) not in existing_ids:
Should work as intended and my csv showed me that every element's id is already in the existing_ids list. I also already checked the types of the data I am comparing they are both int.
question from:
https://stackoverflow.com/questions/65831154/i-use-not-in-to-compare-and-objects-id-to-a-list-of-ids-comparison-seems-to-al