I have a list of columns in a set of CSV files I am searching for and assigning new column names based on.
I want to search through the list of columns to find the case where a date variable (YYYY-MM-DD) is contained in the name (I don't know what the date will be or if it will even exist). I do know that there will most likely be a variable col_of_interest_variable
that I definitely want to exclude.
I then want to search through the list of date containing columns, and choose the latest one. I have part of the function but am not sure the best way to search through a list of columns and create a NEW list containing all columns matching this formula:
Oh also, the rub with this function is that I am reading in hundreds of CSV files all containing different #s of columns with different names. The one thing I do know is that the pattern below, containing the most recent date, is the column I want.
col_of_interest_\d{4}-\d{2}-\d{2}_variable
Here is what I currently have and it doesn't work the way I want it to: The findall method doesn't seem like the best approach, and the list comprehension to get the date_cols doesn't get the most recent date.
filtered_columns = [ re.findall("col_of_interest_\d{4}-\d{2}-\d{2}_variable", column) for column in df.columns ] if len(filtered_columns) == 0: df["new_col_of_interest"] = np.nan if len(filtered_columns) == 1: df["new_col_of_interest"] = df[col_of_interest_\d{4}-\d{2}-\d{2}_variable] elif len(filtered_columns) > 1: date_cols = next(s for s in filtered_columns if s) application_dates = [ pd.to_datetime(re.search("\d{4}-\d{2}-\d{2}", column).group()) for column in date_cols ] last_application_date_index = np.argmax(application_dates) df["new_col_of_interest"] = df[ filtered_columns[last_application_date_index] ]
My goal is to have this Dataframe output, with a new column new_col_of_interest
containing the values (in this case) from the new_col_of_interest_2020-08-28_variable
column.
没有评论:
发表评论