2021年1月25日星期一

How to search through a list of columns to find a specific regular expression pattern, and create a new column based on this value?

I have a list of columns in a set of CSV files I am searching for and assigning new column names based on.

I want to search through the list of columns to find the case where a date variable (YYYY-MM-DD) is contained in the name (I don't know what the date will be or if it will even exist). I do know that there will most likely be a variable col_of_interest_variable that I definitely want to exclude.

I then want to search through the list of date containing columns, and choose the latest one. I have part of the function but am not sure the best way to search through a list of columns and create a NEW list containing all columns matching this formula:

Oh also, the rub with this function is that I am reading in hundreds of CSV files all containing different #s of columns with different names. The one thing I do know is that the pattern below, containing the most recent date, is the column I want.

col_of_interest_\d{4}-\d{2}-\d{2}_variable

Here is what I currently have and it doesn't work the way I want it to: The findall method doesn't seem like the best approach, and the list comprehension to get the date_cols doesn't get the most recent date.

    filtered_columns = [          re.findall("col_of_interest_\d{4}-\d{2}-\d{2}_variable", column)          for column in df.columns      ]      if len(filtered_columns) == 0:          df["new_col_of_interest"] = np.nan        if len(filtered_columns) == 1:          df["new_col_of_interest"] = df[col_of_interest_\d{4}-\d{2}-\d{2}_variable]        elif len(filtered_columns) > 1:          date_cols = next(s for s in filtered_columns if s)            application_dates = [              pd.to_datetime(re.search("\d{4}-\d{2}-\d{2}", column).group())              for column in date_cols          ]          last_application_date_index = np.argmax(application_dates)            df["new_col_of_interest"] = df[              filtered_columns[last_application_date_index]          ]  

enter image description here

My goal is to have this Dataframe output, with a new column new_col_of_interest containing the values (in this case) from the new_col_of_interest_2020-08-28_variable column.

https://stackoverflow.com/questions/65894961/how-to-search-through-a-list-of-columns-to-find-a-specific-regular-expression-pa January 26, 2021 at 10:04AM

没有评论:

发表评论