2021年3月19日星期五

Pandas read xml not working properly for single tag xml

I am using the pandas_read_xml package for reading and processing xml files into a pandas dataframe. The package works absolutely fine for my purpose in the vast majority of cases. However, the dataframe output is kind of off when reading a url with just a single tag. Let me illustrate this with the following two examples.

# Import package  import pandas_read_xml as pdx  from pandas_read_xml import fully_flatten    # Example 1  url_1 = 'https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml'  df_1 =  pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])  df_1 = pdx.fully_flatten(df_1)  

The resulting df_1 contains 163 rows and 31 columns where each row corresponds to a unique security. This is in line with my desired result. However, the output is a little strange when I try to read a xml where there is just one occurrence of the tag 'invstOrSec'.

# Example 2  url_2 = 'https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml'  df_2  = pdx.read_xml(url_2,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])  df_2 = pdx.fully_flatten(df_2)  

The resulting df_2 contains 6 rows and 19 columns. I can't really make sense of why it contains 6 rows when it should really be one. I have observed that this behavior occurs in only those cases where there is just one occurrence of the tag 'invstOrSec'. Any help over this would be greatly appreciated. Please let me know if my question isn't clear.

https://stackoverflow.com/questions/66710039/pandas-read-xml-not-working-properly-for-single-tag-xml March 19, 2021 at 10:27PM

没有评论:

发表评论