2021年3月31日星期三

Pass url column's values one by one to web crawler code in Python

Based on the answered code from this link, I'm able to create a new column: df['url'] = 'https://www.cspea.com.cn/list/c01/' + df['projectCode'].

Next step I would like to pass the url column's values to the following code and append all the scrapied contents as dataframe.

import urllib3  import requests  from bs4 import BeautifulSoup  import pandas as pd    url = "https://www.cspea.com.cn/list/c01/gr2021bj1000186" # url column's values should be passed here one by one  soup = BeautifulSoup(requests.get(url, verify=False).content, "html.parser")  index, data = [], []  for th in soup.select(".project-detail-left th"):      h = th.get_text(strip=True)      t = th.find_next("td").get_text(strip=True)      index.append(h)      data.append(t)    df = pd.DataFrame(data, index=index, columns=["value"])  print(df)   

How could I do that in Python? Thanks.

Updated:

import requests  from bs4 import BeautifulSoup  import pandas as pd    df = pd.read_excel('items_scraped.xlsx')    data = []  urls =  df.url.tolist()  for url_link in urls:        url = url_link      # url = "https://www.cspea.com.cn/list/c01/gr2021bj1000186"      soup = BeautifulSoup(requests.get(url, verify=False).content, "html.parser")            index, data = [], []      for th in soup.select(".project-detail-left th"):          h = th.get_text(strip=True)          t = th.find_next("td").get_text(strip=True)          index.append(h)          data.append(t)            df = pd.DataFrame(data, index=index, columns=["value"])      df = df.T      df.reset_index(drop=True, inplace=True)      print(df)    df.to_excel('result.xlsx', index = False)  

But it only saved one rows into excel file.

https://stackoverflow.com/questions/66890578/pass-url-columns-values-one-by-one-to-web-crawler-code-in-python March 31, 2021 at 11:15PM

没有评论:

发表评论