2020年12月19日星期六

How I do implement multithread in my web scarper?

I am currently working on a project that requires me to extract data from hundreds of pages. However, I notice that the whole extraction is taking too long since the scarper has to process around 800+ pages. I have read about multiprocessing which I believe it can speed things up but I don't really know how to integrate it to my current code.

from bs4 import BeautifulSoup as soup  import requests  import pandas as pd  import time    final_data = []    for i in range(1,8271,10):      url = (f'https://www.fca.org.uk/news/search-results?np_category=warnings&start={i}')      req = requests.get(url)      start = time.process_time()      page_html = req.content      page_soup = soup(page_html, "lxml")      data = page_soup.find_all('li', class_='search-item')      print(f'Processing {url}')        for x in data:          list = {}          list['name'] = x.find('a','search-item__clickthrough').text.strip()          try:              list['published_date']=x.find('span','meta-item published-date').text          except:              list['published_date'] = 'None'          list['modified_date']=x.find('span','meta-item modified-date').text            final_data.append(list)    df = pd.DataFrame(final_data)  TodaysDate = time.strftime("%Y%m%d")  csvfilename = TodaysDate + "_FCA Macro.csv"  df.to_csv(csvfilename, encoding="utf-8-sig")  
https://stackoverflow.com/questions/65353367/how-i-do-implement-multithread-in-my-web-scarper December 18, 2020 at 03:46PM

没有评论:

发表评论