I am currently working on a project that requires me to extract data from hundreds of pages. However, I notice that the whole extraction is taking too long since the scarper has to process around 800+ pages. I have read about multiprocessing which I believe it can speed things up but I don't really know how to integrate it to my current code.
from bs4 import BeautifulSoup as soup import requests import pandas as pd import time final_data = [] for i in range(1,8271,10): url = (f'https://www.fca.org.uk/news/search-results?np_category=warnings&start={i}') req = requests.get(url) start = time.process_time() page_html = req.content page_soup = soup(page_html, "lxml") data = page_soup.find_all('li', class_='search-item') print(f'Processing {url}') for x in data: list = {} list['name'] = x.find('a','search-item__clickthrough').text.strip() try: list['published_date']=x.find('span','meta-item published-date').text except: list['published_date'] = 'None' list['modified_date']=x.find('span','meta-item modified-date').text final_data.append(list) df = pd.DataFrame(final_data) TodaysDate = time.strftime("%Y%m%d") csvfilename = TodaysDate + "_FCA Macro.csv" df.to_csv(csvfilename, encoding="utf-8-sig")
https://stackoverflow.com/questions/65353367/how-i-do-implement-multithread-in-my-web-scarper December 18, 2020 at 03:46PM
没有评论:
发表评论