I want to extract research abstracts on PubMed. I will have multiple URLs to search for publications and some of them will have the same articles as others. Each article has a unique ID called a PMID. Basically, the abstract of each URL is a substring + the PMID (example: https://pubmed.ncbi.nlm.nih.gov/ + 32663045). However, I don't want to extract the same article twice for multiple reasons (i.e., takes longer to complete the entire code, uses up more bandwidth), so once I extract the PMID, I add it to a list. I'm trying to make my code only extract information from the abstract just once, however my code is still extracting duplicate PMIDs and publication titles.
I know how to get rid of duplicates in Pandas in my output, but that's not what I want to do. I want to basically skip over PMIDs/URLs that I already scraped.
Current Output
Title| PMID
COVID-19 And Racial/Ethnic Disparities In Health Risk | 32663045
The Risk Of Severe COVID-19 | 32941086
COVID-19 And Racial/Ethnic Disparities In Health Risk | 32663045
The Risk Of Severe COVID-19 | 32941086
Desired Output
Title| PMID
COVID-19 And Racial/Ethnic Disparities In Health Risk | 32663045
The Risk Of Severe COVID-19 | 32941086
Here's my code:
from bs4 import BeautifulSoup import csv import time import requests import pandas as pd all_pmids = [] out = [] search_urls = ['https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=','https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort='] for search_url in search_urls: response = requests.get(search_url) soup = BeautifulSoup(response.content, 'html.parser') pmids = soup.find_all('span', {'class' : 'docsum-pmid'}) for p in pmids: p = p.get_text() all_pmids.append(p) if p not in all_pmids else print('project already in list, skipping') for pmid in all_pmids: url = 'https://pubmed.ncbi.nlm.nih.gov/'+pmid response2 = requests.get(url) soup2 = BeautifulSoup(response2.content, 'html.parser') title = soup2.select('h1.heading-title')[0].text.strip() data = {'title': title, 'pmid': pmid, 'url':url} time.sleep(3) out.append(data) df = pd.DataFrame(out) df.to_excel('my_results.xlsx')
https://stackoverflow.com/questions/65838383/python-beautiful-soup-webscraping-pubmed-extracting-pmids-an-article-id January 22, 2021 at 10:05AM
没有评论:
发表评论