2021年1月21日星期四

Python - Beautiful Soup: Webscraping PubMed - extracting PMIDs (an article ID), adding to list, and preventing duplicate scraping

I want to extract research abstracts on PubMed. I will have multiple URLs to search for publications and some of them will have the same articles as others. Each article has a unique ID called a PMID. Basically, the abstract of each URL is a substring + the PMID (example: https://pubmed.ncbi.nlm.nih.gov/ + 32663045). However, I don't want to extract the same article twice for multiple reasons (i.e., takes longer to complete the entire code, uses up more bandwidth), so once I extract the PMID, I add it to a list. I'm trying to make my code only extract information from the abstract just once, however my code is still extracting duplicate PMIDs and publication titles.

I know how to get rid of duplicates in Pandas in my output, but that's not what I want to do. I want to basically skip over PMIDs/URLs that I already scraped.

Current Output

Title| PMID
COVID-19 And Racial/Ethnic Disparities In Health Risk | 32663045
The Risk Of Severe COVID-19 | 32941086
COVID-19 And Racial/Ethnic Disparities In Health Risk | 32663045
The Risk Of Severe COVID-19 | 32941086

Desired Output

Title| PMID
COVID-19 And Racial/Ethnic Disparities In Health Risk | 32663045
The Risk Of Severe COVID-19 | 32941086

Here's my code:

from bs4 import BeautifulSoup  import csv  import time  import requests  import pandas as pd    all_pmids = []  out = []    search_urls = ['https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=','https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=']  for search_url in search_urls:            response = requests.get(search_url)      soup = BeautifulSoup(response.content, 'html.parser')        pmids = soup.find_all('span', {'class' : 'docsum-pmid'})      for p in pmids:          p = p.get_text()          all_pmids.append(p) if p not in all_pmids else print('project already in list, skipping')        for pmid in all_pmids:          url = 'https://pubmed.ncbi.nlm.nih.gov/'+pmid          response2 = requests.get(url)          soup2 = BeautifulSoup(response2.content, 'html.parser')            title = soup2.select('h1.heading-title')[0].text.strip()                    data = {'title': title, 'pmid': pmid, 'url':url}          time.sleep(3)          out.append(data)  df = pd.DataFrame(out)    df.to_excel('my_results.xlsx')           
https://stackoverflow.com/questions/65838383/python-beautiful-soup-webscraping-pubmed-extracting-pmids-an-article-id January 22, 2021 at 10:05AM

没有评论:

发表评论