I am trying to scrape some LinkedIn profiles of well known people. The code takes a bunch of LinkedIn profile URLS and then uses Selenium
and scrape_linkedin
to collect the information and save it into a folder as a .json file.
The problem I am running into is that LinkedIn naturally blocks the scraper from collecting some profiles. I am always able to get the first profile in the list of URLs. I put this down to the fact that it opens a new Google Chrome window and then goes to the LinkedIn page. (I could be wrong on this point however.)
What I would like to do is to add to the for loop a line which opens a new Google Chrome session and once the scrapper has collected the data close the Google Chrome session such that on the next iteration in the loop it will open up a fresh new Google Chrome session.
From the package website here it states:
driver {selenium.webdriver}: driver type to use default: selenium.webdriver.Chrome
Looking at the Selenium package website here I see:
driver = webdriver.Firefox() ... driver.close()
So Selenium does have a close()
option.
How can I add an open and close Google Chrome browser to the for loop?
I have tried alternative methods to try and collect the data such as changing the time.sleep()
to 10 minutes, to changing the scroll_increment
and scroll_pause
but it still does not download the whole profile after the first one has been collected.
Code:
from datetime import datetime from scrape_linkedin import ProfileScraper import pandas as pd import json import os import re import time my_profile_list = ['https://www.linkedin.com/in/williamhgates/', 'https://www.linkedin.com/in/christinelagarde/', 'https://www.linkedin.com/in/ursula-von-der-leyen/'] # To get LI_AT key # Navigate to www.linkedin.com and log in # Open browser developer tools (Ctrl-Shift-I or right click -> inspect element) # Select the appropriate tab for your browser (Application on Chrome, Storage on Firefox) # Click the Cookies dropdown on the left-hand menu, and select the www.linkedin.com option # Find and copy the li_at value myLI_AT_Key = 'INSERT LI_AT Key' with ProfileScraper(cookie=myLI_AT_Key, scroll_increment = 50, scroll_pause = 0.8) as scraper: for link in my_profile_list: print('Currently scraping: ', link, 'Time: ', datetime.now()) profile = scraper.scrape(url=link) dataJSON = profile.to_dict() profileName = re.sub('https://www.linkedin.com/in/', '', link) profileName = profileName.replace("?originalSubdomain=es", "") profileName = profileName.replace("?originalSubdomain=pe", "") profileName = profileName.replace("?locale=en_US", "") profileName = profileName.replace("?locale=es_ES", "") profileName = profileName.replace("?originalSubdomain=uk", "") profileName = profileName.replace("/", "") with open(os.path.join(os.getcwd(), 'ScrapedLinkedInprofiles', profileName + '.json'), 'w') as json_file: json.dump(dataJSON, json_file) time.sleep(10) print('The first observation scraped was:', my_profile_list[0:]) print('The last observation scraped was:', my_profile_list[-1:]) print('END')
https://stackoverflow.com/questions/65531036/adding-an-open-close-google-chrome-browser-to-selenium-linkedin-scraper-code January 01, 2021 at 11:33PM
没有评论:
发表评论