2021年1月1日星期五

Adding an open close Google Chrome browser to Selenium linkedin_scraper code

I am trying to scrape some LinkedIn profiles of well known people. The code takes a bunch of LinkedIn profile URLS and then uses Selenium and scrape_linkedin to collect the information and save it into a folder as a .json file.

The problem I am running into is that LinkedIn naturally blocks the scraper from collecting some profiles. I am always able to get the first profile in the list of URLs. I put this down to the fact that it opens a new Google Chrome window and then goes to the LinkedIn page. (I could be wrong on this point however.)

What I would like to do is to add to the for loop a line which opens a new Google Chrome session and once the scrapper has collected the data close the Google Chrome session such that on the next iteration in the loop it will open up a fresh new Google Chrome session.

From the package website here it states:

driver {selenium.webdriver}: driver type to use  default: selenium.webdriver.Chrome  

Looking at the Selenium package website here I see:

driver = webdriver.Firefox()  ...  driver.close()  

So Selenium does have a close() option.

How can I add an open and close Google Chrome browser to the for loop?

I have tried alternative methods to try and collect the data such as changing the time.sleep() to 10 minutes, to changing the scroll_increment and scroll_pause but it still does not download the whole profile after the first one has been collected.

Code:

from datetime import datetime  from scrape_linkedin import ProfileScraper  import pandas as pd  import json  import os  import re  import time    my_profile_list = ['https://www.linkedin.com/in/williamhgates/', 'https://www.linkedin.com/in/christinelagarde/', 'https://www.linkedin.com/in/ursula-von-der-leyen/']  # To get LI_AT key  # Navigate to www.linkedin.com and log in  # Open browser developer tools (Ctrl-Shift-I or right click -> inspect element)  # Select the appropriate tab for your browser (Application on Chrome, Storage on Firefox)  # Click the Cookies dropdown on the left-hand menu, and select the www.linkedin.com option  # Find and copy the li_at value  myLI_AT_Key = 'INSERT LI_AT Key'  with ProfileScraper(cookie=myLI_AT_Key, scroll_increment = 50, scroll_pause = 0.8) as scraper:      for link in my_profile_list:          print('Currently scraping: ', link, 'Time: ', datetime.now())          profile = scraper.scrape(url=link)          dataJSON = profile.to_dict()                    profileName = re.sub('https://www.linkedin.com/in/', '', link)          profileName = profileName.replace("?originalSubdomain=es", "")          profileName = profileName.replace("?originalSubdomain=pe", "")          profileName = profileName.replace("?locale=en_US", "")          profileName = profileName.replace("?locale=es_ES", "")          profileName = profileName.replace("?originalSubdomain=uk", "")          profileName = profileName.replace("/", "")                    with open(os.path.join(os.getcwd(), 'ScrapedLinkedInprofiles', profileName + '.json'), 'w') as json_file:              json.dump(dataJSON, json_file)              time.sleep(10)                print('The first observation scraped was:', my_profile_list[0:])  print('The last observation scraped was:', my_profile_list[-1:])  print('END')  
https://stackoverflow.com/questions/65531036/adding-an-open-close-google-chrome-browser-to-selenium-linkedin-scraper-code January 01, 2021 at 11:33PM

没有评论:

发表评论