I am facing a web scraping problem. I intend to scrape few comments on tripadvisor. I would like to use rvest
and to get comments in all languages. From this questions I understood that a possible way was to use ?filterLang=ALL
at the end of the url. In a web browser, it does work. Example:
Does provide comments with "All languages" selected (and you can see a lot of french comments). Here is my problem: I try to get comment' titles:
library(rvest) url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL" reviews_html <- read_html(url) reviews_html %>% html_nodes(xpath = "//span[@class='noQuotes']") %>% html_text() [1] "I've never visited this restaurant," "Perfect" [3] "Memorable experience" "Tasty" [5] "Absolutely spectacular" "Excellent" [7] "Wonderfullll" "A Perfect Evening" [9] "Dinner " "Perfect dinner and evening"
I only got the English ones. And here the weird thing: if I try to get the number of pages:
reviews_html %>% html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>% html_text() [1] "Next" "1" "2" "3" "4" "5" "6" "176"
I have the number of comment pages corresponding to "All languages" selection! If you compare with the case without language selection
url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html" reviews_html <- read_html(url) reviews_html %>% html_nodes(xpath = "//span[@class='noQuotes']") %>% html_text() [1] "I've never visited this restaurant," "Perfect" [3] "Memorable experience" "Tasty" [5] "Absolutely spectacular" "Excellent" [7] "Wonderfullll" "A Perfect Evening" [9] "Dinner " "Perfect dinner and evening"
I get the same comments, but:
reviews_html %>% html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>% html_text() [1] "Next" "1" "2" "3" "4" "5" "6" "61"
I get the number of pages corresponding to the English language selection. I tried setting also the cookies:
library(httr) url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL" httr::GET(url, set_cookies(`TALanguage` = "ALL", `Domain` = ".tripadvisor.com"))%>% read_html()%>% html_nodes(xpath = "//span[@class='noQuotes']") %>% html_text()
But it did not work either. Does anyone understand what is going on, and what I could do to actually get the comments in all languages with rvest ?
https://stackoverflow.com/questions/66916363/rvest-language-selection-not-working-in-tripadvisor April 02, 2021 at 04:14PM
没有评论:
发表评论