2021年4月3日星期六

rvest: language selection not working in tripadvisor

I am facing a web scraping problem. I intend to scrape few comments on tripadvisor. I would like to use rvest and to get comments in all languages. From this questions I understood that a possible way was to use ?filterLang=ALL at the end of the url. In a web browser, it does work. Example:

https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL

Does provide comments with "All languages" selected (and you can see a lot of french comments). Here is my problem: I try to get comment' titles:

library(rvest)  url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL"    reviews_html <- read_html(url)    reviews_html %>%    html_nodes(xpath = "//span[@class='noQuotes']") %>%    html_text()     [1] "I've never visited this restaurant," "Perfect"                               [3] "Memorable experience"                "Tasty"                                 [5] "Absolutely spectacular"              "Excellent"                             [7] "Wonderfullll"                        "A Perfect Evening"                     [9] "Dinner "                             "Perfect dinner and evening"   

I only got the English ones. And here the weird thing: if I try to get the number of pages:

reviews_html %>%    html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%    html_text()    [1] "Next" "1"    "2"    "3"    "4"    "5"    "6"    "176"  

I have the number of comment pages corresponding to "All languages" selection! If you compare with the case without language selection

url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html"    reviews_html <- read_html(url)    reviews_html %>%    html_nodes(xpath = "//span[@class='noQuotes']") %>%    html_text()     [1] "I've never visited this restaurant," "Perfect"                               [3] "Memorable experience"                "Tasty"                                 [5] "Absolutely spectacular"              "Excellent"                             [7] "Wonderfullll"                        "A Perfect Evening"                     [9] "Dinner "                             "Perfect dinner and evening"   

I get the same comments, but:

reviews_html %>%    html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%    html_text()    [1] "Next" "1"    "2"    "3"    "4"    "5"    "6"    "61"   

I get the number of pages corresponding to the English language selection. I tried setting also the cookies:

library(httr)    url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL"  httr::GET(url,             set_cookies(`TALanguage` = "ALL",                        `Domain` = ".tripadvisor.com"))%>%    read_html()%>%    html_nodes(xpath = "//span[@class='noQuotes']") %>%    html_text()  

But it did not work either. Does anyone understand what is going on, and what I could do to actually get the comments in all languages with rvest ?

https://stackoverflow.com/questions/66916363/rvest-language-selection-not-working-in-tripadvisor April 02, 2021 at 04:14PM

没有评论:

发表评论