akzainda11 Posted January 12 Posted January 12 (edited) Im simply trying to make an HTTP request to this product on the Pokemon Center website. If you have cookies cleared and open the url then the request will succeed (200 status), but without any cookies Pokemon only responds with a `Set Cookie` (no html body) and then the webpage reloads with the received cookies and the GET request succeeds and returns the actual html. I'm having trouble replicating this in Golang: my first request responds with 200 and the set cookies, the second request should have the cookies but responds with the same thing instead of actually returning the html. There's a bunch of requests that take place between the first product page GET request and the second one. You can simply see them by making sure you have cookies cleared and then load that url with the Dev Network tool open. It seems like an incapsula block, how can I get past this? The images attached are for the first GET request with NO cookies, and the second is for the GET request WITH cookies. func FetchPokemonURL(client *http.Client, productUrl string, firstTry bool) (string, bool, string) { req, err := http.NewRequest("GET", productUrl, nil) if err != nil { return "", false, "Failed to create request: " + err.Error() } req.Header.Set("Host", "www.pokemoncenter.com") req.Header.Set("Sec-Fetch-Dest", "document") req.Header.Set("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.0.1 Safari/605.1.15") req.Header.Set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8") if firstTry { req.Header.Set("Sec-Fetch-Site", "none") } else { req.Header.Set("Sec-Fetch-Site", "same-origin") } req.Header.Set("Sec-Fetch-Mode", "navigate") if !firstTry { req.Header.Set("Referer", productUrl) req.Header.Set("Cache-Control", "max-age=0") } req.Header.Set("Accept-Language", "en-US,en;q=0.9") req.Header.Set("Priority", "u=0, i") req.Header.Set("Accept-Encoding", "gzip, deflate, br") req.Header.Set("Connection", "keep-alive") resp, err := client.Do(req) if err != nil { return "", false, fmt.Sprintf("Error making request: %v\n", err) } defer resp.Body.Close() body, err := io.ReadAll(resp.Body) if err != nil { return "", false, fmt.Sprintf("Error reading response body: %v\n", err) } strBody := string(body) if firstTry { return FetchPokemonURL(client, productUrl, false) } else { return strBody, false, "" } } Edited January 12 by akzainda11 1
jackyjask Posted January 12 Posted January 12 I tried your link for the very 1st time and... hit captcha protection! how are you going to overcome it? eg: 1
Progman Posted March 12 Posted March 12 It looks like you're dealing with a common issue when scraping websites that use anti-bot measures like Incapsula. These measures often require you to handle cookies, headers, and sometimes even JavaScript challenges before you can access the content. Here’s a step-by-step approach to handle this in Go: 1. **Initial Request to Get Cookies**: The first request will return a `Set-Cookie` header. You need to capture these cookies and use them in subsequent requests. 2. **Follow-Up Request with Cookies**: Use the cookies from the first response to make a second request. This should return the actual HTML content. 3. **Handle Headers and User-Agent**: Ensure that your headers, especially the `User-Agent`, are set correctly to mimic a real browser. Here’s an updated version of your function to handle this: ```go package main import ( "fmt" "io" "net/http" "net/http/cookiejar" "net/url" "time" ) func FetchPokemonURL(client *http.Client, productUrl string) (string, error) { req, err := http.NewRequest("GET", productUrl, nil) if err != nil { return "", fmt.Errorf("failed to create request: %v", err) } req.Header.Set("Host", "www.pokemoncenter.com") req.Header.Set("Sec-Fetch-Dest", "document") req.Header.Set("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.0.1 Safari/605.1.15") req.Header.Set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8") req.Header.Set("Sec-Fetch-Site", "none") req.Header.Set("Sec-Fetch-Mode", "navigate") req.Header.Set("Accept-Language", "en-US,en;q=0.9") req.Header.Set("Priority", "u=0, i") req.Header.Set("Accept-Encoding", "gzip, deflate, br") req.Header.Set("Connection", "keep-alive") resp, err := client.Do(req) if err != nil { return "", fmt.Errorf("error making request: %v", err) } defer resp.Body.Close() // Read the response body body, err := io.ReadAll(resp.Body) if err != nil { return "", fmt.Errorf("error reading response body: %v", err) } return string(body), nil } func main() { // Create a cookie jar to handle cookies jar, err := cookiejar.New(nil) if err != nil { fmt.Printf("Error creating cookie jar: %v\n", err) return } client := &http.Client{ Jar: jar, Timeout: 10 * time.Second, } productUrl := "https://www.pokemoncenter.com/product/your-product-id" // First request to get cookies _, err = FetchPokemonURL(client, productUrl) if err != nil { fmt.Printf("Error on first request: %v\n", err) return } // Second request with cookies html, err := FetchPokemonURL(client, productUrl) if err != nil { fmt.Printf("Error on second request: %v\n", err) return } fmt.Println(html) } ``` ### Explanation: 1. **Cookie Jar**: The `cookiejar` automatically handles cookies for you. It stores cookies from the first request and sends them in subsequent requests. 2. **Headers**: The headers are set to mimic a real browser request. 3. **Two Requests**: The first request is made to get the cookies, and the second request uses those cookies to get the actual HTML content. ### Additional Tips: - **JavaScript Challenges**: If the site uses JavaScript challenges, you might need to use a headless browser like Puppeteer or Selenium. - **Rate Limiting**: Be mindful of rate limiting. Make sure to add delays between requests if necessary. - **Error Handling**: Add more robust error handling to manage different types of errors that might occur. This should help you get past the initial Incapsula block and retrieve the HTML content you need.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now