You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the [**Web scraping for beginners**](../../scraping_basics_javascript/crawling/pro_scraping.md) course, we learned about the power of Crawlee, and how it can streamline the development process of web crawlers. You've already seen how powerful the `crawlee` package is; however, what you've been exposed to thus far is only the tip of the iceberg.
14
+
In the [**Web scraping for beginners**](../../scraping_basics_javascript/index.md) course, we learned about the power of Crawlee, and how it can streamline the development process of web crawlers. You've already seen how powerful the `crawlee` package is; however, what you've been exposed to thus far is only the tip of the iceberg.
15
15
16
16
Because proxies are so widely used in the scraping world, Crawlee has been equipped with features which make it easy to implement them in an effective way. One of the main functionalities that comes baked into Crawlee is proxy rotation, which is when each request is sent through a different proxy from a proxy pool.
17
17
18
-
## Implementing proxies in a scraper {#implementing-proxies}
18
+
## Implementing proxies {#implementing-proxies}
19
19
20
-
Let's borrow some scraper code from the end of the [pro-scraping](../../scraping_basics_javascript/crawling/pro_scraping.md) lesson in our**Web Scraping for Beginners** course and paste it into a new file called **proxies.js**. This code enqueues all of the product links on [demo-webstore.apify.org](https://demo-webstore.apify.org)'s on-sale page, then makes a request to each product page and scrapes data about each one:
20
+
Let's build on top of the code which appears at the end of the [Professional scraping](../../scraping_basics_javascript/crawling/pro_scraping.md) lesson of the**Web Scraping for Beginners** course.
21
21
22
-
```js
23
-
// crawlee.js
22
+
Let's paste the same code to a new file, `proxies.js`, and make some changes. The code crawls the [Sales](https://warehouse-theme-metal.myshopify.com/collections/sales) page of a sample e-commerce website. It goes through all of the product links, enqueues requests to each page with a product detail, and scrapes data about all of the products:
In order to implement a proxy pool, we will first need some proxies. We'll quickly use the free [proxy scraper](https://apify.com/mstephen190/proxy-scraper) on the Apify platform to get our hands on some quality proxies. Next, we'll need to set up a [`ProxyConfiguration`](https://crawlee.dev/api/core/class/ProxyConfiguration) and configure it with our custom proxies, like so:
62
+
We'll want all the requests to go through a proxies. For that we obviously need proxies! To get some, we can use Matthias Stephens' [free proxy scraper](https://apify.com/mstephen190/proxy-scraper). It can find tens of reliable proxies out of the thousands it scrapes.
63
+
64
+
Once we have a list of proxies, we can add [`ProxyConfiguration`](https://crawlee.dev/api/core/class/ProxyConfiguration) and pass it to our crawler.
65
+
66
+
Proxy pools usually consist of many proxy URLs, but for the sake of simplicity of this lesson we'll list just three. At the time you're reading this text, they most probably won't work anymore, so be sure to use your own values.
Awesome, so there's our proxy pool! Usually, a proxy pool is much larger than this; however, a three proxies pool is totally fine for tutorial purposes. Finally, we can pass the `proxyConfiguration` into our crawler's options:
> Note that if you run this code, it may not work, as the proxies could potentially be down/non-operating at the time you are going through this course.
111
+
The crawler will now automatically rotate through the proxies we provided in the `proxyUrls` array.
101
112
102
-
That's it! The crawler will now automatically rotate through the proxies we provided in the `proxyUrls` option.
113
+
## Debugging proxies {#debugging-proxies}
103
114
104
-
## A bit about debugging proxies {#debugging-proxies}
115
+
To check that we're scraping through the proxies, we can get `proxyInfo` from the handler's context, which includes useful data about the proxy used to make the request.
105
116
106
-
At the time of writing, our above scraper utilizing our custom proxy pool is working just fine. But how can we check that the scraper is for sure using the proxies we provided it, and more importantly, how can we debug proxies within our scraper? Luckily, within the same `context` object we've been destructuring `$` and `request` out of, there is a`proxyInfo`key as well. `proxyInfo` is an object which includes useful data about the proxy which was used to make the request.
117
+
In the code example we already destructure the context object to `$` and `request`, so we can just add`proxyInfo` as something we want to access in the handler, too.
107
118
108
119
```js
109
120
constcrawler=newCheerioCrawler({
@@ -118,15 +129,21 @@ const crawler = new CheerioCrawler({
118
129
});
119
130
```
120
131
121
-
After modifying your code to log `proxyInfo` to the console and running the scraper, you're going to see some logs which look like this:
132
+
After modifying the code to log `proxyInfo` and after running the scraper, we can see proxy details about each request made:
133
+
134
+

122
135
123
-

136
+
These logs confirm that Crawlee uses and automatically rotates the proxies. Such logs can be also useful for debugging slow or broken proxies.
124
137
125
-
These logs confirm that our proxies are being used and rotated successfully by Crawlee, and can also be used to debug slow or broken proxies.
If scraping and managing proxies on your own feels tedious, there are services which do that for you. One of them is [Apify Proxy](https://apify.com/proxy), which provides proxies with both residential and datacenter IP addresses. The integration with Crawlee is seamless, but first you need the Apify SDK:
141
+
142
+
```shell
143
+
npm install apify
144
+
```
128
145
129
-
Though we will discuss it more in-depth in future courses, it is still important to mention that Crawlee has integrated support for the Apify SDK, which supports [Apify Proxy](https://apify.com/proxy) - a service that provides access to pools of both residential and datacenter IP addresses. A `proxyConfiguration` using Apify Proxy might look something like this:
146
+
Then you can create the `proxyConfiguration` like this:
0 commit comments