Skip to content

Commit ad61173

Browse files
committed
feat: update lesson about using proxies
Close #966
1 parent 5256524 commit ad61173

File tree

1 file changed

+61
-44
lines changed

1 file changed

+61
-44
lines changed

sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md

+61-44
Original file line numberDiff line numberDiff line change
@@ -11,99 +11,110 @@ slug: /anti-scraping/mitigation/using-proxies
1111

1212
---
1313

14-
In the [**Web scraping for beginners**](../../scraping_basics_javascript/crawling/pro_scraping.md) course, we learned about the power of Crawlee, and how it can streamline the development process of web crawlers. You've already seen how powerful the `crawlee` package is; however, what you've been exposed to thus far is only the tip of the iceberg.
14+
In the [**Web scraping for beginners**](../../scraping_basics_javascript/index.md) course, we learned about the power of Crawlee, and how it can streamline the development process of web crawlers. You've already seen how powerful the `crawlee` package is; however, what you've been exposed to thus far is only the tip of the iceberg.
1515

1616
Because proxies are so widely used in the scraping world, Crawlee has been equipped with features which make it easy to implement them in an effective way. One of the main functionalities that comes baked into Crawlee is proxy rotation, which is when each request is sent through a different proxy from a proxy pool.
1717

18-
## Implementing proxies in a scraper {#implementing-proxies}
18+
## Implementing proxies {#implementing-proxies}
1919

20-
Let's borrow some scraper code from the end of the [pro-scraping](../../scraping_basics_javascript/crawling/pro_scraping.md) lesson in our **Web Scraping for Beginners** course and paste it into a new file called **proxies.js**. This code enqueues all of the product links on [demo-webstore.apify.org](https://demo-webstore.apify.org)'s on-sale page, then makes a request to each product page and scrapes data about each one:
20+
Let's build on top of the code which appears at the end of the [Professional scraping](../../scraping_basics_javascript/crawling/pro_scraping.md) lesson of the **Web Scraping for Beginners** course.
2121

22-
```js
23-
// crawlee.js
22+
Let's paste the same code to a new file, `proxies.js`, and make some changes. The code crawls the [Sales](https://warehouse-theme-metal.myshopify.com/collections/sales) page of a sample e-commerce website. It goes through all of the product links, enqueues requests to each page with a product detail, and scrapes data about all of the products:
23+
24+
```js title=proxies.js
2425
import { CheerioCrawler, Dataset } from 'crawlee';
2526

2627
const crawler = new CheerioCrawler({
2728
requestHandler: async ({ $, request, enqueueLinks }) => {
28-
if (request.label === 'START') {
29+
console.log(`Fetching URL: ${request.url}`);
30+
31+
if (request.label === 'start-url') {
2932
await enqueueLinks({
30-
selector: 'a[href*="/product/"]',
33+
selector: 'a.product-item__title',
3134
});
32-
33-
// When on the START page, we don't want to
34-
// extract any data after we extract the links.
3535
return;
3636
}
3737

38-
// We copied and pasted the extraction code
39-
// from the previous lesson
40-
const title = $('h3').text().trim();
41-
const price = $('h3 + div').text().trim();
42-
const description = $('div[class*="Text_body"]').text().trim();
38+
const title = $('h1').text().trim();
39+
const vendor = $('a.product-meta__vendor').text().trim();
40+
const price = $('span.price').contents()[2].nodeValue;
41+
const reviewCount = parseInt($('span.rating__caption').text(), 10);
42+
const description = $('div[class*="description"] div.rte').text().trim();
4343

44-
// Instead of saving the data to a variable,
45-
// we immediately save everything to a file.
4644
await Dataset.pushData({
4745
title,
48-
description,
46+
vendor,
4947
price,
48+
reviewCount,
49+
description,
5050
});
5151
},
5252
});
5353

5454
await crawler.addRequests([{
55-
url: 'https://demo-webstore.apify.org/search/on-sale',
56-
// By labeling the Request, we can very easily
57-
// identify it later in the requestHandler.
58-
label: 'START',
55+
url: 'https://warehouse-theme-metal.myshopify.com/collections/sales',
56+
label: 'start-url',
5957
}]);
6058

6159
await crawler.run();
6260
```
6361

64-
In order to implement a proxy pool, we will first need some proxies. We'll quickly use the free [proxy scraper](https://apify.com/mstephen190/proxy-scraper) on the Apify platform to get our hands on some quality proxies. Next, we'll need to set up a [`ProxyConfiguration`](https://crawlee.dev/api/core/class/ProxyConfiguration) and configure it with our custom proxies, like so:
62+
We'll want all the requests to go through a proxies. For that we obviously need proxies! To get some, we can use Matthias Stephens' [free proxy scraper](https://apify.com/mstephen190/proxy-scraper). It can find tens of reliable proxies out of the thousands it scrapes.
63+
64+
Once we have a list of proxies, we can add [`ProxyConfiguration`](https://crawlee.dev/api/core/class/ProxyConfiguration) and pass it to our crawler.
65+
66+
Proxy pools usually consist of many proxy URLs, but for the sake of simplicity of this lesson we'll list just three. At the time you're reading this text, they most probably won't work anymore, so be sure to use your own values.
6567

6668
```js
67-
import { ProxyConfiguration } from 'crawlee';
69+
import { CheerioCrawler, Dataset, ProxyConfiguration } from 'crawlee';
6870

6971
const proxyConfiguration = new ProxyConfiguration({
7072
proxyUrls: ['http://45.42.177.37:3128', 'http://43.128.166.24:59394', 'http://51.79.49.178:3128'],
7173
});
72-
```
7374

74-
Awesome, so there's our proxy pool! Usually, a proxy pool is much larger than this; however, a three proxies pool is totally fine for tutorial purposes. Finally, we can pass the `proxyConfiguration` into our crawler's options:
75-
76-
```js
7775
const crawler = new CheerioCrawler({
7876
proxyConfiguration,
7977
requestHandler: async ({ $, request, enqueueLinks }) => {
80-
if (request.label === 'START') {
78+
console.log(`Fetching URL: ${request.url}`);
79+
80+
if (request.label === 'start-url') {
8181
await enqueueLinks({
82-
selector: 'a[href*="/product/"]',
82+
selector: 'a.product-item__title',
8383
});
8484
return;
8585
}
8686

87-
const title = $('h3').text().trim();
88-
const price = $('h3 + div').text().trim();
89-
const description = $('div[class*="Text_body"]').text().trim();
87+
const title = $('h1').text().trim();
88+
const vendor = $('a.product-meta__vendor').text().trim();
89+
const price = $('span.price').contents()[2].nodeValue;
90+
const reviewCount = parseInt($('span.rating__caption').text(), 10);
91+
const description = $('div[class*="description"] div.rte').text().trim();
9092

9193
await Dataset.pushData({
9294
title,
93-
description,
95+
vendor,
9496
price,
97+
reviewCount,
98+
description,
9599
});
96100
},
97101
});
102+
103+
await crawler.addRequests([{
104+
url: 'https://warehouse-theme-metal.myshopify.com/collections/sales',
105+
label: 'start-url',
106+
}]);
107+
108+
await crawler.run();
98109
```
99110

100-
> Note that if you run this code, it may not work, as the proxies could potentially be down/non-operating at the time you are going through this course.
111+
The crawler will now automatically rotate through the proxies we provided in the `proxyUrls` array.
101112

102-
That's it! The crawler will now automatically rotate through the proxies we provided in the `proxyUrls` option.
113+
## Debugging proxies {#debugging-proxies}
103114

104-
## A bit about debugging proxies {#debugging-proxies}
115+
To check that we're scraping through the proxies, we can get `proxyInfo` from the handler's context, which includes useful data about the proxy used to make the request.
105116

106-
At the time of writing, our above scraper utilizing our custom proxy pool is working just fine. But how can we check that the scraper is for sure using the proxies we provided it, and more importantly, how can we debug proxies within our scraper? Luckily, within the same `context` object we've been destructuring `$` and `request` out of, there is a `proxyInfo` key as well. `proxyInfo` is an object which includes useful data about the proxy which was used to make the request.
117+
In the code example we already destructure the context object to `$` and `request`, so we can just add `proxyInfo` as something we want to access in the handler, too.
107118

108119
```js
109120
const crawler = new CheerioCrawler({
@@ -118,15 +129,21 @@ const crawler = new CheerioCrawler({
118129
});
119130
```
120131

121-
After modifying your code to log `proxyInfo` to the console and running the scraper, you're going to see some logs which look like this:
132+
After modifying the code to log `proxyInfo` and after running the scraper, we can see proxy details about each request made:
133+
134+
![Sample logs of proxyInfo](./images/proxy-info-logs.png)
122135

123-
![proxyInfo being logged by the scraper](./images/proxy-info-logs.png)
136+
These logs confirm that Crawlee uses and automatically rotates the proxies. Such logs can be also useful for debugging slow or broken proxies.
124137

125-
These logs confirm that our proxies are being used and rotated successfully by Crawlee, and can also be used to debug slow or broken proxies.
138+
## Carefree proxy scraping {#higher-level-proxy-scraping}
126139

127-
## Higher level proxy scraping {#higher-level-proxy-scraping}
140+
If scraping and managing proxies on your own feels tedious, there are services which do that for you. One of them is [Apify Proxy](https://apify.com/proxy), which provides proxies with both residential and datacenter IP addresses. The integration with Crawlee is seamless, but first you need the Apify SDK:
141+
142+
```shell
143+
npm install apify
144+
```
128145

129-
Though we will discuss it more in-depth in future courses, it is still important to mention that Crawlee has integrated support for the Apify SDK, which supports [Apify Proxy](https://apify.com/proxy) - a service that provides access to pools of both residential and datacenter IP addresses. A `proxyConfiguration` using Apify Proxy might look something like this:
146+
Then you can create the `proxyConfiguration` like this:
130147

131148
```js
132149
import { Actor } from 'apify';
@@ -136,7 +153,7 @@ const proxyConfiguration = await Actor.createProxyConfiguration({
136153
});
137154
```
138155

139-
Notice that we didn't provide it a list of proxy URLs. This is because the `SHADER` group already serves as our proxy pool (courtesy of Apify Proxy).
156+
For more information about the integration refer to the [Apify SDK documentation](https://docs.apify.com/sdk/js/docs/guides/proxy-management).
140157

141158
## Next up {#next}
142159

0 commit comments

Comments
 (0)