title | description | sidebar_position | slug |
---|---|---|---|
Extracting data |
Learn how to extract data from a page with evaluate functions, then how to parse it by using a second library called Cheerio. |
2 |
/puppeteer-playwright/executing-scripts/collecting-data |
import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';
Learn how to extract data from a page with evaluate functions, then how to parse it by using a second library called Cheerio.
Now that we know how to execute scripts on a page, we're ready to learn a bit about data extraction. In this lesson, we'll be scraping all the on-sale products from warehouse-theme-metal.myshopify.com, a sample Shopify website.
Most web data extraction cases involve looping through a list of items of some sort.
Playwright & Puppeteer offer two main methods for data extraction
- Directly in
page.evaluate()
and other evaluate functions such aspage.$$eval()
. - In the Node.js context using a parsing library such as Cheerio
Here is the base setup for our code, upon which we'll be building off of in this lesson:
import { chromium } from 'playwright';
const browser = await chromium.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales');
// code will go here
await page.waitForTimeout(10000);
await browser.close();
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales');
// code will go here
await page.waitForTimeout(10000);
await browser.close();
Whatever is returned from the callback function in page.evaluate()
will be returned by the evaluate function, which means that we can set it to a variable like so:
const products = await page.evaluate(() => ({ foo: 'bar' }));
console.log(products); // -> { foo: 'bar' }
We'll be returning a bunch of product objects from this function, which will be accessible back in our Node.js context after the promise has resolved. Let's now go ahead and write some data extraction code to collect each product:
const products = await page.evaluate(() => {
const productCards = Array.from(document.querySelectorAll('.product-item'));
return productCards.map((element) => {
const name = element.querySelector('.product-item__title').textContent;
const price = element.querySelector('.price').lastChild.textContent;
return { name, price };
});
});
console.log(products);
When we run this code, we see this logged to our console:
$ node index.js
[
{
name: 'JBL Flip 4 Waterproof Portable Bluetooth Speaker',
price: '$74.95'
},
{
name: 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV',
price: 'From $1,398.00'
},
...
]
Working with document.querySelector
is cumbersome and quite verbose, but with the page.addScriptTag()
function and the latest jQuery CDN link, we can inject jQuery into the current page to gain access to its syntactical sweetness:
await page.addScriptTag({ url: 'https://code.jquery.com/jquery-3.6.0.min.js' });
This function will literally append a <script>
tag to the <head>
element of the current page, allowing access to jQuery's API when using page.evaluate()
to run code in the browser context.
Now, since we're able to use jQuery, let's translate our vanilla JavaScript code within the page.evaluate()
function to jQuery:
await page.addScriptTag({ url: 'https://code.jquery.com/jquery-3.6.0.min.js' });
const products = await page.evaluate(() => {
const productCards = $('.product-item');
return productCards.map(function () {
const card = $(this);
const name = card.find('.product-item__title').text();
const price = card.find('.price').contents().last().text();
return { name, price };
}).get();
});
console.log(products);
This will output the same exact result as the code in the previous section.
One of the most popular parsing libraries for Node.js is Cheerio, which can be used in tandem with Playwright and Puppeteer. It is extremely beneficial to parse the page's HTML in the Node.js context for a number of reasons:
- You can easily port the code between headless browser data extraction and plain HTTP data extraction
- You don't have to worry in which context you're working (which can sometimes be confusing)
- Errors are easier to handle when running in the base Node.js context
To install it, we can run the following command within your project's directory:
npm install cheerio
Then, we'll import the load
function like so:
import { load } from 'cheerio';
Finally, we can create a Cheerio
object based on our page's current content like so:
const $ = load(await page.content());
It's important to note that this
$
object is static. If any content on the page changes, the$
variable will not automatically be updated. It will need to be re-declared or re-defined.
Here's our full code so far:
import { chromium } from 'playwright';
import { load } from 'cheerio';
const browser = await chromium.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales');
const $ = load(await page.content());
// code will go here
await browser.close();
import puppeteer from 'puppeteer';
import { load } from 'cheerio';
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales');
const $ = load(await page.content());
// code will go here
await browser.close();
Now, to loop through all of the products, we'll make use of the $
object and loop through them while safely in the server-side context rather than running the code in the browser. Notice that this code is nearly exactly the same as the jQuery code above - it is just not running inside of a page.evaluate()
in the browser context.
const $ = load(await page.content());
const productCards = $('.product-item');
const products = productCards.map(function () {
const card = $(this);
const name = card.find('.product-item__title').text();
const price = card.find('.price').contents().last().text();
return { name, price };
}).get();
console.log(products);
Here's what our final optimized code looks like:
import { chromium } from 'playwright';
import { load } from 'cheerio';
const browser = await chromium.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales');
const $ = load(await page.content());
const productCards = $('.product-item');
const products = productCards.map(function () {
const card = $(this);
const name = card.find('.product-item__title').text();
const price = card.find('.price').contents().last().text();
return { name, price };
}).get();
console.log(products);
await browser.close();
import puppeteer from 'puppeteer';
import { load } from 'cheerio';
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales');
const $ = load(await page.content());
const productCards = $('.product-item');
const products = productCards.map(function () {
const card = $(this);
const name = card.find('.product-item__title').text();
const price = card.find('.price').contents().last().text();
return { name, price };
}).get();
console.log(products);
await browser.close();
Our next lesson will be discussing something super cool - request interception and reading data from requests and responses. It's just like using DevTools, except programmatically!