Skip to content

Commit 3f87295

Browse files
committed
feat: change naming of JS variables, update crawling to be about JS
1 parent b689317 commit 3f87295

File tree

11 files changed

+238
-246
lines changed

11 files changed

+238
-246
lines changed

sources/academy/webscraping/scraping_basics_javascript2/06_locating_elements.md

Lines changed: 26 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -80,15 +80,15 @@ if (response.ok) {
8080
const $ = cheerio.load(html);
8181

8282
$(".product-item").each((i, element) => {
83-
const productItem = $(element);
83+
const $productItem = $(element);
8484

85-
const title = productItem.find(".product-item__title");
86-
const titleText = title.text();
85+
const $title = $productItem.find(".product-item__title");
86+
const title = $title.text();
8787

88-
const price = productItem.find(".price");
89-
const priceText = price.text();
88+
const $price = $productItem.find(".price");
89+
const price = $price.text();
9090

91-
console.log(`${titleText} | ${priceText}`);
91+
console.log(`${title} | ${price}`);
9292
});
9393
} else {
9494
throw new Error(`HTTP ${response.status}`);
@@ -170,16 +170,16 @@ if (response.ok) {
170170
const $ = cheerio.load(html);
171171

172172
$(".product-item").each((i, element) => {
173-
const productItem = $(element);
173+
const $productItem = $(element);
174174

175-
const title = productItem.find(".product-item__title");
176-
const titleText = title.text();
175+
const $title = $productItem.find(".product-item__title");
176+
const title = $title.text();
177177

178178
// highlight-next-line
179-
const price = productItem.find(".price").contents().last();
180-
const priceText = price.text();
179+
const $price = $productItem.find(".price").contents().last();
180+
const price = $price.text();
181181

182-
console.log(`${titleText} | ${priceText}`);
182+
console.log(`${title} | ${price}`);
183183
});
184184
} else {
185185
throw new Error(`HTTP ${response.status}`);
@@ -243,18 +243,17 @@ Djibouti
243243
const $ = cheerio.load(html);
244244

245245
$(".wikitable").each((i, tableElement) => {
246-
const table = $(tableElement);
247-
const rows = table.find("tr");
248-
249-
rows.each((j, rowElement) => {
250-
const row = $(rowElement);
251-
const cells = row.find("td");
252-
253-
if (cells.length > 0) {
254-
const thirdColumn = $(cells[2]);
255-
const link = thirdColumn.find("a").first();
256-
const linkText = link.text();
257-
console.log(linkText);
246+
const $table = $(tableElement);
247+
const $rows = $table.find("tr");
248+
249+
$rows.each((j, rowElement) => {
250+
const $row = $(rowElement);
251+
const $cells = $row.find("td");
252+
253+
if ($cells.length > 0) {
254+
const $thirdColumn = $($cells[2]);
255+
const $link = $thirdColumn.find("a").first();
256+
console.log($link.text());
258257
}
259258
});
260259
});
@@ -289,10 +288,9 @@ Simplify the code from previous exercise. Use a single for loop and a single CSS
289288
const $ = cheerio.load(html);
290289

291290
$(".wikitable tr td:nth-child(3)").each((i, element) => {
292-
const nameCell = $(element);
293-
const link = nameCell.find("a").first();
294-
const linkText = link.text();
295-
console.log(linkText);
291+
const $nameCell = $(element);
292+
const $link = $nameCell.find("a").first();
293+
console.log($link.text());
296294
});
297295
} else {
298296
throw new Error(`HTTP ${response.status}`);

sources/academy/webscraping/scraping_basics_javascript2/07_extracting_data.md

Lines changed: 33 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -36,14 +36,14 @@ It's because some products have variants with different prices. Later in the cou
3636
Ideally we'd go and discuss the problem with those who are about to use the resulting data. For their purposes, is the fact that some prices are just minimum prices important? What would be the most useful representation of the range for them? Maybe they'd tell us that it's okay if we just remove the `From` prefix?
3737

3838
```js
39-
const priceText = price.text().replace("From ", "");
39+
const priceText = $price.text().replace("From ", "");
4040
```
4141

4242
In other cases, they'd tell us the data must include the range. And in cases when we just don't know, the safest option is to include all the information we have and leave the decision on what's important to later stages. One approach could be having the exact and minimum prices as separate values. If we don't know the exact price, we leave it empty:
4343

4444
```js
4545
const priceRange = { minPrice: null, price: null };
46-
const priceText = price.text()
46+
const priceText = $price.text()
4747
if (priceText.startsWith("From ")) {
4848
priceRange.minPrice = priceText.replace("From ", "");
4949
} else {
@@ -71,22 +71,22 @@ if (response.ok) {
7171
const $ = cheerio.load(html);
7272

7373
$(".product-item").each((i, element) => {
74-
const productItem = $(element);
74+
const $productItem = $(element);
7575

76-
const title = productItem.find(".product-item__title");
77-
const titleText = title.text();
76+
const $title = $productItem.find(".product-item__title");
77+
const title = $title.text();
7878

79-
const price = productItem.find(".price").contents().last();
79+
const $price = $productItem.find(".price").contents().last();
8080
const priceRange = { minPrice: null, price: null };
81-
const priceText = price.text();
81+
const priceText = $price.text();
8282
if (priceText.startsWith("From ")) {
8383
priceRange.minPrice = priceText.replace("From ", "");
8484
} else {
8585
priceRange.minPrice = priceText;
8686
priceRange.price = priceRange.minPrice;
8787
}
8888

89-
console.log(`${titleText} | ${priceRange.minPrice} | ${priceRange.price}`);
89+
console.log(`${title} | ${priceRange.minPrice} | ${priceRange.price}`);
9090
});
9191
} else {
9292
throw new Error(`HTTP ${response.status}`);
@@ -100,9 +100,9 @@ Often, the strings we extract from a web page start or end with some amount of w
100100
We call the operation of removing whitespace _trimming_ or _stripping_, and it's so useful in many applications that programming languages and libraries include ready-made tools for it. Let's add JavaScript's built-in [.trim()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim):
101101

102102
```js
103-
const titleText = title.text().trim();
103+
const title = $title.text().trim();
104104

105-
const priceText = price.text().trim();
105+
const priceText = $price.text().trim();
106106
```
107107

108108
## Removing dollar sign and commas
@@ -124,7 +124,7 @@ The demonstration above is inside the Node.js' [interactive REPL](https://nodejs
124124
We need to remove the dollar sign and the decimal commas. For this type of cleaning, [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions) are often the best tool for the job, but in this case [`.replace()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace) is also sufficient:
125125

126126
```js
127-
const priceText = price
127+
const priceText = $price
128128
.text()
129129
.trim()
130130
.replace("$", "")
@@ -137,7 +137,7 @@ Now we should be able to add `parseFloat()`, so that we have the prices not as a
137137

138138
```js
139139
const priceRange = { minPrice: null, price: null };
140-
const priceText = price.text()
140+
const priceText = $price.text()
141141
if (priceText.startsWith("From ")) {
142142
priceRange.minPrice = parseFloat(priceText.replace("From ", ""));
143143
} else {
@@ -156,7 +156,7 @@ Great! Only if we didn't overlook an important pitfall called [floating-point er
156156
These errors are small and usually don't matter, but sometimes they can add up and cause unpleasant discrepancies. That's why it's typically best to avoid floating point numbers when working with money. We won't store dollars, but cents:
157157

158158
```js
159-
const priceText = price
159+
const priceText = $price
160160
.text()
161161
.trim()
162162
.replace("$", "")
@@ -178,14 +178,14 @@ if (response.ok) {
178178
const $ = cheerio.load(html);
179179

180180
$(".product-item").each((i, element) => {
181-
const productItem = $(element);
181+
const $productItem = $(element);
182182

183-
const title = productItem.find(".product-item__title");
184-
const titleText = title.text().trim();
183+
const $title = $productItem.find(".product-item__title");
184+
const titleText = $title.text().trim();
185185

186-
const price = productItem.find(".price").contents().last();
186+
const $price = $productItem.find(".price").contents().last();
187187
const priceRange = { minPrice: null, price: null };
188-
const priceText = price
188+
const priceText = $price
189189
.text()
190190
.trim()
191191
.replace("$", "")
@@ -199,7 +199,7 @@ if (response.ok) {
199199
priceRange.price = priceRange.minPrice;
200200
}
201201

202-
console.log(`${titleText} | ${priceRange.minPrice} | ${priceRange.price}`);
202+
console.log(`${title} | ${priceRange.minPrice} | ${priceRange.price}`);
203203
});
204204
} else {
205205
throw new Error(`HTTP ${response.status}`);
@@ -249,12 +249,12 @@ Denon AH-C720 In-Ear Headphones | 236
249249
const $ = cheerio.load(html);
250250

251251
$(".product-item").each((i, element) => {
252-
const productItem = $(element);
252+
const $productItem = $(element);
253253

254-
const title = productItem.find(".product-item__title");
255-
const titleText = title.text().trim();
254+
const title = $productItem.find(".product-item__title");
255+
const title = $title.text().trim();
256256

257-
const unitsText = productItem
257+
const unitsText = $productItem
258258
.find(".product-item__inventory")
259259
.text()
260260
.replace("In stock,", "")
@@ -265,7 +265,7 @@ Denon AH-C720 In-Ear Headphones | 236
265265
const unitsCount = unitsText === "Sold out" ? 0
266266
: parseInt(unitsText);
267267

268-
console.log(`${titleText} | ${unitsCount}`);
268+
console.log(`${title} | ${unitsCount}`);
269269
});
270270
} else {
271271
throw new Error(`HTTP ${response.status}`);
@@ -298,19 +298,19 @@ Simplify the code from previous exercise. Use [regular expressions](https://deve
298298
const $ = cheerio.load(html);
299299

300300
$(".product-item").each((i, element) => {
301-
const productItem = $(element);
301+
const $productItem = $(element);
302302

303-
const title = productItem.find(".product-item__title");
304-
const titleText = title.text().trim();
303+
const $title = $productItem.find(".product-item__title");
304+
const title = $title.text().trim();
305305

306-
const unitsText = productItem
306+
const unitsText = $productItem
307307
.find(".product-item__inventory")
308308
.text()
309309
.trim();
310310
const unitsCount = unitsText === "Sold out" ? 0
311311
: parseInt(unitsText.match(/\d+/));
312312

313-
console.log(`${titleText} | ${unitsCount}`);
313+
console.log(`${title} | ${unitsCount}`);
314314
});
315315
} else {
316316
throw new Error(`HTTP ${response.status}`);
@@ -364,19 +364,19 @@ Hints:
364364
const $ = cheerio.load(html);
365365

366366
$("#maincontent ul li").each((i, element) => {
367-
const article = $(element);
367+
const $article = $(element);
368368

369-
const titleText = article
369+
const title = $article
370370
.find("h3")
371371
.text()
372372
.trim();
373-
const dateText = article
373+
const dateText = $article
374374
.find("time")
375375
.attr("datetime")
376376
.trim();
377377
const date = new Date(dateText);
378378

379-
console.log(`${titleText} | ${date.toDateString()}`);
379+
console.log(`${title} | ${date.toDateString()}`);
380380
});
381381
} else {
382382
throw new Error(`HTTP ${response.status}`);

sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md

Lines changed: 28 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ We should use widely popular formats that have well-defined solutions for all th
2525

2626
## Collecting data
2727

28-
Producing results line by line is an efficient approach to handling large datasets, but to simplify this lesson, we'll store all our data in one variable. This'll take three changes to our program:
28+
Producing results line by line is an efficient approach to handling large datasets, but to simplify this lesson, we'll store all our data in one variable. This'll take four changes to our program:
2929

3030
```js
3131
import * as cheerio from 'cheerio';
@@ -38,16 +38,15 @@ if (response.ok) {
3838
const $ = cheerio.load(html);
3939

4040
// highlight-next-line
41-
const data = [];
42-
$(".product-item").each((i, element) => {
43-
const productItem = $(element);
41+
const $items = $(".product-item").map((i, element) => {
42+
const $productItem = $(element);
4443

45-
const title = productItem.find(".product-item__title");
46-
const titleText = title.text().trim();
44+
const $title = $productItem.find(".product-item__title");
45+
const title = $title.text().trim();
4746

48-
const price = productItem.find(".price").contents().last();
47+
const $price = $productItem.find(".price").contents().last();
4948
const priceRange = { minPrice: null, price: null };
50-
const priceText = price
49+
const priceText = $price
5150
.text()
5251
.trim()
5352
.replace("$", "")
@@ -62,17 +61,34 @@ if (response.ok) {
6261
}
6362

6463
// highlight-next-line
65-
data.push({ title: titleText, ...priceRange });
64+
return { title, ...priceRange };
6665
});
67-
66+
// highlight-next-line
67+
const data = $items.get();
6868
// highlight-next-line
6969
console.log(data);
7070
} else {
7171
throw new Error(`HTTP ${response.status}`);
7272
}
7373
```
7474

75-
Before looping over the products, we prepare an empty array. Then, instead of printing each line, we append the data of each product to the array in the form of a JavaScript object. At the end of the program, we print the entire array at once.
75+
Instead of printing each line, we now return the data for each product as a JavaScript object. We've replaced `.each()` with [`.map()`](https://cheerio.js.org/docs/api/classes/Cheerio#map-3), which also iterates over the selection but, in addition, collects all the results and returns them as a Cheerio collection. We then convert it into a standard JavaScript array by calling [`.get()`](https://cheerio.js.org/docs/api/classes/Cheerio#call-signature-32). Near the end of the program, we print the entire array.
76+
77+
:::tip Advanced syntax
78+
79+
When returning the item object, we use [shorthand property syntax](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Object_initializer#property_definitions) to set the title, and [spread syntax](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Spread_syntax) to set the prices. It's the same as if we wrote the following:
80+
81+
```js
82+
{
83+
title: title,
84+
minPrice: priceRange.minPrice,
85+
price: priceRange.price,
86+
}
87+
```
88+
89+
:::
90+
91+
The program should now print the results as a single large JavaScript array:
7692

7793
```text
7894
$ node index.js
@@ -91,20 +107,6 @@ $ node index.js
91107
]
92108
```
93109

94-
:::tip Spread syntax
95-
96-
The three dots in `{ title: titleText, ...priceRange }` are called [spread syntax](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Spread_syntax). It's the same as if we wrote the following:
97-
98-
```js
99-
{
100-
title: titleText,
101-
minPrice: priceRange.minPrice,
102-
price: priceRange.price,
103-
}
104-
```
105-
106-
:::
107-
108110
## Saving data as JSON
109111

110112
The JSON format is popular primarily among developers. We use it for storing data, configuration files, or as a way to transfer data between programs (e.g., APIs). Its origin stems from the syntax of JavaScript objects, but people now use it accross programming languages.
@@ -202,7 +204,7 @@ In this lesson, we created export files in two formats. The following challenges
202204

203205
### Process your JSON
204206

205-
Write a new Node.js program that reads `products.json`, finds all products with a min price greater than $500, and prints each of them.
207+
Write a new Node.js program that reads the `products.json` file we created in the lesson, finds all products with a min price greater than $500, and prints each of them.
206208

207209
<details>
208210
<summary>Solution</summary>

0 commit comments

Comments
 (0)