From 7a7bdb35c0e8497f9535a017563787e9dd71d533 Mon Sep 17 00:00:00 2001 From: Honza Javorek Date: Wed, 24 Apr 2024 17:55:02 +0200 Subject: [PATCH 01/15] fix: typos and stylistic improvements --- sources/academy/webscraping/anti_scraping/index.md | 8 ++++---- .../anti_scraping/techniques/fingerprinting.md | 2 +- .../webscraping/anti_scraping/techniques/rate_limiting.md | 2 +- .../general_api_scraping/handling_pagination.md | 2 +- .../executing_scripts/extracting_data.md | 6 +----- 5 files changed, 8 insertions(+), 12 deletions(-) diff --git a/sources/academy/webscraping/anti_scraping/index.md b/sources/academy/webscraping/anti_scraping/index.md index 8446d03088..f4393ff0bc 100644 --- a/sources/academy/webscraping/anti_scraping/index.md +++ b/sources/academy/webscraping/anti_scraping/index.md @@ -12,7 +12,7 @@ slug: /anti-scraping --- -If at any point in time you've strayed away from the Academy's demo content, and into the wild west by writing some scrapers of your own, you may have been hit with anti-scraping measures. This is extremely common in the scraping world; however, the good thing is that there are always solutions. +If at any point in time you've strayed away from the Academy's demo content, and into the Wild West by writing some scrapers of your own, you may have been hit with anti-scraping measures. This is extremely common in the scraping world; however, the good thing is that there are always solutions. This section covers the essentials of mitigating anti-scraping protections, such as proxies, HTTP headers and cookies, and a few other things to consider when working on a reliable and scalable crawler. Proper usage of the methods taught in the next lessons will allow you to extract data which is specific to a certain location, enable your crawler to browse websites as a logged-in user, and more. @@ -65,7 +65,7 @@ Unfortunately for these websites, they have to make compromises and tradeoffs. W Anti-scraping protections can work on many different layers and use a large amount of bot-identification techniques. 1. **Where you are coming from** - The IP address of the incoming traffic is always available to the website. Proxies are used to emulate a different IP addresses but their quality matters a lot. -2. **How you look** - With each request, the website can analyze its HTTP headers, TLS version, cyphers, and other information. Moreover, if you use a browser, the website can also analyze the whole browser fingerprint and run challenges to classify your hardware (like graphics hardware acceleration). +2. **How you look** - With each request, the website can analyze its HTTP headers, TLS version, ciphers, and other information. Moreover, if you use a browser, the website can also analyze the whole browser fingerprint and run challenges to classify your hardware (like graphics hardware acceleration). 3. **What you are scraping** - The same data can be extracted in many ways from a website. You can just get the inital HTML or you can use a browser to render the full page or you can reverse engineer internal APIs. Each of those endpoints can be protected differently. 4. **How you behave** - The website can see patterns in how you are ordering your requests, how fast you are scraping, etc. It can also analyze browser behavior like mouse movement, clicks or key presses. @@ -91,7 +91,7 @@ A common workflow of a website after it has detected a bot goes as follows: 2. A [Turing test](https://en.wikipedia.org/wiki/Turing_test) is provided to the bot. Typically a **captcha**. If the bot succeeds, it is added to the whitelist. 3. If the captcha is failed, the bot is added to the blacklist. -One thing to keep in mind while navigating through this course is that advanced scraping methods are able to identify non-humans not only by one value (such as a single header value, or IP address), but are able to identify them through more complex things such as header combinations. +One thing to keep in mind while navigating through this course is that advanced anti-scraping methods are able to identify non-humans not only by one value (such as a single header value, or IP address), but are able to identify them through more complex things such as header combinations. Watch a conference talk by [Ondra Urban](https://github.com/mnmkng), which provides an overview of various anti-scraping measures and tactics for circumventing them. @@ -111,7 +111,7 @@ Because we here at Apify scrape for a living, we have discovered many popular an ### IP rate-limiting -This is the most straightforward and standard protection, which is mainly implemented to prevent DDoS attacks, but it also works for blocking scrapers. Websites using rating don't allow to more than some defined number of requests from one IP address in a certain time span. If the max-request number is low, then there is a high potential for false-positive due to IP address uniqueness, such as in large companies where hundreds of employees can share the same IP address. +This is the most straightforward and standard protection, which is mainly implemented to prevent DDoS attacks, but it also works for blocking scrapers. Websites using rate limiting don't allow to more than some defined number of requests from one IP address in a certain time span. If the max-request number is low, then there is a high potential for false-positive due to IP address uniqueness, such as in large companies where hundreds of employees can share the same IP address. > Learn more about rate limiting [here](./techniques/rate_limiting.md) diff --git a/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md b/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md index b666f178a8..1a689ef7e4 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md +++ b/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md @@ -176,7 +176,7 @@ The script is modified with some random JavaScript elements. Additionally, it al ### Data obfuscation -Two main data obfuscation techniues are widely employed: +Two main data obfuscation techniques are widely employed: 1. **String splitting** uses the concatenation of multiple substrings. It is mostly used alongside an `eval()` or `document.write()`. 2. **Keyword replacement** allows the script to mask the accessed properties. This allows the script to have a random order of the substrings and makes it harder to detect. diff --git a/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md b/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md index ce2c029761..fdf063c54d 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md +++ b/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md @@ -17,7 +17,7 @@ In the past, most websites had their own anti-scraping solutions, the most commo In cases when a higher number of requests is expected for the crawler, using a [proxy](../mitigation/proxies.md) and rotating the IPs is essential to let the crawler run as smoothly as possible and avoid being blocked. -## Dealing rate limiting with proxy/session rotating {#dealing-with-rate-limiting} +## Dealing with rate limiting by rotating proxy or session {#dealing-with-rate-limiting} The most popular and effective way of avoiding rate-limiting issues is by rotating [proxies](../mitigation/proxies.md) after every **n** number of requests, which makes your scraper appear as if it is making requests from various different places. Since the majority of rate-limiting solutions are based on IP addresses, rotating IPs allows a scraper to make large amounts to a website without getting restricted. diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md index d2a87ffc90..71fe1105c5 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md @@ -37,7 +37,7 @@ If we were to make a request with the **limit** set to **5** and the **offset** ## Cursor pagination {#cursor-pagination} -Becoming more and more common is cursor-based pagination. Like with offset-based pagination, a **limit** parameter is usually present; however, instead of **offset**, **cursor** is used instead. A cursor is just a marker (sometimes a token, a date, or just a number) for an item in the dataset. All results returned back from the API will be records that come after the item matching the **cursor** parameter provided. +Sometimes pagination uses **cursor** instead of **offset**. Cursor is a marker of an item in the dataset. It can be a date, number, or a more or less random string of letters and numbers. Request with a **cursor** parameter will result in an API response containing items which follow after the item which the cursor points to. One of the most painful things about scraping APIs with cursor pagination is that you can't skip to, for example, the 5th page. You have to paginate through each page one by one. diff --git a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md index 52b2038971..d94fc6a967 100644 --- a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md +++ b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md @@ -14,11 +14,7 @@ import TabItem from '@theme/TabItem'; --- -Now that we know how to execute scripts on a page, we're ready to learn a bit about [data extraction](../../web_scraping_for_beginners/data_extraction/index.md). In this lesson, we'll be scraping all the on-sale products from our [Fakestore](https://demo-webstore.apify.org/search/on-sale) website. - -> Most web data extraction cases involve looping through a list of items of some sort. - -Playwright & Puppeteer offer two main methods for data extraction +Now that we know how to execute scripts on a page, we're ready to learn a bit about [data extraction](../../web_scraping_for_beginners/data_extraction/index.md). In this lesson, we'll be scraping all the on-sale products from our [Fakestore](https://demo-webstore.apify.org/search/on-sale) website. Playwright & Puppeteer offer two main methods for data extraction: 1. Directly in `page.evaluate()` and other evaluate functions such as `page.$$eval()`. 2. In the Node.js context using a parsing library such as [Cheerio](https://www.npmjs.com/package/cheerio) From 68bf8d4923485b1a76a0de0f20da6dbc7e5539e0 Mon Sep 17 00:00:00 2001 From: Honza Javorek Date: Thu, 25 Apr 2024 13:37:14 +0200 Subject: [PATCH 02/15] style: don't tell the reader something is 'easy' --- .../platform/deploying_your_code/deploying.md | 4 +--- .../platform/deploying_your_code/inputs_outputs.md | 2 +- .../managing_source_code.md | 2 -- .../solutions/using_storage_creating_tasks.md | 2 +- .../get_most_of_actors/seo_and_promotion.md | 2 +- .../academy/platform/getting_started/apify_api.md | 2 +- .../platform/getting_started/inputs_outputs.md | 2 +- .../api/run_actor_and_retrieve_data_via_api.md | 4 ++-- .../tutorials/apify_scrapers/puppeteer_scraper.md | 13 ++++++------- .../academy/tutorials/apify_scrapers/web_scraper.md | 5 ++--- .../node_js/add_external_libraries_web_scraper.md | 4 ++-- .../tutorials/node_js/choosing_the_right_scraper.md | 2 +- .../tutorials/node_js/dealing_with_dynamic_pages.md | 6 +----- .../processing_multiple_pages_web_scraper.md | 6 +++--- .../tutorials/node_js/scraping_shadow_doms.md | 4 ++-- .../node_js/when_to_use_puppeteer_scraper.md | 4 ++-- .../academy/tutorials/php/using_apify_from_php.md | 4 +--- .../academy/tutorials/python/scrape_data_python.md | 2 +- .../scraping_paginated_sites.md | 2 +- .../mitigation/generating_fingerprints.md | 2 +- .../anti_scraping/mitigation/using_proxies.md | 2 +- .../general_api_scraping/locating_and_learning.md | 2 +- .../webscraping/puppeteer_playwright/browser.md | 2 +- .../common_use_cases/downloading_files.md | 2 +- .../page/interacting_with_a_page.md | 2 +- .../web_scraping_for_beginners/challenge/index.md | 2 +- .../challenge/initializing_and_setting_up.md | 2 +- .../crawling/exporting_data.md | 2 +- .../crawling/finding_links.md | 4 +--- .../crawling/headless_browser.md | 2 +- .../data_extraction/project_setup.md | 4 ++-- 31 files changed, 43 insertions(+), 57 deletions(-) diff --git a/sources/academy/platform/deploying_your_code/deploying.md b/sources/academy/platform/deploying_your_code/deploying.md index 8e6b2c89ce..bfabc2aa7e 100644 --- a/sources/academy/platform/deploying_your_code/deploying.md +++ b/sources/academy/platform/deploying_your_code/deploying.md @@ -21,12 +21,10 @@ Before we deploy our project onto the Apify platform, let's ensure that we've pu ### Creating the Actor -Before anything can be integrated, we've gotta create a new Actor. Luckily, this is super easy to do. Let's head over to our [Apify Console](https://console.apify.com?asrc=developers_portal) and click on the **New** button, then select the **Empty** template. +Before anything can be integrated, we've gotta create a new Actor. Let's head over to our [Apify Console](https://console.apify.com?asrc=developers_portal) and click on the **New** button, then select the **Empty** template. ![Create new button](../getting_started/images/create-new-actor.png) -Easy peasy! - ### Changing source code location {#change-source-code} In the **Source** tab on the new Actor's page, we'll click the dropdown menu under **Source code** and select **Git repository**. By default, this is set to **Web IDE**. diff --git a/sources/academy/platform/deploying_your_code/inputs_outputs.md b/sources/academy/platform/deploying_your_code/inputs_outputs.md index 3bae75dc09..293db294fe 100644 --- a/sources/academy/platform/deploying_your_code/inputs_outputs.md +++ b/sources/academy/platform/deploying_your_code/inputs_outputs.md @@ -11,7 +11,7 @@ slug: /deploying-your-code/inputs-outputs --- -Most of the time when you're creating a project, you are expecting some sort of input from which your software will run off. Oftentimes as well, you want to provide some sort of output once your software has completed running. With Apify, it is extremely easy to take in inputs and deliver outputs. +Most of the time when you're creating a project, you are expecting some sort of input from which your software will run off. Oftentimes as well, you want to provide some sort of output once your software has completed running. Apify provides a convenient way to handle inputs and deliver outputs. An important thing to understand regarding inputs and outputs is that they are read/written differently depending on where the Actor is running: diff --git a/sources/academy/platform/expert_scraping_with_apify/managing_source_code.md b/sources/academy/platform/expert_scraping_with_apify/managing_source_code.md index 09cb219abc..38d3fdaa95 100644 --- a/sources/academy/platform/expert_scraping_with_apify/managing_source_code.md +++ b/sources/academy/platform/expert_scraping_with_apify/managing_source_code.md @@ -31,8 +31,6 @@ Also, try to explore the **Multifile editor** in one of the Actors you developed ## Our task {#our-task} -> This lesson's task is so quick and easy, we won't even be splitting this topic into two parts like the previous two topics! - First, we must initialize a GitHub repository (you can use Gitlab if you like, but this lesson's examples will be using GitHub). Then, after pushing our main Amazon Actor's code to the repo, we must switch its source code to use the content of the GitHub repository instead. ## Integrating GitHub source code {#integrating-github} diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md b/sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md index b533b5838b..32c306b503 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md @@ -232,7 +232,7 @@ router.addHandler(labels.OFFERS, async ({ $, request }) => { Don't forget to push your changes to GitHub using `git push origin MAIN_BRANCH_NAME` to see them on the Apify platform! -## Creating a task (It's easy!) {#creating-task} +## Creating a task {#creating-task} Back on the platform, on your Actor's page, you can see a button in the top right hand corner that says **Create new task**: diff --git a/sources/academy/platform/get_most_of_actors/seo_and_promotion.md b/sources/academy/platform/get_most_of_actors/seo_and_promotion.md index f0e806e65b..20a57e55cd 100644 --- a/sources/academy/platform/get_most_of_actors/seo_and_promotion.md +++ b/sources/academy/platform/get_most_of_actors/seo_and_promotion.md @@ -23,7 +23,7 @@ On the other hand, if you precisely address a niche segment of users who will be ## Keywords -Several freemium tools exist that make it easy to identify the right phrases and keywords: +Several freemium tools exist that help with identifying the right phrases and keywords: - [wordstream.com/keywords](https://www.wordstream.com/keywords) - [neilpatel.com/ubersuggest](https://neilpatel.com/ubersuggest/) diff --git a/sources/academy/platform/getting_started/apify_api.md b/sources/academy/platform/getting_started/apify_api.md index 8a7cbceb60..f384958fde 100644 --- a/sources/academy/platform/getting_started/apify_api.md +++ b/sources/academy/platform/getting_started/apify_api.md @@ -47,7 +47,7 @@ https://api.apify.com/v2/acts/YOUR_USERNAME~adding-actor/run-sync-get-dataset-it Additional parameters can be passed to this endpoint. You can learn about them [here](/api/v2#/reference/actors/run-actor-synchronously-and-get-dataset-items/run-actor-synchronously-with-input-and-get-dataset-items) -> Note: It is safer to put your API token in the **Authorization** header like so: `Authorization: Bearer YOUR_TOKEN`. This is very easy to configure in popular HTTP clients, such as [Postman](../../glossary/tools/postman.md), [Insomnia](../../glossary/tools/insomnia.md). +> Network components can record visited URLs, so it's more secure to send the token as a HTTP header, not as a parameter. The header should look like `Authorization: Bearer YOUR_TOKEN`. Popular HTTP clients, such as [Postman](../../glossary/tools/postman.md) or [Insomnia](../../glossary/tools/insomnia.md), provide a convenient way to configure the Authorization header for all your API requests. ## Sending the request {#sending-the-request} diff --git a/sources/academy/platform/getting_started/inputs_outputs.md b/sources/academy/platform/getting_started/inputs_outputs.md index b7a95a8e66..88a87c5810 100644 --- a/sources/academy/platform/getting_started/inputs_outputs.md +++ b/sources/academy/platform/getting_started/inputs_outputs.md @@ -11,7 +11,7 @@ slug: /getting-started/inputs-outputs --- -Most of the time, when you are writing any sort of software, it will generally expect some sort of input and generate some sort of output. It is the same exact story when it comes to Actors, which is why we at Apify have made it so easy to accept input into an Actor and store its results somewhere. +Actors, as any other programs, take inputs and generate outputs. The Apify platform has a way how to specify what inputs the Actor expects, and a way to temporarily or permanently store its results. In this lesson, we'll be demonstrating inputs and outputs by building an Actor which takes two numbers as input, adds them up, and then outputs the result. diff --git a/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md b/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md index 074a5030a0..0269c05aff 100644 --- a/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md +++ b/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md @@ -9,9 +9,9 @@ slug: /api/run-actor-and-retrieve-data-via-api --- -The most popular way of [integrating](https://help.apify.com/en/collections/1669769-integrations) the Apify platform with an external project/application is by programmatically running an [Actor](/platform/actors) or [task](/platform/actors/running/tasks), waiting for it to complete its run, then collecting its data and using it within the project. Though this process sounds somewhat complicated, it's actually quite easy to do; however, due to the plethora of features offered on the Apify platform, new users may not be sure how exactly to implement this type of integration. Let's dive in and see how you can do it. +The most popular way of [integrating](https://help.apify.com/en/collections/1669769-integrations) the Apify platform with an external project/application is by programmatically running an [Actor](/platform/actors) or [task](/platform/actors/running/tasks), waiting for it to complete its run, then collecting its data and using it within the project. Follow this tutorial to have an idea on how to approach this, it isn't as complicated as it sounds! -> Remember to check out our [API documentation](/api/v2) with examples in different languages and a live API console. We also recommend testing the API with a nice desktop client like [Postman](https://www.getpostman.com/) or [Insomnia](https://insomnia.rest). +> Remember to check out our [API documentation](/api/v2) with examples in different languages and a live API console. We also recommend testing the API with a desktop client like [Postman](https://www.getpostman.com/) or [Insomnia](https://insomnia.rest). Apify API offers two ways of interacting with it: diff --git a/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md b/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md index 98e5618286..0f5e0dbed6 100644 --- a/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md @@ -374,7 +374,6 @@ JavaScript had the time to run. At first, you may think that the scraper is broken, but it just cannot wait for all the JavaScript in the page to finish executing. For a lot of pages, there's always some JavaScript executing or some network requests being made. It would never stop waiting. It is therefore up to you, the programmer, to wait for the elements you need. -Fortunately, we have an easy solution. #### The `context.page.waitFor()` function @@ -409,13 +408,13 @@ With those tools, you should be able to handle any dynamic content the website t ### [](#how-to-paginate) How to paginate -With the theory out of the way, this should be pretty easy. The algorithm is a loop: +After going through the theory, let's design the algorithm: - 1. Wait for the **Show more** button. - 2. Click it. - 3. Is there another **Show more** button? - - Yes? Repeat the above. (loop) - - No? We're done. We have all the Actors. +1. Wait for the **Show more** button. +2. Click it. +3. Is there another **Show more** button? + - Yes? Repeat from 1. (loop) + - No? We're done. We have all the Actors. #### Waiting for the button diff --git a/sources/academy/tutorials/apify_scrapers/web_scraper.md b/sources/academy/tutorials/apify_scrapers/web_scraper.md index 03dcc27008..08ed5a2b66 100644 --- a/sources/academy/tutorials/apify_scrapers/web_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/web_scraper.md @@ -272,7 +272,6 @@ JavaScript had the time to run. At first, you may think that the scraper is broken, but it just cannot wait for all the JavaScript in the page to finish executing. For a lot of pages, there's always some JavaScript executing or some network requests being made. It would never stop waiting. It is therefore up to you, the programmer, to wait for the elements you need. -Fortunately, we have an easy solution. #### The `context.waitFor()` function @@ -304,12 +303,12 @@ With those tools, you should be able to handle any dynamic content the website t ### [](#how-to-paginate) How to paginate -With the theory out of the way, this should be pretty easy. The algorithm is a loop: +After going through the theory, let's design the algorithm: 1. Wait for the **Show more** button. 2. Click it. 3. Is there another **Show more** button? - - Yes? Repeat the above. (loop) + - Yes? Repeat from 1. (loop) - No? We're done. We have all the Actors. #### Waiting for the button diff --git a/sources/academy/tutorials/node_js/add_external_libraries_web_scraper.md b/sources/academy/tutorials/node_js/add_external_libraries_web_scraper.md index 5a68a08d6a..7f445e2d98 100644 --- a/sources/academy/tutorials/node_js/add_external_libraries_web_scraper.md +++ b/sources/academy/tutorials/node_js/add_external_libraries_web_scraper.md @@ -1,11 +1,11 @@ --- title: How to add external libraries to Web Scraper -description: Web Scraper does not provide an option to load external JavaScript libraries. Fortunately, there's an easy way to do it. Learn how. +description: Learn how to load external JavaScript libraries in Apify's Web Scraper Actor. sidebar_position: 15.7 slug: /node-js/add-external-libraries-web-scraper --- -Sometimes you need to use some extra JavaScript in your [Web Scraper](https://apify.com/apify/web-scraper) page functions. Whether it is to work with dates and times using [Moment.js](https://momentjs.com/), or to manipulate the DOM using [jQuery](https://jquery.com/), libraries save precious time and make your code more concise and readable. Web Scraper already provides an easy way to add jQuery to your page functions. All you need to do is to check the Inject jQuery input option. There's also the option to Inject Underscore, a popular helper function library. +Sometimes you need to use some extra JavaScript in your [Web Scraper](https://apify.com/apify/web-scraper) page functions. Whether it is to work with dates and times using [Moment.js](https://momentjs.com/), or to manipulate the DOM using [jQuery](https://jquery.com/), libraries save precious time and make your code more concise and readable. Web Scraper already provides a way to add jQuery to your page functions. All you need to do is to check the Inject jQuery input option. There's also the option to Inject Underscore, a popular helper function library. In this tutorial, we'll learn how to inject any JavaScript library into your page functions, with the only limitation being that the library needs to be available somewhere on the internet as a downloadable file (typically a CDN). diff --git a/sources/academy/tutorials/node_js/choosing_the_right_scraper.md b/sources/academy/tutorials/node_js/choosing_the_right_scraper.md index 58516d91fb..75d41d8873 100644 --- a/sources/academy/tutorials/node_js/choosing_the_right_scraper.md +++ b/sources/academy/tutorials/node_js/choosing_the_right_scraper.md @@ -26,7 +26,7 @@ If it were only a question of performance, you'd of course use request-based scr ## Dynamic pages & blocking {#dynamic-pages} -Some websites do not load any data without a browser, as they need to execute some scripts to show it (these are known as [dynamic pages](./dealing_with_dynamic_pages.md)). Another problem is blocking. If the website is collecting a [browser fingerprint](../../webscraping/anti_scraping/techniques/fingerprinting.md), it is very easy for it to distinguish between a real user and a bot (crawler) and block access. +Some websites do not load any data without a browser, as they need to execute some scripts to show it (these are known as [dynamic pages](./dealing_with_dynamic_pages.md)). Another problem is blocking. If the website collects a [browser fingerprint](../../webscraping/anti_scraping/techniques/fingerprinting.md), it can easily distinguish between a real user and a bot (crawler) and block access. ## Making the choice {#making-the-choice} diff --git a/sources/academy/tutorials/node_js/dealing_with_dynamic_pages.md b/sources/academy/tutorials/node_js/dealing_with_dynamic_pages.md index 7fdfccc954..9a8d27dcb0 100644 --- a/sources/academy/tutorials/node_js/dealing_with_dynamic_pages.md +++ b/sources/academy/tutorials/node_js/dealing_with_dynamic_pages.md @@ -13,13 +13,9 @@ import Example from '!!raw-loader!roa-loader!./dealing_with_dynamic_pages.js'; --- - - -In this lesson, we'll be discussing dynamic content and how to scrape it while utilizing Crawlee. - ## A quick experiment {#quick-experiment} -From our adored and beloved [Fakestore](https://demo-webstore.apify.org/), we have been tasked to scrape each product's title, price, and image from the [new arrivals](https://demo-webstore.apify.org/search/new-arrivals) page. Easy enough! We did something very similar in the previous modules. +From our adored and beloved [Fakestore](https://demo-webstore.apify.org/), we have been tasked to scrape each product's title, price, and image from the [new arrivals](https://demo-webstore.apify.org/search/new-arrivals) page. ![New arrival products in Fakestore](./images/new-arrivals.jpg) diff --git a/sources/academy/tutorials/node_js/processing_multiple_pages_web_scraper.md b/sources/academy/tutorials/node_js/processing_multiple_pages_web_scraper.md index f6c000c926..cfc5648868 100644 --- a/sources/academy/tutorials/node_js/processing_multiple_pages_web_scraper.md +++ b/sources/academy/tutorials/node_js/processing_multiple_pages_web_scraper.md @@ -5,11 +5,11 @@ sidebar_position: 15.6 slug: /node-js/processing-multiple-pages-web-scraper --- -There is a certain scraping scenario in which you need to process the same URL many times, but each time with a different setup (e.g. filling in a form with different data each time). This is easy to do with Apify, but how to go about it may not be obvious at first glance. +Sometimes you need to process the same URL several times, but each time with a different setup. For example, you may want to submit the same form with different data each time. -We'll show you how to do this with a simple example: starting a scraper with an array of keywords, inputting each of the keywords separately into Google, and retrieving the results on the last page. The tutorial will be split into these three main parts. +Let's illustrate a solution to this problem by creating a scraper which starts with an array of keywords and inputs each of them to Google, one by one. Then it retrieves the results. -This whole thing could be done in a much easier way, by directly enqueuing the search URL, but we're choosing this approach to demonstrate some of the not so obvious features of the Apify scraper. +> This isn't an efficient solution to searching keywords on Google. You could directly enqueue search URLs like `https://www.google.cz/search?q=KEYWORD`. # Enqueuing start pages for all keywords diff --git a/sources/academy/tutorials/node_js/scraping_shadow_doms.md b/sources/academy/tutorials/node_js/scraping_shadow_doms.md index ada3afcb0e..1151c3b6ef 100644 --- a/sources/academy/tutorials/node_js/scraping_shadow_doms.md +++ b/sources/academy/tutorials/node_js/scraping_shadow_doms.md @@ -1,13 +1,13 @@ --- title: How to scrape sites with a shadow DOM -description: The shadow DOM enables the isolation of web components, but causes problems for those building web scrapers. Here's an easy workaround. +description: The shadow DOM enables isolation of web components, but causes problems for those building web scrapers. Here's a workaround. sidebar_position: 14.8 slug: /node-js/scraping-shadow-doms --- # How to scrape sites with a shadow DOM {#scraping-shadow-doms} -**The shadow DOM enables the isolation of web components, but causes problems for those building web scrapers. Here's an easy workaround.** +**The shadow DOM enables isolation of web components, but causes problems for those building web scrapers. Here's a workaround.** --- diff --git a/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md b/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md index 93cc431886..bc065fcf4d 100644 --- a/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md +++ b/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md @@ -126,7 +126,7 @@ Since we're actually clicking in the page, which may or may not trigger some nas ## Plain form submit navigations -This is easy and will work out of the box. It's typically used on older websites such as [Turkish Remax](https://www.remax.com.tr/ofis-office-franchise-girisimci-agent-arama). For a site like this you can just set the `Clickable elements selector` and you're good to go: +This works out of the box. It's typically used on older websites such as [Turkish Remax](https://www.remax.com.tr/ofis-office-franchise-girisimci-agent-arama). For a site like this you can just set the `Clickable elements selector` and you're good to go: ```js 'a[onclick ^= getPage]'; @@ -142,7 +142,7 @@ Those are similar to the ones above with an important caveat. Once you click the ## Frontend navigations -Websites often won't navigate away just to fetch the next set of results. They will do it in the background and just update the displayed data. To paginate websites like that is quite easy actually and it can be done in both Web Scraper and Puppeteer Scraper. Try it on [Udemy](https://www.udemy.com/topic/javascript/) for example. Just click the next button to load the next set of courses. +Websites often won't navigate away just to fetch the next set of results. They will do it in the background and just update the displayed data. You can paginate such websites with either Web Scraper or Puppeteer Scraper. Try it on [Udemy](https://www.udemy.com/topic/javascript/) for example. Click the next button to load the next set of courses. ```js // Web Scraper\ diff --git a/sources/academy/tutorials/php/using_apify_from_php.md b/sources/academy/tutorials/php/using_apify_from_php.md index 885354b3b1..ea8401fcab 100644 --- a/sources/academy/tutorials/php/using_apify_from_php.md +++ b/sources/academy/tutorials/php/using_apify_from_php.md @@ -230,9 +230,7 @@ $response = $client->post('acts/mhamas~html-string-to-pdf/runs', [ ## How to use Apify Proxy -A [proxy](/platform/proxy) is another important Apify feature you will need. Guzzle makes it easy to use. - -If you just want to make sure that your server's IP address won't get blocked somewhere when making requests, you can use the automatic proxy selection mode. +Let's use another important feature: [proxy](/platform/proxy). If you just want to make sure that your server's IP address won't get blocked somewhere when making requests, you can use the automatic proxy selection mode. ```php $client = new \GuzzleHttp\Client([ diff --git a/sources/academy/tutorials/python/scrape_data_python.md b/sources/academy/tutorials/python/scrape_data_python.md index 7be6866f6f..eb14f79b77 100644 --- a/sources/academy/tutorials/python/scrape_data_python.md +++ b/sources/academy/tutorials/python/scrape_data_python.md @@ -11,7 +11,7 @@ slug: /python/scrape-data-python --- -Web scraping is not limited to the JavaScript world. The Python ecosystem contains some pretty powerful scraping tools as well. One of those is [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/), a library for parsing HTML and easy navigation or modification of a DOM tree. +Web scraping is not limited to the JavaScript world. The Python ecosystem contains some pretty powerful scraping tools as well. One of those is [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/), a library for parsing HTML and navigating or modifying of its DOM tree. This tutorial shows you how to write a Python [Actor](../../platform/getting_started/actors.md) for scraping the weather forecast from [BBC Weather](https://www.bbc.com/weather) and process the scraped data using [Pandas](https://pandas.pydata.org/). diff --git a/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md b/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md index badda5bb01..c6ce3fea19 100644 --- a/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md +++ b/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md @@ -97,7 +97,7 @@ In addition, XHRs are smaller and faster than loading an HTML page. On the other #### Does the website show the number of products for each filtered page? {#does-the-website-show-the-number-of-products-for-each-filtered-page} -If it does, it is a nice bonus. It gives us an easy way to check if we are over or below the pagination limit and helps with analytics. +If it does, it's a nice bonus. It gives us a way to check if we are over or below the pagination limit and helps with analytics. If it doesn't, we have to find a different way to check if the number of listings is within a limit. One option is to go to the last allowed page of the pagination. If that page is still full of products, we can assume the filter is over the limit. diff --git a/sources/academy/webscraping/anti_scraping/mitigation/generating_fingerprints.md b/sources/academy/webscraping/anti_scraping/mitigation/generating_fingerprints.md index 5ba143ef47..e309502b17 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/generating_fingerprints.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/generating_fingerprints.md @@ -11,7 +11,7 @@ slug: /anti-scraping/mitigation/generating-fingerprints --- -In [**Crawlee**](https://crawlee.dev), it's extremely easy to automatically generate fingerprints using the [**FingerprintOptions**](https://crawlee.dev/api/browser-pool/interface/FingerprintOptions) on a crawler. +In [**Crawlee**](https://crawlee.dev), you can use [**FingerprintOptions**](https://crawlee.dev/api/browser-pool/interface/FingerprintOptions) on a crawler to automatically generate fingerprints. ```js import { PlaywrightCrawler } from 'crawlee'; diff --git a/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md b/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md index cb0e21f38c..ca61b50cff 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md @@ -13,7 +13,7 @@ slug: /anti-scraping/mitigation/using-proxies In the [**Web scraping for beginners**](../../web_scraping_for_beginners/crawling/pro_scraping.md) course, we learned about the power of Crawlee, and how it can streamline the development process of web crawlers. You've already seen how powerful the `crawlee` package is; however, what you've been exposed to thus far is only the tip of the iceberg. -Because proxies are so widely used in the scraping world, Crawlee has been equipped with features which make it easy to implement them in an effective way. One of the main functionalities that comes baked into Crawlee is proxy rotation, which is when each request is sent through a different proxy from a proxy pool. +Because proxies are so widely used in the scraping world, Crawlee has built-in features for implementing them in an effective way. One of the main functionalities that comes baked into Crawlee is proxy rotation, which is when each request is sent through a different proxy from a proxy pool. ## Implementing proxies in a scraper {#implementing-proxies} diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md b/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md index bdf7d691e2..c49dc0cea8 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md @@ -41,7 +41,7 @@ Since our request doesn't have any body/payload, we just need to analyze the URL ![Breaking down the request url into understandable chunks](./images/analyzing-the-url.png) -Understanding an API's various configurations helps with creating a game-plan on how to best scrape it, as many of the parameters can be utilized for easy pagination, or easy data-filtering. Additionally, these values can be mapped to a scraper's configuration options, which overall makes the scraper more versatile. +Understanding an API's various configurations helps with creating a game-plan on how to best scrape it, as many of the parameters can be utilized for pagination, or data-filtering. Additionally, these values can be mapped to a scraper's configuration options, which overall makes the scraper more versatile. Let's say we want to receive all of the user's tracks in one request. Based on our observations of the endpoint's different parameters, we can modify the URL and utilize the `limit` option to return more than just twenty songs. The `limit` option is extremely common with most APIs, and allows the person making the request to literally limit the maximum number of results to be returned in the request: diff --git a/sources/academy/webscraping/puppeteer_playwright/browser.md b/sources/academy/webscraping/puppeteer_playwright/browser.md index 38170b3802..7cbc78725b 100644 --- a/sources/academy/webscraping/puppeteer_playwright/browser.md +++ b/sources/academy/webscraping/puppeteer_playwright/browser.md @@ -82,7 +82,7 @@ There are a whole lot more options that we can pass into the `launch()` function ## Browser methods {#browser-methods} -The `launch()` function also returns an object representation of the browser that we can use to interact with the browser right from our code. This **Browser** object ([Puppeteer](https://pptr.dev/#?product=Puppeteer&version=v13.7.0&show=api-class-browser), [Playwright](https://playwright.dev/docs/api/class-browser)) has many functions which make it easy to do this. One of these functions is `close()`. Until now, we've been using **control^** + **C** to force quit the process, but with this function, we'll no longer have to do that. +The `launch()` function also returns a **Browser** object ([Puppeteer](https://pptr.dev/#?product=Puppeteer&version=v13.7.0&show=api-class-browser), [Playwright](https://playwright.dev/docs/api/class-browser)), which is a representation of the browser. This object has many methods, which allow us to interact with the browser from our code. One of them is `close()`. Until now, we've been using **control^** + **C** to force quit the process, but with this function, we'll no longer have to do that. diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/downloading_files.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/downloading_files.md index fc10172f6c..bb17676869 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/downloading_files.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/downloading_files.md @@ -11,7 +11,7 @@ slug: /puppeteer-playwright/common-use-cases/downloading-files --- -Downloading a file using Puppeteer can be tricky. On some systems, there can be issues with the usual file saving process that prevent you from doing it the easy way. However, there are different techniques that work (most of the time). +Downloading a file using Puppeteer can be tricky. On some systems, there can be issues with the usual file saving process that prevent you from doing it in a straightforward way. However, there are different techniques that work (most of the time). These techniques are only necessary when we don't have a direct file link, which is usually the case when the file being downloaded is based on more complicated data export. diff --git a/sources/academy/webscraping/puppeteer_playwright/page/interacting_with_a_page.md b/sources/academy/webscraping/puppeteer_playwright/page/interacting_with_a_page.md index aaa689c1dc..60456e2ce1 100644 --- a/sources/academy/webscraping/puppeteer_playwright/page/interacting_with_a_page.md +++ b/sources/academy/webscraping/puppeteer_playwright/page/interacting_with_a_page.md @@ -26,7 +26,7 @@ Let's say that we want to automate searching for **hello world** on Google, then 6. Read the title of the clicked result's loaded page 7. Screenshot the page -Though it seems complex, the wonderful **Page** API makes all of these actions extremely easy to perform. +Though it seems complex, the wonderful **Page** API can help us with all the steps. ## Clicking & pressing keys {#clicking-and-pressing-keys} diff --git a/sources/academy/webscraping/web_scraping_for_beginners/challenge/index.md b/sources/academy/webscraping/web_scraping_for_beginners/challenge/index.md index a0e351b68e..0db3f48a1f 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/challenge/index.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/challenge/index.md @@ -85,4 +85,4 @@ From this course, you should have all the knowledge to build this scraper by you The challenge can be completed using either [CheerioCrawler](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) or [PlaywrightCrawler](https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler). Playwright is significantly slower but doesn't get blocked as much. You will learn the most by implementing both. -Let's start off this section easy by [initializing and setting up](./initializing_and_setting_up.md) our project with the Crawlee CLI (don't worry, no additional installation is required). +Let's start off this section by [initializing and setting up](./initializing_and_setting_up.md) our project with the Crawlee CLI (don't worry, no additional installation is required). diff --git a/sources/academy/webscraping/web_scraping_for_beginners/challenge/initializing_and_setting_up.md b/sources/academy/webscraping/web_scraping_for_beginners/challenge/initializing_and_setting_up.md index 7021eff607..c0cf40bc11 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/challenge/initializing_and_setting_up.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/challenge/initializing_and_setting_up.md @@ -11,7 +11,7 @@ slug: /web-scraping-for-beginners/challenge/initializing-and-setting-up --- -The Crawlee CLI makes it extremely easy for us to set up a project in Crawlee and hit the ground running. Navigate to the directory you'd like your project's folder to live, then open up a terminal instance and run the following command: +The Crawlee CLI speeds up the process of setting up a Crawlee project. Navigate to the directory you'd like your project's folder to live, then open up a terminal instance and run the following command: ```shell npx crawlee create amazon-crawler diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/exporting_data.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/exporting_data.md index 8113e34d26..d43c18ab55 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/exporting_data.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/exporting_data.md @@ -20,7 +20,7 @@ But when we look inside the folder, we see that there are a lot of files, and we ## Exporting data to CSV {#export-csv} -Crawlee's `Dataset` provides an easy way to export all your scraped data into one big CSV file. You can then open it in Excel or any other data processor. To do that, you simply need to call [`Dataset.exportToCSV()`](https://crawlee.dev/api/core/class/Dataset#exportToCSV) after collecting all the data. That means, after your crawler run finishes. +Crawlee's `Dataset` provides a way to export all your scraped data into one big CSV file. You can then open it in Excel or any other data processor. To do that, you need to call [`Dataset.exportToCSV()`](https://crawlee.dev/api/core/class/Dataset#exportToCSV) after collecting all the data. That means, after your crawler run finishes. ```js title=browser.js // ... diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/finding_links.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/finding_links.md index 77f8c3c880..67f8ada243 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/finding_links.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/finding_links.md @@ -25,9 +25,7 @@ On a webpage, the link above will look like this: [This is a link to example.com ## Extracting links 🔗 {#extracting-links} -So, if a link is just an HTML element, and the URL is just an attribute, this means that we can extract links exactly the same way as we extracted data.💡 Easy! - -To test this theory in the browser, we can try running the following code in our DevTools console on any website. +If a link is just an HTML element, and the URL is just an attribute, this means that we can extract links the same way as we extracted data. To test this theory in the browser, we can try running the following code in our DevTools console on any website. ```js // Select all the elements. diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md index bdfc3d13ec..da5c798251 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md @@ -20,7 +20,7 @@ A headless browser is simply a browser that runs without a user interface (UI). > Our focus will be on Playwright, which boasts additional features and better documentation. Notably, it originates from the same team responsible for Puppeteer. -Building a Playwright scraper with Crawlee is extremely easy. To show you how easy it really is, we'll reuse the Cheerio scraper code from the previous lesson. By changing only a few lines of code, we'll turn it into a full headless scraper. +Crawlee has a built-in support for building Playwright scrapers. Let's reuse code of the Cheerio scraper from the previous lesson. It'll take us just a few changes to turn it into a full headless scraper. First, we must install Playwright into our project. It's not included in Crawlee, because it's quite large as it bundles all the browsers. diff --git a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/project_setup.md b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/project_setup.md index 05abd5211e..8af848745a 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/project_setup.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/project_setup.md @@ -47,9 +47,9 @@ Now that we have a project set up, we can install npm modules into the project. npm install got-scraping cheerio ``` -[**got-scraping**](https://github.com/apify/got-scraping) is a library that's made especially for scraping and downloading page's HTML. It's based on the very popular [**got** library](https://github.com/sindresorhus/got), which means any features of **got** are also available in **got-scraping**. Both **got** and **got-scraping** are HTTP clients. To learn more about HTTP, [visit this MDN tutorial](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP). +[**got-scraping**](https://github.com/apify/got-scraping) is a library that's made especially for scraping and downloading page's HTML. It's based on the popular [**got** library](https://github.com/sindresorhus/got), which means any features of **got** are also available in **got-scraping**. Both **got** and **got-scraping** are HTTP clients. To learn more about HTTP, [visit this MDN tutorial](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP). -[Cheerio](https://github.com/cheeriojs/cheerio) is a very popular Node.js library for parsing (processing) HTML. If you're familiar with good old [jQuery](https://jquery.com/), you'll find working with Cheerio really easy. +[Cheerio](https://github.com/cheeriojs/cheerio) is a popular Node.js library for parsing and processing HTML. If you know how to work with [jQuery](https://jquery.com/), you'll find Cheerio familiar. ## Test everything {#testing} From 016a31581e77419000dce85223927d776d189a1e Mon Sep 17 00:00:00 2001 From: Honza Javorek Date: Thu, 25 Apr 2024 13:49:40 +0200 Subject: [PATCH 03/15] style: don't use 'simply', let the reader decide what is simple --- sources/academy/glossary/concepts/http_headers.md | 4 ++-- sources/academy/glossary/tools/edit_this_cookie.md | 2 +- .../academy/platform/deploying_your_code/input_schema.md | 2 +- .../academy/platform/deploying_your_code/output_schema.md | 2 +- .../solutions/rotating_proxies.md | 2 +- .../academy/platform/get_most_of_actors/actor_readme.md | 2 +- sources/academy/platform/getting_started/actors.md | 2 +- sources/academy/platform/getting_started/apify_api.md | 2 +- .../academy/platform/getting_started/creating_actors.md | 6 +++--- sources/academy/platform/running_a_web_server.md | 2 +- .../tutorials/api/run_actor_and_retrieve_data_via_api.md | 8 ++++---- .../academy/tutorials/apify_scrapers/cheerio_scraper.md | 5 ++--- .../academy/tutorials/apify_scrapers/puppeteer_scraper.md | 6 ++---- sources/academy/tutorials/apify_scrapers/web_scraper.md | 3 +-- .../node_js/add_external_libraries_web_scraper.md | 2 +- .../academy/tutorials/node_js/debugging_web_scraper.md | 4 ++-- .../node_js/filter_blocked_requests_using_sessions.md | 6 +++--- .../node_js/handle_blocked_requests_puppeteer.md | 2 +- .../academy/tutorials/node_js/how_to_fix_target_closed.md | 2 +- .../tutorials/node_js/when_to_use_puppeteer_scraper.md | 2 +- .../advanced_web_scraping/tips_and_tricks_robustness.md | 2 +- sources/academy/webscraping/anti_scraping/index.md | 2 +- .../anti_scraping/techniques/fingerprinting.md | 2 +- .../academy/webscraping/anti_scraping/techniques/index.md | 2 +- .../general_api_scraping/handling_pagination.md | 2 +- .../api_scraping/graphql_scraping/custom_queries.md | 2 +- .../common_use_cases/logging_into_a_website.md | 2 +- .../submitting_a_form_with_a_file_attachment.md | 2 +- .../executing_scripts/injecting_code.md | 2 +- .../webscraping/puppeteer_playwright/page/waiting.md | 2 +- .../academy/webscraping/switching_to_typescript/enums.md | 4 ++-- .../unknown_and_type_assertions.md | 2 +- .../switching_to_typescript/using_types_continued.md | 2 +- .../crawling/filtering_links.md | 2 +- .../crawling/headless_browser.md | 4 ++-- .../web_scraping_for_beginners/crawling/relative_urls.md | 2 +- .../data_extraction/computer_preparation.md | 2 +- 37 files changed, 50 insertions(+), 54 deletions(-) diff --git a/sources/academy/glossary/concepts/http_headers.md b/sources/academy/glossary/concepts/http_headers.md index 4e3c715285..64266bc8df 100644 --- a/sources/academy/glossary/concepts/http_headers.md +++ b/sources/academy/glossary/concepts/http_headers.md @@ -23,7 +23,7 @@ For some websites, you won't need to worry about modifying headers at all, as th Some websites will require certain default browser headers to work properly, such as **User-Agent** (though, this header is becoming more obsolete, as there are more sophisticated ways to detect and block a suspicious user). -Another example of such a "default" header is **Referer**. Some e-commerce websites might share the same platform, and data is loaded through XMLHttpRequests to that platform, which simply would not know which data to return without knowing which exact website is requesting it. +Another example of such a "default" header is **Referer**. Some e-commerce websites might share the same platform, and data is loaded through XMLHttpRequests to that platform, which would not know which data to return without knowing which exact website is requesting it. ## Custom headers required {#needs-custom-headers} @@ -44,7 +44,7 @@ You could use Chrome DevTools to inspect request headers, and [Insomnia](../tool HTTP/1.1 and HTTP/2 headers have several differences. Here are the three key differences that you should be aware of: 1. HTTP/2 headers do not include status messages. They only contain status codes. -2. Certain headers are no longer used in HTTP/2 (such as **Connection** along with a few others related to it like **Keep-Alive**). In HTTP/2, connection-specific headers are prohibited. While some browsers will simply ignore them, Safari and other Webkit-based browsers will outright reject any response that contains them. Easy to do by accident, and a big problem. +2. Certain headers are no longer used in HTTP/2 (such as **Connection** along with a few others related to it like **Keep-Alive**). In HTTP/2, connection-specific headers are prohibited. While some browsers will ignore them, Safari and other Webkit-based browsers will outright reject any response that contains them. Easy to do by accident, and a big problem. 3. While HTTP/1.1 headers are case-insensitive and could be sent by the browsers with capitalized letters (e.g. **Accept-Encoding**, **Cache-Control**, **User-Agent**), HTTP/2 headers must be lower-cased (e.g. **accept-encoding**, **cache-control**, **user-agent**). > To learn more about the difference between HTTP/1.1 and HTTP/2 headers, check out [this](https://httptoolkit.tech/blog/translating-http-2-into-http-1/) article diff --git a/sources/academy/glossary/tools/edit_this_cookie.md b/sources/academy/glossary/tools/edit_this_cookie.md index 3e96ba1b9a..9ce785d944 100644 --- a/sources/academy/glossary/tools/edit_this_cookie.md +++ b/sources/academy/glossary/tools/edit_this_cookie.md @@ -21,7 +21,7 @@ At the top of the popup, there is a row of buttons. From left to right, here is ### Delete all cookies -Clicking this button will simply remove all cookies associated with the current domain. For example, if you're logged into your Apify account and delete all the cookies, the website will ask you to log in again. +Clicking this button will remove all cookies associated with the current domain. For example, if you're logged into your Apify account and delete all the cookies, the website will ask you to log in again. ### Reset diff --git a/sources/academy/platform/deploying_your_code/input_schema.md b/sources/academy/platform/deploying_your_code/input_schema.md index d1f78da242..759f322963 100644 --- a/sources/academy/platform/deploying_your_code/input_schema.md +++ b/sources/academy/platform/deploying_your_code/input_schema.md @@ -28,7 +28,7 @@ In the root of our project, we'll create a file named **INPUT_SCHEMA.json** and } ``` -The **title** and **description** simply describe what the input schema is for, and a bit about what the Actor itself does. +The **title** and **description** describe what the input schema is for, and a bit about what the Actor itself does. ## Properties {#properties} diff --git a/sources/academy/platform/deploying_your_code/output_schema.md b/sources/academy/platform/deploying_your_code/output_schema.md index afdd1eaecb..25dc9bc6b8 100644 --- a/sources/academy/platform/deploying_your_code/output_schema.md +++ b/sources/academy/platform/deploying_your_code/output_schema.md @@ -69,7 +69,7 @@ Next, copy-paste the following template code into your `actor.json` file. } ``` -To configure the output schema, simply replace the fields in the template with the relevant fields to your Actor. +To configure the output schema, replace the fields in the template with the relevant fields to your Actor. For reference, you can use the [Zappos Scraper source code](https://github.com/PerVillalva/zappos-scraper-actor/blob/main/.actor/actor.json) as an example of how the final implementation of the output tab should look in a live Actor. diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md b/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md index a9cf0ac810..9e4beced8b 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md @@ -106,7 +106,7 @@ const proxyConfiguration = await Actor.createProxyConfiguration({ **Q: What do you need to do to rotate a proxy (one proxy usually has one IP)? How does this differ for CheerioCrawler and PuppeteerCrawler?** -**A:** Simply making a new request with the proxy endpoint above will automatically rotate it. Sessions can also be used to automatically do this. While proxy rotation is fairly straightforward for Cheerio, it's more complex in Puppeteer, as you have to retire the browser each time a new proxy is rotated in. The SessionPool will automatically retire a browser when a session is retired. Sessions can be manually retired with `session.retire()`. +**A:** Making a new request with the proxy endpoint above will automatically rotate it. Sessions can also be used to automatically do this. While proxy rotation is fairly straightforward for Cheerio, it's more complex in Puppeteer, as you have to retire the browser each time a new proxy is rotated in. The SessionPool will automatically retire a browser when a session is retired. Sessions can be manually retired with `session.retire()`. **Q: Name a few different ways how a website can prevent you from scraping it.** diff --git a/sources/academy/platform/get_most_of_actors/actor_readme.md b/sources/academy/platform/get_most_of_actors/actor_readme.md index a5c766cc4a..06054cb82d 100644 --- a/sources/academy/platform/get_most_of_actors/actor_readme.md +++ b/sources/academy/platform/get_most_of_actors/actor_readme.md @@ -57,7 +57,7 @@ Aim for sections 1–6 below and try to include at least 300 words. You can move - Add a video tutorial or GIF from an ideal Actor run. - > Tip: For better user experience, Apify Console automatically renders every YouTube URL as an embedded video player. Simply add a separate line with the URL of your YouTube video. + > Tip: For better user experience, Apify Console automatically renders every YouTube URL as an embedded video player. Add a separate line with the URL of your YouTube video. - Consider adding a short numbered tutorial as Google will sometimes pick these up as rich snippets. Remember that this might be in search results, so you can repeat the name of the Actor and give a link, e.g. diff --git a/sources/academy/platform/getting_started/actors.md b/sources/academy/platform/getting_started/actors.md index 565a878a1c..e74a01cd6d 100644 --- a/sources/academy/platform/getting_started/actors.md +++ b/sources/academy/platform/getting_started/actors.md @@ -15,7 +15,7 @@ After you've followed the **Getting started** lesson, you're almost ready to sta ## What's an Actor? {#what-is-an-actor} -When you deploy your script to the Apify platform, it is then called an **Actor**, which is simply a [serverless microservice](https://www.datadoghq.com/knowledge-center/serverless-architecture/serverless-microservices/#:~:text=Serverless%20microservices%20are%20cloud-based,suited%20for%20microservice-based%20architectures.) that accepts an input and produces an output. Actors can run for a few seconds, hours or even infinitely. An Actor can perform anything from a simple action such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset. +When you deploy your script to the Apify platform, it is then called an **Actor**, which is a [serverless microservice](https://www.datadoghq.com/knowledge-center/serverless-architecture/serverless-microservices/#:~:text=Serverless%20microservices%20are%20cloud-based,suited%20for%20microservice-based%20architectures.) that accepts an input and produces an output. Actors can run for a few seconds, hours or even infinitely. An Actor can perform anything from a simple action such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset. Once an Actor has been pushed to the Apify platform, they can be shared to the world through the [Apify Store](https://apify.com/store), and even monetized after going public. diff --git a/sources/academy/platform/getting_started/apify_api.md b/sources/academy/platform/getting_started/apify_api.md index f384958fde..b1ccdb631b 100644 --- a/sources/academy/platform/getting_started/apify_api.md +++ b/sources/academy/platform/getting_started/apify_api.md @@ -39,7 +39,7 @@ Our **adding-actor** takes in two input values (`num1` and `num2`). When using t ## Parameters {#parameters} -Let's say we want to run our **adding-actor** via API and view its results in CSV format at the end. We'll achieve this by simply passing the **format** parameter with a value of **csv** to change the output format: +Let's say we want to run our **adding-actor** via API and view its results in CSV format at the end. We'll achieve this by passing the **format** parameter with a value of **csv** to change the output format: ```text https://api.apify.com/v2/acts/YOUR_USERNAME~adding-actor/run-sync-get-dataset-items?token=YOUR_TOKEN_HERE&format=csv diff --git a/sources/academy/platform/getting_started/creating_actors.md b/sources/academy/platform/getting_started/creating_actors.md index 359b4688fc..5e952f8d30 100644 --- a/sources/academy/platform/getting_started/creating_actors.md +++ b/sources/academy/platform/getting_started/creating_actors.md @@ -68,7 +68,7 @@ If you want to use the template locally, you can again use our [Apify CLI](/cli) When you click on the **Use locally** button, you'll be presented with instructions on how to create an Actor from this template in your local environment. -With the Apify CLI installed, you can simply run the following commands in your terminal: +With the Apify CLI installed, you can run the following commands in your terminal: ```shell apify create my-actor -t getting_started_node @@ -153,7 +153,7 @@ And now we are ready to run the Actor. But before we do that, let's give the Act The input tab is where you can provide the Actor with some meaningful input. In this case, we'll be providing the Actor with a URL to scrape. For now, we'll use the prefilled value of [Apify website](https://apify.com/) (`https://apify.com/`). -You can change the website you want to extract the data from by simply changing the URL in the input field. +You can change the website you want to extract the data from by changing the URL in the input field. ![Input tab](./images/actor-input-tab.png) @@ -163,7 +163,7 @@ Once you have provided the Actor with some URL you want to extract the data from ![Actor run logs](./images/actor-run.png) -After the Actor finishes, you can preview or download the extracted data simply by clicking on the **Export X results** button. +After the Actor finishes, you can preview or download the extracted data by clicking on the **Export X results** button. ![Export results](./images/actor-run-dataset.png) diff --git a/sources/academy/platform/running_a_web_server.md b/sources/academy/platform/running_a_web_server.md index ed3f0e4838..6aff852f03 100644 --- a/sources/academy/platform/running_a_web_server.md +++ b/sources/academy/platform/running_a_web_server.md @@ -61,7 +61,7 @@ Now we need to read the following environment variables: - **APIFY_CONTAINER_PORT** contains a port number where we must start the server. - **APIFY_CONTAINER_URL** contains a URL under which we can access the container. -- **APIFY_DEFAULT_KEY_VALUE_STORE_ID** is simply the ID of the default key-value store of this Actor where we can store screenshots. +- **APIFY_DEFAULT_KEY_VALUE_STORE_ID** is the ID of the default key-value store of this Actor where we can store screenshots. ```js const { diff --git a/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md b/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md index 0269c05aff..a40aea9fe2 100644 --- a/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md +++ b/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md @@ -53,7 +53,7 @@ If we send a correct POST request to one of these endpoints, the actor/actor-tas ### Additional settings {#additional-settings} -We can also add settings for the Actor (which will override the default settings) as additional query parameters. For example, if we wanted to change how much memory the Actor's run should be allocated and which build to run, we could simply add the `memory` and `build` parameters separated by `&`. +We can also add settings for the Actor (which will override the default settings) as additional query parameters. For example, if we wanted to change how much memory the Actor's run should be allocated and which build to run, we could add the `memory` and `build` parameters separated by `&`. ```cURL https://api.apify.com/v2/acts/ACTOR_NAME_OR_ID/runs?token=YOUR_TOKEN&memory=8192&build=beta @@ -198,7 +198,7 @@ For runs longer than 5 minutes, the process consists of three steps: ### Wait for the run to finish {#wait-for-the-run-to-finish} -There may be cases where we need to simply run the Actor and go away. But in any kind of integration, we are usually interested in its output. We have three basic options for how to wait for the actor/task to finish. +There may be cases where we need to run the Actor and go away. But in any kind of integration, we are usually interested in its output. We have three basic options for how to wait for the actor/task to finish. - [`waitForFinish` parameter](#waitforfinish-parameter) - [Webhooks](#webhooks) @@ -218,7 +218,7 @@ Once again, the final response will be the **run info object**; however, now its #### Webhooks {#webhooks} -If you have a server, [webhooks](/platform/integrations/webhooks) are the most elegant and flexible solution for integrations with Apify. You can simply set up a webhook for any Actor or task, and that webhook will send a POST request to your server after an [event](/platform/integrations/webhooks/events) has occurred. +If you have a server, [webhooks](/platform/integrations/webhooks) are the most elegant and flexible solution for integrations with Apify. You can set up a webhook for any Actor or task, and that webhook will send a POST request to your server after an [event](/platform/integrations/webhooks/events) has occurred. Usually, this event is a successfully finished run, but you can also set a different webhook for failed runs, etc. @@ -236,7 +236,7 @@ What if you don't have a server, and the run you'd like to do is much too long t When we run the Actor with the [usual API call](#run-an-actor-or-task) shown above, we will back a response with the **run info** object. From this JSON object, we can then extract the ID of the Actor run that we just started from the `id` field. Then, we can set an interval that will poll the Apify API (let's say every 5 seconds) by calling the [**Get run**](https://apify.com/docs/api/v2#/reference/actors/run-object/get-run) endpoint to retrieve the run's status. -Simply replace the `RUN_ID` in the following URL with the ID you extracted earlier: +Replace the `RUN_ID` in the following URL with the ID you extracted earlier: ```cURL https://api.apify.com/v2/acts/ACTOR_NAME_OR_ID/runs/RUN_ID diff --git a/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md b/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md index 73f10170cd..2741784a63 100644 --- a/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md @@ -81,8 +81,7 @@ async function pageFunction(context) { ### [](#description) Description -Getting the Actor's description is a little more involved, but still pretty straightforward. We can't just simply search for a `

` tag, because -there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within +Getting the Actor's description is a little more involved, but still pretty straightforward. We cannot search for a `

` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within the `

` element too, same as the title. Moreover, the actual description is nested inside a `` tag with a class `actor-description`. ![$1](https://raw.githubusercontent.com/apifytech/actor-scraper/master/docs/img/description.webp) @@ -267,7 +266,7 @@ but it requires some clever DevTools-Fu. ### [](#analyzing-the-page) Analyzing the page -While with Web Scraper and **Puppeteer Scraper** ([apify/puppeteer-scraper](https://apify.com/apify/puppeteer-scraper)), we could get away with simply clicking a button, +While with Web Scraper and **Puppeteer Scraper** ([apify/puppeteer-scraper](https://apify.com/apify/puppeteer-scraper)), we could get away with clicking a button, with Cheerio Scraper we need to dig a little deeper into the page's architecture. For this, we will use the Network tab of the Chrome DevTools. diff --git a/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md b/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md index 0f5e0dbed6..1e3195a79a 100644 --- a/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md @@ -44,8 +44,7 @@ It's also much easier to work with external APIs, databases or the [Apify SDK](h in the Node.js context. The tradeoff is simple. Power vs simplicity. Web Scraper is simple, Puppeteer Scraper is powerful (and the [Apify SDK](https://sdk.apify.com) is super-powerful). -> Simply put, Web Scraper's `pageFunction` is just a single -[page.evaluate()](https://pptr.dev/#?product=Puppeteer&show=api-pageevaluatepagefunction-args) call. +> In other words, Web Scraper's `pageFunction` is just a single [page.evaluate()](https://pptr.dev/#?product=Puppeteer&show=api-pageevaluatepagefunction-args) call. Now that's out of the way, let's open one of the Actor detail pages in the Store, for example the Web Scraper page and use our DevTools-Fu to scrape some data. @@ -106,8 +105,7 @@ is automatically passed back to the Node.js context, so we receive an actual `st ### [](#description) Description -Getting the Actor's description is a little more involved, but still pretty straightforward. We can't just simply search for a `

` tag, because -there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within +Getting the Actor's description is a little more involved, but still pretty straightforward. We cannot search for a `

` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within the `

` element too, same as the title. Moreover, the actual description is nested inside a `` tag with a class `actor-description`. ![$1](https://raw.githubusercontent.com/apifytech/actor-scraper/master/docs/img/description.webp) diff --git a/sources/academy/tutorials/apify_scrapers/web_scraper.md b/sources/academy/tutorials/apify_scrapers/web_scraper.md index 08ed5a2b66..3fedac29d2 100644 --- a/sources/academy/tutorials/apify_scrapers/web_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/web_scraper.md @@ -80,8 +80,7 @@ async function pageFunction(context) { ### [](#description) Description -Getting the Actor's description is a little more involved, but still pretty straightforward. We can't just simply search for a `

` tag, because -there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within +Getting the Actor's description is a little more involved, but still pretty straightforward. We cannot search for a `

` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within the `

` element too, same as the title. Moreover, the actual description is nested inside a `` tag with a class `actor-description`. ![$1](https://raw.githubusercontent.com/apifytech/actor-scraper/master/docs/img/description.webp) diff --git a/sources/academy/tutorials/node_js/add_external_libraries_web_scraper.md b/sources/academy/tutorials/node_js/add_external_libraries_web_scraper.md index 7f445e2d98..ac85868bb2 100644 --- a/sources/academy/tutorials/node_js/add_external_libraries_web_scraper.md +++ b/sources/academy/tutorials/node_js/add_external_libraries_web_scraper.md @@ -60,7 +60,7 @@ async function pageFunction(context) { } ``` -With jQuery, we're simply using the `$.getScript()` helper to fetch the script for us and wait for it to load. +With jQuery, we're using the `$.getScript()` helper to fetch the script for us and wait for it to load. ## Dealing with errors diff --git a/sources/academy/tutorials/node_js/debugging_web_scraper.md b/sources/academy/tutorials/node_js/debugging_web_scraper.md index 1bac208978..792f2f33c1 100644 --- a/sources/academy/tutorials/node_js/debugging_web_scraper.md +++ b/sources/academy/tutorials/node_js/debugging_web_scraper.md @@ -29,7 +29,7 @@ You can test a `pageFunction` code in two ways in your console: ## Pasting and running a small code snippet -Usually, you don't need to paste in the whole pageFunction as you can simply isolate the critical part of the code you are trying to debug. You will need to remove any references to the `context` object and its properties like `request` and the final return statement but otherwise, the code should work 1:1. +Usually, you don't need to paste in the whole pageFunction as you can isolate the critical part of the code you are trying to debug. You will need to remove any references to the `context` object and its properties like `request` and the final return statement but otherwise, the code should work 1:1. I will also usually remove `const` declarations on the top level variables. This helps you to run the same code many times over without needing to restart the console (you cannot declare constants more than once). My declaration will change from: @@ -56,7 +56,7 @@ $('.my-list-item').each((i, el) => { }); ``` -Now the `results` variable stays on the page and you can do whatever you wish with it. Usually, simply log it to analyze if your scraping code is correct. Writing a single expression will also log it in a browser console. +Now the `results` variable stays on the page and you can do whatever you wish with it. Log it to analyze if your scraping code is correct. Writing a single expression will also log it in a browser console. ```js results; diff --git a/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md b/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md index f107388cce..0afd0912d8 100644 --- a/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md +++ b/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md @@ -23,7 +23,7 @@ You want to crawl a website with a proxy pool, but most of your proxies are bloc Nobody can make sure that a proxy will work infinitely. The only real solution to this problem is to use [residential proxies](/platform/proxy#residential-proxy), but they can sometimes be too costly. -However, usually, at least some of our proxies work. To crawl successfully, it is therefore imperative to handle blocked requests properly. You first need to discover that you are blocked, which usually means that either your request returned status greater or equal to 400 (simply it didn't return the proper response) or that the page displayed a captcha. To ensure that this bad request is retried, you usually just throw an error and it gets automatically retried later (our [SDK](/sdk/js/) handles this for you). Check out [this article](https://help.apify.com/en/articles/2190650-how-to-handle-blocked-requests-in-puppeteercrawler) as inspiration for how to handle this situation with `PuppeteerCrawler`  class. +However, usually, at least some of our proxies work. To crawl successfully, it is therefore imperative to handle blocked requests properly. You first need to discover that you are blocked, which usually means that either your request returned status greater or equal to 400 (it didn't return the proper response) or that the page displayed a captcha. To ensure that this bad request is retried, you usually just throw an error and it gets automatically retried later (our [SDK](/sdk/js/) handles this for you). Check out [this article](https://help.apify.com/en/articles/2190650-how-to-handle-blocked-requests-in-puppeteercrawler) as inspiration for how to handle this situation with `PuppeteerCrawler`  class. ### Solution @@ -162,7 +162,7 @@ const crawler = new Apify.PuppeteerCrawler({ }); ``` -We picked the session and added it to the browser as `apifyProxySession` but for userAgent, we didn't simply passed the user agent as it is but added the session name into it. That is the hack because we can retrieve the user agent from the Puppeteer browser itself. +We picked the session and added it to the browser as `apifyProxySession` but for userAgent, we didn't pass the User-Agent as it is but added the session name into it. That is the hack because we can retrieve the user agent from the Puppeteer browser itself. Now we need to retrieve the session name back in the `gotoFunction`, pass it into userData and fix the hacked userAgent back to normal so it is not suspicious for the website. @@ -202,7 +202,7 @@ Things to consider 1. Since the good and bad proxies are getting filtered over time, this solution only makes sense for crawlers with at least hundreds of requests. -2. This solution will not help you if you simply don't have enough proxies for your job. It can even get your proxies banned faster (since the good ones will be used more often), so you should be cautious about the speed of your crawl. +2. This solution will not help you if you don't have enough proxies for your job. It can even get your proxies banned faster (since the good ones will be used more often), so you should be cautious about the speed of your crawl. 3. If you are more concerned about the speed of your crawler and less about banning proxies, set the `maxSessions` parameter of `pickSession` function to a number relatively lower than your total number of proxies. If on the other hand, keeping your proxies alive is more important, set `maxSessions`  relatively higher so you will always pick new proxies. diff --git a/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md b/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md index 1e4e0d9a93..d36e80262d 100644 --- a/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md +++ b/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md @@ -39,7 +39,7 @@ With [PuppeteerCrawler](/sdk/js/docs/api/puppeteer-crawler) the situation is a l The straightforward solution would be to set the 'retireInstanceAfterRequestCount' option to 1. PuppeteerCrawler would then rotate the proxies in the same way as BasicCrawler. While this approach could sometimes be useful for the toughest websites, the price you pay is in performance. Restarting the browser is an expensive operation. -That's why PuppeteerCrawler offers a utility retire() function through a PuppeteerPool class. You can access PuppeteerPool by simply passing it into the object parameter of gotoFunction or handlePageFunction. +That's why PuppeteerCrawler offers a utility retire() function through a PuppeteerPool class. You can access PuppeteerPool by passing it into the object parameter of gotoFunction or handlePageFunction. ```js const crawler = new PuppeteerCrawler({ diff --git a/sources/academy/tutorials/node_js/how_to_fix_target_closed.md b/sources/academy/tutorials/node_js/how_to_fix_target_closed.md index af13474883..81dc61db1c 100644 --- a/sources/academy/tutorials/node_js/how_to_fix_target_closed.md +++ b/sources/academy/tutorials/node_js/how_to_fix_target_closed.md @@ -17,7 +17,7 @@ The `Target closed` error happens when you try to access the `page` object (or s ![Chrome crashed tab](./images/chrome-crashed-tab.png) -Browsers create a separate process for each tab. That means each tab lives with a separate memory space. If you have a lot of tabs open, you might run out of memory. The browser cannot simply close your old tabs to free extra memory so it will usually kill your current memory hungry tab. +Browsers create a separate process for each tab. That means each tab lives with a separate memory space. If you have a lot of tabs open, you might run out of memory. The browser cannot close your old tabs to free extra memory so it will usually kill your current memory hungry tab. ### Memory solution diff --git a/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md b/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md index bc065fcf4d..261b408a83 100644 --- a/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md +++ b/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md @@ -41,7 +41,7 @@ const bodyHTML = await context.page.evaluate(() => { }); ``` -The `context.page.evaluate()` call executes the provided function in the browser environment and passes back the return value back to the Node.js environment. One very important caveat though! Since we're in different environments, we cannot simply use our existing variables, such as `context` inside of the evaluated function, because they are not available there. Different environments, different variables. +The `context.page.evaluate()` call executes the provided function in the browser environment and passes back the return value back to the Node.js environment. One very important caveat though! Since we're in different environments, we cannot use our existing variables, such as `context` inside of the evaluated function, because they are not available there. Different environments, different variables. _See the_ `page.evaluate()` _[documentation](https://pptr.dev/#?product=Puppeteer&show=api-pageevaluatepagefunction-args) for info on how to pass variables from Node.js to browser._ diff --git a/sources/academy/webscraping/advanced_web_scraping/tips_and_tricks_robustness.md b/sources/academy/webscraping/advanced_web_scraping/tips_and_tricks_robustness.md index 4d52907787..d0b5b979d5 100644 --- a/sources/academy/webscraping/advanced_web_scraping/tips_and_tricks_robustness.md +++ b/sources/academy/webscraping/advanced_web_scraping/tips_and_tricks_robustness.md @@ -36,7 +36,7 @@ async function isPaymentSuccessful() { } ``` -**Avoid**: Relying on the absence of an element that may have been simply updated or changed. +**Avoid**: Relying on the absence of an element that may have been updated or changed. ```js async function isPaymentSuccessful() { diff --git a/sources/academy/webscraping/anti_scraping/index.md b/sources/academy/webscraping/anti_scraping/index.md index f4393ff0bc..b73f86132a 100644 --- a/sources/academy/webscraping/anti_scraping/index.md +++ b/sources/academy/webscraping/anti_scraping/index.md @@ -16,7 +16,7 @@ If at any point in time you've strayed away from the Academy's demo content, and This section covers the essentials of mitigating anti-scraping protections, such as proxies, HTTP headers and cookies, and a few other things to consider when working on a reliable and scalable crawler. Proper usage of the methods taught in the next lessons will allow you to extract data which is specific to a certain location, enable your crawler to browse websites as a logged-in user, and more. -In development, it is crucial to check and adjust the configurations related to our next lessons' topics, as simply doing this can fix blocking issues on the majority of websites. +In development, it is crucial to check and adjust the configurations related to our next lessons' topics, as doing this can fix blocking issues on the majority of websites. ## Quick start {#quick-start} diff --git a/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md b/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md index 1a689ef7e4..99962e2d5a 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md +++ b/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md @@ -103,7 +103,7 @@ Here's an example of multiple WebGL scenes visibly being rendered differently on The [AudioContext](https://developer.mozilla.org/en-US/docs/Web/API/AudioContext) API represents an audio-processing graph built from audio modules linked together, each represented by an [AudioNode](https://developer.mozilla.org/en-US/docs/Web/API/AudioNode) ([OscillatorNode](https://developer.mozilla.org/en-US/docs/Web/API/OscillatorNode)). -In the simplest cases, the fingerprint can be obtained by simply checking for the existence of AudioContext. However, this doesn't provide very much information. In advanced cases, the technique used to collect a fingerprint from AudioContext is quite similar to the `` method: +In the simplest cases, the fingerprint can be obtained by checking for the existence of AudioContext. However, this doesn't provide very much information. In advanced cases, the technique used to collect a fingerprint from AudioContext is quite similar to the `` method: 1. Audio is passed through an OscillatorNode. 2. The signal is processed and collected. diff --git a/sources/academy/webscraping/anti_scraping/techniques/index.md b/sources/academy/webscraping/anti_scraping/techniques/index.md index 95ab678cfe..b1dfdb3ee3 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/index.md +++ b/sources/academy/webscraping/anti_scraping/techniques/index.md @@ -27,7 +27,7 @@ Probably the most common blocking method. The website gives you a chance to prov ## Redirect {#redirect} -Another common method is simply redirecting to the home page of the site (or a different location). +Another common method is redirecting to the home page of the site (or a different location). ## Request timeout/Socket hangup {#request-timeout} diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md index 71fe1105c5..1eaea1294c 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md @@ -17,7 +17,7 @@ If you've never dealt with it before, trying to scrape thousands to hundreds of ## Page-number pagination {#page-number} -The most common and rudimentary form of pagination is simply having page numbers, which can be compared to paginating through a typical e-commerce website. +The most common and rudimentary forms of pagination have page numbers. Imagine paginating through a typical e-commerce website. ![Amazon pagination](https://apify-docs.s3.amazonaws.com/master/docs/assets/tutorials/images/pagination.jpg) diff --git a/sources/academy/webscraping/api_scraping/graphql_scraping/custom_queries.md b/sources/academy/webscraping/api_scraping/graphql_scraping/custom_queries.md index 0a433b31b1..836feb6693 100644 --- a/sources/academy/webscraping/api_scraping/graphql_scraping/custom_queries.md +++ b/sources/academy/webscraping/api_scraping/graphql_scraping/custom_queries.md @@ -44,7 +44,7 @@ Finally, create a file called **index.js**. This is the file we will be working If we remember from the last lesson, we need to pass a valid "app token" within the **X-App-Token** header of every single request we make, or else we will be blocked. When testing queries, we just copied this value straight from the **Network** tab; however, since this is a dynamic value, we should farm it. -Since we know requests with this header are sent right when the front page is loaded, it can be farmed by simply visiting the page and intercepting requests in Puppeteer like so: +Since we know requests with this header are sent right when the front page is loaded, it can be farmed by visiting the page and intercepting requests in Puppeteer like so: ```js // scrapeAppToken.js diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md index e8e28d3cb9..34bfbbfd9a 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md @@ -124,7 +124,7 @@ const emailsToSend = [ ]; ``` -What we could do is log in 3 different times, then simply automate the sending of each email; however, this is extremely inefficient. When you log into a website, one of the main things that allows you to stay logged in and perform actions on your account is the [cookies](../../../glossary/concepts/http_cookies.md) stored in your browser. These cookies tell the website that you have been authenticated, and that you have the permissions required to modify your account. +What we could do is log in 3 different times, then automate the sending of each email; however, this is extremely inefficient. When you log into a website, one of the main things that allows you to stay logged in and perform actions on your account is the [cookies](../../../glossary/concepts/http_cookies.md) stored in your browser. These cookies tell the website that you have been authenticated, and that you have the permissions required to modify your account. With this knowledge of cookies, it can be concluded that we can just pass the cookies generated by the code above right into each new browser context that we use to send each email. That way, we won't have to run the login flow each time. diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/submitting_a_form_with_a_file_attachment.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/submitting_a_form_with_a_file_attachment.md index 66ec431c2c..1bb718d015 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/submitting_a_form_with_a_file_attachment.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/submitting_a_form_with_a_file_attachment.md @@ -22,7 +22,7 @@ import * as fs from 'fs/promises'; import request from 'request-promise'; ``` -The actual downloading is slightly different for text and binary files. For a text file, it can simply be done like this: +The actual downloading is slightly different for text and binary files. For a text file, it can be done like this: ```js const fileData = await request('https://some-site.com/file.txt'); diff --git a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/injecting_code.md b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/injecting_code.md index 3261e08a95..16a0a04794 100644 --- a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/injecting_code.md +++ b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/injecting_code.md @@ -71,7 +71,7 @@ await browser.close(); ## Exposing functions {#exposing-functions} -Here's a super awesome function we've created called `returnMessage()`, which simply returns the string **Apify Academy!**: +Here's a super awesome function we've created called `returnMessage()`, which returns the string **Apify Academy!**: ```js const returnMessage = () => 'Apify academy!'; diff --git a/sources/academy/webscraping/puppeteer_playwright/page/waiting.md b/sources/academy/webscraping/puppeteer_playwright/page/waiting.md index fe3eae068b..a47697d48b 100644 --- a/sources/academy/webscraping/puppeteer_playwright/page/waiting.md +++ b/sources/academy/webscraping/puppeteer_playwright/page/waiting.md @@ -58,7 +58,7 @@ Now, we won't see the error message anymore, and the first result will be succes If we remember properly, after clicking the first result, we want to console log the title of the result's page and save a screenshot into the filesystem. In order to grab a solid screenshot of the loaded page though, we should **wait for navigation** before snapping the image. This can be done with [`page.waitForNavigation()`](https://pptr.dev/#?product=Puppeteer&version=v14.1.0&show=api-pagewaitfornavigationoptions). -> A navigation is simply when a new [page load](../../../glossary/concepts/dynamic_pages.md) happens. First, the `domcontentloaded` event is fired, then the `load` event. `page.waitForNavigation()` will wait for the `load` event to fire. +> A navigation is when a new [page load](../../../glossary/concepts/dynamic_pages.md) happens. First, the `domcontentloaded` event is fired, then the `load` event. `page.waitForNavigation()` will wait for the `load` event to fire. Naively, you might immediately think that this is the way we should wait for navigation after clicking the first result: diff --git a/sources/academy/webscraping/switching_to_typescript/enums.md b/sources/academy/webscraping/switching_to_typescript/enums.md index 252ff2cc98..56cdd4b0d1 100644 --- a/sources/academy/webscraping/switching_to_typescript/enums.md +++ b/sources/academy/webscraping/switching_to_typescript/enums.md @@ -66,7 +66,7 @@ Because of the custom type definition for `fileExtensions` and the type annotati ## Creating enums {#creating-enums} -The [`enum`](https://www.typescriptlang.org/docs/handbook/enums.html) keyword is a new keyword brought to us by TypeScript that allows us the same functionality we implemented in the above section, plus more. To create one, simply use the keyword followed by the name you'd like to use (the naming convention is generally **CapitalizeEachFirstLetterAndSingular**). +The [`enum`](https://www.typescriptlang.org/docs/handbook/enums.html) keyword is a new keyword brought to us by TypeScript that allows us the same functionality we implemented in the above section, plus more. To create one, use the keyword followed by the name you'd like to use (the naming convention is generally **CapitalizeEachFirstLetterAndSingular**). ```ts enum FileExtension { @@ -80,7 +80,7 @@ enum FileExtension { ## Using enums {#using-enums} -Using enums is straightforward. Simply use dot notation as you normally would with a regular object. +Using enums is straightforward. Use dot notation as you would with a regular object. ```ts enum FileExtension { diff --git a/sources/academy/webscraping/switching_to_typescript/unknown_and_type_assertions.md b/sources/academy/webscraping/switching_to_typescript/unknown_and_type_assertions.md index 080fc907d2..7b36915500 100644 --- a/sources/academy/webscraping/switching_to_typescript/unknown_and_type_assertions.md +++ b/sources/academy/webscraping/switching_to_typescript/unknown_and_type_assertions.md @@ -107,7 +107,7 @@ let job: undefined | string; const chars = job.split(''); ``` -TypeScript will yell at you when trying to compile this code, stating that **Object is possibly 'undefined'**, which is true. In order to assert that `job` will not be `undefined` in this case, we can simply add an exclamation mark before the dot. +TypeScript will yell at you when trying to compile this code, stating that **Object is possibly 'undefined'**, which is true. To assert that `job` will not be `undefined` in this case, we can add an exclamation mark before the dot. ```ts let job: undefined | string; diff --git a/sources/academy/webscraping/switching_to_typescript/using_types_continued.md b/sources/academy/webscraping/switching_to_typescript/using_types_continued.md index ab68ab5e7b..eda345b706 100644 --- a/sources/academy/webscraping/switching_to_typescript/using_types_continued.md +++ b/sources/academy/webscraping/switching_to_typescript/using_types_continued.md @@ -84,7 +84,7 @@ const course2 = { }; ``` -Then, in the type definition, we can add a `typesLearned` key. Then, by simply writing the type that the array's elements are followed by two square brackets (`[]`), we can form an array type. +Then, in the type definition, we can add a `typesLearned` key. Then, by writing the type that the array's elements are followed by two square brackets (`[]`), we can form an array type. ```ts const course: { diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/filtering_links.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/filtering_links.md index 5519473774..34d4961aaa 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/filtering_links.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/filtering_links.md @@ -18,7 +18,7 @@ Web pages are full of links, but frankly, most of them are useless to us when sc ## Filtering with unique CSS selectors {#css-filtering} -In the previous lesson, we simply grabbed all the links from the HTML document. +In the previous lesson, we grabbed all the links from the HTML document. diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md index da5c798251..e239cdb59c 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md @@ -14,7 +14,7 @@ import TabItem from '@theme/TabItem'; --- -A headless browser is simply a browser that runs without a user interface (UI). This means that it's normally controlled by automated scripts. Headless browsers are very popular in scraping because they can help you render JavaScript or programmatically behave like a human user to prevent blocking. The two most popular libraries for controlling headless browsers are [Puppeteer](https://pptr.dev/) and [Playwright](https://playwright.dev/). **Crawlee** supports both. +A headless browser is a browser that runs without a user interface (UI). This means that it's normally controlled by automated scripts. Headless browsers are very popular in scraping because they can help you render JavaScript or programmatically behave like a human user to prevent blocking. The two most popular libraries for controlling headless browsers are [Puppeteer](https://pptr.dev/) and [Playwright](https://playwright.dev/). **Crawlee** supports both. ## Building a Playwright scraper {#playwright-scraper} @@ -85,7 +85,7 @@ The `parseWithCheerio` function is available even in `CheerioCrawler` and all th When you run the code with `node browser.js`, you'll see a browser window open and then the individual pages getting scraped, each in a new browser tab. -So, that's it. In 4 lines of code, we transformed our crawler from a static HTTP crawler to a headless browser crawler. The crawler now runs exactly the same as before, but uses a Chromium browser instead of plain HTTP requests. This simply is not possible without Crawlee. +So, that's it. In 4 lines of code, we transformed our crawler from a static HTTP crawler to a headless browser crawler. The crawler now runs exactly the same as before, but uses a Chromium browser instead of plain HTTP requests. This isn't possible without Crawlee. Using Playwright in combination with Cheerio like this is only one of many ways how you can utilize Playwright (and Puppeteer) with Crawlee. In the advanced courses of the Academy, we will go deeper into using headless browsers for scraping and web automation (RPA) use cases. diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/relative_urls.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/relative_urls.md index 8ab7b5e525..f9487c80a8 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/relative_urls.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/relative_urls.md @@ -52,7 +52,7 @@ for (const link of productLinks) { } ``` -When you run this file in your terminal, you'll immediately see the difference. Unlike in the browser, where looping over elements produced absolute URLs, here in Node.js it only produces the relative ones. This is bad, because we can't use the relative URLs to crawl. They simply don't include all the necessary information. +When you run this file in your terminal, you'll immediately see the difference. Unlike in the browser, where looping over elements produced absolute URLs, here in Node.js it only produces the relative ones. This is bad, because we can't use the relative URLs to crawl. They don't include all the necessary information. ## Resolving URLs {#resolving-urls} diff --git a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/computer_preparation.md b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/computer_preparation.md index 724df66e79..9f598c6406 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/computer_preparation.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/computer_preparation.md @@ -15,7 +15,7 @@ Before you can start writing scraper code, you need to have your computer set up ## Install Node.js {#install-node} -Let's start with the installation of Node.js. Node.js is an engine for running JavaScript, quite similar to the browser console we used in the previous lessons. You feed it JavaScript code, and it executes it for you. Why not just use the browser console? Simply put, because it's limited in its capabilities. Node.js is way more powerful and is much better suited for coding scrapers. +Let's start with the installation of Node.js. Node.js is an engine for running JavaScript, quite similar to the browser console we used in the previous lessons. You feed it JavaScript code, and it executes it for you. Why not just use the browser console? Because it's limited in its capabilities. Node.js is way more powerful and is much better suited for coding scrapers. If you're on macOS, use [this tutorial to install Node.js](https://blog.apify.com/how-to-install-nodejs/). If you're using Windows [visit the official Node.js website](https://nodejs.org/en/download/). And if you're on Linux, just use your package manager to install `nodejs`. From 14d0b97d1cf62adde9e09cba91055358670934fe Mon Sep 17 00:00:00 2001 From: Honza Javorek Date: Thu, 25 Apr 2024 17:11:51 +0200 Subject: [PATCH 04/15] style: remove unnecessary 'just' occurences --- .../glossary/tools/edit_this_cookie.md | 2 +- sources/academy/glossary/tools/insomnia.md | 2 +- sources/academy/glossary/tools/postman.md | 2 +- .../deploying_your_code/docker_file.md | 6 +-- .../platform/deploying_your_code/index.md | 17 ++++----- .../actors_webhooks.md | 2 +- .../migrations_maintaining_state.md | 2 +- .../solutions/integrating_webhooks.md | 2 +- .../solutions/rotating_proxies.md | 2 +- .../solutions/saving_stats.md | 2 +- .../solutions/using_storage_creating_tasks.md | 2 +- .../get_most_of_actors/actor_readme.md | 2 +- .../get_most_of_actors/seo_and_promotion.md | 6 +-- .../platform/getting_started/actors.md | 4 +- .../platform/getting_started/apify_client.md | 4 +- .../getting_started/creating_actors.md | 4 +- .../run_actor_and_retrieve_data_via_api.md | 8 ++-- .../apify_scrapers/cheerio_scraper.md | 12 +++--- .../apify_scrapers/getting_started.md | 38 +++++++++---------- .../academy/tutorials/apify_scrapers/index.md | 2 +- .../apify_scrapers/puppeteer_scraper.md | 25 ++++++------ .../tutorials/apify_scrapers/web_scraper.md | 20 +++++----- .../add_external_libraries_web_scraper.md | 4 +- .../analyzing_pages_and_fixing_errors.md | 4 +- .../node_js/block_requests_puppeteer.md | 2 +- .../node_js/debugging_web_scraper.md | 2 +- .../filter_blocked_requests_using_sessions.md | 6 +-- .../handle_blocked_requests_puppeteer.md | 2 +- .../node_js/how_to_fix_target_closed.md | 2 +- sources/academy/tutorials/node_js/index.md | 2 +- .../node_js/request_labels_in_apify_actors.md | 4 +- .../node_js/when_to_use_puppeteer_scraper.md | 8 ++-- sources/academy/tutorials/php/index.md | 2 +- .../tutorials/php/using_apify_from_php.md | 2 +- sources/academy/tutorials/python/index.md | 2 +- .../python/process_data_using_python.md | 4 +- .../tutorials/python/scrape_data_python.md | 6 +-- .../scraping_paginated_sites.md | 12 +++--- .../webscraping/anti_scraping/index.md | 6 +-- .../anti_scraping/mitigation/proxies.md | 2 +- .../anti_scraping/mitigation/using_proxies.md | 2 +- .../techniques/browser_challenges.md | 2 +- .../anti_scraping/techniques/geolocation.md | 2 +- .../handling_pagination.md | 2 +- .../locating_and_learning.md | 4 +- .../graphql_scraping/custom_queries.md | 6 +-- .../graphql_scraping/introspection.md | 10 ++--- .../graphql_scraping/modifying_variables.md | 4 +- .../common_use_cases/index.md | 2 +- .../logging_into_a_website.md | 2 +- .../paginating_through_results.md | 2 +- .../executing_scripts/extracting_data.md | 2 +- .../webscraping/puppeteer_playwright/index.md | 4 +- .../puppeteer_playwright/page/index.md | 2 +- .../puppeteer_playwright/proxies.md | 2 +- .../best_practices.md | 2 +- .../challenge/scraping_amazon.md | 4 +- .../crawling/exporting_data.md | 2 +- .../crawling/finding_links.md | 2 +- .../crawling/first_crawl.md | 2 +- .../crawling/pro_scraping.md | 2 +- .../data_extraction/browser_devtools.md | 4 +- .../data_extraction/computer_preparation.md | 4 +- 63 files changed, 152 insertions(+), 156 deletions(-) diff --git a/sources/academy/glossary/tools/edit_this_cookie.md b/sources/academy/glossary/tools/edit_this_cookie.md index 9ce785d944..e2c7f942c8 100644 --- a/sources/academy/glossary/tools/edit_this_cookie.md +++ b/sources/academy/glossary/tools/edit_this_cookie.md @@ -25,7 +25,7 @@ Clicking this button will remove all cookies associated with the current domain. ### Reset -Basically just a refresh button. +A refresh button. ### Add a new cookie diff --git a/sources/academy/glossary/tools/insomnia.md b/sources/academy/glossary/tools/insomnia.md index dc3523f19e..760c9dc67c 100644 --- a/sources/academy/glossary/tools/insomnia.md +++ b/sources/academy/glossary/tools/insomnia.md @@ -66,4 +66,4 @@ This will bring up the **Manage cookies** window, where all cached cookies can b ## Postman or Insomnia {#postman-or-insomnia} -The application you choose to use is completely up to your personal preference, and will not affect your development workflow. If viewing timelines of the requests you send is important to you, then you should go with Insomnia; however, if that doesn't matter, just choose the one that has the most intuitive interface for you. +The application you choose to use is completely up to your personal preference, and will not affect your development workflow. If viewing timelines of the requests you send is important to you, then you should go with Insomnia; however, if that doesn't matter, choose the one that has the most intuitive interface for you. diff --git a/sources/academy/glossary/tools/postman.md b/sources/academy/glossary/tools/postman.md index d1671cc68a..ac24bf9c03 100644 --- a/sources/academy/glossary/tools/postman.md +++ b/sources/academy/glossary/tools/postman.md @@ -43,7 +43,7 @@ In order to use a proxy, the proxy's server and configuration must be provided i ![Proxy configuration in Postman settings](./images/postman-proxy.png) -After configuring a proxy, the next request sent will attempt to use it. To switch off the proxy, its details don't need to be deleted. The **Add a custom proxy configuration** option in settings just needs to be un-ticked to disable it. +After configuring a proxy, the next request sent will attempt to use it. To switch off the proxy, its details don't need to be deleted. The **Add a custom proxy configuration** option in settings needs to be un-ticked to disable it. ## Managing the cookies cache {#managing-cookies} diff --git a/sources/academy/platform/deploying_your_code/docker_file.md b/sources/academy/platform/deploying_your_code/docker_file.md index 81ab4704c5..43e0902dc5 100644 --- a/sources/academy/platform/deploying_your_code/docker_file.md +++ b/sources/academy/platform/deploying_your_code/docker_file.md @@ -16,7 +16,7 @@ import TabItem from '@theme/TabItem'; The **Dockerfile** is a file which gives the Apify platform (or Docker, more specifically) instructions on how to create an environment for your code to run in. Every Actor must have a Dockerfile, as Actors run in Docker containers. -> Actors on the platform are always run in Docker containers; however, they can also be run in local Docker containers. This is not common practice though, as it requires more setup and a deeper understanding of Docker. For testing, it's best to just run the Actor on the local OS (this requires you to have the underlying runtime installed, such as Node.js, Python, Rust, GO, etc). +> Actors on the platform are always run in Docker containers; however, they can also be run in local Docker containers. This is not common practice though, as it requires more setup and a deeper understanding of Docker. For testing, it's best to run the Actor on the local OS (this requires you to have the underlying runtime installed, such as Node.js, Python, Rust, GO, etc). ## Base images {#base-images} @@ -24,7 +24,7 @@ If your project doesn’t already contain a Dockerfile, don’t worry! Apify off > Tip: You can see all of Apify's Docker images [on DockerHub](https://hub.docker.com/r/apify/). -At the base level, each Docker image contains a base operating system and usually also a programming language runtime (such as Node.js or Python). You can also find images with preinstalled libraries or just install them yourself during the build step. +At the base level, each Docker image contains a base operating system and usually also a programming language runtime (such as Node.js or Python). You can also find images with preinstalled libraries or install them yourself during the build step. Once you find the base image you need, you can add it as the initial `FROM` statement: @@ -111,7 +111,7 @@ CMD python3 main.py ## Examples {#examples} -The examples we just showed were for Node.js and Python, however, to drive home the fact that Actors can be written in any language, here are some examples of some Dockerfiles for Actors written in different programming languages: +The examples above show how to deploy Actors written in Node.js or Python, but you can use any language. As an inspiration, here are a few examples for other languages: Go, Rust, Julia. diff --git a/sources/academy/platform/deploying_your_code/index.md b/sources/academy/platform/deploying_your_code/index.md index cbbe233c1f..c016bd8bb5 100644 --- a/sources/academy/platform/deploying_your_code/index.md +++ b/sources/academy/platform/deploying_your_code/index.md @@ -1,6 +1,6 @@ --- title: Deploying your code -description: In this course learn how to take an existing project of yours and deploy it to the Apify platform as an Actor in just a few minutes! +description: In this course learn how to take an existing project of yours and deploy it to the Apify platform as an actor. sidebar_position: 9 category: apify platform slug: /deploying-your-code @@ -11,25 +11,24 @@ import TabItem from '@theme/TabItem'; # Deploying your code to Apify {#deploying} -**In this course learn how to take an existing project of yours and deploy it to the Apify platform as an Actor in just a few minutes!** +**In this course learn how to take an existing project of yours and deploy it to the Apify platform as an Actor.** --- This section will discuss how to use your newfound knowledge of the Apify platform and Actors from the [**Getting started**](../getting_started/index.md) section to deploy your existing project's code to the Apify platform as an Actor. - -Because Actors are basically just chunks of code running in Docker containers, you're able to **_Actorify_** just about anything! +Any program running in a Docker container can become an Apify Actor. ![The deployment workflow](../../images/deployment-workflow.png) -Actors are language agnostic, which means that the language your project is written in does not affect your ability to actorify it. +Apify provides detailed guidance on how to deploy Node.js and Python programs as Actors, but apart from that you're not limited in what programming language you choose for your scraper. ![Supported languages](../../images/supported-languages.jpg) -Though the majority of Actors currently on the platform were written in Node.js, and despite the fact our current preferred languages are JavaScript and Python, there are a few examples of Actors in other languages: +Here are a few examples of Actors in other languages: -- [Actor written in Rust](https://apify.com/lukaskrivka/rust-actor-example) -- [GO Actor](https://apify.com/jirimoravcik/go-actor-example) -- [Actor written with Julia](https://apify.com/jirimoravcik/julia-actor-example) +- [Rust actor](https://apify.com/lukaskrivka/rust-actor-example) +- [Go actor](https://apify.com/jirimoravcik/go-actor-example) +- [Julia actor](https://apify.com/jirimoravcik/julia-actor-example) ## The "actorification" workflow {#workflow} diff --git a/sources/academy/platform/expert_scraping_with_apify/actors_webhooks.md b/sources/academy/platform/expert_scraping_with_apify/actors_webhooks.md index 4beda0b22f..bbf31f7306 100644 --- a/sources/academy/platform/expert_scraping_with_apify/actors_webhooks.md +++ b/sources/academy/platform/expert_scraping_with_apify/actors_webhooks.md @@ -17,7 +17,7 @@ Thus far, you've run Actors on the platform and written an Actor of your own, wh In this course, we'll be working out of the Amazon scraper project from the **Web scraping for beginners** course. If you haven't already built that project, you can do it in three short lessons [here](../../webscraping/web_scraping_for_beginners/challenge/index.md). We've made a few small modifications to the project with the Apify SDK, but 99% of the code is still the same. -Take another look at the files within your Amazon scraper project. You'll notice that there is a **Dockerfile**. Every single Actor has a Dockerfile (the Actor's **Image**) which tells Docker how to spin up a container on the Apify platform which can successfully run the Actor's code. "Apify Actors" is basically just a serverless platform that runs multiple Docker containers. For a deeper understanding of Actor Dockerfiles, refer to the [Apify Actor Dockerfile docs](/sdk/js/docs/guides/docker-images#example-dockerfile). +Take another look at the files within your Amazon scraper project. You'll notice that there is a **Dockerfile**. Every single Actor has a Dockerfile (the Actor's **Image**) which tells Docker how to spin up a container on the Apify platform which can successfully run the Actor's code. "Apify Actors" is a serverless platform that runs multiple Docker containers. For a deeper understanding of Actor Dockerfiles, refer to the [Apify Actor Dockerfile docs](/sdk/js/docs/guides/docker-images#example-dockerfile). ## Webhooks {#webhooks} diff --git a/sources/academy/platform/expert_scraping_with_apify/migrations_maintaining_state.md b/sources/academy/platform/expert_scraping_with_apify/migrations_maintaining_state.md index c3f9d5e155..c59da80531 100644 --- a/sources/academy/platform/expert_scraping_with_apify/migrations_maintaining_state.md +++ b/sources/academy/platform/expert_scraping_with_apify/migrations_maintaining_state.md @@ -11,7 +11,7 @@ slug: /expert-scraping-with-apify/migrations-maintaining-state --- -We already know that Actors are basically just Docker containers that can be run on any server. This means that they can be allocated anywhere there is space available, making them very efficient. Unfortunately, there is one big caveat: Actors move - a lot. When an Actor moves, it is called a **migration**. +We already know that Actors are Docker containers that can be run on any server. This means that they can be allocated anywhere there is space available, making them very efficient. Unfortunately, there is one big caveat: Actors move - a lot. When an Actor moves, it is called a **migration**. On migration, the process inside of an Actor is completely restarted and everything in its memory is lost, meaning that any values stored within variables or classes are lost. diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/integrating_webhooks.md b/sources/academy/platform/expert_scraping_with_apify/solutions/integrating_webhooks.md index f8740cfa3f..f2fe3b8809 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/integrating_webhooks.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/integrating_webhooks.md @@ -68,7 +68,7 @@ const filtered = items.reduce((acc, curr) => { }, {}); ``` -The results should be an array, so finally, we can take the map we just created and push an array of all of its values to the Actor's default dataset: +The results should be an array, so we can take the map we just created and push an array of its values to the Actor's default dataset: ```js await Actor.pushData(Object.values(filtered)); diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md b/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md index 9e4beced8b..ee9cf2960a 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md @@ -73,7 +73,7 @@ And that's it! We've successfully configured the session pool to match the task' ## Limiting proxy location {#limiting-proxy-location} -The final requirement was to only use proxies from the US. Back in our **ProxyConfiguration**, we just need to add the **countryCode** key and set it to **US**: +The final requirement was to use proxies only from the US. Back in our **ProxyConfiguration**, we need to add the **countryCode** key and set it to **US**: ```js const proxyConfiguration = await Actor.createProxyConfiguration({ diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md b/sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md index a8bcd851b4..2fab8d5187 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md @@ -88,7 +88,7 @@ const crawler = new CheerioCrawler({ ## Tracking total saved {#tracking-total-saved} -Now, we'll just increment our **totalSaved** count for every offer added to the dataset. +Now, we'll increment our **totalSaved** count for every offer added to the dataset. ```js router.addHandler(labels.OFFERS, async ({ $, request }) => { diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md b/sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md index 32c306b503..87ae6b9104 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md @@ -108,7 +108,7 @@ export const CHEAPEST_ITEM = 'CHEAPEST-ITEM'; ## Code check-in {#code-check-in} -Just to ensure we're all on the same page, here is what the **main.js** file looks like now: +Here is what the **main.js** file looks like now: ```js // main.js diff --git a/sources/academy/platform/get_most_of_actors/actor_readme.md b/sources/academy/platform/get_most_of_actors/actor_readme.md index 06054cb82d..832a70686a 100644 --- a/sources/academy/platform/get_most_of_actors/actor_readme.md +++ b/sources/academy/platform/get_most_of_actors/actor_readme.md @@ -71,7 +71,7 @@ Aim for sections 1–6 below and try to include at least 300 words. You can move 6. **Input** - - Each Actor detail page has an input tab, so you just need to refer to that. If you like, you can add a screenshot showing the user what the input fields will look like. + - Refer to the input tab on Actor's detail page. If you like, you can add a screenshot showing the user what the input fields will look like. - This is an example of how to refer to the input tab: > Twitter Scraper has the following input options. Click on the [input tab](https://apify.com/vdrmota/twitter-scraper/input-schema) for more information. diff --git a/sources/academy/platform/get_most_of_actors/seo_and_promotion.md b/sources/academy/platform/get_most_of_actors/seo_and_promotion.md index 20a57e55cd..4e959b17c0 100644 --- a/sources/academy/platform/get_most_of_actors/seo_and_promotion.md +++ b/sources/academy/platform/get_most_of_actors/seo_and_promotion.md @@ -41,7 +41,7 @@ The best combinations are those with **high search volume** and **low competitio - Page body (e.g., README). - The texts in your links. -> While crafting your content with keywords, beware of [over-optimizing or keyword stuffing](https://yoast.com/over-optimized-website/) the page. You can use synonyms or related keywords to help this. Google is smart enough to evaluate the page based on how well the whole topic is covered (not just based on keywords), but using them helps. +> While crafting your content with keywords, beware of [over-optimizing or keyword stuffing](https://yoast.com/over-optimized-website/) the page. You can use synonyms or related keywords to help this. Google is smart enough to evaluate the page based on how well the whole topic is covered (not only by keywords), but using them helps. ## Optimizing your Actor details @@ -49,9 +49,7 @@ While blog posts and promotion are important, your Actor is the main product. He ### Name -The Actor name is your Actor's developer-style name, which is prefixed by your username (e.g. `jancurn/find-broken-links`). The name is used to generate URL used for your Actor (e.g. ), making it an important signal for search engines. - -However, the name should also be readable and clear enough, so that people using your Actor can understand what it does just from the name. +The Actor name is your Actor's developer-style name, which is prefixed by your username (e.g. `jancurn/find-broken-links`). The name is used to generate URL used for your Actor (e.g. ), making it an important signal for search engines. The name should also be readable and clear enough, so that people using your Actor can understand what it does. [Read more about naming your Actor](./naming_your_actor.md)!. diff --git a/sources/academy/platform/getting_started/actors.md b/sources/academy/platform/getting_started/actors.md index e74a01cd6d..caf4ac6966 100644 --- a/sources/academy/platform/getting_started/actors.md +++ b/sources/academy/platform/getting_started/actors.md @@ -19,7 +19,7 @@ When you deploy your script to the Apify platform, it is then called an **Actor* Once an Actor has been pushed to the Apify platform, they can be shared to the world through the [Apify Store](https://apify.com/store), and even monetized after going public. -> Though the majority of Actors that are currently on the Apify platform are scrapers, crawlers, or automation software, Actors are not limited to just scraping. They are just pieces of code running in Docker containers, which means they can be used for nearly anything. +> Though the majority of Actors that are currently on the Apify platform are scrapers, crawlers, or automation software, Actors are not limited to scraping. They can be any program running in a Docker container. ## Actors on the Apify platform {#actors-on-platform} @@ -29,7 +29,7 @@ On the front page of the Actor, click the green **Try for free** button. If you' ![Actor configuration](./images/seo-actor-config.png) -This is where we can provide input to the Actor. The defaults here are just fine, so we'll just leave it as is and click the green **Start** button to run it. While the Actor is running, you'll see it log some information about itself. +This is where we can provide input to the Actor. The defaults here are just fine, so we'll leave it as is and click the green **Start** button to run it. While the Actor is running, you'll see it log some information about itself. ![Actor logs](./images/actor-logs.jpg) diff --git a/sources/academy/platform/getting_started/apify_client.md b/sources/academy/platform/getting_started/apify_client.md index df0edcecb1..27364ffeee 100644 --- a/sources/academy/platform/getting_started/apify_client.md +++ b/sources/academy/platform/getting_started/apify_client.md @@ -242,7 +242,7 @@ actor = client.actor('YOUR_USERNAME/adding-actor') -Then, we'll just call the `.update()` method on the `actor` variable we created and pass in our new **default run options**: +Then, we'll call the `.update()` method on the `actor` variable we created and pass in our new **default run options**: @@ -274,7 +274,7 @@ After running the code, go back to the **Settings** page of **adding-actor**. If ## Overview {#overview} -You can do so much more with the Apify client than just running Actors, updating Actors, and downloading dataset items. The purpose of this lesson was just to get you comfortable using the client in your own projects, as it's the absolute best developer tool for integrating the Apify platform with an external system. +You can do so much more with the Apify client than running Actors, updating Actors, and downloading dataset items. The purpose of this lesson was to get you comfortable using the client in your own projects, as it's the absolute best developer tool for integrating the Apify platform with an external system. For a more in-depth understanding of the Apify API client, give these a quick lookover: diff --git a/sources/academy/platform/getting_started/creating_actors.md b/sources/academy/platform/getting_started/creating_actors.md index 5e952f8d30..cc96bb720a 100644 --- a/sources/academy/platform/getting_started/creating_actors.md +++ b/sources/academy/platform/getting_started/creating_actors.md @@ -30,7 +30,7 @@ You'll be presented with a page featuring two ways to get started with a new Act ## Creating Actor from existing source code {#existing-source-code} -If you already have your code hosted by a Git provider, you can use it to create an Actor by linking the repository. If you use GitHub, you can use our [GitHub integration](/platform/integrations/github) to create an Actor from your public or private repository with just a few clicks. You can also use GitLab, Bitbucket or other Git providers or external repositories. +If you already have your code hosted by a Git provider, you can use it to create an Actor by linking the repository. If you use GitHub, you can use our [GitHub integration](/platform/integrations/github) to create an Actor from your public or private repository. You can also use GitLab, Bitbucket or other Git providers or external repositories. ![Create an Actor from Git repository](./images/create-actor-git.png) @@ -135,7 +135,7 @@ The Actor takes the `url` from the input and then: The extracted data is stored in the [Dataset](/platform/storage/dataset) where you can preview it and download it. We'll show how to do that later in [Run the Actor](#run-the-actor) section. -> Feel free to play around with the code and add some more features to it. For example, you can extract all the links from the page or extract all the images or completely change the logic of this template. Just keep in mind that this template uses [input schema](/academy/deploying-your-code/input-schema) defined in the `.actor/input_schema.json` file and linked to the `.actor/actor.json`. If you want to change the input schema, you need to change it in those files as well. Learn more about the Actor input and output [in the next page](/academy/getting-started/inputs-outputs). +> Feel free to play around with the code and add some more features to it. For example, you can extract all the links from the page or extract all the images or completely change the logic of this template. Keep in mind that this template uses [input schema](/academy/deploying-your-code/input-schema) defined in the `.actor/input_schema.json` file and linked to the `.actor/actor.json`. If you want to change the input schema, you need to change it in those files as well. Learn more about the Actor input and output [in the next page](/academy/getting-started/inputs-outputs). ## Build the Actor 🧱 {#build-an-actor} diff --git a/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md b/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md index a40aea9fe2..2d30e87866 100644 --- a/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md +++ b/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md @@ -23,7 +23,7 @@ If the Actor being run via API takes 5 minutes or less to complete a typical run ## Run an Actor or task {#run-an-actor-or-task} -> If you are unsure about the differences between an Actor and a task, you can read about them in the [tasks](/platform/actors/running/tasks) documentation. In brief, tasks are just pre-configured inputs for Actors. +> If you are unsure about the differences between an Actor and a task, you can read about them in the [tasks](/platform/actors/running/tasks) documentation. In brief, tasks are pre-configured inputs for Actors. The API endpoints and usage (for both sync and async) for [Actors](/api/v2#/reference/actors/run-collection/run-actor) and [tasks](/api/v2#/reference/actor-tasks/run-collection/run-task) are essentially the same. @@ -43,7 +43,7 @@ The URL of [POST request](https://developer.mozilla.org/en-US/docs/Web/HTTP/Meth https://api.apify.com/v2/acts/ACTOR_NAME_OR_ID/runs?token=YOUR_TOKEN ``` -For tasks, we can just switch the path from **acts** to **actor-tasks** and keep the rest the same: +For tasks, we can switch the path from **acts** to **actor-tasks** and keep the rest the same: ```cURL https://api.apify.com/v2/actor-tasks/TASK_NAME_OR_ID/runs?token=YOUR_TOKEN @@ -224,7 +224,7 @@ Usually, this event is a successfully finished run, but you can also set a diffe ![Webhook example](./images/webhook.png) -The webhook will send you a pretty complicated [JSON object](/platform/integrations/webhooks/actions), but usually, you would only be interested in the `resource` object within the response, which is essentially just the **run info** JSON from the previous sections. We can leave the payload template as is for our example since it is all we need. +The webhook will send you a pretty complicated [JSON object](/platform/integrations/webhooks/actions), but usually, you would only be interested in the `resource` object within the response, which is like the **run info** JSON from the previous sections. We can leave the payload template as is for our example since it is all we need. Once your server receives this request from the webhook, you know that the event happened, and you can ask for the complete data. @@ -254,7 +254,7 @@ The **run info** JSON also contains the IDs of the default [dataset](/platform/s > If you are scraping products, or any list of items with similar fields, the [dataset](/platform/storage/dataset) should be your storage of choice. Don't forget though, that dataset items are immutable. This means that you can only add to the dataset, and not change the content that is already inside it. -Retrieving the data from a dataset is simple. Just send a GET request to the [**Get items**](/api/v2#/reference/datasets/item-collection/get-items) endpoint and pass the `defaultDatasetId` into the URL. For a GET request to the default dataset, no token is needed. +Retrieving the data from a dataset is simple. Send a GET request to the [**Get items**](/api/v2#/reference/datasets/item-collection/get-items) endpoint and pass the `defaultDatasetId` into the URL. For a GET request to the default dataset, no token is needed. ```cURL https://api.apify.com/v2/datasets/DATASET_ID/items diff --git a/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md b/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md index 2741784a63..3e8167540f 100644 --- a/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md @@ -24,7 +24,7 @@ so now it's time to add more data to the results. To do that, we'll be using the [Cheerio](https://github.com/cheeriojs/cheerio) library. This may not sound familiar, so let's try again. Does [jQuery](https://jquery.com/) ring a bell? If it does you're in luck, -because Cheerio is just jQuery that doesn't need an actual browser to run. Everything else is the same. +because Cheerio is like jQuery that doesn't need an actual browser to run. Everything else is the same. All the functions you already know are there and even the familiar `$` is used. If you still have no idea what either of those are, don't worry. We'll walk you through using them step by step. @@ -65,7 +65,7 @@ element that we can use to select only the heading we're interested in. > their selectors. And always make sure to use the DevTools to verify your scraping process and assumptions. It's faster than changing the crawler > code all the time. -To get the title we just need to find it using a `header h1` selector, which selects all `

` elements that have a `
` ancestor. +To get the title we need to find it using a `header h1` selector, which selects all `

` elements that have a `
` ancestor. And as we already know, there's only one. ```js @@ -250,12 +250,12 @@ You nailed it! ## [](#pagination) Pagination -Pagination is just a term that represents "going to the next page of results". You may have noticed that we did not +Pagination is a term that represents "going to the next page of results". You may have noticed that we did not actually scrape all the Actors, just the first page of results. That's because to load the rest of the Actors, one needs to click the **Show more** button at the very bottom of the list. This is pagination. > This is a typical JavaScript pagination, sometimes called infinite scroll. Other pages may use links -that take you to the next page. If you encounter those, just make a Pseudo URL for those links and they +that take you to the next page. If you encounter those, make a Pseudo URL for those links and they will be automatically enqueued to the request queue. Use a label to let the scraper know what kind of URL it's processing. @@ -305,7 +305,7 @@ we need is there, in the `data.props.pageProps.items` array. Great! ![$1](https://raw.githubusercontent.com/apifytech/actor-scraper/master/docs/img/inspect-data.webp) > It's obvious that all the information we set to scrape is available in this one data object, -so you might already be wondering, can I just make one request to the store to get this JSON +so you might already be wondering, can I make one request to the store to get this JSON and then parse it out and be done with it in a single request? Yes you can! And that's the power of clever page analysis. @@ -403,7 +403,7 @@ async function pageFunction(context) { That's it! You can now remove the **Max pages per run** limit, **Save & Run** your task and watch the scraper scrape all of the Actors' data. After it succeeds, open the **Dataset** tab again click on **Preview**. You should have a table of all the Actor's details in front of you. If you do, great job! You've successfully -scraped Apify Store. And if not, no worries, just go through the code examples again, it's probably just some typo. +scraped Apify Store. And if not, no worries, go through the code examples again, it's probably just a typo. > There's an important caveat. The way we implemented pagination here is in no way a generic system that you can easily use with other websites. Cheerio is fast (and that means it's cheap), but it's not easy. Sometimes there's just no way diff --git a/sources/academy/tutorials/apify_scrapers/getting_started.md b/sources/academy/tutorials/apify_scrapers/getting_started.md index 7af4e67e16..c503e2e891 100644 --- a/sources/academy/tutorials/apify_scrapers/getting_started.md +++ b/sources/academy/tutorials/apify_scrapers/getting_started.md @@ -15,9 +15,9 @@ Welcome to the getting started tutorial! It will walk you through creating your ## [](#what-is-an-apify-scraper) What is an Apify scraper -It doesn't matter whether you arrived here from **Web Scraper** ([apify/web-scraper](https://apify.com/apify/web-scraper)), **Puppeteer Scraper** ([apify/puppeteer-scraper](https://apify.com/apify/puppeteer-scraper)) or **Cheerio Scraper** ([apify/cheerio-scraper](https://apify.com/apify/cheerio-scraper)). All of them are **Actors** and for now, let's just think of an **Actor** as an application that you can use with your own configuration. **apify/web-scraper** is therefore an application called **web-scraper**, built by **apify**, that you can configure to scrape any webpage. We call these configurations **tasks**. +It doesn't matter whether you arrived here from **Web Scraper** ([apify/web-scraper](https://apify.com/apify/web-scraper)), **Puppeteer Scraper** ([apify/puppeteer-scraper](https://apify.com/apify/puppeteer-scraper)) or **Cheerio Scraper** ([apify/cheerio-scraper](https://apify.com/apify/cheerio-scraper)). All of them are **Actors** and for now, let's think of an **Actor** as an application that you can use with your own configuration. **apify/web-scraper** is therefore an application called **web-scraper**, built by **apify**, that you can configure to scrape any webpage. We call these configurations **tasks**. -> If you need help choosing the right scraper, see this [great article](https://help.apify.com/en/articles/3024655-choosing-the-right-solution). And if you just want to learn more about Actors in general, you can read our [Actors page](https://apify.com/actors) or [browse the documentation](/platform/actors). +> If you need help choosing the right scraper, see this [great article](https://help.apify.com/en/articles/3024655-choosing-the-right-solution). If you want to learn more about Actors in general, you can read our [actors page](https://apify.com/actors) or [browse the documentation](/platform/actors). You can create 10 different **tasks** for 10 different websites, with very different options, but there will always be just one **Actor**, the `apify/*-scraper` you chose. This is the essence of tasks. They are nothing but **saved configurations** of the Actor that you can run easily and repeatedly. @@ -31,11 +31,11 @@ Depending on how you arrived at this tutorial, you may already have your first t ### [](#running-a-task) Running a task -This takes you to the **Input and options** tab of the task configuration. Before we delve into the details, let's just see how the example works. You can see that there are already some pre-configured input values. It says that the task should visit **** and all its subpages, such as **** and scrape some data using the provided `pageFunction`, specifically the `` of the page and its URL. +This takes you to the **Input and options** tab of the task configuration. Before we delve into the details, let's see how the example works. You can see that there are already some pre-configured input values. It says that the task should visit **<https://apify.com>** and all its subpages, such as **<https://apify.com/contact>** and scrape some data using the provided `pageFunction`, specifically the `<title>` of the page and its URL. -Scroll down to the **Performance and limits** section and set the **Max pages per run** option to **10**. This tells your task to finish after 10 pages have been visited. We don't need to crawl the whole domain just to see that the Actor works. +Scroll down to the **Performance and limits** section and set the **Max pages per run** option to **10**. This tells your task to finish after 10 pages have been visited. We don't need to crawl the whole domain to see that the Actor works. -> This also helps with keeping your [compute unit](/platform/actors/running/usage-and-resources) (CU) consumption low. Just to get an idea, our free plan includes 10 CUs and this run will consume about 0.04 CU, so you can run it 250 times a month for free. If you accidentally go over the limit, no worries, we won't charge you for it. You just won't be able to run more tasks that month. +> This also helps with keeping your [compute unit](/platform/actors/running/usage-and-resources) (CU) consumption low. To get an idea, our free plan includes 10 CUs and this run will consume about 0.04 CU, so you can run it 250 times a month for free. If you accidentally go over the limit, no worries, we won't charge you for it. You just won't be able to run more tasks that month. Now click **Save & Run**! *(in the bottom-left part of your screen)* @@ -45,25 +45,25 @@ After clicking **Save & Run**, the window will change to the run detail. Here, y > Feel free to browse through the various new tabs: **Log**, **Info**, **Input** and other, but for the sake of brevity, we will not explain all their features in this tutorial. -Now that the run has `SUCCEEDED`, click on the glowing **Results** card to see the scrape's results. This takes you to the **Dataset** tab, where you can display or download the results in various formats. For now, just click the **Preview** button. Voila, the scraped data! +Now that the run has `SUCCEEDED`, click on the glowing **Results** card to see the scrape's results. This takes you to the **Dataset** tab, where you can display or download the results in various formats. For now, click the **Preview** button. Voila, the scraped data! ![$1](https://raw.githubusercontent.com/apifytech/actor-scraper/master/docs/img/the-run-detail.webp) -Good job! We've run our first task and got some results. Let's learn how to change the default configuration to scrape something more interesting than just the page's `<title>`. +Good job! We've run our first task and got some results. Let's learn how to change the default configuration to scrape something more interesting than the page's `<title>`. ## [](#creating-your-own-task) Creating your own task -Before we jump into the scraping itself, let's just have a quick look at the user interface that's available to us. Click on the task's name in the top-left corner to visit the task's configuration. +Before we jump into the scraping itself, let's have a quick look at the user interface that's available to us. Click on the task's name in the top-left corner to visit the task's configuration. ![$1](https://raw.githubusercontent.com/apifytech/actor-scraper/master/docs/img/task-name.webp) ### [](#input) Input and options -The **Input** tab is where we started and it's the place where you create your scraping configuration. The Actor's creator prepares the **Input** form so that you can easily tell the Actor what to do. Feel free to check the tooltips of the various options to get a better idea of what they do. To display the tooltip, just click the question mark next to each input field's name. +The **Input** tab is where we started and it's the place where you create your scraping configuration. The Actor's creator prepares the **Input** form so that you can easily tell the Actor what to do. Feel free to check the tooltips of the various options to get a better idea of what they do. To display the tooltip, click the question mark next to each input field's name. > We will not go through all the available input options in this tutorial. See the Actor's README for detailed information. -Below the input fields are the Build, Timeout and Memory options. Let's keep them at default settings for now. Just remember that if you see a yellow `TIMED-OUT` status after running your task, you might want to come back here and increase the timeout. +Below the input fields are the Build, Timeout and Memory options. Let's keep them at default settings for now. Remember that if you see a yellow `TIMED-OUT` status after running your task, you might want to come back here and increase the timeout. > Timeouts are there to prevent tasks from running forever. Always set a reasonable timeout to prevent a rogue task from eating up all your compute units. @@ -81,7 +81,7 @@ Webhooks are a feature that help keep you aware of what's happening with your ta ### [](#readme) Information -Since tasks are just configurations for Actors, this tab shows you all the information about the underlying Actor, the Apify scraper of your choice. You can see the available versions and their READMEs - it's always a good idea to read an Actor's README first before creating a task for it. +Since tasks are configurations for Actors, this tab shows you all the information about the underlying Actor, the Apify scraper of your choice. You can see the available versions and their READMEs - it's always a good idea to read an Actor's README first before creating a task for it. ### [](#api) API @@ -116,7 +116,7 @@ How do we choose the new **Start URL**? The goal is to scrape all Actors in the https://apify.com/store ``` -We also need to somehow distinguish the **Start URL** from all the other URLs that the scraper will add later. To do this, click the **Details** button in the **Start URL** form and see the **User data** input. Here you can add any information you'll need during the scrape in a JSON format. For now, just add a label to the **Start URL**. +We also need to somehow distinguish the **Start URL** from all the other URLs that the scraper will add later. To do this, click the **Details** button in the **Start URL** form and see the **User data** input. Here you can add any information you'll need during the scrape in a JSON format. For now, add a label to the **Start URL**. ```json { @@ -130,13 +130,13 @@ The **Link selector**, together with **Pseudo URL**s, are your URL matching arse What's the connection to **Pseudo URL**s? Well, first, all the URLs found in the elements that match the Link selector are collected. Then, **Pseudo URL**s are used to filter through those URLs and enqueue only the ones that match the **Pseudo URL** structure. Simple. -To scrape all the Actors in Apify Store, we should use the Link selector to tell the scraper where to find the URLs we need. For now, let us just tell you that the Link selector you're looking for is: +To scrape all the Actors in Apify Store, we should use the Link selector to tell the scraper where to find the URLs we need. For now, let us tell you that the Link selector you're looking for is: ```css div.item > a ``` -Save it as your **Link selector**. If you're wondering how we figured this out, just follow along with the tutorial. By the time we finish, you'll know why we used this selector, too. +Save it as your **Link selector**. If you're wondering how we figured this out, follow along with the tutorial. By the time we finish, you'll know why we used this selector, too. ### [](#crawling-the-website-with-pseudo-url) Crawling the website with pseudo URLs @@ -152,7 +152,7 @@ In the structures, only the `OWNER` and `NAME` change. We can leverage this in a #### Making a pseudo URL -**Pseudo URL**s are really just URLs with some variable parts in them. Those variable parts are represented by [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions) enclosed in brackets `[]`. +**Pseudo URL**s are URLs with some variable parts in them. Those variable parts are represented by [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions) enclosed in brackets `[]`. Working with our Actor details example, we could produce a **Pseudo URL** like this: @@ -190,7 +190,7 @@ Let's use the above **Pseudo URL** in our task. We should also add a label as we ### [](#test-run) Test run -Now that we've added some configuration, it's time to test it. Just run the task, keeping the **Max pages per run** set to `10` and the `pageFunction` as it is. You should see in the log that the scraper first visits the **Start URL** and then several of the Actor details matching the **Pseudo URL**. +Now that we've added some configuration, it's time to test it. Run the task, keeping the **Max pages per run** set to `10` and the `pageFunction` as it is. You should see in the log that the scraper first visits the **Start URL** and then several of the Actor details matching the **Pseudo URL**. ## [](#the-page-function) The page function @@ -200,7 +200,7 @@ The `pageFunction` is a JavaScript function that gets executed for each page the Open [Apify Store](https://apify.com/store) in the Chrome browser (or use any other browser, just note that the DevTools may differ slightly) and open the DevTools, either by right-clicking on the page and selecting **Inspect** or by pressing **F12**. -The DevTools window will pop up and display a lot of, perhaps unfamiliar, information. Don't worry about that too much - just open the Elements tab (the one with the page's HTML). The Elements tab allows you to browse the page's structure and search within it using the search tool. You can open the search tool by pressing **CTRL+F** or **CMD+F**. Try typing **title** into the search bar. +The DevTools window will pop up and display a lot of, perhaps unfamiliar, information. Don't worry about that too much - open the Elements tab (the one with the page's HTML). The Elements tab allows you to browse the page's structure and search within it using the search tool. You can open the search tool by pressing **CTRL+F** or **CMD+F**. Try typing **title** into the search bar. You'll see that the Element tab jumps to the first `<title>` element of the current page and that the title is **Store · Apify**. It's always good practice to do your research using the DevTools before writing the `pageFunction` and running your task. @@ -286,14 +286,14 @@ The scraper: 5. Enqueues the matching URLs to the end of the crawling queue. 6. Closes the page and selects a new URL to visit, either from the **Start URL**s if there are any left, or from the beginning of the crawling queue. -> When you're not using the request queue, the scraper just repeats steps 1 and 2. You would not use the request queue when you already know all the URLs you want to visit. For example, when you have a pre-existing list of a thousand URLs that you uploaded as a text file. Or when scraping just a single URL. +> When you're not using the request queue, the scraper repeats steps 1 and 2. You would not use the request queue when you already know all the URLs you want to visit. For example, when you have a pre-existing list of a thousand URLs that you uploaded as a text file. Or when scraping a single URL. ## [](#scraping-practice) Scraping practice We've covered all the concepts that we need to understand to successfully scrape the data in our goal, so let's get to it and start with something really simple. We will only output data that are already available to us in the page's URL. Remember from [our goal](#the-goal) that we also want to include the **URL** and a **Unique -identifier** in our results. To get those, we just need the `request.url` because it is the URL and +identifier** in our results. To get those, we need the `request.url` because it is the URL and includes the Unique identifier. ```js diff --git a/sources/academy/tutorials/apify_scrapers/index.md b/sources/academy/tutorials/apify_scrapers/index.md index 9ad91b4d60..5816b900d3 100644 --- a/sources/academy/tutorials/apify_scrapers/index.md +++ b/sources/academy/tutorials/apify_scrapers/index.md @@ -13,7 +13,7 @@ slug: /apify-scrapers Scraping and crawling the web can be difficult and time-consuming without the right tools. That's why Apify provides ready-made solutions to crawl and scrape any website. They are based on our [Actors](https://apify.com/actors), the [Apify SDK](/sdk/js) and [Crawlee](https://crawlee.dev/). -Don't let the number of options confuse you. Unless you're really sure you need to use a specific tool, just go ahead and use **Web Scraper** ([apify/web-scraper](./web_scraper.md)). It is the easiest to pick up and can handle almost anything. Look at **Puppeteer Scraper** ([apify/puppeteer-scraper](./puppeteer_scraper.md)) or **Cheerio Scraper** ([apify/cheerio-scraper](./cheerio_scraper.md)) only after you know your target websites well and need to optimize your scraper. +Don't let the number of options confuse you. Unless you're really sure you need to use a specific tool, go ahead and use **Web Scraper** ([apify/web-scraper](./web_scraper.md)). It is the easiest to pick up and can handle almost anything. Look at **Puppeteer Scraper** ([apify/puppeteer-scraper](./puppeteer_scraper.md)) or **Cheerio Scraper** ([apify/cheerio-scraper](./cheerio_scraper.md)) only after you know your target websites well and need to optimize your scraper. [Visit the Scraper introduction tutorial to get started!](./getting_started.md) diff --git a/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md b/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md index 1e3195a79a..313238bb7f 100644 --- a/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md @@ -44,7 +44,7 @@ It's also much easier to work with external APIs, databases or the [Apify SDK](h in the Node.js context. The tradeoff is simple. Power vs simplicity. Web Scraper is simple, Puppeteer Scraper is powerful (and the [Apify SDK](https://sdk.apify.com) is super-powerful). -> In other words, Web Scraper's `pageFunction` is just a single [page.evaluate()](https://pptr.dev/#?product=Puppeteer&show=api-pageevaluatepagefunction-args) call. +> In other words, Web Scraper's `pageFunction` is like a single [page.evaluate()](https://pptr.dev/#?product=Puppeteer&show=api-pageevaluatepagefunction-args) call. Now that's out of the way, let's open one of the Actor detail pages in the Store, for example the Web Scraper page and use our DevTools-Fu to scrape some data. @@ -80,7 +80,7 @@ element that we can use to select only the heading we're interested in. > their selectors. And always make sure to use the DevTools to verify your scraping process and assumptions. It's faster than changing the crawler > code all the time. -To get the title we just need to find it using a `header h1` selector, which selects all `<h1>` elements that have a `<header>` ancestor. +To get the title we need to find it using a `header h1` selector, which selects all `<h1>` elements that have a `<header>` ancestor. And as we already know, there's only one. ```js @@ -352,12 +352,12 @@ You nailed it! ## [](#pagination) Pagination -Pagination is just a term that represents "going to the next page of results". You may have noticed that we did not +Pagination is a term that represents "going to the next page of results". You may have noticed that we did not actually scrape all the Actors, just the first page of results. That's because to load the rest of the Actors, one needs to click the **Show more** button at the very bottom of the list. This is pagination. -> This is a typical form of JavaScript pagination, sometimes called infinite scroll. Other pages may just use links -that take you to the next page. If you encounter those, just make a **Pseudo URL** for those links and they will +> This is a typical form of JavaScript pagination, sometimes called infinite scroll. Other pages may use links +that take you to the next page. If you encounter those, make a **Pseudo URL** for those links and they will be automatically enqueued to the request queue. Use a label to let the scraper know what kind of URL it's processing. ### [](#waiting-for-dynamic-content) Waiting for dynamic content @@ -428,7 +428,7 @@ div.show-more > button ![$1](https://raw.githubusercontent.com/apifytech/actor-scraper/master/docs/img/waiting-for-the-button.webp) -Now that we know what to wait for, we just plug it into the `waitFor()` function. +Now that we know what to wait for, we plug it into the `waitFor()` function. ```js await page.waitFor('div.show-more > button'); @@ -480,11 +480,11 @@ async function pageFunction(context) { ``` We want to run this until the `waitFor()` function throws, so that's why we use a `while(true)` loop. We're also not -interested in the error, because we're expecting it, so we just ignore it and print a log message instead. +interested in the error, because we're expecting it, so we ignore it and print a log message instead. You might be wondering what's up with the `timeout`. Well, for the first page load, we want to wait longer, so that all the page's JavaScript has had a chance to execute, but for the other iterations, the JavaScript is -already loaded and we're just waiting for the page to re-render so waiting for `2` seconds is enough to confirm +already loaded and we're waiting for the page to re-render so waiting for `2` seconds is enough to confirm that the button is not there. We don't want to stall the scraper for `30` seconds just to make sure that there's no button. @@ -576,8 +576,8 @@ async function pageFunction(context) { That's it! You can now remove the **Max pages per run** limit, **Save & Run** your task and watch the scraper paginate through all the Actors and then scrape all of their data. After it succeeds, open the **Dataset** tab again and click on **Preview****. You should have a table of all the Actor's details in front of you. If you do, great job! -You've successfully scraped Apify Store. And if not, no worries, just go through the code examples again, -it's probably just some typo. +You've successfully scraped Apify Store. And if not, no worries, go through the code examples again, +it's probably just a typo. ![$1](https://raw.githubusercontent.com/apifytech/actor-scraper/master/docs/img/plugging-it-into-the-pagefunction.webp) @@ -705,10 +705,9 @@ you can easily use jQuery with Puppeteer Scraper too. ### [](#injecting-jquery) Injecting jQuery -To be able to use jQuery, we first need to introduce it to the browser. Fortunately, we have a helper function to -do just that: [`Apify.utils.puppeteer.injectJQuery`](https://sdk.apify.com/docs/api/puppeteer#puppeteerinjectjquerypage) +To be able to use jQuery, we first need to introduce it to the browser. The [`Apify.utils.puppeteer.injectJQuery`](https://sdk.apify.com/docs/api/puppeteer#puppeteerinjectjquerypage) function will help us with the task. -> Just a friendly warning. Injecting jQuery into a page may break the page itself, if it expects a specific version +> Friendly warning: Injecting jQuery into a page may break the page itself, if it expects a specific version of jQuery to be available and you override it with an incompatible one. Be careful. You can either call this function directly in your `pageFunction`, or you can set up jQuery injection in the diff --git a/sources/academy/tutorials/apify_scrapers/web_scraper.md b/sources/academy/tutorials/apify_scrapers/web_scraper.md index 3fedac29d2..1507d30fdb 100644 --- a/sources/academy/tutorials/apify_scrapers/web_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/web_scraper.md @@ -26,7 +26,7 @@ we've confirmed that the scraper works as expected, so now it's time to add more To do that, we'll be using the [jQuery library](https://jquery.com/), because it provides some nice tools and a lot of people familiar with JavaScript already know how to use it. -> [Check out the jQuery docs](https://api.jquery.com/) if you're not familiar with it. And if you just don't want to use it, that's okay. Everything can be done using pure JavaScript, too. +> [Check out the jQuery docs](https://api.jquery.com/) if you're not familiar with it. And if you don't want to use it, that's okay. Everything can be done using pure JavaScript, too. To add jQuery, all we need to do is turn on **Inject jQuery** under the **Input and options** tab. This will add a `context.jQuery` function that you can use. @@ -63,7 +63,7 @@ element that we can use to select only the heading we're interested in. > their selectors. And always make sure to use the DevTools to verify your scraping process and assumptions. It's faster than changing the crawler > code all the time. -To get the title we just need to find it using a `header h1` selector, which selects all `<h1>` elements that have a `<header>` ancestor. +To get the title we need to find it using a `header h1` selector, which selects all `<h1>` elements that have a `<header>` ancestor. And as we already know, there's only one. ```js @@ -251,12 +251,12 @@ You nailed it! ## [](#pagination) Pagination -Pagination is just a term that represents "going to the next page of results". You may have noticed that we did not +Pagination is a term that represents "going to the next page of results". You may have noticed that we did not actually scrape all the Actors, just the first page of results. That's because to load the rest of the Actors, one needs to click the **Show more** button at the very bottom of the list. This is pagination. -> This is a typical form of JavaScript pagination, sometimes called infinite scroll. Other pages may just use links -that take you to the next page. If you encounter those, just make a **Pseudo URL** for those links and they will +> This is a typical form of JavaScript pagination, sometimes called infinite scroll. Other pages may use links +that take you to the next page. If you encounter those, make a **Pseudo URL** for those links and they will be automatically enqueued to the request queue. Use a label to let the scraper know what kind of URL it's processing. ### [](#waiting-for-dynamic-content) Waiting for dynamic content @@ -324,7 +324,7 @@ div.show-more > button ![$1](https://raw.githubusercontent.com/apifytech/actor-scraper/master/docs/img/waiting-for-the-button.webp) -Now that we know what to wait for, we just plug it into the `waitFor()` function. +Now that we know what to wait for, we plug it into the `waitFor()` function. ```js await waitFor('div.show-more > button'); @@ -374,11 +374,11 @@ async function pageFunction(context) { ``` We want to run this until the `waitFor()` function throws, so that's why we use a `while(true)` loop. We're also not -interested in the error, because we're expecting it, so we just ignore it and print a log message instead. +interested in the error, because we're expecting it, so we ignore it and print a log message instead. You might be wondering what's up with the `timeoutMillis`. Well, for the first page load, we want to wait longer, so that all the page's JavaScript has had a chance to execute, but for the other iterations, the JavaScript is -already loaded and we're just waiting for the page to re-render so waiting for `2` seconds is enough to confirm +already loaded and we're waiting for the page to re-render so waiting for `2` seconds is enough to confirm that the button is not there. We don't want to stall the scraper for `20` seconds just to make sure that there's no button. @@ -452,8 +452,8 @@ async function pageFunction(context) { That's it! You can now remove the **Max pages per run** limit, **Save & Run** your task and watch the scraper paginate through all the Actors and then scrape all of their data. After it succeeds, open the **Dataset** tab again click on **Preview**. You should have a table of all the Actor's details in front of you. If you do, great job! -You've successfully scraped Apify Store. And if not, no worries, just go through the code examples again, -it's probably just some typo. +You've successfully scraped Apify Store. And if not, no worries, go through the code examples again, +it's probably just a typo. ![$1](https://raw.githubusercontent.com/apifytech/actor-scraper/master/docs/img/plugging-it-into-the-pagefunction.webp) diff --git a/sources/academy/tutorials/node_js/add_external_libraries_web_scraper.md b/sources/academy/tutorials/node_js/add_external_libraries_web_scraper.md index ac85868bb2..aeed367d3a 100644 --- a/sources/academy/tutorials/node_js/add_external_libraries_web_scraper.md +++ b/sources/academy/tutorials/node_js/add_external_libraries_web_scraper.md @@ -39,7 +39,7 @@ async function pageFunction(context) { } ``` -We're creating a script element in the page's DOM and waiting for the script to load. Afterwards, we just confirm that the library has been successfully loaded by using one of its functions. +We're creating a script element in the page's DOM and waiting for the script to load. Afterwards, we confirm that the library has been successfully loaded by using one of its functions. ## Injecting a library using jQuery @@ -64,6 +64,6 @@ With jQuery, we're using the `$.getScript()` helper to fetch the script for us a ## Dealing with errors -Some websites employ security measures that disallow loading external scripts within their pages. Luckily, those measures can be easily overridden with Web Scraper. If you are encountering errors saying that your library cannot be loaded due to a security policy, just select the Ignore CORS and CSP input option at the very bottom of Web Scraper input and the errors should go away. +Some websites employ security measures that disallow loading external scripts within their pages. Luckily, those measures can be easily overridden with Web Scraper. If you are encountering errors saying that your library cannot be loaded due to a security policy, select the Ignore CORS and CSP input option at the very bottom of Web Scraper input and the errors should go away. Happy scraping! diff --git a/sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md b/sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md index e23918dedc..5f42d15ed7 100644 --- a/sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md +++ b/sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md @@ -35,7 +35,7 @@ Here are the most common reasons your working solution may break. Web scraping and automation are very specific types of programming. It is not possible to rely on specialized debugging tools, since the code does not output the same results every time. However, there are still many ways to diagnose issues in a crawler. -> Many issues are edge cases, which occur in just one of a thousand pages or are time-dependent. Because of this, you cannot rely only on [determinism](https://en.wikipedia.org/wiki/Deterministic_algorithm). +> Many issues are edge cases, which occur in one of a thousand pages or are time-dependent. Because of this, you cannot rely only on [determinism](https://en.wikipedia.org/wiki/Deterministic_algorithm). ### Logging {#logging} @@ -123,7 +123,7 @@ To make the error snapshot descriptive, we name it **ERROR-LOGIN**. We add a ran ### Error reporting {#error-reporting} -Logging and snapshotting are great tools but once you reach a certain run size, it may be hard to read through them all. For a large project, it is handy to create a more sophisticated reporting system. For example, let's just look at simple **dataset** reporting. +Logging and snapshotting are great tools but once you reach a certain run size, it may be hard to read through them all. For a large project, it is handy to create a more sophisticated reporting system. For example, let's look at simple **dataset** reporting. ## With the Apify SDK {#with-the-apify-sdk} diff --git a/sources/academy/tutorials/node_js/block_requests_puppeteer.md b/sources/academy/tutorials/node_js/block_requests_puppeteer.md index ffbc2538ce..bb763e70f8 100644 --- a/sources/academy/tutorials/node_js/block_requests_puppeteer.md +++ b/sources/academy/tutorials/node_js/block_requests_puppeteer.md @@ -65,6 +65,6 @@ With this code set up this is the output: And except for different ads, the page should look the same. -From this we can see, that just by blocking a few analytics and tracking scripts the page was loaded nearly 25 seconds faster and downloaded 35% less data (approximately since the data is measured after it's decompressed). +From this we can see that just by blocking a few analytics and tracking scripts the page was loaded nearly 25 seconds faster and downloaded 35% less data (approximately since the data is measured after it's decompressed). Hopefully this helps you make your solutions faster and use fewer resources. diff --git a/sources/academy/tutorials/node_js/debugging_web_scraper.md b/sources/academy/tutorials/node_js/debugging_web_scraper.md index 792f2f33c1..2abe5e1361 100644 --- a/sources/academy/tutorials/node_js/debugging_web_scraper.md +++ b/sources/academy/tutorials/node_js/debugging_web_scraper.md @@ -65,7 +65,7 @@ results; ## Pasting and running a full pageFunction -If you don't want to deal with copy/pasting a proper snippet, you can always paste the whole pageFunction. You will just have to mock the context object when calling it. If you use some advanced tricks, this might not work but in most cases copy pasting this code should do it. This code is only for debugging your Page Function for a particular page. It does not crawl the website and the output is not saved anywhere. +If you don't want to deal with copy/pasting a proper snippet, you can always paste the whole pageFunction. You will have to mock the context object when calling it. If you use some advanced tricks, this might not work but in most cases copy pasting this code should do it. This code is only for debugging your Page Function for a particular page. It does not crawl the website and the output is not saved anywhere. <!-- eslint-disable --> ```js diff --git a/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md b/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md index 0afd0912d8..61a869e320 100644 --- a/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md +++ b/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md @@ -5,7 +5,7 @@ sidebar_position: 16 slug: /node-js/filter-blocked-requests-using-sessions --- -_This article explains how the problem was solved before the [SessionPool](/sdk/js/docs/api/session-pool) class was added into [Apify SDK](/sdk/js/). We are keeping the article here as it might be interesting for people who want to see how to work with sessions on a lower level. For any practical usage of sessions, just follow the documentation and examples of SessionPool._ +_This article explains how the problem was solved before the [SessionPool](/sdk/js/docs/api/session-pool) class was added into [Apify SDK](/sdk/js/). We are keeping the article here as it might be interesting for people who want to see how to work with sessions on a lower level. For any practical usage of sessions, follow the documentation and examples of SessionPool._ ### Overview of the problem @@ -23,7 +23,7 @@ You want to crawl a website with a proxy pool, but most of your proxies are bloc Nobody can make sure that a proxy will work infinitely. The only real solution to this problem is to use [residential proxies](/platform/proxy#residential-proxy), but they can sometimes be too costly. -However, usually, at least some of our proxies work. To crawl successfully, it is therefore imperative to handle blocked requests properly. You first need to discover that you are blocked, which usually means that either your request returned status greater or equal to 400 (it didn't return the proper response) or that the page displayed a captcha. To ensure that this bad request is retried, you usually just throw an error and it gets automatically retried later (our [SDK](/sdk/js/) handles this for you). Check out [this article](https://help.apify.com/en/articles/2190650-how-to-handle-blocked-requests-in-puppeteercrawler) as inspiration for how to handle this situation with `PuppeteerCrawler`  class. +However, usually, at least some of our proxies work. To crawl successfully, it is therefore imperative to handle blocked requests properly. You first need to discover that you are blocked, which usually means that either your request returned status greater or equal to 400 (it didn't return the proper response) or that the page displayed a captcha. To ensure that this bad request is retried, you usually throw an error and it gets automatically retried later (our [SDK](/sdk/js/) handles this for you). Check out [this article](https://help.apify.com/en/articles/2190650-how-to-handle-blocked-requests-in-puppeteercrawler) as inspiration for how to handle this situation with `PuppeteerCrawler`  class. ### Solution @@ -52,7 +52,7 @@ Apify.main(async () => { ### Algorithm -You don't necessarily need to understand the solution below - it should be fine to just copy/paste it to your Actor. +You don't necessarily need to understand the solution below - it should be fine to copy/paste it to your Actor. `sessions`  will be an object whose keys will be the names of the sessions and values will be objects with the name of the session (we choose a random number as a name here) and user agent (you can add any other useful properties that you want to match with each session.) This will be created automatically, for example: diff --git a/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md b/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md index d36e80262d..43ddf9ade3 100644 --- a/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md +++ b/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md @@ -54,7 +54,7 @@ const crawler = new PuppeteerCrawler({ }); ``` -It is really up to a developer to spot if something is wrong with his request. A website can interfere with your crawling in [many ways](https://kb.apify.com/tips-and-tricks/several-tips-how-to-bypass-website-anti-scraping-protections). Page loading can be cancelled right away, it can timeout, the page can display a captcha, some error or warning message, or the data may be just missing or corrupted. The developer can then choose if he will try to handle these problems in the code or just focus on receiving the proper data. Either way, if the request went wrong, you should throw a proper error. +It is really up to a developer to spot if something is wrong with his request. A website can interfere with your crawling in [many ways](https://kb.apify.com/tips-and-tricks/several-tips-how-to-bypass-website-anti-scraping-protections). Page loading can be cancelled right away, it can timeout, the page can display a captcha, some error or warning message, or the data may be missing or corrupted. The developer can then choose if he will try to handle these problems in the code or focus on receiving the proper data. Either way, if the request went wrong, you should throw a proper error. Now that we know when the request is blocked, we can use the retire() function and continue crawling with a new proxy. Google is one of the most popular websites for scrapers, so let's code some simple Google search crawler. The two main blocking mechanisms used by Google is either to display their (in)famous 'sorry' captcha or to not load the page at all so we will focus on covering these. diff --git a/sources/academy/tutorials/node_js/how_to_fix_target_closed.md b/sources/academy/tutorials/node_js/how_to_fix_target_closed.md index 81dc61db1c..ac42f16009 100644 --- a/sources/academy/tutorials/node_js/how_to_fix_target_closed.md +++ b/sources/academy/tutorials/node_js/how_to_fix_target_closed.md @@ -23,7 +23,7 @@ Browsers create a separate process for each tab. That means each tab lives with If you use [Crawlee](https://crawlee.dev/), your concurrency automatically scales up and down to fit in the allocated memory. You can change the allocated memory using the environment variable or the [Configuration](https://crawlee.dev/docs/guides/configuration) class. But very hungry pages can still occasionally cause sudden memory spikes, and you might have to limit the [maxConcurrency](https://crawlee.dev/docs/guides/scaling-crawlers#minconcurrency-and-maxconcurrency) of the crawler. This problem is very rare, though. -Without Crawlee, you will need to predict the maximum concurrency the particular use case can handle or just increase the allocated memory. +Without Crawlee, you will need to predict the maximum concurrency the particular use case can handle or increase the allocated memory. ## Page closed prematurely diff --git a/sources/academy/tutorials/node_js/index.md b/sources/academy/tutorials/node_js/index.md index 7873f0d4f5..c8abaa847b 100644 --- a/sources/academy/tutorials/node_js/index.md +++ b/sources/academy/tutorials/node_js/index.md @@ -12,4 +12,4 @@ slug: /node-js --- -This section contains various web-scraping or web-scraping related tutorials for Node.js. Whether you're trying to scrape from a website with sitemaps, struggling with a dynamic page, want to optimize your slow Puppeteer scraper, or just need some general tips for scraping in Node.js, this section is right for you. +This section contains various web-scraping or web-scraping related tutorials for Node.js. Whether you're trying to scrape from a website with sitemaps, struggling with a dynamic page, want to optimize your slow Puppeteer scraper, or need some general tips for scraping in Node.js, this section is right for you. diff --git a/sources/academy/tutorials/node_js/request_labels_in_apify_actors.md b/sources/academy/tutorials/node_js/request_labels_in_apify_actors.md index 01b27a5fe1..1d2ece675f 100644 --- a/sources/academy/tutorials/node_js/request_labels_in_apify_actors.md +++ b/sources/academy/tutorials/node_js/request_labels_in_apify_actors.md @@ -50,13 +50,13 @@ await requestQueue.addRequest({ }); ``` -Now, in the "SELLERDETAIL" url, we can just evaluate the page and extracted data merge to the object from the item detail, for example like this +Now, in the "SELLERDETAIL" url, we can evaluate the page and extracted data merge to the object from the item detail, for example like this ```js const result = { ...request.userData.data, ...sellerDetail }; ``` -So next just save the results and we're done! +Save the results and we're done! ```js await Apify.pushData(result); diff --git a/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md b/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md index 261b408a83..bb632ce99b 100644 --- a/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md +++ b/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md @@ -19,7 +19,7 @@ Ok, so both Web Scraper and Puppeteer Scraper use Puppeteer to give commands to ## Execution environment -It may sound fancy, but it's just a technical term for "where does my code run". When you open the DevTools and start typing JavaScript in the browser Console, it gets executed in the browser. Browser is the code's execution environment. But you can't control the browser from the inside. For that, you need a different environment. Puppeteer's environment is Node.js. If you don't know what Node.js is, don't worry about it too much. Just remember that it's the environment where Puppeteer runs. +It may sound fancy, but it's just a technical term for "where does my code run". When you open the DevTools and start typing JavaScript in the browser Console, it gets executed in the browser. Browser is the code's execution environment. But you can't control the browser from the inside. For that, you need a different environment. Puppeteer's environment is Node.js. If you don't know what Node.js is, don't worry about it too much. Remember that it's the environment where Puppeteer runs. By now you probably figured this out on your own, so this will not come as a surprise. The difference between Web Scraper and Puppeteer Scraper is where your page function gets executed. When using the Web Scraper, it's executed in the browser environment. It means that it gets access to all the browser specific features such as the `window` or `document` objects, but it cannot control the browser with Puppeteer directly. This is done automatically in the background by the scraper. Whereas in Puppeteer Scraper, the page function is executed in the Node.js environment, giving you full access to Puppeteer and all its features. @@ -32,7 +32,7 @@ Ok, cool, different environments, but how does that help you scrape stuff? Actua ## Evaluating in-browser code -In Web Scraper, everything runs in the browser, so there's really not much to talk about there. With Puppeteer Scraper, it's just a single function call away. +In Web Scraper, everything runs in the browser, so there's really not much to talk about there. With Puppeteer Scraper, it's a single function call away. ```js const bodyHTML = await context.page.evaluate(() => { @@ -126,7 +126,7 @@ Since we're actually clicking in the page, which may or may not trigger some nas ## Plain form submit navigations -This works out of the box. It's typically used on older websites such as [Turkish Remax](https://www.remax.com.tr/ofis-office-franchise-girisimci-agent-arama). For a site like this you can just set the `Clickable elements selector` and you're good to go: +This works out of the box. It's typically used on older websites such as [Turkish Remax](https://www.remax.com.tr/ofis-office-franchise-girisimci-agent-arama). For a site like this you can set the `Clickable elements selector` and you're good to go: ```js 'a[onclick ^= getPage]'; @@ -142,7 +142,7 @@ Those are similar to the ones above with an important caveat. Once you click the ## Frontend navigations -Websites often won't navigate away just to fetch the next set of results. They will do it in the background and just update the displayed data. You can paginate such websites with either Web Scraper or Puppeteer Scraper. Try it on [Udemy](https://www.udemy.com/topic/javascript/) for example. Click the next button to load the next set of courses. +Websites often won't navigate away just to fetch the next set of results. They will do it in the background and update the displayed data. You can paginate such websites with either Web Scraper or Puppeteer Scraper. Try it on [Udemy](https://www.udemy.com/topic/javascript/) for example. Click the next button to load the next set of courses. ```js // Web Scraper\ diff --git a/sources/academy/tutorials/php/index.md b/sources/academy/tutorials/php/index.md index 241dac6f7b..dbf0751615 100644 --- a/sources/academy/tutorials/php/index.md +++ b/sources/academy/tutorials/php/index.md @@ -12,4 +12,4 @@ slug: /php --- -This section contains web-scraping or web-scraping related tutorials for PHP. Whether you're trying to scrape from a website with sitemaps, struggling with a dynamic page, want to optimize your slow scraper, or just need some general tips for scraping in Apify with PHP, this section is right for you. +This section contains web-scraping or web-scraping related tutorials for PHP. Whether you're trying to scrape from a website with sitemaps, struggling with a dynamic page, want to optimize your slow scraper, or need some general tips for scraping in Apify with PHP, this section is right for you. diff --git a/sources/academy/tutorials/php/using_apify_from_php.md b/sources/academy/tutorials/php/using_apify_from_php.md index ea8401fcab..7df379001e 100644 --- a/sources/academy/tutorials/php/using_apify_from_php.md +++ b/sources/academy/tutorials/php/using_apify_from_php.md @@ -230,7 +230,7 @@ $response = $client->post('acts/mhamas~html-string-to-pdf/runs', [ ## How to use Apify Proxy -Let's use another important feature: [proxy](/platform/proxy). If you just want to make sure that your server's IP address won't get blocked somewhere when making requests, you can use the automatic proxy selection mode. +Let's use another important feature: [proxy](/platform/proxy). If you want to make sure that your server's IP address won't get blocked somewhere when making requests, you can use the automatic proxy selection mode. ```php $client = new \GuzzleHttp\Client([ diff --git a/sources/academy/tutorials/python/index.md b/sources/academy/tutorials/python/index.md index c01869468e..ea2a5e0880 100644 --- a/sources/academy/tutorials/python/index.md +++ b/sources/academy/tutorials/python/index.md @@ -12,4 +12,4 @@ slug: /python --- -This section contains various web-scraping or web-scraping related tutorials for Python. Whether you're trying to scrape from a website with sitemaps, struggling with a dynamic page, want to optimize your slow scraper, or just need some general tips for scraping in Python, this section is right for you. +This section contains various web-scraping or web-scraping related tutorials for Python. Whether you're trying to scrape from a website with sitemaps, struggling with a dynamic page, want to optimize your slow scraper, or need some general tips for scraping in Python, this section is right for you. diff --git a/sources/academy/tutorials/python/process_data_using_python.md b/sources/academy/tutorials/python/process_data_using_python.md index 785f665290..5a8b9eb2b1 100644 --- a/sources/academy/tutorials/python/process_data_using_python.md +++ b/sources/academy/tutorials/python/process_data_using_python.md @@ -29,7 +29,7 @@ First, we need to create another Actor. You can do it the same way as before - g In the page that opens, you can see your newly created Actor. In the **Settings** tab, you can give it a name (e.g. `bbc-weather-parser`) and further customize its settings. We'll skip customizing the settings for now, the defaults should be fine. In the **Source** tab, you can see the files that are at the heart of the Actor. Although there are several of them, just two are important for us now, `main.py` and `requirements.txt`. -First, we'll start with the `requirements.txt` file. Its purpose is to list all the third-party packages that your Actor will use. We will be using the `pandas` package for parsing the downloaded weather data, and the `matplotlib` package for visualizing it. We don't particularly care about the specific versions of these packages, so we just list them in the file: +First, we'll start with the `requirements.txt` file. Its purpose is to list all the third-party packages that your Actor will use. We will be using the `pandas` package for parsing the downloaded weather data, and the `matplotlib` package for visualizing it. We don't care about versions of these packages, so we list just their names: ```py # Add your dependencies here. @@ -77,7 +77,7 @@ dataset_client = client.dataset(scraper_run['defaultDatasetId']) ### Processing the data -Now, we need to load the data from the dataset to a Pandas dataframe. Pandas supports reading data from a CSV file stream, so we just create a stream with the dataset items in the right format and supply it to `pandas.read_csv()`. +Now, we need to load the data from the dataset to a Pandas dataframe. Pandas supports reading data from a CSV file stream, so we create a stream with the dataset items in the right format and supply it to `pandas.read_csv()`. ```py # Load the dataset items into a pandas dataframe diff --git a/sources/academy/tutorials/python/scrape_data_python.md b/sources/academy/tutorials/python/scrape_data_python.md index eb14f79b77..b93f96c83e 100644 --- a/sources/academy/tutorials/python/scrape_data_python.md +++ b/sources/academy/tutorials/python/scrape_data_python.md @@ -61,7 +61,7 @@ First, we need to create a new Actor. To do this, go to [Apify Console](https:// In the page that opens, you can see your newly created Actor. In the **Settings** tab, you can give it a name (e.g. `bbc-weather-scraper`) and further customize its settings. We'll skip customizing the settings for now, the defaults should be fine. In the **Source** tab, you can see the files that are at the heart of the Actor. Although there are several of them, just two are important for us now, `main.py` and `requirements.txt`. -First we'll start with the `requirements.txt` file. Its purpose is to list all the third-party packages that your Actor will use. We will be using the `requests` package for downloading the BBC Weather pages, and the `beautifulsoup4` package for parsing and processing the downloaded pages. We don't particularly care about the specific versions of these packages, so we just list them in the file: +First we'll start with the `requirements.txt` file. Its purpose is to list all the third-party packages that your Actor will use. We will be using the `requests` package for downloading the BBC Weather pages, and the `beautifulsoup4` package for parsing and processing the downloaded pages. We don't care about versions of these packages, so we list just their names: ```py # Add your dependencies here. @@ -229,7 +229,7 @@ First, we need to create another Actor. You can do it the same way as before - g In the page that opens, you can see your newly created Actor. In the **Settings** tab, you can give it a name (e.g. `bbc-weather-parser`) and further customize its settings. We'll skip customizing the settings for now, the defaults should be fine. In the **Source** tab, you can see the files that are at the heart of the Actor. Although there are several of them, just two are important for us now, `main.py` and `requirements.txt`. -First, we'll start with the `requirements.txt` file. Its purpose is to list all the third-party packages that your Actor will use. We will be using the `pandas` package for parsing the downloaded weather data, and the `matplotlib` package for visualizing it. We don't particularly care about the specific versions of these packages, so we just list them in the file: +First, we'll start with the `requirements.txt` file. Its purpose is to list all the third-party packages that your Actor will use. We will be using the `pandas` package for parsing the downloaded weather data, and the `matplotlib` package for visualizing it. We don't care about versions of these packages, so we list just their names: ```py # Add your dependencies here. @@ -277,7 +277,7 @@ dataset_client = client.dataset(scraper_run['defaultDatasetId']) ### Processing the data -Now, we need to load the data from the dataset to a Pandas dataframe. Pandas supports reading data from a CSV file stream, so we just create a stream with the dataset items in the right format and supply it to `pandas.read_csv()`. +Now, we need to load the data from the dataset to a Pandas dataframe. Pandas supports reading data from a CSV file stream, so we create a stream with the dataset items in the right format and supply it to `pandas.read_csv()`. ```py # Load the dataset items into a pandas dataframe diff --git a/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md b/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md index c6ce3fea19..e8faee7a8d 100644 --- a/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md +++ b/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md @@ -30,7 +30,7 @@ This is usually the first solution that comes to mind. You traverse the smallest 1. Any subcategory might be bigger than the pagination limit. 2. Some listings from the parent category might not be present in any subcategory. -While you can often manually test if the second problem is true on the site, the first problem is a hard blocker. You might be just lucky, and it may work on this site but usually, traversing subcategories is just not enough. It can be used as a first step of the solution but not as the solution itself. +While you can often manually test if the second problem is true on the site, the first problem is a hard blocker. You might be just lucky, and it may work on this site but usually, traversing subcategories is not enough. It can be used as a first step of the solution but not as the solution itself. ### Using filters {#using-filters} @@ -83,7 +83,7 @@ If the website supports only overlapping ranges (e.g. **$0-$5**, **$5–10**), i In rare cases, a listing can have more than one value that you are filtering in a range. A typical example is Amazon, where each product has several offers and those offers have different prices. If any of those offers is within the range, the product is shown. -No easy way exists to get around this but the price range split works even with duplicate listings, just use a [JS set](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Set) or request queue to deduplicate them. +No easy way exists to get around this but the price range split works even with duplicate listings, use a [JS set](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Set) or request queue to deduplicate them. #### How is the range passed to the URL? {#how-is-the-range-passed-to-the-url} @@ -105,7 +105,7 @@ If it doesn't, we have to find a different way to check if the number of listing Logically, every full (price) range starts at 0 and ends at infinity. But the way this is encoded will differ on each site. The end of the price range can be either closed (0) or open (infinity). Open ranges require special handling when you split them (we will get to that). -Most sites will let you start with 0 (there might be exceptions, where you will have to make the start open), so we can use just that. The high end is more complicated. Because you don't know the biggest price, it is best to leave it open and handle it specially. Internally you can just assign `null` to the value. +Most sites will let you start with 0 (there might be exceptions, where you will have to make the start open), so we can use just that. The high end is more complicated. Because you don't know the biggest price, it is best to leave it open and handle it specially. Internally you can assign `null` to the value. Here are a few examples of a query parameter with an open and closed high-end range: @@ -144,7 +144,7 @@ await Actor.init(); const MAX_PRODUCTS_PAGINATION = 1000; -// These is just an example, choose what makes sense for your site +// Just an example, choose what makes sense for your site const PIVOT_PRICE_RANGES = [ { min: 0, max: 9.99 }, { min: 10, max: 99.99 }, @@ -208,7 +208,7 @@ const crawler = new CheerioCrawler({ // The filter is either good enough of we have to split it if (numberOfProducts <= MAX_PRODUCTS_PAGINATION) { - // We just pass the URL for scraping, we could optimize it so the page is not opened again + // We pass the URL for scraping, we could optimize it so the page is not opened again await crawler.addRequests([{ url: `${request.url}&page=1`, userData: { label: 'PAGINATION' }, @@ -268,7 +268,7 @@ const { min, max } = getFiltersFromUrl(request.url); // Our generic splitFilter function doesn't account for decimal values so we will have to convert to cents and back to dollars const newFilters = splitFilter({ min: min * 100, max: max * 100 }); -// And we just enqueue those 2 new filters so the process will recursively repeat until all pages get to the PAGINATION phase +// And we enqueue those 2 new filters so the process will recursively repeat until all pages get to the PAGINATION phase const requestsToEnqueue = []; for (const filter of newFilters) { requestsToEnqueue.push({ diff --git a/sources/academy/webscraping/anti_scraping/index.md b/sources/academy/webscraping/anti_scraping/index.md index b73f86132a..2b0d9133b1 100644 --- a/sources/academy/webscraping/anti_scraping/index.md +++ b/sources/academy/webscraping/anti_scraping/index.md @@ -20,7 +20,7 @@ In development, it is crucial to check and adjust the configurations related to ## Quick start {#quick-start} -If you don't have time to read about the theory behind anti-scraping protections to fine-tune your scraping project and instead you just need to get unblocked ASAP, here are some quick tips: +If you don't have time to read about the theory behind anti-scraping protections to fine-tune your scraping project and instead you need to get unblocked ASAP, here are some quick tips: - Use high-quality proxies. [Residential proxies](/platform/proxy/residential-proxy) are the least blocked. You can find many providers out there like Apify, BrightData, Oxylabs, NetNut, etc. - Set **real-user-like HTTP settings** and **browser fingerprints**. [Crawlee](https://crawlee.dev/) uses statistically generated realistic HTTP headers and browser fingerprints by default for all of its crawlers. @@ -66,7 +66,7 @@ Anti-scraping protections can work on many different layers and use a large amou 1. **Where you are coming from** - The IP address of the incoming traffic is always available to the website. Proxies are used to emulate a different IP addresses but their quality matters a lot. 2. **How you look** - With each request, the website can analyze its HTTP headers, TLS version, ciphers, and other information. Moreover, if you use a browser, the website can also analyze the whole browser fingerprint and run challenges to classify your hardware (like graphics hardware acceleration). -3. **What you are scraping** - The same data can be extracted in many ways from a website. You can just get the inital HTML or you can use a browser to render the full page or you can reverse engineer internal APIs. Each of those endpoints can be protected differently. +3. **What you are scraping** - The same data can be extracted in many ways from a website. You can get the inital HTML or you can use a browser to render the full page or you can reverse engineer internal APIs. Each of those endpoints can be protected differently. 4. **How you behave** - The website can see patterns in how you are ordering your requests, how fast you are scraping, etc. It can also analyze browser behavior like mouse movement, clicks or key presses. These are the 4 main principles that anti-scraping protections are based on. @@ -107,7 +107,7 @@ Although the talk, given in 2021, features some outdated code examples, it still Because we here at Apify scrape for a living, we have discovered many popular and niche anti-scraping techniques. We've compiled them into a short and comprehensible list here to help understand the roadblocks before this course teaches you how to get around them. -> However, not all issues you encounter are caused by anti-scraping systems. Sometimes, it's just a simple configuration issue. Learn [how to effectively debug your programs here](/academy/node-js/analyzing-pages-and-fixing-errors). +> Not all issues you encounter are caused by anti-scraping systems. Sometimes, it's a simple configuration issue. Learn [how to effectively debug your programs here](/academy/node-js/analyzing-pages-and-fixing-errors). ### IP rate-limiting diff --git a/sources/academy/webscraping/anti_scraping/mitigation/proxies.md b/sources/academy/webscraping/anti_scraping/mitigation/proxies.md index b328cfebe0..f17faea8c0 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/proxies.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/proxies.md @@ -22,7 +22,7 @@ There are a few factors that determine the quality of a proxy IP: - How long was the proxy left to "heal" before it was resold? - What is the quality of the underlying server of the proxy? (latency) -Although IP quality is still the most important factor when it comes to using proxies and avoiding anti-scraping measures, nowadays it's not just about avoiding rate-limiting, which brings new challenges for scrapers that can no longer just rely on simple IP rotation. Anti-scraping software providers, such as CloudFlare, have global databases of "suspicious" IP addresses. If you are unlucky, your newly bought IP might be blocked even before you use it. If the previous owners overused it, it might have already been marked as suspicious in many databases, or even (very likely) was blocked altogether. If you care about the quality of your IPs, use them as a real user, and any website will have a hard time banning them completely. +Although IP quality is still the most important factor when it comes to using proxies and avoiding anti-scraping measures, nowadays it's not just about avoiding rate-limiting, which brings new challenges for scrapers that can no longer rely on simple IP rotation. Anti-scraping software providers, such as CloudFlare, have global databases of "suspicious" IP addresses. If you are unlucky, your newly bought IP might be blocked even before you use it. If the previous owners overused it, it might have already been marked as suspicious in many databases, or even (very likely) was blocked altogether. If you care about the quality of your IPs, use them as a real user, and any website will have a hard time banning them completely. Fixing rate-limiting issues is only the tip of the iceberg of what proxies can do for your scrapers, though. By implementing proxies properly, you can successfully avoid the majority of anti-scraping measures listed in the [previous lesson](../index.md). diff --git a/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md b/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md index ca61b50cff..1f09724663 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md @@ -103,7 +103,7 @@ That's it! The crawler will now automatically rotate through the proxies we prov ## A bit about debugging proxies {#debugging-proxies} -At the time of writing, our above scraper utilizing our custom proxy pool is working just fine. But how can we check that the scraper is for sure using the proxies we provided it, and more importantly, how can we debug proxies within our scraper? Luckily, within the same `context` object we've been destructuring `$` and `request` out of, there is a `proxyInfo` key as well. `proxyInfo` is an object which includes useful data about the proxy which was used to make the request. +At the time of writing, the scraper above utilizing our custom proxy pool is working just fine. But how can we check that the scraper is for sure using the proxies we provided it, and more importantly, how can we debug proxies within our scraper? Luckily, within the same `context` object we've been destructuring `$` and `request` out of, there is a `proxyInfo` key as well. `proxyInfo` is an object which includes useful data about the proxy which was used to make the request. ```js const crawler = new CheerioCrawler({ diff --git a/sources/academy/webscraping/anti_scraping/techniques/browser_challenges.md b/sources/academy/webscraping/anti_scraping/techniques/browser_challenges.md index 3a606317b9..521d11c9c4 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/browser_challenges.md +++ b/sources/academy/webscraping/anti_scraping/techniques/browser_challenges.md @@ -13,7 +13,7 @@ slug: /anti-scraping/techniques/browser-challenges Browser challenges are a type of security measure that relies on browser fingerprints. These challenges typically involve a JavaScript program that collects both static and dynamic browser fingerprints. Static fingerprints include attributes such as User-Agent, video card, and number of CPU cores available. Dynamic fingerprints, on the other hand, might involve rendering fonts or objects in the canvas (known as a [canvas fingerprint](./fingerprinting.md#with-canvases)), or playing audio in the [AudioContext](./fingerprinting.md#from-audiocontext). We were covering the details in the previous [fingerprinting](./fingerprinting.md) lesson. -While some browser challenges are relatively straightforward - for example, just loading an image and checking if it renders correctly - others can be much more complex. One well-known example of a complex browser challenge is Cloudflare's browser screen check. In this challenge, Cloudflare visually inspects the browser screen and blocks the first request if any inconsistencies are found. This approach provides an extra layer of protection against automated attacks. +While some browser challenges are relatively straightforward - for example, loading an image and checking if it renders correctly - others can be much more complex. One well-known example of a complex browser challenge is Cloudflare's browser screen check. In this challenge, Cloudflare visually inspects the browser screen and blocks the first request if any inconsistencies are found. This approach provides an extra layer of protection against automated attacks. Many online protections incorporate browser challenges into their security measures, but the specific techniques used can vary. diff --git a/sources/academy/webscraping/anti_scraping/techniques/geolocation.md b/sources/academy/webscraping/anti_scraping/techniques/geolocation.md index 774b8f0d58..fe964d1c06 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/geolocation.md +++ b/sources/academy/webscraping/anti_scraping/techniques/geolocation.md @@ -17,7 +17,7 @@ Geolocation is yet another way websites can detect and block access or show limi Certain websites might use certain location-specific/language-specific [headers](../../../glossary/concepts/http_headers.md)/[cookies](../../../glossary/concepts/http_cookies.md) to geolocate a user. Some examples of these headers are `Accept-Language` and `CloudFront-Viewer-Country` (which is a custom HTTP header from [CloudFront](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/using-cloudfront-headers.html)). -On targets which are just utilizing cookies and headers to identify the location from which a request is coming from, it is pretty straightforward to make requests which appear like they are coming from somewhere else. +On targets which are utilizing just cookies and headers to identify the location from which a request is coming from, it is pretty straightforward to make requests which appear like they are coming from somewhere else. ## IP address {#ip-address} diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md index 1eaea1294c..22049bacfe 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md @@ -21,7 +21,7 @@ The most common and rudimentary forms of pagination have page numbers. Imagine p ![Amazon pagination](https://apify-docs.s3.amazonaws.com/master/docs/assets/tutorials/images/pagination.jpg) -This implementation makes it fairly straightforward to programmatically paginate through an API, as it pretty much entails just incrementing up or down in order to receive the next set of items. The page number is usually provided right in the parameters of the request URL; however, some APIs require it to be provided in the request body instead. +This implementation makes it fairly straightforward to programmatically paginate through an API, as it pretty much entails incrementing up or down in order to receive the next set of items. The page number is usually provided right in the parameters of the request URL; however, some APIs require it to be provided in the request body instead. ## Offset pagination {#offset-pagination} diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md b/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md index c49dc0cea8..8342f186b4 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md @@ -37,13 +37,13 @@ Here's what our target endpoint's URL looks like coming directly from the Networ https://api-v2.soundcloud.com/users/141707/tracks?representation=&client_id=zdUqm51WRIAByd0lVLntcaWRKzuEIB4X&limit=20&offset=0&linked_partitioning=1&app_version=1646987254&app_locale=en ``` -Since our request doesn't have any body/payload, we just need to analyze the URL. We can break this URL down into chunks that help us understand what each value does. +Since our request doesn't have any body/payload, we need to analyze the URL. We can break this URL down into chunks that help us understand what each value does. ![Breaking down the request url into understandable chunks](./images/analyzing-the-url.png) Understanding an API's various configurations helps with creating a game-plan on how to best scrape it, as many of the parameters can be utilized for pagination, or data-filtering. Additionally, these values can be mapped to a scraper's configuration options, which overall makes the scraper more versatile. -Let's say we want to receive all of the user's tracks in one request. Based on our observations of the endpoint's different parameters, we can modify the URL and utilize the `limit` option to return more than just twenty songs. The `limit` option is extremely common with most APIs, and allows the person making the request to literally limit the maximum number of results to be returned in the request: +Let's say we want to receive all of the user's tracks in one request. Based on our observations of the endpoint's different parameters, we can modify the URL and utilize the `limit` option to return more than twenty songs. The `limit` option is extremely common with most APIs, and allows the person making the request to literally limit the maximum number of results to be returned in the request: ```text https://api-v2.soundcloud.com/users/141707/tracks?client_id=zdUqm51WRIAByd0lVLntcaWRKzuEIB4X&limit=99999 diff --git a/sources/academy/webscraping/api_scraping/graphql_scraping/custom_queries.md b/sources/academy/webscraping/api_scraping/graphql_scraping/custom_queries.md index 836feb6693..b7aaabeb30 100644 --- a/sources/academy/webscraping/api_scraping/graphql_scraping/custom_queries.md +++ b/sources/academy/webscraping/api_scraping/graphql_scraping/custom_queries.md @@ -42,7 +42,7 @@ Finally, create a file called **index.js**. This is the file we will be working ## Preparations {#preparations} -If we remember from the last lesson, we need to pass a valid "app token" within the **X-App-Token** header of every single request we make, or else we will be blocked. When testing queries, we just copied this value straight from the **Network** tab; however, since this is a dynamic value, we should farm it. +If we remember from the last lesson, we need to pass a valid "app token" within the **X-App-Token** header of every single request we make, or else we will be blocked. When testing queries, we copied this value straight from the **Network** tab; however, since this is a dynamic value, we should farm it. Since we know requests with this header are sent right when the front page is loaded, it can be farmed by visiting the page and intercepting requests in Puppeteer like so: @@ -71,7 +71,7 @@ const scrapeAppToken = async () => { await page.waitForNetworkIdle(); - // otherwise, just close the browser after networkidle + // otherwise, close the browser after networkidle // has been fired await browser.close(); @@ -135,7 +135,7 @@ query SearchQuery($query: String!) { } ``` -The next step is to just fill out the fields we'd like back, and we've got our final query! +The next step is to fill out the fields we'd like back, and we've got our final query! ```graphql query SearchQuery($query: String!) { diff --git a/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md b/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md index 79f3a995f4..255f404294 100644 --- a/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md +++ b/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md @@ -21,7 +21,7 @@ Not only does becoming comfortable with and understanding the ins and outs of us ! Cheddar website was changed and the below example no longer works there. Nonetheless, the general approach is still viable on some websites even though introspection is disabled on most. -In order to perform introspection on our [target website](https://cheddar.com), we just need to make a request to their GraphQL API with this introspection query using [Insomnia](../../../glossary/tools/insomnia.md) or another HTTP client that supports GraphQL: +In order to perform introspection on our [target website](https://cheddar.com), we need to make a request to their GraphQL API with this introspection query using [Insomnia](../../../glossary/tools/insomnia.md) or another HTTP client that supports GraphQL: > To make a GraphQL query in Insomnia, make sure you've set the HTTP method to **POST** and the request body type to **GraphQL Query**. @@ -132,7 +132,7 @@ The response body of our introspection query contains a whole lot of useful info ## Understanding the response {#understanding-the-response} -An introspection query's response body size will vary depending on how big the target API is. In our case, what we got back is a 27 thousand line JSON response 🤯 If you just thought to yourself, "Wow, that's a whole lot to sift through! I don't want to look through that!", you are absolutely right. Luckily for us, there is a fantastic online tool called [GraphQL Voyager](https://graphql-kit.com/graphql-voyager/) (no install required) which can take this massive JSON response and turn it into a digestable visualization of the API. +An introspection query's response body size will vary depending on how big the target API is. In our case, what we got back is a 27 thousand line JSON response 🤯 If you thought to yourself, "Wow, that's a whole lot to sift through! I don't want to look through that!", you are absolutely right. Luckily for us, there is a fantastic online tool called [GraphQL Voyager](https://graphql-kit.com/graphql-voyager/) (no install required) which can take this massive JSON response and turn it into a digestable visualization of the API. Let's copy the response to our clipboard by clicking inside of the response body and pressing **CMD** + **A**, then subsequently **CMD** + **C**. Now, we'll head over to [GraphQL Voyager](https://graphql-kit.com/graphql-voyager/) and click on **Change Schema**. In the modal, we'll click on the **Introspection** tab and paste our data into the text area. @@ -146,9 +146,9 @@ Now that we have this visualization to work off of, it will be much easier to bu ## Building a query {#building-a-query} -In future lessons, we'll be building more complex queries using **dynamic variables** and advanced features such as **fragments**; however, for now let's just get our feet wet by using the data we have from GraphQL Voyager to build a simple query. +In future lessons, we'll be building more complex queries using **dynamic variables** and advanced features such as **fragments**; however, for now let's get our feet wet by using the data we have from GraphQL Voyager to build a simple query. -Right now, our goal is to fetch the 1000 most recent articles on [Cheddar](https://cheddar.com). From each article, we'd like to fetch the **title** and the **publish date**. After just a bit of digging through the schema, we've come across the **media** field within the **organization** type, which has both **title** and **public_at** fields - seems to check out! +Right now, our goal is to fetch the 1000 most recent articles on [Cheddar](https://cheddar.com). From each article, we'd like to fetch the **title** and the **publish date**. After a bit of digging through the schema, we've come across the **media** field within the **organization** type, which has both **title** and **public_at** fields - seems to check out! ![The media field pointing to datatype slugable](./images/media-field.jpg) @@ -181,7 +181,7 @@ Let's send it! Oh, okay. That didn't work. But **why**? -Rest assured, nothing is wrong with our query. We are most likely just missing an authorization token/parameter. Let's check back on the Cheddar website within our browser to see what types of headers are being sent with the requests there: +Rest assured, nothing is wrong with our query. We are most likely missing an authorization token/parameter. Let's check back on the Cheddar website within our browser to see what types of headers are being sent with the requests there: ![Request headers back on the Cheddar website](./images/cheddar-headers.jpg) diff --git a/sources/academy/webscraping/api_scraping/graphql_scraping/modifying_variables.md b/sources/academy/webscraping/api_scraping/graphql_scraping/modifying_variables.md index 429cae9c53..33ce89d2fb 100644 --- a/sources/academy/webscraping/api_scraping/graphql_scraping/modifying_variables.md +++ b/sources/academy/webscraping/api_scraping/graphql_scraping/modifying_variables.md @@ -103,7 +103,7 @@ query SearchQuery($query: String!, $count: Int!, $cursor: String) { } ``` -If the query provided in the payload you find in the **Network** tab is good enough for your scraper's needs, you don't actually have to go down the GraphQL rabbit hole. Rather, you can just change the variables to receive the data you want. For example, right now, our example payload is set up to search for articles matching the keyword **test**. However, if we wanted to search for articles matching **cats** instead, we could do that by changing the **query** variable like so: +If the query provided in the payload you find in the **Network** tab is good enough for your scraper's needs, you don't actually have to go down the GraphQL rabbit hole. Rather, you can change the variables to receive the data you want. For example, right now, our example payload is set up to search for articles matching the keyword **test**. However, if we wanted to search for articles matching **cats** instead, we could do that by changing the **query** variable like so: ```json { @@ -112,7 +112,7 @@ If the query provided in the payload you find in the **Network** tab is good eno } ``` -Depending on the API, just doing this can be sufficient. However, sometimes we want to utilize complex GraphQL features in order to optimize our scrapers or just to receive more data than is being provided in the response of the request found in the **Network** tab. This is what we will be discussing in the next lessons. +Depending on the API, doing just this can be sufficient. However, sometimes we want to utilize complex GraphQL features in order to optimize our scrapers or to receive more data than is being provided in the response of the request found in the **Network** tab. This is what we will be discussing in the next lessons. ## Next up {#next} diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/index.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/index.md index d4249ce37c..c79c8ae338 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/index.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/index.md @@ -11,7 +11,7 @@ slug: /puppeteer-playwright/common-use-cases --- -You can do just about anything with a headless browser, but, there are some extremely common use cases that are important to understand and be prepared for when you might run into them. This short section will be all about solving these common situations. Here's what we'll be covering: +You can do about anything with a headless browser, but, there are some extremely common use cases that are important to understand and be prepared for when you might run into them. This short section will be all about solving these common situations. Here's what we'll be covering: 1. Login flow (logging into an account) 2. Paginating through results on a website diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md index 34bfbbfd9a..e237137b82 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md @@ -126,7 +126,7 @@ const emailsToSend = [ What we could do is log in 3 different times, then automate the sending of each email; however, this is extremely inefficient. When you log into a website, one of the main things that allows you to stay logged in and perform actions on your account is the [cookies](../../../glossary/concepts/http_cookies.md) stored in your browser. These cookies tell the website that you have been authenticated, and that you have the permissions required to modify your account. -With this knowledge of cookies, it can be concluded that we can just pass the cookies generated by the code above right into each new browser context that we use to send each email. That way, we won't have to run the login flow each time. +With this knowledge of cookies, it can be concluded that we can pass the cookies generated by the code above right into each new browser context that we use to send each email. That way, we won't have to run the login flow each time. ### Retrieving cookies {#retrieving-cookies} diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md index 233048d998..0085904858 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md @@ -14,7 +14,7 @@ import TabItem from '@theme/TabItem'; --- -If you're trying to [collect data](../executing_scripts/extracting_data.md) on a website that has millions, thousands, or even just hundreds of results, it is very likely that they are paginating their results to reduce strain on their back-end as well as on the users loading and rendering the content. +If you're trying to [collect data](../executing_scripts/extracting_data.md) on a website that has millions, thousands, or even hundreds of results, it is very likely that they are paginating their results to reduce strain on their back-end as well as on the users loading and rendering the content. ![Amazon pagination](../../advanced_web_scraping/images/pagination.png) diff --git a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md index d94fc6a967..58ac96fd55 100644 --- a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md +++ b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md @@ -302,4 +302,4 @@ await browser.close(); ## Next up {#next} -Our [next lesson](../reading_intercepting_requests.md) will be discussing something super cool - request interception and reading data from requests and responses. It's just like using DevTools, except programmatically! +Our [next lesson](../reading_intercepting_requests.md) will be discussing something super cool - request interception and reading data from requests and responses. It's like using DevTools, except programmatically! diff --git a/sources/academy/webscraping/puppeteer_playwright/index.md b/sources/academy/webscraping/puppeteer_playwright/index.md index 672adfdc5c..186b3b8d9b 100644 --- a/sources/academy/webscraping/puppeteer_playwright/index.md +++ b/sources/academy/webscraping/puppeteer_playwright/index.md @@ -17,7 +17,7 @@ import TabItem from '@theme/TabItem'; [Puppeteer](https://pptr.dev/) and [Playwright](https://playwright.dev/) are both libraries which allow you to write code in Node.js which automates a headless browser. -> A headless browser is just a regular browser like the one you're using right now, but without the user-interface. Because they don't have a UI, they generally perform faster as they don't render any visual content. For an in-depth understanding of headless browsers, check out [this short article](https://blog.arhg.net/2009/10/what-is-headless-browser.html) about them. +> A headless browser is like a regular browser like the one you're using right now, but without the user-interface. Because they don't have a UI, they generally perform faster as they don't render any visual content. For an in-depth understanding of headless browsers, check out [this short article](https://blog.arhg.net/2009/10/what-is-headless-browser.html) about them. Both packages were developed by the same team and are very similar, which is why we have combined the Puppeteer course and the Playwright course into one super-course that shows code examples for both technologies. There are some small differences between the two, which will be highlighted in the examples. @@ -25,7 +25,7 @@ Both packages were developed by the same team and are very similar, which is why ## Advantages of using a headless browser {#advantages-of-headless-browsers} -When automating a headless browser, you can do a whole lot more in comparison to just making HTTP requests for static content. In fact, you can programmatically do pretty much anything a human could do with a browser, such as clicking elements, taking screenshots, typing into text areas, etc. +When automating a headless browser, you can do a whole lot more in comparison to making HTTP requests for static content. In fact, you can programmatically do pretty much anything a human could do with a browser, such as clicking elements, taking screenshots, typing into text areas, etc. Additionally, since the requests aren't static, [dynamic content](../../glossary/concepts/dynamic_pages.md) can be rendered and interacted with (or, data from the dynamic content can be scraped). diff --git a/sources/academy/webscraping/puppeteer_playwright/page/index.md b/sources/academy/webscraping/puppeteer_playwright/page/index.md index 5344c0ed5d..d96db0d003 100644 --- a/sources/academy/webscraping/puppeteer_playwright/page/index.md +++ b/sources/academy/webscraping/puppeteer_playwright/page/index.md @@ -47,7 +47,7 @@ await browser.close(); </TabItem> </Tabs> -Then, we can visit a website with the `page.goto()` method. Let's go to [Google](https://google.com) for now. We'll also use the `page.waitForTimeout()` function, which will force the program to wait for a number of seconds before quitting (otherwise, everything will just flash before our eyes and we won't really be able to tell what's going on): +Then, we can visit a website with the `page.goto()` method. Let's go to [Google](https://google.com) for now. We'll also use the `page.waitForTimeout()` function, which will force the program to wait for a number of seconds before quitting (otherwise, everything will flash before our eyes and we won't really be able to tell what's going on): <Tabs groupId="main"> <TabItem value="Playwright" label="Playwright"> diff --git a/sources/academy/webscraping/puppeteer_playwright/proxies.md b/sources/academy/webscraping/puppeteer_playwright/proxies.md index 8b3d3532a7..556638ab6f 100644 --- a/sources/academy/webscraping/puppeteer_playwright/proxies.md +++ b/sources/academy/webscraping/puppeteer_playwright/proxies.md @@ -169,7 +169,7 @@ const browser = await puppeteer.launch({ </TabItem> </Tabs> -However, authentication parameters need to be passed in separately in order to work. In Puppeteer, the username and password need to be passed into the `page.authenticate()` prior to any navigations being made, while in Playwright they just need to be passed into the **proxy** option object. +However, authentication parameters need to be passed in separately in order to work. In Puppeteer, the username and password need to be passed into the `page.authenticate()` prior to any navigations being made, while in Playwright they can be passed into the **proxy** option object. <Tabs groupId="main"> <TabItem value="Playwright" label="Playwright"> diff --git a/sources/academy/webscraping/web_scraping_for_beginners/best_practices.md b/sources/academy/webscraping/web_scraping_for_beginners/best_practices.md index a6fb8ec610..6c3036cfba 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/best_practices.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/best_practices.md @@ -124,7 +124,7 @@ This really depends on your use case though. If you want 100% clean data, you mi ## Recap {#recap} -Wow, that's a whole lot of things to abide by! How will you remember all of them? Well, to simplify everything, just try to follow these three points: +Wow, that's a whole lot of things to abide by! How will you remember all of them? Try to follow these three points: 1. Describe your code as you write it with good naming, constants, and comments. It **should read like a book**. 2. Add log messages at points throughout your code so that when it's running, you (and everyone else) know what's going on. diff --git a/sources/academy/webscraping/web_scraping_for_beginners/challenge/scraping_amazon.md b/sources/academy/webscraping/web_scraping_for_beginners/challenge/scraping_amazon.md index eebeb09628..de17ebc4f7 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/challenge/scraping_amazon.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/challenge/scraping_amazon.md @@ -11,7 +11,7 @@ slug: /web-scraping-for-beginners/challenge/scraping-amazon --- -In our quick chat about modularity, we finished the code for the results page and added a request for each product to the crawler's **RequestQueue**. Here, we just need to scrape the description, so it shouldn't be too hard: +In our quick chat about modularity, we finished the code for the results page and added a request for each product to the crawler's **RequestQueue**. Here, we need to scrape the description, so it shouldn't be too hard: ```js // routes.js @@ -99,7 +99,7 @@ router.addHandler(labels.OFFERS, async ({ $, request }) => { ## Final code {#final-code} -That should be it! Let's just make sure we've all got the same code: +That should be it! Let's make sure we've all got the same code: ```js // constants.js diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/exporting_data.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/exporting_data.md index d43c18ab55..31969fc261 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/exporting_data.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/exporting_data.md @@ -43,7 +43,7 @@ After you add this one line and run the code, you'll find your CSV with all the ## Exporting data to JSON {#export-json} -Exporting to JSON is very similar to exporting to CSV, we just need to use a different function: [`Dataset.exportToJSON`](https://crawlee.dev/api/core/class/Dataset#exportToJSON). Exporting to JSON is useful when you don't want to work with each item separately, but would rather have one big JSON file with all the results. +Exporting to JSON is very similar to exporting to CSV, but we'll use a different function: [`Dataset.exportToJSON`](https://crawlee.dev/api/core/class/Dataset#exportToJSON). Exporting to JSON is useful when you don't want to work with each item separately, but would rather have one big JSON file with all the results. ```js title=browser.js // ... diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/finding_links.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/finding_links.md index 67f8ada243..1e74eddf2e 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/finding_links.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/finding_links.md @@ -25,7 +25,7 @@ On a webpage, the link above will look like this: [This is a link to example.com ## Extracting links 🔗 {#extracting-links} -If a link is just an HTML element, and the URL is just an attribute, this means that we can extract links the same way as we extracted data. To test this theory in the browser, we can try running the following code in our DevTools console on any website. +If a link is an HTML element, and the URL is an attribute, this means that we can extract links the same way as we extracted data. To test this theory in the browser, we can try running the following code in our DevTools console on any website. ```js // Select all the <a> elements. diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/first_crawl.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/first_crawl.md index 588f4177fa..9123e9dab8 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/first_crawl.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/first_crawl.md @@ -107,7 +107,7 @@ for (const url of productUrls) { console.log(productPageTitle); } catch (error) { // In the catch block, we handle errors. - // This time, we will just print + // This time, we will print // the error message and the url. console.error(error.message, url); } diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/pro_scraping.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/pro_scraping.md index 45e1ed02dc..09754ab160 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/pro_scraping.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/pro_scraping.md @@ -170,7 +170,7 @@ When you run the code, you'll see the names and URLs of all the products printed ## Extracting data {#extracting-data} -We have the crawler in place, and it's time to extract data. We already have the extraction code from the previous lesson, so we can just copy and paste it into the `requestHandler` with tiny changes. Instead of printing results to the terminal, we will save it to disk. +We have the crawler in place, and it's time to extract data. We already have the extraction code from the previous lesson, so we can copy and paste it into the `requestHandler` with tiny changes. Instead of printing results to the terminal, we will save it to disk. ```js title=crawlee.js // To save data to disk, we need to import Dataset. diff --git a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/browser_devtools.md b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/browser_devtools.md index 38583801c2..76f13dc62f 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/browser_devtools.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/browser_devtools.md @@ -25,7 +25,7 @@ When you first open Chrome DevTools on Wikipedia, you will start on the Elements Each element is enclosed in an HTML tag. For example `<div>`, `<p>`, and `<span>` are all tags. When you add something inside of those tags, like `<p>Hello!</p>` you create an element. You can also see elements inside other elements in the **Elements** tab. This is called nesting, and it gives the page its structure. -At the bottom, there's the **JavaScript console**, which is a powerful tool which can be used to manipulate the website. If the console is not there, you can press **ESC** to toggle it. All of this might look super complicated at first, but don't worry, there's no need to understand everything just yet - we'll walk you through all the important things you need to know. +At the bottom, there's the **JavaScript console**, which is a powerful tool which can be used to manipulate the website. If the console is not there, you can press **ESC** to toggle it. All of this might look super complicated at first, but don't worry, there's no need to understand everything yet - we'll walk you through all the important things you need to know. ![Console in Chrome DevTools](./images/browser-devtools-console.png) @@ -71,4 +71,4 @@ By changing HTML elements from the Console, you can change what's displayed on t In this lesson, we learned the absolute basics of interaction with a page using the DevTools. In the [next lesson](./using_devtools.md), you will learn how to extract data from it. We will extract data about the on-sale products on the [Warehouse store](https://warehouse-theme-metal.myshopify.com). -It isn't a real store, but a full-featured demo of a Shopify online store. And that is just perfect for our purposes. Shopify is one of the largest e-commerce platforms in the world, and it uses all the latest technologies that a real e-commerce web application would use. Learning to scrape a Shopify store is useful, because you can immediately apply the learnings to millions of websites. +It isn't a real store, but a full-featured demo of a Shopify online store. And that is perfect for our purposes. Shopify is one of the largest e-commerce platforms in the world, and it uses all the latest technologies that a real e-commerce web application would use. Learning to scrape a Shopify store is useful, because you can immediately apply the learnings to millions of websites. diff --git a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/computer_preparation.md b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/computer_preparation.md index 9f598c6406..c4b9baf78b 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/computer_preparation.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/computer_preparation.md @@ -17,11 +17,11 @@ Before you can start writing scraper code, you need to have your computer set up Let's start with the installation of Node.js. Node.js is an engine for running JavaScript, quite similar to the browser console we used in the previous lessons. You feed it JavaScript code, and it executes it for you. Why not just use the browser console? Because it's limited in its capabilities. Node.js is way more powerful and is much better suited for coding scrapers. -If you're on macOS, use [this tutorial to install Node.js](https://blog.apify.com/how-to-install-nodejs/). If you're using Windows [visit the official Node.js website](https://nodejs.org/en/download/). And if you're on Linux, just use your package manager to install `nodejs`. +If you're on macOS, use [this tutorial to install Node.js](https://blog.apify.com/how-to-install-nodejs/). If you're using Windows [visit the official Node.js website](https://nodejs.org/en/download/). And if you're on Linux, use your package manager to install `nodejs`. ## Install a text editor {#install-an-editor} -Many text editors are available for you to choose from when programming. You might already have a preferred one so feel free to use that. Just make sure it has syntax highlighting and support for Node.js. If you don't have a text editor, we suggest starting with VSCode. It's free, very popular, and well maintained. [Download it here](https://code.visualstudio.com/download). +Many text editors are available for you to choose from when programming. You might already have a preferred one so feel free to use that. Make sure it has syntax highlighting and support for Node.js. If you don't have a text editor, we suggest starting with VSCode. It's free, very popular, and well maintained. [Download it here](https://code.visualstudio.com/download). Once you downloaded and installed it, you can open a folder where we will build your scraper. We recommend starting with a new, empty folder. From 8a92855016c08be894d55980899eb71b083b5407 Mon Sep 17 00:00:00 2001 From: Honza Javorek <mail@honzajavorek.cz> Date: Thu, 25 Apr 2024 17:21:06 +0200 Subject: [PATCH 05/15] style: remove 'easily' --- sources/academy/glossary/concepts/css_selectors.md | 2 +- sources/academy/glossary/tools/modheader.md | 2 +- sources/academy/glossary/tools/postman.md | 4 ++-- sources/academy/glossary/tools/user_agent_switcher.md | 4 ++-- .../academy/platform/deploying_your_code/input_schema.md | 2 +- .../platform/deploying_your_code/inputs_outputs.md | 2 +- .../solutions/handling_migrations.md | 2 +- .../solutions/rotating_proxies.md | 2 +- .../expert_scraping_with_apify/solutions/saving_stats.md | 2 +- sources/academy/platform/getting_started/apify_api.md | 2 +- sources/academy/platform/running_a_web_server.md | 4 ++-- .../academy/tutorials/apify_scrapers/cheerio_scraper.md | 4 ++-- .../academy/tutorials/apify_scrapers/getting_started.md | 4 ++-- .../academy/tutorials/apify_scrapers/puppeteer_scraper.md | 4 ++-- sources/academy/tutorials/apify_scrapers/web_scraper.md | 2 +- .../node_js/add_external_libraries_web_scraper.md | 4 ++-- .../node_js/analyzing_pages_and_fixing_errors.md | 2 +- .../tutorials/node_js/choosing_the_right_scraper.md | 2 +- .../academy/tutorials/node_js/debugging_web_scraper.md | 2 +- .../node_js/handle_blocked_requests_puppeteer.md | 4 ++-- .../tutorials/node_js/request_labels_in_apify_actors.md | 2 +- sources/academy/tutorials/node_js/scraping_shadow_doms.md | 6 +++--- .../tutorials/node_js/when_to_use_puppeteer_scraper.md | 2 +- .../advanced_web_scraping/tips_and_tricks_robustness.md | 2 +- .../anti_scraping/mitigation/generating_fingerprints.md | 6 +++--- .../webscraping/anti_scraping/mitigation/using_proxies.md | 8 ++++---- .../anti_scraping/techniques/fingerprinting.md | 2 +- .../webscraping/anti_scraping/techniques/rate_limiting.md | 2 +- sources/academy/webscraping/api_scraping/index.md | 4 ++-- .../executing_scripts/extracting_data.md | 2 +- .../webscraping/puppeteer_playwright/page/page_methods.md | 2 +- .../academy/webscraping/switching_to_typescript/enums.md | 4 ++-- .../webscraping/switching_to_typescript/interfaces.md | 2 +- .../webscraping/switching_to_typescript/mini_project.md | 2 +- .../web_scraping_for_beginners/best_practices.md | 2 +- .../crawling/headless_browser.md | 2 +- .../data_extraction/devtools_continued.md | 2 +- .../data_extraction/project_setup.md | 2 +- .../data_extraction/using_devtools.md | 4 ++-- .../web_scraping_for_beginners/introduction.md | 2 +- 40 files changed, 58 insertions(+), 58 deletions(-) diff --git a/sources/academy/glossary/concepts/css_selectors.md b/sources/academy/glossary/concepts/css_selectors.md index 4c2fd79eda..ee36b53c0e 100644 --- a/sources/academy/glossary/concepts/css_selectors.md +++ b/sources/academy/glossary/concepts/css_selectors.md @@ -59,7 +59,7 @@ CSS selectors are important for web scraping because they allow you to target sp For example, if you wanted to scrape a list of all the titles of blog posts on a website, you could use a CSS selector to select all the elements that contain the title text. Once you have selected these elements, you can extract the text from them and use it for your scraping project. -Additionally, when web scraping it is important to understand the structure of the website and CSS selectors can help you to navigate it easily. With them, you can select specific elements and their children, siblings, or parent elements. This allows you to extract data that is nested within other elements, or to navigate through the page structure to find the data you need. +Additionally, when web scraping it is important to understand the structure of the website and CSS selectors can help you to navigate it. With them, you can select specific elements and their children, siblings, or parent elements. This allows you to extract data that is nested within other elements, or to navigate through the page structure to find the data you need. ## Resources diff --git a/sources/academy/glossary/tools/modheader.md b/sources/academy/glossary/tools/modheader.md index 7cafa0ae57..354abe743f 100644 --- a/sources/academy/glossary/tools/modheader.md +++ b/sources/academy/glossary/tools/modheader.md @@ -21,7 +21,7 @@ After you install the ModHeader extension, you should see it pinned in Chrome's ![Modheader's simple interface](./images/modheader.jpg) -Here, you can add headers, remove headers, and even save multiple collections of headers that you can easily toggle between (which are called **Profiles** within the extension itself). +Here, you can add headers, remove headers, and even save multiple collections of headers that you can toggle between (which are called **Profiles** within the extension itself). ## Use cases {#use-cases} diff --git a/sources/academy/glossary/tools/postman.md b/sources/academy/glossary/tools/postman.md index ac24bf9c03..a41c4aeabb 100644 --- a/sources/academy/glossary/tools/postman.md +++ b/sources/academy/glossary/tools/postman.md @@ -11,7 +11,7 @@ slug: /tools/postman --- -[Postman](https://www.postman.com/) is a powerful collaboration platform for API development and testing. For scraping use-cases, it's mainly used to test requests and proxies (such as checking the response body of a raw request, without loading any additional resources such as JavaScript or CSS). This tool can do much more than that, but we will not be discussing all of its capabilities here. Postman allows us to easily test requests with cookies, headers, and payloads so that we can be entirely sure what the response looks like for a request URL we plan to eventually use in a scraper. +[Postman](https://www.postman.com/) is a powerful collaboration platform for API development and testing. For scraping use-cases, it's mainly used to test requests and proxies (such as checking the response body of a raw request, without loading any additional resources such as JavaScript or CSS). This tool can do much more than that, but we will not be discussing all of its capabilities here. Postman allows us to test requests with cookies, headers, and payloads so that we can be entirely sure what the response looks like for a request URL we plan to eventually use in a scraper. The desktop app can be downloaded from its [official download page](https://www.postman.com/downloads/), or the web app can be used with a simple signup - no download required. If this is your first time working with a tool like Postman, we recommend checking out their [Getting Started guide](https://learning.postman.com/docs/getting-started/introduction/). @@ -55,7 +55,7 @@ In order to check whether there are any cookies associated with a certain reques ![Button to view the cached cookies](./images/postman-cookies-button.png) -Clicking on this button opens a **MANAGE COOKIES** window, where a list of all cached cookies per domain can be seen. If we had been previously sending multiple requests to **<https://github.com/apify>**, within this window we would be able to easily find cached cookies associated with github.com. Cookies can also be easily edited (to update some specific values), or deleted (to send a "clean" request without any cached data) here. +Clicking on this button opens a **MANAGE COOKIES** window, where a list of all cached cookies per domain can be seen. If we had been previously sending multiple requests to **<https://github.com/apify>**, within this window we would be able to find cached cookies associated with github.com. Cookies can also be edited (to update some specific values), or deleted (to send a "clean" request without any cached data) here. ![Managing cookies in Postman with the "MANAGE COOKIES" window](./images/postman-manage-cookies.png) diff --git a/sources/academy/glossary/tools/user_agent_switcher.md b/sources/academy/glossary/tools/user_agent_switcher.md index 7b86fcbcc8..e83f5140d0 100644 --- a/sources/academy/glossary/tools/user_agent_switcher.md +++ b/sources/academy/glossary/tools/user_agent_switcher.md @@ -1,13 +1,13 @@ --- title: User-Agent Switcher -description: Learn how to easily switch your User-Agent header to different values in order to monitor how a certain site responds to the changes. +description: Learn how to switch your User-Agent header to different values in order to monitor how a certain site responds to the changes. sidebar_position: 9.8 slug: /tools/user-agent-switcher --- # User-Agent Switcher -**Learn how to easily switch your User-Agent header to different values in order to monitor how a certain site responds to the changes.** +**Learn how to switch your User-Agent header to different values in order to monitor how a certain site responds to the changes.** --- diff --git a/sources/academy/platform/deploying_your_code/input_schema.md b/sources/academy/platform/deploying_your_code/input_schema.md index 759f322963..4a60c8c9f3 100644 --- a/sources/academy/platform/deploying_your_code/input_schema.md +++ b/sources/academy/platform/deploying_your_code/input_schema.md @@ -102,7 +102,7 @@ Here is what the input schema we wrote will render on the platform: ![Rendered UI from input schema](./images/rendered-ui.png) -Later on, we'll be building more complex input schemas, as well as discussing how to write quality input schemas that allow the user to easily understand the Actor and not become overwhelmed. +Later on, we'll be building more complex input schemas, as well as discussing how to write quality input schemas that allow the user to understand the Actor and not become overwhelmed. It's not expected to memorize all of the fields that properties can take or the different editor types available, which is why it's always good to reference the [input schema documentation](/platform/actors/development/actor-definition/input-schema) when writing a schema. diff --git a/sources/academy/platform/deploying_your_code/inputs_outputs.md b/sources/academy/platform/deploying_your_code/inputs_outputs.md index 293db294fe..45a5be745a 100644 --- a/sources/academy/platform/deploying_your_code/inputs_outputs.md +++ b/sources/academy/platform/deploying_your_code/inputs_outputs.md @@ -221,4 +221,4 @@ After running our script, there should be a single item in the default dataset t ## Next up {#next} -That's it! We've now added all of the files and code necessary to convert our software into an Actor. In the [next lesson](./input_schema.md), we'll be learning how to easily generate a user interface for our Actor's input so that users don't have to provide the input in raw JSON format. +That's it! We've now added all of the files and code necessary to convert our software into an Actor. In the [next lesson](./input_schema.md), we'll be learning how to generate a user interface for our Actor's input so that users don't have to provide the input in raw JSON format. diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md b/sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md index 2cfba52a4d..fe23b28fc4 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md @@ -11,7 +11,7 @@ slug: /expert-scraping-with-apify/solutions/handling-migrations --- -Let's first head into our **demo-actor** and create a new file named **asinTracker.js** in the **src** folder. Within this file, we are going to build a utility class which will allow us to easily store, modify, persist, and log our tracked ASIN data. +Let's first head into our **demo-actor** and create a new file named **asinTracker.js** in the **src** folder. Within this file, we are going to build a utility class which will allow us to store, modify, persist, and log our tracked ASIN data. Here's the skeleton of our class: diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md b/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md index ee9cf2960a..73c4741d19 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md @@ -94,7 +94,7 @@ const proxyConfiguration = await Actor.createProxyConfiguration({ **Q: How can you prevent an error from occurring if one of the proxy groups that a user has is removed? What are the best practices for these scenarios?** -**A:** By making the proxy for the scraper to use be configurable by the user through the Actor's input. That way, they can easily switch proxies if the Actor stops working due to proxy-related issues. It can also be done by using the **AUTO** proxy instead of specific groups. +**A:** By making the proxy for the scraper to use be configurable by the user through the Actor's input. That way, they can switch proxies if the Actor stops working due to proxy-related issues. It can also be done by using the **AUTO** proxy instead of specific groups. **Q: Does it make sense to rotate proxies when you are logged into a website?** diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md b/sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md index 2fab8d5187..95970aab02 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md @@ -114,7 +114,7 @@ router.addHandler(labels.OFFERS, async ({ $, request }) => { ## Saving stats with dataset items {#saving-stats-with-dataset-items} -Still, in the **OFFERS** handler, we need to add a few extra keys to the items which are pushed to the dataset. Luckily, all of the data required by the task is easily accessible in the context object. +Still, in the **OFFERS** handler, we need to add a few extra keys to the items which are pushed to the dataset. Luckily, all of the data required by the task is accessible in the context object. ```js router.addHandler(labels.OFFERS, async ({ $, request }) => { diff --git a/sources/academy/platform/getting_started/apify_api.md b/sources/academy/platform/getting_started/apify_api.md index b1ccdb631b..0f410f7b7f 100644 --- a/sources/academy/platform/getting_started/apify_api.md +++ b/sources/academy/platform/getting_started/apify_api.md @@ -69,7 +69,7 @@ What we've done in this lesson only scratches the surface of what the Apify API ## Next up {#next} -[Next up](./apify_client.md), we'll be learning about how to use Apify's JavaScript and Python clients to easily interact with the API right within our code. +[Next up](./apify_client.md), we'll be learning about how to use Apify's JavaScript and Python clients to interact with the API right within our code. <!-- Note: From the previous version of this lesson, some now unused but useful images still remain. diff --git a/sources/academy/platform/running_a_web_server.md b/sources/academy/platform/running_a_web_server.md index 6aff852f03..ce80891b7d 100644 --- a/sources/academy/platform/running_a_web_server.md +++ b/sources/academy/platform/running_a_web_server.md @@ -1,6 +1,6 @@ --- title: Running a web server on the Apify platform -description: A web server running in an Actor can act as a communication channel with the outside world. Learn how to easily set one up with Node.js. +description: A web server running in an actor can act as a communication channel with the outside world. Learn how to set one up with Node.js. sidebar_position: 11 category: apify platform slug: /running-a-web-server @@ -8,7 +8,7 @@ slug: /running-a-web-server # Running a web server on the Apify platform -**A web server running in an Actor can act as a communication channel with the outside world. Learn how to easily set one up with Node.js.** +**A web server running in an Actor can act as a communication channel with the outside world. Learn how to set one up with Node.js.** --- diff --git a/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md b/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md index 3e8167540f..3870e2a8c9 100644 --- a/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md @@ -405,7 +405,7 @@ scrape all of the Actors' data. After it succeeds, open the **Dataset** tab agai You should have a table of all the Actor's details in front of you. If you do, great job! You've successfully scraped Apify Store. And if not, no worries, go through the code examples again, it's probably just a typo. -> There's an important caveat. The way we implemented pagination here is in no way a generic system that you can easily +> There's an important caveat. The way we implemented pagination here is in no way a generic system that you can use with other websites. Cheerio is fast (and that means it's cheap), but it's not easy. Sometimes there's just no way to get all results with Cheerio only and other times it takes hours of research. Keep this in mind when choosing the right scraper for your job. But don't get discouraged. Often times, the only thing you will ever need is to @@ -497,7 +497,7 @@ of JavaScript. It helps you put what matters on top, if you so desire. ## [](#final-word) Final word -Thank you for reading this whole tutorial! Really! It's important to us that our users have the best information available to them so that they can use Apify easily and effectively. We're glad that you made it all the way here and congratulations on creating your first scraping task. We hope that you liked the tutorial and if there's anything you'd like to ask, [join us on Discord](https://discord.gg/jyEM2PRvMU)! +Thank you for reading this whole tutorial! Really! It's important to us that our users have the best information available to them so that they can use Apify and effectively. We're glad that you made it all the way here and congratulations on creating your first scraping task. We hope that you liked the tutorial and if there's anything you'd like to ask, [join us on Discord](https://discord.gg/jyEM2PRvMU)! ## [](#whats-next) What's next diff --git a/sources/academy/tutorials/apify_scrapers/getting_started.md b/sources/academy/tutorials/apify_scrapers/getting_started.md index c503e2e891..ac199f473f 100644 --- a/sources/academy/tutorials/apify_scrapers/getting_started.md +++ b/sources/academy/tutorials/apify_scrapers/getting_started.md @@ -19,7 +19,7 @@ It doesn't matter whether you arrived here from **Web Scraper** ([apify/web-scra > If you need help choosing the right scraper, see this [great article](https://help.apify.com/en/articles/3024655-choosing-the-right-solution). If you want to learn more about Actors in general, you can read our [actors page](https://apify.com/actors) or [browse the documentation](/platform/actors). -You can create 10 different **tasks** for 10 different websites, with very different options, but there will always be just one **Actor**, the `apify/*-scraper` you chose. This is the essence of tasks. They are nothing but **saved configurations** of the Actor that you can run easily and repeatedly. +You can create 10 different **tasks** for 10 different websites, with very different options, but there will always be just one **Actor**, the `apify/*-scraper` you chose. This is the essence of tasks. They are nothing but **saved configurations** of the Actor that you can run repeatedly. ## [](#trying-it-out) Trying it out @@ -59,7 +59,7 @@ Before we jump into the scraping itself, let's have a quick look at the user int ### [](#input) Input and options -The **Input** tab is where we started and it's the place where you create your scraping configuration. The Actor's creator prepares the **Input** form so that you can easily tell the Actor what to do. Feel free to check the tooltips of the various options to get a better idea of what they do. To display the tooltip, click the question mark next to each input field's name. +The **Input** tab is where we started and it's the place where you create your scraping configuration. The Actor's creator prepares the **Input** form so that you can tell the Actor what to do. Feel free to check the tooltips of the various options to get a better idea of what they do. To display the tooltip, click the question mark next to each input field's name. > We will not go through all the available input options in this tutorial. See the Actor's README for detailed information. diff --git a/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md b/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md index 313238bb7f..a7b77a8cff 100644 --- a/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md @@ -701,7 +701,7 @@ of JavaScript. It helps you put what matters on top, if you so desire. If you're familiar with the [jQuery library](https://jquery.com/), you may have looked at the scraping code and thought that it's unnecessarily complicated. That's probably up to everyone to decide on their own, but the good news is, -you can easily use jQuery with Puppeteer Scraper too. +you can use jQuery with Puppeteer Scraper too. ### [](#injecting-jquery) Injecting jQuery @@ -817,7 +817,7 @@ function to run the script in the context of the browser and the return value is ## [](#final-word) Final word -Thank you for reading this whole tutorial! Really! It's important to us that our users have the best information available to them so that they can use Apify easily and effectively. We're glad that you made it all the way here and congratulations on creating your first scraping task. We hope that you liked the tutorial and if there's anything you'd like to ask, [join us on Discord](https://discord.gg/jyEM2PRvMU)! +Thank you for reading this whole tutorial! Really! It's important to us that our users have the best information available to them so that they can use Apify effectively. We're glad that you made it all the way here and congratulations on creating your first scraping task. We hope that you liked the tutorial and if there's anything you'd like to ask, [join us on Discord](https://discord.gg/jyEM2PRvMU)! ## [](#whats-next) What's next? diff --git a/sources/academy/tutorials/apify_scrapers/web_scraper.md b/sources/academy/tutorials/apify_scrapers/web_scraper.md index 1507d30fdb..90bd57a2a1 100644 --- a/sources/academy/tutorials/apify_scrapers/web_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/web_scraper.md @@ -551,7 +551,7 @@ of JavaScript. It helps you put what matters on top, if you so desire. ## [](#final-word) Final word -Thank you for reading this whole tutorial! Really! It's important to us that our users have the best information available to them so that they can use Apify easily and effectively. We're glad that you made it all the way here and congratulations on creating your first scraping task. We hope that you liked the tutorial and if there's anything you'd like to ask, [join us on Discord](https://discord.gg/jyEM2PRvMU)! +Thank you for reading this whole tutorial! Really! It's important to us that our users have the best information available to them so that they can use Apify effectively. We're glad that you made it all the way here and congratulations on creating your first scraping task. We hope that you liked the tutorial and if there's anything you'd like to ask, [join us on Discord](https://discord.gg/jyEM2PRvMU)! ## [](#whats-next) What's next? diff --git a/sources/academy/tutorials/node_js/add_external_libraries_web_scraper.md b/sources/academy/tutorials/node_js/add_external_libraries_web_scraper.md index aeed367d3a..328a635148 100644 --- a/sources/academy/tutorials/node_js/add_external_libraries_web_scraper.md +++ b/sources/academy/tutorials/node_js/add_external_libraries_web_scraper.md @@ -13,7 +13,7 @@ In this tutorial, we'll learn how to inject any JavaScript library into your pag Moment.js is a very popular library for working with date and time. It helps you with the parsing, manipulation, and formatting of datetime values in multiple locales and has become the de-facto standard for this kind of work in JavaScript. -To inject Moment.js into our page function (or any other library using the same method), we first need to find a link to download it from. We can easily find it in [Moment.js' documentation](https://momentjs.com/docs/#/use-it/browser/) under the CDN links. +To inject Moment.js into our page function (or any other library using the same method), we first need to find a link to download it from. We can find it in [Moment.js' documentation](https://momentjs.com/docs/#/use-it/browser/) under the CDN links. > <https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.24.0/moment.min.js> @@ -64,6 +64,6 @@ With jQuery, we're using the `$.getScript()` helper to fetch the script for us a ## Dealing with errors -Some websites employ security measures that disallow loading external scripts within their pages. Luckily, those measures can be easily overridden with Web Scraper. If you are encountering errors saying that your library cannot be loaded due to a security policy, select the Ignore CORS and CSP input option at the very bottom of Web Scraper input and the errors should go away. +Some websites employ security measures that disallow loading external scripts within their pages. Luckily, those measures can be overridden with Web Scraper. If you are encountering errors saying that your library cannot be loaded due to a security policy, select the Ignore CORS and CSP input option at the very bottom of Web Scraper input and the errors should go away. Happy scraping! diff --git a/sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md b/sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md index 5f42d15ed7..57e4f1a492 100644 --- a/sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md +++ b/sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md @@ -75,7 +75,7 @@ Read more information about logging and error handling in our developer [best pr ### Saving snapshots {#saving-snapshots} -By snapshots, we mean **screenshots** if you use a [browser with Puppeteer/Playwright](../../webscraping/puppeteer_playwright/index.md) and HTML saved into a [key-value store](https://crawlee.dev/api/core/class/KeyValueStore) that you can easily display in your own browser. Snapshots are useful throughout your code but especially important in error handling. +By snapshots, we mean **screenshots** if you use a [browser with Puppeteer/Playwright](../../webscraping/puppeteer_playwright/index.md) and HTML saved into a [key-value store](https://crawlee.dev/api/core/class/KeyValueStore) that you can display in your own browser. Snapshots are useful throughout your code but especially important in error handling. Note that an error can happen only in a few pages out of a thousand and look completely random. There is not much you can do other than save and analyze a snapshot. diff --git a/sources/academy/tutorials/node_js/choosing_the_right_scraper.md b/sources/academy/tutorials/node_js/choosing_the_right_scraper.md index 75d41d8873..1af2b43118 100644 --- a/sources/academy/tutorials/node_js/choosing_the_right_scraper.md +++ b/sources/academy/tutorials/node_js/choosing_the_right_scraper.md @@ -26,7 +26,7 @@ If it were only a question of performance, you'd of course use request-based scr ## Dynamic pages & blocking {#dynamic-pages} -Some websites do not load any data without a browser, as they need to execute some scripts to show it (these are known as [dynamic pages](./dealing_with_dynamic_pages.md)). Another problem is blocking. If the website collects a [browser fingerprint](../../webscraping/anti_scraping/techniques/fingerprinting.md), it can easily distinguish between a real user and a bot (crawler) and block access. +Some websites do not load any data without a browser, as they need to execute some scripts to show it (these are known as [dynamic pages](./dealing_with_dynamic_pages.md)). Another problem is blocking. If the website collects a [browser fingerprint](../../webscraping/anti_scraping/techniques/fingerprinting.md), it can distinguish between a real user and a bot (crawler) and block access. ## Making the choice {#making-the-choice} diff --git a/sources/academy/tutorials/node_js/debugging_web_scraper.md b/sources/academy/tutorials/node_js/debugging_web_scraper.md index 2abe5e1361..e43759f099 100644 --- a/sources/academy/tutorials/node_js/debugging_web_scraper.md +++ b/sources/academy/tutorials/node_js/debugging_web_scraper.md @@ -44,7 +44,7 @@ into results = []; ``` -You can easily get all the information you need by running a small snippet of your pageFunction like this +You can get all the information you need by running a snippet of your `pageFunction` like this: ```js results = []; diff --git a/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md b/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md index 43ddf9ade3..af114d551e 100644 --- a/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md +++ b/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md @@ -5,7 +5,7 @@ sidebar_position: 15.9 slug: /node-js/handle-blocked-requests-puppeteer --- -One of the main defense mechanisms websites use to ensure they are not scraped by bots is allowing only a limited number of requests from a specific IP address. That's why Apify provides a [proxy](https://www.apify.com/docs/proxy) component with intelligent rotation. With a large enough pool of proxies, you can multiply the number of allowed requests per day to easily cover your crawling needs. Let's look at how we can rotate proxies when using our [JavaScript SDK](https://github.com/apify/apify-sdk-js). +One of the main defense mechanisms websites use to ensure they are not scraped by bots is allowing only a limited number of requests from a specific IP address. That's why Apify provides a [proxy](https://www.apify.com/docs/proxy) component with intelligent rotation. With a large enough pool of proxies, you can multiply the number of allowed requests per day to cover your crawling needs. Let's look at how we can rotate proxies when using our [JavaScript SDK](https://github.com/apify/apify-sdk-js). # BasicCrawler @@ -31,7 +31,7 @@ const crawler = new Apify.BasicCrawler({ }); ``` -Each time handleRequestFunction is executed in this example, requestPromise will send a request through the least used proxy for that target domain. This way you will not easily burn through your proxies. +Each time handleRequestFunction is executed in this example, requestPromise will send a request through the least used proxy for that target domain. This way you will not burn through your proxies. # Puppeteer Crawler diff --git a/sources/academy/tutorials/node_js/request_labels_in_apify_actors.md b/sources/academy/tutorials/node_js/request_labels_in_apify_actors.md index 1d2ece675f..1d3a06dfc4 100644 --- a/sources/academy/tutorials/node_js/request_labels_in_apify_actors.md +++ b/sources/academy/tutorials/node_js/request_labels_in_apify_actors.md @@ -5,7 +5,7 @@ sidebar_position: 15.1 slug: /node-js/request-labels-in-apify-actors --- -Are you trying to use Actors for the first time and don't know how to deal with the request label or how to pass data to the request easily? +Are you trying to use Actors for the first time and don't know how to deal with the request label or how to pass data to the request? Here's how to do it. diff --git a/sources/academy/tutorials/node_js/scraping_shadow_doms.md b/sources/academy/tutorials/node_js/scraping_shadow_doms.md index 1151c3b6ef..eaeca879d3 100644 --- a/sources/academy/tutorials/node_js/scraping_shadow_doms.md +++ b/sources/academy/tutorials/node_js/scraping_shadow_doms.md @@ -11,7 +11,7 @@ slug: /node-js/scraping-shadow-doms --- -Each website is represented by an HTML DOM, a tree-like structure consisting of HTML elements (e.g. paragraphs, images, videos) and text. [Shadow DOM](https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM) allows the separate DOM trees to be attached to the main DOM while remaining isolated in terms of CSS inheritance and JavaScript DOM manipulation. The CSS and JavaScript codes of separate shadow DOM components do not clash, but the downside is that you can't easily access the content from outside. +Each website is represented by an HTML DOM, a tree-like structure consisting of HTML elements (e.g. paragraphs, images, videos) and text. [Shadow DOM](https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM) allows the separate DOM trees to be attached to the main DOM while remaining isolated in terms of CSS inheritance and JavaScript DOM manipulation. The CSS and JavaScript codes of separate shadow DOM components do not clash, but the downside is that you can't access the content from outside. Let's take a look at this page [alodokter.com](https://www.alodokter.com/). If you click on the menu and open a Chrome debugger, you will see that the menu tree is attached to the main DOM as shadow DOM under the element `<top-navbar-view id="top-navbar-view">`. @@ -32,7 +32,7 @@ const links = $(shadowRoot.innerHTML).find('a'); const urls = links.map((obj, el) => el.href); ``` -However, this isn't very convenient, because you have to find the root element of each component you want to work with, and you can't easily take advantage of all the scripts and tools you already have. +However, this isn't very convenient, because you have to find the root element of each component you want to work with, and you can't take advantage of all the scripts and tools you already have. So instead of that, we can replace the content of each element containing shadow DOM with the HTML of shadow DOM. @@ -45,7 +45,7 @@ for (const el of document.getElementsByTagName('*')) { } ``` -After you run this, you can access all the elements and content easily using jQuery or plain JavaScript. The downside is that it breaks all the interactive components because you create a new copy of the shadow DOM HTML content without the JavaScript code and CSS attached, so this must be done after all the content has been rendered. +After you run this, you can access all the elements and content using jQuery or plain JavaScript. The downside is that it breaks all the interactive components because you create a new copy of the shadow DOM HTML content without the JavaScript code and CSS attached, so this must be done after all the content has been rendered. Some websites may contain shadow DOMs recursively inside of shadow DOMs. In these cases, we must replace them with HTML recursively: diff --git a/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md b/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md index bb632ce99b..033120f12a 100644 --- a/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md +++ b/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md @@ -28,7 +28,7 @@ _This does not mean that you can't execute in-browser code with Puppeteer Scrape ## Practical differences -Ok, cool, different environments, but how does that help you scrape stuff? Actually, quite a lot. Some things you just can't do from within the browser, but you can easily do them with Puppeteer. We will not attempt to create an exhaustive list, but rather show you some very useful features that we use every day in our scraping. +Ok, cool, different environments, but how does that help you scrape stuff? Actually, quite a lot. Some things you just can't do from within the browser, but you can do them with Puppeteer. We will not attempt to create an exhaustive list, but rather show you some very useful features that we use every day in our scraping. ## Evaluating in-browser code diff --git a/sources/academy/webscraping/advanced_web_scraping/tips_and_tricks_robustness.md b/sources/academy/webscraping/advanced_web_scraping/tips_and_tricks_robustness.md index d0b5b979d5..601536ddbb 100644 --- a/sources/academy/webscraping/advanced_web_scraping/tips_and_tricks_robustness.md +++ b/sources/academy/webscraping/advanced_web_scraping/tips_and_tricks_robustness.md @@ -80,7 +80,7 @@ async function submitPayment() { } ``` -**Avoid**: Not verifying an outcome. It can easily fail despite output claiming otherwise. +**Avoid**: Not verifying an outcome. It can fail despite output claiming otherwise. ```js async function submitPayment() { diff --git a/sources/academy/webscraping/anti_scraping/mitigation/generating_fingerprints.md b/sources/academy/webscraping/anti_scraping/mitigation/generating_fingerprints.md index e309502b17..6616be75b6 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/generating_fingerprints.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/generating_fingerprints.md @@ -1,13 +1,13 @@ --- title: Generating fingerprints -description: Learn how to use two super handy npm libraries to easily generate fingerprints and inject them into a Playwright or Puppeteer page. +description: Learn how to use two super handy npm libraries to generate fingerprints and inject them into a Playwright or Puppeteer page. sidebar_position: 3 slug: /anti-scraping/mitigation/generating-fingerprints --- # Generating fingerprints {#generating-fingerprints} -**Learn how to use two super handy npm libraries to easily generate fingerprints and inject them into a Playwright or Puppeteer page.** +**Learn how to use two super handy npm libraries to generate fingerprints and inject them into a Playwright or Puppeteer page.** --- @@ -33,7 +33,7 @@ const crawler = new PlaywrightCrawler({ ## Using the fingerprint-generator package {#using-fingerprint-generator} -Crawlee uses the [Fingerprint generator](https://github.com/apify/fingerprint-suite) npm package to do its fingerprint generating magic. For maximum control outside of Crawlee, you can install it on its own. With this package, you can easily generate browser fingerprints. +Crawlee uses the [Fingerprint generator](https://github.com/apify/fingerprint-suite) npm package to do its fingerprint generating magic. For maximum control outside of Crawlee, you can install it on its own. With this package, you can generate browser fingerprints. > It is crucial to generate fingerprints for the specific browser and operating system being used to trick the protections successfully. For example, if you are trying to overcome protection locally with Firefox on a macOS system, you should generate fingerprints for Firefox and macOS to achieve the best results. diff --git a/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md b/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md index 1f09724663..d32cf122d8 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md @@ -1,13 +1,13 @@ --- title: Using proxies -description: Learn how to use and automagically rotate proxies in your scrapers by using Crawlee, and a bit about how to easily obtain pools of proxies. +description: Learn how to use and automagically rotate proxies in your scrapers by using Crawlee, and a bit about how to obtain pools of proxies. sidebar_position: 2 slug: /anti-scraping/mitigation/using-proxies --- # Using proxies {#using-proxies} -**Learn how to use and automagically rotate proxies in your scrapers by using Crawlee, and a bit about how to easily obtain pools of proxies.** +**Learn how to use and automagically rotate proxies in your scrapers by using Crawlee, and a bit about how to obtain pools of proxies.** --- @@ -53,8 +53,8 @@ const crawler = new CheerioCrawler({ await crawler.addRequests([{ url: 'https://demo-webstore.apify.org/search/on-sale', - // By labeling the Request, we can very easily - // identify it later in the requestHandler. + // By labeling the Request, we can identify it + // later in the requestHandler. label: 'START', }]); diff --git a/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md b/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md index 99962e2d5a..1fadca91fd 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md +++ b/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md @@ -13,7 +13,7 @@ slug: /anti-scraping/techniques/fingerprinting Browser fingerprinting is a method that some websites use to collect information about a browser's type and version, as well as the operating system being used, any active plugins, the time zone and language of the machine, the screen resolution, and various other active settings. All of this information is called the **fingerprint** of the browser, and the act of collecting it is called **fingerprinting**. -Yup! Surprisingly enough, browsers provide a lot of information about the user (and even their machine) that is easily accessible to websites! Browser fingerprinting wouldn't even be possible if it weren't for the sheer amount of information browsers provide, and the fact that each fingerprint is unique. +Yup! Surprisingly enough, browsers provide a lot of information about the user (and even their machine) that is accessible to websites! Browser fingerprinting wouldn't even be possible if it weren't for the sheer amount of information browsers provide, and the fact that each fingerprint is unique. Based on [research](https://www.eff.org/press/archives/2010/05/13) carried out by the Electronic Frontier Foundation, 84% of collected fingerprints are globally exclusive, and they found that the next 9% were in sets with a size of two. They also stated that even though fingerprints are dynamic, new ones can be matched up with old ones with 99.1% correctness. This makes fingerprinting a very viable option for websites that want to track the online behavior of their users in order to serve hyper-personalized advertisements to them. In some cases, it is also used to aid in preventing bots from accessing the websites (or certain sections of it). diff --git a/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md b/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md index fdf063c54d..ee239ea4cc 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md +++ b/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md @@ -11,7 +11,7 @@ slug: /anti-scraping/techniques/rate-limiting --- -When crawling a website, a web scraping bot will typically send many more requests from a single IP address than a human user could generate over the same period. Websites can easily monitor how many requests they receive from a single IP address, and block it or require a [captcha](./captchas.md) test to continue making requests. +When crawling a website, a web scraping bot will typically send many more requests from a single IP address than a human user could generate over the same period. Websites can monitor how many requests they receive from a single IP address, and block it or require a [captcha](./captchas.md) test to continue making requests. In the past, most websites had their own anti-scraping solutions, the most common of which was IP address rate-limiting. In recent years, the popularity of third-party specialized anti-scraping providers has dramatically increased, but a lot of websites still use rate-limiting to only allow a certain number of requests per second/minute/hour to be sent from a single IP; therefore, crawler requests have the potential of being blocked entirely quite quickly. diff --git a/sources/academy/webscraping/api_scraping/index.md b/sources/academy/webscraping/api_scraping/index.md index a6467f8049..ddf9609f16 100644 --- a/sources/academy/webscraping/api_scraping/index.md +++ b/sources/academy/webscraping/api_scraping/index.md @@ -59,7 +59,7 @@ Since the data is coming directly from the site's API, as opposed to the parsing ### 2. Configurable -Most APIs accept query parameters such as `maxPosts` or `fromCountry`. These parameters can be mapped to the configuration options of the scraper, which makes creating a scraper that supports various requirements and use-cases much easier. They can also be utilized to easily filter and/or limit data results. +Most APIs accept query parameters such as `maxPosts` or `fromCountry`. These parameters can be mapped to the configuration options of the scraper, which makes creating a scraper that supports various requirements and use-cases much easier. They can also be utilized to filter and/or limit data results. ### 3. Fast and efficient @@ -91,7 +91,7 @@ For complex APIs that require certain headers and/or payloads in order to make a APIs come in all different shapes and sizes. That means every API will vary in not only the quality of the data that it returns, but also the format that it is in. The two most common formats are JSON and HTML. -JSON responses are the most ideal, as they are easily manipulated in JavaScript code. In general, no serious parsing is necessary, and the data can be easily filtered and formatted to fit a scraper's output schema. +JSON responses are the ideal, as they can be manipulated in JavaScript code. In general, no serious parsing is necessary, and the data can be filtered and formatted to fit a scraper's output schema. APIs which output HTML generally return the raw HTML of a small component of the page which is already hydrated with data. In these cases, it is still worth using the API, as it is still more efficient than making a request to the entire page; even though the data does still need to be parsed from the HTML response. diff --git a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md index 58ac96fd55..8bb93b2b0d 100644 --- a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md +++ b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md @@ -138,7 +138,7 @@ This will output the same exact result as the code in the previous section. One of the most popular parsing libraries for Node.js is [Cheerio](https://www.npmjs.com/package/cheerio), which can be used in tandem with Playwright and Puppeteer. It is extremely beneficial to parse the page's HTML in the Node.js context for a number of reasons: -- You can easily port the code between headless browser data extraction and plain HTTP data extraction +- You can port the code between headless browser data extraction and plain HTTP data extraction - You don't have to worry in which context you're working (which can sometimes be confusing) - Errors are easier to handle when running in the base Node.js context diff --git a/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md b/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md index 6bba4dad69..40671a3241 100644 --- a/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md +++ b/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md @@ -16,7 +16,7 @@ import TabItem from '@theme/TabItem'; Other than having methods for interacting with a page and waiting for events and elements, the **Page** object also supports various methods for doing other things, such as [reloading](https://pptr.dev/#?product=Puppeteer&version=v13.7.0&show=api-pagereloadoptions), [screenshotting](https://playwright.dev/docs/api/class-page#page-screenshot), [changing headers](https://playwright.dev/docs/api/class-page#page-set-extra-http-headers), and extracting the [page's content](https://pptr.dev/#?product=Puppeteer&show=api-pagecontent). -Last lesson, we left off at a point where we were waiting for the page to navigate so that we can extract the page's title and take a screenshot of it. In this lesson, we'll be learning about the two methods we can use to easily achieve both of those things. +Last lesson, we left off at a point where we were waiting for the page to navigate so that we can extract the page's title and take a screenshot of it. In this lesson, we'll be learning about the two methods we can use to achieve both of those things. ## Grabbing the title {#grabbing-the-title} diff --git a/sources/academy/webscraping/switching_to_typescript/enums.md b/sources/academy/webscraping/switching_to_typescript/enums.md index 56cdd4b0d1..8f3b81bb43 100644 --- a/sources/academy/webscraping/switching_to_typescript/enums.md +++ b/sources/academy/webscraping/switching_to_typescript/enums.md @@ -1,13 +1,13 @@ --- title: Enums -description: Learn how to easily define, use, and manage constant values using a cool feature called "enums" that TypeScript brings to the table. +description: Learn how to define, use, and manage constant values using a cool feature called "enums" that TypeScript brings to the table. sidebar_position: 7.4 slug: /switching-to-typescript/enums --- # Enums! {#enums} -**Learn how to easily define, use, and manage constant values using a cool feature called "enums" that TypeScript brings to the table.** +**Learn how to define, use, and manage constant values using a cool feature called "enums" that TypeScript brings to the table.** --- diff --git a/sources/academy/webscraping/switching_to_typescript/interfaces.md b/sources/academy/webscraping/switching_to_typescript/interfaces.md index e210c19e68..b83cd96705 100644 --- a/sources/academy/webscraping/switching_to_typescript/interfaces.md +++ b/sources/academy/webscraping/switching_to_typescript/interfaces.md @@ -29,7 +29,7 @@ We can keep this just as it is, which would be totally okay, or we could use an > When working with object types, it usually just comes down to preference whether you decide to use an interface or a type alias. -Using the `interface` keyword, we can easily turn our `Person` type into an interface. +Using the `interface` keyword, we can turn our `Person` type into an interface. ```ts // Interfaces don't need an "=" sign diff --git a/sources/academy/webscraping/switching_to_typescript/mini_project.md b/sources/academy/webscraping/switching_to_typescript/mini_project.md index 211432d38b..014bc2cd61 100644 --- a/sources/academy/webscraping/switching_to_typescript/mini_project.md +++ b/sources/academy/webscraping/switching_to_typescript/mini_project.md @@ -21,7 +21,7 @@ Here's a rundown of what our project should be able to do: 2. Fetch the data and get full TypeScript support on the response object (no `any`!). 3. Sort and modify the data, receiving TypeScript support for the new modified data. -We'll be using a single external package called [**Axios**](https://www.npmjs.com/package/axios) to easily fetch the data from the API, which can be installed with the following command: +We'll be using a single external package called [**Axios**](https://www.npmjs.com/package/axios) to fetch the data from the API, which can be installed with the following command: ```shell npm i axios diff --git a/sources/academy/webscraping/web_scraping_for_beginners/best_practices.md b/sources/academy/webscraping/web_scraping_for_beginners/best_practices.md index 6c3036cfba..b3e1540cc4 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/best_practices.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/best_practices.md @@ -90,7 +90,7 @@ When allowing your users to pass input properties which could break the scraper Validate the input provided by the user! This should be the very first thing your scraper does. If the fields in the input are missing or in an incorrect type/format, either parse the value and correct it programmatically or throw an informative error telling the user how to fix the error. -> On the Apify platform, you can use the [input schema](../../platform/deploying_your_code/input_schema.md) to both easily validate inputs and generate a clean UI for those using your scraper. +> On the Apify platform, you can use the [input schema](../../platform/deploying_your_code/input_schema.md) to both validate inputs and generate a clean UI for those using your scraper. ## Error handling {#error-handling} diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md index e239cdb59c..ef3fdd00be 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md @@ -121,7 +121,7 @@ $env:CRAWLEE_HEADLESS=1; & node browser.js ## Dynamically loaded data {#dynamic-data} -One of the important benefits of using a browser is that it allows you to easily extract data that's dynamically loaded, such as data that's only fetched after a user scrolls or interacts with the page. In our case, it's the "**You may also like**" section of the product detail pages. Those products aren't available in the initial HTML, but the browser loads them later using an API. +One of the important benefits of using a browser is that it allows you to extract data that's dynamically loaded, such as data that's only fetched after a user scrolls or interacts with the page. In our case, it's the "**You may also like**" section of the product detail pages. Those products aren't available in the initial HTML, but the browser loads them later using an API. ![headless-dynamic-data.png](./images/headless-dynamic-data.png) diff --git a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/devtools_continued.md b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/devtools_continued.md index e10d149af2..45dd302e06 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/devtools_continued.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/devtools_continued.md @@ -52,7 +52,7 @@ for (const product of products) { ## Extracting more data {#extracting-data-in-loop} -We will add the price extraction from the previous lesson to the loop. We will also save all the data to an array so that we can easily work with it. Run this in the Console: +We will add the price extraction from the previous lesson to the loop. We will also save all the data to an array so that we can work with it. Run this in the Console: > The `results.push()` function takes its argument and pushes (adds) it to the `results` array. [Learn more about it here](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/push). diff --git a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/project_setup.md b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/project_setup.md index 8af848745a..9d0c9a6a26 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/project_setup.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/project_setup.md @@ -41,7 +41,7 @@ Node.js and npm support two types of projects, let's call them legacy and modern ## Installing necessary libraries {#install-libraries} -Now that we have a project set up, we can install npm modules into the project. Let's install libraries that will help us easily download and process websites' HTML. In the project directory, run the following command, which will install two libraries into your project. **got-scraping** and Cheerio. +Now that we have a project set up, we can install npm modules into the project. Let's install libraries that will help us with downloading and processing websites' HTML. In the project directory, run the following command, which will install two libraries into your project. **got-scraping** and Cheerio. ```shell npm install got-scraping cheerio diff --git a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/using_devtools.md b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/using_devtools.md index cb7c721c45..5cd197f394 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/using_devtools.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/using_devtools.md @@ -151,7 +151,7 @@ price.textContent; ![Extract product price](./images/devtools-extract-product-price.png) -It worked, but the price was not alone in the result. We extracted it together with some extra text. This is very common in web scraping. Sometimes it's not possible to easily separate the data we need by element selection alone, and we have to clean the data using other methods. +It worked, but the price was not alone in the result. We extracted it together with some extra text. This is very common in web scraping. Sometimes it's impossible to separate the data we need by element selection alone, and we have to clean the data using other methods. ### Cleaning extracted data {#cleaning-extracted-data} @@ -192,7 +192,7 @@ price.textContent.split('$')[1]; And there you go. Notice that this time we extracted the price without the `$` dollar sign. This could be desirable, because we wanted to convert the price from a string to a number, or not, depending on individual circumstances of the scraping project. -Which method to choose? Neither is the perfect solution. The first method could easily break if the website's developers change the structure of the `<span>` elements and the price will no longer be in the third position - a very small change that can happen at any moment. +Which method to choose? Neither is the perfect solution. The first method could break if the website's developers change the structure of the `<span>` elements and the price will no longer be in the third position - a very small change that can happen at any moment. The second method seems more reliable, but only until the website adds prices in other currency or decides to replace `$` with `USD`. It's up to you, the scraping developer to decide which of the methods will be more resilient on the website you scrape. diff --git a/sources/academy/webscraping/web_scraping_for_beginners/introduction.md b/sources/academy/webscraping/web_scraping_for_beginners/introduction.md index bf4504aaf4..6ee66f548e 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/introduction.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/introduction.md @@ -16,7 +16,7 @@ Web scraping or crawling? Web data extraction, mining, or collection? You can fi ## What is web data extraction? {#what-is-data-extraction} -Web data extraction (or collection) is a process that takes a web page, like an Amazon product page, and collects useful information from the page, such as the product's name and price. Web pages are an unstructured data source and the goal of web data extraction is to make information from websites structured, so that it can be easily processed by data analysis tools or integrated with computer systems. The main sources of data on a web page are HTML documents and API calls, but also images, PDFs, and so on. +Web data extraction (or collection) is a process that takes a web page, like an Amazon product page, and collects useful information from the page, such as the product's name and price. Web pages are an unstructured data source and the goal of web data extraction is to make information from websites structured, so that it can be processed by data analysis tools or integrated with computer systems. The main sources of data on a web page are HTML documents and API calls, but also images, PDFs, and so on. ![product data extraction from Amazon](./images/beginners-data-extraction.png) From dae908bf2a06b345eaec51512f610c31250fa870 Mon Sep 17 00:00:00 2001 From: Honza Javorek <mail@honzajavorek.cz> Date: Thu, 25 Apr 2024 17:21:40 +0200 Subject: [PATCH 06/15] fix: strange link formatting --- .../academy/platform/get_most_of_actors/seo_and_promotion.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sources/academy/platform/get_most_of_actors/seo_and_promotion.md b/sources/academy/platform/get_most_of_actors/seo_and_promotion.md index 4e959b17c0..ec50b909d9 100644 --- a/sources/academy/platform/get_most_of_actors/seo_and_promotion.md +++ b/sources/academy/platform/get_most_of_actors/seo_and_promotion.md @@ -112,7 +112,7 @@ Now that you’ve created a cool new Actor, let others see it! Share it on your - Use relevant and widely used hashtags (Twitter). -> **GOOD**: Need to crawl #Amazon or #Yelp? See my Amazon crawler <https://>... +> **GOOD**: Need to crawl #Amazon or #Yelp? See my Amazon crawler https:\/\/... > <br/> **AVOID**: I just #created something, check it out on Apify... - Post in groups or pages with relevant target groups (Facebook and LinkedIn). From f1109e57dd3d35e8812171f9613d25492c4bfd02 Mon Sep 17 00:00:00 2001 From: Honza Javorek <mail@honzajavorek.cz> Date: Thu, 25 Apr 2024 17:53:51 +0200 Subject: [PATCH 07/15] style: let the reader decide if something is 'simple' --- sources/academy/glossary/tools/edit_this_cookie.md | 2 +- sources/academy/glossary/tools/insomnia.md | 4 ++-- sources/academy/glossary/tools/modheader.md | 2 +- sources/academy/glossary/tools/postman.md | 6 +++--- .../glossary/tools/quick_javascript_switcher.md | 6 +++--- sources/academy/glossary/tools/user_agent_switcher.md | 2 +- .../platform/get_most_of_actors/actor_readme.md | 2 +- .../get_most_of_actors/guidelines_for_writing.md | 8 ++++---- .../get_most_of_actors/monetizing_your_actor.md | 2 +- sources/academy/platform/getting_started/actors.md | 2 +- .../api/run_actor_and_retrieve_data_via_api.md | 8 ++++---- .../tutorials/apify_scrapers/cheerio_scraper.md | 6 +++--- .../tutorials/apify_scrapers/getting_started.md | 10 +++------- sources/academy/tutorials/apify_scrapers/index.md | 2 +- .../tutorials/apify_scrapers/puppeteer_scraper.md | 8 ++++---- .../academy/tutorials/apify_scrapers/web_scraper.md | 6 +++--- .../node_js/analyzing_pages_and_fixing_errors.md | 2 +- .../academy/tutorials/node_js/debugging_web_scraper.md | 2 +- .../node_js/handle_blocked_requests_puppeteer.md | 6 +++--- .../node_js/how_to_save_screenshots_puppeteer.md | 2 +- .../academy/tutorials/node_js/optimizing_scrapers.md | 2 +- .../tutorials/node_js/when_to_use_puppeteer_scraper.md | 6 +++--- sources/academy/tutorials/php/using_apify_from_php.md | 2 +- .../tutorials/python/process_data_using_python.md | 2 +- sources/academy/tutorials/python/scrape_data_python.md | 2 +- .../advanced_web_scraping/scraping_paginated_sites.md | 8 ++++---- sources/academy/webscraping/anti_scraping/index.md | 2 +- .../webscraping/anti_scraping/mitigation/proxies.md | 2 +- .../general_api_scraping/cookies_headers_tokens.md | 2 +- .../api_scraping/graphql_scraping/introspection.md | 2 +- .../graphql_scraping/modifying_variables.md | 2 +- .../common_use_cases/logging_into_a_website.md | 2 +- .../common_use_cases/scraping_iframes.md | 2 +- .../switching_to_typescript/installation.md | 2 +- .../switching_to_typescript/type_aliases.md | 2 +- .../unknown_and_type_assertions.md | 2 +- .../crawling/pro_scraping.md | 4 ++-- .../crawling/recap_extraction_basics.md | 2 +- .../data_extraction/devtools_continued.md | 2 +- .../data_extraction/project_setup.md | 2 +- .../webscraping/web_scraping_for_beginners/index.md | 2 +- .../web_scraping_for_beginners/introduction.md | 2 +- 42 files changed, 70 insertions(+), 74 deletions(-) diff --git a/sources/academy/glossary/tools/edit_this_cookie.md b/sources/academy/glossary/tools/edit_this_cookie.md index e2c7f942c8..cfdf1eefea 100644 --- a/sources/academy/glossary/tools/edit_this_cookie.md +++ b/sources/academy/glossary/tools/edit_this_cookie.md @@ -11,7 +11,7 @@ slug: /tools/edit-this-cookie --- -**EditThisCookie** is a simple Chrome extension to manage your browser's cookies. It can be added through the [Chrome Web Store](https://chrome.google.com/webstore/category/extensions). After adding it to Chrome, you'll see a button with a delicious cookie icon next to any other Chrome extensions you might have installed. Clicking on it will open a pop-up window with a list of all saved cookies associated with the currently opened page domain. +**EditThisCookie** is a Chrome extension to manage your browser's cookies. It can be added through the [Chrome Web Store](https://chrome.google.com/webstore/category/extensions). After adding it to Chrome, you'll see a button with a delicious cookie icon next to any other Chrome extensions you might have installed. Clicking on it will open a pop-up window with a list of all saved cookies associated with the currently opened page domain. ![EditThisCookie popup](./images/edit-this-cookie-popup.png) diff --git a/sources/academy/glossary/tools/insomnia.md b/sources/academy/glossary/tools/insomnia.md index 760c9dc67c..143e57a4ec 100644 --- a/sources/academy/glossary/tools/insomnia.md +++ b/sources/academy/glossary/tools/insomnia.md @@ -1,13 +1,13 @@ --- title: Insomnia -description: Learn about Insomnia, a simple yet super valuable tool for testing requests and proxies when building scalable web scrapers. +description: Learn about Insomnia, a valuable tool for testing requests and proxies when building scalable web scrapers. sidebar_position: 9.2 slug: /tools/insomnia --- # What is Insomnia {#what-is-insomnia} -**Learn about Insomnia, a simple yet super valuable tool for testing requests and proxies when building scalable web scrapers.** +**Learn about Insomnia, a valuable tool for testing requests and proxies when building scalable web scrapers.** --- diff --git a/sources/academy/glossary/tools/modheader.md b/sources/academy/glossary/tools/modheader.md index 354abe743f..581e6628dd 100644 --- a/sources/academy/glossary/tools/modheader.md +++ b/sources/academy/glossary/tools/modheader.md @@ -19,7 +19,7 @@ If you read about [Postman](./postman.md), you might remember that you can use i After you install the ModHeader extension, you should see it pinned in Chrome's task bar. When you click it, you'll see an interface like this pop up: -![Modheader's simple interface](./images/modheader.jpg) +![Modheader's interface](./images/modheader.jpg) Here, you can add headers, remove headers, and even save multiple collections of headers that you can toggle between (which are called **Profiles** within the extension itself). diff --git a/sources/academy/glossary/tools/postman.md b/sources/academy/glossary/tools/postman.md index a41c4aeabb..27fb8a5238 100644 --- a/sources/academy/glossary/tools/postman.md +++ b/sources/academy/glossary/tools/postman.md @@ -1,19 +1,19 @@ --- title: Postman -description: Learn about Postman, a simple yet super valuable tool for testing requests and proxies when building scalable web scrapers. +description: Learn about Postman, a valuable tool for testing requests and proxies when building scalable web scrapers. sidebar_position: 9.3 slug: /tools/postman --- # What is Postman? {#what-is-postman} -**Learn about Postman, a simple yet super valuable tool for testing requests and proxies when building scalable web scrapers.** +**Learn about Postman, a valuable tool for testing requests and proxies when building scalable web scrapers.** --- [Postman](https://www.postman.com/) is a powerful collaboration platform for API development and testing. For scraping use-cases, it's mainly used to test requests and proxies (such as checking the response body of a raw request, without loading any additional resources such as JavaScript or CSS). This tool can do much more than that, but we will not be discussing all of its capabilities here. Postman allows us to test requests with cookies, headers, and payloads so that we can be entirely sure what the response looks like for a request URL we plan to eventually use in a scraper. -The desktop app can be downloaded from its [official download page](https://www.postman.com/downloads/), or the web app can be used with a simple signup - no download required. If this is your first time working with a tool like Postman, we recommend checking out their [Getting Started guide](https://learning.postman.com/docs/getting-started/introduction/). +The desktop app can be downloaded from its [official download page](https://www.postman.com/downloads/), or the web app can be used with a signup - no download required. If this is your first time working with a tool like Postman, we recommend checking out their [Getting Started guide](https://learning.postman.com/docs/getting-started/introduction/). ## Understanding the interface {#understanding-the-interface} diff --git a/sources/academy/glossary/tools/quick_javascript_switcher.md b/sources/academy/glossary/tools/quick_javascript_switcher.md index ba2f3d5806..eca2c21b34 100644 --- a/sources/academy/glossary/tools/quick_javascript_switcher.md +++ b/sources/academy/glossary/tools/quick_javascript_switcher.md @@ -1,17 +1,17 @@ --- title: Quick JavaScript Switcher -description: Discover a super simple tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs. +description: Discover a handy tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs. sidebar_position: 9.9 slug: /tools/quick-javascript-switcher --- # Quick JavaScript Switcher -**Discover a super simple tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs.** +**Discover a handy tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs.** --- -**Quick JavaScript Switcher** is a very simple Chrome extension that allows you to switch on/off the JavaScript for the current page with one click. It can be added to your browser via the [Chrome Web Store](https://chrome.google.com/webstore/category/extensions). After adding it to Chrome, you'll see its respective button next to any other Chrome extensions you might have installed. +**Quick JavaScript Switcher** is a Chrome extension that allows you to switch on/off the JavaScript for the current page with one click. It can be added to your browser via the [Chrome Web Store](https://chrome.google.com/webstore/category/extensions). After adding it to Chrome, you'll see its respective button next to any other Chrome extensions you might have installed. If JavaScript is enabled - clicking the button will switch it off and reload the page. The next click will re-enable JavaScript and refresh the page. This extension is useful for checking whether a certain website will work without JavaScript (and thus could be parsed without using a browser with a plain HTTP request) or not. diff --git a/sources/academy/glossary/tools/user_agent_switcher.md b/sources/academy/glossary/tools/user_agent_switcher.md index e83f5140d0..65a1445a7b 100644 --- a/sources/academy/glossary/tools/user_agent_switcher.md +++ b/sources/academy/glossary/tools/user_agent_switcher.md @@ -11,7 +11,7 @@ slug: /tools/user-agent-switcher --- -**User-Agent Switcher** is a simple Chrome extension that allows you to quickly change your **User-Agent** and see how a certain website would behave with different user agents. After adding it to Chrome, you'll see a **Chrome UA Spoofer** button in the extension icons area. Clicking on it will open up a list of various **User-Agent** groups. +**User-Agent Switcher** is a Chrome extension that allows you to quickly change your **User-Agent** and see how a certain website would behave with different user agents. After adding it to Chrome, you'll see a **Chrome UA Spoofer** button in the extension icons area. Clicking on it will open up a list of various **User-Agent** groups. ![User-Agent Switcher groups](./images/user-agent-switcher-groups.png) diff --git a/sources/academy/platform/get_most_of_actors/actor_readme.md b/sources/academy/platform/get_most_of_actors/actor_readme.md index 832a70686a..834de58787 100644 --- a/sources/academy/platform/get_most_of_actors/actor_readme.md +++ b/sources/academy/platform/get_most_of_actors/actor_readme.md @@ -42,7 +42,7 @@ Aim for sections 1–6 below and try to include at least 300 words. You can move 3. **How much will it cost to scrape (target site)?** - - Simple text explaining what type of proxies are needed and how many platform credits (calculated mainly from consumption units) are needed for 1000 results. + - Explanation of what type of proxies are needed and how many platform credits (calculated mainly from consumption units) are needed for 1000 results. - This is calculated from carrying out several runs (or from runs saved in the DB).<!-- @Zuzka can help if needed. [Information in this table](https://docs.google.com/spreadsheets/d/1NOkob1eYqTsRPTVQdltYiLUsIipvSFXswRcWQPtCW9M/edit#gid=1761542436), tab "cost of usage". --> - Here’s an example for this section: diff --git a/sources/academy/platform/get_most_of_actors/guidelines_for_writing.md b/sources/academy/platform/get_most_of_actors/guidelines_for_writing.md index dbe76dc268..43ad39b74f 100644 --- a/sources/academy/platform/get_most_of_actors/guidelines_for_writing.md +++ b/sources/academy/platform/get_most_of_actors/guidelines_for_writing.md @@ -1,5 +1,5 @@ --- -title: Guidelines for writing tutorials +title: Guidelines for writing tutorials description: Create a guide for your users so they can get the best out of your Actor. Make sure your tutorial is both user- and SEO-friendly. Your tutorial will be published on Apify Blog. sidebar_position: 3 slug: /get-most-of-actors/guidelines-writing-tutorials @@ -42,12 +42,12 @@ These guidelines are of course not set in stone. They are here to give you a gen ## Tutorial template -A simple tutorial template for you to start from. Feel free to expand and modify it as you see fit. +A tutorial template for you to start from. Feel free to expand and modify it as you see fit. ```markdown # How to [perform task] automatically -A simple step-by-step guide to [describe what the guide helps achieve]. +A step-by-step guide to [describe what the guide helps achieve]. The web is a vast and dynamic space, continuously expanding and evolving. Often, there's a need to [describe the problem or need the tool addresses]. A handy tool for anyone who wants to [describe what the tool helps with] would be invaluable. @@ -69,7 +69,7 @@ Here's how to [quick intro to the tutorial itself] ### Step 1. Find the [Actor name] -Navigate to [Tool Name] and click the [CTA button]. You'll be redirected to Apify Console. +Navigate to [Tool Name] and click the [CTA button]. You'll be redirected to Apify Console. ### Step 2. Add URL or choose [setting 1], [setting 2], and [setting 3] diff --git a/sources/academy/platform/get_most_of_actors/monetizing_your_actor.md b/sources/academy/platform/get_most_of_actors/monetizing_your_actor.md index 5f6351cde1..b3d7f99b44 100644 --- a/sources/academy/platform/get_most_of_actors/monetizing_your_actor.md +++ b/sources/academy/platform/get_most_of_actors/monetizing_your_actor.md @@ -94,7 +94,7 @@ To dive deep into numbers for a specific Actor, you can visit the Actor insights Your paid Actors’ profits are directly related to the amount of paying users you have for your tool. After publishing and monetizing your software, comes a crucial step for your Actor’s success: **attracting users**. -Getting new users can be an art in itself, but there are **two simple steps** you can take to ensure your Actor is getting the attention it deserves. +Getting new users can be an art in itself, but there are **two proven steps** you can take to ensure your Actor is getting the attention it deserves. 1. **SEO-optimized description and README** diff --git a/sources/academy/platform/getting_started/actors.md b/sources/academy/platform/getting_started/actors.md index caf4ac6966..1f89ee263f 100644 --- a/sources/academy/platform/getting_started/actors.md +++ b/sources/academy/platform/getting_started/actors.md @@ -15,7 +15,7 @@ After you've followed the **Getting started** lesson, you're almost ready to sta ## What's an Actor? {#what-is-an-actor} -When you deploy your script to the Apify platform, it is then called an **Actor**, which is a [serverless microservice](https://www.datadoghq.com/knowledge-center/serverless-architecture/serverless-microservices/#:~:text=Serverless%20microservices%20are%20cloud-based,suited%20for%20microservice-based%20architectures.) that accepts an input and produces an output. Actors can run for a few seconds, hours or even infinitely. An Actor can perform anything from a simple action such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset. +When you deploy your script to the Apify platform, it is then called an **Actor**, which is a [serverless microservice](https://www.datadoghq.com/knowledge-center/serverless-architecture/serverless-microservices/#:~:text=Serverless%20microservices%20are%20cloud-based,suited%20for%20microservice-based%20architectures.) that accepts an input and produces an output. Actors can run for a few seconds, hours or even infinitely. An Actor can perform anything from a basic action such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset. Once an Actor has been pushed to the Apify platform, they can be shared to the world through the [Apify Store](https://apify.com/store), and even monetized after going public. diff --git a/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md b/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md index 2d30e87866..1d5b43ab45 100644 --- a/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md +++ b/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md @@ -69,7 +69,7 @@ Most Actors would not be much use if input could not be passed into them to chan Good Actors have reasonable defaults for most input fields, so if you want to run one of the major Actors from [Apify Store](https://apify.com/store), you usually do not need to provide all possible fields. -Via API, let's quickly try to run [Web Scraper](https://apify.com/apify/web-scraper), which is the most popular Actor on the Apify Store at the moment. The full input with all possible fields is [pretty long and ugly](https://apify.com/apify/web-scraper?section=example-run), so we will not show it here. Because it has default values for most fields, we can provide a simple JSON input containing only the fields we'd like to customize. We will send a POST request to the endpoint below and add the JSON as the **body** of the request: +Via API, let's quickly try to run [Web Scraper](https://apify.com/apify/web-scraper), which is the most popular Actor on the Apify Store at the moment. The full input with all possible fields is [pretty long and ugly](https://apify.com/apify/web-scraper?section=example-run), so we will not show it here. Because it has default values for most fields, we can provide a JSON input containing only the fields we'd like to customize. We will send a POST request to the endpoint below and add the JSON as the **body** of the request: ```cURL https://api.apify.com/v2/acts/apify~web-scraper/runs?token=YOUR_TOKEN @@ -87,7 +87,7 @@ We will later use this **run info** JSON to retrieve the run's output data. This ## JavaScript and Python client {#javascript-and-python-client} -If you are using JavaScript or Python, we highly recommend using the Apify API client ([JavaScript](https://docs.apify.com/api/client/js/), [Python](https://docs.apify.com/api/client/python/)) instead of the raw HTTP API. The client implements smart polling and exponential backoff, which makes calling Actors and getting results very simple. +If you are using JavaScript or Python, we highly recommend using the Apify API client ([JavaScript](https://docs.apify.com/api/client/js/), [Python](https://docs.apify.com/api/client/python/)) instead of the raw HTTP API. The client implements smart polling and exponential backoff, which makes calling Actors and getting results efficient. You can skip most of this tutorial by following this code example that calls Google Search Results Scraper and logs its results: @@ -149,7 +149,7 @@ If your synchronous run exceeds the 5-minute time limit, the response will be a Most Actor runs will store their data in the default [dataset](/platform/storage/dataset). The Apify API provides **run-sync-get-dataset-items** endpoints for [Actors](/api/v2#/reference/actors/run-actor-synchronously-and-get-dataset-items/run-actor-synchronously-with-input-and-get-dataset-items) and [tasks](/api/v2#/reference/actor-tasks/run-task-synchronously-and-get-dataset-items/run-task-synchronously-and-get-dataset-items-(post)), which allow you to run an Actor and receive the items from the default dataset once the run has finished. -Here is a simple Node.js example of calling a task via the API and logging the dataset items to the console: +Here is a Node.js example of calling a task via the API and logging the dataset items to the console: ```js // Use your favorite HTTP client @@ -254,7 +254,7 @@ The **run info** JSON also contains the IDs of the default [dataset](/platform/s > If you are scraping products, or any list of items with similar fields, the [dataset](/platform/storage/dataset) should be your storage of choice. Don't forget though, that dataset items are immutable. This means that you can only add to the dataset, and not change the content that is already inside it. -Retrieving the data from a dataset is simple. Send a GET request to the [**Get items**](/api/v2#/reference/datasets/item-collection/get-items) endpoint and pass the `defaultDatasetId` into the URL. For a GET request to the default dataset, no token is needed. +To retrieve the data from a dataset, send a GET request to the [**Get items**](/api/v2#/reference/datasets/item-collection/get-items) endpoint and pass the `defaultDatasetId` into the URL. For a GET request to the default dataset, no token is needed. ```cURL https://api.apify.com/v2/datasets/DATASET_ID/items diff --git a/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md b/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md index 3870e2a8c9..5659a11f61 100644 --- a/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md @@ -501,9 +501,9 @@ Thank you for reading this whole tutorial! Really! It's important to us that our ## [](#whats-next) What's next -* Check out the [Apify SDK](https://sdk.apify.com/) and its [Getting started](https://sdk.apify.com/docs/guides/getting-started) tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a simple `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking. -* [Take a deep dive into Actors](/platform/actors), from how they work to [publishing](/platform/actors/publishing) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on Actors. -* Found out you're not into the coding part but would still to use Apify Actors? Check out our [ready-made solutions](https://apify.com/store) or [order a custom Actor](https://apify.com/custom-solutions) from an Apify-certified developer. +* Check out the [Apify SDK](https://sdk.apify.com/) and its [Getting started](https://sdk.apify.com/docs/guides/getting-started) tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking. +* [Take a deep dive into Actors](/platform/actors), from how they work to [publishing](/platform/actors/publishing) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on actors. +* Found out you're not into the coding part but would still to use Apify Actors? Check out our [ready-made solutions](https://apify.com/store) or [order a custom actor](https://apify.com/custom-solutions) from an Apify-certified developer. **Learn how to scrape a website using Apify's Cheerio Scraper. Build an Actor's page function, extract information from a web page and download your data.** diff --git a/sources/academy/tutorials/apify_scrapers/getting_started.md b/sources/academy/tutorials/apify_scrapers/getting_started.md index ac199f473f..7d98cdb70c 100644 --- a/sources/academy/tutorials/apify_scrapers/getting_started.md +++ b/sources/academy/tutorials/apify_scrapers/getting_started.md @@ -108,7 +108,7 @@ Some of this information may be scraped directly from the listing pages, but for ### [](#the-start-url) The start URL -Let's start with something simple. In the **Input** tab of the task we have, we'll change the **Start URL** from **<https://apify.com>**. This will tell the scraper to start by opening a different URL. You can add more **Start URL**s or even [use a file with a list of thousands of them](#-crawling-the-website-with-pseudo-urls), but in this case, we'll be good with just one. +In the **Input** tab of the task we have, we'll change the **Start URL** from **<https://apify.com>**. This will tell the scraper to start by opening a different URL. You can add more **Start URL**s or even [use a file with a list of thousands of them](#-crawling-the-website-with-pseudo-urls), but in this case, we'll be good with just one. How do we choose the new **Start URL**? The goal is to scrape all Actors in the store, which is available at [https://apify.com/store](https://apify.com/store), so we choose this URL as our **Start URL**. @@ -128,7 +128,7 @@ We also need to somehow distinguish the **Start URL** from all the other URLs th The **Link selector**, together with **Pseudo URL**s, are your URL matching arsenal. The Link selector is a CSS selector and its purpose is to select the HTML elements where the scraper should look for URLs. And by looking for URLs, we mean finding the elements' `href` attributes. For example, to enqueue URLs from `<div class="my-class" href=...>` tags, we would enter `'div.my-class'`. -What's the connection to **Pseudo URL**s? Well, first, all the URLs found in the elements that match the Link selector are collected. Then, **Pseudo URL**s are used to filter through those URLs and enqueue only the ones that match the **Pseudo URL** structure. Simple. +What's the connection to **Pseudo URL**s? Well, first, all the URLs found in the elements that match the Link selector are collected. Then, **Pseudo URL**s are used to filter through those URLs and enqueue only the ones that match the **Pseudo URL** structure. To scrape all the Actors in Apify Store, we should use the Link selector to tell the scraper where to find the URLs we need. For now, let us tell you that the Link selector you're looking for is: @@ -290,11 +290,7 @@ The scraper: ## [](#scraping-practice) Scraping practice -We've covered all the concepts that we need to understand to successfully scrape the data in our goal, -so let's get to it and start with something really simple. We will only output data that are already available -to us in the page's URL. Remember from [our goal](#the-goal) that we also want to include the **URL** and a **Unique -identifier** in our results. To get those, we need the `request.url` because it is the URL and -includes the Unique identifier. +We've covered all the concepts that we need to understand to successfully scrape the data in our goal, so let's get to it. We will only output data that are already available to us in the page's URL. Remember from [our goal](#the-goal) that we also want to include the **URL** and a **Unique identifier** in our results. To get those, we need the `request.url` because it is the URL and includes the Unique identifier. ```js const { url } = request; diff --git a/sources/academy/tutorials/apify_scrapers/index.md b/sources/academy/tutorials/apify_scrapers/index.md index 5816b900d3..c146e2fd71 100644 --- a/sources/academy/tutorials/apify_scrapers/index.md +++ b/sources/academy/tutorials/apify_scrapers/index.md @@ -21,7 +21,7 @@ Don't let the number of options confuse you. Unless you're really sure you need Web Scraper is a ready-made solution for scraping the web using the Chrome browser. It takes away all the work necessary to set up a browser for crawling, controls the browser automatically and produces machine-readable results in several common formats. -Underneath, it uses the Puppeteer library to control the browser, but you don't need to worry about that. Using a simple web UI and a little of basic JavaScript, you can tweak it to serve almost any scraping need. +Underneath, it uses the Puppeteer library to control the browser, but you don't need to worry about that. Using a web UI and a little of basic JavaScript, you can tweak it to serve almost any scraping need. [Visit the Web Scraper tutorial to get started!](./web_scraper.md) diff --git a/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md b/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md index a7b77a8cff..0205d65eff 100644 --- a/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md @@ -41,7 +41,7 @@ Where Web Scraper only gives you access to in-browser JavaScript and the `pageFu in the browser context, Puppeteer Scraper's `pageFunction` is executed in Node.js context, giving you much more freedom to bend the browser to your will. You're the puppeteer and the browser is your puppet. It's also much easier to work with external APIs, databases or the [Apify SDK](https://sdk.apify.com) -in the Node.js context. The tradeoff is simple. Power vs simplicity. Web Scraper is simple, +in the Node.js context. The tradeoff is simplicity vs power. Web Scraper is simple, Puppeteer Scraper is powerful (and the [Apify SDK](https://sdk.apify.com) is super-powerful). > In other words, Web Scraper's `pageFunction` is like a single [page.evaluate()](https://pptr.dev/#?product=Puppeteer&show=api-pageevaluatepagefunction-args) call. @@ -821,9 +821,9 @@ Thank you for reading this whole tutorial! Really! It's important to us that our ## [](#whats-next) What's next? -- Check out the [Apify SDK](https://sdk.apify.com/) and its [Getting started](https://sdk.apify.com/docs/guides/getting-started) tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a simple `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking. -- [Take a deep dive into Actors](/platform/actors), from how they work to [publishing](/platform/actors/publishing) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on Actors. -- Found out you're not into the coding part but would still to use Apify Actors? Check out our [ready-made solutions](https://apify.com/store) or [order a custom Actor](https://apify.com/custom-solutions) from an Apify-certified developer. +- Check out the [Apify SDK](https://sdk.apify.com/) and its [Getting started](https://sdk.apify.com/docs/guides/getting-started) tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking. +- [Take a deep dive into Actors](/platform/actors), from how they work to [publishing](/platform/actors/publishing) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on actors. +- Found out you're not into the coding part but would still to use Apify Actors? Check out our [ready-made solutions](https://apify.com/store) or [order a custom actor](https://apify.com/custom-solutions) from an Apify-certified developer. **Learn how to scrape a website using Apify's Puppeteer Scraper. Build an Actor's page function, extract information from a web page and download your data.** diff --git a/sources/academy/tutorials/apify_scrapers/web_scraper.md b/sources/academy/tutorials/apify_scrapers/web_scraper.md index 90bd57a2a1..335f63536d 100644 --- a/sources/academy/tutorials/apify_scrapers/web_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/web_scraper.md @@ -555,9 +555,9 @@ Thank you for reading this whole tutorial! Really! It's important to us that our ## [](#whats-next) What's next? -- Check out the [Apify SDK](https://sdk.apify.com/) and its [Getting started](https://sdk.apify.com/docs/guides/getting-started) tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a simple `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking. -- [Take a deep dive into Actors](/platform/actors), from how they work to [publishing](/platform/actors/publishing) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on Actors. -- Found out you're not into the coding part but would still to use Apify Actors? Check out our [ready-made solutions](https://apify.com/store) or [order a custom Actor](https://apify.com/custom-solutions) from an Apify-certified developer. +- Check out the [Apify SDK](https://sdk.apify.com/) and its [Getting started](https://sdk.apify.com/docs/guides/getting-started) tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking. +- [Take a deep dive into Actors](/platform/actors), from how they work to [publishing](/platform/actors/publishing) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on actors. +- Found out you're not into the coding part but would still to use Apify Actors? Check out our [ready-made solutions](https://apify.com/store) or [order a custom actor](https://apify.com/custom-solutions) from an Apify-certified developer. **Learn how to scrape a website using Apify's Web Scraper. Build an Actor's page function, extract information from a web page and download your data.** diff --git a/sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md b/sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md index 57e4f1a492..f9c9027e96 100644 --- a/sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md +++ b/sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md @@ -123,7 +123,7 @@ To make the error snapshot descriptive, we name it **ERROR-LOGIN**. We add a ran ### Error reporting {#error-reporting} -Logging and snapshotting are great tools but once you reach a certain run size, it may be hard to read through them all. For a large project, it is handy to create a more sophisticated reporting system. For example, let's look at simple **dataset** reporting. +Logging and snapshotting are great tools but once you reach a certain run size, it may be hard to read through them all. For a large project, it is handy to create a more sophisticated reporting system. ## With the Apify SDK {#with-the-apify-sdk} diff --git a/sources/academy/tutorials/node_js/debugging_web_scraper.md b/sources/academy/tutorials/node_js/debugging_web_scraper.md index e43759f099..de668ba9a7 100644 --- a/sources/academy/tutorials/node_js/debugging_web_scraper.md +++ b/sources/academy/tutorials/node_js/debugging_web_scraper.md @@ -7,7 +7,7 @@ slug: /node-js/debugging-web-scraper A lot of beginners struggle through trial and error while scraping a simple site. They write some code that might work, press the run button, see that error happened and they continue writing more code that might work but probably won't. This is extremely inefficient and gets tedious really fast. -What beginners are missing are simple tools and tricks to get things done quickly. One of these wow tricks is the option to run the JavaScript code directly in your browser. +What beginners are missing are basic tools and tricks to get things done quickly. One of these wow tricks is the option to run the JavaScript code directly in your browser. Pressing F12 while browsing with Chrome, Firefox, or other popular browsers opens up the browser console, the magic toolbox of any web developer. The console allows you to run a code in the context of the website you are in. Don't worry, you cannot mess the site up (well, unless you start doing really nasty tricks) as the page content is downloaded on your computer and any change is only local to your PC. diff --git a/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md b/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md index af114d551e..da2a425b3f 100644 --- a/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md +++ b/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md @@ -11,7 +11,7 @@ One of the main defense mechanisms websites use to ensure they are not scraped b > Getting around website defense mechanisms when crawling. -Setting proxy rotation in [BasicCrawler](https://crawlee.dev/api/basic-crawler/class/BasicCrawler) is pretty simple. When using plain HTTP requests (like with the popular '[request-promise](https://www.npmjs.com/package/request-promise)' npm package), a fresh proxy is set up on each request. +You can use `handleRequestFunction` to set up proxy rotation for a [BasicCrawler](https://crawlee.dev/api/basic-crawler/class/BasicCrawler). The following example shows how to use a fresh proxy on each request if you make requests through the popular [request-promise](https://www.npmjs.com/package/request-promise) npm package: ```js const Apify = require('apify'); @@ -31,7 +31,7 @@ const crawler = new Apify.BasicCrawler({ }); ``` -Each time handleRequestFunction is executed in this example, requestPromise will send a request through the least used proxy for that target domain. This way you will not burn through your proxies. +Each time `handleRequestFunction` is executed in this example, requestPromise will send a request through the least used proxy for that target domain. This way you will not burn through your proxies. # Puppeteer Crawler @@ -56,7 +56,7 @@ const crawler = new PuppeteerCrawler({ It is really up to a developer to spot if something is wrong with his request. A website can interfere with your crawling in [many ways](https://kb.apify.com/tips-and-tricks/several-tips-how-to-bypass-website-anti-scraping-protections). Page loading can be cancelled right away, it can timeout, the page can display a captcha, some error or warning message, or the data may be missing or corrupted. The developer can then choose if he will try to handle these problems in the code or focus on receiving the proper data. Either way, if the request went wrong, you should throw a proper error. -Now that we know when the request is blocked, we can use the retire() function and continue crawling with a new proxy. Google is one of the most popular websites for scrapers, so let's code some simple Google search crawler. The two main blocking mechanisms used by Google is either to display their (in)famous 'sorry' captcha or to not load the page at all so we will focus on covering these. +Now that we know when the request is blocked, we can use the retire() function and continue crawling with a new proxy. Google is one of the most popular websites for scrapers, so let's code a Google search crawler. The two main blocking mechanisms used by Google is either to display their (in)famous 'sorry' captcha or to not load the page at all so we will focus on covering these. For example, let's assume we have already initialized a requestList of Google search pages. Let's show how you can use the retire() function in both gotoFunction and handlePageFunction. diff --git a/sources/academy/tutorials/node_js/how_to_save_screenshots_puppeteer.md b/sources/academy/tutorials/node_js/how_to_save_screenshots_puppeteer.md index 86891b5918..da90807f59 100644 --- a/sources/academy/tutorials/node_js/how_to_save_screenshots_puppeteer.md +++ b/sources/academy/tutorials/node_js/how_to_save_screenshots_puppeteer.md @@ -28,7 +28,7 @@ Because this is so common use-case Apify SDK has a utility function called [save - You can also save the HTML of the page -A simple example in an Apify Actor: +An example of such Apify Actor: ```js import { Actor } from 'apify'; diff --git a/sources/academy/tutorials/node_js/optimizing_scrapers.md b/sources/academy/tutorials/node_js/optimizing_scrapers.md index 790af47096..6160096780 100644 --- a/sources/academy/tutorials/node_js/optimizing_scrapers.md +++ b/sources/academy/tutorials/node_js/optimizing_scrapers.md @@ -13,7 +13,7 @@ slug: /node-js/optimizing-scrapers Especially if you are running your scrapers on [Apify](https://apify.com), performance is directly related to your wallet (or rather bank account). The slower and heavier your program is, the more proxy bandwidth, storage, [compute units](https://help.apify.com/en/articles/3490384-what-is-a-compute-unit) and higher [subscription plan](https://apify.com/pricing) you'll need. -The goal of optimization is simple: Make the code run as fast as possible and use the least resources possible. On Apify, the resources are memory and CPU usage (don't forget that the more memory you allocate to a run, the bigger share of CPU you get - proportionally). The memory alone should never be a bottleneck though. If it is, that means either a bug (memory leak) or bad architecture of the program (you need to split the computation into smaller parts). The rest of this article will focus only on optimizing CPU usage. You allocate more memory only to get more power from the CPU. +The goal of optimization is to make the code run as fast as possible while using the least resources possible. On Apify, the resources are memory and CPU usage (don't forget that the more memory you allocate to a run, the bigger share of CPU you get - proportionally). The memory alone should never be a bottleneck though. If it is, that means either a bug (memory leak) or bad architecture of the program (you need to split the computation into smaller parts). The rest of this article will focus only on optimizing CPU usage. You allocate more memory only to get more power from the CPU. One more thing to remember. Optimization has its own cost: development time. You should always think about how much time you're able to spend on it and if it's worth it. diff --git a/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md b/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md index 033120f12a..c73105246f 100644 --- a/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md +++ b/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md @@ -102,7 +102,7 @@ await context.page.goto('https://some-new-page.com'); Some very useful scraping techniques revolve around listening to network requests and responses and even modifying them on the fly. Web Scraper's page function doesn't have access to the network, besides calling JavaScript APIs such as `fetch()`. Puppeteer Scraper, on the other hand, has full control over the browser's network activity. -With a simple call, you can listen to all the network requests that are being dispatched from the browser. For example, the following code will print all their URLs to the console. +You can listen to all the network requests that are being dispatched from the browser. For example, the following code will print all their URLs to the console. ```js context.page.on('request', (req) => console.log(req.url())); @@ -116,7 +116,7 @@ _Explaining how to do interception properly is out of scope of this article. See A large number of websites use either form submissions or JavaScript redirects for navigation and displaying of data. With Web Scraper, you cannot crawl those websites, because there are no links to find and enqueue on those pages. Puppeteer Scraper enables you to automatically click all those elements that cause navigation, intercept the navigation requests and enqueue them to the request queue. -If it seems complicated, don't worry. We've abstracted all the complexity away into a simple `Clickable elements selector` input option. When left empty, none of the said clicking and intercepting happens, but once you choose a selector, Puppeteer Scraper will automatically click all the selected elements, watch for page navigations and enqueue them into the `RequestQueue`. +If it seems complicated, don't worry. We've abstracted all the complexity away to a `Clickable elements selector` input option. When left empty, none of the said clicking and intercepting happens, but once you choose a selector, Puppeteer Scraper will automatically click all the selected elements, watch for page navigations and enqueue them into the `RequestQueue`. _The_ `Clickable elements selector` _will also work on regular non-JavaScript links, however, it is significantly slower than using the plain_ `Link selector`_. Unless you know you need it, use the_ `Link selector` _for best performance._ @@ -170,6 +170,6 @@ And we're only scratching the surface here. ## Wrapping it up -Many more techniques are available to Puppeteer Scraper that are either too complicated to replicate in Web Scraper or downright impossible to do. For basic scraping of simple websites Web Scraper is a great tool, because it goes right to the point and uses in-browser JavaScript which is well-known to millions of people, even non-developers. +Many more techniques are available to Puppeteer Scraper that are either too complicated to replicate in Web Scraper or downright impossible to do. Web Scraper is a great tool for basic scraping, because it goes right to the point and uses in-browser JavaScript which is well-known to millions of people, even non-developers. Once you start hitting some roadblocks, you may find that Puppeteer Scraper is just what you need to overcome them. And if Puppeteer Scraper still doesn't cut it, there's still Apify SDK to rule them all. We hope you found this tutorial helpful and happy scraping. diff --git a/sources/academy/tutorials/php/using_apify_from_php.md b/sources/academy/tutorials/php/using_apify_from_php.md index 7df379001e..b6ea6425ca 100644 --- a/sources/academy/tutorials/php/using_apify_from_php.md +++ b/sources/academy/tutorials/php/using_apify_from_php.md @@ -166,7 +166,7 @@ $data = $parsedResponse['data']; echo \json_encode($data, JSON_PRETTY_PRINT); ``` -We can see that there are two record keys: `INPUT` and `OUTPUT`. The HTML String to PDF Actor's README states that the PDF is stored under the `OUTPUT` key. Downloading it is simple: +We can see that there are two record keys: `INPUT` and `OUTPUT`. The HTML String to PDF Actor's README states that the PDF is stored under the `OUTPUT` key. Let's download it: ```php // Don't forget to replace the <RUN_ID> diff --git a/sources/academy/tutorials/python/process_data_using_python.md b/sources/academy/tutorials/python/process_data_using_python.md index 5a8b9eb2b1..5e72eaddb9 100644 --- a/sources/academy/tutorials/python/process_data_using_python.md +++ b/sources/academy/tutorials/python/process_data_using_python.md @@ -21,7 +21,7 @@ In this tutorial, we will use the Actor we created in the [previous tutorial](/a In the previous tutorial, we set out to select our next holiday destination based on the forecast of the upcoming weather there. We have written an Actor that scrapes the BBC Weather forecast for the upcoming two weeks for three destinations: Prague, New York, and Honolulu. It then saves the scraped data to a [dataset](/platform/storage/dataset) on the Apify platform. -Now, we need to process the scraped data and make a simple visualization that will help us decide which location has the best weather, and will therefore become our next holiday destination. +Now, we need to process the scraped data and make a visualization that will help us decide which location has the best weather, and will therefore become our next holiday destination. ### Setting up the Actor {#setting-up-the-actor} diff --git a/sources/academy/tutorials/python/scrape_data_python.md b/sources/academy/tutorials/python/scrape_data_python.md index b93f96c83e..47a7ae39a9 100644 --- a/sources/academy/tutorials/python/scrape_data_python.md +++ b/sources/academy/tutorials/python/scrape_data_python.md @@ -221,7 +221,7 @@ Earlier in this tutorial, we learned how to scrape data from the web in Python u In the previous tutorial, we set out to select our next holiday destination based on the forecast of the upcoming weather there. We have written an Actor that scrapes the BBC Weather forecast for the upcoming two weeks for three destinations: Prague, New York, and Honolulu. It then saves the scraped data to a [dataset](/platform/storage/dataset) on the Apify platform. -Now, we need to process the scraped data and make a simple visualization that will help us decide which location has the best weather, and will therefore become our next holiday destination. +Now, we need to process the scraped data and make a visualization that will help us decide which location has the best weather, and will therefore become our next holiday destination. ### Setting up the Actor {#setting-up-the-actor} diff --git a/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md b/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md index e8faee7a8d..a1f9b42a1c 100644 --- a/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md +++ b/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md @@ -49,7 +49,7 @@ This has several benefits: 1. All listings can eventually be found in a range. 2. The ranges do not overlap, so we scrape the smallest possible number of pages and avoid duplicate listings. -3. Ranges can be controlled by a generic algorithm that is simple to reuse for different sites. +3. Ranges can be controlled by a generic algorithm that can be reused for different sites. ## Splitting pages with range filters {#splitting-pages-with-range-filters} @@ -59,7 +59,7 @@ In the previous section, we analyzed different options to split the pages to ove ### The algorithm {#the-algorithm} -The core algorithm is simple and can be used on any (even overlapping) range. This is a simplified presentation, we will discuss the details later. +The core algorithm can be used on any (even overlapping) range. This is a simplified presentation, we will discuss the details later. 1. We choose a few pivot ranges with a similar number of products and enqueue them. For example, **$0-$10**, **$100-$1000**, **$1000-$10000**, **$10000-**. 2. For each range, we open the page and check if the listings are below the limit. If yes, we continue to step 3. If not, we split the filter in half, e.g. **$0-$10** to **$0-$5** and **$5-$10** and enqueue those again. We recursively repeat step **2** for each range as long as needed. @@ -120,7 +120,7 @@ In this rare case, you will need to add another range or other filters to combin ### Implementing a range filter {#implementing-a-range-filter} -This section shows a simple code example implementing our solution for an imaginary website. Writing a real solution will bring up more complex problems but the previous section should prepare you for some of them. +This section shows a code example implementing our solution for an imaginary website. Writing a real solution will bring up more complex problems but the previous section should prepare you for some of them. First, let's define our imaginary site: @@ -283,7 +283,7 @@ await crawler.addRequests(requestsToEnqueue); ## Summary {#summary} -And that's it. We have an elegant and simple solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](../../platform/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had. +And that's it. We have an elegant solution to a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](../../platform/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had. Check out the [full code example](https://github.com/apify-projects/apify-extra-library/tree/master/examples/crawler-with-filters). diff --git a/sources/academy/webscraping/anti_scraping/index.md b/sources/academy/webscraping/anti_scraping/index.md index 2b0d9133b1..f7dcaa562a 100644 --- a/sources/academy/webscraping/anti_scraping/index.md +++ b/sources/academy/webscraping/anti_scraping/index.md @@ -107,7 +107,7 @@ Although the talk, given in 2021, features some outdated code examples, it still Because we here at Apify scrape for a living, we have discovered many popular and niche anti-scraping techniques. We've compiled them into a short and comprehensible list here to help understand the roadblocks before this course teaches you how to get around them. -> Not all issues you encounter are caused by anti-scraping systems. Sometimes, it's a simple configuration issue. Learn [how to effectively debug your programs here](/academy/node-js/analyzing-pages-and-fixing-errors). +> Not all issues you encounter are caused by anti-scraping systems. Sometimes, it's a configuration issue. Learn [how to effectively debug your programs here](/academy/node-js/analyzing-pages-and-fixing-errors). ### IP rate-limiting diff --git a/sources/academy/webscraping/anti_scraping/mitigation/proxies.md b/sources/academy/webscraping/anti_scraping/mitigation/proxies.md index f17faea8c0..719e4e083f 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/proxies.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/proxies.md @@ -22,7 +22,7 @@ There are a few factors that determine the quality of a proxy IP: - How long was the proxy left to "heal" before it was resold? - What is the quality of the underlying server of the proxy? (latency) -Although IP quality is still the most important factor when it comes to using proxies and avoiding anti-scraping measures, nowadays it's not just about avoiding rate-limiting, which brings new challenges for scrapers that can no longer rely on simple IP rotation. Anti-scraping software providers, such as CloudFlare, have global databases of "suspicious" IP addresses. If you are unlucky, your newly bought IP might be blocked even before you use it. If the previous owners overused it, it might have already been marked as suspicious in many databases, or even (very likely) was blocked altogether. If you care about the quality of your IPs, use them as a real user, and any website will have a hard time banning them completely. +Although IP quality is still the most important factor when it comes to using proxies and avoiding anti-scraping measures, nowadays it's not just about avoiding rate-limiting, which brings new challenges for scrapers that can no longer rely on IP rotation. Anti-scraping software providers, such as CloudFlare, have global databases of "suspicious" IP addresses. If you are unlucky, your newly bought IP might be blocked even before you use it. If the previous owners overused it, it might have already been marked as suspicious in many databases, or even (very likely) was blocked altogether. If you care about the quality of your IPs, use them as a real user, and any website will have a hard time banning them completely. Fixing rate-limiting issues is only the tip of the iceberg of what proxies can do for your scrapers, though. By implementing proxies properly, you can successfully avoid the majority of anti-scraping measures listed in the [previous lesson](../index.md). diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/cookies_headers_tokens.md b/sources/academy/webscraping/api_scraping/general_api_scraping/cookies_headers_tokens.md index 7444657e22..8c96eb343e 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/cookies_headers_tokens.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/cookies_headers_tokens.md @@ -92,7 +92,7 @@ const response = await gotScraping({ For our SoundCloud example, testing the endpoint from the previous section in a tool like [Postman](../../../glossary/tools/postman.md) works perfectly, and returns the data we want; however, when the `client_id` parameter is removed, we receive a **401 Unauthorized** error. Luckily, the Client ID is the same for every user, which means that it is not tied to a session or an IP address (this is based on our own observations and tests). The big downfall is that the token being used by SoundCloud changes every few weeks, so it shouldn't be hardcoded. This case is actually quite common, and is not only seen with SoundCloud. -Ideally, this `client_id` should be scraped dynamically, especially since it changes frequently, but unfortunately, the token cannot be found anywhere on SoundCloud's pages. We already know that it's available within the parameters of certain requests though, and luckily, [Puppeteer](https://github.com/puppeteer/puppeteer) offers a simple way to analyze each response when on a page. It's a bit like using browser DevTools, which you are already familiar with by now, but programmatically instead. +Ideally, this `client_id` should be scraped dynamically, especially since it changes frequently, but unfortunately, the token cannot be found anywhere on SoundCloud's pages. We already know that it's available within the parameters of certain requests though, and luckily, [Puppeteer](https://github.com/puppeteer/puppeteer) offers a way to analyze each response when on a page. It's a bit like using browser DevTools, which you are already familiar with by now, but programmatically instead. Here is a way you could dynamically scrape the `client_id` using Puppeteer: diff --git a/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md b/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md index 255f404294..c6b9e328b8 100644 --- a/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md +++ b/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md @@ -146,7 +146,7 @@ Now that we have this visualization to work off of, it will be much easier to bu ## Building a query {#building-a-query} -In future lessons, we'll be building more complex queries using **dynamic variables** and advanced features such as **fragments**; however, for now let's get our feet wet by using the data we have from GraphQL Voyager to build a simple query. +In future lessons, we'll be building more complex queries using **dynamic variables** and advanced features such as **fragments**; however, for now let's get our feet wet by using the data we have from GraphQL Voyager to build a query. Right now, our goal is to fetch the 1000 most recent articles on [Cheddar](https://cheddar.com). From each article, we'd like to fetch the **title** and the **publish date**. After a bit of digging through the schema, we've come across the **media** field within the **organization** type, which has both **title** and **public_at** fields - seems to check out! diff --git a/sources/academy/webscraping/api_scraping/graphql_scraping/modifying_variables.md b/sources/academy/webscraping/api_scraping/graphql_scraping/modifying_variables.md index 33ce89d2fb..9e8da6635c 100644 --- a/sources/academy/webscraping/api_scraping/graphql_scraping/modifying_variables.md +++ b/sources/academy/webscraping/api_scraping/graphql_scraping/modifying_variables.md @@ -116,4 +116,4 @@ Depending on the API, doing just this can be sufficient. However, sometimes we w ## Next up {#next} -In the [next lesson](./introspection.md) we will be walking you through how to learn about a GraphQL API before scraping it by using **introspection** (don't worry - it's a fancy word, but a simple concept). +In the [next lesson](./introspection.md) we will be walking you through how to learn about a GraphQL API before scraping it by using **introspection**. diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md index e237137b82..004a4a4e8a 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md @@ -14,7 +14,7 @@ import TabItem from '@theme/TabItem'; --- -Whether it's auto-renewing a service, automatically sending a message on an interval, or automatically cancelling a Netflix subscription, one of the most popular things headless browsers are used for is automating things within a user's account on a certain website. Of course, automating anything on a user's account requires the automation of the login process as well. In this lesson, we'll be covering how to build a simple login flow from start to finish with Playwright or Puppeteer. +Whether it's auto-renewing a service, automatically sending a message on an interval, or automatically cancelling a Netflix subscription, one of the most popular things headless browsers are used for is automating things within a user's account on a certain website. Of course, automating anything on a user's account requires the automation of the login process as well. In this lesson, we'll be covering how to build a login flow from start to finish with Playwright or Puppeteer. > In this lesson, we'll be using [yahoo.com](https://yahoo.com) as an example. Feel free to follow along using the academy Yahoo account credentials, or even deviate from the lesson a bit and try building a login flow for a different website of your choosing! diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/scraping_iframes.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/scraping_iframes.md index bcc2e68a05..760baf0c49 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/scraping_iframes.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/scraping_iframes.md @@ -17,7 +17,7 @@ Getting information from inside iFrames is a known pain, especially for new deve If you are using basic methods of page objects like `page.evaluate()`, you are actually already working with frames. Behind the scenes, Puppeteer will call `page.mainFrame().evaluate()`, so most of the methods you are using with page object can be used the same way with frame object. To access frames, you need to loop over the main frame's child frames and identify the one you want to use. -As a simple demonstration, we'll scrape the Twitter widget iFrame from [IMDB](https://www.imdb.com/). +As a demonstration, we'll scrape the Twitter widget iFrame from [IMDB](https://www.imdb.com/). ```js import puppeteer from 'puppeteer'; diff --git a/sources/academy/webscraping/switching_to_typescript/installation.md b/sources/academy/webscraping/switching_to_typescript/installation.md index 361bb16ba8..330cd794d3 100644 --- a/sources/academy/webscraping/switching_to_typescript/installation.md +++ b/sources/academy/webscraping/switching_to_typescript/installation.md @@ -85,7 +85,7 @@ Let's create a folder called **learning-typescript**, adding a new file within i ![Example pasted into first-lines.ts](./images/pasted-example.png) -As seen above, TypeScript has successfully recognized our code; however, there are now red underlines under the `price1` and `price2` parameters in the function declaration of `addPrices`. This is because right now, the compiler has no idea what data types we're expecting to be passed in. This can be solved with the simple addition of **type annotations** to the parameters by using a colon (`:`) and the name of the parameter's type. +As seen above, TypeScript has successfully recognized our code; however, there are now red underlines under the `price1` and `price2` parameters in the function declaration of `addPrices`. This is because right now, the compiler has no idea what data types we're expecting to be passed in. This can be solved with the addition of **type annotations** to the parameters by using a colon (`:`) and the name of the parameter's type. ```ts const products = [ diff --git a/sources/academy/webscraping/switching_to_typescript/type_aliases.md b/sources/academy/webscraping/switching_to_typescript/type_aliases.md index 1b70f926c6..ba8910270b 100644 --- a/sources/academy/webscraping/switching_to_typescript/type_aliases.md +++ b/sources/academy/webscraping/switching_to_typescript/type_aliases.md @@ -72,7 +72,7 @@ console.log(returnValueAsString(myValue)); ## Function types {#function-types} -Before we learn about how to write function types, let's learn about a problem they can solve. We have a simple function called `addAll` which takes in array of numbers, adds them all up, and then returns the result. +Before we learn about how to write function types, let's learn about a problem they can solve. We have a function called `addAll` which takes in array of numbers, adds them all up, and then returns the result. ```ts const addAll = (nums: number[]) => { diff --git a/sources/academy/webscraping/switching_to_typescript/unknown_and_type_assertions.md b/sources/academy/webscraping/switching_to_typescript/unknown_and_type_assertions.md index 7b36915500..1c961e072d 100644 --- a/sources/academy/webscraping/switching_to_typescript/unknown_and_type_assertions.md +++ b/sources/academy/webscraping/switching_to_typescript/unknown_and_type_assertions.md @@ -80,7 +80,7 @@ This works, and in fact, it's the most optimal solution for this use case. But w ## Type assertions {#type-assertions} -Despite the fancy name, [type assertions](https://www.typescriptlang.org/docs/handbook/2/everyday-types.html#type-assertions) are a simple concept based around a single keyword: `as`. We usually use this on values that we can't control the return type of, or values that we're sure have a certain type, but TypeScript needs a bit of help understanding that. +Despite the fancy name, [type assertions](https://www.typescriptlang.org/docs/handbook/2/everyday-types.html#type-assertions) are a concept based around a single keyword: `as`. We usually use this on values that we can't control the return type of, or values that we're sure have a certain type, but TypeScript needs a bit of help understanding that. <!-- eslint-disable --> ```ts diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/pro_scraping.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/pro_scraping.md index 09754ab160..e69994a5c7 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/pro_scraping.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/pro_scraping.md @@ -152,7 +152,7 @@ const crawler = new CheerioCrawler({ }, }); -// Instead of using a simple URL string, we're now +// Instead of using a string with URL, we're now // using a request object to add more options. await crawler.addRequests([{ url: 'https://warehouse-theme-metal.myshopify.com/collections/sales', @@ -224,7 +224,7 @@ When you run the code as usual, you'll see the product URLs printed to the termi ./storage/datasets/default/*.json ``` -Thanks to **Crawlee**, we were able to create a **faster and more robust scraper**, but **with less code** than what was needed for the simple scraper in the earlier lessons. +Thanks to **Crawlee**, we were able to create a **faster and more robust scraper**, but **with less code** than what was needed for the scraper in the earlier lessons. ## Next up {#next} diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/recap_extraction_basics.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/recap_extraction_basics.md index 1d44782c7d..fdcf2e93d4 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/recap_extraction_basics.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/recap_extraction_basics.md @@ -11,7 +11,7 @@ slug: /web-scraping-for-beginners/crawling/recap-extraction-basics --- -We finished off the [first section](../data_extraction/index.md) of the _Web Scraping for Beginners_ course by creating a simple web scraper in Node.js. The scraper collected all the on-sale products from [Warehouse store](https://warehouse-theme-metal.myshopify.com/collections/sales). Let's see the code with some comments added. +We finished off the [first section](../data_extraction/index.md) of the _Web Scraping for Beginners_ course by creating a web scraper in Node.js. The scraper collected all the on-sale products from [Warehouse store](https://warehouse-theme-metal.myshopify.com/collections/sales). Let's see the code with some comments added. ```js // First, we imported all the libraries we needed to diff --git a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/devtools_continued.md b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/devtools_continued.md index 45dd302e06..79278386a1 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/devtools_continued.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/devtools_continued.md @@ -30,7 +30,7 @@ The `length` property of `products` tells us how many products we have in the li > [Visit this tutorial](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Loops_and_iteration) if you need to refresh the concept of loops in programming. -Now, we will loop over each product and print their titles. We will use a so-called `for..of` loop to do it. It is a simple loop that iterates through all items of an array. +Now, we will loop over each product and print their titles. We will use a so-called `for..of` loop to do it. It is a loop that iterates through all items of an array. Run the following command in the Console. Some notes: diff --git a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/project_setup.md b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/project_setup.md index 9d0c9a6a26..be86277ddc 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/project_setup.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/project_setup.md @@ -53,7 +53,7 @@ npm install got-scraping cheerio ## Test everything {#testing} -With the libraries installed, create a new file in the project's folder called **main.js**. This is where we will put all our code. Before we start scraping, though, let's do a simple check that everything was installed correctly. Add this piece of code inside **main.js**. +With the libraries installed, create a new file in the project's folder called **main.js**. This is where we will put all our code. Before we start scraping, though, let's do a check that everything was installed correctly. Add this piece of code inside **main.js**. ```js import { gotScraping } from 'got-scraping'; diff --git a/sources/academy/webscraping/web_scraping_for_beginners/index.md b/sources/academy/webscraping/web_scraping_for_beginners/index.md index c0ede32215..6842ca4c61 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/index.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/index.md @@ -65,7 +65,7 @@ Throughout the next lessons, we will sometimes use certain technologies and term ### jQuery or Cheerio {#jquery-or-cheerio} -We'll be using the [**Cheerio**](https://www.npmjs.com/package/cheerio) package a lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js. +We'll be using the [**Cheerio**](https://www.npmjs.com/package/cheerio) package a lot to parse data from HTML. This package provides an API using jQuery syntax to help traverse downloaded HTML within Node.js. ## Next up {#next} diff --git a/sources/academy/webscraping/web_scraping_for_beginners/introduction.md b/sources/academy/webscraping/web_scraping_for_beginners/introduction.md index 6ee66f548e..57b1169e0d 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/introduction.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/introduction.md @@ -12,7 +12,7 @@ slug: /web-scraping-for-beginners/introduction --- -Web scraping or crawling? Web data extraction, mining, or collection? You can find various definitions on the web. Let's agree on simple explanations that we will use throughout this beginner course on web scraping. +Web scraping or crawling? Web data extraction, mining, or collection? You can find various definitions on the web. Let's agree on explanations that we will use throughout this beginner course on web scraping. ## What is web data extraction? {#what-is-data-extraction} From fc1bd20e5683902ac3ba6b61cd77f025a044f88c Mon Sep 17 00:00:00 2001 From: Honza Javorek <mail@honzajavorek.cz> Date: Thu, 25 Apr 2024 18:00:15 +0200 Subject: [PATCH 08/15] style: make vale happy --- sources/academy/tutorials/apify_scrapers/cheerio_scraper.md | 2 +- sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md | 2 +- sources/academy/tutorials/apify_scrapers/web_scraper.md | 2 +- .../webscraping/web_scraping_for_beginners/introduction.md | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md b/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md index 5659a11f61..58f8cbc4e7 100644 --- a/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md @@ -502,7 +502,7 @@ Thank you for reading this whole tutorial! Really! It's important to us that our ## [](#whats-next) What's next * Check out the [Apify SDK](https://sdk.apify.com/) and its [Getting started](https://sdk.apify.com/docs/guides/getting-started) tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking. -* [Take a deep dive into Actors](/platform/actors), from how they work to [publishing](/platform/actors/publishing) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on actors. +* [Take a deep dive into Actors](/platform/actors), from how they work to [publishing](/platform/actors/publishing) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on Actors. * Found out you're not into the coding part but would still to use Apify Actors? Check out our [ready-made solutions](https://apify.com/store) or [order a custom actor](https://apify.com/custom-solutions) from an Apify-certified developer. diff --git a/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md b/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md index 0205d65eff..c2ef09d81e 100644 --- a/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md @@ -822,7 +822,7 @@ Thank you for reading this whole tutorial! Really! It's important to us that our ## [](#whats-next) What's next? - Check out the [Apify SDK](https://sdk.apify.com/) and its [Getting started](https://sdk.apify.com/docs/guides/getting-started) tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking. -- [Take a deep dive into Actors](/platform/actors), from how they work to [publishing](/platform/actors/publishing) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on actors. +- [Take a deep dive into Actors](/platform/actors), from how they work to [publishing](/platform/actors/publishing) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on Actors. - Found out you're not into the coding part but would still to use Apify Actors? Check out our [ready-made solutions](https://apify.com/store) or [order a custom actor](https://apify.com/custom-solutions) from an Apify-certified developer. diff --git a/sources/academy/tutorials/apify_scrapers/web_scraper.md b/sources/academy/tutorials/apify_scrapers/web_scraper.md index 335f63536d..3ee05e1f9e 100644 --- a/sources/academy/tutorials/apify_scrapers/web_scraper.md +++ b/sources/academy/tutorials/apify_scrapers/web_scraper.md @@ -556,7 +556,7 @@ Thank you for reading this whole tutorial! Really! It's important to us that our ## [](#whats-next) What's next? - Check out the [Apify SDK](https://sdk.apify.com/) and its [Getting started](https://sdk.apify.com/docs/guides/getting-started) tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking. -- [Take a deep dive into Actors](/platform/actors), from how they work to [publishing](/platform/actors/publishing) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on actors. +- [Take a deep dive into Actors](/platform/actors), from how they work to [publishing](/platform/actors/publishing) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on Actors. - Found out you're not into the coding part but would still to use Apify Actors? Check out our [ready-made solutions](https://apify.com/store) or [order a custom actor](https://apify.com/custom-solutions) from an Apify-certified developer. diff --git a/sources/academy/webscraping/web_scraping_for_beginners/introduction.md b/sources/academy/webscraping/web_scraping_for_beginners/introduction.md index 57b1169e0d..aff6571d1f 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/introduction.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/introduction.md @@ -16,7 +16,7 @@ Web scraping or crawling? Web data extraction, mining, or collection? You can fi ## What is web data extraction? {#what-is-data-extraction} -Web data extraction (or collection) is a process that takes a web page, like an Amazon product page, and collects useful information from the page, such as the product's name and price. Web pages are an unstructured data source and the goal of web data extraction is to make information from websites structured, so that it can be processed by data analysis tools or integrated with computer systems. The main sources of data on a web page are HTML documents and API calls, but also images, PDFs, and so on. +Web data extraction (or collection) is a process that takes a web page, like an Amazon product page, and collects useful information from the page, such as the product's name and price. Web pages are an unstructured data source and the goal of web data extraction is to make information from websites structured, so that it can be processed by data analysis tools or integrated with computer systems. The main sources of data on a web page are HTML documents and API calls, but also images, PDFs, etc. ![product data extraction from Amazon](./images/beginners-data-extraction.png) From f3df1f6cee3a0a64d814b8904177281c0899368f Mon Sep 17 00:00:00 2001 From: Honza Javorek <mail@honzajavorek.cz> Date: Fri, 26 Apr 2024 10:34:55 +0200 Subject: [PATCH 09/15] style: removing implicit expectation --- .../data_extraction/browser_devtools.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/browser_devtools.md b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/browser_devtools.md index 76f13dc62f..8afb17ce63 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/browser_devtools.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/browser_devtools.md @@ -9,9 +9,7 @@ slug: /web-scraping-for-beginners/data-extraction/browser-devtools --- -Even though DevTools stands for developer tools, everyone can use them to inspect a website. Each major browser has its own DevTools. We will use Chrome DevTools as an example, but the advice is applicable to any browser, as the tools are extremely similar. To open Chrome DevTools, you can press **F12** or right-click anywhere in the page and choose **Inspect**. - -Now go to [Wikipedia](https://wikipedia.com) and open your DevTools there. Inspecting the same website as us will make this lesson easier to follow. +Even though DevTools stands for developer tools, everyone can use them to inspect a website. Each major browser has its own DevTools. We will use Chrome DevTools as an example, but the advice is applicable to any browser, as the tools are extremely similar. To open Chrome DevTools, you can press **F12** or right-click anywhere in the page and choose **Inspect**. Now go to [Wikipedia](https://wikipedia.com) and open your DevTools there. ![Wikipedia with Chrome DevTools open](./images/browser-devtools-wikipedia.png) From a1dd80c951dcab6fb5af8550a6065f88b09b8317 Mon Sep 17 00:00:00 2001 From: Honza Javorek <mail@honzajavorek.cz> Date: Fri, 26 Apr 2024 15:14:41 +0200 Subject: [PATCH 10/15] fix: broken links to pptr.dev docs --- .../webscraping/puppeteer_playwright/page/page_methods.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md b/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md index 40671a3241..7874517f65 100644 --- a/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md +++ b/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md @@ -14,7 +14,7 @@ import TabItem from '@theme/TabItem'; --- -Other than having methods for interacting with a page and waiting for events and elements, the **Page** object also supports various methods for doing other things, such as [reloading](https://pptr.dev/#?product=Puppeteer&version=v13.7.0&show=api-pagereloadoptions), [screenshotting](https://playwright.dev/docs/api/class-page#page-screenshot), [changing headers](https://playwright.dev/docs/api/class-page#page-set-extra-http-headers), and extracting the [page's content](https://pptr.dev/#?product=Puppeteer&show=api-pagecontent). +Other than having methods for interacting with a page and waiting for events and elements, the **Page** object also supports various methods for doing other things, such as [reloading](https://pptr.dev/api/puppeteer.page.reload), [screenshotting](https://playwright.dev/docs/api/class-page#page-screenshot), [changing headers](https://playwright.dev/docs/api/class-page#page-set-extra-http-headers), and extracting the [page's content](https://pptr.dev/api/puppeteer.page.content/). Last lesson, we left off at a point where we were waiting for the page to navigate so that we can extract the page's title and take a screenshot of it. In this lesson, we'll be learning about the two methods we can use to achieve both of those things. From 222717a49e38c643b8391e2e20c61f6f078e3289 Mon Sep 17 00:00:00 2001 From: Honza Javorek <mail@honzajavorek.cz> Date: Fri, 26 Apr 2024 15:31:11 +0200 Subject: [PATCH 11/15] fix: wasn't clear what 'extensions' exactly means --- .../puppeteer_playwright/reading_intercepting_requests.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sources/academy/webscraping/puppeteer_playwright/reading_intercepting_requests.md b/sources/academy/webscraping/puppeteer_playwright/reading_intercepting_requests.md index 0d2c2449a2..2d7d4d0072 100644 --- a/sources/academy/webscraping/puppeteer_playwright/reading_intercepting_requests.md +++ b/sources/academy/webscraping/puppeteer_playwright/reading_intercepting_requests.md @@ -231,7 +231,7 @@ Upon running this code, we'll see the API response logged into the console: One of the most popular ways of speeding up website loading in Puppeteer and Playwright is by blocking certain resources from loading. These resources are usually CSS files, images, and other miscellaneous resources that aren't super necessary (mainly because the computer doesn't have eyes - it doesn't care how the website looks!). -In Puppeteer, we must first enable request interception with the `page.setRequestInterception()` function. Then, we can check whether or not the request's resource ends with one of our blocked extensions. If so, we'll abort the request. Otherwise, we'll let it continue. All of this logic will still be within the `page.on()` method. +In Puppeteer, we must first enable request interception with the `page.setRequestInterception()` function. Then, we can check whether or not the request's resource ends with one of our blocked file extensions. If so, we'll abort the request. Otherwise, we'll let it continue. All of this logic will still be within the `page.on()` method. With Playwright, request interception is a bit different. We use the [`page.route()`](https://playwright.dev/docs/api/class-page#page-route) function instead of `page.on()`, passing in a string, regular expression, or a function that will match the URL of the request we'd like to read from. The second parameter is also a callback function, but with the [**Route**](https://playwright.dev/docs/api/class-route) object passed into it instead. From 472c701676d24ce51b2f56647079a5761206cb49 Mon Sep 17 00:00:00 2001 From: Honza Javorek <mail@honzajavorek.cz> Date: Fri, 26 Apr 2024 15:57:11 +0200 Subject: [PATCH 12/15] style: clearer explanation --- .../api_scraping/general_api_scraping/locating_and_learning.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md b/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md index 8342f186b4..d8909832bf 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md @@ -21,7 +21,7 @@ _Here's what we can see in the Network tab after reloading the page:_ Let's say that our target data is a full list of Tiësto's uploaded songs on SoundCloud. We can use the **Filter** option to search for the keyword `tracks`, and see if any endpoints have been hit that include that word. Multiple results may still be in the list when using this feature, so it is important to carefully examine the payloads and responses of each request in order to ensure that the correct one is found. -> **Note:** The keyword/piece of data that is used in this filtered search should be a target keyword or a piece of target data that that can be assumed will most likely be a part of the endpoint. +> To find what we're looking for, we must wisely choose what piece of data (in this case a keyword) we filter by. Think of something that is most likely to be part of the endpoint (in this case a string `tracks`). After a little bit of digging through the different response values of each request in our filtered list within the Network tab, we can discover this endpoint, which returns a JSON list including 20 of Tiësto's latest tracks: From 67e0dbe86ad55a8fa8e133694dbbd7ab0c93372f Mon Sep 17 00:00:00 2001 From: Honza Javorek <mail@honzajavorek.cz> Date: Fri, 26 Apr 2024 16:47:38 +0200 Subject: [PATCH 13/15] style: improve section about proxy links --- .../anti_scraping/mitigation/proxies.md | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/sources/academy/webscraping/anti_scraping/mitigation/proxies.md b/sources/academy/webscraping/anti_scraping/mitigation/proxies.md index 719e4e083f..e1109acffc 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/proxies.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/proxies.md @@ -26,17 +26,27 @@ Although IP quality is still the most important factor when it comes to using pr Fixing rate-limiting issues is only the tip of the iceberg of what proxies can do for your scrapers, though. By implementing proxies properly, you can successfully avoid the majority of anti-scraping measures listed in the [previous lesson](../index.md). -## A bit about proxy links {#understanding-proxy-links} +## About proxy links {#understanding-proxy-links} -When using proxies in your crawlers, you'll most likely be using them in a format that looks like this: +To use a proxy, you need a proxy link, which contains the connection details, sometimes including credentials. ```text http://proxy.example.com:8080 ``` -This link is separated into two main components: the **host**, and the **port**. In our case, our hostname is `http://proxy.example.com`, and our port is `8080`. Sometimes, a proxy might use an IP address as the host, such as `103.130.104.33`. +The proxy link above has several parts: -If authentication (a username and a password) is required, the format will look a bit different: +- `http://` tells us we're using HTTP protocol, +- `proxy.example.com` is a hostname, i.e. an address to the proxy server, +- `8080` is a port number. + +Sometimes the proxy server has no name, so the link contains an IP address instead: + +```text +http://123.456.789.10:8080 +``` + +If proxy requires authentication, the proxy link can contain username and password: ```text http://USERNAME:PASSWORD@proxy.example.com:8080 From d487e5961de9ef120e914d90ab72db67321b7b0f Mon Sep 17 00:00:00 2001 From: Honza Javorek <mail@honzajavorek.cz> Date: Fri, 26 Apr 2024 17:52:56 +0200 Subject: [PATCH 14/15] fix: English --- sources/academy/webscraping/puppeteer_playwright/proxies.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sources/academy/webscraping/puppeteer_playwright/proxies.md b/sources/academy/webscraping/puppeteer_playwright/proxies.md index 556638ab6f..60c1d04416 100644 --- a/sources/academy/webscraping/puppeteer_playwright/proxies.md +++ b/sources/academy/webscraping/puppeteer_playwright/proxies.md @@ -169,7 +169,7 @@ const browser = await puppeteer.launch({ </TabItem> </Tabs> -However, authentication parameters need to be passed in separately in order to work. In Puppeteer, the username and password need to be passed into the `page.authenticate()` prior to any navigations being made, while in Playwright they can be passed into the **proxy** option object. +However, authentication parameters need to be passed in separately in order to work. In Puppeteer, the username and password need to be passed to the `page.authenticate()` prior to any navigations being made, while in Playwright they can be passed to the **proxy** option object. <Tabs groupId="main"> <TabItem value="Playwright" label="Playwright"> From a0d8b3aea434fd4b977bb4229852771c7658f07c Mon Sep 17 00:00:00 2001 From: Honza Javorek <mail@honzajavorek.cz> Date: Fri, 26 Apr 2024 18:27:10 +0200 Subject: [PATCH 15/15] style: sounds better --- .../webscraping/puppeteer_playwright/browser_contexts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sources/academy/webscraping/puppeteer_playwright/browser_contexts.md b/sources/academy/webscraping/puppeteer_playwright/browser_contexts.md index 3fd94756cc..8891772f17 100644 --- a/sources/academy/webscraping/puppeteer_playwright/browser_contexts.md +++ b/sources/academy/webscraping/puppeteer_playwright/browser_contexts.md @@ -77,7 +77,7 @@ await browser.close(); ## Using browser contexts {#using-browser-contexts} -In both Playwright and Puppeteer, various devices (iPhones, iPads, Androids, etc.) can be emulated by using [`playwright.devices`](https://playwright.dev/docs/api/class-playwright#playwright-devices) or [`puppeteer.devices`](https://pptr.dev/#?product=Puppeteer&version=v14.1.0&show=api-puppeteerdevices). We'll be using this to create two different browser contexts, one emulating an iPhone, and one emulating an Android: +In both Playwright and Puppeteer, various devices (iPhones, iPads, Androids, etc.) can be emulated by using [`playwright.devices`](https://playwright.dev/docs/api/class-playwright#playwright-devices) or [`puppeteer.devices`](https://pptr.dev/#?product=Puppeteer&version=v14.1.0&show=api-puppeteerdevices). We'll be using this to create two different browser contexts, one emulating an iPhone, and one emulating an Android device: <Tabs groupId="main"> <TabItem value="Playwright" label="Playwright">