Karim shoair commited on
Commit ·
87facc2
1
Parent(s): 983da92
docs: update the titles for all files
Browse files- docs/development/adaptive_storage_system.md +2 -0
- docs/development/scrapling_custom_types.md +2 -0
- docs/fetching/choosing.md +2 -0
- docs/fetching/dynamic.md +1 -1
- docs/fetching/static.md +1 -1
- docs/fetching/stealthy.md +1 -1
- docs/parsing/adaptive.md +1 -1
- docs/parsing/main_classes.md +1 -1
- docs/parsing/selection.md +1 -1
- docs/spiders/architecture.md +1 -1
- docs/spiders/getting-started.md +2 -0
- docs/spiders/proxy-blocking.md +1 -1
- docs/spiders/requests-responses.md +1 -1
- docs/spiders/sessions.md +1 -1
docs/development/adaptive_storage_system.md
CHANGED
|
@@ -1,3 +1,5 @@
|
|
|
|
|
|
|
|
| 1 |
Scrapling uses SQLite by default, but this tutorial shows how to write your own storage system to store element properties for the `adaptive` feature.
|
| 2 |
|
| 3 |
You might want to use Firebase, for example, and share the database between multiple spiders on different machines. It's a great idea to use an online database like that because spiders can share adaptive data with each other.
|
|
|
|
| 1 |
+
# Writing your retrieval system
|
| 2 |
+
|
| 3 |
Scrapling uses SQLite by default, but this tutorial shows how to write your own storage system to store element properties for the `adaptive` feature.
|
| 4 |
|
| 5 |
You might want to use Firebase, for example, and share the database between multiple spiders on different machines. It's a great idea to use an online database like that because spiders can share adaptive data with each other.
|
docs/development/scrapling_custom_types.md
CHANGED
|
@@ -1,3 +1,5 @@
|
|
|
|
|
|
|
|
| 1 |
> You can take advantage of the custom-made types for Scrapling and use them outside the library if you want. It's better than copying their code, after all :)
|
| 2 |
|
| 3 |
### All current types can be imported alone, like below
|
|
|
|
| 1 |
+
# Using Scrapling's custom types
|
| 2 |
+
|
| 3 |
> You can take advantage of the custom-made types for Scrapling and use them outside the library if you want. It's better than copying their code, after all :)
|
| 4 |
|
| 5 |
### All current types can be imported alone, like below
|
docs/fetching/choosing.md
CHANGED
|
@@ -1,3 +1,5 @@
|
|
|
|
|
|
|
|
| 1 |
## Introduction
|
| 2 |
Fetchers are classes that can do requests or fetch pages for you easily in a single-line fashion with many features and then return a [Response](#response-object) object. Starting with v0.3, all fetchers have separate classes to keep the session running, so for example, a fetcher that uses a browser will keep the browser open till you finish all your requests through it instead of opening multiple browsers. So it depends on your use case.
|
| 3 |
|
|
|
|
| 1 |
+
# Fetchers basics
|
| 2 |
+
|
| 3 |
## Introduction
|
| 4 |
Fetchers are classes that can do requests or fetch pages for you easily in a single-line fashion with many features and then return a [Response](#response-object) object. Starting with v0.3, all fetchers have separate classes to keep the session running, so for example, a fetcher that uses a browser will keep the browser open till you finish all your requests through it instead of opening multiple browsers. So it depends on your use case.
|
| 5 |
|
docs/fetching/dynamic.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
Here, we will discuss the `DynamicFetcher` class (formerly `PlayWrightFetcher`). This class provides flexible browser automation with multiple configuration options and little under-the-hood stealth improvements.
|
| 4 |
|
|
|
|
| 1 |
+
# Fetching dynamic websites
|
| 2 |
|
| 3 |
Here, we will discuss the `DynamicFetcher` class (formerly `PlayWrightFetcher`). This class provides flexible browser automation with multiple configuration options and little under-the-hood stealth improvements.
|
| 4 |
|
docs/fetching/static.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
The `Fetcher` class provides rapid and lightweight HTTP requests using the high-performance `curl_cffi` library with a lot of stealth capabilities.
|
| 4 |
|
|
|
|
| 1 |
+
# HTTP requests
|
| 2 |
|
| 3 |
The `Fetcher` class provides rapid and lightweight HTTP requests using the high-performance `curl_cffi` library with a lot of stealth capabilities.
|
| 4 |
|
docs/fetching/stealthy.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
Here, we will discuss the `StealthyFetcher` class. This class is very similar to the [DynamicFetcher](dynamic.md#introduction) class, including the browsers, the automation, and the use of [Playwright's API](https://playwright.dev/python/docs/intro). The main difference is that this class provides advanced anti-bot protection bypass capabilities; most of them are handled automatically under the hood, and the rest is up to you to enable.
|
| 4 |
|
|
|
|
| 1 |
+
# Fetching dynamic websites with hard protections
|
| 2 |
|
| 3 |
Here, we will discuss the `StealthyFetcher` class. This class is very similar to the [DynamicFetcher](dynamic.md#introduction) class, including the browsers, the automation, and the use of [Playwright's API](https://playwright.dev/python/docs/intro). The main difference is that this class provides advanced anti-bot protection bypass capabilities; most of them are handled automatically under the hood, and the rest is up to you to enable.
|
| 4 |
|
docs/parsing/adaptive.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
!!! success "Prerequisites"
|
| 4 |
|
|
|
|
| 1 |
+
# Adaptive scraping
|
| 2 |
|
| 3 |
!!! success "Prerequisites"
|
| 4 |
|
docs/parsing/main_classes.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
!!! success "Prerequisites"
|
| 4 |
|
|
|
|
| 1 |
+
# Parsing main classes
|
| 2 |
|
| 3 |
!!! success "Prerequisites"
|
| 4 |
|
docs/parsing/selection.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
#
|
| 2 |
Scrapling currently supports parsing HTML pages exclusively, so it doesn't support XML feeds. This decision was made because the adaptive feature won't work with XML, but that might change soon, so stay tuned :)
|
| 3 |
|
| 4 |
In Scrapling, there are five main ways to find elements:
|
|
|
|
| 1 |
+
# Querying elements
|
| 2 |
Scrapling currently supports parsing HTML pages exclusively, so it doesn't support XML feeds. This decision was made because the adaptive feature won't work with XML, but that might change soon, so stay tuned :)
|
| 3 |
|
| 4 |
In Scrapling, there are five main ways to find elements:
|
docs/spiders/architecture.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
!!! success "Prerequisites"
|
| 4 |
|
|
|
|
| 1 |
+
# Spiders architecture
|
| 2 |
|
| 3 |
!!! success "Prerequisites"
|
| 4 |
|
docs/spiders/getting-started.md
CHANGED
|
@@ -1,3 +1,5 @@
|
|
|
|
|
|
|
|
| 1 |
## Introduction
|
| 2 |
|
| 3 |
!!! success "Prerequisites"
|
|
|
|
| 1 |
+
# Getting started
|
| 2 |
+
|
| 3 |
## Introduction
|
| 4 |
|
| 5 |
!!! success "Prerequisites"
|
docs/spiders/proxy-blocking.md
CHANGED
|
@@ -203,7 +203,7 @@ class MySpider(Spider):
|
|
| 203 |
yield {"title": response.css("title::text").get("")}
|
| 204 |
```
|
| 205 |
|
| 206 |
-
What happened above is that I left the blocking detection logic unchanged and
|
| 207 |
|
| 208 |
|
| 209 |
Putting it all together:
|
|
|
|
| 203 |
yield {"title": response.css("title::text").get("")}
|
| 204 |
```
|
| 205 |
|
| 206 |
+
What happened above is that I left the blocking detection logic unchanged and had the spider mainly use requests until it got blocked, then switch to the stealthy browser.
|
| 207 |
|
| 208 |
|
| 209 |
Putting it all together:
|
docs/spiders/requests-responses.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
!!! success "Prerequisites"
|
| 4 |
|
|
|
|
| 1 |
+
# Requests & Responses
|
| 2 |
|
| 3 |
!!! success "Prerequisites"
|
| 4 |
|
docs/spiders/sessions.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
!!! success "Prerequisites"
|
| 4 |
|
|
|
|
| 1 |
+
# Spiders sessions
|
| 2 |
|
| 3 |
!!! success "Prerequisites"
|
| 4 |
|