Karim shoair commited on
Commit ·
3dc1188
1
Parent(s): d8893a8
docs: Style updates and a lot of clarifications
Browse files- docs/cli/extract-commands.md +8 -0
- docs/cli/interactive-shell.md +8 -0
- docs/fetching/dynamic.md +8 -2
- docs/fetching/static.md +6 -0
- docs/fetching/stealthy.md +6 -0
- docs/index.md +1 -24
- docs/parsing/adaptive.md +7 -0
- docs/parsing/main_classes.md +29 -20
- mkdocs.yml +11 -26
docs/cli/extract-commands.md
CHANGED
|
@@ -4,6 +4,14 @@
|
|
| 4 |
|
| 5 |
The `scrapling extract` Command lets you download and extract content from websites directly from your terminal without writing any code. Ideal for beginners, researchers, and anyone requiring rapid web data extraction.
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
## What is the Extract Command group?
|
| 8 |
|
| 9 |
The extract command is a set of simple terminal tools that:
|
|
|
|
| 4 |
|
| 5 |
The `scrapling extract` Command lets you download and extract content from websites directly from your terminal without writing any code. Ideal for beginners, researchers, and anyone requiring rapid web data extraction.
|
| 6 |
|
| 7 |
+
> 💡 **Prerequisites:**
|
| 8 |
+
>
|
| 9 |
+
> 1. You’ve completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
|
| 10 |
+
> 2. You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
|
| 11 |
+
> 3. You’ve completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
|
| 12 |
+
> 4. You’ve completed or read at least one page from the fetchers section to use here for requests: [HTTP requests](../fetching/static.md), [Dynamic websites](../fetching/dynamic.md), or [Dynamic websites with hard protections](../fetching/stealthy.md).
|
| 13 |
+
|
| 14 |
+
|
| 15 |
## What is the Extract Command group?
|
| 16 |
|
| 17 |
The extract command is a set of simple terminal tools that:
|
docs/cli/interactive-shell.md
CHANGED
|
@@ -6,6 +6,14 @@
|
|
| 6 |
|
| 7 |
The Scrapling Interactive Shell is an enhanced IPython-based environment designed specifically for Web Scraping tasks. It provides instant access to all Scrapling features, clever shortcuts, automatic page management, and advanced tools like curl command conversion.
|
| 8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
## Why use the Interactive Shell?
|
| 10 |
|
| 11 |
The interactive shell transforms web scraping from a slow script-and-run cycle into a fast, exploratory experience. It's perfect for:
|
|
|
|
| 6 |
|
| 7 |
The Scrapling Interactive Shell is an enhanced IPython-based environment designed specifically for Web Scraping tasks. It provides instant access to all Scrapling features, clever shortcuts, automatic page management, and advanced tools like curl command conversion.
|
| 8 |
|
| 9 |
+
> 💡 **Prerequisites:**
|
| 10 |
+
>
|
| 11 |
+
> 1. You’ve completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
|
| 12 |
+
> 2. You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
|
| 13 |
+
> 3. You’ve completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
|
| 14 |
+
> 4. You’ve completed or read at least one page from the fetchers section to use here for requests: [HTTP requests](../fetching/static.md), [Dynamic websites](../fetching/dynamic.md), or [Dynamic websites with hard protections](../fetching/stealthy.md).
|
| 15 |
+
|
| 16 |
+
|
| 17 |
## Why use the Interactive Shell?
|
| 18 |
|
| 19 |
The interactive shell transforms web scraping from a slow script-and-run cycle into a fast, exploratory experience. It's perfect for:
|
docs/fetching/dynamic.md
CHANGED
|
@@ -4,6 +4,12 @@ Here, we will discuss the `DynamicFetcher` class (previously known as `PlayWrigh
|
|
| 4 |
|
| 5 |
As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
## Basic Usage
|
| 8 |
You have one primary way to import this Fetcher, which is the same for all fetchers.
|
| 9 |
|
|
@@ -275,7 +281,7 @@ async def scrape_multiple_sites():
|
|
| 275 |
return pages
|
| 276 |
```
|
| 277 |
|
| 278 |
-
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **rotating pool of Browser tabs**. Instead of using a single tab for all your requests, you set a limit on the maximum number of pages. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
|
| 279 |
|
| 280 |
1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
|
| 281 |
2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
|
|
@@ -301,4 +307,4 @@ Use DynamicFetcher when:
|
|
| 301 |
- Need custom browser config
|
| 302 |
- Want flexible stealth options
|
| 303 |
|
| 304 |
-
If you want more stealth and control without much config, check out the [StealthyFetcher](stealthy.md).
|
|
|
|
| 4 |
|
| 5 |
As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
|
| 6 |
|
| 7 |
+
> 💡 **Prerequisites:**
|
| 8 |
+
>
|
| 9 |
+
> 1. You’ve completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
|
| 10 |
+
> 2. You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
|
| 11 |
+
> 3. You’ve completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
|
| 12 |
+
|
| 13 |
## Basic Usage
|
| 14 |
You have one primary way to import this Fetcher, which is the same for all fetchers.
|
| 15 |
|
|
|
|
| 281 |
return pages
|
| 282 |
```
|
| 283 |
|
| 284 |
+
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **rotating pool of Browser tabs**. Instead of using a single tab for all your requests, you set a limit on the maximum number of pages that can be displayed at once. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
|
| 285 |
|
| 286 |
1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
|
| 287 |
2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
|
|
|
|
| 307 |
- Need custom browser config
|
| 308 |
- Want flexible stealth options
|
| 309 |
|
| 310 |
+
If you want more stealth and control without much config, check out the [StealthyFetcher](stealthy.md).
|
docs/fetching/static.md
CHANGED
|
@@ -2,6 +2,12 @@
|
|
| 2 |
|
| 3 |
The `Fetcher` class provides rapid and lightweight HTTP requests using the high-performance `curl_cffi` library with a lot of stealth capabilities.
|
| 4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
## Basic Usage
|
| 6 |
You have one primary way to import this Fetcher, which is the same for all fetchers.
|
| 7 |
|
|
|
|
| 2 |
|
| 3 |
The `Fetcher` class provides rapid and lightweight HTTP requests using the high-performance `curl_cffi` library with a lot of stealth capabilities.
|
| 4 |
|
| 5 |
+
> 💡 **Prerequisites:**
|
| 6 |
+
>
|
| 7 |
+
> 1. You’ve completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
|
| 8 |
+
> 2. You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
|
| 9 |
+
> 3. You’ve completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
|
| 10 |
+
|
| 11 |
## Basic Usage
|
| 12 |
You have one primary way to import this Fetcher, which is the same for all fetchers.
|
| 13 |
|
docs/fetching/stealthy.md
CHANGED
|
@@ -4,6 +4,12 @@ Here, we will discuss the `StealthyFetcher` class. This class is similar to [Dyn
|
|
| 4 |
|
| 5 |
As with [DynamicFetcher](dynamic.md#introduction), you will need some knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) to automate the page, as we will explain later.
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
## Basic Usage
|
| 8 |
You have one primary way to import this Fetcher, which is the same for all fetchers.
|
| 9 |
|
|
|
|
| 4 |
|
| 5 |
As with [DynamicFetcher](dynamic.md#introduction), you will need some knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) to automate the page, as we will explain later.
|
| 6 |
|
| 7 |
+
> 💡 **Prerequisites:**
|
| 8 |
+
>
|
| 9 |
+
> 1. You’ve completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
|
| 10 |
+
> 2. You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
|
| 11 |
+
> 3. You’ve completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
|
| 12 |
+
|
| 13 |
## Basic Usage
|
| 14 |
You have one primary way to import this Fetcher, which is the same for all fetchers.
|
| 15 |
|
docs/index.md
CHANGED
|
@@ -83,33 +83,10 @@ Scrapling’s GitHub stars have grown steadily since its release (see chart belo
|
|
| 83 |
|
| 84 |
<div id="chartContainer">
|
| 85 |
<a href="https://github.com/D4Vinci/Scrapling">
|
| 86 |
-
<img id="chartImage" alt="Star History Chart" loading="lazy" src="https://api.star-history.com/svg?repos=D4Vinci/Scrapling&type=
|
| 87 |
</a>
|
| 88 |
</div>
|
| 89 |
|
| 90 |
-
<script>
|
| 91 |
-
const observer = new MutationObserver((mutations) => {
|
| 92 |
-
mutations.forEach((mutation) => {
|
| 93 |
-
if (mutation.attributeName === 'data-md-color-media') {
|
| 94 |
-
const colorMedia = document.body.getAttribute('data-md-color-media');
|
| 95 |
-
const isDarkScheme = document.body.getAttribute('data-md-color-scheme') === 'slate';
|
| 96 |
-
const chartImg = document.querySelector('#chartImage');
|
| 97 |
-
const baseUrl = 'https://api.star-history.com/svg?repos=D4Vinci/Scrapling&type=Date';
|
| 98 |
-
|
| 99 |
-
if (colorMedia === '(prefers-color-scheme)' ? isDarkScheme : colorMedia.includes('dark')) {
|
| 100 |
-
chartImg.src = `${baseUrl}&theme=dark`;
|
| 101 |
-
} else {
|
| 102 |
-
chartImg.src = baseUrl;
|
| 103 |
-
}
|
| 104 |
-
}
|
| 105 |
-
});
|
| 106 |
-
});
|
| 107 |
-
|
| 108 |
-
observer.observe(document.body, {
|
| 109 |
-
attributes: true,
|
| 110 |
-
attributeFilter: ['data-md-color-media', 'data-md-color-scheme']
|
| 111 |
-
});
|
| 112 |
-
</script>
|
| 113 |
|
| 114 |
## Installation
|
| 115 |
Scrapling requires Python 3.10 or higher:
|
|
|
|
| 83 |
|
| 84 |
<div id="chartContainer">
|
| 85 |
<a href="https://github.com/D4Vinci/Scrapling">
|
| 86 |
+
<img id="chartImage" alt="Star History Chart" loading="lazy" src="https://api.star-history.com/svg?repos=D4Vinci/Scrapling&type=date&legend=top-left&theme=dark" height="400"/>
|
| 87 |
</a>
|
| 88 |
</div>
|
| 89 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
## Installation
|
| 92 |
Scrapling requires Python 3.10 or higher:
|
docs/parsing/adaptive.md
CHANGED
|
@@ -1,4 +1,11 @@
|
|
| 1 |
## Introduction
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
Adaptive scraping (previously known as automatch) is one of Scrapling's most powerful features. It allows your scraper to survive website changes by intelligently tracking and relocating elements.
|
| 3 |
|
| 4 |
Let's say you are scraping a page with a structure like this:
|
|
|
|
| 1 |
## Introduction
|
| 2 |
+
|
| 3 |
+
> 💡 **Prerequisites:**
|
| 4 |
+
>
|
| 5 |
+
> 1. You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector) object.
|
| 6 |
+
> 2. You’ve completed or read the [Main classes](../parsing/main_classes.md) page to understand the [Selector](../parsing/main_classes.md#selector) class.
|
| 7 |
+
> <br><br>
|
| 8 |
+
|
| 9 |
Adaptive scraping (previously known as automatch) is one of Scrapling's most powerful features. It allows your scraper to survive website changes by intelligently tracking and relocating elements.
|
| 10 |
|
| 11 |
Let's say you are scraping a page with a structure like this:
|
docs/parsing/main_classes.md
CHANGED
|
@@ -1,7 +1,13 @@
|
|
| 1 |
## Introduction
|
| 2 |
-
After exploring the various ways to select elements with Scrapling and related features, let's take a step back and examine the [Selector](#selector) class generally and other objects to better understand the parsing engine.
|
| 3 |
|
| 4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
```python
|
| 6 |
from scrapling import Selector
|
| 7 |
from scrapling.parser import Selector
|
|
@@ -133,7 +139,7 @@ Getting the attributes of the element
|
|
| 133 |
>>> print(article.attrib)
|
| 134 |
{'class': 'product', 'data-id': '1'}
|
| 135 |
```
|
| 136 |
-
Access a specific attribute with any
|
| 137 |
```python
|
| 138 |
>>> article.attrib['class']
|
| 139 |
>>> article.attrib.get('class')
|
|
@@ -151,14 +157,16 @@ Get the HTML content of the element
|
|
| 151 |
```
|
| 152 |
Get the prettified version of the element's HTML content
|
| 153 |
```python
|
| 154 |
-
|
|
|
|
|
|
|
| 155 |
<article class="product" data-id="1"><h3>Product 1</h3>
|
| 156 |
<p class="description">This is product 1</p>
|
| 157 |
<span class="price">$10.99</span>
|
| 158 |
<div class="hidden stock">In stock: 5</div>
|
| 159 |
</article>
|
| 160 |
```
|
| 161 |
-
Use `.body` property to get the raw content of page
|
| 162 |
```python
|
| 163 |
>>> page.body
|
| 164 |
'<html>\n <head>\n <title>Some page</title>\n </head>\n <body>\n <div class="product-list">\n <article class="product" data-id="1">\n <h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article>\n\n <article class="product" data-id="2">\n <h3>Product 2</h3>\n <p class="description">This is product 2</p>\n <span class="price">$20.99</span>\n <div class="hidden stock">In stock: 3</div>\n </article>\n\n <article class="product" data-id="3">\n <h3>Product 3</h3>\n <p class="description">This is product 3</p>\n <span class="price">$15.99</span>\n <div class="hidden stock">Out of stock</div>\n </article>\n </div>\n\n <script id="page-data" type="application/json">\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n </script>\n </body>\n</html>'
|
|
@@ -192,7 +200,7 @@ If you are unfamiliar with the DOM tree or the tree data structure in general, t
|
|
| 192 |
|
| 193 |
If you are too lazy to search about it, here's a quick explanation to give you a good idea.<br/>
|
| 194 |
In simple words, the `html` element is the root of the website's tree, as every page starts with an `html` element.<br/>
|
| 195 |
-
This element will be directly above elements
|
| 196 |
|
| 197 |
Accessing the parent of an element
|
| 198 |
```python
|
|
@@ -302,7 +310,7 @@ In the [Selector](#selector) class, all methods/properties that should return a
|
|
| 302 |
|
| 303 |
Let's see what [Selectors](#selectors) class adds to the table with that out of the way.
|
| 304 |
### Properties
|
| 305 |
-
Apart from the normal operations on Python lists
|
| 306 |
|
| 307 |
You can do the following:
|
| 308 |
|
|
@@ -326,9 +334,9 @@ Execute CSS and XPath selectors directly on the [Selector](#selector) instances
|
|
| 326 |
<data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
|
| 327 |
...]
|
| 328 |
```
|
| 329 |
-
Run the `re` and `re_first` methods directly. They take the same arguments passed to the [Selector](#selector) class. I
|
| 330 |
|
| 331 |
-
However, in this class, the `re_first` behaves differently as it runs `re` on each [Selector](#selector) within and returns the first one with a result. The `re` method will return a [TextHandlers](#texthandlers) object as normal,
|
| 332 |
```python
|
| 333 |
>>> page.css('.price_color').re(r'[\d\.]+')
|
| 334 |
['51.77',
|
|
@@ -381,15 +389,15 @@ Of course, TextHandler provides extra methods and properties that standard Pytho
|
|
| 381 |
### Usage
|
| 382 |
First, before discussing the added methods, you need to know that all operations on it, like slicing, accessing by index, etc., and methods like `split`, `replace`, `strip`, etc., all return a `TextHandler` again, so you can chain them as you want. If you find a method or property that returns a standard string instead of `TextHandler`, please open an issue, and we will override it as well.
|
| 383 |
|
| 384 |
-
First, we start with the `re` and `re_first` methods. These are the same methods that exist in the
|
| 385 |
|
| 386 |
- The `re` method takes a string/compiled regex pattern as the first argument. It searches the data for all strings matching the regex and returns them as a [TextHandlers](#texthandlers) instance. The `re_first` method takes the same arguments and behaves similarly, but as you probably figured out from the naming, it returns the first result only as a `TextHandler` instance.
|
| 387 |
|
| 388 |
Also, it takes other helpful arguments, which are:
|
| 389 |
|
| 390 |
- **replace_entities**: This is enabled by default. It replaces character entity references with their corresponding characters.
|
| 391 |
-
- **clean_match**: It's disabled by default. This
|
| 392 |
-
- **case_sensitive**: It's enabled by default. As the name implies, disabling it will
|
| 393 |
|
| 394 |
You have seen these examples before; the return result is [TextHandlers](#texthandlers) because we used the `re` method.
|
| 395 |
```python
|
|
@@ -484,7 +492,7 @@ First, we start with the `re` and `re_first` methods. These are the same methods
|
|
| 484 |
>>> page.json()
|
| 485 |
{'some_key': 'some_value'}
|
| 486 |
```
|
| 487 |
-
You might wonder how this happened
|
| 488 |
Well, for cases like JSON responses, I made the [Selector](#selector) class maintain a raw copy of the content passed to it. This way, when you use the `.json()` method, it checks for that raw copy and then converts it to JSON. If the raw copy is not available like the case with the elements, it checks for the current element text content, or otherwise it used the `get_all_text` method directly.<br/><br/>This might sound hacky a bit but remember, Scrapling is currently optimized to work with HTML pages only so that's the best way till now to handle JSON responses currently without sacrificing speed. This will be changed in the upcoming versions.
|
| 489 |
|
| 490 |
- Another handy method is `.clean()`, which will remove all white spaces and consecutive spaces for you and return a new `TextHandler` instance
|
|
@@ -509,10 +517,10 @@ Other methods and properties will be added over time, but remember that this cla
|
|
| 509 |
## TextHandlers
|
| 510 |
You probably guessed it: This class is similar to [Selectors](#selectors) and [Selector](#selector), but here it inherits the same logic and method as standard lists, with only `re` and `re_first` as new methods.
|
| 511 |
|
| 512 |
-
The only difference is that the `re_first` method logic here does `re` on each [TextHandler](#texthandler) within and returns the first result it has or `None`. Nothing
|
| 513 |
|
| 514 |
## AttributesHandler
|
| 515 |
-
This is a read-only version of Python's standard dictionary or `dict` that
|
| 516 |
```python
|
| 517 |
>>> print(page.find('script').attrib)
|
| 518 |
{'id': 'page-data', 'type': 'application/json'}
|
|
@@ -525,7 +533,7 @@ It currently adds two extra simple methods:
|
|
| 525 |
|
| 526 |
- The `search_values` method
|
| 527 |
|
| 528 |
-
In standard dictionaries, you can do `dict.get("key_name")` to check if a key exists. However, if you want to search by values instead of keys, it will
|
| 529 |
|
| 530 |
A simple example would be
|
| 531 |
```python
|
|
@@ -552,8 +560,9 @@ It currently adds two extra simple methods:
|
|
| 552 |
|
| 553 |
- The `json_string` property
|
| 554 |
|
| 555 |
-
|
| 556 |
-
|
|
|
|
| 557 |
>>>page.find('script').attrib.json_string
|
| 558 |
-
|
| 559 |
-
|
|
|
|
| 1 |
## Introduction
|
|
|
|
| 2 |
|
| 3 |
+
> 💡 **Prerequisites:**
|
| 4 |
+
>
|
| 5 |
+
> - You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector) object.
|
| 6 |
+
> <br><br>
|
| 7 |
+
|
| 8 |
+
After exploring the various ways to select elements with Scrapling and its related features, let's take a step back and examine the [Selector](#selector) class in general, as well as other objects, to gain a better understanding of the parsing engine.
|
| 9 |
+
|
| 10 |
+
The [Selector](#selector) class is the core parsing engine in Scrapling, providing HTML parsing and element selection capabilities. You can always import it with any of the following imports
|
| 11 |
```python
|
| 12 |
from scrapling import Selector
|
| 13 |
from scrapling.parser import Selector
|
|
|
|
| 139 |
>>> print(article.attrib)
|
| 140 |
{'class': 'product', 'data-id': '1'}
|
| 141 |
```
|
| 142 |
+
Access a specific attribute with any of the following
|
| 143 |
```python
|
| 144 |
>>> article.attrib['class']
|
| 145 |
>>> article.attrib.get('class')
|
|
|
|
| 157 |
```
|
| 158 |
Get the prettified version of the element's HTML content
|
| 159 |
```python
|
| 160 |
+
print(article.prettify())
|
| 161 |
+
```
|
| 162 |
+
```html
|
| 163 |
<article class="product" data-id="1"><h3>Product 1</h3>
|
| 164 |
<p class="description">This is product 1</p>
|
| 165 |
<span class="price">$10.99</span>
|
| 166 |
<div class="hidden stock">In stock: 5</div>
|
| 167 |
</article>
|
| 168 |
```
|
| 169 |
+
Use the `.body` property to get the raw content of the page
|
| 170 |
```python
|
| 171 |
>>> page.body
|
| 172 |
'<html>\n <head>\n <title>Some page</title>\n </head>\n <body>\n <div class="product-list">\n <article class="product" data-id="1">\n <h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article>\n\n <article class="product" data-id="2">\n <h3>Product 2</h3>\n <p class="description">This is product 2</p>\n <span class="price">$20.99</span>\n <div class="hidden stock">In stock: 3</div>\n </article>\n\n <article class="product" data-id="3">\n <h3>Product 3</h3>\n <p class="description">This is product 3</p>\n <span class="price">$15.99</span>\n <div class="hidden stock">Out of stock</div>\n </article>\n </div>\n\n <script id="page-data" type="application/json">\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n </script>\n </body>\n</html>'
|
|
|
|
| 200 |
|
| 201 |
If you are too lazy to search about it, here's a quick explanation to give you a good idea.<br/>
|
| 202 |
In simple words, the `html` element is the root of the website's tree, as every page starts with an `html` element.<br/>
|
| 203 |
+
This element will be positioned directly above elements such as `head` and `body`. These are considered "children" of the `html` element, and the `html` element is considered their "parent". The element `body` is a "sibling" of the element `head` and vice versa.
|
| 204 |
|
| 205 |
Accessing the parent of an element
|
| 206 |
```python
|
|
|
|
| 310 |
|
| 311 |
Let's see what [Selectors](#selectors) class adds to the table with that out of the way.
|
| 312 |
### Properties
|
| 313 |
+
Apart from the normal operations on Python lists, such as iteration and slicing, etc.
|
| 314 |
|
| 315 |
You can do the following:
|
| 316 |
|
|
|
|
| 334 |
<data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
|
| 335 |
...]
|
| 336 |
```
|
| 337 |
+
Run the `re` and `re_first` methods directly. They take the same arguments passed to the [Selector](#selector) class. I will still leave these methods to be explained in the [TextHandler](#texthandler) section below.
|
| 338 |
|
| 339 |
+
However, in this class, the `re_first` behaves differently as it runs `re` on each [Selector](#selector) within and returns the first one with a result. The `re` method will return a [TextHandlers](#texthandlers) object as normal, which combines all the [TextHandler](#texthandler) instances into one [TextHandlers](#texthandlers) instance.
|
| 340 |
```python
|
| 341 |
>>> page.css('.price_color').re(r'[\d\.]+')
|
| 342 |
['51.77',
|
|
|
|
| 389 |
### Usage
|
| 390 |
First, before discussing the added methods, you need to know that all operations on it, like slicing, accessing by index, etc., and methods like `split`, `replace`, `strip`, etc., all return a `TextHandler` again, so you can chain them as you want. If you find a method or property that returns a standard string instead of `TextHandler`, please open an issue, and we will override it as well.
|
| 391 |
|
| 392 |
+
First, we start with the `re` and `re_first` methods. These are the same methods that exist in the other classes ([Selector](#selector), [Selectors](#selectors), and [TextHandlers](#texthandlers)), so they will accept the same arguments as well.
|
| 393 |
|
| 394 |
- The `re` method takes a string/compiled regex pattern as the first argument. It searches the data for all strings matching the regex and returns them as a [TextHandlers](#texthandlers) instance. The `re_first` method takes the same arguments and behaves similarly, but as you probably figured out from the naming, it returns the first result only as a `TextHandler` instance.
|
| 395 |
|
| 396 |
Also, it takes other helpful arguments, which are:
|
| 397 |
|
| 398 |
- **replace_entities**: This is enabled by default. It replaces character entity references with their corresponding characters.
|
| 399 |
+
- **clean_match**: It's disabled by default. This causes the method to ignore all whitespace and consecutive spaces while matching.
|
| 400 |
+
- **case_sensitive**: It's enabled by default. As the name implies, disabling it will cause the regex to ignore the case of letters while compiling.
|
| 401 |
|
| 402 |
You have seen these examples before; the return result is [TextHandlers](#texthandlers) because we used the `re` method.
|
| 403 |
```python
|
|
|
|
| 492 |
>>> page.json()
|
| 493 |
{'some_key': 'some_value'}
|
| 494 |
```
|
| 495 |
+
You might wonder how this happened, given that the `html` tag doesn't contain direct text.<br/>
|
| 496 |
Well, for cases like JSON responses, I made the [Selector](#selector) class maintain a raw copy of the content passed to it. This way, when you use the `.json()` method, it checks for that raw copy and then converts it to JSON. If the raw copy is not available like the case with the elements, it checks for the current element text content, or otherwise it used the `get_all_text` method directly.<br/><br/>This might sound hacky a bit but remember, Scrapling is currently optimized to work with HTML pages only so that's the best way till now to handle JSON responses currently without sacrificing speed. This will be changed in the upcoming versions.
|
| 497 |
|
| 498 |
- Another handy method is `.clean()`, which will remove all white spaces and consecutive spaces for you and return a new `TextHandler` instance
|
|
|
|
| 517 |
## TextHandlers
|
| 518 |
You probably guessed it: This class is similar to [Selectors](#selectors) and [Selector](#selector), but here it inherits the same logic and method as standard lists, with only `re` and `re_first` as new methods.
|
| 519 |
|
| 520 |
+
The only difference is that the `re_first` method logic here does `re` on each [TextHandler](#texthandler) within and returns the first result it has or `None`. Nothing new needs to be explained here, but new methods will be added over time.
|
| 521 |
|
| 522 |
## AttributesHandler
|
| 523 |
+
This is a read-only version of Python's standard dictionary, or `dict`, that is used solely to store the attributes of each element or each [Selector](#selector) instance.
|
| 524 |
```python
|
| 525 |
>>> print(page.find('script').attrib)
|
| 526 |
{'id': 'page-data', 'type': 'application/json'}
|
|
|
|
| 533 |
|
| 534 |
- The `search_values` method
|
| 535 |
|
| 536 |
+
In standard dictionaries, you can do `dict.get("key_name")` to check if a key exists. However, if you want to search by values instead of keys, it will require some additional code lines. This method does that for you. It allows you to search the current attributes by values and returns a dictionary of each matching item.
|
| 537 |
|
| 538 |
A simple example would be
|
| 539 |
```python
|
|
|
|
| 560 |
|
| 561 |
- The `json_string` property
|
| 562 |
|
| 563 |
+
This property converts current attributes to a JSON string if the attributes are JSON serializable; otherwise, it throws an error
|
| 564 |
+
|
| 565 |
+
```python
|
| 566 |
>>>page.find('script').attrib.json_string
|
| 567 |
+
b'{"id":"page-data","type":"application/json"}'
|
| 568 |
+
```
|
mkdocs.yml
CHANGED
|
@@ -12,24 +12,9 @@ theme:
|
|
| 12 |
logo: assets/logo.png
|
| 13 |
favicon: assets/favicon.ico
|
| 14 |
palette:
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
name: Switch to light mode
|
| 19 |
-
- media: "(prefers-color-scheme: light)"
|
| 20 |
-
scheme: default
|
| 21 |
-
primary: indigo
|
| 22 |
-
accent: indigo
|
| 23 |
-
toggle:
|
| 24 |
-
icon: material/toggle-switch
|
| 25 |
-
name: Switch to dark mode
|
| 26 |
-
- media: "(prefers-color-scheme: dark)"
|
| 27 |
-
scheme: slate
|
| 28 |
-
primary: black
|
| 29 |
-
accent: indigo
|
| 30 |
-
toggle:
|
| 31 |
-
icon: material/toggle-switch-off
|
| 32 |
-
name: Switch to system preference
|
| 33 |
font:
|
| 34 |
text: Open Sans
|
| 35 |
code: JetBrains Mono
|
|
@@ -70,10 +55,10 @@ nav:
|
|
| 70 |
- Main classes: parsing/main_classes.md
|
| 71 |
- Adaptive scraping: parsing/adaptive.md
|
| 72 |
- Fetching:
|
| 73 |
-
-
|
| 74 |
-
-
|
| 75 |
-
-
|
| 76 |
-
-
|
| 77 |
- Command Line Interface:
|
| 78 |
- Overview: cli/overview.md
|
| 79 |
- Interactive shell: cli/interactive-shell.md
|
|
@@ -118,10 +103,10 @@ markdown_extensions:
|
|
| 118 |
|
| 119 |
plugins:
|
| 120 |
- search
|
| 121 |
-
- social:
|
| 122 |
-
cards_layout_options:
|
| 123 |
-
background_color: "#1f1f1f"
|
| 124 |
-
font_family: Roboto
|
| 125 |
- mkdocstrings:
|
| 126 |
handlers:
|
| 127 |
python:
|
|
|
|
| 12 |
logo: assets/logo.png
|
| 13 |
favicon: assets/favicon.ico
|
| 14 |
palette:
|
| 15 |
+
scheme: slate
|
| 16 |
+
primary: black
|
| 17 |
+
accent: deep purple
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
font:
|
| 19 |
text: Open Sans
|
| 20 |
code: JetBrains Mono
|
|
|
|
| 55 |
- Main classes: parsing/main_classes.md
|
| 56 |
- Adaptive scraping: parsing/adaptive.md
|
| 57 |
- Fetching:
|
| 58 |
+
- Fetchers basics: fetching/choosing.md
|
| 59 |
+
- HTTP requests: fetching/static.md
|
| 60 |
+
- Dynamic websites: fetching/dynamic.md
|
| 61 |
+
- Dynamic websites with hard protections: fetching/stealthy.md
|
| 62 |
- Command Line Interface:
|
| 63 |
- Overview: cli/overview.md
|
| 64 |
- Interactive shell: cli/interactive-shell.md
|
|
|
|
| 103 |
|
| 104 |
plugins:
|
| 105 |
- search
|
| 106 |
+
# - social:
|
| 107 |
+
# cards_layout_options:
|
| 108 |
+
# background_color: "#1f1f1f"
|
| 109 |
+
# font_family: Roboto
|
| 110 |
- mkdocstrings:
|
| 111 |
handlers:
|
| 112 |
python:
|