Karim shoair commited on
Commit
3dc1188
·
1 Parent(s): d8893a8

docs: Style updates and a lot of clarifications

Browse files
docs/cli/extract-commands.md CHANGED
@@ -4,6 +4,14 @@
4
 
5
  The `scrapling extract` Command lets you download and extract content from websites directly from your terminal without writing any code. Ideal for beginners, researchers, and anyone requiring rapid web data extraction.
6
 
 
 
 
 
 
 
 
 
7
  ## What is the Extract Command group?
8
 
9
  The extract command is a set of simple terminal tools that:
 
4
 
5
  The `scrapling extract` Command lets you download and extract content from websites directly from your terminal without writing any code. Ideal for beginners, researchers, and anyone requiring rapid web data extraction.
6
 
7
+ > 💡 **Prerequisites:**
8
+ >
9
+ > 1. You’ve completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
10
+ > 2. You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
11
+ > 3. You’ve completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
12
+ > 4. You’ve completed or read at least one page from the fetchers section to use here for requests: [HTTP requests](../fetching/static.md), [Dynamic websites](../fetching/dynamic.md), or [Dynamic websites with hard protections](../fetching/stealthy.md).
13
+
14
+
15
  ## What is the Extract Command group?
16
 
17
  The extract command is a set of simple terminal tools that:
docs/cli/interactive-shell.md CHANGED
@@ -6,6 +6,14 @@
6
 
7
  The Scrapling Interactive Shell is an enhanced IPython-based environment designed specifically for Web Scraping tasks. It provides instant access to all Scrapling features, clever shortcuts, automatic page management, and advanced tools like curl command conversion.
8
 
 
 
 
 
 
 
 
 
9
  ## Why use the Interactive Shell?
10
 
11
  The interactive shell transforms web scraping from a slow script-and-run cycle into a fast, exploratory experience. It's perfect for:
 
6
 
7
  The Scrapling Interactive Shell is an enhanced IPython-based environment designed specifically for Web Scraping tasks. It provides instant access to all Scrapling features, clever shortcuts, automatic page management, and advanced tools like curl command conversion.
8
 
9
+ > 💡 **Prerequisites:**
10
+ >
11
+ > 1. You’ve completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
12
+ > 2. You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
13
+ > 3. You’ve completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
14
+ > 4. You’ve completed or read at least one page from the fetchers section to use here for requests: [HTTP requests](../fetching/static.md), [Dynamic websites](../fetching/dynamic.md), or [Dynamic websites with hard protections](../fetching/stealthy.md).
15
+
16
+
17
  ## Why use the Interactive Shell?
18
 
19
  The interactive shell transforms web scraping from a slow script-and-run cycle into a fast, exploratory experience. It's perfect for:
docs/fetching/dynamic.md CHANGED
@@ -4,6 +4,12 @@ Here, we will discuss the `DynamicFetcher` class (previously known as `PlayWrigh
4
 
5
  As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
6
 
 
 
 
 
 
 
7
  ## Basic Usage
8
  You have one primary way to import this Fetcher, which is the same for all fetchers.
9
 
@@ -275,7 +281,7 @@ async def scrape_multiple_sites():
275
  return pages
276
  ```
277
 
278
- You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **rotating pool of Browser tabs**. Instead of using a single tab for all your requests, you set a limit on the maximum number of pages. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
279
 
280
  1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
281
  2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
@@ -301,4 +307,4 @@ Use DynamicFetcher when:
301
  - Need custom browser config
302
  - Want flexible stealth options
303
 
304
- If you want more stealth and control without much config, check out the [StealthyFetcher](stealthy.md).
 
4
 
5
  As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
6
 
7
+ > 💡 **Prerequisites:**
8
+ >
9
+ > 1. You’ve completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
10
+ > 2. You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
11
+ > 3. You’ve completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
12
+
13
  ## Basic Usage
14
  You have one primary way to import this Fetcher, which is the same for all fetchers.
15
 
 
281
  return pages
282
  ```
283
 
284
+ You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **rotating pool of Browser tabs**. Instead of using a single tab for all your requests, you set a limit on the maximum number of pages that can be displayed at once. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
285
 
286
  1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
287
  2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
 
307
  - Need custom browser config
308
  - Want flexible stealth options
309
 
310
+ If you want more stealth and control without much config, check out the [StealthyFetcher](stealthy.md).
docs/fetching/static.md CHANGED
@@ -2,6 +2,12 @@
2
 
3
  The `Fetcher` class provides rapid and lightweight HTTP requests using the high-performance `curl_cffi` library with a lot of stealth capabilities.
4
 
 
 
 
 
 
 
5
  ## Basic Usage
6
  You have one primary way to import this Fetcher, which is the same for all fetchers.
7
 
 
2
 
3
  The `Fetcher` class provides rapid and lightweight HTTP requests using the high-performance `curl_cffi` library with a lot of stealth capabilities.
4
 
5
+ > 💡 **Prerequisites:**
6
+ >
7
+ > 1. You’ve completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
8
+ > 2. You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
9
+ > 3. You’ve completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
10
+
11
  ## Basic Usage
12
  You have one primary way to import this Fetcher, which is the same for all fetchers.
13
 
docs/fetching/stealthy.md CHANGED
@@ -4,6 +4,12 @@ Here, we will discuss the `StealthyFetcher` class. This class is similar to [Dyn
4
 
5
  As with [DynamicFetcher](dynamic.md#introduction), you will need some knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) to automate the page, as we will explain later.
6
 
 
 
 
 
 
 
7
  ## Basic Usage
8
  You have one primary way to import this Fetcher, which is the same for all fetchers.
9
 
 
4
 
5
  As with [DynamicFetcher](dynamic.md#introduction), you will need some knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) to automate the page, as we will explain later.
6
 
7
+ > 💡 **Prerequisites:**
8
+ >
9
+ > 1. You’ve completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
10
+ > 2. You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
11
+ > 3. You’ve completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
12
+
13
  ## Basic Usage
14
  You have one primary way to import this Fetcher, which is the same for all fetchers.
15
 
docs/index.md CHANGED
@@ -83,33 +83,10 @@ Scrapling’s GitHub stars have grown steadily since its release (see chart belo
83
 
84
  <div id="chartContainer">
85
  <a href="https://github.com/D4Vinci/Scrapling">
86
- <img id="chartImage" alt="Star History Chart" loading="lazy" src="https://api.star-history.com/svg?repos=D4Vinci/Scrapling&type=Date" height="400"/>
87
  </a>
88
  </div>
89
 
90
- <script>
91
- const observer = new MutationObserver((mutations) => {
92
- mutations.forEach((mutation) => {
93
- if (mutation.attributeName === 'data-md-color-media') {
94
- const colorMedia = document.body.getAttribute('data-md-color-media');
95
- const isDarkScheme = document.body.getAttribute('data-md-color-scheme') === 'slate';
96
- const chartImg = document.querySelector('#chartImage');
97
- const baseUrl = 'https://api.star-history.com/svg?repos=D4Vinci/Scrapling&type=Date';
98
-
99
- if (colorMedia === '(prefers-color-scheme)' ? isDarkScheme : colorMedia.includes('dark')) {
100
- chartImg.src = `${baseUrl}&theme=dark`;
101
- } else {
102
- chartImg.src = baseUrl;
103
- }
104
- }
105
- });
106
- });
107
-
108
- observer.observe(document.body, {
109
- attributes: true,
110
- attributeFilter: ['data-md-color-media', 'data-md-color-scheme']
111
- });
112
- </script>
113
 
114
  ## Installation
115
  Scrapling requires Python 3.10 or higher:
 
83
 
84
  <div id="chartContainer">
85
  <a href="https://github.com/D4Vinci/Scrapling">
86
+ <img id="chartImage" alt="Star History Chart" loading="lazy" src="https://api.star-history.com/svg?repos=D4Vinci/Scrapling&type=date&legend=top-left&theme=dark" height="400"/>
87
  </a>
88
  </div>
89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
  ## Installation
92
  Scrapling requires Python 3.10 or higher:
docs/parsing/adaptive.md CHANGED
@@ -1,4 +1,11 @@
1
  ## Introduction
 
 
 
 
 
 
 
2
  Adaptive scraping (previously known as automatch) is one of Scrapling's most powerful features. It allows your scraper to survive website changes by intelligently tracking and relocating elements.
3
 
4
  Let's say you are scraping a page with a structure like this:
 
1
  ## Introduction
2
+
3
+ > 💡 **Prerequisites:**
4
+ >
5
+ > 1. You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector) object.
6
+ > 2. You’ve completed or read the [Main classes](../parsing/main_classes.md) page to understand the [Selector](../parsing/main_classes.md#selector) class.
7
+ > <br><br>
8
+
9
  Adaptive scraping (previously known as automatch) is one of Scrapling's most powerful features. It allows your scraper to survive website changes by intelligently tracking and relocating elements.
10
 
11
  Let's say you are scraping a page with a structure like this:
docs/parsing/main_classes.md CHANGED
@@ -1,7 +1,13 @@
1
  ## Introduction
2
- After exploring the various ways to select elements with Scrapling and related features, let's take a step back and examine the [Selector](#selector) class generally and other objects to better understand the parsing engine.
3
 
4
- The [Selector](#selector) class is the core parsing engine in Scrapling that provides HTML parsing and element selection capabilities. You can always import it with any of the following imports
 
 
 
 
 
 
 
5
  ```python
6
  from scrapling import Selector
7
  from scrapling.parser import Selector
@@ -133,7 +139,7 @@ Getting the attributes of the element
133
  >>> print(article.attrib)
134
  {'class': 'product', 'data-id': '1'}
135
  ```
136
- Access a specific attribute with any method of the following
137
  ```python
138
  >>> article.attrib['class']
139
  >>> article.attrib.get('class')
@@ -151,14 +157,16 @@ Get the HTML content of the element
151
  ```
152
  Get the prettified version of the element's HTML content
153
  ```python
154
- >>> print(article.prettify())
 
 
155
  <article class="product" data-id="1"><h3>Product 1</h3>
156
  <p class="description">This is product 1</p>
157
  <span class="price">$10.99</span>
158
  <div class="hidden stock">In stock: 5</div>
159
  </article>
160
  ```
161
- Use `.body` property to get the raw content of page
162
  ```python
163
  >>> page.body
164
  '<html>\n <head>\n <title>Some page</title>\n </head>\n <body>\n <div class="product-list">\n <article class="product" data-id="1">\n <h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article>\n\n <article class="product" data-id="2">\n <h3>Product 2</h3>\n <p class="description">This is product 2</p>\n <span class="price">$20.99</span>\n <div class="hidden stock">In stock: 3</div>\n </article>\n\n <article class="product" data-id="3">\n <h3>Product 3</h3>\n <p class="description">This is product 3</p>\n <span class="price">$15.99</span>\n <div class="hidden stock">Out of stock</div>\n </article>\n </div>\n\n <script id="page-data" type="application/json">\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n </script>\n </body>\n</html>'
@@ -192,7 +200,7 @@ If you are unfamiliar with the DOM tree or the tree data structure in general, t
192
 
193
  If you are too lazy to search about it, here's a quick explanation to give you a good idea.<br/>
194
  In simple words, the `html` element is the root of the website's tree, as every page starts with an `html` element.<br/>
195
- This element will be directly above elements like `head` and `body`. These are considered "children" of the `html` element, and the `html` element is considered their "parent." The element `body` is a "sibling" of the element `head` and vice versa.
196
 
197
  Accessing the parent of an element
198
  ```python
@@ -302,7 +310,7 @@ In the [Selector](#selector) class, all methods/properties that should return a
302
 
303
  Let's see what [Selectors](#selectors) class adds to the table with that out of the way.
304
  ### Properties
305
- Apart from the normal operations on Python lists like iteration, slicing, etc...
306
 
307
  You can do the following:
308
 
@@ -326,9 +334,9 @@ Execute CSS and XPath selectors directly on the [Selector](#selector) instances
326
  <data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
327
  ...]
328
  ```
329
- Run the `re` and `re_first` methods directly. They take the same arguments passed to the [Selector](#selector) class. I'm still leaving these methods to be explained in the [TextHandler](#texthandler) section below.
330
 
331
- However, in this class, the `re_first` behaves differently as it runs `re` on each [Selector](#selector) within and returns the first one with a result. The `re` method will return a [TextHandlers](#texthandlers) object as normal, that has all the [TextHandler](#texthandler) instances combined in one [TextHandlers](#texthandlers) instance.
332
  ```python
333
  >>> page.css('.price_color').re(r'[\d\.]+')
334
  ['51.77',
@@ -381,15 +389,15 @@ Of course, TextHandler provides extra methods and properties that standard Pytho
381
  ### Usage
382
  First, before discussing the added methods, you need to know that all operations on it, like slicing, accessing by index, etc., and methods like `split`, `replace`, `strip`, etc., all return a `TextHandler` again, so you can chain them as you want. If you find a method or property that returns a standard string instead of `TextHandler`, please open an issue, and we will override it as well.
383
 
384
- First, we start with the `re` and `re_first` methods. These are the same methods that exist in the rest of the classes ([Selector](#selector), [Selectors](#selectors), and [TextHandlers](#texthandlers)), so they will take the same arguments as well.
385
 
386
  - The `re` method takes a string/compiled regex pattern as the first argument. It searches the data for all strings matching the regex and returns them as a [TextHandlers](#texthandlers) instance. The `re_first` method takes the same arguments and behaves similarly, but as you probably figured out from the naming, it returns the first result only as a `TextHandler` instance.
387
 
388
  Also, it takes other helpful arguments, which are:
389
 
390
  - **replace_entities**: This is enabled by default. It replaces character entity references with their corresponding characters.
391
- - **clean_match**: It's disabled by default. This makes the method ignore all whitespaces and consecutive spaces while matching.
392
- - **case_sensitive**: It's enabled by default. As the name implies, disabling it will make the regex ignore the case of letters while compiling it.
393
 
394
  You have seen these examples before; the return result is [TextHandlers](#texthandlers) because we used the `re` method.
395
  ```python
@@ -484,7 +492,7 @@ First, we start with the `re` and `re_first` methods. These are the same methods
484
  >>> page.json()
485
  {'some_key': 'some_value'}
486
  ```
487
- You might wonder how this happened while the `html` tag doesn't have direct text?<br/>
488
  Well, for cases like JSON responses, I made the [Selector](#selector) class maintain a raw copy of the content passed to it. This way, when you use the `.json()` method, it checks for that raw copy and then converts it to JSON. If the raw copy is not available like the case with the elements, it checks for the current element text content, or otherwise it used the `get_all_text` method directly.<br/><br/>This might sound hacky a bit but remember, Scrapling is currently optimized to work with HTML pages only so that's the best way till now to handle JSON responses currently without sacrificing speed. This will be changed in the upcoming versions.
489
 
490
  - Another handy method is `.clean()`, which will remove all white spaces and consecutive spaces for you and return a new `TextHandler` instance
@@ -509,10 +517,10 @@ Other methods and properties will be added over time, but remember that this cla
509
  ## TextHandlers
510
  You probably guessed it: This class is similar to [Selectors](#selectors) and [Selector](#selector), but here it inherits the same logic and method as standard lists, with only `re` and `re_first` as new methods.
511
 
512
- The only difference is that the `re_first` method logic here does `re` on each [TextHandler](#texthandler) within and returns the first result it has or `None`. Nothing is new to explain here, but new methods will be added over time.
513
 
514
  ## AttributesHandler
515
- This is a read-only version of Python's standard dictionary or `dict` that's only used to store the attributes of each element or each [Selector](#selector) instance, in other words.
516
  ```python
517
  >>> print(page.find('script').attrib)
518
  {'id': 'page-data', 'type': 'application/json'}
@@ -525,7 +533,7 @@ It currently adds two extra simple methods:
525
 
526
  - The `search_values` method
527
 
528
- In standard dictionaries, you can do `dict.get("key_name")` to check if a key exists. However, if you want to search by values instead of keys, it will take you some code lines. This method does that for you. It allows you to search the current attributes by values and returns a dictionary of each matching item.
529
 
530
  A simple example would be
531
  ```python
@@ -552,8 +560,9 @@ It currently adds two extra simple methods:
552
 
553
  - The `json_string` property
554
 
555
- This property converts current attributes to a JSON string if the attributes are JSON serializable; otherwise, it throws an error
556
- ```python
 
557
  >>>page.find('script').attrib.json_string
558
- b'{"id":"page-data","type":"application/json"}'
559
- ```
 
1
  ## Introduction
 
2
 
3
+ > 💡 **Prerequisites:**
4
+ >
5
+ > - You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector) object.
6
+ > <br><br>
7
+
8
+ After exploring the various ways to select elements with Scrapling and its related features, let's take a step back and examine the [Selector](#selector) class in general, as well as other objects, to gain a better understanding of the parsing engine.
9
+
10
+ The [Selector](#selector) class is the core parsing engine in Scrapling, providing HTML parsing and element selection capabilities. You can always import it with any of the following imports
11
  ```python
12
  from scrapling import Selector
13
  from scrapling.parser import Selector
 
139
  >>> print(article.attrib)
140
  {'class': 'product', 'data-id': '1'}
141
  ```
142
+ Access a specific attribute with any of the following
143
  ```python
144
  >>> article.attrib['class']
145
  >>> article.attrib.get('class')
 
157
  ```
158
  Get the prettified version of the element's HTML content
159
  ```python
160
+ print(article.prettify())
161
+ ```
162
+ ```html
163
  <article class="product" data-id="1"><h3>Product 1</h3>
164
  <p class="description">This is product 1</p>
165
  <span class="price">$10.99</span>
166
  <div class="hidden stock">In stock: 5</div>
167
  </article>
168
  ```
169
+ Use the `.body` property to get the raw content of the page
170
  ```python
171
  >>> page.body
172
  '<html>\n <head>\n <title>Some page</title>\n </head>\n <body>\n <div class="product-list">\n <article class="product" data-id="1">\n <h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article>\n\n <article class="product" data-id="2">\n <h3>Product 2</h3>\n <p class="description">This is product 2</p>\n <span class="price">$20.99</span>\n <div class="hidden stock">In stock: 3</div>\n </article>\n\n <article class="product" data-id="3">\n <h3>Product 3</h3>\n <p class="description">This is product 3</p>\n <span class="price">$15.99</span>\n <div class="hidden stock">Out of stock</div>\n </article>\n </div>\n\n <script id="page-data" type="application/json">\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n </script>\n </body>\n</html>'
 
200
 
201
  If you are too lazy to search about it, here's a quick explanation to give you a good idea.<br/>
202
  In simple words, the `html` element is the root of the website's tree, as every page starts with an `html` element.<br/>
203
+ This element will be positioned directly above elements such as `head` and `body`. These are considered "children" of the `html` element, and the `html` element is considered their "parent". The element `body` is a "sibling" of the element `head` and vice versa.
204
 
205
  Accessing the parent of an element
206
  ```python
 
310
 
311
  Let's see what [Selectors](#selectors) class adds to the table with that out of the way.
312
  ### Properties
313
+ Apart from the normal operations on Python lists, such as iteration and slicing, etc.
314
 
315
  You can do the following:
316
 
 
334
  <data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
335
  ...]
336
  ```
337
+ Run the `re` and `re_first` methods directly. They take the same arguments passed to the [Selector](#selector) class. I will still leave these methods to be explained in the [TextHandler](#texthandler) section below.
338
 
339
+ However, in this class, the `re_first` behaves differently as it runs `re` on each [Selector](#selector) within and returns the first one with a result. The `re` method will return a [TextHandlers](#texthandlers) object as normal, which combines all the [TextHandler](#texthandler) instances into one [TextHandlers](#texthandlers) instance.
340
  ```python
341
  >>> page.css('.price_color').re(r'[\d\.]+')
342
  ['51.77',
 
389
  ### Usage
390
  First, before discussing the added methods, you need to know that all operations on it, like slicing, accessing by index, etc., and methods like `split`, `replace`, `strip`, etc., all return a `TextHandler` again, so you can chain them as you want. If you find a method or property that returns a standard string instead of `TextHandler`, please open an issue, and we will override it as well.
391
 
392
+ First, we start with the `re` and `re_first` methods. These are the same methods that exist in the other classes ([Selector](#selector), [Selectors](#selectors), and [TextHandlers](#texthandlers)), so they will accept the same arguments as well.
393
 
394
  - The `re` method takes a string/compiled regex pattern as the first argument. It searches the data for all strings matching the regex and returns them as a [TextHandlers](#texthandlers) instance. The `re_first` method takes the same arguments and behaves similarly, but as you probably figured out from the naming, it returns the first result only as a `TextHandler` instance.
395
 
396
  Also, it takes other helpful arguments, which are:
397
 
398
  - **replace_entities**: This is enabled by default. It replaces character entity references with their corresponding characters.
399
+ - **clean_match**: It's disabled by default. This causes the method to ignore all whitespace and consecutive spaces while matching.
400
+ - **case_sensitive**: It's enabled by default. As the name implies, disabling it will cause the regex to ignore the case of letters while compiling.
401
 
402
  You have seen these examples before; the return result is [TextHandlers](#texthandlers) because we used the `re` method.
403
  ```python
 
492
  >>> page.json()
493
  {'some_key': 'some_value'}
494
  ```
495
+ You might wonder how this happened, given that the `html` tag doesn't contain direct text.<br/>
496
  Well, for cases like JSON responses, I made the [Selector](#selector) class maintain a raw copy of the content passed to it. This way, when you use the `.json()` method, it checks for that raw copy and then converts it to JSON. If the raw copy is not available like the case with the elements, it checks for the current element text content, or otherwise it used the `get_all_text` method directly.<br/><br/>This might sound hacky a bit but remember, Scrapling is currently optimized to work with HTML pages only so that's the best way till now to handle JSON responses currently without sacrificing speed. This will be changed in the upcoming versions.
497
 
498
  - Another handy method is `.clean()`, which will remove all white spaces and consecutive spaces for you and return a new `TextHandler` instance
 
517
  ## TextHandlers
518
  You probably guessed it: This class is similar to [Selectors](#selectors) and [Selector](#selector), but here it inherits the same logic and method as standard lists, with only `re` and `re_first` as new methods.
519
 
520
+ The only difference is that the `re_first` method logic here does `re` on each [TextHandler](#texthandler) within and returns the first result it has or `None`. Nothing new needs to be explained here, but new methods will be added over time.
521
 
522
  ## AttributesHandler
523
+ This is a read-only version of Python's standard dictionary, or `dict`, that is used solely to store the attributes of each element or each [Selector](#selector) instance.
524
  ```python
525
  >>> print(page.find('script').attrib)
526
  {'id': 'page-data', 'type': 'application/json'}
 
533
 
534
  - The `search_values` method
535
 
536
+ In standard dictionaries, you can do `dict.get("key_name")` to check if a key exists. However, if you want to search by values instead of keys, it will require some additional code lines. This method does that for you. It allows you to search the current attributes by values and returns a dictionary of each matching item.
537
 
538
  A simple example would be
539
  ```python
 
560
 
561
  - The `json_string` property
562
 
563
+ This property converts current attributes to a JSON string if the attributes are JSON serializable; otherwise, it throws an error
564
+
565
+ ```python
566
  >>>page.find('script').attrib.json_string
567
+ b'{"id":"page-data","type":"application/json"}'
568
+ ```
mkdocs.yml CHANGED
@@ -12,24 +12,9 @@ theme:
12
  logo: assets/logo.png
13
  favicon: assets/favicon.ico
14
  palette:
15
- - media: "(prefers-color-scheme)"
16
- toggle:
17
- icon: material/link
18
- name: Switch to light mode
19
- - media: "(prefers-color-scheme: light)"
20
- scheme: default
21
- primary: indigo
22
- accent: indigo
23
- toggle:
24
- icon: material/toggle-switch
25
- name: Switch to dark mode
26
- - media: "(prefers-color-scheme: dark)"
27
- scheme: slate
28
- primary: black
29
- accent: indigo
30
- toggle:
31
- icon: material/toggle-switch-off
32
- name: Switch to system preference
33
  font:
34
  text: Open Sans
35
  code: JetBrains Mono
@@ -70,10 +55,10 @@ nav:
70
  - Main classes: parsing/main_classes.md
71
  - Adaptive scraping: parsing/adaptive.md
72
  - Fetching:
73
- - Choosing a fetcher: fetching/choosing.md
74
- - Static requests: fetching/static.md
75
- - Dynamically loaded websites: fetching/dynamic.md
76
- - Fully bypass protections while fetching: fetching/stealthy.md
77
  - Command Line Interface:
78
  - Overview: cli/overview.md
79
  - Interactive shell: cli/interactive-shell.md
@@ -118,10 +103,10 @@ markdown_extensions:
118
 
119
  plugins:
120
  - search
121
- - social:
122
- cards_layout_options:
123
- background_color: "#1f1f1f"
124
- font_family: Roboto
125
  - mkdocstrings:
126
  handlers:
127
  python:
 
12
  logo: assets/logo.png
13
  favicon: assets/favicon.ico
14
  palette:
15
+ scheme: slate
16
+ primary: black
17
+ accent: deep purple
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  font:
19
  text: Open Sans
20
  code: JetBrains Mono
 
55
  - Main classes: parsing/main_classes.md
56
  - Adaptive scraping: parsing/adaptive.md
57
  - Fetching:
58
+ - Fetchers basics: fetching/choosing.md
59
+ - HTTP requests: fetching/static.md
60
+ - Dynamic websites: fetching/dynamic.md
61
+ - Dynamic websites with hard protections: fetching/stealthy.md
62
  - Command Line Interface:
63
  - Overview: cli/overview.md
64
  - Interactive shell: cli/interactive-shell.md
 
103
 
104
  plugins:
105
  - search
106
+ # - social:
107
+ # cards_layout_options:
108
+ # background_color: "#1f1f1f"
109
+ # font_family: Roboto
110
  - mkdocstrings:
111
  handlers:
112
  python: