Karim shoair commited on
Commit
2a85f06
·
1 Parent(s): 0832de7

First version of Scrapling full documentation

Browse files
docs/Core/using scrapling custom types.md DELETED
@@ -1,21 +0,0 @@
1
- > You can take advantage from the custom-made types for Scrapling and use it outside the library if you want. It's better than copying their code after all :)
2
-
3
- ### All current types can be imported alone like below
4
- ```python
5
- >>> from scrapling.core.custom_types import TextHandler, AttributesHandler
6
-
7
- >>> somestring = TextHandler('{}')
8
- >>> somestring.json()
9
- '{}'
10
- >>> somedict_1 = AttributesHandler({'a': 1})
11
- >>> somedict_2 = AttributesHandler(a=1)
12
- ```
13
-
14
- Note `TextHandler` is a sub-class of Python's `str` so all normal operations/methods that work with Python strings will work.
15
- If you want to check for the type in your code, it's better to depend on Python built-in function `issubclass`.
16
-
17
- The class `AttributesHandler` is a sub-class of `collections.abc.Mapping` so it's immutable (read-only) and all operations are inherited from it. The data passed can be accessed later though the `._data` method but careful it's of type `types.MappingProxyType` so it's immutable (read-only) as well (faster than `collections.abc.Mapping` by fractions of seconds).
18
-
19
- So basically to make it simple to you if you are new to Python, the same operations and methods from Python standard `dict` type will all work with class `AttributesHandler` except the ones that try to modify the actual data.
20
-
21
- If you want to modify the data inside `AttributesHandler`, you have to convert it to dictionary first like with using the `dict` function and modify it outside.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/Examples/selectorless_stackoverflow.py DELETED
@@ -1,25 +0,0 @@
1
- """
2
- I only made this example to show how Scrapling features can be used to scrape a website without writing any selector
3
- so this script doesn't depend on the website structure.
4
- """
5
-
6
- import requests
7
-
8
- from scrapling import Adaptor
9
-
10
- response = requests.get('https://stackoverflow.com/questions/tagged/web-scraping?sort=MostVotes&filters=NoAcceptedAnswer&edited=true&pagesize=50&page=2')
11
- page = Adaptor(response.text, url=response.url)
12
- # First we will extract the first question title and its author based on the text content
13
- first_question_title = page.find_by_text('Run Selenium Python Script on Remote Server')
14
- first_question_author = page.find_by_text('Ryan')
15
- # because this page changes a lot
16
- if first_question_title and first_question_author:
17
- # If you want you can extract other questions tags like below
18
- first_question = first_question_title.find_ancestor(
19
- lambda ancestor: ancestor.attrib.get('id') and 'question-summary' in ancestor.attrib.get('id')
20
- )
21
- rest_of_questions = first_question.find_similar()
22
- # But since nothing to rely on to extract other titles/authors from these elements without CSS/XPath selectors due to the website nature
23
- # We will get all the rest of the titles/authors in the page depending on the first title and the first author we got above as a starting point
24
- for i, (title, author) in enumerate(zip(first_question_title.find_similar(), first_question_author.find_similar()), start=1):
25
- print(i, title.text, author.text)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/Extending Scrapling/writing storage system.md DELETED
@@ -1,17 +0,0 @@
1
- Scrapling by default is using SQLite but in case you want to write your storage system to store elements properties there for the auto-matching, this tutorial got you covered.
2
-
3
- You might want to use FireBase for example and share the database between multiple spiders on different machines, it's a great idea to use an online database like that because this way the spiders will share with each others.
4
-
5
- So first to make your storage class work, it must do the big 3:
6
- 1. Inherit from the abstract class `scrapling.storage_adaptors.StorageSystemMixin` and accept a string argument which will be the `url` argument to maintain the library logic.
7
- 2. Use the decorator `functools.lru_cache` on top of the class itself to follow the Singleton design pattern as other classes.
8
- 3. Implement methods `save` and `retrieve`, as you see from the type hints:
9
- - The method `save` returns nothing and will get two arguments from the library
10
- * The first one is of type `lxml.html.HtmlElement` which is the element itself, ofc. It must be converted to dictionary using the function `scrapling.utils._StorageTools.element_to_dict` so we keep the same format then saved to your database as you wish.
11
- * The second one is string which is the identifier used for retrieval. The combination of this identifier and the `url` argument from initialization must be unique for each row or the auto-match will be messed up.
12
- - The method `retrieve` takes a string which is the identifier, using it with the `url` passed on initialization the element's dictionary is retrieved from the database and returned if it exist otherwise it returns `None`
13
- > If the instructions weren't clear enough for you, you can check my implementation using SQLite3 in [storage_adaptors](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/storage_adaptors.py) file
14
-
15
- If your class satisfy this, the rest is easy. If you are planning to use the library in a threaded application, make sure that your class supports it. The default used class is thread-safe.
16
-
17
- There are some helper functions added to the abstract class if you want to use it. It's easier to see it for yourself in the [code](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/storage_adaptors.py), it's heavily commented :)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/api-reference/adaptor.md ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Adaptor Class
2
+
3
+ The `Adaptor` class is the core parsing engine in Scrapling that provides HTML parsing and element selection capabilities.
4
+
5
+ Here's the reference information for the `Adaptor` class, with all its parameters, attributes, and methods.
6
+
7
+ You can import the `Adaptor` class directly from `scrapling`:
8
+
9
+ ```python
10
+ from scrapling.parser import Adaptor
11
+ ```
12
+
13
+ ## ::: scrapling.parser.Adaptor
14
+ handler: python
15
+ :docstring:
16
+
17
+ ## ::: scrapling.parser.Adaptors
18
+ handler: python
19
+ :docstring:
20
+
docs/api-reference/custom-types.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Custom Types API Reference
2
+
3
+ Here's the reference information for all custom types of classes Scrapling implemented, with all their parameters, attributes, and methods.
4
+
5
+ You can import all of them directly like below:
6
+
7
+ ```python
8
+ from scrapling.core.custom_types import TextHandler, TextHandlers, AttributesHandler
9
+ ```
10
+
11
+ ## ::: scrapling.core.custom_types.TextHandler
12
+ handler: python
13
+ :docstring:
14
+
15
+ ## ::: scrapling.core.custom_types.TextHandlers
16
+ handler: python
17
+ :docstring:
18
+
19
+ ## ::: scrapling.core.custom_types.AttributesHandler
20
+ handler: python
21
+ :docstring:
docs/api-reference/fetchers.md ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Fetchers Classes
2
+
3
+ Here's the reference information for all fetcher-type classes' parameters, attributes, and methods.
4
+
5
+ You can import all of them directly like below:
6
+
7
+ ```python
8
+ from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
9
+ ```
10
+
11
+ ## ::: scrapling.fetchers.Fetcher
12
+ handler: python
13
+ :docstring:
14
+
15
+ ## ::: scrapling.fetchers.AsyncFetcher
16
+ handler: python
17
+ :docstring:
18
+
19
+ ## ::: scrapling.fetchers.PlayWrightFetcher
20
+ handler: python
21
+ :docstring:
22
+
23
+ ## ::: scrapling.fetchers.StealthyFetcher
24
+ handler: python
25
+ :docstring:
docs/benchmarks.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Scrapling isn't just powerful - it's also blazing fast. Scrapling implements many best practices, design patterns, and numerous optimizations to save fractions of seconds. All of that while focusing exclusively on parsing HTML documents.
2
+
3
+ Here are benchmarks comparing Scrapling's parsing speed to popular Python libraries in two tests.
4
+
5
+ ### Text Extraction Speed Test
6
+
7
+ This test consists of extracting the text content of 5000 nested div elements.
8
+
9
+ Here are the results comparing Scrapling to all well-known parsing libraries:
10
+
11
+
12
+ | # | Library | Time (ms) | vs Scrapling |
13
+ |---|:-----------------:|:---------:|:------------:|
14
+ | 1 | Scrapling | 5.44 | 1.0x |
15
+ | 2 | Parsel/Scrapy | 5.53 | 1.017x |
16
+ | 3 | Raw Lxml | 6.76 | 1.243x |
17
+ | 4 | PyQuery | 21.96 | 4.037x |
18
+ | 5 | Selectolax | 67.12 | 12.338x |
19
+ | 6 | BS4 with Lxml | 1307.03 | 240.263x |
20
+ | 7 | MechanicalSoup | 1322.64 | 243.132x |
21
+ | 8 | BS4 with html5lib | 3373.75 | 620.175x |
22
+
23
+ As you see, Scrapling is on par with Scrapy and slightly faster than Lxml, which both libraries are built on top of. These are the closest results to Scrapling. PyQuery is also built on top of Lxml, but Scrapling is four times faster.
24
+
25
+ ### Extraction By Text Speed Test
26
+
27
+ Scrapling can find elements based on its text content and find elements similar to these elements. The only known library with these two features, too, is AutoScraper.
28
+
29
+ So, we compared this to see how fast Scrapling can be in these two tasks compared to AutoScraper.
30
+
31
+ Here are the results:
32
+
33
+ | Library | Time (ms) | vs Scrapling |
34
+ |-------------|:---------:|:------------:|
35
+ | Scrapling | 2.51 | 1.0x |
36
+ | AutoScraper | 11.41 | 4.546x |
37
+
38
+ Scrapling can find elements with more methods and returns the entire element's `Adaptor` object, not only text like AutoScraper. So, to make this test fair, both libraries will extract an element with text, find similar elements, and then extract the text content for all of them.
39
+
40
+ As you see, Scrapling is still 4.5 times faster at the same task.
41
+
42
+ If we made Scrapling extract the elements only without stopping to extract each element's text, we would get speed twice as fast as this, but as I said, to make it fair comparison a bit :smile:
43
+
44
+ > All benchmarks' results are an average of 100 runs. See our [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology and to run your comparisons.
docs/contributing.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Thank you for your interest in contributing to Scrapling!
2
+
3
+ Everybody is invited and welcome to contribute to Scrapling.
4
+
5
+ Smaller changes have a better chance of getting included in a timely manner. Adding unit tests for new features or test cases for bugs you've fixed helps us to ensure that the Pull Request (PR) is acceptable.
6
+
7
+ There is a lot to do...
8
+
9
+ - If you are not a developer, you can help us improve the documentation.
10
+ - If you are a developer, most of the features I'm planning to add in the future are moved to [roadmap file](https://github.com/D4Vinci/Scrapling/blob/main/ROADMAP.md), so consider reading it.
11
+
12
+ ## Running tests
13
+ Scrapling includes a comprehensive test suite that can be executed with pytest, but first, you need to install all libraries and `pytest-plugins` inside `tests/requirements.txt`. Then, running the tests will result in an output like this:
14
+ ```bash
15
+ $ pytest tests
16
+ =============================== test session starts ===============================
17
+ platform darwin -- Python 3.12.8, pytest-8.3.3, pluggy-1.5.0 -- /Users/<redacted>/.venv/bin/python3.12
18
+ cachedir: .pytest_cache
19
+ rootdir: /Users/<redacted>/scrapling
20
+ configfile: pytest.ini
21
+ plugins: cov-5.0.0, asyncio-0.25.0, base-url-2.1.0, httpbin-2.1.0, playwright-0.5.2, anyio-4.6.2.post1, xdist-3.6.1, typeguard-4.3.0
22
+ asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=function
23
+ collected 83 items
24
+
25
+ ...<shortened>...
26
+
27
+ =============================== 83 passed in 157.52s (0:02:37) =====================
28
+ ```
29
+ Hence, you can add `-n auto` to the command above to run tests in threads to increase speed.
30
+
31
+ Bonus: You can also see the test coverage with the pytest plugin below
32
+ ```bash
33
+ pytest --cov=scrapling tests/
34
+ ```
35
+
36
+ ## Installing the latest unstable version from the dev branch
37
+ ```bash
38
+ pip3 install git+https://github.com/D4Vinci/Scrapling.git@dev
39
+ ```
40
+
41
+ ## Development
42
+ Setting the scrapling logging level to `debug` makes it easier to know what's happening in the background.
43
+ ```python
44
+ >>> import logging
45
+ >>> logging.getLogger("scrapling").setLevel(logging.DEBUG)
46
+ ```
47
+ ### Code Style
48
+
49
+ We use:
50
+
51
+ 1. Type hints for better code clarity
52
+ 2. Flake8, bandit, isort, and other hooks through `pre-commit`. <br/>Please install the hooks before committing with:
53
+ ```bash
54
+ pip install pre-commit
55
+ pre-commit install
56
+ ```
57
+ It will run automatically on the code you push with each commit.
58
+ 3. Conventional commit messages format. We use the below format for commit messages
59
+
60
+ | Prefix | When to use it |
61
+ |-------------|--------------------------|
62
+ | `feat:` | New feature added |
63
+ | `fix:` | Bug fix |
64
+ | `docs:` | Documentation change/add |
65
+ | `test:` | Tests |
66
+ | `refactor:` | Code refactoring |
67
+ | `chore:` | Maintenance tasks |
68
+
69
+ Example:
70
+ ```
71
+ feat: add auto-matching for similar elements
72
+
73
+ - Added find_similar() method
74
+ - Implemented pattern matching
75
+ - Added tests and documentation
76
+ ```
77
+
78
+ ### Push changes to the library
79
+
80
+ Then, the process is straightforward.
81
+
82
+ - Read [How to get faster PR reviews](https://github.com/kubernetes/community/blob/master/contributors/guide/pull-requests.md#best-practices-for-faster-reviews) by Kubernetes (but skip step 0 and 1)
83
+ - Fork Scrapling [Git repository](https://github.com/D4Vinci/Scrapling.git).
84
+ - Make your changes, and don't forget to create a separate virtual environment for this project.
85
+ - Ensure all tests are passing.
86
+ - Create a Pull Request against the [**dev**](https://github.com/D4Vinci/Scrapling/tree/dev) branch of Scrapling.
87
+
88
+ A bonus: if you have more than one version of Python installed, you can use tox to run tests on each version with:
89
+ ```bash
90
+ pip install tox
91
+ tox
92
+ ```
93
+
94
+ > Note: All tests are automatically run with each push on Github on all supported Python versions using tox, so ensure all tests pass, or your PR will not be accepted.
95
+
96
+
97
+ ## Building Documentation
98
+ ```bash
99
+ pip install mkdocs-material
100
+ mkdocs serve # Local preview
101
+ mkdocs build # Build the static site
102
+ ```
docs/development/automatch_storage_system.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Scrapling uses SQLite by default, but this tutorial covers writing your storage system to store element properties there for auto-matching.
2
+
3
+ You might want to use FireBase, for example, and share the database between multiple spiders on different machines. It's a great idea to use an online database like that because the spiders will share with each other.
4
+
5
+ So first, to make your storage class work, it must do the big 3:
6
+
7
+ 1. Inherit from the abstract class `scrapling.core.storage_adaptors.StorageSystemMixin` and accept a string argument, which will be the `url` argument to maintain the library logic.
8
+ 2. Use the decorator `functools.lru_cache` on top of the class to follow the Singleton design pattern as other classes.
9
+ 3. Implement methods `save` and `retrieve`, as you see from the type hints:
10
+ - The method `save` returns nothing and will get two arguments from the library
11
+ * The first one is of type `lxml.html.HtmlElement`, which is the element itself. It must be converted to a dictionary using the function `element_to_dict` in submodule `scrapling.core.utils._StorageTools` to keep the same format and save it to your database as you wish.
12
+ * The second one is a string, the identifier used for retrieval. The combination result of this identifier and the `url` argument from initialization must be unique for each row, or the auto-match will be messed up.
13
+ - The method `retrieve` takes a string, which is the identifier; using it with the `url` passed on initialization, the element's dictionary is retrieved from the database and returned if it exists; otherwise, it returns `None`.
14
+
15
+ > If the instructions weren't clear enough for you, you can check my implementation using SQLite3 in [storage_adaptors](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/storage_adaptors.py) file
16
+
17
+ If your class meets these criteria, the rest is easy. If you plan to use the library in a threaded application, ensure your class supports it. The default used class is thread-safe.
18
+
19
+ Some helper functions are added to the abstract class if you want to use them. It's easier to see it for yourself in the [code](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/storage_adaptors.py); it's heavily commented :)
20
+
21
+
22
+ ## Real-World Example: Redis Storage
23
+
24
+ Here's a more practical example generated by AI using Redis:
25
+
26
+ ```python
27
+ import redis
28
+ import orjson
29
+ from functools import lru_cache
30
+ from scrapling.core.storage_adaptors import StorageSystemMixin
31
+ from scrapling.core.utils import _StorageTools
32
+
33
+ @lru_cache(None)
34
+ class RedisStorage(StorageSystemMixin):
35
+ def __init__(self, host='localhost', port=6379, db=0, url=None):
36
+ super().__init__(url)
37
+ self.redis = redis.Redis(
38
+ host=host,
39
+ port=port,
40
+ db=db,
41
+ decode_responses=False
42
+ )
43
+
44
+ def save(self, element, identifier: str) -> None:
45
+ # Convert element to dictionary
46
+ element_dict = _StorageTools.element_to_dict(element)
47
+
48
+ # Create key
49
+ key = f"scrapling:{self._get_base_url()}:{identifier}"
50
+
51
+ # Store as JSON
52
+ self.redis.set(
53
+ key,
54
+ orjson.dumps(element_dict)
55
+ )
56
+
57
+ def retrieve(self, identifier: str) -> dict:
58
+ # Get data
59
+ key = f"scrapling:{self._get_base_url()}:{identifier}"
60
+ data = self.redis.get(key)
61
+
62
+ # Parse JSON if exists
63
+ if data:
64
+ return orjson.loads(data)
65
+ return None
66
+ ```
docs/development/scrapling_custom_types.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ > You can take advantage of the custom-made types for Scrapling and use them outside the library if you want. It's better than copying their code, after all :)
2
+
3
+ ### All current types can be imported alone like below
4
+ ```python
5
+ >>> from scrapling.core.custom_types import TextHandler, AttributesHandler
6
+
7
+ >>> somestring = TextHandler('{}')
8
+ >>> somestring.json()
9
+ '{}'
10
+ >>> somedict_1 = AttributesHandler({'a': 1})
11
+ >>> somedict_2 = AttributesHandler(a=1)
12
+ ```
13
+
14
+ Note that `TextHandler` is a subclass of Python's `str`, so all normal operations/methods that work with Python strings will work.
15
+ If you want to check for the type in your code, it's better to depend on Python's built-in function `issubclass`.
16
+
17
+ The class `AttributesHandler` is a subclass of `collections.abc.Mapping`, so it's immutable (read-only), and all operations are inherited from it. The data passed can be accessed later through the `_data` property, but be careful; it's of type `types.MappingProxyType`, so it's immutable (read-only) as well (faster than `collections.abc.Mapping` by fractions of seconds).
18
+
19
+ So, to make it simple for you if you are new to Python, the same operations and methods from the Python standard `dict` type will all work with class `AttributesHandler` except the ones that try to modify the actual data.
20
+
21
+ If you want to modify the data inside `AttributesHandler,` you have to convert it to a dictionary first, like using the `dict` function, and then modify it outside.
docs/donate.md ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ I've been working on Scrapling and other public projects in my spare time and have invested considerable resources and effort to provide these projects for free to the community. By becoming a sponsor, you'd be directly funding my coffee reserves, helping me continuously update existing projects and potentially create new ones.
2
+
3
+ You can sponsor me directly through [Github sponsors program](https://github.com/sponsors/D4Vinci) or [Buy Me A Coffe](https://buymeacoffee.com/d4vinci). If you are a **company** and looking to **advertise** your business through Scrapling or another project, check out the available plans on my [Github Sponsors page](https://github.com/sponsors/D4Vinci).
4
+
5
+ Below is the list of our Gold tier sponsors.
6
+
7
+ Thank you, stay curious, and hack the planet! ❤️
8
+
9
+ ---
10
+
11
+ ## Top Sponsors
12
+ ### Scrapeless
13
+
14
+ [Scrapeless Deep SerpApi](https://www.scrapeless.com/en/product/deep-serp-api?utm_source=website&utm_medium=ads&utm_campaign=scraping&utm_term=d4vinci) From $0.10 per 1,000 queries with a 1-2 second response time!
15
+
16
+ [![Scrapeless Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg)](https://www.scrapeless.com/?utm_source=github&utm_medium=ads&utm_campaign=scraping&utm_term=D4Vinci)
17
+
18
+ Deep SerpApi is a dedicated search engine designed for large language models (LLMs) and AI agents. It aims to provide real-time, accurate, and unbiased information to help AI applications retrieve and process data efficiently.
19
+
20
+ - covering 20+ Google SERP scenarios and mainstream search engines.
21
+ - support real-time data updates to ensure real-time and accurate information.
22
+ - It can integrate information from all available online channels and search engines.
23
+ - Deep SerpApi will simplify the process of integrating dynamic web information into AI solutions, and ultimately achieve an ALL-in-One API for one-click search and extraction of web data.
24
+ - **Developer Support Program**: Integrate Scrapeless Deep SerpApi into your AI tools, applications or projects. [We already support Dify, and will soon support frameworks such as Langchain, Langflow, FlowiseAI]. Then share your results on GitHub or social media, and you will get a 1-12 month free developer support opportunity, up to 500 free usage per month.
25
+ - 🚀 **Scraping API**: Effortless and highly customizable data extraction with a single API call, providing structured data from any website.
26
+ - ⚡ **Scraping Browser**: AI-powered and LLM-driven, it simulates human-like behavior with genuine fingerprints and headless browser support, ensuring seamless, block-free scraping.
27
+ - 🌐 **Proxies**: Use high-quality, rotating proxies to scrape top platforms like Amazon, Shopee, and more, with global coverage in 195+ countries.
docs/fetching/choosing.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Introduction
2
+ Fetchers are classes that can do requests or fetch pages for you easily in a single-line fashion with many features and then return a [Response](#response-object) object.
3
+
4
+ This feature was introduced because the only option before v0.2 was to fetch the page as you wanted, then pass it manually to the `Adaptor` class and start playing with it.
5
+
6
+ > Fetchers are not wrappers built on top of other libraries, but they use these libraries as an engine to make requests/fetch pages easily for you while fully utilizing that engine and adding features for you that aren't included in those engines
7
+
8
+ ## Fetchers Overview
9
+
10
+ Scrapling provides three different fetcher classes, each designed for specific use cases.
11
+
12
+ The following table compares them and can be quickly used for guidance.
13
+
14
+
15
+ | Feature | Fetcher | PlayWrightFetcher | StealthyFetcher |
16
+ |--------------------|----------------|--------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
17
+ | Relative speed | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
18
+ | Stealth | ⭐ | ⭐⭐ | ⭐⭐⭐⭐ |
19
+ | Anti-Bot options | ⭐ | ⭐⭐ | ⭐⭐⭐⭐ |
20
+ | JavaScript loading | ❌ | ✅ | ✅ |
21
+ | Memory Usage | ⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
22
+ | Best used for | Basic scraping | - Dynamically loaded websites <br/>- Small automation<br/>- Slight protections | - Dynamically loaded websites <br/>- Small automation <br/>- Complicated protections |
23
+ | Browser(s) | ❌ | Chromium and Google Chrome | Modified Firefox |
24
+ | Browser API used | ❌ | PlayWright | PlayWright |
25
+ | Setup Complexity | Simple | Simple | Simple |
26
+
27
+ In the following pages, we will talk about each one in detail.
28
+
29
+ ## Parser configuration in all fetchers
30
+ All fetchers classes share the same import, as you will see in the upcoming pages
31
+ ```python
32
+ >>> from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
33
+ ```
34
+ Then you use it right away without initializing like this, and it will use the default parser settings:
35
+ ```python
36
+ >>> page = StealthyFetcher.fetch('https://example.com')
37
+ ```
38
+ If you want to configure the parser ([Adaptor class](../parsing/main_classes.md#adaptor)) that will be used on the response before returning it for you, then do this first:
39
+ ```python
40
+ >>> from scrapling.fetchers import Fetcher
41
+ >>> Fetcher.configure(auto_match=True, encoding="utf8", keep_comments=False, keep_cdata=False) # and the rest
42
+ ```
43
+ or
44
+ ```python
45
+ >>> from scrapling.fetchers import Fetcher
46
+ >>> Fetcher.auto_match=True
47
+ >>> Fetcher.encoding="utf8"
48
+ >>> Fetcher.keep_comments=False
49
+ >>> Fetcher.keep_cdata=False # and the rest
50
+ ```
51
+ Then, continue your code as usual.
52
+
53
+ The available configuration arguments are: `auto_match`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the `Adaptor` class. You can display the current configuration anytime by running `<fetcher_class>.display_config()`.
54
+
55
+ > Note: The `auto_match` argument is disabled by default; you must enable it to use that feature.
56
+
57
+ ### Set parser config per request
58
+ As you probably understood, the logic above for setting the parser config will work globally for all requests/fetches done through that class, and it's intended for simplicity.
59
+
60
+ If your use case requires a different configuration for each request/fetch, you can pass a dictionary to the request method (`fetch`/`get`/`post`/...) to an argument named `custom_config`.
61
+
62
+ ## Response Object
63
+ The `Response` object is the same as the [Adaptor](../parsing/main_classes.md#adaptor) class, but it has added details about the response like response headers, status, cookies, etc... as shown below:
64
+ ```python
65
+ >>> from scrapling.fetchers import Fetcher
66
+ >>> page = Fetcher.get('https://example.com')
67
+
68
+ >>> page.status # HTTP status code
69
+ >>> page.reason # Status message
70
+ >>> page.cookies # Response cookies as a dictionary
71
+ >>> page.headers # Response headers
72
+ >>> page.request_headers # Request headers
73
+ >>> page.history # Response history of redirections, if any
74
+ >>> page.body # Raw response body
75
+ >>> page.encoding # Response encoding
76
+ ```
77
+ All fetchers return the `Response` object.
docs/fetching/dynamic.md ADDED
@@ -0,0 +1,248 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Introduction
2
+
3
+ Here, we will discuss the `PlayWrightFetcher` class. This class provides flexible browser automation with multiple configuration options and some stealth capabilities. It uses [PlayWright](https://playwright.dev/python/docs/intro) as an engine for fetching websites.
4
+
5
+ As we will explain later, to automate the page, you need some knowledge of [PlayWright's Page API](https://playwright.dev/python/docs/api/class-page).
6
+
7
+ ## Basic Usage
8
+ You have one primary way to import this Fetcher, which is the same for all fetchers.
9
+
10
+ ```python
11
+ >>> from scrapling.fetchers import PlayWrightFetcher
12
+ ```
13
+ Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
14
+
15
+ Now we will go over most of the arguments one by one with examples if you want to jump to a table of all arguments for quick reference [click here](#full-list-of-arguments)
16
+
17
+ > Notes:
18
+ >
19
+ > 1. Every time you fetch a website with this fetcher, it waits by default for all JavaScript to fully load and execute, so you don't have to (waits for the `domcontentloaded` state).
20
+ > 2. Of course, the async version of the `fetch` method is the `async_fetch` method.
21
+
22
+
23
+ This fetcher currently provides 4 main run options, but they can be mixed as you want.
24
+
25
+ Which are:
26
+
27
+ ### 1. Vanilla Playwright
28
+ ```python
29
+ PlayWrightFetcher.fetch('https://example.com')
30
+ ```
31
+ Using it like that will open a Chromium browser and fetch the page. There are no tricks or extra features; it's just a plain PlayWright API.
32
+
33
+ ### 2. Stealth Mode
34
+ ```python
35
+ PlayWrightFetcher.fetch('https://example.com', stealth=True)
36
+ ```
37
+ It's the same as the vanilla PlayWright option, but it provides a simple stealth mode suitable for websites with a small-to-medium protection layer(s).
38
+
39
+ Some of the things this fetcher's stealth mode does include:
40
+
41
+ * Patching the CDP runtime fingerprint.
42
+ * Mimics some of the real browsers' properties by injecting several JS files and using custom options.
43
+ * Custom flags are used on launch to hide Playwright even more and make it faster.
44
+ * Generates real browser headers of the same type and user OS, then append them to the request's headers.
45
+
46
+ ### 3. Real Chrome
47
+ ```python
48
+ PlayWrightFetcher.fetch('https://example.com', real_chrome=True)
49
+ ```
50
+ If you have a Google Chrome browser installed, use this option. It's the same as the first option but will use the Google Chrome browser you installed on your device instead of Chromium.
51
+
52
+ This will make your requests look more like requests coming from an actual human, so it's less detectable, and you can even use the `stealth=True` mode with it for better results like below:
53
+ ```python
54
+ PlayWrightFetcher.fetch('https://example.com', real_chrome=True, stealth=True)
55
+ ```
56
+ If you don't have Google Chrome installed and want to use this option, you can use the command below in the terminal to install it for the library instead of installing it manually:
57
+ ```commandline
58
+ playwright install chrome
59
+ ```
60
+
61
+ ### 4. CDP Connection
62
+ ```python
63
+ PlayWrightFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222')
64
+ ```
65
+ Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
66
+
67
+ This fetcher takes it even a step further. You can use [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option like below
68
+ ```python
69
+ PlayWrightFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222', nstbrowser_mode=True)
70
+ ```
71
+ There's also a `nstbrowser_config` argument to send the config you want to send with the requests to the NSTBrowser. If you leave it empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config.
72
+
73
+ ## Full list of arguments
74
+ Scrapling provides many options with this fetcher, which works in all modes except the [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5) mode. To make it as simple as possible, we will list the options here and give examples of using most of them.
75
+
76
+ | Argument | Description | Optional |
77
+ |:-------------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
78
+ | url | Target url | ❌ |
79
+ | headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | ✔️ |
80
+ | disable_resources | Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage, but be careful with this option as it makes some websites never finish loading._ | ✔️ |
81
+ | useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser.** | ✔️ |
82
+ | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
83
+ | timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30000. | ✔️ |
84
+ | wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
85
+ | page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation, then returns `page` again. | ✔️ |
86
+ | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
87
+ | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
88
+ | google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search for this website's domain name. | ✔️ |
89
+ | extra_headers | A dictionary of extra headers to add to the request. The referer set by the `google_search` argument takes priority over the referer set here if used together. | ✔️ |
90
+ | proxy | The proxy to be used with requests. It can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | ✔️ |
91
+ | hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
92
+ | disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
93
+ | stealth | Enables stealth mode; you should always check the documentation to see what stealth mode does currently. | ✔️ |
94
+ | real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser and use it. | ✔️ |
95
+ | locale | Set the locale for the browser if wanted. The default value is `en-US`. | ✔️ |
96
+ | cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP. | ✔️ |
97
+ | nstbrowser_mode | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.** | ✔️ |
98
+ | nstbrowser_config | The config you want to send with requests to the NSTBrowser. _Scrapling defaults to an optimized NSTBrowser's docker browserless config if you leave this argument empty._ | ✔️ |
99
+
100
+
101
+ ## Examples
102
+ It's easier to understand with examples, so let's look at it.
103
+
104
+ ### Resource Control
105
+
106
+ ```python
107
+ # Disable unnecessary resources
108
+ page = PlayWrightFetcher.fetch(
109
+ 'https://example.com',
110
+ disable_resources=True # Blocks fonts, images, media, etc...
111
+ )
112
+ ```
113
+
114
+ ### Network Control
115
+
116
+ ```python
117
+ # Wait for network idle (Consider fetch to be finished when there are no network connections for at least 500 ms)
118
+ page = PlayWrightFetcher.fetch('https://example.com', network_idle=True)
119
+
120
+ # Custom timeout (in milliseconds)
121
+ page = PlayWrightFetcher.fetch('https://example.com', timeout=30000) # 30 seconds
122
+
123
+ # Proxy support
124
+ page = PlayWrightFetcher.fetch(
125
+ 'https://example.com',
126
+ proxy='http://username:password@host:port' # Or it can be a dictionary with the keys 'server', 'username', and 'password' only
127
+ )
128
+ ```
129
+
130
+ ### Browser Automation
131
+ This is where your knowledge about [PlayWright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, does what you want, and then returns it again for the current fetcher to continue working on it.
132
+
133
+ This function is executed right after waiting for network_idle (if enabled) and before waiting for the `wait_selector` argument, so it can be used for many things, not just automation. You can alter the page as you want.
134
+
135
+ In the example below, I used page [mouse events](https://playwright.dev/python/docs/api/class-mouse) to move the mouse wheel to scroll the page and then move the mouse.
136
+ ```python
137
+ from playwright.sync_api import Page
138
+
139
+ def scroll_page(page: Page):
140
+ page.mouse.wheel(10, 0)
141
+ page.mouse.move(100, 400)
142
+ page.mouse.up()
143
+ return page
144
+
145
+ page = PlayWrightFetcher.fetch(
146
+ 'https://example.com',
147
+ page_action=scroll_page
148
+ )
149
+ ```
150
+ Of course, if you use the async fetch version, the function must also be async.
151
+ ```python
152
+ from playwright.async_api import Page
153
+
154
+ async def scroll_page(page: Page):
155
+ await page.mouse.wheel(10, 0)
156
+ await page.mouse.move(100, 400)
157
+ await page.mouse.up()
158
+ return page
159
+
160
+ page = await PlayWrightFetcher.async_fetch(
161
+ 'https://example.com',
162
+ page_action=scroll_page
163
+ )
164
+ ```
165
+
166
+ ### Wait Conditions
167
+
168
+ ```python
169
+ # Wait for the selector
170
+ page = PlayWrightFetcher.fetch(
171
+ 'https://example.com',
172
+ wait_selector='h1',
173
+ wait_selector_state='visible'
174
+ )
175
+ ```
176
+ This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
177
+
178
+ After that, the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state) and wait for them to be. If you have enabled `network_idle` with this, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
179
+
180
+ The states the fetcher can wait for can be either ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
181
+
182
+ - `attached`: Wait for an element to be present in DOM.
183
+ - `detached`: Wait for an element to not be present in DOM.
184
+ - `visible`: wait for an element to have a non-empty bounding box and no `visibility:hidden`. Note that an element without any content or with `display:none` has an empty bounding box and is not considered visible.
185
+ - `hidden`: wait for an element to be either detached from DOM, or have an empty bounding box or `visibility:hidden`. This is opposite to the `'visible'` option.
186
+
187
+ ### Some Stealth Features
188
+
189
+ ```python
190
+ # Full stealth mode
191
+ page = PlayWrightFetcher.fetch(
192
+ 'https://example.com',
193
+ stealth=True,
194
+ hide_canvas=True,
195
+ disable_webgl=True,
196
+ google_search=True
197
+ )
198
+
199
+ # Custom user agent
200
+ page = PlayWrightFetcher.fetch(
201
+ 'https://example.com',
202
+ useragent='Mozilla/5.0...'
203
+ )
204
+
205
+ # Set browser locale
206
+ page = PlayWrightFetcher.fetch(
207
+ 'https://example.com',
208
+ locale='en-US'
209
+ )
210
+ ```
211
+ Hence, the `hide_canvas` argument doesn't disable canvas but hides it by adding random noise to canvas operations to prevent fingerprinting. Also, if you didn't set a useragent (preferred), the fetcher will generate a real Useragent of the same browser and use it.
212
+
213
+ The `google_search` argument is enabled by default, making the request look like it came from Google. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
214
+
215
+ ### General example
216
+ ```python
217
+ from scrapling.fetchers import PlayWrightFetcher
218
+
219
+ def scrape_dynamic_content():
220
+ # Use PlayWright for JavaScript content
221
+ page = PlayWrightFetcher.fetch(
222
+ 'https://example.com/dynamic',
223
+ network_idle=True,
224
+ wait_selector='.content'
225
+ )
226
+
227
+ # Extract dynamic content
228
+ content = page.css('.content')
229
+
230
+ return {
231
+ 'title': content.css_first('h1::text'),
232
+ 'items': [
233
+ item.text for item in content.css('.item')
234
+ ]
235
+ }
236
+ ```
237
+
238
+ ## When to Use
239
+
240
+ Use PlayWrightFetcher when:
241
+
242
+ - Need browser automation
243
+ - Want multiple browser options
244
+ - Using a real Chrome browser
245
+ - Need custom browser config
246
+ - Want flexible stealth options
247
+
248
+ If you want more stealth and control without much config, check out the [StealthyFetcher](stealthy.md).
docs/fetching/static.md ADDED
@@ -0,0 +1,300 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Introduction
2
+
3
+ The `Fetcher` class provides fast and lightweight HTTP requests with some stealth capabilities. This class uses [httpx](https://www.python-httpx.org/) as an engine for making requests. For advanced usages, you will need some knowledge about [httpx](https://www.python-httpx.org/), but it becomes simpler and simpler with user feedback and updates.
4
+
5
+ ## Basic Usage
6
+ You have one primary way to import this Fetcher, which is the same for all fetchers.
7
+
8
+ ```python
9
+ >>> from scrapling.fetchers import Fetcher
10
+ ```
11
+ Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
12
+
13
+ ### Shared arguments
14
+ All methods for making requests here share some arguments, so let's discuss them first.
15
+
16
+ - **url**: The URL you want to request, of course :)
17
+ - **proxy**: As the name implies, the proxy for this request is used to route all traffic (HTTP and HTTPS). The format accepted here is `http://username:password@localhost:8030`.
18
+ - **stealthy_headers**: Generate and use real browser's headers, then create a referer header as if this request came from a Google search page of this URL's domain. Enabled by default, all headers generated can be overwritten by you through the `headers` argument.
19
+ - **follow_redirects**: As the name implies, tell the fetcher to follow redirections. Enabled by default
20
+ - **timeout**: The timeout to wait for each request to be finished in milliseconds. The default is 30000ms (30 seconds).
21
+ - **retries**: The number of retries that [httpx](https://www.python-httpx.org/) will do for failed requests. The default number of retries is 3.
22
+
23
+ Other than this, you can pass any arguments that `httpx.<method_name>` takes, and that's why I said, in the beginning, you need a bit of knowledge about [httpx](https://www.python-httpx.org/), but in the following examples, we will try to cover most cases.
24
+
25
+ ### HTTP Methods
26
+ Examples are the best way to explain this
27
+
28
+ > Hence: `OPTIONS` and `HEAD` methods are not supported.
29
+ #### GET
30
+ ```python
31
+ >>> from scrapling.fetchers import Fetcher
32
+ >>> # Basic GET
33
+ >>> page = Fetcher.get('https://example.com')
34
+ >>> page = Fetcher.get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
35
+ >>> page = Fetcher.get('https://httpbin.org/get', proxy='http://username:password@localhost:8030')
36
+ >>> # With parameters
37
+ >>> page = Fetcher.get('https://example.com/search', params={'q': 'query'})
38
+ >>>
39
+ >>> # With headers
40
+ >>> page = Fetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
41
+ >>> # Basic HTTP authentication
42
+ >>> page = Fetcher.get("https://example.com", auth=("my_user", "password123"))
43
+ ```
44
+ And for asynchronous requests, it's a small adjustment
45
+ ```python
46
+ >>> from scrapling.fetchers import AsyncFetcher
47
+ >>> # Basic GET
48
+ >>> page = await AsyncFetcher.get('https://example.com')
49
+ >>> page = await AsyncFetcher.get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
50
+ >>> page = await AsyncFetcher.get('https://httpbin.org/get', proxy='http://username:password@localhost:8030')
51
+ >>> # With parameters
52
+ >>> page = await AsyncFetcher.get('https://example.com/search', params={'q': 'query'})
53
+ >>>
54
+ >>> # With headers
55
+ >>> page = await AsyncFetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
56
+ >>> # Basic HTTP authentication
57
+ >>> page = await AsyncFetcher.get("https://example.com", auth=("my_user", "password123"))
58
+ ```
59
+ Needless to say, the `page` object in all cases is [Response](choosing.md#response-object) object, which is an `Adaptor` as we said, so you will use it directly
60
+ ```python
61
+ >>> page.css('.something.something')
62
+
63
+ >>> page = Fetcher.get('https://api.github.com/events')
64
+ >>> page.json()
65
+ [{'id': '<redacted>',
66
+ 'type': 'PushEvent',
67
+ 'actor': {'id': '<redacted>',
68
+ 'login': '<redacted>',
69
+ 'display_login': '<redacted>',
70
+ 'gravatar_id': '',
71
+ 'url': 'https://api.github.com/users/<redacted>',
72
+ 'avatar_url': 'https://avatars.githubusercontent.com/u/<redacted>'},
73
+ 'repo': {'id': '<redacted>',
74
+ ...
75
+ ```
76
+ #### POST
77
+ ```python
78
+ >>> from scrapling.fetchers import Fetcher
79
+ >>> # Basic POST
80
+ >>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'})
81
+ >>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True)
82
+ >>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
83
+ >>> # Another example of form-encoded data
84
+ >>> page = Fetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'})
85
+ >>> # JSON data
86
+ >>> page = Fetcher.post('https://example.com/api', json={'key': 'value'})
87
+ >>> # Uploading file
88
+ >>> r = Fetcher.post("https://httpbin.org/post", files={'upload-file': open('something.xlsx', 'rb')})
89
+ ```
90
+ And for asynchronous requests, it's a small adjustment
91
+ ```python
92
+ >>> from scrapling.fetchers import AsyncFetcher
93
+ >>> # Basic POST
94
+ >>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'})
95
+ >>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True)
96
+ >>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
97
+ >>> # Another example of form-encoded data
98
+ >>> page = await AsyncFetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'})
99
+ >>> # JSON data
100
+ >>> page = await AsyncFetcher.post('https://example.com/api', json={'key': 'value'})
101
+ >>> # Uploading file
102
+ >>> r = await AsyncFetcher.post("https://httpbin.org/post", files={'upload-file': open('something.xlsx', 'rb')})
103
+ ```
104
+ #### PUT
105
+ ```python
106
+ >>> from scrapling.fetchers import Fetcher
107
+ >>> # Basic PUT
108
+ >>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'})
109
+ >>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True)
110
+ >>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030')
111
+ >>> # Another example of form-encoded data
112
+ >>> page = Fetcher.put("https://httpbin.org/put", data={'key': ['value1', 'value2']})
113
+ ```
114
+ And for asynchronous requests, it's a small adjustment
115
+ ```python
116
+ >>> from scrapling.fetchers import AsyncFetcher
117
+ >>> # Basic PUT
118
+ >>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'})
119
+ >>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True)
120
+ >>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030')
121
+ >>> # Another example of form-encoded data
122
+ >>> page = await AsyncFetcher.put("https://httpbin.org/put", data={'key': ['value1', 'value2']})
123
+ ```
124
+
125
+ #### DELETE
126
+ ```python
127
+ >>> from scrapling.fetchers import Fetcher
128
+ >>> page = Fetcher.delete('https://example.com/resource/123')
129
+ >>> page = Fetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True)
130
+ >>> page = Fetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030')
131
+ ```
132
+ And for asynchronous requests, it's a small adjustment
133
+ ```python
134
+ >>> from scrapling.fetchers import AsyncFetcher
135
+ >>> page = await AsyncFetcher.delete('https://example.com/resource/123')
136
+ >>> page = await AsyncFetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True)
137
+ >>> page = await AsyncFetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030')
138
+ ```
139
+
140
+ ## Examples
141
+ Some well-rounded examples to aid newcomers to Web Scraping
142
+
143
+ ### Basic HTTP Request
144
+
145
+ ```python
146
+ from scrapling.fetchers import Fetcher
147
+
148
+ # Make a request
149
+ page = Fetcher.get('https://example.com')
150
+
151
+ # Check the status
152
+ if page.status == 200:
153
+ # Extract title
154
+ title = page.css_first('title::text')
155
+ print(f"Page title: {title}")
156
+
157
+ # Extract all links
158
+ links = page.css('a::attr(href)')
159
+ print(f"Found {len(links)} links")
160
+ ```
161
+
162
+ ### Product Scraping
163
+
164
+ ```python
165
+ from scrapling.fetchers import Fetcher
166
+
167
+ def scrape_products():
168
+ page = Fetcher.get('https://example.com/products')
169
+
170
+ # Find all product elements
171
+ products = page.css('.product')
172
+
173
+ results = []
174
+ for product in products:
175
+ results.append({
176
+ 'title': product.css_first('.title::text'),
177
+ 'price': product.css_first('.price::text').re_first(r'\d+\.\d{2}'),
178
+ 'description': product.css_first('.description::text'),
179
+ 'in_stock': product.has_class('in-stock')
180
+ })
181
+
182
+ return results
183
+ ```
184
+
185
+ ### Pagination Handling
186
+
187
+ ```python
188
+ from scrapling.fetchers import Fetcher
189
+
190
+ def scrape_all_pages():
191
+ base_url = 'https://example.com/products?page={}'
192
+ page_num = 1
193
+ all_products = []
194
+
195
+ while True:
196
+ # Get current page
197
+ page = Fetcher.get(base_url.format(page_num))
198
+
199
+ # Find products
200
+ products = page.css('.product')
201
+ if not products:
202
+ break
203
+
204
+ # Process products
205
+ for product in products:
206
+ all_products.append({
207
+ 'name': product.css_first('.name::text'),
208
+ 'price': product.css_first('.price::text')
209
+ })
210
+
211
+ # Next page
212
+ page_num += 1
213
+
214
+ return all_products
215
+ ```
216
+
217
+ ### Form Submission
218
+
219
+ ```python
220
+ from scrapling.fetchers import Fetcher
221
+
222
+ # Submit login form
223
+ response = Fetcher.post(
224
+ 'https://example.com/login',
225
+ data={
226
+ 'username': 'user@example.com',
227
+ 'password': 'password123'
228
+ }
229
+ )
230
+
231
+ # Check login success
232
+ if response.status == 200:
233
+ # Extract user info
234
+ user_name = response.css_first('.user-name::text')
235
+ print(f"Logged in as: {user_name}")
236
+ ```
237
+
238
+ ### Table Extraction
239
+
240
+ ```python
241
+ from scrapling.fetchers import Fetcher
242
+
243
+ def extract_table():
244
+ page = Fetcher.get('https://example.com/data')
245
+
246
+ # Find table
247
+ table = page.css_first('table')
248
+
249
+ # Extract headers
250
+ headers = [
251
+ th.text for th in table.css('thead th')
252
+ ]
253
+
254
+ # Extract rows
255
+ rows = []
256
+ for row in table.css('tbody tr'):
257
+ cells = [td.text for td in row.css('td')]
258
+ rows.append(dict(zip(headers, cells)))
259
+
260
+ return rows
261
+ ```
262
+
263
+ ### Navigation Menu
264
+
265
+ ```python
266
+ from scrapling.fetchers import Fetcher
267
+
268
+ def extract_menu():
269
+ page = Fetcher.get('https://example.com')
270
+
271
+ # Find navigation
272
+ nav = page.css_first('nav')
273
+
274
+ menu = {}
275
+ for item in nav.css('li'):
276
+ link = item.css_first('a')
277
+ if link:
278
+ menu[link.text] = {
279
+ 'url': link.attrib['href'],
280
+ 'has_submenu': bool(item.css('.submenu'))
281
+ }
282
+
283
+ return menu
284
+ ```
285
+
286
+ ## When to Use
287
+
288
+ Use `Fetcher` when:
289
+
290
+ - Need fast HTTP requests
291
+ - Want minimal overhead
292
+ - Don't need JavaScript
293
+ - Want simple configuration
294
+ - Need basic stealth features
295
+
296
+ Use other fetchers when:
297
+
298
+ - Need browser automation.
299
+ - Need advanced anti-bot/stealth.
300
+ - Need JavaScript support.
docs/fetching/stealthy.md ADDED
@@ -0,0 +1,218 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Introduction
2
+
3
+ Here, we will discuss the `StealthyFetcher` class. This class is similar to [PlayWrightFetcher](dynamic.md#introduction) in many ways, like browser automation and using [PlayWright](https://playwright.dev/python/docs/intro) as an engine for fetching websites. The main difference is that this class provides advanced anti-bot protection bypass capabilities and a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), from which most stealth comes.
4
+
5
+ As with [PlayWrightFetcher](dynamic.md#introduction), you will need some knowledge about [PlayWright's Page API](https://playwright.dev/python/docs/api/class-page) to automate the page, as we will explain later.
6
+
7
+ ## Basic Usage
8
+ You have one primary way to import this Fetcher, which is the same for all fetchers.
9
+
10
+ ```python
11
+ >>> from scrapling.fetchers import StealthyFetcher
12
+ ```
13
+ Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
14
+
15
+ > Notes:
16
+ >
17
+ > 1. Every time you fetch a website with this fetcher, it waits by default for all JavaScript to fully load and execute, so you don't have to (waits for the `domcontentloaded` state).
18
+ > 2. Of course, the async version of the `fetch` method is the `async_fetch` method.
19
+
20
+ ## Full list of arguments
21
+ Before jumping to [examples](#examples), here's the full list of arguments
22
+
23
+
24
+ | Argument | Description | Optional |
25
+ |:--------------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
26
+ | url | Target url | ❌ |
27
+ | headless | Pass `True` to run the browser in headless/hidden (**default**), `virtual` to run it in virtual screen mode, or `False` for headful/visible mode. The `virtual` mode requires having `xvfb` installed. | ✔️ |
28
+ | block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage, but be careful with this option as it makes some websites never finish loading._ | ✔️ |
29
+ | disable_resources | Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage, but be careful with this option as it makes some websites never finish loading._ | ✔️ |
30
+ | google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search for this website's domain name. | ✔️ |
31
+ | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
32
+ | block_webrtc | Blocks WebRTC entirely. | ✔️ |
33
+ | page_action | Added for automation. A function that takes the `page` object and does the automation you need, then returns `page` again. | ✔️ |
34
+ | addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
35
+ | humanize | Humanize the cursor movement. The cursor movement takes either True or the MAX duration in seconds. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
36
+ | allow_webgl | Enabled by default. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | ✔️ |
37
+ | geoip | Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. | ✔️ |
38
+ | os_randomize | If enabled, Scrapling will randomize the OS fingerprints used. The default is matching the fingerprints with the current OS. | ✔️ |
39
+ | disable_ads | Disabled by default; this installs the `uBlock Origin` addon on the browser if enabled. | ✔️ |
40
+ | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
41
+ | timeout | The timeout used in all operations and waits through the page. It's in milliseconds, and the default is 30000. | ✔️ |
42
+ | wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
43
+ | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
44
+ | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
45
+ | proxy | The proxy to be used with requests. It can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | ✔️ |
46
+ | additional_arguments | Arguments passed to Camoufox as additional settings that take higher priority than Scrapling's. | ✔️ |
47
+
48
+
49
+ ## Examples
50
+ It's easier to understand with examples, so now we will go over most of the arguments individually with examples.
51
+
52
+ ### Browser Modes
53
+
54
+ ```python
55
+ # Headless/hidden mode (default)
56
+ page = StealthyFetcher.fetch('https://example.com', headless=True)
57
+
58
+ # Virtual display mode (requires having `xvfb` installed)
59
+ page = StealthyFetcher.fetch('https://example.com', headless='virtual')
60
+
61
+ # Visible browser mode
62
+ page = StealthyFetcher.fetch('https://example.com', headless=False)
63
+ ```
64
+
65
+ ### Resource Control
66
+
67
+ ```python
68
+ # Block images
69
+ page = StealthyFetcher.fetch('https://example.com', block_images=True)
70
+
71
+ # Disable unnecessary resources
72
+ page = StealthyFetcher.fetch('https://example.com', disable_resources=True) # Blocks fonts, images, media, etc.
73
+ ```
74
+
75
+ ### Additional stealth options
76
+
77
+ ```python
78
+ page = StealthyFetcher.fetch(
79
+ 'https://example.com',
80
+ block_webrtc=True, # Block WebRTC
81
+ allow_webgl=False, # Disable WebGL
82
+ humanize=True, # Make the mouse move as how a human would move it
83
+ geoip=True, # Use IP's longitude, latitude, timezone, country, and locale, then spoof the WebRTC IP address...
84
+ os_randomize=True, # Randomize the OS fingerprints used. The default is matching the fingerprints with the current OS.
85
+ disable_ads=True, # Block ads with uBlock Origin addon (enabled by default)
86
+ google_search=True
87
+ )
88
+
89
+ # Custom user agent
90
+ page = StealthyFetcher.fetch(
91
+ 'https://example.com',
92
+ useragent='Mozilla/5.0...'
93
+ )
94
+
95
+ # Custom humanization duration
96
+ page = StealthyFetcher.fetch(
97
+ 'https://example.com',
98
+ humanize=1.5 # Max 1.5 seconds for cursor movement
99
+ )
100
+ ```
101
+
102
+ The `google_search` argument is enabled by default. It makes the request as if it came from Google, so for a request for `https://example.com`, it will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
103
+
104
+ ### Network Control
105
+
106
+ ```python
107
+ # Wait for network idle (Consider fetch to be finished when there are no network connections for at least 500 ms)
108
+ page = StealthyFetcher.fetch('https://example.com', network_idle=True)
109
+
110
+ # Custom timeout (in milliseconds)
111
+ page = StealthyFetcher.fetch('https://example.com', timeout=30000) # 30 seconds
112
+
113
+ # Proxy support
114
+ page = StealthyFetcher.fetch(
115
+ 'https://example.com',
116
+ proxy='http://username:password@host:port' # Or it can be a dictionary with the keys 'server', 'username', and 'password' only
117
+ )
118
+ ```
119
+
120
+ ### Browser Automation
121
+ This is where your knowledge about [PlayWright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, does what you want, and then returns it again for the current fetcher to continue working on it.
122
+
123
+ This function is executed right after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, so it can be used for many things, not just automation. You can alter the page as you want.
124
+
125
+ In the example below, I used page [mouse events](https://playwright.dev/python/docs/api/class-mouse) to move the mouse wheel to scroll the page and then move the mouse.
126
+ ```python
127
+ from playwright.sync_api import Page
128
+
129
+ def scroll_page(page: Page):
130
+ page.mouse.wheel(10, 0)
131
+ page.mouse.move(100, 400)
132
+ page.mouse.up()
133
+ return page
134
+
135
+ page = StealthyFetcher.fetch(
136
+ 'https://example.com',
137
+ page_action=scroll_page
138
+ )
139
+ ```
140
+ Of course, if you use the async fetch version, the function must also be async.
141
+ ```python
142
+ from playwright.async_api import Page
143
+
144
+ async def scroll_page(page: Page):
145
+ await page.mouse.wheel(10, 0)
146
+ await page.mouse.move(100, 400)
147
+ await page.mouse.up()
148
+ return page
149
+
150
+ page = await StealthyFetcher.async_fetch(
151
+ 'https://example.com',
152
+ page_action=scroll_page
153
+ )
154
+ ```
155
+
156
+ ### Wait Conditions
157
+ ```python
158
+ # Wait for the selector
159
+ page = StealthyFetcher.fetch(
160
+ 'https://example.com',
161
+ wait_selector='h1',
162
+ wait_selector_state='visible'
163
+ )
164
+ ```
165
+ This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
166
+
167
+ After that, the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state) and wait for them to be. If you have enabled `network_idle` with this, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
168
+
169
+ The states the fetcher can wait for can be either ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
170
+
171
+ - `attached`: wait for the element to be present in DOM.
172
+ - `detached`: wait for the element to not be present in DOM.
173
+ - `visible`: wait for the element to have a non-empty bounding box and no `visibility:hidden`. Note that an element without any content or with `display:none` has an empty bounding box and is not considered visible.
174
+ - `hidden`: Wait for the element to be detached from DOM, have an empty bounding box, or have `visibility:hidden`. This is opposite to the `'visible'` option.
175
+
176
+ ### Firefox Addons
177
+
178
+ ```python
179
+ # Custom Firefox addons
180
+ page = StealthyFetcher.fetch(
181
+ 'https://example.com',
182
+ addons=['/path/to/addon1', '/path/to/addon2']
183
+ )
184
+ ```
185
+ The paths here must be paths of extracted addons, which will be installed automatically upon browser launch.
186
+
187
+ ### Real-world example (Amazon)
188
+ This is for educational purposes only; this example was generated by AI, which shows too how easy it is to work with Scrapling through AI
189
+ ```python
190
+ def scrape_amazon_product(url):
191
+ # Use StealthyFetcher to bypass protection
192
+ page = StealthyFetcher.fetch(url)
193
+
194
+ # Extract product details
195
+ return {
196
+ 'title': page.css_first('#productTitle::text').clean(),
197
+ 'price': page.css_first('.a-price .a-offscreen::text'),
198
+ 'rating': page.css_first('[data-feature-name="averageCustomerReviews"] .a-popover-trigger .a-color-base::text'),
199
+ 'reviews_count': page.css('#acrCustomerReviewText::text').re_first(r'[\d,]+'),
200
+ 'features': [
201
+ li.clean() for li in page.css('#feature-bullets li span::text')
202
+ ],
203
+ 'availability': page.css_first('#availability').get_all_text(strip=True),
204
+ 'images': [
205
+ img.attrib['src'] for img in page.css('#altImages img')
206
+ ]
207
+ }
208
+ ```
209
+
210
+ ## When to Use
211
+
212
+ Use StealthyFetcher when:
213
+
214
+ - Bypassing anti-bot protection
215
+ - Need a reliable browser fingerprint
216
+ - Full JavaScript support needed
217
+ - Want automatic stealth features
218
+ - Need browser automation
docs/index.md CHANGED
@@ -1,2 +1,107 @@
1
- # This section is still under work but any help is highly appreciated
2
- ## I will try to make full detailed documentation with Sphinx ASAP.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Scrapling
2
+
3
+ Scrapling is an Undetectable, high-performance, intelligent Web scraping library for Python 3 to make Web Scraping easy!
4
+
5
+ Scrapling isn't only about making undetectable requests or fetching pages under the radar!
6
+
7
+ It has its own parser that adapts to website changes and provides many element selection/querying options other than traditional selectors, powerful DOM traversal API, and many other features while significantly outperforming popular parsing alternatives.
8
+
9
+ Scrapling is built from the ground up by Web scraping experts for beginners and experts. The goal is to provide powerful features while maintaining simplicity and minimal boilerplate code.
10
+
11
+ ```python
12
+ >> from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
13
+ >> StealthyFetcher.auto_match = True
14
+ # Fetch websites' source under the radar!
15
+ >> page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
16
+ >> print(page.status)
17
+ 200
18
+ >> products = page.css('.product', auto_save=True) # Scrape data that survives website design changes!
19
+ >> # Later, if the website structure changes, pass `auto_match=True`
20
+ >> products = page.css('.product', auto_match=True) # and Scrapling still finds them!
21
+ ```
22
+ ## Key Features
23
+ ### Fetch websites as you prefer with async support
24
+ - **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class.
25
+ - **Dynamic Loading & Automation**: Fetch dynamic websites with the `PlayWrightFetcher` class through your real browser, Scrapling's stealth mode, Playwright's Chromium browser, or [NSTbrowser](https://app.nstbrowser.io/r/1vO5e5)'s browserless!
26
+ - **Anti-bot Protections Bypass**: Easily bypass protections with the `StealthyFetcher` and `PlayWrightFetcher` classes.
27
+
28
+ ### Easy Scraping
29
+ - **Smart Element Tracking**: Relocate elements after website changes using an intelligent similarity system and integrated storage.
30
+ - **Flexible Selection**: CSS selectors, XPath selectors, filters-based search, text search, regex search, and more.
31
+ - **Find Similar Elements**: Automatically locate elements similar to the element you found!
32
+ - **Smart Content Scraping**: Extract data from multiple websites without specific selectors using Scrapling powerful features.
33
+
34
+ ### High Performance
35
+ - **Lightning Fast**: Built from the ground up with performance in mind, outperforming most popular Python scraping libraries.
36
+ - **Memory Efficient**: Optimized data structures for minimal memory footprint.
37
+ - **Fast JSON serialization**: 10x faster than standard library.
38
+
39
+ ### Developer Friendly
40
+ - **Powerful Navigation API**: Easy DOM traversal in all directions.
41
+ - **Rich Text Processing**: All strings have built-in regex, cleaning methods, and more. All elements' attributes are optimized dictionaries that use less memory than standard dictionaries with added methods.
42
+ - **Auto Selectors Generation**: Generate robust short and full CSS/XPath selectors for any element.
43
+ - **Familiar API**: Similar to Scrapy/BeautifulSoup and the same CSS pseudo-elements used in Scrapy.
44
+ - **Type hints**: Complete type/doc-strings coverage for future-proofing and best autocompletion support.
45
+
46
+ ## Star History
47
+ Scrapling’s GitHub stars have grown steadily since its release (see chart below).
48
+
49
+ <div id="chartContainer">
50
+ <a href="https://github.com/D4Vinci/Scrapling">
51
+ <img id="chartImage" alt="Star History Chart" src="https://api.star-history.com/svg?repos=D4Vinci/Scrapling&type=Date" height="400"/>
52
+ </a>
53
+ </div>
54
+
55
+ <script>
56
+ const observer = new MutationObserver((mutations) => {
57
+ mutations.forEach((mutation) => {
58
+ if (mutation.attributeName === 'data-md-color-media') {
59
+ const colorMedia = document.body.getAttribute('data-md-color-media');
60
+ const isDarkScheme = document.body.getAttribute('data-md-color-scheme') === 'slate';
61
+ const chartImg = document.querySelector('#chartImage');
62
+ const baseUrl = 'https://api.star-history.com/svg?repos=D4Vinci/Scrapling&type=Date';
63
+
64
+ if (colorMedia === '(prefers-color-scheme)' ? isDarkScheme : colorMedia.includes('dark')) {
65
+ chartImg.src = `${baseUrl}&theme=dark`;
66
+ } else {
67
+ chartImg.src = baseUrl;
68
+ }
69
+ }
70
+ });
71
+ });
72
+
73
+ observer.observe(document.body, {
74
+ attributes: true,
75
+ attributeFilter: ['data-md-color-media', 'data-md-color-scheme']
76
+ });
77
+ </script>
78
+
79
+ ## Installation
80
+ Scrapling is a breeze to get started with!<br/>Starting from version 0.2.9, we require at least Python 3.9 to work.
81
+
82
+ Run this command to install it with Python's pip.
83
+ ```bash
84
+ pip3 install scrapling
85
+ ```
86
+ You are ready if you plan to use the parser only (the `Adaptor` class).
87
+
88
+ But if you are going to make requests or fetch pages with Scrapling, then run this command to install browsers' dependencies needed to use the Fetchers
89
+ ```bash
90
+ scrapling install
91
+ ```
92
+ If you have any installation issues, please open an [issue](https://github.com/D4Vinci/Scrapling/issues/new/choose).
93
+
94
+ ## How the documentation is organized
95
+ Scrapling has a lot of documentation, so we try to follow a guideline called the [Diátaxis documentation framework](https://diataxis.fr/).
96
+
97
+ ## Support
98
+
99
+ If you like Scrapling and want to support its development:
100
+
101
+ - ⭐ Star the [GitHub repository](https://github.com/D4Vinci/Scrapling)
102
+ - 💝 Consider [sponsoring the project or buying me a coffe](donate.md) :wink:
103
+ - 🐛 Report bugs and suggest features through [GitHub Issues](https://github.com/D4Vinci/Scrapling/issues)
104
+
105
+ ## License
106
+
107
+ This project is licensed under BSD-3 License. See the [LICENSE](https://github.com/D4Vinci/Scrapling/blob/main/LICENSE) file for details.
docs/overview.md ADDED
@@ -0,0 +1,328 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ We will start by quickly reviewing the parsing capabilities. Then, we will fetch websites with custom browsers, make requests, and parse the response.
2
+
3
+ Here's an HTML document generated by ChatGPT we will be using as an example throughout this page:
4
+ ```html
5
+ <html>
6
+ <head>
7
+ <title>Complex Web Page</title>
8
+ <style>
9
+ .hidden { display: none; }
10
+ </style>
11
+ </head>
12
+ <body>
13
+ <header>
14
+ <nav>
15
+ <ul>
16
+ <li> <a href="#home">Home</a> </li>
17
+ <li> <a href="#about">About</a> </li>
18
+ <li> <a href="#contact">Contact</a> </li>
19
+ </ul>
20
+ </nav>
21
+ </header>
22
+ <main>
23
+ <section id="products" schema='{"jsonable": "data"}'>
24
+ <h2>Products</h2>
25
+ <div class="product-list">
26
+ <article class="product" data-id="1">
27
+ <h3>Product 1</h3>
28
+ <p class="description">This is product 1</p>
29
+ <span class="price">$10.99</span>
30
+ <div class="hidden stock">In stock: 5</div>
31
+ </article>
32
+
33
+ <article class="product" data-id="2">
34
+ <h3>Product 2</h3>
35
+ <p class="description">This is product 2</p>
36
+ <span class="price">$20.99</span>
37
+ <div class="hidden stock">In stock: 3</div>
38
+ </article>
39
+
40
+ <article class="product" data-id="3">
41
+ <h3>Product 3</h3>
42
+ <p class="description">This is product 3</p>
43
+ <span class="price">$15.99</span>
44
+ <div class="hidden stock">Out of stock</div>
45
+ </article>
46
+ </div>
47
+ </section>
48
+
49
+ <section id="reviews">
50
+ <h2>Customer Reviews</h2>
51
+ <div class="review-list">
52
+ <div class="review" data-rating="5">
53
+ <p class="review-text">Great product!</p>
54
+ <span class="reviewer">John Doe</span>
55
+ </div>
56
+ <div class="review" data-rating="4">
57
+ <p class="review-text">Good value for money.</p>
58
+ <span class="reviewer">Jane Smith</span>
59
+ </div>
60
+ </div>
61
+ </section>
62
+ </main>
63
+ <script id="page-data" type="application/json">
64
+ {
65
+ "lastUpdated": "2024-09-22T10:30:00Z",
66
+ "totalProducts": 3
67
+ }
68
+ </script>
69
+ </body>
70
+ </html>
71
+ ```
72
+ Starting with loading raw HTML above like this
73
+ ```python
74
+ from scrapling.parser import Adaptor
75
+ page = Adaptor(html_doc)
76
+ page # <data='<html><head><title>Complex Web Page</tit...'>
77
+ ```
78
+ Get all text content on the page recursively
79
+ ```python
80
+ page.get_all_text(ignore_tags=('script', 'style'))
81
+ # 'Complex Web Page\nHome\nAbout\nContact\nProducts\nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock\nCustomer Reviews\nGreat product!\nJohn Doe\nGood value for money.\nJane Smith'
82
+ ```
83
+
84
+ ## Finding elements
85
+ If there's an element you want to find on the page, you will! Your creativity level is the only limitation!
86
+
87
+ Finding the first HTML `section` element
88
+ ```python
89
+ section_element = page.find('section')
90
+ # <data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>
91
+ ```
92
+ Find all `section` elements
93
+ ```python
94
+ section_elements = page.find_all('section')
95
+ # [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>, <data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'>]
96
+ ```
97
+ Find all `section` elements whose `id` attribute value is `products`
98
+ ```python
99
+ section_elements = page.find_all('section', {'id':"products"})
100
+ # Same as
101
+ section_elements = page.find_all('section', id="products")
102
+ # [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>]
103
+ ```
104
+ Find all `section` elements that its `id` attribute value contains `product`
105
+ ```python
106
+ section_elements = page.find_all('section', {'id*':"product"})
107
+ ```
108
+ Find all `h3` elements whose text content matches this regex `Product \d`
109
+ ```python
110
+ page.find_all('h3', re.compile(r'Product \d'))
111
+ # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>]
112
+ ```
113
+ Find all `h3` and `h2` elements whose text content matches regex `Product` only
114
+ ```python
115
+ page.find_all(['h3', 'h2'], re.compile(r'Product'))
116
+ # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>, <data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]
117
+ ```
118
+ Find all elements that its text content matches exactly `Products` (Whitespaces are not taken into consideration)
119
+ ```python
120
+ page.find_by_text('Products', first_match=False)
121
+ # [<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]
122
+ ```
123
+ Or find all elements whose text content matches regex `Product \d`
124
+ ```python
125
+ page.find_by_regex(r'Product \d', first_match=False)
126
+ # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>]
127
+ ```
128
+ Find all elements that are similar to the element you want
129
+ ```python
130
+ target_element = page.find_by_regex(r'Product \d', first_match=True)
131
+ # <data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>
132
+ target_element.find_similar()
133
+ # [<data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>]
134
+ ```
135
+ Find the first element that matches a CSS selector
136
+ ```python
137
+ page.css_first('.product-list [data-id="1"]')
138
+ # <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>
139
+ ```
140
+ Find all elements that match a CSS selector
141
+ ```python
142
+ page.css('.product-list article')
143
+ # [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]
144
+ ```
145
+ Find the first element that matches an XPath selector
146
+ ```python
147
+ page.xpath_first("//*[@id='products']/div/article")
148
+ # <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>
149
+ ```
150
+ Find all elements that match an XPath selector
151
+ ```python
152
+ page.xpath("//*[@id='products']/div/article")
153
+ # [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]
154
+ ```
155
+
156
+ With this, we just scratched the surface of these functions; more advanced options with these selection methods are shown later.
157
+ ## Accessing elements' data
158
+ It's as simple as
159
+ ```python
160
+ >>> section_element.tag
161
+ 'section'
162
+ >>> print(section_element.attrib)
163
+ {'id': 'products', 'schema': '{"jsonable": "data"}'}
164
+ >>> section_element.attrib['schema'].json() # If an attribute value can be converted to json, then use `.json()` to convert it
165
+ {'jsonable': 'data'}
166
+ >>> section_element.text # Direct text content
167
+ ''
168
+ >>> section_element.get_all_text() # All text content recursively
169
+ 'Products\nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock'
170
+ >>> section_element.html_content # The HTML content of the element
171
+ '<section id="products" schema=\'{"jsonable": "data"}\'><h2>Products</h2>\n <div class="product-list">\n <article class="product" data-id="1"><h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article><article class="product" data-id="2"><h3>Product 2</h3>\n <p class="description">This is product 2</p>\n <span class="price">$20.99</span>\n <div class="hidden stock">In stock: 3</div>\n </article><article class="product" data-id="3"><h3>Product 3</h3>\n <p class="description">This is product 3</p>\n <span class="price">$15.99</span>\n <div class="hidden stock">Out of stock</div>\n </article></div>\n </section>'
172
+ >>> print(section_element.prettify()) # The prettified version
173
+ '''
174
+ <section id="products" schema='{"jsonable": "data"}'><h2>Products</h2>
175
+ <div class="product-list">
176
+ <article class="product" data-id="1"><h3>Product 1</h3>
177
+ <p class="description">This is product 1</p>
178
+ <span class="price">$10.99</span>
179
+ <div class="hidden stock">In stock: 5</div>
180
+ </article><article class="product" data-id="2"><h3>Product 2</h3>
181
+ <p class="description">This is product 2</p>
182
+ <span class="price">$20.99</span>
183
+ <div class="hidden stock">In stock: 3</div>
184
+ </article><article class="product" data-id="3"><h3>Product 3</h3>
185
+ <p class="description">This is product 3</p>
186
+ <span class="price">$15.99</span>
187
+ <div class="hidden stock">Out of stock</div>
188
+ </article>
189
+ </div>
190
+ </section>
191
+ '''
192
+ >>> section_element.path # All the ancestors in the DOM tree of this element
193
+ [<data='<main><section id="products" schema='{"j...' parent='<body> <header><nav><ul><li> <a href="#h...'>,
194
+ <data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'>,
195
+ <data='<html><head><title>Complex Web Page</tit...'>]
196
+ >>> section_element.generate_css_selector
197
+ '#products'
198
+ >>> section_element.generate_full_css_selector
199
+ 'body > main > #products > #products'
200
+ >>> section_element.generate_xpath_selector
201
+ "//*[@id='products']"
202
+ >>> section_element.generate_full_xpath_selector
203
+ "//body/main/*[@id='products']"
204
+ ```
205
+
206
+ ## Navigation
207
+ Using the elements we found above
208
+
209
+ ```python
210
+ >>> section_element.parent
211
+ <data='<main><section id="products" schema='{"j...' parent='<body> <header><nav><ul><li> <a href="#h...'>
212
+ >>> section_element.parent.tag
213
+ 'main'
214
+ >>> section_element.parent.parent.tag
215
+ 'body'
216
+ >>> section_element.children
217
+ [<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>,
218
+ <data='<div class="product-list"> <article clas...' parent='<section id="products" schema='{"jsonabl...'>]
219
+ >>> section_element.siblings
220
+ [<data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'>]
221
+ >>> section_element.next # gets the next element, the same logic applies to `quote.previous`
222
+ <data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'>
223
+ >>> section_element.children.css('h2::text')
224
+ ['Products']
225
+ >>> page.css_first('[data-id="1"]').has_class('product')
226
+ True
227
+ ```
228
+ If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element like the one below
229
+ ```python
230
+ for ancestor in quote.iterancestors():
231
+ # do something with it...
232
+ ```
233
+ You can search for a specific ancestor of an element that satisfies a function; all you need to do is to pass a function that takes an `Adaptor` object as an argument and return `True` if the condition satisfies or `False` otherwise like below:
234
+ ```python
235
+ >>> section_element.find_ancestor(lambda ancestor: ancestor.css('nav'))
236
+ <data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'>
237
+ ```
238
+
239
+ ## Fetching websites
240
+ Instead of passing the raw HTML to Scrapling, you can get a website's response directly through HTTP requests or by fetching it from browsers.
241
+
242
+ A fetcher is made for every use case.
243
+
244
+ ### HTTP Requests
245
+ For simple HTTP requests, there's a `Fetcher` class that can be imported as below:
246
+ ```python
247
+ from scrapling.fetchers import Fetcher
248
+ ```
249
+ But that's class, so you will need to create an instance of the Fetcher first like this:
250
+ ```python
251
+ from scrapling.fetchers import Fetcher
252
+ fetcher = Fetcher()
253
+ page = fetcher.get('https://httpbin.org/get')
254
+ ```
255
+ This is intended, and you will find it with all fetchers because there are settings you can pass to `Fetcher()` initialization, but more on this later.
256
+
257
+ If you are going to use the default settings anyway, you can do this instead for a cleaner approach:
258
+ ```python
259
+ from scrapling.fetchers import Fetcher
260
+ page = Fetcher.get('https://httpbin.org/get')
261
+ ```
262
+ With that out of the way, here's how to do all HTTP methods:
263
+ ```python
264
+ >>> from scrapling.fetchers import Fetcher
265
+ >>> page = Fetcher.get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
266
+ >>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
267
+ >>> page = Fetcher.put('https://httpbin.org/put', data={'key': 'value'})
268
+ >>> page = Fetcher.delete('https://httpbin.org/delete')
269
+ ```
270
+ For Async requests, you will just replace the import like below:
271
+ ```python
272
+ >>> from scrapling.fetchers import AsyncFetcher
273
+ >>> page = await AsyncFetcher.get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
274
+ >>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
275
+ >>> page = await AsyncFetcher.put('https://httpbin.org/put', data={'key': 'value'})
276
+ >>> page = await AsyncFetcher.delete('https://httpbin.org/delete')
277
+ ```
278
+
279
+ > Note: You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a referer header, as if this request came from Google's search of this URL's domain. It's enabled by default.
280
+
281
+ This is just the tip of this fetcher; check the full page from [here](fetching/static.md)
282
+
283
+ ### Dynamic loading
284
+ We have you covered if you deal with dynamic websites like most today!
285
+
286
+ The `PlayWrightFetcher` class provides many options to fetch/load websites' pages through browsers.
287
+ ```python
288
+ >>> from scrapling.fetchers import PlayWrightFetcher
289
+ >>> page = PlayWrightFetcher.fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option
290
+ >>> page.css_first("#search a::attr(href)")
291
+ 'https://github.com/D4Vinci/Scrapling'
292
+ >>> # The async version of fetch
293
+ >>> page = await PlayWrightFetcher.async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)
294
+ >>> page.css_first("#search a::attr(href)")
295
+ 'https://github.com/D4Vinci/Scrapling'
296
+ ```
297
+ It's named like that because it's built on top of [Playwright](https://playwright.dev/python/), and it currently provides 4 main run options that can be mixed as you want:
298
+
299
+ - Vanilla Playwright without any modifications other than the ones you chose.
300
+ - Stealthy Playwright with custom stealth mode explicitly written for it. It's not top-tier stealth mode but bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/). Check out the `StealthyFetcher` class below for more advanced stealth mode.
301
+ - Real browsers by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher, and most of the options can be enabled on it.
302
+ - [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
303
+
304
+ > Note: All requests done by this fetcher are waited by default for all javascript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing 'network_idle=True', as you will see later.
305
+
306
+ Again, this is just the tip of this fetcher. Check out the full page from [here](fetching/dynamic.md) for all details and the complete list of arguments.
307
+
308
+ ### Dynamic anti-protection loading
309
+ We also have you covered if you deal with dynamic websites with annoying anti-protections!
310
+
311
+ The `StealthyFetcher` class uses a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), bypassing most anti-bot protections by default. Scrapling adds extra layers of flavors and configurations to further increase performance and undetectability.
312
+ ```python
313
+ >>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection') # Running headless by default
314
+ >>> page.status == 200
315
+ True
316
+ >>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection', humanize=True, os_randomize=True) # and the rest of arguments...
317
+ >>> # The async version of fetch
318
+ >>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection')
319
+ >>> page.status == 200
320
+ True
321
+ ```
322
+ > Note: All requests done by this fetcher are waited by default for all javascript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing 'network_idle=True', as you will see later.
323
+
324
+ Again, this is just the tip of this fetcher. Check out the full page from [here](fetching/dynamic.md) for all details and the complete list of arguments.
325
+
326
+ ---
327
+
328
+ That's Scrapling at a glance. If you want to learn more about it, continue to the next section.
docs/parsing/automatch.md ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Introduction
2
+ Auto-matching is one of Scrapling's most powerful features. It allows your scraper to survive website changes by intelligently tracking and relocating elements.
3
+
4
+ Let's say you are scraping a page with a structure like this:
5
+ ```html
6
+ <div class="container">
7
+ <section class="products">
8
+ <article class="product" id="p1">
9
+ <h3>Product 1</h3>
10
+ <p class="description">Description 1</p>
11
+ </article>
12
+ <article class="product" id="p2">
13
+ <h3>Product 2</h3>
14
+ <p class="description">Description 2</p>
15
+ </article>
16
+ </section>
17
+ </div>
18
+ ```
19
+ And you want to scrape the first product, the one with the `p1` ID. You will probably write a selector like this
20
+ ```python
21
+ page.css('#p1')
22
+ ```
23
+ When website owners implement structural changes like
24
+ ```html
25
+ <div class="new-container">
26
+ <div class="product-wrapper">
27
+ <section class="products">
28
+ <article class="product new-class" data-id="p1">
29
+ <div class="product-info">
30
+ <h3>Product 1</h3>
31
+ <p class="new-description">Description 1</p>
32
+ </div>
33
+ </article>
34
+ <article class="product new-class" data-id="p2">
35
+ <div class="product-info">
36
+ <h3>Product 2</h3>
37
+ <p class="new-description">Description 2</p>
38
+ </div>
39
+ </article>
40
+ </section>
41
+ </div>
42
+ </div>
43
+ ```
44
+ The selector will no longer function, and your code needs maintenance. That's where Scrapling's auto-matching feature comes into play.
45
+
46
+ With Scrapling, you can enable the `automatch` feature the first time you select an element, and the next time you select that element and it doesn't exist, Scrapling will remember its properties and search on the website for the element with the highest percentage of similarity to that element and without AI :)
47
+
48
+ ```python
49
+ from scrapling import Adaptor, Fetcher
50
+ # Before the change
51
+ page = Adaptor(page_source, auto_match=True, url='example.com')
52
+ # or
53
+ Fetcher.auto_match = True
54
+ page = Fetcher.get('https://example.com')
55
+ # then
56
+ element = page.css('#p1' auto_save=True)
57
+ if not element: # One day website changes?
58
+ element = page.css('#p1', auto_match=True) # Scrapling still finds it!
59
+ # the rest of your code...
60
+ ```
61
+ Below, I will show you one usage example for this feature. Then, we will dive deep into how to use it and provide details about this feature.
62
+
63
+ ## Real-World Scenario
64
+ Let's use a real website as an example and use one of the fetchers to fetch its source. To do this, we need to find a website that will soon change its design/structure, take a copy of its source, and then wait for the website to make the change. Of course, that's nearly impossible to know unless I know the website's owner, but that will make it a staged test, haha.
65
+
66
+ To solve this issue, I will use [The Web Archive](https://archive.org/)'s [Wayback Machine](https://web.archive.org/). Here is a copy of [StackOverFlow's website in 2010](https://web.archive.org/web/20100102003420/http://stackoverflow.com/); pretty old, eh?</br>Let's test if the automatch feature can extract the same button in the old design from 2010 and the current design using the same selector :)
67
+
68
+ If I want to extract the Questions button from the old design, I can use a selector like this: `#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a` This selector is too specific because it was generated by Google Chrome.
69
+
70
+
71
+ Now, let's test the same selector in both versions
72
+ ```python
73
+ >> from scrapling import Fetcher
74
+ >> selector = '#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a'
75
+ >> old_url = "https://web.archive.org/web/20100102003420/http://stackoverflow.com/"
76
+ >> new_url = "https://stackoverflow.com/"
77
+ >> Fetcher.configure(auto_match = True, automatch_domain='stackoverflow.com')
78
+ >>
79
+ >> page = Fetcher.get(old_url, timeout=30)
80
+ >> element1 = page.css_first(selector, auto_save=True)
81
+ >>
82
+ >> # Same selector but used in the updated website
83
+ >> page = Fetcher.get(new_url)
84
+ >> element2 = page.css_first(selector, auto_match=True)
85
+ >>
86
+ >> if element1.text == element2.text:
87
+ ... print('Scrapling found the same element in the old and new designs!')
88
+ 'Scrapling found the same element in the old and new designs!'
89
+ ```
90
+ Note that I used a new argument called `automatch_domain`; this is because, for Scrapling, these are two different domains(`archive.org` and `stackoverflow.com`), so scrapling will isolate their `auto_match` data. To tell Scrapling they are the same website, we need to pass the custom domain we want to use while saving auto-match data for them both so Scrapling doesn't isolate them.
91
+
92
+ The code will be the same in a real-world scenario, except it will use the same URL for both requests, so you won't need to use the `automatch_domain` argument. This is the closest example I can give to real-world cases, so I hope it didn't confuse you :)
93
+
94
+ Hence, in the two examples above, I used both the `Adaptor` class and the `Fetcher` class to show you that the logic for automatch is the same.
95
+
96
+ ## How the automatch feature works
97
+ Auto-matching works in two phases:
98
+
99
+ 1. **Save Phase**: Store unique properties of elements
100
+ 2. **Match Phase**: Find elements with similar properties later
101
+
102
+ Let's say you have an element you got through selection or any method and want the library to find it the next time you scrape this website, even if it had structural/design changes.
103
+
104
+ As little technical details as possible, the general logic goes as the following:
105
+
106
+ 1. You tell Scrapling to save that element's unique properties in one of the ways we will show below.
107
+ 2. Scrapling uses its configured database (SQLite by default) and saves each element's unique properties.
108
+ 3. Now, because everything about the element can be changed or removed from the website's owner(s), nothing from the element can be used as a unique identifier for the database. To solve this issue, I made the storage system rely on two things:
109
+ 1. The domain of the current website. If you are using the `Adaptor` class, you should pass it while initializing the class, or if you are using one of the fetchers, the domain will be taken from the URL automatically.
110
+ 2. An `identifier` to query that element's properties from the database. You don't always have to set the identifier yourself, as you will see later when we discuss this.
111
+
112
+ Together, they will be used to retrieve the element's unique properties from the database later.
113
+
114
+ 4. Later, when the website structural changes, you tell Scrapling to automatch the element. Scrapling retrieves the element's unique properties and matches all elements on the page against the unique properties we already have for this element. A score is calculated for their similarity to the element we want. In that comparison, everything is taken into consideration, as you will see later
115
+ 5. The element(s) with the highest similarity score to the wanted element are returned.
116
+
117
+ ### The unique properties
118
+ You might wonder, if all aspects of an element can be removed or changed, what unique properties we are talking about.
119
+
120
+ For Scrapling, the unique elements we are relying on are:
121
+
122
+ - Element tag name, text, attributes (names and values), siblings (tag names only), and path (tag names only).
123
+ - Element's parent tag name, attributes (names and values), and text.
124
+
125
+ But you need to understand that the comparison between elements is not exact; it's more about finding how similar these values are. So everything is considered, even the values' order, like the order in which the element class names were written before and the order in which the same element class names are written now.
126
+
127
+ ## How to use automatch feature
128
+ The automatch feature can be used on any element you have, and it's added as arguments to CSS/XPath Selection methods, as you saw above, but we will get back to that later.
129
+
130
+ First, you must enable the automatch feature by passing `auto_match=True` to the [Adaptor](main_classes.md#adaptor) class when you initialize it or enable it in the fetcher you are using of the available fetchers, as we will show.
131
+
132
+ Examples:
133
+ ```python
134
+ >>> from scrapling import Adaptor, Fetcher
135
+ >>> page = Adaptor(html_doc, auto_match=True)
136
+ # OR
137
+ >>> Fetcher.auto_match = True
138
+ >>> page = Fetcher.fetch('https://example.com')
139
+ ```
140
+ If you are using the [Adaptor](main_classes.md#adaptor) class, you need to pass the url of the website you are using with the argument `url` so Scrapling can separate the properties saved for each element by domain.
141
+
142
+ If you didn't pass a URL, the word `default` will be used in place of the URL field while saving the element's unique properties. So, this will only be an issue if you used the same identifier later for a different website and didn't pass the URL parameter while initializing it. The save process will overwrite the previous data, and auto-matching only uses the latest saved properties.
143
+
144
+ Besides those arguments, we have `storage` and `storage_args`. Both are for the class to be used to connect to the database; by default, it's set to the SQLite class that the library is using. Those arguments shouldn't matter unless you want to write your own storage system, which we will cover on a [separate page in the development section](../development/automatch_storage_system.md).
145
+
146
+ Now, after enabling the automatch feature globally, you have two main ways to use it.
147
+
148
+ ### The CSS/XPath Selection way
149
+ As you have seen in the example above, first, you have to use the `auto_save` argument while selecting an element that exists on the page like below
150
+ ```python
151
+ element = page.css('#p1' auto_save=True)
152
+ ```
153
+ and when the element doesn't exist, you can use the same selector and the `auto_match` argument, and the library will find it for you
154
+ ```python
155
+ element = page.css('#p1', auto_match=True)
156
+ ```
157
+ Pretty simple, eh?
158
+
159
+ Well, a lot happened under the hood here. Remember the identifier part we mentioned before that you need to set so you can retrieve the element you want? Here, with the `css`/`css_first`/`xpath`/`xpath_first` methods, the identifier is set automatically as the selector you passed here to make things easier :)
160
+
161
+ Also, that's why here, for all these methods, you can pass the `identifier` argument to set it yourself, and there are cases for this, or you can use it to save the properties with the `auto_save` argument.
162
+
163
+ ### The manual way
164
+ You manually save and retrieve an element, then relocate it, which all happens within the automatch feature, as shown below. This allows you to automatch any element you have by any way or any selection method!
165
+
166
+ First, let's say you got an element like this by text:
167
+ ```python
168
+ >>> element = page.find_by_text('Tipping the Velvet', first_match=True)
169
+ ```
170
+ You can save its unique properties with the `save` method like below, but you must set the identifier yourself. For this example, I chose `my_special_element` as an identifier, but it's best to use a meaningful identifier in your code for the same reason you use meaningful variable names :)
171
+ ```python
172
+ >>> page.save(element, 'my_special_element')
173
+ ```
174
+ Now, later, when you want to retrieve it and relocate it inside the page with auto-matching, it would be like this
175
+ ```python
176
+ >>> element_dict = page.retrieve('my_special_element')
177
+ >>> page.relocate(element_dict, adaptor_type=True)
178
+ [<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]
179
+ >>> page.relocate(element_dict, adaptor_type=True).css('::text')
180
+ ['Tipping the Velvet']
181
+ ```
182
+ Hence, the `retrieve` and relocate` methods are used.
183
+
184
+ if you want to keep it as `lxml.etree` object, leave the `adaptor_type` argument
185
+ ```python
186
+ >>> page.relocate(element_dict)
187
+ [<Element a at 0x105a2a7b0>]
188
+ ```
189
+
190
+ ## Troubleshooting
191
+
192
+ ### No Matches Found
193
+ ```python
194
+ # 1. Check if data was saved
195
+ element_data = page.retrieve('identifier')
196
+ if not element_data:
197
+ print("No data saved for this identifier")
198
+
199
+ # 2. Try with different identifier
200
+ products = page.css('.product', auto_match=True, identifier='old_selector')
201
+
202
+ # 3. Save again with new identifier
203
+ products = page.css('.new-product', auto_save=True, identifier='new_identifier')
204
+ ```
205
+
206
+ ### Wrong Elements Matched
207
+ ```python
208
+ # Use more specific selectors
209
+ products = page.css('.product-list .product', auto_save=True)
210
+
211
+ # Or save with more context
212
+ product = page.find_by_text('Product Name').parent
213
+ page.save(product, 'specific_product')
214
+ ```
215
+
216
+ ## Known Issues
217
+ In the auto-matching save process, the unique properties of the first element from the selection results are the only ones that get saved. So if the selector you are using selects different elements on the page in different locations, auto-matching will return the first element to you only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector, for example), as these selectors get separated, and each selector gets executed alone.
218
+
219
+ ## Final thoughts
220
+ Explaining this feature in detail without complications turned out to be challenging, but still, if there's something left unclear, you can head out to the [discussions section](https://github.com/D4Vinci/Scrapling/discussions), and I will reply to you ASAP or reach out to me privately and have a chat :)
docs/parsing/main_classes.md ADDED
@@ -0,0 +1,539 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Introduction
2
+ After exploring the various ways to select elements with Scrapling and related features, Let's take a step back and examine the [Adaptor](#adaptor) class generally and other objects to better understand the parsing engine.
3
+
4
+ The [Adaptor](#adaptor) class is the core parsing engine in Scrapling that provides HTML parsing and element selection capabilities. You can always import it with any of the following imports
5
+ ```python
6
+ from scrapling import Adaptor
7
+ from scrapling.parser import Adaptor
8
+ ```
9
+ then use it directly as you already learned in the [overview](../overview.md) page
10
+ ```python
11
+ adaptor = Adaptor(
12
+ text='<html>...</html>',
13
+ url='https://example.com'
14
+ )
15
+
16
+ # Then select elements as you like
17
+ elements = adaptor.css('.product')
18
+ ```
19
+ In Scrapling, the main object you deal with after passing an HTML source or fetching a website is, of course, an [Adaptor](#adaptor) object. Any operation you do, like selection, navigation, etc., will return either an [Adaptor](#adaptor) object or an [Adaptors](#adaptors) object, given that the result is element/elements from the page, not text or similar.
20
+
21
+ In other words, the main page is a [Adaptor](#adaptor) object, and the elements within are [Adaptor](#adaptor) objects, and so on. Any text, such as the text content inside elements or the text inside element attributes, is a [TextHandler](#texthandler) object, and the attributes of each element are stored as [AttributesHandler](#attributeshandler). We will return to both objects later, so let's focus on the [Adaptor](#adaptor) object.
22
+
23
+ ## Adaptor
24
+ ### Arguments explained
25
+ The most important ones are `text` and `body`. Both are used to pass the HTML code you want to parse, but the first one accepts `str`, and the latter accepts `bytes` like how you used to do with `parsel` :)
26
+
27
+ Otherwise, you have the arguments `url`, `auto_match`, `storage`, and `storage_args`. All these arguments are settings used with the `auto_match` feature, and they don't make a difference if you are not going to use that feature, so just ignore them for now, and we will explain them in the [automatch](automatch.md) feature page.
28
+
29
+ Then you have the arguments for adjustments for parsing or adjusting/manipulating the HTML while the library parsing it:
30
+
31
+ - **encoding**: This is the encoding that will be used while parsing the HTML. The default is `UTF-8`.
32
+ - **keep_comments**: This tells the library whether to keep HTML comments while parsing the page. It's disabled by default, as it can mess up your scraping in many ways.
33
+ - **keep_cdata**: Same logic as the HTML comments. [cdata](https://stackoverflow.com/questions/7092236/what-is-cdata-in-html) is removed by default for cleaner HTML. This also means when you check for the raw html content, you will find it doesn't have the cdata.
34
+
35
+ I have intended to ignore the arguments `huge_tree` and `root` to avoid making this page more complicated than needed.
36
+ You may notice that I'm doing that a lot, and that's because it's something you don't need to know to use the library. The development section will cover these missing parts if you are that interested.
37
+
38
+ After that, for the main page and elements within, most properties don't get initialized until you use it like the text content of a page/element, and this is one of the reasons for Scrapling speed :)
39
+
40
+ ### Properties
41
+ You have already seen much of this on the [overview](../overview.md) page, but don't worry if you didn't. We will review it more thoroughly using more advanced methods/usages. For clarity, the properties for traversal are separated below in the [traversal](#traversal) section.
42
+
43
+ Let's say we are parsing this HTML page for simplicity:
44
+ ```html
45
+ <html>
46
+ <head>
47
+ <title>Some page</title>
48
+ </head>
49
+ <body>
50
+ <div class="product-list">
51
+ <article class="product" data-id="1">
52
+ <h3>Product 1</h3>
53
+ <p class="description">This is product 1</p>
54
+ <span class="price">$10.99</span>
55
+ <div class="hidden stock">In stock: 5</div>
56
+ </article>
57
+
58
+ <article class="product" data-id="2">
59
+ <h3>Product 2</h3>
60
+ <p class="description">This is product 2</p>
61
+ <span class="price">$20.99</span>
62
+ <div class="hidden stock">In stock: 3</div>
63
+ </article>
64
+
65
+ <article class="product" data-id="3">
66
+ <h3>Product 3</h3>
67
+ <p class="description">This is product 3</p>
68
+ <span class="price">$15.99</span>
69
+ <div class="hidden stock">Out of stock</div>
70
+ </article>
71
+ </div>
72
+
73
+ <script id="page-data" type="application/json">
74
+ {
75
+ "lastUpdated": "2024-09-22T10:30:00Z",
76
+ "totalProducts": 3
77
+ }
78
+ </script>
79
+ </body>
80
+ </html>
81
+ ```
82
+ Load the page directly as shown before:
83
+ ```python
84
+ from scrapling import Adaptor
85
+ page = Adaptor(html_doc)
86
+ ```
87
+ Get all text content on the page recursively
88
+ ```python
89
+ >>> page.get_all_text()
90
+ 'Some page\n\n \n\n \nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock'
91
+ ```
92
+ Get the first article as explained before; we will use it as an example
93
+ ```python
94
+ article = page.find('article')
95
+ ```
96
+ With the same logic, get all text content on the element recursively
97
+ ```python
98
+ >>> article.get_all_text()
99
+ 'Product 1\nThis is product 1\n$10.99\nIn stock: 5'
100
+ ```
101
+ But if you try to get the direct text content, it will be empty; notice the logic difference
102
+ ```python
103
+ >>> article.text
104
+ ''
105
+ ```
106
+ The `get_all_text` method has the following optional arguments:
107
+
108
+ 1. **separator**: All strings collected will be concatenated using this separator. The default is '\n'
109
+ 2. **strip**: If enabled, strings will be stripped before concatenation. Disabled by default.
110
+ 3. **ignore_tags**: A tuple of all tag names you want to ignore in the final results. The default is `('script', 'style',)`.
111
+ 4. **valid_values**: If enabled, the method will only collect elements with real values, so all elements with empty text content or only whitespaces will be ignored. It's enabled by default
112
+
113
+ By the way, the text returned here is not a standard string but a [TextHandler](#texthandler); we will get to this in detail later, so if the text content can be serialized to JSON, then use `.json()` on it
114
+ ```python
115
+ >>> script = page.find('script')
116
+ >>> script.json()
117
+ {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
118
+ ```
119
+ Let's continue to get the element tag
120
+ ```python
121
+ >>> article.tag
122
+ 'article'
123
+ ```
124
+ If you used it on the page directly, you will find you are operating on the root `html` element
125
+ ```python
126
+ >>> page.tag
127
+ 'html'
128
+ ```
129
+ Now, I think I hammered the (`page`/`element`) idea, so I won't return to it again.
130
+
131
+ Getting the attributes of the element
132
+ ```python
133
+ >>> print(article.attrib)
134
+ {'class': 'product', 'data-id': '1'}
135
+ ```
136
+ Get the HTML content of the element
137
+ ```python
138
+ >>> article.html_content
139
+ '<article class="product" data-id="1"><h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article>'
140
+ ```
141
+ It's the same if you used the `.body` property
142
+ ```python
143
+ >>> article.body
144
+ '<article class="product" data-id="1"><h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article>'
145
+ ```
146
+ Get the prettified version of the HTML content of the element
147
+ ```python
148
+ >>> print(article.prettify())
149
+ <article class="product" data-id="1"><h3>Product 1</h3>
150
+ <p class="description">This is product 1</p>
151
+ <span class="price">$10.99</span>
152
+ <div class="hidden stock">In stock: 5</div>
153
+ </article>
154
+ ```
155
+ To get all the ancestors in the DOM tree of this element
156
+ ```python
157
+ >>> article.path
158
+ [<data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>,
159
+ <data='<body> <div class="product-list"> <artic...' parent='<html><head><title>Some page</title></he...'>,
160
+ <data='<html><head><title>Some page</title></he...'>]
161
+ ```
162
+ Generate a CSS shortened selector if possible, or generate the full selector
163
+ ```python
164
+ >>> article.generate_css_selector
165
+ 'body > div > article'
166
+ >>> article.generate_full_css_selector
167
+ 'body > div > article'
168
+ ```
169
+ Same case with XPath
170
+ ```python
171
+ >>> article.generate_xpath_selector
172
+ "//body/div/article"
173
+ >>> article.generate_full_xpath_selector
174
+ "//body/div/article"
175
+ ```
176
+
177
+ ### Traversal
178
+ Using the elements we found above, we will go over the properties/methods for moving in the page in detail.
179
+
180
+ If you are unfamiliar with the DOM tree or the tree data structure in general, the following traversal part can be confusing. I recommend you look up these concepts online for a better understanding.
181
+
182
+ If you are too lazy to search about it, here's a quick explanation to give you a good idea.<br/>
183
+ Simply put, the `html` element is the root of the website's tree, as every page starts with an `html` element.<br/>
184
+ This element will be directly above elements like `head` and `body`. These are considered "children" of the `html` element, and the `html` element is considered their "parent." The element `body` is a "sibling" of the element `head` and vice versa.
185
+
186
+ Accessing the parent of an element
187
+ ```python
188
+ >>> article.parent
189
+ <data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
190
+ >>> article.parent.tag
191
+ 'div'
192
+ ```
193
+ You can chain it as you want, which applies to all similar properties/methods we will review.
194
+ ```python
195
+ >>> article.parent.parent.tag
196
+ 'body'
197
+ ```
198
+ Get the children of an element
199
+ ```python
200
+ >>> article.children
201
+ [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>,
202
+ <data='<p class="description">This is product 1...' parent='<article class="product" data-id="1"><h3...'>,
203
+ <data='<span class="price">$10.99</span>' parent='<article class="product" data-id="1"><h3...'>,
204
+ <data='<div class="hidden stock">In stock: 5</d...' parent='<article class="product" data-id="1"><h3...'>]
205
+ ```
206
+ Get all elements underneath an element. It acts as a nested version of the `children` property
207
+ ```python
208
+ >>> article.below_elements
209
+ [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>,
210
+ <data='<p class="description">This is product 1...' parent='<article class="product" data-id="1"><h3...'>,
211
+ <data='<span class="price">$10.99</span>' parent='<article class="product" data-id="1"><h3...'>,
212
+ <data='<div class="hidden stock">In stock: 5</d...' parent='<article class="product" data-id="1"><h3...'>]
213
+ ```
214
+ This element returns the same result as the `children` property because its children don't have children.
215
+
216
+ Another example of using the element with the `product-list` class will clear the difference between the `children` property and the `below_elements` property
217
+ ```python
218
+ >>> products_list = page.css_first('.product-list')
219
+ >>> products_list.children
220
+ [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
221
+ <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
222
+ <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]
223
+
224
+ >>> products_list.below_elements
225
+ [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
226
+ <data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>,
227
+ <data='<p class="description">This is product 1...' parent='<article class="product" data-id="1"><h3...'>,
228
+ <data='<span class="price">$10.99</span>' parent='<article class="product" data-id="1"><h3...'>,
229
+ <data='<div class="hidden stock">In stock: 5</d...' parent='<article class="product" data-id="1"><h3...'>,
230
+ <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
231
+ ...]
232
+ ```
233
+ Get the siblings of an element
234
+ ```python
235
+ >>> article.siblings
236
+ [<data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
237
+ <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]
238
+ ```
239
+ Get the next element of the current element
240
+ ```python
241
+ >>> article.next # gets the next element, the same logic applies to `quote.previous`
242
+ <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>
243
+ ```
244
+ The same logic applies to the `previous` property
245
+ ```python
246
+ >>> article.previous # It's the first child, so it doesn't have a previous element
247
+ >>> second_article = page.css_first('.product[data-id="2"]')
248
+ >>> second_article.previous
249
+ <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>
250
+ ```
251
+ You can check easily and pretty fast if an element has a specific class name or not
252
+ ```python
253
+ >>> article.has_class('product')
254
+ True
255
+ ```
256
+ If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element, like the example below
257
+ ```python
258
+ for ancestor in article.iterancestors():
259
+ # do something with it...
260
+ ```
261
+ You can search for a specific ancestor of an element that satisfies a function; all you need to do is to pass a function that takes an [Adaptor](#adaptor) object as an argument and return `True` if the condition satisfies or `False` otherwise like below:
262
+ ```python
263
+ >>> article.find_ancestor(lambda ancestor: ancestor.has_class('product-list'))
264
+ <data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
265
+
266
+ >>> article.find_ancestor(lambda ancestor: ancestor.css('.product-list')) # Same result, different approach
267
+ <data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
268
+ ```
269
+ ## Adaptors
270
+ The class `Adaptors` is the "List" version of the [Adaptor](#adaptor) class. It inherits from the Python standard `List` type, so it shares all `List` properties and methods while adding more methods to make the operations you want to execute on the [Adaptor](#adaptor) instances within more straightforward.
271
+
272
+ In the [Adaptor](#adaptor) class, all methods/properties that should return a group of elements return them as an [Adaptors](#adaptors) class instance. The only exceptions are when you use the CSS/XPath methods as follows:
273
+
274
+ - If you selected a text node with the selector, then the return type will be [TextHandler](#texthandler)/[TextHandlers](#texthandlers). <br/>Examples:
275
+ ```python
276
+ >>> page.css('a::text') # -> TextHandlers
277
+ >>> page.xpath('//a/text()') # -> TextHandlers
278
+ >>> page.css_first('a::text') # -> TextHandler
279
+ >>> page.xpath_first('//a/text()') # -> TextHandler
280
+ >>> page.css('a::attr(href)') # -> TextHandlers
281
+ >>> page.xpath('//a/@href') # -> TextHandlers
282
+ >>> page.css_first('a::attr(href)') # -> TextHandler
283
+ >>> page.xpath_first('//a/@href') # -> TextHandler
284
+ ```
285
+ - If you used a combined selector that returns mixed types, the result will be a Python standard `List`. <br/>Examples:
286
+ ```python
287
+ >>> page.css('.price_color') # -> Adaptors
288
+ >>> page.css('.product_pod a::attr(href)') # -> TextHandlers
289
+ >>> page.css('.price_color, .product_pod a::attr(href)') # -> List
290
+ ```
291
+
292
+ Let's see what [Adaptors](#adaptors) class adds to the table with that out of the way.
293
+ ### Properties
294
+ Apart from the normal operations on Python lists like iteration, slicing, etc...
295
+
296
+ You can do the following:
297
+
298
+ Execute CSS and XPath selectors directly on the [Adaptor](#adaptor) instances it has while the arguments and the return types are the same as [Adaptor](#adaptor)'s `css` and `xpath` methods. This, of course, makes chaining methods very straightforward.
299
+ ```python
300
+ >>> page.css('.product_pod a')
301
+ [<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,
302
+ <data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,
303
+ <data='<a href="catalogue/tipping-the-velvet_99...' parent='<div class="image_container"> <a href="c...'>,
304
+ <data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>,
305
+ <data='<a href="catalogue/soumission_998/index....' parent='<div class="image_container"> <a href="c...'>,
306
+ <data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
307
+ ...]
308
+
309
+ >>> page.css('.product_pod').css('a') # Returns the same result
310
+ [<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,
311
+ <data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,
312
+ <data='<a href="catalogue/tipping-the-velvet_99...' parent='<div class="image_container"> <a href="c...'>,
313
+ <data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>,
314
+ <data='<a href="catalogue/soumission_998/index....' parent='<div class="image_container"> <a href="c...'>,
315
+ <data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
316
+ ...]
317
+ ```
318
+ Run the `re` and `re_first` methods directly. They take the same arguments passed as the [Adaptor](#adaptor) class. I'm still leaving these methods to be explained in the [TextHandler](#texthandler) section below.
319
+
320
+ However, in this class, the `re_first` behaves differently as it runs `re` on each [Adaptor](#adaptor) within and returns the first one with a result. The `re` method will return a [TextHandlers](#texthandlers) object as normal that has all the results combined in one [TextHandlers](#texthandlers) instance.
321
+ ```python
322
+ >>> page.css('.price_color').re(r'[\d\.]+')
323
+ ['51.77',
324
+ '53.74',
325
+ '50.10',
326
+ '47.82',
327
+ '54.23',
328
+ ...]
329
+
330
+ >>> page.css('.product_pod h3 a::attr(href)').re(r'catalogue/(.*)/index.html')
331
+ ['a-light-in-the-attic_1000',
332
+ 'tipping-the-velvet_999',
333
+ 'soumission_998',
334
+ 'sharp-objects_997',
335
+ ...]
336
+ ```
337
+ With the `search` method, you can search quickly in the available [Adaptor](#adaptor) classes. The function you pass must accept an [Adaptor](#adaptor) instance as the first argument and return True/False. The method will return the first [Adaptor](#adaptor) instance that satisfies the function; otherwise, it will return `None`.
338
+ ```python
339
+ # Find all the products with price '53.23'
340
+ >>> search_function = lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
341
+ >>> page.css('.product_pod').search(search_function)
342
+ <data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>
343
+ ```
344
+ You can use the `filter` method, too, which takes a function like the `search` method but returns an `Adaptors` instance of all the [Adaptor](#adaptor) classes that satisfy the function
345
+ ```python
346
+ # Find all products with prices over $50
347
+ >>> filtering_function = lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) > 50
348
+ >>> page.css('.product_pod').filter(filtering_function)
349
+ [<data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
350
+ <data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
351
+ <data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
352
+ ...]
353
+ ```
354
+
355
+ ## TextHandler
356
+ This class is mandatory to understand, as all methods/properties that should return a string for you will return `TextHandler`, and the ones that should return a list of strings will return [TextHandlers](#texthandlers) instead.
357
+
358
+ TextHandler is a subclass of the standard Python string, so you can do anything with it. So, what is the difference that requires a different naming?
359
+
360
+ Of course, TextHandler provides extra methods and properties that the standard Python strings can't do. We will review them now, but remember that all methods and properties in all classes that return string(s) are returning TextHandler, which opens the door for creativity and makes the code shorter and cleaner, as you will see. Also, you can import it directly and use it on any string, which we will explain later.
361
+ ### Usage
362
+ First, before discussing the added methods, you need to know that all operations on it, like slicing, accessing by index, etc., and methods like `split`, `replace`, `strip`, etc., all return a TextHandler again, so you can chain them as you want. If you find a method or property that returns a standard string instead of TextHandler, please open an issue, and we will override it as well.
363
+
364
+ First, we start with the `re` and `re_first` methods. These are the same methods that exist in the rest of the classes ([Adaptor](#adaptor), [Adaptors](#adaptors), and [TextHandlers](#texthandlers)), so they will take the same arguments as well.
365
+
366
+ The `re` method takes a string/compiled regex pattern as the first argument. It searches the data for all strings matching the regex and returns them as a [TextHandlers](#texthandlers) instance. The `re_first` method takes the same arguments and behaves similarly, but as you probably figured out from the naming, it returns the first result only as a `TextHandler` instance.
367
+
368
+ Also, it takes other helpful arguments, which are:
369
+
370
+ - **replace_entities**: This is enabled by default. It replaces character entity references with their corresponding characters.
371
+ - **clean_match**: It's disabled by default. This makes the method ignore all whitespaces and consecutive spaces while matching.
372
+ - **case_sensitive**: It's enabled by default. As the name implies, disabling it will make the regex ignore letters case while compiling it.
373
+
374
+ You have seen these examples before; the return result is [TextHandlers](#texthandlers) because we used the `re` method.
375
+ ```python
376
+ >>> page.css('.price_color').re(r'[\d\.]+')
377
+ ['51.77',
378
+ '53.74',
379
+ '50.10',
380
+ '47.82',
381
+ '54.23',
382
+ ...]
383
+
384
+ >>> page.css('.product_pod h3 a::attr(href)').re(r'catalogue/(.*)/index.html')
385
+ ['a-light-in-the-attic_1000',
386
+ 'tipping-the-velvet_999',
387
+ 'soumission_998',
388
+ 'sharp-objects_997',
389
+ ...]
390
+ ```
391
+ To explain the other arguments better, we will use a custom string for each example below
392
+ ```python
393
+ >>> from scrapling import TextHandler
394
+ >>> test_string = TextHandler('hi there') # Hence the two spaces
395
+ >>> test_string.re('hi there')
396
+ >>> test_string.re('hi there', clean_match=True) # Using `clean_match` will clean the string before matching the regex
397
+ ['hi there']
398
+
399
+ >>> test_string2 = TextHandler('Oh, Hi Mark')
400
+ >>> test_string2.re_first('oh, hi Mark')
401
+ >>> test_string2.re_first('oh, hi Mark', case_sensitive=False) # Hence disabling `case_sensitive`
402
+ 'Oh, Hi Mark'
403
+
404
+ # Mixing arguments
405
+ >>> test_string.re('hi there', clean_match=True, case_sensitive=False)
406
+ ['hi There']
407
+ ```
408
+ Another use of the idea of replacing strings with `TextHandler` everywhere is a property like `html_content` returns `TextHandler` so you can do regex on the HTML content if you want:
409
+ ```python
410
+ >>> page.html_content.re('div class=".*">(.*)</div')
411
+ ['In stock: 5', 'In stock: 3', 'Out of stock']
412
+ ```
413
+
414
+ - You also have the `.json()` method, which tries to convert the content to a json object quickly if possible; otherwise, it throws an error
415
+ ```python
416
+ >>> page.css_first('#page-data::text')
417
+ '\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n '
418
+ >>> page.css_first('#page-data::text').json()
419
+ {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
420
+ ```
421
+ Hence, if you didn't specify a text node while selecting an element (like the text content or an attribute text content), the text content will be selected automatically like this
422
+ ```python
423
+ >>> page.css_first('#page-data').json()
424
+ {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
425
+ ```
426
+ The [Adaptor](#adaptor) class adds one thing here, too; let's say this is the page we are working with:
427
+ ```html
428
+ <html>
429
+ <body>
430
+ <div>
431
+ <script id="page-data" type="application/json">
432
+ {
433
+ "lastUpdated": "2024-09-22T10:30:00Z",
434
+ "totalProducts": 3
435
+ }
436
+ </script>
437
+ </div>
438
+ </body>
439
+ </html>
440
+ ```
441
+ The [Adaptor](#adaptor) class has the `get_all_text` method, which you should be aware of by now. This method returns a `TextHandler`, of course.<br/><br/>
442
+ So, as you know here, if you did something like this
443
+ ```python
444
+ >>> page.css_first('div::text').json()
445
+ ```
446
+ You will get an error because the `div` tag doesn't have direct text content that can be serialized to JSON; it actually doesn't have text content at all.<br/><br/>
447
+ In this case, the `get_all_text` method comes to the rescue, so you can do something like that
448
+ ```python
449
+ >>> page.css_first('div').get_all_text(ignore_tags=[]).json()
450
+ {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
451
+ ```
452
+ I used the `ignore_tags` argument here because the default value of it is `('script', 'style',)`, as you are aware.<br/><br/>
453
+ Another related behavior you should be aware of is the case while using any of the fetchers, which we will explain later. If you have a JSON response like this example:
454
+ ```python
455
+ >>> page = Adaptor("""{"some_key": "some_value"}""")
456
+ ```
457
+ Because the [Adaptor](#adaptor) class is optimized to deal with HTML pages, it will deal with it as a broken HTML response and fix it, so if you used the `html_content` property, you get this
458
+ ```python
459
+ >>> page.html_content
460
+ '<html><body><p>{"some_key": "some_value"}</p></body></html>'
461
+ ```
462
+ Here, you can use `json` method directly, and it will work
463
+ ```python
464
+ >>> page.json()
465
+ {'some_key': 'some_value'}
466
+ ```
467
+ You might wonder how this happened while the `html` tag lacks direct text?<br/>
468
+ Well, for these cases like JSON responses, I made the `.json()` method inside the [Adaptor](#adaptor) class to check if the current element doesn't have text content; it will use the `get_all_text` method directly.<br/><br/>It might sound hacky a bit but remember, Scrapling is currently optimized to work with HTML pages only so that's the best way till now to handle JSON responses currently without sacrificing speed. This will be changed in the upcoming versions.
469
+
470
+ - Another handy method is `.clean()`, this will remove all white spaces and consecutive spaces for you and return a new `TextHandler`, wonderful
471
+ ```python
472
+ >>> TextHandler('\n wonderful idea, \reh?').clean()
473
+ 'wonderful idea, eh?'
474
+ ```
475
+
476
+ - Another method that might be helpful in some cases is the `.sort()` method to sort the string for you as you do with lists
477
+ ```python
478
+ >>> TextHandler('acb').sort()
479
+ 'abc'
480
+ ```
481
+ Or do it in reverse:
482
+ ```python
483
+ >>> TextHandler('acb').sort(reverse=True)
484
+ 'cba'
485
+ ```
486
+
487
+ Other methods and properties will be added over time, but remember that this class is returned in place of strings nearly everywhere in the library.
488
+
489
+ ## TextHandlers
490
+ You probably guessed it: This class is similar to [Adaptors](#adaptors) and [Adaptor](#adaptor), but here it inherits the same logic and method as standard lists, with only `re` and `re_first` as new methods.
491
+
492
+ The only difference is that the `re_first` method logic here does `re` on each [TextHandler](#texthandler) within and returns the first result it has or `None`. Nothing is new to explain here, but new methods will be added here with time.
493
+
494
+ ## AttributesHandler
495
+ This is a read-only version of Python's standard dictionary or `dict` that's only used to store the attributes of each element or each [Adaptor](#adaptor) instance, in other words.
496
+ ```python
497
+ >>> print(page.find('script').attrib)
498
+ {'id': 'page-data', 'type': 'application/json'}
499
+ >>> type(page.find('script').attrib).__name__
500
+ 'AttributesHandler'
501
+ ```
502
+ Because it's read-only, it will use fewer resources than the standard dictionary. Still, it has the same dictionary method/properties other than those allowing you to modify/override the data.
503
+
504
+ It currently adds two extra simple methods:
505
+
506
+ - The `search_values` method
507
+
508
+ In standard dictionaries, you can do `dict.get("key_name")` to check if a key exists. However, if you want to search by values instead of keys, it will take you some code lines. This method does that for you. It allows you to search the current attributes by values and returns a dictionary of each matching item.
509
+
510
+ A simple example would be
511
+ ```python
512
+ >>> for i in page.find('script').attrib.search_values('page-data'):
513
+ print(i)
514
+ {'id': 'page-data'}
515
+ ```
516
+ But this method provides the `partial` argument as well, which allows you to search by part of the value:
517
+ ```python
518
+ >>> for i in page.find('script').attrib.search_values('page', partial=True):
519
+ print(i)
520
+ {'id': 'page-data'}
521
+ ```
522
+ These examples won't happen in the real world; most likely, a more real-world example would be using it with the `find_all` method to find all elements that have a specific value in their arguments:
523
+ ```python
524
+ >>> page.find_all(lambda element: list(element.attrib.search_values('product')))
525
+ [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
526
+ <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
527
+ <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]
528
+ ```
529
+ All these elements have 'product' as a value for the attribute `class`.
530
+
531
+ Hence, I used the `list` function here because `search_values` returns a generator, so it would be `True` for all elements.
532
+
533
+ - The `json_string` property
534
+
535
+ This property converts current attributes to JSON string if the attributes are JSON serializable; otherwise, it throws an error
536
+ ```python
537
+ >>> page.find('script').attrib.json_string
538
+ b'{"id":"page-data","type":"application/json"}'
539
+ ```
docs/parsing/selection.md ADDED
@@ -0,0 +1,512 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Introduction
2
+ Scrapling currently supports parsing HTML pages exclusively, so it doesn't support XML feeds. This decision was made because the automatch feature won't work with XML, but that might change soon, so stay tuned :)
3
+
4
+ In Scrapling, there are 5 main ways to find elements:
5
+
6
+ 1. CSS3 Selectors
7
+ 2. XPath Selectors
8
+ 3. Finding elements based on filters/conditions.
9
+ 4. Finding elements whose content contains specific text
10
+ 5. Finding elements whose content matches specific regex
11
+
12
+ Of course, there are other indirect ways to find elements with Scrapling, but here we will discuss the main ways in detail. We will also bring up one of the most remarkable features of Scrapling: the ability to find elements that are similar to the element you have; you can jump to that section directly from [here](#finding-similar-elements).
13
+
14
+ If you are new to Web Scraping, have little to no experience writing selectors, and want to start quickly, I recommend you jump directly to learning the `find`/`find_all` methods from [here](#filters-based-searching).
15
+
16
+ ## CSS/XPath selectors
17
+
18
+ ### What are CSS selectors?
19
+ [CSS](https://en.wikipedia.org/wiki/CSS) is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.
20
+
21
+ Scrapling implements CSS3 selectors as described in the [W3C specification](http://www.w3.org/TR/2011/REC-css3-selectors-20110929/). CSS selectors support comes from cssselect, so it's better to read about which [selectors are supported from cssselect](https://cssselect.readthedocs.io/en/latest/#supported-selectors) and pseudo-functions/elements.
22
+
23
+ Also, Scrapling implements some non-standard pseudo-elements like:
24
+
25
+ * To select text nodes, use ``::text``
26
+ * To select attribute values, use ``::attr(name)`` where name is the name of the attribute that you want the value of
27
+
28
+ In short, if you come from Scrapy/Parsel, you will find the same logic for selectors here to make it easier. No need to implement a stranger logic to the one that most of us are used to :)
29
+
30
+ To select elements with CSS selectors, you have the `css` and `css_first` methods. The latter is useful when you are interested in the first element it finds only, or if it's one element, etc., and the first when it's more than one, as it returns `Adaptors`.
31
+
32
+ ### What are XPath selectors?
33
+ [XPath](https://en.wikipedia.org/wiki/XPath) is a language for selecting nodes in XML documents, which can also be used with HTML. This [cheatsheet] (https://devhints.io/xpath) is a good resource for learning about [XPath](https://en.wikipedia.org/wiki/XPath). Scrapling adds XPath selectors directly through LXML.
34
+
35
+ In short, it is the same situation as CSS Selectors; if you come from Scrapy/Parsel, you will find the same logic for selectors here. BUT Scrapling doesn't implement the XPath extension function `has-class` as Scrapy/Parsel—instead, there's the `has_class` method that you can use on elements returned for the same purpose.
36
+
37
+ To select elements with XPath selectors, you have the `xpath` and `xpath_first` methods. Again, these methods follow the same logic as the CSS selectors methods above.
38
+
39
+ > Note that each method of `css`, `css_first`, `xpath`, and `xpath_first` have additional arguments, but we didn't explain them here as they are all about the automatch feature. The automatch feature will have its page later to be described in detail.
40
+
41
+ ### Selectors examples
42
+ Let's see some shared examples of using CSS and XPath Selectors.
43
+
44
+ Select all elements with the class `product`
45
+ ```python
46
+ products = page.css('.product')
47
+ products = page.xpath('//*[@class="product"]')
48
+ ```
49
+ Note: The XPath one won't be accurate if there's another class; better rely on CSS for selecting by class
50
+
51
+ Select the first element with the class `product`
52
+ ```python
53
+ product = page.css_first('.product')
54
+ product = page.xpath_first('//*[@class="product"]')
55
+ ```
56
+ Which would be the same as doing
57
+ ```python
58
+ product = page.css('.product')[0]
59
+ product = page.xpath('//*[@class="product"]')[0]
60
+ ```
61
+ Get the text of the first element with the `h1` tag name
62
+ ```python
63
+ title = page.css_first('h1::text')
64
+ title = page.xpath_first('//h1//text()')
65
+ ```
66
+ Which is again the same as doing
67
+ ```python
68
+ title = page.css_first('h1').text
69
+ title = page.xpath_first('//h1').text
70
+ ```
71
+ Get the `href` attribute of the first element with `a` tag name
72
+ ```python
73
+ link = page.css_first('a::attr(href)')
74
+ link = page.xpath_first('//a/@href')
75
+ ```
76
+ Select the text of the first element with the `h1` tag name, which contains 'Phone' and under an element with class 'product'
77
+ ```python
78
+ title = page.css_first('.product h1:contains("Phone")::text')
79
+ title = page.page.xpath_first('//*[@class="product"]//h1[contains(text(),"Phone")]/text()')
80
+ ```
81
+ You can nest and chain selectors as you want, given that it returns results
82
+ ```python
83
+ page.css_first('.product').css_first('h1:contains("Phone")::text')
84
+ page.xpath_first('//*[@class="product"]').xpath_first('//h1[contains(text(),"Phone")]/text()')
85
+ page.xpath_first('//*[@class="product"]').css_first('h1:contains("Phone")::text')
86
+ ```
87
+ Another example
88
+
89
+ All links that have 'image' in their 'href' attribute
90
+ ```python
91
+ links = page.css('a[href*="image"]')
92
+ links = page.xpath('//a[contains(@href, "image")]')
93
+ for index, link in enumerate(links):
94
+ link_value = link.attrib['href'] # Cleaner than link.css('::attr(href)')
95
+ link_text = link.text
96
+ print(f'Link number {index} points to this url {link_value} with text content as "{link_text}"')
97
+ ```
98
+
99
+ ## Text-content selection
100
+ Scrapling provides the ability to select elements based on their direct text content, and you have two ways to do this:
101
+
102
+ 1. Elements whose direct text content contains given text with many options through the `find_by_text` method.
103
+ 2. Elements whose direct text content matches the given regex pattern with many options through the `find_by_regex` method.
104
+
105
+ What you can do with `find_by_text` can be done with `find_by_regex` if you are good enough with regular expressions (regex), but we are providing more options to make them easier for all users to access.
106
+
107
+ With `find_by_text`, you will pass the text as the first argument; with the `find_by_regex` method, the regex pattern is the first. Both methods share the following arguments:
108
+
109
+ * **first_match**: If `True` (the default), the method used will return the first result it finds.
110
+ * **case_sensitive**: If `True`, the case of the letters will be considered.
111
+ * **clean_match**: If `True`, all whitespaces and consecutive spaces will be ignored while matching.
112
+
113
+ By default, Scrapling search for exact matching for the text you pass to `find_by_text`, so the text content of the wanted element have to be ONLY the text you inputted, but that's why it also has one extra argument, which is:
114
+
115
+ * **partial**: If enabled, `find_by_text` will return elements that contain the input text. So it's not an exact match anymore
116
+
117
+ Note: The method `find_by_regex` can accept both regular strings and a compiled regex pattern as its first argument, as you will see in the upcoming examples.
118
+
119
+ ### Finding Similar Elements
120
+ One of the most remarkable new features that Scrapling puts on the table is the feature that allows the user to tell Scrapling to find elements similar to the element at hand. This feature inspiration came from the AutoScraper library, but here, it can be used on elements found by any method. Most likely, most of its usage would be after finding elements through text content like how AutoScraper works, so it would also be convenient to explain it here.
121
+
122
+ So, how does it work?
123
+
124
+ Imagine a scenario where you found a product by its title, for example, and you want to extract other products listed in the same table/container. With the element you have, you can simply call the method `.find_similar()` on it, and Scrapling will:
125
+
126
+ 1. Find all page elements with the same tree depth as this element.
127
+ 2. All found elements will be checked, and those without the same tag name, parent tag name, and grandparent tag name will be dropped.
128
+ 3. Now we are sure (like 99% sure) that these elements are the ones we want, but as a last check, Scrapling will use fuzzy matching to drop the elements whose attributes don't look like the attributes of our element. There's a percentage to control this step, and I recommend you not play with it unless the default settings don't get the elements you want.
129
+
130
+ That's a lot of talking, I know, but I had to go deep, I will give examples of using this method in the next section, but first, these are the arguments that can be passed to this method:
131
+
132
+ * **similarity_threshold**: This is the percentage we discussed in step 3 for comparing elements' attributes. The default value is 0.2. In Simpler words, the attributes' values of both elements should be at least 20% similar. If you want to turn off this check (Step 3, basically), you can set this attribute to 0, but I recommend you read what other arguments do first.
133
+ * **ignore_attributes**: The attribute names passed will be ignored while matching the attributes in the last step. The default value is `('href', 'src',)` because URLs can change a lot between elements, making them unreliable.
134
+ * **match_text**: If `True`, the element's text content will be considered when matching. Using this in normal cases is not recommended, but it depends.
135
+
136
+ Now, let's check out the examples below.
137
+
138
+ ### Examples
139
+ Let's see some shared examples of finding elements with raw text and regex.
140
+
141
+ I will use the `Fetcher` to clarify these examples, but it will be explained in detail later.
142
+ ```python
143
+ from scrapling.fetchers import Fetcher
144
+ page = Fetcher.get('https://books.toscrape.com/index.html')
145
+ ```
146
+ Find the first element whose text fully matches this text
147
+ ```python
148
+ >>> page.find_by_text('Tipping the Velvet')
149
+ <data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>
150
+ ```
151
+ Combining it with `page.urljoin` to return the full URL from the relative `href`
152
+ ```python
153
+ >>> page.find_by_text('Tipping the Velvet').attrib['href']
154
+ 'catalogue/tipping-the-velvet_999/index.html'
155
+ >>> page.urljoin(page.find_by_text('Tipping the Velvet').attrib['href'])
156
+ 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'
157
+ ```
158
+ Get all matches if there are more (hence, it returned a list)
159
+ ```python
160
+ >>> page.find_by_text('Tipping the Velvet', first_match=False)
161
+ [<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]
162
+ ```
163
+ Get all elements that contain the word `the` (Partial matching)
164
+ ```python
165
+ >>> results = page.find_by_text('the', partial=True, first_match=False)
166
+ >>> [i.text for i in results]
167
+ ['A Light in the ...',
168
+ 'Tipping the Velvet',
169
+ 'The Requiem Red',
170
+ 'The Dirty Little Secrets ...',
171
+ 'The Coming Woman: A ...',
172
+ 'The Boys in the ...',
173
+ 'The Black Maria',
174
+ 'Mesaerion: The Best Science ...',
175
+ "It's Only the Himalayas"]
176
+ ```
177
+ The search is case insensitive, so those results have `The`, not only the lowercase one `the`; let's limit the search to the elements with `the` only.
178
+ ```python
179
+ >>> results = page.find_by_text('the', partial=True, first_match=False, case_sensitive=True)
180
+ >>> [i.text for i in results]
181
+ ['A Light in the ...',
182
+ 'Tipping the Velvet',
183
+ 'The Boys in the ...',
184
+ "It's Only the Himalayas"]
185
+ ```
186
+ Get the first element that its text content matches my price regex
187
+ ```python
188
+ >>> page.find_by_regex(r'£[\d\.]+')
189
+ <data='<p class="price_color">£51.77</p>' parent='<div class="product_price"> <p class="pr...'>
190
+ >>> page.find_by_regex(r'£[\d\.]+').text
191
+ '£51.77'
192
+ ```
193
+ It's the same if you pass the compiled regex as well; Scrapling will detect the input type and act upon that:
194
+ ```python
195
+ >>> import re
196
+ >>> regex = re.compile(r'£[\d\.]+')
197
+ >>> page.find_by_regex(regex)
198
+ <data='<p class="price_color">£51.77</p>' parent='<div class="product_price"> <p class="pr...'>
199
+ >>> page.find_by_regex(regex).text
200
+ '£51.77'
201
+ ```
202
+ Get all elements that match the regex
203
+ ```python
204
+ >>> page.find_by_regex(r'£[\d\.]+', first_match=False)
205
+ [<data='<p class="price_color">£51.77</p>' parent='<div class="product_price"> <p class="pr...'>,
206
+ <data='<p class="price_color">£53.74</p>' parent='<div class="product_price"> <p class="pr...'>,
207
+ <data='<p class="price_color">£50.10</p>' parent='<div class="product_price"> <p class="pr...'>,
208
+ <data='<p class="price_color">£47.82</p>' parent='<div class="product_price"> <p class="pr...'>,
209
+ ...]
210
+ ```
211
+ And so on...
212
+
213
+ Find all elements similar to the current element in location and attributes. For our case, ignore the 'title' attribute while matching
214
+ ```python
215
+ >>> element = page.find_by_text('Tipping the Velvet')
216
+ >>> element.find_similar(ignore_attributes=['title'])
217
+ [<data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,
218
+ <data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
219
+ <data='<a href="catalogue/sharp-objects_997/ind...' parent='<h3><a href="catalogue/sharp-objects_997...'>,
220
+ ...]
221
+ ```
222
+ Notice that the number of elements is 19, not 20, because the current element is not included in the results.
223
+ ```python
224
+ >>> len(element.find_similar(ignore_attributes=['title']))
225
+ 19
226
+ ```
227
+ Get the `href` attribute from all similar elements
228
+ ```python
229
+ >>> [
230
+ element.attrib['href']
231
+ for element in element.find_similar(ignore_attributes=['title'])
232
+ ]
233
+ ['catalogue/a-light-in-the-attic_1000/index.html',
234
+ 'catalogue/soumission_998/index.html',
235
+ 'catalogue/sharp-objects_997/index.html',
236
+ ...]
237
+ ```
238
+ To increase the complexity a little bit, let's say we want to get all books' data using that element as a starting point for some reason
239
+ ```python
240
+ >>> for product in element.parent.parent.find_similar():
241
+ print({
242
+ "name": product.css_first('h3 a::text'),
243
+ "price": product.css_first('.price_color').re_first(r'[\d\.]+'),
244
+ "stock": product.css('.availability::text')[-1].clean()
245
+ })
246
+ {'name': 'A Light in the ...', 'price': '51.77', 'stock': 'In stock'}
247
+ {'name': 'Soumission', 'price': '50.10', 'stock': 'In stock'}
248
+ {'name': 'Sharp Objects', 'price': '47.82', 'stock': 'In stock'}
249
+ ...
250
+ ```
251
+ ### Advanced examples
252
+ See more advanced or real-world examples using the `find_similar` method.
253
+
254
+ E-commerce Product Extraction
255
+ ```python
256
+ def extract_product_grid(page):
257
+ # Find the first product card
258
+ first_product = page.find_by_text('Add to Cart').find_ancestor(
259
+ lambda e: e.has_class('product-card')
260
+ )
261
+
262
+ # Find similar product cards
263
+ products = first_product.find_similar()
264
+
265
+ return [
266
+ {
267
+ 'name': p.css_first('h3::text'),
268
+ 'price': p.css_first('.price::text').re_first(r'\d+\.\d{2}'),
269
+ 'stock': 'In stock' in p.text,
270
+ 'rating': p.css_first('.rating').attrib.get('data-rating')
271
+ }
272
+ for p in products
273
+ ]
274
+ ```
275
+ Table Row Extraction
276
+ ```python
277
+ def extract_table_data(page):
278
+ # Find the first data row
279
+ first_row = page.css_first('table tbody tr')
280
+
281
+ # Find similar rows
282
+ rows = first_row.find_similar()
283
+
284
+ return [
285
+ {
286
+ 'column1': row.css_first('td:nth-child(1)::text'),
287
+ 'column2': row.css_first('td:nth-child(2)::text'),
288
+ 'column3': row.css_first('td:nth-child(3)::text')
289
+ }
290
+ for row in rows
291
+ ]
292
+ ```
293
+ Form Field Extraction
294
+ ```python
295
+ def extract_form_fields(page):
296
+ # Find first form field container
297
+ first_field = page.css_first('input').find_ancestor(
298
+ lambda e: e.has_class('form-field')
299
+ )
300
+
301
+ # Find similar field containers
302
+ fields = first_field.find_similar()
303
+
304
+ return [
305
+ {
306
+ 'label': f.css_first('label::text'),
307
+ 'type': f.css_first('input').attrib.get('type'),
308
+ 'required': 'required' in f.css_first('input').attrib
309
+ }
310
+ for f in fields
311
+ ]
312
+ ```
313
+ Extracting reviews from a website
314
+ ```python
315
+ def extract_reviews(page):
316
+ # Find first review
317
+ first_review = page.find_by_text('Great product!')
318
+ review_container = first_review.find_ancestor(
319
+ lambda e: e.has_class('review')
320
+ )
321
+
322
+ # Find similar reviews
323
+ all_reviews = review_container.find_similar()
324
+
325
+ return [
326
+ {
327
+ 'text': r.css_first('.review-text::text'),
328
+ 'rating': r.attrib.get('data-rating'),
329
+ 'author': r.css_first('.reviewer::text')
330
+ }
331
+ for r in all_reviews
332
+ ]
333
+ ```
334
+ ## Filters-based searching
335
+ This search method might be arguably the best way to find elements in Scrapling because it is powerful and easier to learn for newcomers to Web Scraping than learning to write selectors.
336
+
337
+ Inspired by BeautifulSoup's `find_all` function, you can find elements using the `find_all` and `find` methods. Both methods can take multiple types of filters and return all elements in the pages that all these filters apply to.
338
+
339
+ To be more specific:
340
+
341
+ * Any string passed is considered a tag name.
342
+ * Any iterable passed like List/Tuple/Set is considered an iterable of tag names.
343
+ * Any dictionary is considered a mapping of HTML element(s) attribute names and attribute values.
344
+ * Any regex patterns passed are used to filter elements by content like the `find_by_regex` method
345
+ * Any functions passed are used to filter elements
346
+ * Any keyword argument passed is considered as an HTML element attribute with its value.
347
+
348
+ It collects all passed arguments and keywords, and each filter passes its results to the following filter in a waterfall-like filtering system.
349
+
350
+ It filters all elements in the current page/element in the following order:
351
+
352
+ 1. All elements with the passed tag name(s) get collected.
353
+ 2. All elements that match all passed attribute(s) are collected; if a previous filter is used, then previously collected elements are filtered.
354
+ 3. All elements that match all passed regex patterns are collected, or if previous filter(s) are used, then previously collected elements are filtered.
355
+ 4. All elements that fulfill all passed function(s) are collected; if a previous filter(s) is used, then previously collected elements are filtered.
356
+
357
+ Notes:
358
+
359
+ 1. As you probably understood, the filtering process always starts from the first filter it finds in the filtering order above. So, if no tag name(s) are passed but attributes are passed, the process starts from that layer, and so on.
360
+ 2. The order in which you pass the arguments doesn't matter. The only order that's taken into consideration is the order explained above.
361
+
362
+ Check examples to clear any confusion :)
363
+
364
+ ### Examples
365
+ ```python
366
+ >>> from scrapling.fetchers import Fetcher
367
+ >>> page = Fetcher.get('https://quotes.toscrape.com/')
368
+ ```
369
+ Find all elements with the tag name `div`.
370
+ ```python
371
+ >>> page.find_all('div')
372
+ [<data='<div class="container"> <div class="row...' parent='<body> <div class="container"> <div clas...'>,
373
+ <data='<div class="row header-box"> <div class=...' parent='<div class="container"> <div class="row...'>,
374
+ ...]
375
+ ```
376
+ Find all div elements with a class that equals `quote`.
377
+ ```python
378
+ >>> page.find_all('div', class_='quote')
379
+ [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
380
+ <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
381
+ ...]
382
+ ```
383
+ Same as above.
384
+ ```python
385
+ >>> page.find_all('div', {'class': 'quote'})
386
+ [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
387
+ <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
388
+ ...]
389
+ ```
390
+ Find all elements with a class that equals `quote`.
391
+ ```python
392
+ >>> page.find_all({'class': 'quote'})
393
+ [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
394
+ <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
395
+ ...]
396
+ ```
397
+ Find all div elements with a class that equals `quote` and contains the element `.text`, which contains the word 'world' in its content.
398
+ ```python
399
+ >>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css_first('.text::text'))
400
+ [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]
401
+ ```
402
+ Find all elements that don't have children.
403
+ ```python
404
+ >>> page.find_all(lambda element: len(element.children) > 0)
405
+ [<data='<html lang="en"><head><meta charset="UTF...'>,
406
+ <data='<head><meta charset="UTF-8"><title>Quote...' parent='<html lang="en"><head><meta charset="UTF...'>,
407
+ <data='<body> <div class="container"> <div clas...' parent='<html lang="en"><head><meta charset="UTF...'>,
408
+ ...]
409
+ ```
410
+ Find all elements that contain the word 'world' in its content.
411
+ ```python
412
+ >>> page.find_all(lambda element: "world" in element.text)
413
+ [<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>,
414
+ <data='<a class="tag" href="/tag/world/page/1/"...' parent='<div class="tags"> Tags: <meta class="ke...'>]
415
+ ```
416
+ Find all span elements that match the given regex
417
+ ```python
418
+ >>> page.find_all('span', re.compile(r'world'))
419
+ [<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>]
420
+ ```
421
+ Find all div and span elements with class 'quote' (No span elements like that, so only div returned)
422
+ ```python
423
+ >>> page.find_all(['div', 'span'], {'class': 'quote'})
424
+ [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
425
+ <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
426
+ ...]
427
+ ```
428
+ Mix things up
429
+ ```python
430
+ >>> page.find_all({'itemtype':"http://schema.org/CreativeWork"}, 'div').css('.author::text')
431
+ ['Albert Einstein',
432
+ 'J.K. Rowling',
433
+ ...]
434
+ ```
435
+ A bonus pro tip: Find all elements whose `href` attribute's value ends with the word 'Einstein'.
436
+ ```python
437
+ >>> page.find_all({'href$': 'Einstein'})
438
+ [<data='<a href="/author/Albert-Einstein">(about...' parent='<span>by <small class="author" itemprop=...'>,
439
+ <data='<a href="/author/Albert-Einstein">(about...' parent='<span>by <small class="author" itemprop=...'>,
440
+ <data='<a href="/author/Albert-Einstein">(about...' parent='<span>by <small class="author" itemprop=...'>]
441
+ ```
442
+ Another pro tip: Find all elements that its `href` attribute's value has '/author/' in it
443
+ ```python
444
+ >>> page.find_all({'href*': '/author/'})
445
+ [<data='<a href="/author/Albert-Einstein">(about...' parent='<span>by <small class="author" itemprop=...'>,
446
+ <data='<a href="/author/J-K-Rowling">(about)</a...' parent='<span>by <small class="author" itemprop=...'>,
447
+ <data='<a href="/author/Albert-Einstein">(about...' parent='<span>by <small class="author" itemprop=...'>,
448
+ ...]
449
+ ```
450
+ And so on...
451
+
452
+ ## Generating selectors
453
+ You can always generate CSS/XPath selectors for any element that can be reused here or anywhere else, and the most remarkable thing is that it doesn't matter what method you used to find that element!
454
+
455
+ Generate a short CSS selector for the `url_element` element (if possible, create a short one; otherwise, it's a full selector)
456
+ ```python
457
+ >>> url_element = page.find({'href*': '/author/'})
458
+ >>> url_element.generate_css_selector
459
+ 'body > div > div:nth-of-type(2) > div > div > span:nth-of-type(2) > a'
460
+ ```
461
+ Generate a full CSS selector for the `url_element` element from the start of the page
462
+ ```python
463
+ >>> url_element.generate_full_css_selector
464
+ 'body > div > div:nth-of-type(2) > div > div > span:nth-of-type(2) > a'
465
+ ```
466
+ Generate a short XPath selector for the `url_element` element (if possible, create a short one; otherwise, it's a full selector)
467
+ ```python
468
+ >>> url_element.generate_xpath_selector
469
+ '//body/div/div[2]/div/div/span[2]/a'
470
+ ```
471
+ Generate a full XPath selector for the `url_element` element from the start of the page
472
+ ```python
473
+ >>> url_element.generate_full_xpath_selector
474
+ '//body/div/div[2]/div/div/span[2]/a'
475
+ ```
476
+ > Note: <br>
477
+ > When you tell Scrapling to create a short selector, it tries to find a unique element to use in generation as a stop point, like an element with an `id` attribute, but in our case, there wasn't any so that's why the short and the full selector will be the same.
478
+
479
+ ## Using selectors with regular expressions
480
+ Like in `parsel`/`scrapy`, you have the methods `re` and `re_first` for extracting data using regular expressions. However, unlike the former, these methods are in nearly all classes like `Adaptor`/`Adaptors`/`TextHandler` and `TextHandlers`, which means you can use them directly on the element even if you didn't select a text node.
481
+
482
+ We will have a deep look at it while explaining the [TextHandler](main_classes.md#texthandler) class, but in general, it works like the below examples:
483
+ ```python
484
+ >>> page.css_first('.price_color').re_first(r'[\d\.]+')
485
+ '51.77'
486
+
487
+ >>> page.css('.price_color').re_first(r'[\d\.]+')
488
+ '51.77'
489
+
490
+ >>> page.css('.price_color').re(r'[\d\.]+')
491
+ ['51.77',
492
+ '53.74',
493
+ '50.10',
494
+ '47.82',
495
+ '54.23',
496
+ ...]
497
+
498
+ >>> page.css('.product_pod h3 a::attr(href)').re(r'catalogue/(.*)/index.html')
499
+ ['a-light-in-the-attic_1000',
500
+ 'tipping-the-velvet_999',
501
+ 'soumission_998',
502
+ 'sharp-objects_997',
503
+ ...]
504
+
505
+ >>> filtering_function = lambda e: e.parent.tag == 'h3' and e.parent.parent.has_class('product_pod') # As above selector
506
+ >>> page.find('a', filtering_function).attrib['href'].re(r'catalogue/(.*)/index.html')
507
+ ['a-light-in-the-attic_1000']
508
+
509
+ >>> page.find_by_text('Tipping the Velvet').attrib['href'].re(r'catalogue/(.*)/index.html')
510
+ ['tipping-the-velvet_999']
511
+ ```
512
+ And so on. You get the idea. We will explain this in more detail on the next page while explaining the [TextHandler](main_classes.md#texthandler) class.
docs/stylesheets/extra.css ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ .md-grid {
2
+ max-width: 65%;
3
+ }
docs/tutorials/migrating_from_beautifulsoup.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Migrating from BeautifulSoup to Scrapling
2
+
3
+ <style>
4
+ .md-grid {
5
+ max-width: 85%;
6
+ }
7
+ </style>
8
+
9
+ If you're already familiar with BeautifulSoup, you're in for a treat. Scrapling is faster, provides similar parsing capabilities, and adds powerful new features for fetching and handling modern web pages. This guide will help you quickly adapt your existing BeautifulSoup code to take advantage of Scrapling's capabilities.
10
+
11
+ Below is a table that covers the most common operations you'll perform when scraping web pages. Each row shows how to accomplish a specific task in BeautifulSoup and the corresponding way to do it in Scrapling.
12
+
13
+ You will notice some shortcuts in BeautifulSoup are missing in Scrapling, but that's one of the reasons that makes BeautifulSoup slower than Scrapling. The point is: If the same feature can be used in a short oneliner, there is no need to sacrifice performance to make that short line shorter :)
14
+
15
+
16
+ | Task | BeautifulSoup Code | Scrapling Code |
17
+ |-----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
18
+ | Parser import | `from bs4 import BeautifulSoup` | `from scrapling.parser import Adaptor` |
19
+ | Parsing HTML from string | `soup = BeautifulSoup(html, 'html.parser')` | `page = Adaptor(html)` |
20
+ | Finding a single element | `element = soup.find('div', class_='example')` | `element = page.find('div', class_='example')` |
21
+ | Finding multiple elements | `elements = soup.find_all('div', class_='example')` | `elements = page.find_all('div', class_='example')` |
22
+ | Finding a single element (Example 2) | `element = soup.find('div', attrs={"class": "example"})` | `element = page.find('div', {"class": "example"})` |
23
+ | Finding a single element (Example 3) | `element = soup.find(re.compile("^b"))` | `element = page.find(re.compile("^b"))`<br/>`element = page.find_by_regex(r"^b")` |
24
+ | Finding a single element (Example 4) | `element = soup.find(lambda e: len(list(e.children)) > 0)` | `element = page.find(lambda e: len(e.children) > 0)` |
25
+ | Finding a single element (Example 5) | `element = soup.find(["a", "b"])` | `element = page.find(["a", "b"])` |
26
+ | Find element by its text content | `element = soup.find(text="some text")` | `element = page.find_by_text("some text", partial=False)` |
27
+ | Using CSS selectors to find the first matching element | `elements = soup.select_one('div.example')` | `elements = page.css_first('div.example')` |
28
+ | Using CSS selectors to find all matching element | `elements = soup.select('div.example')` | `elements = page.css('div.example')` |
29
+ | Get a prettified version of the page/element source | `prettified = soup.prettify()` | `prettified = page.prettify()` |
30
+ | Get a Non-pretty version of the page/element source | `source = str(soup)` | `source = page.body` |
31
+ | Get tag name of an element | `name = element.name` | `name = element.tag` |
32
+ | Extracting text content of an element | `string = element.string` | `string = element.text` |
33
+ | Extracting all the text in a document or beneath a tag | `text = soup.get_text(strip=True)` | `text = page.get_all_text(strip=True)` |
34
+ | Access the dictionary of attributes | `attrs = element.attrs` | `attrs = element.attrib` |
35
+ | Extracting attributes | `attr = element['href']` | `attr = element.attrib['href']` |
36
+ | Navigating to parent | `parent = element.parent` | `parent = element.parent` |
37
+ | Get all parents of an element | `parents = list(element.parents)` | `parents = list(element.iterancestors())` |
38
+ | Searching for an element in the parents of an element | `target_parent = element.find_parent("a")` | `target_parent = element.find_ancestor(lambda p: p.tag == 'a')` |
39
+ | Get all siblings of an element | N/A | `siblings = element.siblings` |
40
+ | Get next sibling of an element | `next_element = element.next_sibling` | `next_element = element.next` |
41
+ | Searching for an element in the siblings of an element | `target_sibling = element.find_next_sibling("a")`<br/>`target_sibling = element.find_previous_sibling("a")` | `target_sibling = element.siblings.search(lambda s: s.tag == 'a')` |
42
+ | Searching for elements in the siblings of an element | `target_sibling = element.find_next_siblings("a")`<br/>`target_sibling = element.find_previous_siblings("a")` | `target_sibling = element.siblings.filter(lambda s: s.tag == 'a')` |
43
+ | Searching for an element in the next elements of an element | `target_parent = element.find_next("a")` | `target_parent = element.below_elements.search(lambda p: p.tag == 'a')` |
44
+ | Searching for elements in the next elements of an element | `target_parent = element.find_all_next("a")` | `target_parent = element.below_elements.filter(lambda p: p.tag == 'a')` |
45
+ | Searching for an element in the previous elements of an element | `target_parent = element.find_previous("a")` | `target_parent = element.path.search(lambda p: p.tag == 'a')` |
46
+ | Searching for elements in the previous elements of an element | `target_parent = element.find_all_previous("a")` | `target_parent = element.path.filter(lambda p: p.tag == 'a')` |
47
+ | Get previous sibling of an element | `prev_element = element.previous_sibling` | `prev_element = element.previous` |
48
+ | Navigating to children | `children = list(element.children)` | `children = element.children` |
49
+ | Get all descendants of an element | `children = list(element.descendants)` | `children = element.below_elements` |
50
+ | Filtering a group of elements that satisfies a condition | `group = soup.find('p', 'story').css.filter('a')` | `group = page.find_all('p', 'story').filter(lambda p: p.tag == 'a')` |
51
+
52
+
53
+ One point to remember: BeautifulSoup provides features for modifying and manipulating the page after parsing it. Scrapling focuses more on Scraping the page faster for you, and then you can do what you want with the extracted information. So, two different tools can be used in Web SScraping, but one of them specializes in Web Scraping :)
54
+
55
+ ### Putting It All Together
56
+
57
+ Here's a simple example of scraping a web page to extract all the links using BeautifulSoup and Scrapling.
58
+
59
+ **With BeautifulSoup:**
60
+
61
+ ```python
62
+ import requests
63
+ from bs4 import BeautifulSoup
64
+
65
+ url = 'http://example.com'
66
+ response = requests.get(url)
67
+ soup = BeautifulSoup(response.text, 'html.parser')
68
+
69
+ links = soup.find_all('a')
70
+ for link in links:
71
+ print(link['href'])
72
+ ```
73
+
74
+ **With Scrapling:**
75
+
76
+ ```python
77
+ from scrapling import Fetcher
78
+
79
+ url = 'http://example.com'
80
+ page = Fetcher.get(url=url)
81
+
82
+ links = page.css('a::attr(href)')
83
+ for link in links:
84
+ print(link)
85
+ ```
86
+
87
+ As you can see, Scrapling simplifies the process by handling the fetching and parsing in a single step, making your code cleaner and more efficient.
88
+
89
+ **Additional Notes:**
90
+
91
+ - **Different parsers**: BeautifulSoup allows you to set the parser engine to use, and one of them is `lxml`. Scrapling doesn't do that and uses the `lxml` library by default for performance reasons.
92
+ - **Element Types**: In BeautifulSoup, elements are `Tag` objects, while in Scrapling, they are `Adaptor` objects. However, they provide similar methods and properties for navigation and data extraction.
93
+ - **Error Handling**: Both libraries return `None` when an element is not found (e.g., `soup.find()` or `page.css_first()`). To avoid errors, Check for `None` before accessing properties.
94
+ - **Text Extraction**: Scrapling provides additional methods for handling text through `TextHandler`, such as `clean()`, which can be helpful for removing extra whitespace or unwanted characters. Please check out the documentation for the complete list.
95
+
96
+ The documentation provides more details on Scrapling's features and the full list of arguments that can be passed to all methods.
97
+
98
+ This guide should make your transition from BeautifulSoup to Scrapling smooth and straightforward. Happy scraping!
docs/tutorials/replacing_ai.md ADDED
@@ -0,0 +1 @@
 
 
1
+ WIP
mkdocs.yml ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ site_name: Scrapling
2
+ site_description: Scrapling - a Python library to make Web Scraping easy again!
3
+ site_author: Karim Shoair
4
+ repo_url: https://github.com/D4Vinci/Scrapling
5
+ repo_name: D4Vinci/Scrapling
6
+ copyright: Copyright &copy; 2025 Karim Shoair
7
+
8
+ theme:
9
+ name: material
10
+ language: en
11
+ palette:
12
+ - media: "(prefers-color-scheme)"
13
+ toggle:
14
+ icon: material/link
15
+ name: Switch to light mode
16
+ - media: "(prefers-color-scheme: light)"
17
+ scheme: default
18
+ primary: indigo
19
+ accent: indigo
20
+ toggle:
21
+ icon: material/toggle-switch
22
+ name: Switch to dark mode
23
+ - media: "(prefers-color-scheme: dark)"
24
+ scheme: slate
25
+ primary: black
26
+ accent: indigo
27
+ toggle:
28
+ icon: material/toggle-switch-off
29
+ name: Switch to system preference
30
+ font:
31
+ text: Roboto
32
+ code: Roboto Mono
33
+ icon:
34
+ repo: fontawesome/brands/github-alt
35
+ features:
36
+ - announce.dismiss
37
+ - navigation.top
38
+ - navigation.footer
39
+ - navigation.instant
40
+ - navigation.indexes
41
+ - navigation.sections
42
+ - navigation.tracking
43
+ - navigation.instant
44
+ - navigation.instant.progress
45
+ # - navigation.tabs
46
+ # - navigation.expand
47
+ # - toc.integrate
48
+ - search.share
49
+ - search.suggest
50
+ - search.highlight
51
+ - content.tabs.link
52
+ - content.width.full
53
+ - content.action.view
54
+ - content.action.edit
55
+ - content.code.copy
56
+ - content.code.annotate
57
+ - content.code.annotation
58
+ # logo: assets/logo.png
59
+ # favicon: assets/favicon.png
60
+
61
+ nav:
62
+ - Introduction: index.md
63
+ - Overview: overview.md
64
+ - Parsing Performance: benchmarks.md
65
+ - User Guide:
66
+ - Parsing:
67
+ - Querying elements: parsing/selection.md
68
+ - Main classes: parsing/main_classes.md
69
+ - Using automatch feature: parsing/automatch.md
70
+ - Fetching:
71
+ - Choosing a fetcher: fetching/choosing.md
72
+ - Static requests: fetching/static.md
73
+ - Dynamically loaded websites: fetching/dynamic.md
74
+ - Fully bypass protections while fetching: fetching/stealthy.md
75
+ - Tutorials:
76
+ - Using Scrapling instead of AI: tutorials/replacing_ai.md
77
+ - Migrating from BeautifulSoup: tutorials/migrating_from_beautifulsoup.md
78
+ # - Migrating from AutoScraper: tutorials/migrating_from_autoscraper.md
79
+ - Development:
80
+ - API Reference:
81
+ - Adaptor: api-reference/adaptor.md
82
+ - Fetchers: api-reference/fetchers.md
83
+ - Custom Types: api-reference/custom-types.md
84
+ - Writing your retrieval system: development/automatch_storage_system.md
85
+ - Using Scrapling's custom types: development/scrapling_custom_types.md
86
+ - Support and Sponsors: donate.md
87
+ - Contributing: contributing.md
88
+ - Changelog: 'https://github.com/D4Vinci/Scrapling/releases'
89
+
90
+ markdown_extensions:
91
+ - admonition
92
+ - abbr
93
+ # - mkautodoc
94
+ - pymdownx.emoji
95
+ - pymdownx.details
96
+ - pymdownx.superfences
97
+ - pymdownx.highlight:
98
+ anchor_linenums: true
99
+ - pymdownx.inlinehilite
100
+ - pymdownx.snippets
101
+ - pymdownx.tabbed:
102
+ alternate_style: true
103
+ - tables
104
+ - codehilite:
105
+ css_class: highlight
106
+ - toc:
107
+ permalink: true
108
+
109
+ plugins:
110
+ - search
111
+ - mkdocstrings:
112
+ handlers:
113
+ python:
114
+ paths: [scrapling]
115
+ options:
116
+ docstring_style: sphinx
117
+ show_source: true
118
+ show_root_heading: true
119
+ show_if_no_docstring: true
120
+ inherited_members: true
121
+ members_order: source
122
+ separate_signature: true
123
+ unwrap_annotated: true
124
+ filters:
125
+ - '!^_'
126
+ merge_init_into_class: true
127
+ docstring_section_style: spacy
128
+ signature_crossrefs: true
129
+ show_symbol_type_heading: true
130
+ show_symbol_type_toc: true
131
+
132
+ extra:
133
+ social:
134
+ - icon: fontawesome/brands/github
135
+ link: https://github.com/D4Vinci/Scrapling
136
+ - icon: fontawesome/brands/python
137
+ link: https://pypi.org/project/scrapling/
138
+ - icon: fontawesome/brands/x-twitter
139
+ link: https://x.com/D4Vinci1
140
+
141
+ extra_css:
142
+ - stylesheets/extra.css